前面介绍了NLR的鉴定及分类,NLR鉴定及分类(test) - 简书 (jianshu.com),
今天把之前的结果统计一下
需要的脚本
###tq1.py
import sys
input_file = sys.argv[1] # 输入文件名
output_file = sys.argv[2] # 输出文件名
with open(input_file, 'r') as f_input, open(output_file, 'w') as f_output:
for line in f_input:
columns = line.strip().split('\t')
if len(columns) >= 2:
if columns[1] == 'Coils':
f_output.write(columns[0] + '\t' + columns[1] + '\n') # 如果第二列为Coils,则写入第一列和第二列到新文件
elif len(columns) >= 3:
f_output.write(columns[0] + '\t' + columns[2] + '\n') # 如果第二列不为Coils,则写入第一列和第三列到新文件
print("处理完成,结果已写入到", output_file)
###tr.py
import sys
input_file = sys.argv[1] # 输入文件名
with open(input_file, 'r') as f:
table = [line.strip().split('\t') for line in f]
feature_dict = {}
for row in table:
id = row[0]
feature = row[1]
if id in feature_dict:
feature_dict[id].add(feature)
else:
feature_dict[id] = {feature}
A_count = 0
B_count = 0
C_count = 0
D_count = 0
for features in feature_dict.values():
if 'a' in features and 'b' in features and 'c' in features:
C_count += 1
elif 'a' not in features and 'b' not in features:
A_count += 1
elif 'a' in features and 'b' not in features:
B_count += 1
elif 'b' in features and 'a' not in features:
D_count += 1
output_file = sys.argv[2] # Get the output file name from command line argument
with open(output_file, "w") as output_file:
output_file.write("{}\t{}\t{}\t{}\n".format(A_count, B_count, C_count, D_count))
流程代码:
###提取id,
###path /public/home/fengting/work/zyyy/NLR_num
ls |sed s/.nb.tsv//g >../NLR_num/id
###提取1,5,13列
####理论上可以直接一步用脚本完成,但实际操作不知道什么原因Coil列无法打印,所以先把需要的几列提取出来
cat id |while read id
do
cut -f 1,5,13 ../inter/${id}.nb.tsv >01cut/$id.tsv
done
###整合
cat id |while read id
do
python tq1.py 01cut/$id.tsv 02zh/$id.tsv
done
###替换字符
cat id|while read id
do
sed -i s'/Coils/a/g' 02zh/$id.tsv
sed -i s'/Leucine-rich repeat, typical subtype/b/g' 02zh/$id.tsv
sed -i s'/Leucine-rich repeat, cysteine-containing subtype/b/g' 02zh/$id.tsv
sed -i s'/Leucine rich repeat 4/b/g' 02zh/$id.tsv
sed -i s'/Leucine-rich repeat/b/g' 02zh/$id.tsv
sed -i s'/NB-ARC/c/g' 02zh/$id.tsv
done
###统计
cat id |while read id
do
python tr.py 02zh/$id.tsv 03tj/${id}.tsv
done
###整合tsv文件
use strict;
use warnings;
# 获取当前文件夹中所有tsv文件的文件名
my @file_names = glob "*.tsv";
# 创建一个新的文件用于整合数据
open my $integrated_file, '>', 'integrated_file.tsv' or die "Cannot open integrated_file.tsv: $!";
# 循环读取每个tsv文件,并整合数据到新文件中
foreach my $file_name (@file_names) {
open my $file, '<', $file_name or die "Cannot open $file_name: $!";
while (<$file>) {
chomp;
print $integrated_file "$file_name\t$_\n"; # 将原文件名作为第一列
}
close $file;
}
# 关闭新文件
close $integrated_file;
最后统计的结果,表头可以在perl添加也可以在excel里添加
2024-4-9日更新,非冗余NLR统计方法:
###整个fas文件
import os
# 获取当前文件夹路径
dir_path = os.getcwd()
# 用于存储所有fas文件内容的列表
fas_contents = []
# 遍历当前文件夹
for file in os.listdir(dir_path):
if file.endswith(".fas"):
# 打开fas文件并将内容存储到fas_contents列表中
with open(file, 'r') as f:
content = f.read()
fas_contents.append(content)
# 将所有fas文件内容合并成一个字符串
merged_content = "\n".join(fas_contents)
# 将合并后的内容写入一个新文件
with open("merged_fas_file.fas", 'w') as merged_file:
merged_file.write(merged_content)
print("所有fas文件已成功合并为merged_fas_file.fas文件。")
cd-hit去冗余
cd-hit -i merged_fas_file.fas -o output_fas_file.fas -c 0.9 -n 5
网友评论