NLR统计

作者: 花生学生信 | 来源:发表于2024-03-02 16:09 被阅读0次

前面介绍了NLR的鉴定及分类,NLR鉴定及分类(test) - 简书 (jianshu.com)
今天把之前的结果统计一下

需要的脚本

###tq1.py
import sys

input_file = sys.argv[1]  # 输入文件名
output_file = sys.argv[2]  # 输出文件名

with open(input_file, 'r') as f_input, open(output_file, 'w') as f_output:
    for line in f_input:
        columns = line.strip().split('\t')
        if len(columns) >= 2:
            if columns[1] == 'Coils':
                f_output.write(columns[0] + '\t' + columns[1] + '\n')  # 如果第二列为Coils,则写入第一列和第二列到新文件
            elif len(columns) >= 3:
                f_output.write(columns[0] + '\t' + columns[2] + '\n')  # 如果第二列不为Coils,则写入第一列和第三列到新文件

print("处理完成,结果已写入到", output_file)
###tr.py
import sys
input_file = sys.argv[1]  # 输入文件名


with open(input_file, 'r') as f:


    table = [line.strip().split('\t') for line in f]


feature_dict = {}
for row in table:
    id = row[0]
    feature = row[1]
    if id in feature_dict:
        feature_dict[id].add(feature)
    else:
        feature_dict[id] = {feature}

A_count = 0
B_count = 0
C_count = 0
D_count = 0

for features in feature_dict.values():
    if 'a' in features and 'b' in features and 'c' in features:
        C_count += 1
    elif 'a' not in features and 'b' not in features:
        A_count += 1
    elif 'a' in features and 'b' not  in features:
        B_count += 1
    elif 'b' in features and 'a' not in features:
        D_count += 1

output_file = sys.argv[2]  # Get the output file name from command line argument

with open(output_file, "w") as output_file:
    output_file.write("{}\t{}\t{}\t{}\n".format(A_count, B_count, C_count, D_count))


流程代码:

###提取id,

###path   /public/home/fengting/work/zyyy/NLR_num
 ls |sed s/.nb.tsv//g >../NLR_num/id

###提取1,5,13列
####理论上可以直接一步用脚本完成,但实际操作不知道什么原因Coil列无法打印,所以先把需要的几列提取出来
cat id |while read id 
do
cut -f 1,5,13 ../inter/${id}.nb.tsv >01cut/$id.tsv
done

###整合
cat id |while read id
do 
python tq1.py 01cut/$id.tsv 02zh/$id.tsv
done



###替换字符
cat id|while read id 
do
sed -i s'/Coils/a/g' 02zh/$id.tsv 
sed -i s'/Leucine-rich repeat, typical subtype/b/g' 02zh/$id.tsv
sed -i s'/Leucine-rich repeat, cysteine-containing subtype/b/g' 02zh/$id.tsv
sed -i s'/Leucine rich repeat 4/b/g' 02zh/$id.tsv
sed -i s'/Leucine-rich repeat/b/g' 02zh/$id.tsv
sed -i s'/NB-ARC/c/g' 02zh/$id.tsv
done


###统计

cat id |while read id
do
python tr.py 02zh/$id.tsv 03tj/${id}.tsv
done

###整合tsv文件
use strict;
use warnings;

# 获取当前文件夹中所有tsv文件的文件名
my @file_names = glob "*.tsv";

# 创建一个新的文件用于整合数据
open my $integrated_file, '>', 'integrated_file.tsv' or die "Cannot open integrated_file.tsv: $!";

# 循环读取每个tsv文件,并整合数据到新文件中
foreach my $file_name (@file_names) {
    open my $file, '<', $file_name or die "Cannot open $file_name: $!";
    while (<$file>) {
        chomp;
        print $integrated_file "$file_name\t$_\n";  # 将原文件名作为第一列
    }
    close $file;
}

# 关闭新文件
close $integrated_file;
最后统计的结果,表头可以在perl添加也可以在excel里添加
2024-4-9日更新,非冗余NLR统计方法:

###整个fas文件
import os

# 获取当前文件夹路径
dir_path = os.getcwd()

# 用于存储所有fas文件内容的列表
fas_contents = []

# 遍历当前文件夹
for file in os.listdir(dir_path):
    if file.endswith(".fas"):
        # 打开fas文件并将内容存储到fas_contents列表中
        with open(file, 'r') as f:
            content = f.read()
            fas_contents.append(content)

# 将所有fas文件内容合并成一个字符串
merged_content = "\n".join(fas_contents)

# 将合并后的内容写入一个新文件
with open("merged_fas_file.fas", 'w') as merged_file:
    merged_file.write(merged_content)

print("所有fas文件已成功合并为merged_fas_file.fas文件。")

cd-hit去冗余

cd-hit -i merged_fas_file.fas -o output_fas_file.fas -c 0.9 -n 5

相关文章

网友评论

      本文标题:NLR统计

      本文链接:https://www.haomeiwen.com/subject/ariczdtx.html