未比对上序列的提取遇到的问题及解决策略

作者: 花生学生信 | 来源:发表于2024-02-23 17:15 被阅读0次

2021-05-20未比对上的序列提取(二）
【未解决】遇到的问题
blast-小技巧：提取序列
提取未比对序列
Python3 - 过滤序列元素(推导式)
DU2:如何准确获取基因的某个保守区域序列
Python 过滤序列元素
【总结】升级Xcode8遇到的问题及解决方案!!!
成长就是解决问题
【Python进阶】1.16 过滤序列元素（列表推导会占大量内存

未必对序列提取的方法请看：
请看前期：
泛基因组提取未必对上序列 - 简书 (jianshu.com)

脚本会自动把其他染色体上的序列给补上

提取序列，full表示完全没有比上，partial表示部分未比上

这里其实应该是脚本的缺陷，如果quast的比对的是两个完整的chr级的序列（水稻12条，不含contig）

实际上未必对的结果约占1/10是比较合理的，约40Mb的序列

因为序列名称等各种原因，直接提取出来的结果会有各种问题，这里我写了几个脚本，但是解决起来还是很繁琐，如果有更好的办法后面会更新，

首先使用脚本提取quast结果，考虑到后面批处理及提取的问题，所以提取的文件fasta文件id编号不添加样本名

python3 unalign_drop_direct.py \
    DL651.500.fa \
    H7L1/contigs_reports/contigs_report_DL651-500.unaligned.info \
    unalign/651_unalign_500.fa \
    500 \
#   DL651

提取出的contig比原来多200多，因为有的contig是部分比对上的

首先是把quast/contig_report/unalign.info文件第一列提取出来，在序列前加_符号（unalign_drop_direct.py脚本的问题），

##提取第一列
cut -f 1  unalign.info |sed '1d' >contigid

###tj.py
import sys

if len(sys.argv) != 3:
    print("Usage: python add_underscore.py input_file output_file")
    sys.exit(1)

input_file = sys.argv[1]
output_file = sys.argv[2]

with open(input_file, 'r') as fin:
    lines = fin.readlines()

with open(output_file, 'w') as fout:
    for line in lines:
        new_line = '_'+line.strip() + '\n'
        fout.write(new_line)

print("文件处理完成，结果已保存到", output_file)

结果是模糊匹配，这里写一个脚本提取。
即写一个脚本，读取id文件和fasta文件，根据id文件的内容提取fasta文件，fasta文件只要:前匹配id的都提取，并保存成新的fasta文件，命令行形式

#tq1.py
import sys

if len(sys.argv) != 4:
    print("Usage: python extract_fasta_by_id_prefix.py id_file fasta_file output_file")
    sys.exit(1)

id_file = sys.argv[1]
fasta_file = sys.argv[2]
output_file = sys.argv[3]

# 读取 ID 文件中的 ID 前缀列表
with open(id_file, 'r') as id_f:
    id_prefixes = [line.strip() for line in id_f]

# 读取 FASTA 文件，提取匹配的序列
with open(fasta_file, 'r') as fasta_f:
    sequences = {}
    curr_id = None
    for line in fasta_f:
        if line.startswith('>'):
            curr_id = line.strip().split(':')[0][1:]
            sequences[curr_id] = ''
        else:
            sequences[curr_id] += line.strip()

# 将匹配的序列保存到输出文件中
with open(output_file, 'w') as out_f:
    for seq_id in sequences.keys():
        for id_prefix in id_prefixes:
            if seq_id.startswith(id_prefix):
                out_f.write('>' + seq_id + '\n')
                out_f.write(sequences[seq_id] + '\n')

print("序列提取完成，结果已保存到", output_file)

这是最终提取出来的结果，应该没有什么问题了

cat ../../fqid|while read id 
do
cut -f 1 /public/home/fengting/work/zyyy/DK/jelly/DK_pan/02_quast/$id/contigs_reports/contigs_report_${id}-500.unaligned.info|sed '1d' >000unid/$id
done

cat ../../fqid|while read id
do
python tj.py 000unid/$id 001unid/$id 
python tq1.py 001unid/$id ../03_unmapp/${id}_unalign_500.fa 002unmap/$id.fa
done

关于不同软件组装,如spadas、soapdenovo、megahit等组装效果对结果的影响会在后面的分析中检测。