染色体变异（持续更新中...）

作者: Morriyaty | 来源:发表于2022-03-20 21:24 被阅读0次

多基因组共线性分析（无注释版）

1. 处理基因组，只保留挂载到染色体水平的染色体
在这使用的方法是计算每个 > 的长度，根据长度筛选染色体
python3 genome_len.py sp1.fa > sp1.len
python3 select_len.py sp1.fa 61420004(从len文件得到的长度) > sp1.chr.fa
2. 基因组之间比对
2.1 mumer比对基因组（这里使用mumer，last也可）
nucmer -t 20 --mum sp1.chr.fa sp2.chr.fa -p sp1-sp2
2.2 根据相似度，长度过滤比对信息 【这里选择90% ，30k长，具体可自己安排】
delta-filter -i 90 -l 30000 -q sp1-sp2.delta > filter.sp1-sp2.delta
2.3 得到coords信息
show-coords -c -r filter.sp1-sp2.delta > filter.sp1-sp2.delta.coords
2.4 脚本处理coords文件
perl Noname_1.pl filter.sp1-sp2.delta.coords > filter.sp1-sp2.delta.coords.txt
2.5 根据txt文件得到jcvi输入文件
 [同一路径记得改名字，不然覆盖了]
python3 coordsTxT2jcviInput.py filter.sp1-sp2.delta.coords.txt sp1.bed sp2.bed sp1-sp2.sample sp1 sp2
3. 如果是多物种比较，重复上述步骤，bed文件cat到一起
4. jcvi画图
python3 -m jcvi.graphics.karyotype seqids layout

软屏蔽重复序列

因为后续lastz比对用软屏蔽基因组可以加快速度，我们需要对基因组进行屏蔽，如果是已经屏蔽的可以跳过此步骤。

BuildDatabase -name sp1 sp1.fa
RepeatModeler -LTRStruct -database sp1 -pa 30
RepeatMasker -e rmblast -pa 30 -q -xsmall -lib sp1-families.fa  sp1.fa

基于基因组建树

之前的很多基因组建树方法都是基于gff走直系同源单拷贝那一套，对于没有注释的基因组就很不方便，下面用到的Mash + Mashtree就只用到基因组序列，比较方便。

1. Mash下载安装【编译好的，直接用】
wget https://github.com/marbl/Mash/releases/download/v2.2/mash-Linux64-v2.2.tar
tar -zxvf mash-Linux64-v2.2.tar
2. Mashtree下载安装
这里用的singularity安装
singularity pull docker://bioedge/mashtree
singularity build --sandbox mashtree mashtree_latest.sif
其中bootstrap的两个脚本需要安装quicktree和几个perl包，perl包的话缺少就用cpanm装就行。
quicktree也比较简单
wget https://github.com/khowe/quicktree/archive/v2.5.tar.gz
tar xvf v2.5.tar.gz 
cd quicktree-2.5
make
不过运行脚本的话需要把mash和quicktree添加到环境变量
3. 先使用sketch功能转化基因序列
mash sketch sp1.fa
mash sketch sp2.fa
4. 建树
singularity exec -B /data mashtree mashtree --mindepth 0 --numcpus 10 *.msh > test.dnd
5. 还可以跑个bootstrap 【同时还有jackknife】
perl mashtree_bootstrap.pl --reps 1000 --numcpus 10 *.msh -- --min-depth 0 > test.bootstrap dnd
6. ITOL之类的可视化软件画树就行了

多基因组比对 lastz

1. 将fa文件转换生成2bit文件，计算每条染色体的长度
faToTwoBit sp1.fa sp1.2bit
faToTwoBit sp2.fa sp2.2bit
faSize sp1.fa -detailed > sp1.sizes
faSize sp2.fa -detailed > sp2.sizes
2. lastz本身不支持并行，所以我们将sp1基因组按照染色体切分，手动并行
mkdir sp1_t
faSplit byName  sp1.fa sp1_t
3. lastz比对
mkdir axt
for i in sp1_t/*.fa; 
do prefix=$(basename $i .fa); 
lastz $i sp2.fa O=400 E=30 K=3000 L=3000 H=2200 T=1 --format=axt --ambiguous=n --ambiguous=iupac > axt/${prefix}.axt;
done
4. 生成Chain文件及Net文件
axtChain test.axt sp1.2bit sp2.2bit test.chain -minScore=3000 -linearGap=medium
chainNet test.chain -minSpace=1 cishu.soft.masked.sizes di.softmasked.sizes test1.net test2.net

参考资料：

本节使用的脚本

# genome_len.py ：
import sys

f1 = open(sys.argv[1],'r')

for line in f1 :
    line = line.strip()
    if line[0] == ">" :
        x = line
    else :
        leng = len(line)
        print(x+"\t"+str(leng))

f1.close()

# select_len.py
import sys

f1 = open(sys.argv[1],'r')
f2 = int(sys.argv[2])

for line in f1 :
    line = line.strip()
    if line[0] == ">":
        x = line
    else :
        length = len(line)
        if length >= f2 :
            print(x)
            print(line)

f1.close()

# Noname_1.pl
#!/usr/bin/perl -w
use strict;
die  "Version 1.0\t2021-01-11;\nUsage: $0 <InPut>\n" unless (@ARGV ==1);
open (IA,"$ARGV[0]") || die "input file can't open $!";
<IA>;<IA>;<IA>;<IA>;<IA>;
        while(<IA>)
        {
                chomp ;
                my @inf=split ;
                next if  ($inf[6]<90);
                print "$inf[-2]\t$inf[0]\t$inf[1]\t$inf[-2]\t$inf[-1]\t$inf[3]\t$inf[4]\t$inf[-7]\n";
        }
close IA;

# coordsTxT2jcviInput.py
import sys

f1 = open(sys.argv[1],'r') # sp1-sp2.coords.txt
f2 = open(sys.argv[2],'w') # sp1.bed
f3 = open(sys.argv[3],'w') # sp2.bed
f4 = open(sys.argv[4],'w') # .sample
n1 = sys.argv[5] # sp1 name
n2 = sys.argv[6] # sp2 name

for line in f1 :
    line = line.strip().split()
    new_n1 = n1 + "_" + line[0]
    new_n2 = n2 + "_" + line[4]
    if int(line[1]) < int(line[2]) :
        f2.write(new_n1+"\t"+str(int(line[1])-1)+"\t"+str(int(line[1])+1)+"\t"+new_n1+"_"+line[1]+"\t0\t+\n")
        f2.write(new_n1+"\t"+str(int(line[2])-1)+"\t"+str(int(line[2])+1)+"\t"+new_n1+"_"+line[2]+"\t0\t+\n")
    else :
        f2.write(new_n1+"\t"+str(int(line[2])-1)+"\t"+str(int(line[2])+1)+"\t"+new_n1+"_"+line[1]+"\t0\t-\n")
        f2.write(new_n1+"\t"+str(int(line[1])-1)+"\t"+str(int(line[1])+1)+"\t"+new_n1+"_"+line[2]+"\t0\t-\n")
    if int(line[5]) < int(line[6]) :
        f3.write(new_n2+"\t"+str(int(line[5])-1)+"\t"+str(int(line[5])+1)+"\t"+new_n2+"_"+line[5]+"\t0\t+\n")
        f3.write(new_n2+"\t"+str(int(line[6])-1)+"\t"+str(int(line[6])+1)+"\t"+new_n2+"_"+line[6]+"\t0\t+\n")
    else :
        f3.write(new_n2+"\t"+str(int(line[6])-1)+"\t"+str(int(line[6])+1)+"\t"+new_n2+"_"+line[6]+"\t0\t-\n")
        f3.write(new_n2+"\t"+str(int(line[5])-1)+"\t"+str(int(line[5])+1)+"\t"+new_n2+"_"+line[5]+"\t0\t-\n")
    if int(line[1]) < int(line[2]) and int(line[5]) < int(line[6]) :
        f4.write(new_n1+"_"+line[1]+"\t"+new_n1+"_"+line[2]+"\t"+new_n2+"_"+line[5]+"\t"+new_n2+"_"+line[6]+"\t500\t+\n")
    elif int(line[1]) < int(line[2]) and int(line[5]) > int(line[6]) :
        f4.write(new_n1+"_"+line[1]+"\t"+new_n1+"_"+line[2]+"\t"+new_n2+"_"+line[6]+"\t"+new_n2+"_"+line[5]+"\t500\t-\n")
    elif int(line[1]) > int(line[2]) and int(line[5]) < int(line[6]) :
        f4.write(new_n1+"_"+line[2]+"\t"+new_n1+"_"+line[1]+"\t"+new_n2+"_"+line[5]+"\t"+new_n2+"_"+line[6]+"\t500\t-\n")
    else :
        f4.write(new_n1+"_"+line[2]+"\t"+new_n1+"_"+line[1]+"\t"+new_n2+"_"+line[6]+"\t"+new_n2+"_"+line[5]+"\t500\t+\n")

f1.close()
f2.close()
f3.close()
f4.close()

网友评论

本文标题：染色体变异（持续更新中...）

本文链接：https://www.haomeiwen.com/subject/iuradrtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！