基因组去冗余（一）

作者: GenomeStudy | 来源:发表于2023-08-03 11:40 被阅读0次

dRep物种集去冗余
「BioNano系列」如何进行cmap之间的比对
初步组装的杂合基因组如何去冗余
去PCR冗余
富集分析去冗余
2020-08-28：学习日志
unique(Python)
2021-05-21--Python脚本之对基因组cds去冗余
用Python处理vcf文件的代码
二级核酸数据库

去冗余

一般我们在组装简单的基因组的时候是不需要去冗余的，像龙眼、花椰菜、木瓜等都属于简单基因组，组装出来就可以进行下一步了，但是有些基因组是有重复的，甚至是高重复的基因组，那我们就需要进行去冗余处理！

在这里推荐一款去冗余的软件--khaper

https://github.com/lardo/khaper

软件的安装

这个软件直接下载，给权限就可以用了，只需要安装jellyfish

git clone https://github.com/lardo/khaper.git
cd khaper/Bin   &&  chmod 755 *
conda install -c bioconda jellyfish

软件的使用

1.Prepare input files

Prepare:
assemble.fasta  # genemone assembly with dupplcated sequences.
PE300_1.fq.gz       # read1
PE300_2.fq.gz       # read2

2.Build the kmer frequency table

ls *.gz > fq.lst
#一般我们的基因组都大于100M，小于10G，所以k我们就设定17就好了
perl Bin/Graph.pl pipe -i fq.lst -m 2 -k 17 -s 1,3 -d Kmer_17
#result:
kmer bit file: Kmer_17/02.Uinque_bit/kmer_17.bit

Note:

a. k=15 is suitable for genome with size <100M.
b. k=17 is suitable for genome with size <10G.
c. This version is only support k<=17.

3.Compress the assembly file

# compress the genome

# Usage:
 perl remDup.pl <genome.fa> <outdir> <cutoff:0.7>

     Options:
            --ref   <str> The ref genome to build kbit
          --kbit  <str> The unique kmer file
            --kmer  <int> the kmer size [15]
          --sort  <int> sort seq by length [1]

Description
     This script is to remove dupplcation seq

# Demo
perl Bin/remDup.pl  --kbit Kmer_17/02.Uinque_bit/kmer_17.bit --kmer 17 assemble.fasta Compress 0.3

# result:
compress file: Compress/trinity.single.fasta.gz

具体的使用

##Prepare input files
HiFi_path=/share/home/off/Work/Genome_assembly/Sre/01.HiFi
contig=/share/home/off/Work/Genome_assembly/Sre/03.Assembly/01.hifiasm/Sre.asm.hic.p_ctg.fa
##Build the kmer frequency table
ls ${HiFi_path}/*.gz > fq.list
perl ~/biosoft/khaper/Bin/Graph.pl pipe -i fq.list -m 2 -k 17 -s 1,3 -d Kmer_17
##Compress the assembly file
#perl ~/biosoft/khaper/Bin/remDup.pl ${contig} --kbit Kmer_17/02.Uinque_bit/kmer_17.bit --kmer 17 Compress 0.3
##0.3为设置数值，默认的cutoff为0.7，我们可以先依次设置0.6、0.5、0.4再看最后的结果
##Compress为输出文件夹的名称，可自行修改
perl ~/biosoft/khaper/Bin/Graph.pl  ${contig} --kbit Kmer_17/02.Uinque_bit/kmer_17.bit --kmer 17 cutoff_0.6 0.6
perl ~/biosoft/khaper/Bin/Graph.pl  ${contig} --kbit Kmer_17/02.Uinque_bit/kmer_17.bit --kmer 17 cutoff_0.5 0.5
perl ~/biosoft/khaper/Bin/Graph.pl  ${contig} --kbit Kmer_17/02.Uinque_bit/kmer_17.bit --kmer 17 cutoff_0.4 0.4

结果文件

compress file: cutoff_0.6/trinity.single.fasta.gz

那么去冗余的标准又是什么呢，达到预估基因组大小即可，再进行BUSCO的评估，BUSCO评估值没有下降很多。

参考链接

https://github.com/lardo/khaper

网友评论

本文标题：基因组去冗余（一）

本文链接：https://www.haomeiwen.com/subject/xrjmpdtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

基因组去冗余（一）

去冗余

在这里推荐一款去冗余的软件--khaper

软件的安装

软件的使用

1.Prepare input files

2.Build the kmer frequency table

Note:

3.Compress the assembly file

具体的使用

结果文件

参考链接

相关文章

dRep物种集去冗余

「BioNano系列」如何进行cmap之间的比对

初步组装的杂合基因组如何去冗余

去PCR冗余

富集分析去冗余

2020-08-28：学习日志

unique(Python)

2021-05-21--Python脚本之对基因组cds去冗余

用Python处理vcf文件的代码

二级核酸数据库

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读