核苷酸/氨基酸序列聚类去重Cd-hit

作者: xiaoji_hb | 来源:发表于2023-07-04 10:41 被阅读0次

转录组重建系统发育（四）使用cdhit去冗余和使用transde
cd-hit 基因集去冗余
CD-hit安装及使用
使用cd-hit对蛋白质或核酸序列进行聚类
CD-hit如何得到聚类转录本的数目
R语言中通过ggmsa包展示多序列比对结果
生信常用文件格式
Qiime2 (三)-DADA2降噪（Denoise）流程
如何使用SnapGene将DNA序列转换为氨基酸序列
序列比对1

我们在分析测序数据或者下载数据库数据时，经常需要合并数据。期间不可避免的出现重复序列，为了减少减少后续资源的使用，更快速地分析数据，往往需要先对数据进行聚类/去重。这种类型的软件很多，今天给大家介绍两个常用的工具。

seqkit软件

该软件是一款很老的、用于序列处理的软件，经常被用来处理二代测序fastq文件，比如常用的序列提取(sample、grep)、文件格式转化(fq2fa)等。

软件中rmdup模块用来序列去重，可以根据序列id、序列全名以及序列进行去重。

$ seqkit rmdup -h
remove duplicated sequences by ID/name/sequence

Attentions:
  1. When comparing by sequences, both positive and negative strands are
     compared. Switch on -P/--only-positive-strand for considering the
     positive strand only.
  2. Only the first record is saved for duplicates.

Usage:
  seqkit rmdup [flags]

Flags:
  -n, --by-name                by full name instead of just id
  -s, --by-seq                 by seq
  -D, --dup-num-file string    file to save number and list of duplicated seqs
  -d, --dup-seqs-file string   file to save duplicated seqs
  -h, --help                   help for rmdup
  -i, --ignore-case            ignore case
  -P, --only-positive-strand   only considering positive strand when comparing by sequence

软件去重时通过-n、-s两个参数指定使用序列全名和序列进行去重，默认使用id(此处的id依照一般的序列读取规则，选取>行第一个空格前的部分作为id)进行去重。去重时，默认考虑正负两条链，同时可以通过-i参数，忽略序列序列中碱基或者氨基酸的大小写。

该软件去重时使用精确匹配，只有完全一致才会去除。为了方便后续的分析，我们有时想对序列进行相似性聚类，这款软件就难以实现这样的功能，可以考虑cd-hit软件。

cd-hit软件

该软件可以对序列进行聚类，以此来达到缩减数据集的目的。软件基本原理是先对输出的fasta文件中所有序列进行排序，以最长的序列为代表序列，考查其他序列与该序列的相似程度，在阈值范围内的序列会归为一类。该软件最大的特点是运行速度快，比如使用90%相似度阈值对NCBI网站中nr数据库(2006.02)进行过滤，该数据库共包含3287897条序列，过滤后剩余1982695条序列，用时7h50m。

软件包括很多分析模块，可以根据自己的数据类型选择

* cd-hit Cluster peptide sequences
* cd-hit-est Cluster nucleotide sequences
* cd-hit-2d Compare 2 peptide databases
* cd-hit-est-2d Compare 2 nucleotide databases
* psi-cd-hit Cluster proteins at <40% cutoff
* cd-hit-lap Identify overlapping reads
* cd-hit-dup Identify duplicates from single or paired Illumina reads
* cd-hit-454 Identify duplicates from 454 reads
* cd-hit-otu Cluster rRNA tags
* cd-hit Web server Cluster user-uploaded data
* cd-hit-para Cluster sequences in parallel on a computer cluster
* scripts Parse results and so on
* h-cd-hit Hierarchical clustering

一般常用的是cd-hit对氨基酸序列聚类，用cd-hit-est对核苷酸序列聚类，聚类时有两个重要的参数，一个是-c，指定聚类阈值，一个是-n，指定word size，这两个配套使用，有推荐的阈值。

cd-hit 氨基酸聚类阈值：
word size 5 is for thresholds 0.7 ~ 1.0
word size 4 is for thresholds 0.6 ~ 0.7
word size 3 is for thresholds 0.5 ~ 0.6
word size 2 is for thresholds 0.4 ~ 0.5

cd-hit-est 核苷酸聚类阈值：
Word size 10-11 is for thresholds 0.95 ~ 1.0
Word size 8,9 is for thresholds 0.90 ~ 0.95
Word size 7 is for thresholds 0.88 ~ 0.9
Word size 6 is for thresholds 0.85 ~ 0.88
Word size 5 is for thresholds 0.80 ~ 0.85
Word size 4 is for thresholds 0.75 ~ 0.8

什么样的word size和聚类阈值条件下能达到比较好的分析效果，需要根据自己的数据进行测试。如果数据量比较大，可以使用-T参数并行计算，减少分析时间。

运行完成后，分析结果会给出聚类后的筛选结果及聚类详细信息。

>Cluster 0
0       502nt, >63778083-f569-4214-... at +/100.00%
1       256nt, >eec9e8c2-abb7-4e90-... at +/100.00%
2       951nt, >31654eb2-f4e9-4d3b-... at +/100.00%
3       30300nt, >440a3b49-c1a7-4e3c-... *
4       262nt, >78d3005d-da28-4f0f-... at +/100.00%
5       1219nt, >de1fecf0-671d-45ad-... at +/100.00%
6       363nt, >be8a1159-80b5-4ee4-... at +/100.00%
7       340nt, >3b3cd125-fed8-4655-... at +/100.00%
8       449nt, >71b64698-a551-4358-... at +/100.00%
9       517nt, >2d3b31bf-2492-4750-... at +/100.00%
10      1047nt, >79eae31c-e365-4687-... at +/100.00%
11      453nt, >0cea7a85-f764-41fc-... at +/100.00%
12      448nt, >425d4faa-497c-4e5c-... at +/100.00%
13      306nt, >908f4fb7-45c2-49ee-... at +/100.00%
14      360nt, >f1367801-98f6-46f3-... at +/100.00%
15      269nt, >2a461953-cbce-4bf8-... at +/100.00%
16      360nt, >8e45ff14-1c2f-4e81-... at +/100.00%

每一个聚类中，+/后面表示的是序列相似度，标注*为该聚类下最终选择的序列(一般是最长的那条)