英文内容搬运自:
https://bioinformaticsworkbook.org/introduction/dataTerminology.html
Learning Objective
- base/nucleotide
- read
- contig
- scaffold
- chromosome
What is a base?
There are four common bases in DNA sequence, A
denine, G
uanine, C
ytosine and T
hymine. U
racil is found in RNA in place of Thyamine
Image taken from wikipedia where more information about nucleotides can also be found.
What is a read?
A read is a string of bases represented by their one letter codes. Here is an example of a read that is 50 bases long. TTAACCTTGGTTTTGAACTTGAACACTTAGGGGATTGAAGATTCAACAACCCTAAAGCTTGGGGTAAAAC
What is a contig?
A contig is the consensus sequence generated by aligning reads to themselves.
contigThe last line is the consensus of the aligned reads. We call this consensus sequence a contig.
What is a scaffold?
A scaffold is a set of contigs that have been ordered and oriented based on mate pair or long distance information.
contig
NNNNNNNNNNNNgitnoc
NNNNNNNNcontig
NNNNNNNNcontig
NNNNgitnoc
In the line above
-
contig
is a string of of bases (ATC or G) - N is an unknown base
-
gitnoc
is the word contig written backwards to represent the reverse complement of a contig
再搜文章一些补充,有图就更好了:
contig/scaffold 和 N50/N90
把测序的reads拼接,如果可以完全拼接起来,中间没有gap,则是contig.如果中间有gap,但是知道gap的长度,这样的序列称为scaffold.
contig N50 和scaffold N50
把contig或scaffold按照从大到小的顺序排列,长度达到基因组大小(所有contig或scaffold的长度)的50%时,那条contig/scaffold的长度,即为contig/scaffold N50. N50越大,说明基因组组装的质量越高。同理还有N90,即达到基因组大小90%时的contig/scaffold的长度。
作者:wo_monic
链接:https://www.jianshu.com/p/9876964e3d20
来源:简书
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。
基因组组装一般分为三个层次,contig, scaffold和chromosomes. contig表示从大规模测序得到的短读(reads)中找到的一致性序列。组装的第一步就是从短片段(pair-end)文库中组装出contig。进一步基于不同长度的大片段(mate-pair)文库,将原本孤立的contig按序前后连接,其中会调整contig方向以及contig可能会存在开口(gap,用N表示),这一步会得到scaffolds,就相当于supercontigs和meatacontigs。最后基于遗传图谱或光学图谱将scaffold合并调整,形成染色体级别的组装(chromosome).
https://zhuanlan.zhihu.com/p/38317398
什么是Scaffold?基因组de novo测序,通过reads拼接获得Contigs后,往往还需要构建454 Paired-end库或Illumina Mate-pair库,以获得一定大小片段(如3Kb、6Kb、10Kb、20Kb)两端的序列。基于这些序列,可以确定一些Contig之间的顺序关系,这些先后顺序已知的Contigs组成Scaffold。Contig N50:Reads拼接后会获得一些不同长度的Contigs.将所有的Contig长度相加,能获得一个Contig总长度.然后将所有的Contigs按照从长到短进行排序,如获得Contig 1,Contig 2,contig 3...………Contig 25.将Contig按照这个顺序依次相加,当相加的长度达到Contig总长度的一半时,最后一个加上的Contig长度即为Contig N50.举例:Contig 1+Contig 2+ Contig 3 +Contig 4=Contig总长度1/2时,Contig 4的长度即为Contig N50.ContigN50可以作为基因组拼接的结果好坏的一个判断标准。Scaffold N50:Scaffold N50与Contig N50的定义类似.Contigs拼接组装获得一些不同长度的Scaffolds.将所有的Scaffold长度相加,能获得一个Scaffold总长度.然后将所有的Scaffolds按照从长到短进行排序,如获得Scaffold 1,Scaffold 2,Scaffold 3...………Scaffold 25.将Scaffold按照这个顺序依次相加,当相加的长度达到Scaffold总长度的一半时,最后一个加上的Scaffold长度即为Scaffold N50.举例:Scaffold 1+Scaffold 2+ Scaffold3 +Scaffold 4 +Scaffold 5=Scaffold总长度1/2时,Scaffold 5的长度即为Scaffold N50.Scaffold N50可以作为基因组拼接的结果好坏的一个判断标准。
作者:白羊铁蛋
链接:https://www.jianshu.com/p/117441ac6eb8
来源:简书
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。
What is a chromosome?
Chromosomes are the largest DNA molecules in a cell.
Scaffolds can be ordered and oriented using a genetic map or Hi-C data into linkage groups or chromosomes.
The ultimate goal of a genome assembly project is to assemble reads into phased chromosomes that represent an actual individual.
Most chromosomal assemblies produced today are not phased or may represent multiple individuals.
网友评论