一、FastQC简介
当我拿到第一批待分析的数据时,首要问题便是检测查看数据的质量好坏,以便评估后续生信分析的进行。
本人接触到的常用的一款二代测序数据评估软件: FastQC,该软件使用Java编写,可以快速多线程地对测序数据进行质量评估。并最终生成一份评估报告,包含多项内容,如测序reads碱基质量、GC含量、reads长度、k-mer分布等信息,以便我们快速得知测序数据质量。
FastQC官方网站: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
二、软件使用方法
参数介绍
$ fastqc -h
FastQC - A high throughput sequence QC analysis tool
SYNOPSIS
fastqc seqfile1 seqfile2 .. seqfileN
fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam]
[-c contaminant file] seqfile1 .. seqfileN
DESCRIPTION
FastQC reads a set of sequence files and produces from each one a quality
control report consisting of a number of different modules, each one of
which will help to identify a different potential type of problem in your
data.
If no files to process are specified on the command line then the program
will start as an interactive graphical application. If files are provided
on the command line then the program will run with no user interaction
required. In this mode it is suitable for inclusion into a standardised
analysis pipeline.
The options for the program as as follows:
-h --help Print this help file and exit
-v --version Print the version of the program and exit
-o --outdir Create all output files in the specified output directory.
Please note that this directory must exist as the program
will not create it. If this option is not set then the
output file for each sequence file is created in the same
directory as the sequence file which was processed.
--casava Files come from raw casava output. Files in the same sample
group (differing only by the group number) will be analysed
as a set rather than individually. Sequences with the filter
flag set in the header will be excluded from the analysis.
Files must have the same names given to them by casava
(including being gzipped and ending with .gz) otherwise they
won't be grouped together correctly.
--nano Files come from nanopore sequences and are in fast5 format. In
this mode you can pass in directories to process and the program
will take in all fast5 files within those directories and produce
a single output file from the sequences found in all files.
--nofilter If running with --casava then don't remove read flagged by
casava as poor quality when performing the QC analysis.
--extract If set then the zipped output file will be uncompressed in
the same directory after it has been created. By default
this option will be set if fastqc is run in non-interactive
mode.
-j --java Provides the full path to the java binary you want to use to
launch fastqc. If not supplied then java is assumed to be in
your path.
--noextract Do not uncompress the output file after creating it. You
should set this option if you do not wish to uncompress
the output when running in non-interactive mode.
--nogroup Disable grouping of bases for reads >50bp. All reports will
show data for every base in the read. WARNING: Using this
option will cause fastqc to crash and burn if you use it on
really long reads, and your plots may end up a ridiculous size.
You have been warned!
--min_length Sets an artificial lower limit on the length of the sequence
to be shown in the report. As long as you set this to a value
greater or equal to your longest read length then this will be
the sequence length used to create your read groups. This can
be useful for making directly comaparable statistics from
datasets with somewhat variable read lengths.
-f --format Bypasses the normal sequence file format detection and
forces the program to use the specified format. Valid
formats are bam,sam,bam_mapped,sam_mapped and fastq
-t --threads Specifies the number of files which can be processed
simultaneously. Each thread will be allocated 250MB of
memory so you shouldn't run more threads than your
available memory will cope with, and not more than
6 threads on a 32 bit machine
-c Specifies a non-default file which contains the list of
--contaminants contaminants to screen overrepresented sequences against.
The file must contain sets of named contaminants in the
form name[tab]sequence. Lines prefixed with a hash will
be ignored.
-a Specifies a non-default file which contains the list of
--adapters adapter sequences which will be explicity searched against
the library. The file must contain sets of named adapters
in the form name[tab]sequence. Lines prefixed with a hash
will be ignored.
-l Specifies a non-default file which contains a set of criteria
--limits which will be used to determine the warn/error limits for the
various modules. This file can also be used to selectively
remove some modules from the output all together. The format
needs to mirror the default limits.txt file found in the
Configuration folder.
-k --kmers Specifies the length of Kmer to look for in the Kmer content
module. Specified Kmer length must be between 2 and 10. Default
length is 7 if not specified.
-q --quiet Supress all progress messages on stdout and only report errors.
-d --dir Selects a directory to be used for temporary files written when
generating report images. Defaults to system temp directory if
not specified.
BUGS
Any bugs in fastqc should be reported either to simon.andrews@babraham.ac.uk
or in www.bioinformatics.babraham.ac.uk/bugzilla/
其中,在二代测序数据评估中的常用参数选项说明:
-o,结果输出路径;
--extract,默认情况下,会将所有结果文件打包后生成一个压缩文件,该参数存在时将解压缩;
-t,程序运行时的线程数;
-c,指定一个文件,该文件内容为“name[tab]sequence”样式,记录可能的污染序列, FastQC会根据该文件中的序列信息评估测序数据的污染程度,不指定时不评估;
-a,指定一个文件,该文件内容为“name[tab]sequence”样式,记录测序接头序列,FastQC会根据该文件中的序列信息评估测序接头序列的残留情况,不指定时将自动识别通用接头序列;
-k,指定k-mer统计时的k-mer长度,取值范围2-10,默认7;
-q,默认情况下,程序会实时报告运行的状况,该参数存在时仅报告错误信息。
使用FastQC对从GEO数据库下载的二代测序数据进行质量评估
$ ls
SRR9292576_1.fastq.gz SRR9292578_1.fastq.gz SRR9292580_1.fastq.gz SRR9292582_1.fastq.gz SRR9292584_1.fastq.gz SRR9292586_1.fastq.gz
SRR9292576_2.fastq.gz SRR9292578_2.fastq.gz SRR9292580_2.fastq.gz SRR9292582_2.fastq.gz SRR9292584_2.fastq.gz SRR9292586_2.fastq.gz
SRR9292577_1.fastq.gz SRR9292579_1.fastq.gz SRR9292581_1.fastq.gz SRR9292583_1.fastq.gz SRR9292585_1.fastq.gz SRR9292587_1.fastq.gz
SRR9292577_2.fastq.gz SRR9292579_2.fastq.gz SRR9292581_2.fastq.gz SRR9292583_2.fastq.gz SRR9292585_2.fastq.gz SRR9292587_2.fastq.gz
ls *.gz | while read id ; do fastqc $id -o Fastqc_results/ -q ;done
程序运行完毕后,对于每个测序数据,得到两个结果文件,“.zip”以及“.html”。其中,“.html”即为FastQC所得评估报告,主要关注这份报告即可;“.zip”为打包压缩后的结果文件,其中包含了“*.html”,以及一些统计结果、图片结果等信息。本次运行的所有结果均已保存在网盘中,若需要可查看。
$ ls QC/Fastqc_results/
SRR9292576_1_fastqc.html SRR9292578_1_fastqc.html SRR9292580_1_fastqc.html SRR9292582_1_fastqc.html SRR9292584_1_fastqc.html SRR9292586_1_fastqc.html
SRR9292576_1_fastqc.zip SRR9292578_1_fastqc.zip SRR9292580_1_fastqc.zip SRR9292582_1_fastqc.zip SRR9292584_1_fastqc.zip SRR9292586_1_fastqc.zip
SRR9292576_2_fastqc.html SRR9292578_2_fastqc.html SRR9292580_2_fastqc.html SRR9292582_2_fastqc.html SRR9292584_2_fastqc.html SRR9292586_2_fastqc.html
SRR9292576_2_fastqc.zip SRR9292578_2_fastqc.zip SRR9292580_2_fastqc.zip SRR9292582_2_fastqc.zip SRR9292584_2_fastqc.zip SRR9292586_2_fastqc.zip
SRR9292577_1_fastqc.html SRR9292579_1_fastqc.html SRR9292581_1_fastqc.html SRR9292583_1_fastqc.html SRR9292585_1_fastqc.html SRR9292587_1_fastqc.html
SRR9292577_1_fastqc.zip SRR9292579_1_fastqc.zip SRR9292581_1_fastqc.zip SRR9292583_1_fastqc.zip SRR9292585_1_fastqc.zip SRR9292587_1_fastqc.zip
SRR9292577_2_fastqc.html SRR9292579_2_fastqc.html SRR9292581_2_fastqc.html SRR9292583_2_fastqc.html SRR9292585_2_fastqc.html SRR9292587_2_fastqc.html
SRR9292577_2_fastqc.zip SRR9292579_2_fastqc.zip SRR9292581_2_fastqc.zip SRR9292583_2_fastqc.zip SRR9292585_2_fastqc.zip SRR9292587_2_fastqc.zip
选取一组查看一下FastQC网页报告内容解读
Summary
Summary简要展示出哪些指标评估质量良好(PASS,绿色√),哪些指标评估质量一般(WARN,橙色!),哪些指标评估质量较差(FAIL,红色×)。绿色√越多表明测序数据质量越佳。对于红色×部分,需重点关注并探其原因。
Basic Statics
Basic Statics中统计了测序数据类型、测序平台、测序数据中包含的总reads数、测序reads长度范围及测序reads的平均GC含量等信息。
Per base sequence quality
Per base sequence quality,以箱线图的形式展示了测序reads沿5’到3’方向所有碱基的测序质量值的分布。图中,横坐标为碱基在reads中的位置,纵坐标为单碱基错误率Q,其中Q = -10*log10(error P)即20表示1%的错误率,30表示0.1%。
根据测序技术的特点,测序片段末端的碱基质量一般会比前端的低,属正常现象。若reads末端测序质量明显较差,可考虑将末端碱基统一裁剪去除。
若任一位置的下四分位数低于10或中位数低于25,报“WARN”;若任一位置的下四分位数低于5或中位数低于20,报“FAIL”。
在本示例中,我们可见测序数据“Bacillus_subtilis.clean_R1.fastq.gz”中的碱基质量几乎全部集中在高质量区域(绿色区域),表明该数据测序质量良好。
Per sequence quality scores
Per sequence quality scores,横轴为reads碱基平均质量值,纵轴是reads数目。若测序质量越高,则绝大多数reads分布在高质量值区域,即曲线峰值的横坐标对应在高分区。
当峰值横坐标小于27(错误率0.2%)时报“WARN”,当峰值横坐标小于20(错误率1%)时报“FAIL”。
Per base sequence content
Per base sequence content,统计了测序碱基A、T、C、G的含量分布,可以一定程度上反映测序是否正常。图中横坐标为碱基在reads中的位置,纵坐标为该位置处各碱基含量百分比,根据碱基互补原则,A和T的比例应该接近,C和G的比例也应该是接近的。
实验过程所用的随机引物会引起前几个位置的碱基组成出现波动,这属于正常情况,或者可考虑将5'端前几个位置处的碱基统一裁剪去除。
当任一位置的A/T比例与G/C比例相差超过10%,报“WARN”;当任一位置的A/T比例与G/C比例相差超过20%,报“FAIL”。
Per sequence GC content
Per sequence GC content,展示了测序reads的GC含量分布。图中横坐标为reads GC含量,纵坐标为reads数量;蓝色曲线为理想状态下的GC含量曲线(显著单峰),红色曲线为实际的GC含量曲线。
若红色曲线与蓝色曲线的拟合程度越高,则测数据序质量越好。曲线形状的偏差往往是由于文库的污染或是部分reads构成的子集有偏差(overrepresented reads),形状接近正态但偏离理论分布的情况提示我们可能有系统偏差,当红色出现双峰是表示混入了其它DNA序列。
偏离理论分布的reads超过15%时,报“WARN”;偏离理论分布的reads超过30%时,报“FAIL”。
Per base N content
Per base N content,当出现测序仪不能分辨的碱基时会产生N,该图统计了N碱基的含量分布。图中横坐标为碱基在reads中的位置,纵坐标为该位置处N碱基含量百分比,N碱基含量越低越好。
当任一位置N的比率超过5%报“WARN”,超过20%报“FAIL”。
在本示例中,我们可见测序数据“Bacillus_subtilis.clean_R1.fastq.gz”中几乎不含有N碱基,即测序质量良好。
Sequence Length Distribution
Sequence Length Distribution,统计了测序reads的长度分布,图中横坐标为reads长度,纵坐标是reads数目。
对于测序原始raw reads,每次测序仪测出来的长度在理论上应该是完全相等的;对于质控后的clean reads,由于切除测序接头、低质量碱基等后会导致长度出现波动,但就“好的测序数据”来讲,reads长度分布仍然集中在最长区域。
Sequence Duplication Levels
Sequence Duplication Levels,统计序列完全一致的reads的频率,判定为duplication reads(重复序列),由二代测序过程中PCR的偏好性扩增导致。一般测序深度越高,越容易产生一定程度的duplication reads,属于正常现象。图中,横坐标表示duplication的次数,纵坐标表示duplication reads的数目。理论上,duplication reads的比例越低越好。
当测序数据量很大时,使用全部数据计算duplication reads将相当费时,此时FastQC会选取数据中前200000条reads统计其在全部数据中的duplication reads情况,同时重复数目大于等于10的reads被合并统计。由于reads越长越不容易完全相同(由测序错误导致),所以其重复程度仍有可能被低估。
当duplication reads占总数的比例大于20%时,报“WARN”;当duplication reads占总数的比例大于50%时,报“FAIL”。
Adapter Content
Adapter Content,统计测序reads两端接头序列(adapter sequence)长度所占比例,图中横坐标为碱基在reads中的位置,纵坐标表示该位置的碱基为测序接头序列碱基的百分比。
对于raw reads来讲,会存在一定比例的测序接头序列,需要过滤去除;而对于clean reads来讲,理论上测序接头序列应当已经被过滤干净。
网友评论