转录组分析实战第一节：Rawdata的质量控制与清理

作者: Yeyuntian | 来源:发表于2018-09-14 14:33 被阅读899次

在转录组数据拿到后，公司一般会给我们一个Rawdata和Cleandata

如果喜欢折腾的可以直接拿着Rawdata进行一下数据清理后再进行后期的数据分析

我们今天就对我们之前获得的一个Rawdata进行分析和质量控制为后期做数据分析做准备

需要安装的软件是MutiQC用于可视化data质量

1. MutiQC的安装和配置

因为我试过我这边的服务器，没有办法联网，因此就呵呵了没有办法了我就在本地进行了这个东西的安装与配置，后期的QC在服务器上做了后放回到本地做可视化就好了

安装方法参考multiqc官方指南

yeyuntian@yeyuntian-rescuer-r720-15ikbn:~/trinitytest$ curl -LOk https://github.com/ewels/MultiQC/archive/master.zip
yeyuntian@yeyuntian-rescuer-r720-15ikbn:~/trinitytest$ l
master.zip       rawdata/                         rnaseq_workshop_slides.pdf                trinityrnaseq-Trinity-v2.8.3/  tuxedo_nprot.2012.016.pdf               RNASeq_Trinity_Tuxedo_Workshop/  TrinityNatureProtocol.nprot.2013.084.pdf  Trinity-v2.8.3.tar
yeyuntian@yeyuntian-rescuer-r720-15ikbn:~/trinitytest$ unzip master.zip 
yeyuntian@yeyuntian-rescuer-r720-15ikbn:~/trinitytest$ cd MultiQC-master/
yeyuntian@yeyuntian-rescuer-r720-15ikbn:~/trinitytest/MultiQC-master$ sudo python setup.py install
yeyuntian@yeyuntian-rescuer-r720-15ikbn:~/trinitytest/MultiQC-master$ multiqc --help
Usage: multiqc [OPTIONS] <analysis directory>

  MultiQC aggregates results from bioinformatics analyses across many
  samples into a single report.

  It searches a given directory for analysis logs and compiles a HTML
  report. It's a general use tool, perfect for summarising the output from
  numerous bioinformatics tools.

  To run, supply with one or more directory to scan for analysis results. To
  run here, use 'multiqc .'

  See http://multiqc.info for more details.

  Author: Phil Ewels (http://phil.ewels.co.uk)

Options:
  -f, --force                     Overwrite any existing reports
  -d, --dirs                      Prepend directory to sample names
  -dd, --dirs-depth INTEGER       Prepend [INT] directories to sample names.
                                  Negative number to take from start of path.
  -s, --fullnames                 Do not clean the sample names (leave as full
                                  file name)
  -i, --title TEXT                Report title. Printed as page header, used
                                  for filename if not otherwise specified.
  -b, --comment TEXT              Custom comment, will be printed at the top
                                  of the report.
  -n, --filename TEXT             Report filename. Use 'stdout' to print to
                                  standard out.
  -o, --outdir TEXT               Create report in the specified output
                                  directory.
  -t, --template [default|default_dev|geo|simple|sections]
                                  Report template to use.
  --tag TEXT                      Use only modules which tagged with this
                                  keyword, eg. RNA
  --view-tags, --view_tags        View the available tags and which modules
                                  they load
  -x, --ignore TEXT               Ignore analysis files (glob expression)
  --ignore-samples TEXT           Ignore sample names (glob expression)
  --ignore-symlinks               Ignore symlinked directories and files
  --sample-names PATH             File containing alternative sample names
  -l, --file-list                 Supply a file containing a list of file
                                  paths to be searched, one per row
  -e, --exclude [module name]     Do not use this module. Can specify multiple
                                  times.
  -m, --module [module name]      Use only this module. Can specify multiple
                                  times.
  --data-dir                      Force the parsed data directory to be
                                  created.
  --no-data-dir                   Prevent the parsed data directory from being
                                  created.
  -k, --data-format [tsv|yaml|json]
                                  Output parsed data in a different format.
                                  Default: tsv
  -z, --zip-data-dir              Compress the data directory.
  -p, --export                    Export plots as static images in addition to
                                  the report
  -fp, --flat                     Use only flat plots (static images)
  -ip, --interactive              Use only interactive plots (HighCharts
                                  Javascript)
  --lint                          Use strict linting (validation) to help code
                                  development
  --pdf                           Creates PDF report with 'simple' template.
                                  Requires Pandoc to be installed.
  --no-megaqc-upload              Don't upload generated report to MegaQC,
                                  even if MegaQC options are found
  -c, --config PATH               Specific config file to load, after those in
                                  MultiQC dir / home dir / working dir.
  --cl-config, --cl_config TEXT   Specify MultiQC config YAML on the command
                                  line
  -v, --verbose                   Increase output verbosity.
  -q, --quiet                     Only show log warnings
  --version                       Show the version and exit.
  -h, --help                      Show this message and exit.

Ok,这个有反应就是可以了，注意如果中途报错可以重新再试一试就好了，原因就是在安装的过程中会解决很多依赖问题，这些依赖问题需要通过下载一些文件，当然如果下载的时候网络波动挂了，就会错误，所以重新试一下就好了。

2. 在服务器端的FastQC运行

关于fastqc的软件主要就是作为NGS数据的rawdata的检测工具。

yeyt@ubuntu:~/biodata/NH160034/NH160034/rawdata$ pwd
/home/yeyt/biodata/NH160034/NH160034/rawdata
yeyt@ubuntu:~/biodata/NH160034/NH160034/rawdata$ l
B251_1.fq.gz  B251_2.fq.gz  B252_1.fq.gz  B252_2.fq.gz  R251_1.fq.gz  R251_2.fq.gz  R252_1.fq.gz  R252_2.fq.gz  W251_1.fq.gz  W251_2.fq.gz  W252_1.fq.gz  W252_2.fq.gz  rd_md5.txt
yeyt@ubuntu:~/biodata/NH160034/NH160034/rawdata$ l
B251_1.fq.gz  B251_2.fq.gz  B252_1.fq.gz  B252_2.fq.gz  R251_1.fq.gz  R251_2.fq.gz  R252_1.fq.gz  R252_2.fq.gz  W251_1.fq.gz  W251_2.fq.gz  W252_1.fq.gz  W252_2.fq.gz  rd_md5.txt
yeyt@ubuntu:~/biodata/NH160034/NH160034/rawdata$ fastqc *.gz 
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
    LANGUAGE = "en_US:en",
    LC_ALL = (unset),
    LC_PAPER = "zh_CN.UTF-8",
    LC_ADDRESS = "zh_CN.UTF-8",
    LC_MONETARY = "zh_CN.UTF-8",
    LC_NUMERIC = "zh_CN.UTF-8",
    LC_TELEPHONE = "zh_CN.UTF-8",
    LC_IDENTIFICATION = "zh_CN.UTF-8",
    LC_MEASUREMENT = "zh_CN.UTF-8",
    LC_TIME = "zh_CN.UTF-8",
    LC_NAME = "zh_CN.UTF-8",
    LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
Started analysis of B251_1.fq.gz
Approx 5% complete for B251_1.fq.gz
Approx 10% complete for B251_1.fq.gz
Approx 15% complete for B251_1.fq.gz
...
#慢慢运行就好了，当然也有比较快的方案，我在这个地方没有用，以后会提到
Approx 30% complete for W252_2.fq.gz
Approx 35% complete for W252_2.fq.gz
Approx 40% complete for W252_2.fq.gz
Approx 45% complete for W252_2.fq.gz
Approx 50% complete for W252_2.fq.gz
Approx 55% complete for W252_2.fq.gz
Approx 60% complete for W252_2.fq.gz
Approx 65% complete for W252_2.fq.gz
Approx 70% complete for W252_2.fq.gz
Approx 75% complete for W252_2.fq.gz
Approx 80% complete for W252_2.fq.gz
Approx 85% complete for W252_2.fq.gz
Approx 90% complete for W252_2.fq.gz
Approx 95% complete for W252_2.fq.gz
Analysis complete for W252_2.fq.gz
#运行完了过后看看文件夹里面多了一些东西。
yeyt@ubuntu:~/biodata/NH160034/NH160034/rawdata$ l
B251_1.fq.gz        B251_2_fastqc.html  B252_1_fastqc.zip   R251_1.fq.gz        R251_2_fastqc.html  R252_1_fastqc.zip   W251_1.fq.gz        W251_2_fastqc.html  W252_1_fastqc.zip   rd_md5.txt
B251_1_fastqc.html  B251_2_fastqc.zip   B252_2.fq.gz        R251_1_fastqc.html  R251_2_fastqc.zip   R252_2.fq.gz        W251_1_fastqc.html  W251_2_fastqc.zip   W252_2.fq.gz
B251_1_fastqc.zip   B252_1.fq.gz        B252_2_fastqc.html  R251_1_fastqc.zip   R252_1.fq.gz        R252_2_fastqc.html  W251_1_fastqc.zip   W252_1.fq.gz        W252_2_fastqc.html
B251_2.fq.gz        B252_1_fastqc.html  B252_2_fastqc.zip   R251_2.fq.gz        R252_1_fastqc.html  R252_2_fastqc.zip   W251_2.fq.gz        W252_1_fastqc.html  W252_2_fastqc.zip

在运行完后，我们看到文件夹中多了html和zip的文件，我们需要把这些文件放到本定用multiqc进行数据质量的可视化

yeyt@ubuntu:~/biodata/NH160034/NH160034/rawdata$ mv *.html Fastqcresult/
yeyt@ubuntu:~/biodata/NH160034/NH160034/rawdata$ mv *.zip Fastqcresult/
yeyt@ubuntu:~/biodata/NH160034/NH160034/rawdata$ l
B251_1.fq.gz  B251_2.fq.gz  B252_1.fq.gz  B252_2.fq.gz  Fastqcresult/  R251_1.fq.gz  R251_2.fq.gz  R252_1.fq.gz  R252_2.fq.gz  W251_1.fq.gz  W251_2.fq.gz  W252_1.fq.gz  W252_2.fq.gz  rd_md5.txt
#建立一个文件夹放置qc结果

到本地进行scp操作拷贝结果到本地

yeyuntian@yeyuntian-rescuer-r720-15ikbn:~/trinitytest$ scp yeyt@220:/home/yeyt/biodata/NH160034/NH160034/rawdata/Fastqcresult/* .
B251_1_fastqc.html                                                                                                                                                        100%  299KB 299.1KB/s   00:01    
B251_1_fastqc.zip                                                                                                                                                         100%  375KB 187.7KB/s   00:02    
B251_2_fastqc.html                                                                                                                                                        100%  308KB 307.8KB/s   00:01    
B251_2_fastqc.zip                                                                                                                                                         100%  389KB 389.1KB/s   00:01    
B252_1_fastqc.html                                                                                                                                                        100%  308KB 308.0KB/s   00:01    
B252_1_fastqc.zip                                                                                                                                                         100%  389KB 389.4KB/s   00:01    
B252_2_fastqc.html                                                                                                                                                        100%  310KB 309.6KB/s   00:00    
B252_2_fastqc.zip                                                                                                                                                         100%  392KB 392.5KB/s   00:00    
R251_1_fastqc.html                                                                                                                                                        100%  300KB 300.0KB/s   00:01    
R251_1_fastqc.zip                                                                                                                                                         100%  377KB 377.3KB/s   00:00    
R251_2_fastqc.html                                                                                                                                                        100%  304KB 304.4KB/s   00:00    
R251_2_fastqc.zip                                                                                                                                                         100%  384KB 384.1KB/s   00:00    
R252_1_fastqc.html                                                                                                                                                        100%  302KB 302.5KB/s   00:00    
R252_1_fastqc.zip                                                                                                                                                         100%  381KB 381.2KB/s   00:00    
R252_2_fastqc.html                                                                                                                                                        100%  306KB 305.9KB/s   00:00    
R252_2_fastqc.zip                                                                                                                                                         100%  387KB 386.7KB/s   00:01    
W251_1_fastqc.html                                                                                                                                                        100%  300KB 300.3KB/s   00:00    
W251_1_fastqc.zip                                                                                                                                                         100%  378KB 377.6KB/s   00:01    
W251_2_fastqc.html                                                                                                                                                        100%  313KB 313.3KB/s   00:00    
W251_2_fastqc.zip                                                                                                                                                         100%  398KB 397.9KB/s   00:00    
W252_1_fastqc.html                                                                                                                                                        100%  306KB 306.4KB/s   00:00    
W252_1_fastqc.zip                                                                                                                                                         100%  387KB 387.4KB/s   00:00    
W252_2_fastqc.html                                                                                                                                                        100%  310KB 310.0KB/s   00:00    
W252_2_fastqc.zip                                                                                                                                                         100%  392KB 391.7KB/s   00:01    
yeyuntian@yeyuntian-rescuer-r720-15ikbn:~/trinitytest$ l
B251_1_fastqc.html  B252_1_fastqc.zip   R251_1_fastqc.html  R252_1_fastqc.zip                rnaseq_workshop_slides.pdf                W251_1_fastqc.html  W252_1_fastqc.zip
B251_1_fastqc.zip   B252_2_fastqc.html  R251_1_fastqc.zip   R252_2_fastqc.html               TrinityNatureProtocol.nprot.2013.084.pdf  W251_1_fastqc.zip   W252_2_fastqc.html
B251_2_fastqc.html  B252_2_fastqc.zip   R251_2_fastqc.html  R252_2_fastqc.zip                trinityrnaseq-Trinity-v2.8.3/             W251_2_fastqc.html  W252_2_fastqc.zip
B251_2_fastqc.zip   master.zip          R251_2_fastqc.zip   rawdata/                         Trinity-v2.8.3.tar                        W251_2_fastqc.zip
B252_1_fastqc.html  MultiQC-master/     R252_1_fastqc.html  RNASeq_Trinity_Tuxedo_Workshop/  tuxedo_nprot.2012.016.pdf                 W252_1_fastqc.html
yeyuntian@yeyuntian-rescuer-r720-15ikbn:~/trinitytest$ mkdir Fastqcresult
yeyuntian@yeyuntian-rescuer-r720-15ikbn:~/trinitytest$ mv *zip Fastqcresult/
yeyuntian@yeyuntian-rescuer-r720-15ikbn:~/trinitytest$ mv *html Fastqcresult/
yeyuntian@yeyuntian-rescuer-r720-15ikbn:~/trinitytest$ l
Fastqcresult/    rawdata/                         rnaseq_workshop_slides.pdf                trinityrnaseq-Trinity-v2.8.3/  tuxedo_nprot.2012.016.pdf
MultiQC-master/  RNASeq_Trinity_Tuxedo_Workshop/  TrinityNatureProtocol.nprot.2013.084.pdf  Trinity-v2.8.3.tar
yeyuntian@yeyuntian-rescuer-r720-15ikbn:~/trinitytest$ cd Fastqcresult/
yeyuntian@yeyuntian-rescuer-r720-15ikbn:~/trinitytest/Fastqcresult$ l
B251_1_fastqc.html  B251_2_fastqc.zip   B252_2_fastqc.html  R251_1_fastqc.html  R251_2_fastqc.zip   R252_2_fastqc.html  W251_1_fastqc.zip   W252_1_fastqc.html  W252_2_fastqc.zip
B251_1_fastqc.zip   B252_1_fastqc.html  B252_2_fastqc.zip   R251_1_fastqc.zip   R252_1_fastqc.html  R252_2_fastqc.zip   W251_2_fastqc.html  W252_1_fastqc.zip
B251_2_fastqc.html  B252_1_fastqc.zip   master.zip          R251_2_fastqc.html  R252_1_fastqc.zip   W251_1_fastqc.html  W251_2_fastqc.zip   W252_2_fastqc.html

然后采用multiqc进行整合结果

yeyuntian@yeyuntian-rescuer-r720-15ikbn:~/trinitytest/Fastqcresult$ multiqc ./
[INFO   ]         multiqc : This is MultiQC v1.7.dev0
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching './'
Searching 25 files..  [####################################]  100%
[INFO   ]          fastqc : Found 12 reports
[INFO   ]         multiqc : Compressing plot data
[INFO   ]         multiqc : Report      : multiqc_report.html
[INFO   ]         multiqc : Data        : multiqc_data
[INFO   ]         multiqc : MultiQC complete
yeyuntian@yeyuntian-rescuer-r720-15ikbn:~/trinitytest/Fastqcresult$ l
B251_1_fastqc.html  B251_2_fastqc.zip   B252_2_fastqc.html  multiqc_data/        R251_1_fastqc.zip   R252_1_fastqc.html  R252_2_fastqc.zip   W251_2_fastqc.html  W252_1_fastqc.zip
B251_1_fastqc.zip   B252_1_fastqc.html  B252_2_fastqc.zip   multiqc_report.html  R251_2_fastqc.html  R252_1_fastqc.zip   W251_1_fastqc.html  W251_2_fastqc.zip   W252_2_fastqc.html
B251_2_fastqc.html  B252_1_fastqc.zip   master.zip          R251_1_fastqc.html   R251_2_fastqc.zip   R252_2_fastqc.html  W251_1_fastqc.zip   W252_1_fastqc.html  W252_2_fastqc.zip
yeyuntian@yeyuntian-rescuer-r720-15ikbn:~/trinitytest/Fastqcresult$ cd multiqc_data/
yeyuntian@yeyuntian-rescuer-r720-15ikbn:~/trinitytest/Fastqcresult/multiqc_data$ l
multiqc_data.json  multiqc_fastqc.txt  multiqc_general_stats.txt  multiqc.log  multiqc_sources.txt

#######整合完成后我们就可以看看这个结果了，打开multiqc_report.html这个文件可以看到结果
#######下面我们对于这些结果进行解读。

结果的第一项

结果第一项是代表这些数据的总体结果，包括GC含量以及测序Reads长度和测序数量

从以上结果我们可以看到，Reads长度是150bp，并且rawdata中一个Run含有25M条序列。对于双端测序来讲，这个测序结果的数据量为： 150bp × 25 M × 2 ends = 7.5 G
当然这个rawdata的结果，测序数据量是一个重要的测序质量指标

未达标的结果

12个结果均在测序前15个碱基出现碱基分布波动

我们直接看有一项未达标的结果，就是这个碱基分布图，并且是12个结果均未达到，其中在前面10~15个碱基的位置是出现了碱基分布的不均衡

如果碱基差异>10%会显示warn
如果碱基差异>20%会显示fail

理论上的碱基分布在25%左右，但是我们的结果还是有GC含量的不均衡分布，但是在15~150bp这个范围可以接受。

因此下一步我们需要把这15个碱基去掉

3. 在服务器端运行Trimomatic进行Reads的数据清洗

关于Trimomatic的软件及使用信息可以参看该软件的说明书

我们要去除5端的前13个碱基我们就采用以下命令进行

yeyuntian@yeyuntian-rescuer-r720-15ikbn:~/trinitytest/Fastqcresult$ wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.38.zip
--2018-09-13 00:55:35--  http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.38.zip
Connecting to 127.0.0.1:8118... connected.
Proxy request sent, awaiting response... 200 OK
Length: 132647 (130K) [application/zip]
Saving to: ‘Trimmomatic-0.38.zip’

Trimmomatic-0.38.zip                               100%[================================================================================================================>] 129.54K   114KB/s    in 1.1s    

2018-09-13 00:55:38 (114 KB/s) - ‘Trimmomatic-0.38.zip’ saved [132647/132647]
#在本地先进行下载后放到服务器
yeyuntian@yeyuntian-rescuer-r720-15ikbn:~/trinitytest/Fastqcresult$ scp Trimmomatic-0.38.zip yeyt@220:/home/yeyt/biosoft/
Trimmomatic-0.38.zip

再切换到远程端口进行解压

yeyt@ubuntu:~/biosoft$ l
MultiQC-master/  Trimmomatic-0.38.zip  bin/  iqtree-1.6.7-Linux/  lib/  master.zip
yeyt@ubuntu:~/biosoft$ unzip Trimmomatic-0.38.zip 
Archive:  Trimmomatic-0.38.zip
   creating: Trimmomatic-0.38/
  inflating: Trimmomatic-0.38/LICENSE  
  inflating: Trimmomatic-0.38/trimmomatic-0.38.jar  
   creating: Trimmomatic-0.38/adapters/
  inflating: Trimmomatic-0.38/adapters/NexteraPE-PE.fa  
  inflating: Trimmomatic-0.38/adapters/TruSeq2-PE.fa  
  inflating: Trimmomatic-0.38/adapters/TruSeq2-SE.fa  
  inflating: Trimmomatic-0.38/adapters/TruSeq3-PE-2.fa  
  inflating: Trimmomatic-0.38/adapters/TruSeq3-PE.fa  
  inflating: Trimmomatic-0.38/adapters/TruSeq3-SE.fa  
yeyt@ubuntu:~/biosoft$ l
MultiQC-master/  Trimmomatic-0.38/  Trimmomatic-0.38.zip  bin/  iqtree-1.6.7-Linux/  lib/  master.zip
yeyt@ubuntu:~/biosoft$ cd Trimmomatic-0.38/
yeyt@ubuntu:~/biosoft/Trimmomatic-0.38$ l
LICENSE  adapters/  trimmomatic-0.38.jar
yeyt@ubuntu:~/biosoft/Trimmomatic-0.38$ pwd
/home/yeyt/biosoft/Trimmomatic-0.38
yeyt@ubuntu:~/biosoft/Trimmomatic-0.38$ 
yeyt@ubuntu:~/biosoft/Trimmomatic-0.38$ java -jar trimmomatic-0.38.jar 
Usage: 
       PE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] [-validatePairs] [-basein <inputBase> | <inputFile1> <inputFile2>] [-baseout <outputBase> | <outputFile1P> <outputFile1U> <outputFile2P> <outputFile2U>] <trimmer1>...
   or: 
       SE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] <inputFile> <outputFile> <trimmer1>...
   or: 
       -version

然后回到数据文件夹进行批处理的sh脚本生成

yeyt@ubuntu:~/biodata/NH160034/NH160034/rawdata$ cat trimmomaitc.sh 
nohup java -jar ~/biosoft/Trimmomatic-0.38/trimmomatic-0.38.jar  PE  -threads 3  B251_1.fq.gz B251_2.fq.gz B251_1.P.fq.gz B251_1.UP.fq.gz B251_2.P.fq.gz B251_2.UP.fq.gz  HEADCROP:18 MINLEN:50 TOPHRED33 & 
nohup java -jar ~/biosoft/Trimmomatic-0.38/trimmomatic-0.38.jar  PE  -threads 3  B252_1.fq.gz B252_2.fq.gz B252_1.P.fq.gz B252_1.UP.fq.gz B252_2.P.fq.gz B252_2.UP.fq.gz  HEADCROP:18 MINLEN:50 TOPHRED33 & 
nohup java -jar ~/biosoft/Trimmomatic-0.38/trimmomatic-0.38.jar  PE  -threads 3  R251_1.fq.gz R251_2.fq.gz R251_1.P.fq.gz R251_1.UP.fq.gz R251_2.P.fq.gz R251_2.UP.fq.gz  HEADCROP:18 MINLEN:50 TOPHRED33 & 
nohup java -jar ~/biosoft/Trimmomatic-0.38/trimmomatic-0.38.jar  PE  -threads 3  R252_1.fq.gz R252_2.fq.gz R252_1.P.fq.gz R252_1.UP.fq.gz R252_2.P.fq.gz R252_2.UP.fq.gz  HEADCROP:18 MINLEN:50 TOPHRED33 & 
nohup java -jar ~/biosoft/Trimmomatic-0.38/trimmomatic-0.38.jar  PE  -threads 3  W251_1.fq.gz W251_2.fq.gz W251_1.P.fq.gz W251_1.UP.fq.gz W251_2.P.fq.gz W251_2.UP.fq.gz  HEADCROP:18 MINLEN:50 TOPHRED33 & 
nohup java -jar ~/biosoft/Trimmomatic-0.38/trimmomatic-0.38.jar  PE  -threads 3  W252_1.fq.gz W252_2.fq.gz W252_1.P.fq.gz W252_1.UP.fq.gz W252_2.P.fq.gz W252_2.UP.fq.gz  HEADCROP:18 MINLEN:50 TOPHRED33 &
#解释一下 
#java -jar ~/biosoft/Trimmomatic-0.38/trimmomatic-0.38.jar 是启动该jar程序
#PE  -threads 12 是指明处理数据为Pair-End的数据类型，并且采用计算线程为12
#B251_1.fq.gz B251_2.fq.gz 为双端测序的两个RUN文件 
# B251_1.P.fq.gz B251_1.UP.fastq.gz B251_2.P.fq.gz B251_2.UP.fq.gz 这四个文件为输出文件
#HEADCROP:13 MINLEN:50 TOPHRED33 为剪切参数其含义为： 去掉5端开头13个碱基，然后去掉低于50bp的reads，并且将fastq质量格式转为phred33格式

然后运行这个sh脚本

yeyt@ubuntu:~$ bash trimmomaitc.sh 
yeyt@ubuntu:~/biodata/NH160034/NH160034/rawdata$ cat nohup.out 
TrimmomaticPE: Started with arguments:
 -threads 3 B252_1.fq.gz B252_2.fq.gz B252_1.P.fq.gz B252_1.UP.fq.gz B252_2.P.fq.gz B252_2.UP.fq.gz HEADCROP:18 MINLEN:50 TOPHRED33
TrimmomaticPE: Started with arguments:
 -threads 3 W251_1.fq.gz W251_2.fq.gz W251_1.P.fq.gz W251_1.UP.fq.gz W251_2.P.fq.gz W251_2.UP.fq.gz HEADCROP:18 MINLEN:50 TOPHRED33
TrimmomaticPE: Started with arguments:
 -threads 3 R251_1.fq.gz R251_2.fq.gz R251_1.P.fq.gz R251_1.UP.fq.gz R251_2.P.fq.gz R251_2.UP.fq.gz HEADCROP:18 MINLEN:50 TOPHRED33
TrimmomaticPE: Started with arguments:
 -threads 3 W252_1.fq.gz W252_2.fq.gz W252_1.P.fq.gz W252_1.UP.fq.gz W252_2.P.fq.gz W252_2.UP.fq.gz HEADCROP:18 MINLEN:50 TOPHRED33
TrimmomaticPE: Started with arguments:
 -threads 3 R252_1.fq.gz R252_2.fq.gz R252_1.P.fq.gz R252_1.UP.fq.gz R252_2.P.fq.gz R252_2.UP.fq.gz HEADCROP:18 MINLEN:50 TOPHRED33
TrimmomaticPE: Started with arguments:
 -threads 3 B251_1.fq.gz B251_2.fq.gz B251_1.P.fq.gz B251_1.UP.fq.gz B251_2.P.fq.gz B251_2.UP.fq.gz HEADCROP:18 MINLEN:50 TOPHRED33
Quality encoding detected as phred33
Quality encoding detected as phred33
Quality encoding detected as phred33
Quality encoding detected as phred33
Quality encoding detected as phred33
Quality encoding detected as phred33
Input Read Pairs: 23929511 Both Surviving: 23929511 (100.00%) Forward Only Surviving: 0 (0.00%) Reverse Only Surviving: 0 (0.00%) Dropped: 0 (0.00%)
TrimmomaticPE: Completed successfully
Input Read Pairs: 24577100 Both Surviving: 24577100 (100.00%) Forward Only Surviving: 0 (0.00%) Reverse Only Surviving: 0 (0.00%) Dropped: 0 (0.00%)
TrimmomaticPE: Completed successfully
Input Read Pairs: 24423445 Both Surviving: 24423445 (100.00%) Forward Only Surviving: 0 (0.00%) Reverse Only Surviving: 0 (0.00%) Dropped: 0 (0.00%)
TrimmomaticPE: Completed successfully
Input Read Pairs: 24498964 Both Surviving: 24498964 (100.00%) Forward Only Surviving: 0 (0.00%) Reverse Only Surviving: 0 (0.00%) Dropped: 0 (0.00%)
TrimmomaticPE: Completed successfully
Input Read Pairs: 25553075 Both Surviving: 25553075 (100.00%) Forward Only Surviving: 0 (0.00%) Reverse Only Surviving: 0 (0.00%) Dropped: 0 (0.00%)
TrimmomaticPE: Completed successfully
Input Read Pairs: 28213701 Both Surviving: 28213701 (100.00%) Forward Only Surviving: 0 (0.00%) Reverse Only Surviving: 0 (0.00%) Dropped: 0 (0.00%)
TrimmomaticPE: Completed successfully

转录组分析实战第一节：Rawdata的质量控制与清理

在转录组数据拿到后，公司一般会给我们一个Rawdata和Cleandata

如果喜欢折腾的可以直接拿着Rawdata进行一下数据清理后再进行后期的数据分析

我们今天就对我们之前获得的一个Rawdata进行分析和质量控制为后期做数据分析做准备

需要安装的软件是MutiQC用于可视化data质量

1. MutiQC的安装和配置

因为我试过我这边的服务器，没有办法联网，因此就呵呵了没有办法了我就在本地进行了这个东西的安装与配置，后期的QC在服务器上做了后放回到本地做可视化就好了

安装方法参考multiqc官方指南

2. 在服务器端的FastQC运行

关于fastqc的软件主要就是作为NGS数据的rawdata的检测工具。

在运行完后，我们看到文件夹中多了html和zip的文件，我们需要把这些文件放到本定用multiqc进行数据质量的可视化

到本地进行scp操作拷贝结果到本地

然后采用multiqc进行整合结果

结果第一项是代表这些数据的总体结果，包括GC含量以及测序Reads长度和测序数量

我们直接看有一项未达标的结果，就是这个碱基分布图，并且是12个结果均未达到，其中在前面10~15个碱基的位置是出现了碱基分布的不均衡

理论上的碱基分布在25%左右，但是我们的结果还是有GC含量的不均衡分布，但是在15~150bp这个范围可以接受。

因此下一步我们需要把这15个碱基去掉

3. 在服务器端运行Trimomatic进行Reads的数据清洗

关于Trimomatic的软件及使用信息可以参看该软件的说明书

我们要去除5端的前13个碱基我们就采用以下命令进行

再切换到远程端口进行解压

然后回到数据文件夹进行批处理的sh脚本生成

然后运行这个sh脚本

最后可以看到生成了的pair文件，我们把这个文件拷贝到新的文件夹中作为cleandata,后期就可以用这个cleandata作为分析的数据

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

走进转录组

转录组学习

Bioinformatics入门实战精通大百科