RNA-seq分析的流程及软件

作者: 谢俊飞 | 来源:发表于2018-10-04 18:14 被阅读36次

    前言:
    观看B站孟浩巍的系列视频2017-08-04-高通量测序技术交流录像,内容颇多,直接记录笔记代码,没有经过本地运行,仅作记录,他日再做梳理。

    RNA-seq分析的流程及软件

    • raw data quality control
      fastqc;cutadapt;fastx_trimmer
    • mapping to the genome
      tophat2;STAR; hisat2
    • remove the duplication
      picard
    • calculate RPKM
      cufflinks
    • find different expression gene
      cuffdiff


      RNAseq pipeline.png

    1. fastqc:

    fastqc -t 2 -o ./fastqc_result/ -q ./raw.fast/*gz &
    

    输出结果包含:
    html、fastqc.zip

    2. cutadapt.sh

    for case_name in SRR1033853
    do 
    fastq_1 = ./raw.fastq/${case_name}_1.fastq.gz
    fastq_2 = ./raw.fastq/${case_name}_2.fastq.gz
    log_file = ./raw.fastq/${case_name}_cutadapt.log
    out_fastq_1 = ./raw.fastq/${case_name}_1_cutadapt.fastq.gz
    out_fastq_2 = ./raw.fastq/${case_name}_2_cutadapt.fastq.gz
    nohup cutadapt --times 1 -e 0.1 -O 3 --qualigy-cutoff 6 -m 75 -a AGATTTGGGG -A AGTTGGAATC -o $out_fastq_1 -p $out_fastq_2 $fastq_1 $fastq_2  > &log_file 2>&1 &
    done
    

    参数说明:
    --times 1 :去掉一次
    -e 0.1 允许错误率
    -O 3 至少有三个bp与reads序列overlap上
    -m 75 去adapter 后最小长度
    -a reads1需要去的序列
    -A reads2需要去的序列
    后面依次是输出1,2,输入1,2
    log_file 屏幕内容输出端到文件中
    2>&1 报错内容也输出到屏幕上
    nohup &不挂起,退出不终止

    1. trim.sh
    for case_name in SRR1033853
    do 
    fastq_1 = ./raw.fastq/${case_name}_1_cutadapt.fastq.gz
    fastq_2 = ./raw.fastq/${case_name}_2_cutadapt.fastq.gz
    
    out_fastq_1 = ./raw.fastq/${case_name}_R1_trim.fq.gz
    out_fastq_2 = ./raw.fastq/${case_name}_R2_trim.fq.gz
    
    zcat $fastq_1 | fastx_trimmer -f 6 -l 75 -z -o $out_fastq_1 &
    zcat $fastq_2 | fastx_trimmer -f 6 -l 75 -z -o $out_fastq_2 &
    done
    

    参数说明:
    zcat 解压缩gz文件
    -f 6 -l 75 从第6个碱基一直trimmer到第75个碱基
    -z 输出文件也用gzip压缩

    05. mm10_mapping.sh

    mm10-index = /home/reference/mm10_fasta/mm10_genome_bt2
    for case-name in  SRR1033853
    do 
    fastq_1 = ./raw.fastq/${case_name}_R1_trim.fq.gz
    fastq_2 = ./raw.fastq/${case_name}_R2_trim.fq.gz
    
    out_put_dir = ./${case_name}_tophat_result
    log= ./${case_name}_tophat_result/${case-name}_tophat.log
    mkdir -p $output_dir
    
    nohup tophat2 -p 32 -o $output_dir $mm10-index $fastq_1 $fastq_2 >$log 2>&1 &
    done
    

    06. rmdup.sh

    对于单细胞测序,去除重复

    mm10-index = /home/reference/mm10_fasta/mm10_genome_bt2
    for case-name in  SRR1033853
    do 
    input_bam = ./${case_name}_tophat_result/accepted_hits.bam
    output_bam = ./${case_name}_tophat_result/accepted_hits_rmdup.bam
    
    matrix_file = ./${case_name}_tophat_result/accepted_hists_rmdup.matrix
    log_file= ./${case_name}_tophat_result/accepted_hists_rmdup.log
    
    nohup java -Xms16g -Xmx32g -XX:ParallelGCThreads=32 -jar /home/my_software/picard/picard.jar MarkDuplicates INPUT= $input_bam OUTPUT=$out_bam METRICS_FILE=$matrix_file ASSUME_SORTED=true REMOVE_DUPLICATES=true >$log_file 2>&1 &
    done
    

    参数说明:
    -Xms16g -Xmx32g 线程最小16g,最大32g
    -XX:ParallelGCThreads=32 32核
    METRICS_FILE=$matrix_file 指定一个文件名
    ASSUME_SORTED=true 把完全一样的序列去掉,先排序再运行

    07. cufflink.sh

    mm10_gtf = /home/reference/mm10_fasta/mm10_RefSeq.fix.gtf
    for case-name in  SRR1033853
    do 
    cufflink_dir= ./${case_name}_cufflink_result/
    log= ./${case_name}_cufflink_result/cufflink.log
    bam_file = ./${case_name}_tophat_result/accepted_hits_rmdup.bam
    
    mkdir -p $cufflink_dir
    nohup cufflinks -o $cufflink_dir -p 16 -G $mm10_gtf  $bam_file  > $log 2>&1 &
    done
    

    参数说明:
    cufflinks计算fpkm
    -G 基因的注释文件
    RSEM计算表达量,用统计回归去计算拟合表达量

    08. R中分析

    主成分分析PCA
    principal component analysis 简单来说,就是在寻找变量中最能代表这组数据的变量,用少数几个变量的线性变化来代表原始数据,从而叨叨数据降维的目的。

    #load fpkm table
    rm(list=ls())
    combine_fpkm_table = read.table(file="./nature_2014-single/", headr = T, sep="\t")
    dim(combine_fpkm_table)
    
    #R自带的princcomp(input_matrix)
    input_matrix = combine_fpkm_table[,c(2:ncol(combine_fpkm_table))]
    princomp(input_matrix )
    
    #install.packages("psych")
    install.packages("pysch")
    library(psych)
    input_matrix = combine_fpkm_table[,c(2:ncol(combine_fpkm_table))]
    pca_result = principal(input_matrix, nfactors = 3)
    dim(pca_result$scores)
    pca_result$values
    pca_result$scores
    pca_result$weights
    plot(pca_result$scores[,1],pca_result$scores[,2],xlim=c(0,50),ylim=c(0,50))
    

    09.绘图

    1. boxplot

    barplot.png
    Q1-Q3:25%与75%差中位数
    IQR:interquartile range
    boxplot(log2(gene_fpkm_table.select$FPKM+1),col="orange")
    

    2. vioplot

    vioplot.png
    可以看出基因的丰度
    rm (list=ls())
    library(vioplot)
    vioplot::vioplot(log2(gene_fpkm_table.select$FPKM+1),col="orange")
    

    3. pheatmap

    pheatmap(log2(input_matrix[c(1:500),]+1))
    

    相关文章

      网友评论

      本文标题:RNA-seq分析的流程及软件

      本文链接:https://www.haomeiwen.com/subject/lppxaftx.html