前言:
观看B站孟浩巍的系列视频2017-08-04-高通量测序技术交流录像,内容颇多,直接记录笔记代码,没有经过本地运行,仅作记录,他日再做梳理。
RNA-seq分析的流程及软件
- raw data quality control
fastqc;cutadapt;fastx_trimmer - mapping to the genome
tophat2;STAR; hisat2 - remove the duplication
picard - calculate RPKM
cufflinks -
find different expression gene
cuffdiff
RNAseq pipeline.png
1. fastqc:
fastqc -t 2 -o ./fastqc_result/ -q ./raw.fast/*gz &
输出结果包含:
html、fastqc.zip
2. cutadapt.sh
for case_name in SRR1033853
do
fastq_1 = ./raw.fastq/${case_name}_1.fastq.gz
fastq_2 = ./raw.fastq/${case_name}_2.fastq.gz
log_file = ./raw.fastq/${case_name}_cutadapt.log
out_fastq_1 = ./raw.fastq/${case_name}_1_cutadapt.fastq.gz
out_fastq_2 = ./raw.fastq/${case_name}_2_cutadapt.fastq.gz
nohup cutadapt --times 1 -e 0.1 -O 3 --qualigy-cutoff 6 -m 75 -a AGATTTGGGG -A AGTTGGAATC -o $out_fastq_1 -p $out_fastq_2 $fastq_1 $fastq_2 > &log_file 2>&1 &
done
参数说明:
--times 1 :去掉一次
-e 0.1 允许错误率
-O 3 至少有三个bp与reads序列overlap上
-m 75 去adapter 后最小长度
-a reads1需要去的序列
-A reads2需要去的序列
后面依次是输出1,2,输入1,2
log_file 屏幕内容输出端到文件中
2>&1 报错内容也输出到屏幕上
nohup &不挂起,退出不终止
- trim.sh
for case_name in SRR1033853
do
fastq_1 = ./raw.fastq/${case_name}_1_cutadapt.fastq.gz
fastq_2 = ./raw.fastq/${case_name}_2_cutadapt.fastq.gz
out_fastq_1 = ./raw.fastq/${case_name}_R1_trim.fq.gz
out_fastq_2 = ./raw.fastq/${case_name}_R2_trim.fq.gz
zcat $fastq_1 | fastx_trimmer -f 6 -l 75 -z -o $out_fastq_1 &
zcat $fastq_2 | fastx_trimmer -f 6 -l 75 -z -o $out_fastq_2 &
done
参数说明:
zcat 解压缩gz文件
-f 6 -l 75 从第6个碱基一直trimmer到第75个碱基
-z 输出文件也用gzip压缩
05. mm10_mapping.sh
mm10-index = /home/reference/mm10_fasta/mm10_genome_bt2
for case-name in SRR1033853
do
fastq_1 = ./raw.fastq/${case_name}_R1_trim.fq.gz
fastq_2 = ./raw.fastq/${case_name}_R2_trim.fq.gz
out_put_dir = ./${case_name}_tophat_result
log= ./${case_name}_tophat_result/${case-name}_tophat.log
mkdir -p $output_dir
nohup tophat2 -p 32 -o $output_dir $mm10-index $fastq_1 $fastq_2 >$log 2>&1 &
done
06. rmdup.sh
对于单细胞测序,去除重复
mm10-index = /home/reference/mm10_fasta/mm10_genome_bt2
for case-name in SRR1033853
do
input_bam = ./${case_name}_tophat_result/accepted_hits.bam
output_bam = ./${case_name}_tophat_result/accepted_hits_rmdup.bam
matrix_file = ./${case_name}_tophat_result/accepted_hists_rmdup.matrix
log_file= ./${case_name}_tophat_result/accepted_hists_rmdup.log
nohup java -Xms16g -Xmx32g -XX:ParallelGCThreads=32 -jar /home/my_software/picard/picard.jar MarkDuplicates INPUT= $input_bam OUTPUT=$out_bam METRICS_FILE=$matrix_file ASSUME_SORTED=true REMOVE_DUPLICATES=true >$log_file 2>&1 &
done
参数说明:
-Xms16g -Xmx32g 线程最小16g,最大32g
-XX:ParallelGCThreads=32 32核
METRICS_FILE=$matrix_file 指定一个文件名
ASSUME_SORTED=true 把完全一样的序列去掉,先排序再运行
07. cufflink.sh
mm10_gtf = /home/reference/mm10_fasta/mm10_RefSeq.fix.gtf
for case-name in SRR1033853
do
cufflink_dir= ./${case_name}_cufflink_result/
log= ./${case_name}_cufflink_result/cufflink.log
bam_file = ./${case_name}_tophat_result/accepted_hits_rmdup.bam
mkdir -p $cufflink_dir
nohup cufflinks -o $cufflink_dir -p 16 -G $mm10_gtf $bam_file > $log 2>&1 &
done
参数说明:
cufflinks计算fpkm
-G 基因的注释文件
RSEM计算表达量,用统计回归去计算拟合表达量
08. R中分析
主成分分析PCA
principal component analysis 简单来说,就是在寻找变量中最能代表这组数据的变量,用少数几个变量的线性变化来代表原始数据,从而叨叨数据降维的目的。
#load fpkm table
rm(list=ls())
combine_fpkm_table = read.table(file="./nature_2014-single/", headr = T, sep="\t")
dim(combine_fpkm_table)
#R自带的princcomp(input_matrix)
input_matrix = combine_fpkm_table[,c(2:ncol(combine_fpkm_table))]
princomp(input_matrix )
#install.packages("psych")
install.packages("pysch")
library(psych)
input_matrix = combine_fpkm_table[,c(2:ncol(combine_fpkm_table))]
pca_result = principal(input_matrix, nfactors = 3)
dim(pca_result$scores)
pca_result$values
pca_result$scores
pca_result$weights
plot(pca_result$scores[,1],pca_result$scores[,2],xlim=c(0,50),ylim=c(0,50))
09.绘图
1. boxplot
Q1-Q3:25%与75%差中位数
IQR:interquartile range
boxplot(log2(gene_fpkm_table.select$FPKM+1),col="orange")
2. vioplot
可以看出基因的丰度
rm (list=ls())
library(vioplot)
vioplot::vioplot(log2(gene_fpkm_table.select$FPKM+1),col="orange")
3. pheatmap
pheatmap(log2(input_matrix[c(1:500),]+1))
网友评论