植物RNA-seq (Salmon)从软件安装到下游分析(纯Ma

作者: dandanwu90 | 来源:发表于2019-03-13 22:06 被阅读166次

植物RNA-seq (Salmon)从软件安装到下游分析(纯Ma
将salmon生成的数据导入R语言做DESeq2差异分析
RNA-seq分析实战
salmon：从RNA-seq数据中定量转录本
RNA-seq分析简洁版
2019-05-12 利用salmon进行转录组定量(上)
TCR
RNA-seq分析：从软件安装到富集分析
转录组分析：RNA-seq pipeline through k
salmon分析RNA-seq实战

写在前面：

整个流程是首次从软件安装-数据下载-上游分析-下游分析。
代码没毛病，奇怪的是salmon比对后，mapping～15%（不正常），一般来说期望mapping应该是70-90%。这导致在下游分析时，找到的差异表达基因很少，虽然火山图勉强画出，但是这些基因功能注释及信号通路没办法继续。因此我决定再分析一次，对其中的原因进行探究，比较两次分析的过程和结果。

1. 准备工作之软件安装

1.1 下载miniconda来安装软件

#下载miniconda
wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-latest-Linux-x86_64.sh 
#更改镜像
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda
conda config --set show_channel_urls yes

1.2 先创建一个相对的环境,使用python2

conda create rna python=2

下载软件

conda install -y fastqc trim-galore bwa bowtie salmon subread sra-tools

conda install -y *

数据来源
Temporal dynamics of gene expression and histone marks at the Arabidopsis shoot meristem during flowerin

2. 下载数据及参考基因组

2.1 下载序列

创建文件夹放数据

mkdir -p project/rna

获取样本信息

wget http://www.ebi.ac.uk/arrayexpress/files/E-MTAB-5130/E-MTAB-5130.sdrf.txt

获得样本下载链接，并放入id.txt的文档

tail -n +2 E-MTAB-5130.sdrf.txt | cut -f 33 >id.txt

下载链接信息

写脚本批量下载

cat  >download.sh
cat id.txt|while read id;do (wget ${id});done
nohup bash download.sh &

2.2 下载参考基因组

#创建文件夹
mkdir reference && cd reference
#下载gff3(注释基因组)和gtf(注释基因)
nohup wget ftp://ftp.ensemblgenomes.org/pub/release-42/plants/gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.42.gff3.gz &
nohup wget ftp://ftp.ensemblgenomes.org/pub/release-42/plants/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.42.gtf.gz &

下载参考基因组dna和cdna，我下载到了本地的。所以这里没代码。注意的是：拟南芥中没有primary assemble，阅读READMRE得知：如果没有primary assemle就用toplevel

readme

查阅其他类型的参考序列含义

fasta dna dumps 包括些什么

一般参考基因组要下载primary assemble
dna_rm：有屏蔽的基因组，用RepeatMasker tool可以检测到的分散的重复序列及复杂性较低的区域，用“N”代替这些碱基
dna_sm：软屏蔽，所有重复以及复杂性低的序列用小写字母代替
mt 线粒体
**pt **叶绿体
toplevel ：所有组装到的序列标签信息，也包含没有组装到的染色体上的染色体，区域以及用N代替单倍型/批次区域
primary assemble ：除不包含单倍型及批次序列外与toplevel相同。用来序列相似性分析的最佳选择，单倍型或是批次序列的存在会对相似性结果产生影响。

构建索引

# salmon index --help 来查看salmon构建索引的方法
# -t [--transcripts] arg 转录本fasta 文件
# 注意，其他软件都是利用参考基因组(dna)的fa文件建立索引，这个软件用的cdna的fa数据建立索引
# -i [--index] arg salmon的索引
# k-mers 长度31，大约大于等于75bp
salmon index -t Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz -i athal_index_salmon -k 31

salmon构建索引

不需要比对直接定量分析

#将下载链接信息中的ID抽出来，
#-d '/'指定分隔符为‘/‘
#-f8提取第8列
#ID号本身是以顺序排列所以不用sort，直接uniq
#然后定向到一个文本
cat id.txt |cut -d"/" -f8|uniq >sample.txt
mkdir quant_salmon && cd quant_salmon
#写脚本
# salmon index 中的参数：
#-l 字符串类型描述文库类型；
#-i salmon index(索引)；
#-1 文件中包含#1匹配；
#-2 文件中包含#2匹配；
#--validateMappings产生不对齐比对，最佳或临近对齐；
-p 线程数 默认为2 
#-o 输出路径
cat >quant_salmon.sh
index=/Users/wudandan/project/rna/reference/athal_index_salmon/
quant=/Users/wudandan/project/rna/quant_salmon/
cat sample.txt|while read id; do salmon quant -i $index -l IU -1 ${id}_1.fastq.gz -2 ${id}_2.fastq.gz -p 1 -o $quant/${id}_quant
done
nohup bash quant_salmon.sh &

salmon_quant

其中一个日志信息

没加--validateMappings的结果

在-MTAB-5130.sdrf.txt中提取分组信息，为DESeq2分析准备

tail -n +2 E-MTAB-5130.sdrf.txt | cut -f 32,36|uniq >sampleTable.txt

下述流程在Rstudio中完成

1. 配置镜像，下载软件，下载Bioconductor,加载安装包

rm(list = ls())   
options()$repos 
options()$BioC_mirror
options(BioC_mirror="https://mirrors.ustc.edu.cn/bioc/")
options("repos" = c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))
options()$repos 
options()$BioC_mirror

install.packages("tidyverse") ; library(tidyverse)
install.packages("optparse") ; library(optparse)
install.packages("UpSetR") ; library(UpSetR)
install.packages("rjson") ; library(rjson)
# https://bioconductor.org/packages/release/bioc/html/GEOquery.html
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install(c("tximport","KEGG.db","DESeq2","edgeR" ,"org.At.tair.db","pheatmap","AnnotationHub"),ask = F,update = F)
BiocManager::install(c("clusterProfiler","limma","DOSE","GenomicFeatures","RColorBrewer"),ask = F,update = F)

2.准备quant files和基因名-转录本名字相关的数据框tx2genes

rm(list = ls())
options(stringsAsFactors = F)
# 得到salmon_quant路径
getwd()
dir=getwd()
# *.sf文件为得到表达矩阵信息
files=list.files(pattern = "*sf",dir,recursive = T)
files=file.path(dir,files)
all(file.exists(files))

# 安装Bioconductor包中的Arabidopsis thaliana对基因组注释包
BiocManager::install("TxDb.Athaliana.BioMart.plantsmart28", version = "3.8")
library(TxDb.Athaliana.BioMart.plantsmart28)
ls('package:TxDb.Athaliana.BioMart.plantsmart28')
a=TxDb.Athaliana.BioMart.plantsmart28
# 查看包有哪些列
columns(a)
# keys返回这个数据包可以当作关键字查找的列，
# keytypes返回的列等于或少于columns返回的结果，不是所有的列都可以当作对象查找
k=keys(a,keytype = "GENEID")
# select可以根据你提供的key取查找注释数据库，返回你需要的columns信息
df=select(a, keys=k, keytype = "GENEID",columns = "TXNAME")
# 检查得到的矩阵，得到列名为“GENEID”和“TXNAME”的两列
head(df)
# 将“TXNAME”放在第一列，“GENEID”放在第二列
tx2gene=df[,2:1]
head(tx2gene)


#### 与上面的步骤一样，都为得到Arabidopsis thaliana对基因组注释包，提取注释信息，推荐这种，可以安装数据库中已包含的所有物种都基因注释包。
#### 安装AnnotationHub注释信息数据库
# BiocManager::install('AnnotationHub')
#### 加载并创建AnnotationHub对象
# library(AnnotationHub)
# ah <- AnnotationHub()
#### 用query来查找数据库中'Arabidopsis thaliana'的相关信息
# ath <- query(ah,'Arabidopsis thaliana')
#### 获得'Arabidopsis thaliana'最新注释ID为AH52247
# ath_tx <- ath[['AH52247']]
#### 与上述相同
# columns(ath_tx)
# k <- keys(ath_tx,keytype = "GENEID")
# df <- select(ath_tx, keys=k, keytype = "GENEID",columns = "TXNAME")
# tx2gene <- df[,2:1]

3. 将salmon_count得到的数据合并作为inputdata

# tximport
library('tximport')
library('readr')
txi=tximport(files ,type = "salmon",tx2gene = tx2gene)
head(txi)
head(txi$length)
files
# 加载stringr包,为了得到将txi的列名定义为样本名
library(stringr)
# 以'\'为分隔符，将含有quant.sf的路径分割，并提第七列---取样本名（ERR1698194_quant）
t1=sapply(strsplit(files,'\\/'),function(x)x[7])
# txi 的列counts的列名为样本名,
colnames(txi$counts)=sapply(strsplit(t1,'_'),function(x)x[1])
tmp=txi$counts
head(tmp)
# 1为行，2为列，将列都转化为整数
exprSet=apply(tmp,2, as.integer)
rownames(exprSet)=rownames(tmp)
head(exprSet)
dim(exprSet)
save(exprSet,file=paste0('quants-exprSet.Rdata'))

表达矩阵有问题

创建dds时报错

txi中的counts为float，需要转化为整数

还是txi，counts需要转化为data.frame

image.png

最后发现是txi $abundance和txi$ length都没有列名，于是想当然的加上与txi$counts一样的列名

colnames(txi$abundance)=colnames(txi$counts)
colnames(txi$length)=colnames(txi$counts)

再构建dds就不会报错了

dds<-DESeqDataSetFromTximport(txi,sampleTable,design = ~group_list)

均一化

suppressMessages(dds2 <- DESeq(dds)) 
##直接用DESeq函数即可
## 下面的代码如果你不感兴趣不需要运行，免得误导你
## 就是看看normalization前面的数据分布差异
rld <- rlogTransformation(dds2)  ## 得到经过DESeq2软件normlization的表达矩阵！
exprSet_new=assay(rld)
par(cex = 0.7)
n.sample=ncol(exprSet)
if(n.sample>40) par(cex = 0.5)
cols <- rainbow(n.sample*1.2)
par(mfrow=c(2,2))
boxplot(exprSet, col = cols,main="expression value",las=2)
boxplot(exprSet_new, col = cols,main="expression value",las=2)
hist(exprSet)
hist(exprSet_new)

normalization

制定比较矩阵，共三组，分别用day0与其他三组进行比较，得到DESeq差异分析结果

resultsNames(dds2)
res0vs1 <- results(dds2, contrast = c("group_list","day1","day0"))
res0vs2 <- results(dds2, contrast = c("group_list","day2","day0"))
res0vs3 <- results(dds2, contrast = c("group_list","day3","day0"))

从DESeq差异分析结果中提取矩阵,包含：mean，

log2FoldChange,IfcSE, stat, pvalue, padj，用来画图

resOrdered1 <- res0vs1[order(res0vs1$padj),]
resOrdered1=as.data.frame(resOrdered1)
head(resOrdered1)

resOrdered2 <- res0vs2[order(res0vs2$padj),]
resOrdered2=as.data.frame(resOrdered2)
head(resOrdered1)

resOrdered3 <- res0vs3[order(res0vs3$padj),]
resOrdered3=as.data.frame(resOrdered3)
head(resOrdered3)

先画热图，在画图之前要去掉矩阵中的NA，

去掉NA前基因为4499，去掉后只含有表达量的基因数量减少。

热图之前先去掉NA

resOrdered1=na.omit(resOrdered1)
choose_gene1=rownames(resOrdered1)
choose_matrix1=exprSet[choose_gene1,]
#归一化
choose_matrix1=t(scale(t(choose_matrix1)))
#将热图保存在本地
pheatmap(choose_matrix1,filename = "DEG1_all_heatmap.png")

resOrdered2=na.omit(resOrdered2)
choose_gene2=rownames(resOrdered2)
choose_matrix2=exprSet[choose_gene2,]
choose_matrix2=t(scale(t(choose_matrix2)))
pheatmap(choose_matrix2,filename = "DEG2_all_heatmap.png")

resOrdered3=na.omit(resOrdered3)
choose_gene3=rownames(resOrdered3)
choose_matrix3=exprSet[choose_gene3,]
choose_matrix3=t(scale(t(choose_matrix3)))
pheatmap(choose_matrix3,filename = "DEG3_all_heatmap.png")

DEG1_all_heatmap.png(这么丑的热图也是第一次，毕竟是自己画的不能嫌弃)

DEG1_all_heatmap.png与DEG2_all_heatmap.png类似，DEG3_all_heatmap.png略有不同

DEG3_all_heatmap.png

画个火山图

library("org.At.tair.db")
library("KEGG.db")
library("clusterProfiler")
library(ggplot2)

resOrdered1$gene_id=rownames(resOrdered1)
id2symbol=toTable(org.At.tairSYMBOL) 
#rownames会变为数字
resOrdered1=merge(resOrdered1,id2symbol,by='gene_id')
DEG=resOrdered1

DEG_analysis <- function(DEG,prefix="test"){
  
  colnames(DEG)=c('gene_id' ,'baseMean','logFC','lfcSE','stat','pvalue' , 'P.Value' , 'symbol')
  ## 下面我都用padj，抛弃了pvalue
  ## 我不想修改我以前的代码，所以我更改了这个列名
  
  ############################################################
  ############  DEG filter      #############################
  ############################################################
  
  ## please keep in mind that I only keep the genes with symbol.
  DEG$symbol=as.character(DEG$symbol)
  #作用？？？
  DEG_filter=DEG[nchar(DEG$symbol)>1,]
  DEG_filter=DEG_filter[!is.na(DEG_filter$symbol),]
  
  ############################################################
  ############ volcano plot      #############################
  ############################################################
  logFC_Cutof <- with(DEG_filter,mean(abs( logFC)) + 2*sd(abs( logFC)) )
  logFC_Cutof = 0
  ## 这里我不准备用logFC来挑选差异基因，仅仅是用padj即可
  DEG_filter$change = as.factor(ifelse(DEG_filter$P.Value < 0.05 & 
                                         abs(DEG_filter$logFC) > logFC_Cutof,
                                       ifelse(DEG_filter$logFC > logFC_Cutof ,'UP','DOWN'),'NOT'))
  this_tile <- paste0('Cutoff for logFC is ',round(logFC_Cutof,3),
                      '\nThe number of up gene is ',nrow(DEG_filter[DEG_filter$change =='UP',]) ,
                      '\nThe number of down gene is ',nrow(DEG_filter[DEG_filter$change =='DOWN',])
  )
  g_volcano = ggplot(data=DEG_filter, aes(x=logFC, y=-log10(P.Value), color=change)) +
    geom_point(alpha=0.4, size=1.75) +
    theme_set(theme_set(theme_bw(base_size=20)))+
    xlab("log2 fold change") + ylab("-log10 p-value") +
    ggtitle( this_tile  ) + 
    theme(plot.title = element_text(size=15,hjust = 0.5))+
    scale_colour_manual(values = c('blue','black','red'))  ## corresponding to the levels(res$change)
  print(g_volcano)

火山图

KEGG分析

流程:
http://www.biotrainee.com/forum.php?mod=viewthread&tid=1602&extra=page%3D1%26filter%3Dtypeid%26typeid%3D30

https://github.com/shenmengyuan/RNA_seq_Biotrainee

比对建索引参考:
https://www.jianshu.com/p/071c1757ded1

salmon_quant:
https://salmon.readthedocs.io/en/latest/salmon.html#quantifying-in-mapping-based-mode

用tximport将转录组数据导入Rstudio：
https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html#import-transcript-level-estimates

注释包：
https://www.jianshu.com/p/ae94178918bc

拟南芥数据库：
https://www.arabidopsis.org/download/index-auto.jsp?dir=%2Fdownload_files%2FGenes%2FTAIR10_genome_release

植物RNA-seq (Salmon)从软件安装到下游分析(纯Ma
写在前面：整个流程是首次从软件安装-数据下载-上游分析-下游分析。代码没毛病，奇怪的是salmon比对后，map...
将salmon生成的数据导入R语言做DESeq2差异分析
用salmon走完RNA-Seq上游分析后，就要把数据导入R里做下游分析使用R包tximport导入数据这样就...
RNA-seq分析实战
在最最最开始！放上借鉴引用的文章链接!(后边还有)RNA-seq分析：从软件安装到富集分析详细过程RNA-seq实...
salmon：从RNA-seq数据中定量转录本
Salmon 主要用于从RNA-seq数据中计算转录本的丰度。 # 安装 # 基于Salmon分析RNAseq数据...
RNA-seq分析简洁版
前面RNA-seq分析：从软件安装到富集分析部分已经把转录组全部流程走完了一遍，这次利用RNA-seq(2)-2:...
2019-05-12 利用salmon进行转录组定量(上)
参考：RNA-Seq基因差异表达分析实战Dawn_天鹏简书RNA_seq(1)植物转录组实战（上)之salmon进...
TCR
TRUST 1、概述 TRUST，是从RNA-seq原始数据组装到TCR CDR3的软件，专门为RNA-seq数据...
RNA-seq分析：从软件安装到富集分析
从安装软件到富集分析，完整流程链接整理： RNA-seq(1) :用conda安装RNA-seq所需要的工具 RN...
转录组分析：RNA-seq pipeline through k
RNA-seq pipeline through kallisto or salmon kallisto 和sal...
salmon分析RNA-seq实战
Salmon应用查看帮助文档 Quasi-mapping-based mode (including light...