ballgown的使用说明

作者: 一只烟酒僧 | 来源:发表于2020-04-25 18:18 被阅读0次

ballgown的使用说明
hisat2+stringtie+ballgown取FPKM
第七步-ballgown的相关参数
rnaseq-R
Stringtie+deseq2分析转录组备忘
RNA-seq分析流程大比较
RNA-seq阅读2016年【HISAT, StringTie
ERROR: dependency ‘XML’ is not a
ballgown下载及使用
3. Ballgown 流程

幕布：https://mubu.com/doc/zIdzn1sNO0
参考说明书：http://bioconductor.org/packages/release/bioc/vignettes/ballgown/inst/doc/ballgown.html

一、ballgown简介

二、使用前的准备

1. 使用前需要做的操作

read需要完成比对
需要完成转录组组装，可以从头组装也可以基于参考gtf组装
需要完成转录组定量，即对每类feature（外显子、内含子、转录本等）都做样本的定量
综上，目前支持该类操作的流程为基于stringtie或cufflinks的那些流程

2. 上述流程最后的输出文件为ballgown的input（每个样本5个文件）

2.1 e_data.ctab ：exon-level expression measurements. One row per exon. Columns are e_id (numericexon id), chr, strand, start, end (genomic location of the exon), and the following expressionmeasurements for each sample

rcount: reads overlapping the exon
ucount: uniquely mapped reads overlapping the exon
mrcount: multi-map-corrected number of reads overlapping the exon
cov： average per-base read coverage
cov_sd: standard deviation of per-base read coverage
mcov: multi-map-corrected average per-base read coverage
mcov_sd: standard deviation of multi-map-corrected per-base coverage

2.2 i_data.ctab ：intron- (i.e., junction-) level expression measurements. One row per intron. Columns arei_id (numeric intron id), chr, strand, start, end (genomic location of the intron), and the followingexpression measurements for each sample

rcount: number of reads supporting the intron
ucount: number of uniquely mapped reads supporting the intron
mrcount: multi-map-corrected number of reads supporting the intron

2.3 t_data.ctab ：transcript-level expression measurements. One row per transcript. Columns are

t_id: numeric transcript id
chr, strand, start, end: genomic location of the transcript
t_name: Cufflinks-generated transcript id
num_exons: number of exons comprising the transcript
length: transcript length, including both exons and introns
gene_id: gene the transcript belongs to
gene_name: HUGO gene name for the transcript, if known
cov: per-base coverage for the transcript (available for each sample)
FPKM: Cufflinks-estimated FPKM for the transcript (available for each sample)

2.4 e2t.ctab ：table with two columns, e_id and t_id, denoting which exons belong to which transcripts.These ids match the ids in the e_data and t_data tables.

2.5 i2t.ctab ：table with two columns, i_id and t_id, denoting which introns belong to which transcripts.These ids match the ids in the i_data and t_data tables

3. 一个保存样本metadata的文件

格式为csv，注意事项为：It is veryimportant that the rows of pData are in the correct order. Each row corresponds to a sample, and therows of pData should be ordered teh same as the tables in the expr slot. You can check that order byrunning sampleNames(bg). The pData component can be added during construction (you can pass a dataframe to the ballgown function), or you can add it later ：举例如：pData(bg) = data.frame(id=sampleNames(bg), group=rep(c(1,0), each=10))

4. 文件要保存的文件夹状态

ballgown默认支持的状态为，在一个extdata/路径下，每个样本有单独的文件夹（如sample01/ ，sample02/.....）,每个文件夹内放着该样本的上述五个文件，那就可以通过以下代码进行读入

library(ballgown)
data_directory = system.file('extdata', package='ballgown') # automatically finds ballgown's installation directory
# examine data_directory:
data_directory
## [1] "/tmp/RtmpQbXERE/Rinst49e674551b57/ballgown/extdata"
# make the ballgown object:
bg = ballgown(dataDir=data_directory, samplePattern='sample', meas='all')
bg
## ballgown instance with 100 transcripts and 20 samples

标准的文件结构

如果用户没有按照上述文件夹放置文件，ballgown也提供了自定义的导入方法，If your data is stored in a directory structure other than the one specified above, you can use the samples
argument in the ballgown function: samples should be a vector (1-d array) with one entry per sample,where the entry gives the path to the folder containing that sample's .ctab files.

5. 对于大数据的导入

如果用户使用了很多样本，导入会非常慢，那就不建议使用交互模式的语言环境，可以写成脚本后使用Rscript等命令运行，之后再重新load保存的rdata文件，代码如下

library(ballgown)
data_directory = system.file('extdata', package='ballgown')
bg = ballgown(dataDir=data_directory, samplePattern='sample', meas='all')
save(bg, file='bg.rda')

三、ballgown对象的结构

1. ballgown对象包括六个slot：structure、expr、indexes、dirs、mergedData、meas

2. 调用方式分别为 structure（bg），*expr（bg），indexes(bg)$ 或pdata（bg），bg@dirs，bg@mergedData，bg@meas

3. 结构介绍

3.1、structure：

该slot是仿照Genomicranges创建的，官方描述如下：The slot specifies the structure, i.e., genomic locations and relationships between exons, introns, andtranscripts, of the transcriptome assembly. It is convenient to represent exons and introns as intervalsand to represent transcripts as a set of intervals (exons), so assembled exons and introns are available as GRanges objects, and the assembled transcripts are available as a GRangesList object. This means thatuseful range operations, such as findOverlaps and reduce, are readily available for assembled features
用户可以通过structure进行调用，代码如下

structure(bg)$exon
## GRanges object with 633 ranges and 2 metadata columns:
## seqnames ranges strand | id transcripts
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] 18 24412069-24412331 * | 12 10
## [2] 22 17308271-17308950 + | 55 25
## [3] 22 17309432-17310226 + | 56 25
## [4] 22 18121428-18121652 + | 88 35
## [5] 22 18138428-18138598 + | 89 35
## ... ... ... ... . ... ...
## [629] 22 51221929-51222113 - | 3777 1294
## [630] 22 51221319-51221473 - | 3782 1297
## [631] 22 51221929-51222162 - | 3783 1297
## [632] 22 51221929-51222168 - | 3784 1301
## [633] 6 31248149-31248334 * | 3794 1312
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths

3.2、expr

该slot是一个list，包含了不同genomic features的表达量的矩阵，这些矩阵的格式非常类似于_data.ctab这类文件的格式
用户需要通过一系列有规律的函数调用不同的genomic feature的表达信息，expr(ballgown_object_name, <EXPRESSION_MEASUREMENT>)
对于函数的解释：where * is either e for exon, i for intron, t for transcript, or g for gene, and is an expressionmeasurement column name from the appropriate .ctab file. Gene-level measurements are calculated byaggregating the transcript-level measurements for that gene .The *expr functions return matrices unless meas = 'all', in which case some additional feature metadatais returned and the result is a data.frame. 即正常情况下输出的都是一个矩阵，如果meas=“all”,由于存在metadata等文本信息，因此会输出一个数据框

transcript_fpkm = texpr(bg, 'FPKM')
transcript_cov = texpr(bg, 'cov')
whole_tx_table = texpr(bg, 'all')
exon_mcov = eexpr(bg, 'mcov')
junction_rcount = iexpr(bg)
whole_intron_table = iexpr(bg, 'all')
gene_expression = gexpr(bg)

3.3、indexes

该slot是一个list，其中包含了样本的metadata信息以及刚才提到的五个文件中的两个文件 e2t and i2t tables ，在slot设计中也有bamfiles的位置，但是ballgown中没有函数处理bam文件，需要通过其它包的函数辅助处理
对于metadata信息，需要通过pData（bg）添加或提取，对于其他信息，可以通过indexes(bg)$e2t的方式提取

exon_transcript_table = indexes(bg)$e2t
transcript_gene_table = indexes(bg)$t2g
head(transcript_gene_table)
## t_id g_id
## 1 10 XLOC_000010
## 2 25 XLOC_000014
## 3 35 XLOC_000017
## 4 41 XLOC_000246
## 5 45 XLOC_000019
## 6 67 XLOC_000255

3.4、dirs

储存了读入的所有样本的绝对路径
通过bg@dirs调用

head(bg@dirs)
## sample01
## "/tmp/RtmpQbXERE/Rinst49e674551b57/ballgown/extdata/sample01"
## sample02
## "/tmp/RtmpQbXERE/Rinst49e674551b57/ballgown/extdata/sample02"
## sample03
## "/tmp/RtmpQbXERE/Rinst49e674551b57/ballgown/extdata/sample03"
## sample04
## "/tmp/RtmpQbXERE/Rinst49e674551b57/ballgown/extdata/sample04"
## sample05
## "/tmp/RtmpQbXERE/Rinst49e674551b57/ballgown/extdata/sample05"
## sample06
## "/tmp/RtmpQbXERE/Rinst49e674551b57/ballgown/extdata/sample06"

3.5、mergedData

储存了对象的创建时间
通过bg@mergedData调用

3.6、meas

储存了导入时导入的测量值的类型
通过bg@meas调用

四、ballgown对转录本的可视化

1. 核心函数为plotTranscripts （）

2. 支持单样本、单基因的不同转录本的可视化，颜色表示转录本的表达水平，如：plotTranscripts(gene='XLOC_000454', gown=bg, samples='sample12',meas='FPKM', colorby='transcript',main='transcripts from gene XLOC_000454: sample 12, FPKM')

单样本可视化

3. 支持多样本、单基因的不同转录本的可视化，如：plotTranscripts('XLOC_000454', bg,samples=c('sample01', 'sample06', 'sample12', 'sample19'),meas='FPKM', colorby='transcript')

多样本可视化

4. 支持多组别、单基因的不同转录本的可视化，颜色表示该转录本在同一组不同样本中表达的平均值，如：plotMeans('XLOC_000454', bg, groupvar='group', meas='FPKM', colorby='transcript')

多组别可视化

五、ballgown的差异分析功能

1.核心函数为stattest（）

2.使用了基于线性模型的有参F检验，十分类似于limma中基于线性模型和经验贝叶斯的检验方法，同时limma包中的统计方法也可以在ballgown对象中的matrix中运行

3. 对于该模型原理的描述：Ballgown's statistical models are implemented with the stattest function. Two models are fit to eachfeature, using expression as the outcome: one including the covariate of interest (e.g., case/controlstatus or time) and one not including that covariate. An F statistic and p-value are calculated using thefits of the two models. A significant p-value means the model including the covariate of interest fitssignificantly better than the model without that covariate, indicating differential expression. We adjustfor multiple testing by reporting q-values (Storey & Tibshirani (2003)) for each transcript in addition top-values: reporting features with, say, q < 0.05 means the false discovery rate should be controlled atabout 5%

4. 应用范围

4.1可以用于两组比较、多组比较和时间序列的比较，对于前两种比较，有显著性说明在至少一个组中存在差异，对于时间序列的比较，有差异说明某个feature的表达在时间上趋于变化而非平稳

4.2 对于分类变量，在stattest中设为协变量参数，如stat_results = stattest(bg, feature='transcript', meas='FPKM', covariate='group') 最终输出结果为一个数据框

4.3 对于连续变量，如时间序列变量，在stattest中除了设置协变量外，还需要将timecourse改为TRUE，如timecourse_results = stattest(bg, feature='transcript', meas='FPKM', covariate='time', timecourse=TRUE)

4.4 作者建议，如果时间变量的数目小于5，则不足以构建拟合模型，需要当作分类变量分析！

stat_results = stattest(bg, feature='transcript', meas='FPKM', covariate='group')
head(stat_results)
## feature id pval qval
## 1 transcript 10 0.01381576 0.10521233
## 2 transcript 25 0.26773622 0.79114975
## 3 transcript 35 0.01085070 0.08951825

5. 干扰因子

5.1 操作及说明：You can adjust for any or all variables in pData when testing for differential expression. Ballgownautomatically adjusts for library size using the sum of all logged nonzero expression measurementsbelow the 75th percentile of those measurements, for each sample. If you would like to adjust for othervariables, just provide those confounders as the adjustvars argument to stattest

5.2 假如说我关心的是time，不关心group：group_adj_timecourse_results = stattest(bg, feature='transcript', meas='FPKM', covariate='time',timecourse=TRUE, adjustvars='group')

6. 自定义模型

6.1 操作及说明：It is also possible to explicitly provide the design matrices for the models to be compared. You canprovide any two models for mod and mod0, provided that mod0 is nested in mod, i.e., that all covariates usedin mod0 also appear in mod. For example, suppose we had sex and age information available, in addition togroup and time, and we wanted to compare a model (mod) including all information (sex, age, group,time) to a model including only group and time (mod).

6.2 实例代码

# create example data:
set.seed(43)
sex = sample(c('M','F'), size=nrow(pData(bg)), replace=TRUE)
age = sample(21:52, size=nrow(pData(bg)), replace=TRUE)
# create design matrices:
mod = model.matrix(~ sex + age + pData(bg)$group + pData(bg)$time)
mod0 = model.matrix(~ pData(bg)$group + pData(bg)$time)
# run differential expression tests:
adjusted_results = stattest(bg, feature='transcript', meas='FPKM', mod0=mod0, mod=mod)
head(adjusted_results)
## feature id pval qval
## 1 transcript 10 0.3506799 0.8831885
## 2 transcript 25 0.4220214 0.8831885
## 3 transcript 35 0.5559817 0.9023309
## 4 transcript 41 0.8012697 0.9230539
## 5 transcript 45 0.8764350 0.9230539
## 6 transcript 67 0.8012261 0.9230539

六、ballgown中支持的其它统计方法

1. ballgown的数据结构足以保证它可以被很多统计软件支持，如limma、deseq等，因为其通过*expr（）导出来的都是表达矩阵

2. ballgown中可以做简单的转录本聚类

2.1 应用背景：Sometimes several very similar transcripts are assembled for the same gene, which might causeexpression estimates for those transcripts to be unreliable: statistically, it can very difficult or impossibleto tell which of two very similar transcript a read came from. This means differential expression resultsmight also be unreliable

2.2 ballgown的策略：As a preliminary attempt at addressing this issue, Ballgown provides some simple transcript clusteringfunctions. The idea is that similar assembled transcripts can be grouped together in clusters, anddifferential expression analysis could be performed on the cluster, whose expression measurementaggregates the expression estimates of the transcripts that compose it.

2.3 原理：These functions measure the distance between transcripts using Jaccard distance, where eachtranscript's “set” is the nucleotides included in its exons. Transcripts can be clustered using either kmeans clustering or hierarchical clustering

2.4 核心函数：clusterTranscripts，plotLatentTranscripts ，collapseTranscripts

2.5 clusterTranscripts(gene='XLOC_000454', gown=bg, k=2, method='kmeans') ##聚类

clusterTranscripts(gene='XLOC_000454', gown=bg, k=2, method='kmeans')
## $clusters
## cluster t_id
## 1 2 1294
## 2 1 1297
## 3 2 1301
##
## $pctvar
## [1] 0.9117737

2.6 plotLatentTranscripts(gene='XLOC_000454', gown=bg, k=2, method='kmeans', returncluster=FALSE) ##可视化

cluster展示

2.7 agg = collapseTranscripts(gene='XLOC_000454', gown=bg, k=2, method='kmeans') ##将相同cluster的转录本整合

2.8 stattest(gowntable=agg$tab, pData=pData(bg), feature='transcript_cluster',covariate='group', libadjust=FALSE) ##按照cluter进行差异分析

agg = collapseTranscripts(gene='XLOC_000454', gown=bg, k=2, method='kmeans')
stattest(gowntable=agg$tab, pData=pData(bg), feature='transcript_cluster',
covariate='group', libadjust=FALSE)
## feature id pval qval
## 1 transcript_cluster 1 0.3332238 0.6664477
## 2 transcript_cluster 2 0.6882119 0.6882119

ballgown的使用说明
幕布：https://mubu.com/doc/zIdzn1sNO0参考说明书：http://bioconduct...
hisat2+stringtie+ballgown取FPKM
require(ballgown) ## Create ballgown object: my.data=ball...
第七步-ballgown的相关参数
转自转录组差异表达分析工具Ballgown | Biochen生物 Ballgown的输入文件 StringTie...
rnaseq-R
library(ballgown) library(RSkittleBrewer) library(genefil...
Stringtie+deseq2分析转录组备忘
Ballgown备忘上述命令生成下面两个文件
RNA-seq分析流程大比较
HISAT2+StringTie+Ballgown Tophat2+cufflinks+cuffdiff HISA...
RNA-seq阅读2016年【HISAT, StringTie
HISAT+StringTie+Ballgown相比于TopHat2+Cufflinks速度要更快，使用的内存更少...
ERROR: dependency ‘XML’ is not a
在做RNA-SEQ时候，用hisat2，stringtie，ballgown流程，在过程中需要'genefilte...
ballgown下载及使用
下载我用的是R3.6.2的版本，对应BiocManager应该是3.10的版本报错解决即可使用将xsh...
3. Ballgown 流程
Loading data ballgown 通过正则表达和自带函数可以实现RNAseq文件夹下所有/部分样本str...