IsoformSwitchAnalyzeR包是2019年才发布的一个分析包,更充分地发挥了RNA-seq数据的价值。学习笔记参考官方文档,示例文件参考R包内嵌数据;回校后再利用资源,根据笔记实操一下吧~
一、简单介绍
1、IsoformSwitch
One of these examples is the pyruvate kinase(丙酮酸激酶). In normal adult homeostasis(体内平衡), cells use the adult isoform (M1), which supports oxidative phosphorylation(氧化磷酸化). However, almost all cancer cells use the embryonic isoform (M2), which promotes aerobic glycolysis(有氧糖酵解), one of the hallmarks(标志) of cancer. Such shifts in isoform usage are termed ‘isoformisoform switching’
如上所述,IsoformSwitch就是同一基因产生的转录本发生变化(量与质)导致基因功能改变,从而最终导致对机体的影响。
2、IsoformSwitchAnalyze
简单理解就是比较两组样本的基因转录本差异与否。这是区别于差异基因分析的一种分析思路。从某些角度,这是有很大意义的。比如两组样本的某一基因表达量(gene expression)差异较小,但是该基因所产生的转录本情况可能发生大的改变(highly significant switch in isoform usage),在差异基因分析时就会被忽略。
3、大致流程workflow
如下图,可以理解成三个步骤--
- (1)导入数据--将前期基础数据导入R中,其中最关键的是Isoform quantification数据。
- (2)对样本isoform从多角度进行注释分析;
- (3)利用转录本注释信息,进行组间IsoformSwitch分析,并将结果可视化。
data:image/s3,"s3://crabby-images/49ccc/49ccc9e59a591f09ee8d1ceb9a62babb48e0deb4" alt=""
步骤一:数据导入
1、Isoform quantification
即RNA-seq中reads比对到转录本中的情况,分析软件有很多,这里推荐使用Salmon(high speed and increased accuracy)。分析后,直接将结果导入到R中即可。importIsoformExpression()
在导入的同时会对counts进行标准化,计算abundance,更有比较意义。
library(IsoformSwitchAnalyzeR)
salmonQuant <- importIsoformExpression(
parentDir = system.file("extdata/", package="IsoformSwitchAnalyzeR"),
addIsofomIdAsColumn = TRUE
)
值得注意的是因为记得salmon分析会将每个样本数据结果储存单独一个文件夹,因此指定
parentDir=
即可。
此外这里因为利用的是包内嵌数据;当使用自己的数据时,直接指定路径/文件即可,下面示例文件操作也是类似的。
2、实验分组设计表
data.frame格式,要有两列:sampleID and condition
myDesign <- data.frame(
sampleID = colnames(salmonQuant$abundance)[-1],
condition = gsub('_.*', '', colnames(salmonQuant$abundance)[-1])
)
如下图,示例实验为三组condition,每组两个重复。
data:image/s3,"s3://crabby-images/06195/06195130fc6d1bb5a0cbd873e2561d94182f89e4" alt=""
3、GTF文件
主要提供以下两方面信息:
- The transcript structure of the isoforms (in genomic coordinates) as well as information about which isoforms originate from the same gene.
即转录本的外显子信息,以及起源基因。 - the CoDing Sequence regions as the as the ORF regions.
其实CDS本质与OPF是两个概念,此时可以一同理解,就是转录本的编码序列。
4、fasta文件
It is highly recommended that you supply the transcript fasta file while constructing the switchAnalyzeRlist. This will result in more accurate and much faster analysis in the entire workflow.
上述GTF与fasta注释文件均可在Ensembl genome browser 99网站下载,如下图。
data:image/s3,"s3://crabby-images/0e4bb/0e4bbe66caa569c4f6b50447893741bbcf04f7cb" alt=""
5、导入合并
准备好上述四个文件,就可以使用importRdata()
函数进行导入合并了:并进行了以下分析:
- 分析转录本abundance,得到基因表达情况;
- 根据分组,summarize gene/isoform expression/usage;
- 根据上一步,计算组间差异(dIF);
- 为每个转录本,注释genomic annotation and the nucleotide sequences。
此时组间差异计算主要是
dIF
,the difference in isoform fraction。IF
即isoform fraction,用来评价该 isoform usage;计算公式为isoform_exp / gene_exp。因此dIF
=IF2 - IF1,用于评价差异大小(effect size)。
sar1 <- importRdata(
isoformCountMatrix = salmonQuant$counts,
isoformRepExpression = salmonQuant$abundance,
designMatrix = myDesign,
isoformExonAnnoation = system.file("extdata/example.gtf.gz" , package="IsoformSwitchAnalyzeR"),
isoformNtFasta = system.file("extdata/example_isoform_nt.fasta.gz", package="IsoformSwitchAnalyzeR")
)
注意一下,此时sar1初步汇总结果仍可理解成:观测/行名为所有的转录本。
head(sar1,2)
head(sar1$isoformFeatures,2)
head(sar1$exons,2)
head(sar1$ntSequence,2)
6、初筛一下数据
这里的过滤主要基于low Gene/Isoform expression筛选,减少接下来不必要的运算时间。
sar2 <- preFilter(
switchAnalyzeRlist = sar1,
geneExpressionCutoff = 10,
isoformExpressionCutoff = 3,
removeSingleIsoformGenes = TRUE #默认为TRUE,删除只有一个转录本的基因
)
#返回:The filtering removed 373 ( 34.16% of ) transcripts. There is now 719 isoforms left
-
preFilter
默认参数geneExpressionCutoff = 1
、isoformExpressionCutoff = 0
sar2 <- preFilter(sar1)
#返回:The filtering removed 273 ( 25% of ) transcripts. There is now 819 isoforms left
步骤二:添加注释
1、注释Q值
这一步主要计算Switches的Q值与dIF,判断isoform的显著性与效应大小
- The alpha argument indicating the FDR corrected P-value (Q-value) cutoff.
- The dIFcutoff argument indicating the minimum (absolute) change in isoform usage (dIF).
由于dIF者前期已经计算,所以这里主要是计算Q值。有不同的的方法/算法,推荐DEXSeq。
DEXSeq was originally designed for testing for differential exon usage but have recently been shown to perform exceptionally well for differential isoform usage.
sar3 <- isoformSwitchTestDEXSeq(
switchAnalyzeRlist = sar2,
reduceToSwitchingGenes=TRUE
)
-
reduceToSwitchingGenes=TRUE
参数设置筛选genes which each contains at least one differential used isoform, as indicated by the alpha and dIFcutoff cutoffs.
head(sar3,2)
data:image/s3,"s3://crabby-images/5a63b/5a63b3f15f4b60de531c1d5312d0ee270574113b" alt=""
2、注释ORF(选做)
由于一般提供GTF文件会提供CDS数据,所以不需要再添加。由于本次示例GTF文件好像未提供相关信息。不过也没关系,因为也可以利用现有的数据集构建每个转录本的ORF。(utilizes the genomic coordinates of each transcript to extract the transcript nucleotide sequence from a reference genome )
sar4 <- analyzeORF(
sar3,
orfMethod = "longest", #有four different methods ,详见官方文档
showProgress=FALSE
)
head(sar4$orfAnalysis, 3)
data:image/s3,"s3://crabby-images/a9fd1/a9fd12a1870017e24c269a690f1491716023c461" alt=""
3、生成转录本核苷酸与对应氨基酸序列
sar5 <- extractSequence(sar4,
filterAALength=TRUE,
alsoSplitFastaFile=TRUE,
pathToOutput = '~/li/test/01',
writeToFile=TRUE)
For easier usage of the Pfam and SignalP web servers,进行以下设置:
-
filterAALength
参数表示氨基酸的有效长度为5~1000; -
alsoSplitFastaFile
参数表示对 the number of sequences。
4、External Sequence Analysis
(1)根据转录本核苷酸序列(Nucleotide/nt Sequences)
- predict whether an isoform is coding or not(CPAT/CPC2二选一)
(2)根据转录本氨基酸序列( Amino Acid/AA Sequences)
- Prediction of protein domains蛋白结构域(Pfam);
- Prediction of Signal Peptides信号肽(SignalP);
- Prediction of Intrinsically Disordered Regions,IDR内在无序区域(IUPred2A/NetSurfP 2二选一)
根据所得序列文件,进行上述四种角度的转录本深入分析,网站链接见官方文档。分析得到结果后,将结果注释到sar5中,方法可参考下述代码:
- CPAT/CPC2
sar5.1 <- analyzeCPAT(
switchAnalyzeRlist = sar5,
pathToCPATresultFile = system.file("extdata/cpat_results.txt", package = "IsoformSwitchAnalyzeR"),
codingCutoff = 0.725, # the coding potential cutoff we suggested for human
removeNoncodinORFs = TRUE # 存疑!!!
)
#返回:Added coding potential to 350 (74.47%) transcripts
### Add CPC2 analysis
exampleSwitchListAnalyzed <- analyzeCPC2(
switchAnalyzeRlist = exampleSwitchListAnalyzed,
pathToCPC2resultFile = system.file("extdata/cpc2_result.txt", package = "IsoformSwitchAnalyzeR"),
codingCutoff = 0.725, # the coding potential cutoff we suggested for human
removeNoncodinORFs = TRUE # because ORF was predicted de novo
)
- Pfam
sar5.2 <- analyzePFAM(
switchAnalyzeRlist = sar5.1,
pathToPFAMresultFile = system.file("extdata/pfam_results.txt", package = "IsoformSwitchAnalyzeR"),
showProgress=FALSE
)
#Added domain information to 102 (74.64%) transcripts
- SignalP
sar5.3 <- analyzeSignalP(
switchAnalyzeRlist = sar5.2,
pathToSignalPresultFile = system.file("extdata/signalP_results.txt", package = "IsoformSwitchAnalyzeR")
)
#Added signal peptide information to 30 (6.38%) transcripts
- IUPred2A/NetSurfP
sar5.4 <- analyzeIUPred2A(
switchAnalyzeRlist = sar5.3,
pathToIUPred2AresultFile = system.file("extdata/iupred2a_result.txt.gz", package = "IsoformSwitchAnalyzeR"),
showProgress = FALSE
)
# Added IDR information to 69 (14.68%) transcripts
友情提示:if you had to submit your data as multiple runs at the Pfam or SignalP website you can just supply a vector of strings indicating the path to each of the resulting files to the functions above and IsoformSwitchAnalyzeR will read them all and integrate them.
5、注释Alternative Splicing可变剪切类型
基于 the exon structure of all isoforms in a given gene (with isoform switching)
sar6 <- analyzeAlternativeSplicing(
switchAnalyzeRlist = sar5.4,
quiet=TRUE
)
tmp=as.data.frame(sar6$AlternativeSplicingAnalysis)
table(sar6$AlternativeSplicingAnalysis$IR )
如下表格显示,61 isoforms contain a single intron retention (IR) and 19 isoforms each contain two or more intron retentions.
data:image/s3,"s3://crabby-images/62aff/62afff98930244b0acb7079772a28afaa3eaaf54" alt=""
6、注释Switch Consequences
- These isoforms are then divided into the isoforms that increase their contribution to gene expression (positive dIF values larger than dIFcutoff) and the isoforms that decrease their contribution (negative dIF values smaller than -dIFcutoff).
- The isoforms with increased contribution are then (in a pairwise manner) compared to the isoform with decreasing contribution.
- In each of these comparisons the isoforms compared are analyzed for differences in their annotation (controlled by the consequencesToAnalyze parameter).
cIn <- c('intron_retention','coding_potential','NMD_status','domains_identified','ORF_seq_similarity')
sar7 <- analyzeSwitchConsequences(
sar6,
consequencesToAnalyze = cIn,
dIFcutoff = 0.4, # very high cutoff for fast runtimes
showProgress=FALSE
)
sar7.df=as.data.frame(sar7$switchConsequence)
sar7.df=sar7$switchConsequence
结合sar7.df及上述定义,我个人的理解是首先筛选基因表达的多个转录本中既存在dIF>0.4的转录本,也存在dIF< -0.4的转录本。然后将同一基因的这些转录本的注释进行比较,观察,这些注释是否存在不同。
data:image/s3,"s3://crabby-images/ade48/ade48aad5ae5095e6cd0b5f524a1d0e689557c6c" alt=""
sar7
如下图所示,步骤二就是不断增加Feature analyzed条目的过程。到这里步骤二就大致完成了。之后的步骤就是结合这些前期数据进行最后结论性得分析、汇总了。
data:image/s3,"s3://crabby-images/05758/05758f95ab39d202fa1e3d9d1c888948c25eae11" alt=""
值得注意的是 IsoformSwitchAnalyzeR包还提供了两个便捷函数:part1,part2,快速实现上述步骤二所有的注释分析
data("exampleSwitchList")
exampleSwitchList #已完成了前期的数据导入
exampleSwitchList <- isoformSwitchAnalysisPart1(
switchAnalyzeRlist = exampleSwitchList,
dIFcutoff = 0.3, # Cutoff for finding switches - set high for short runtime in example data
# pathToOutput = 'path/to/where/output/should/be/'
outputSequences = FALSE, # change to TRUE whan analyzing your own data
prepareForWebServers = FALSE # change to TRUE if you will use webservers for external sequence analysis
)
# 利用产生的序列信息,进行外部软件分析,得到相应结果
exampleSwitchList <- isoformSwitchAnalysisPart2(
switchAnalyzeRlist = exampleSwitchList,
dIFcutoff = 0.3, # Cutoff for finding switches - set high for short runtime in example data
n = 10, # if plotting was enabled, it would only output the top 10 switches
removeNoncodinORFs = TRUE, # Because ORF was predicted de novo
pathToCPC2resultFile = system.file("extdata/cpc2_result.txt" , package = "IsoformSwitchAnalyzeR"),
pathToPFAMresultFile = system.file("extdata/pfam_results.txt" , package = "IsoformSwitchAnalyzeR"),
pathToIUPred2AresultFile = system.file("extdata/iupred2a_result.txt.gz" , package = "IsoformSwitchAnalyzeR"),
pathToSignalPresultFile = system.file("extdata/signalP_results.txt" , package = "IsoformSwitchAnalyzeR"),
outputPlots = FALSE # keeps the function from outputting the plots from this example
)
- 如下结果(Feature analyzed),上述利用
isoformSwitchAnalysisPart1
,isoformSwitchAnalysisPart2
两个函数可直接完成所有的分析,还是挺方便的~
exampleSwitchList
网友评论