RNA-seq学习：No.7IsoformSwitchAnaly

作者: 小贝学生信 | 来源:发表于2020-03-19 17:11 被阅读0次

RNA-seq学习：No.7IsoformSwitchAnaly
深入理解RNA-seq
Transcript quantification/differ
RNA-seq 检测变异之 GATK 最佳实践流程
[RNA-Seq] 学习
RNA-seq无比对直接定量(Kallisto - sleuth
STAR 使用手册
Analysis of single cell RNA-seq
【RSS】[1] FastQC-Quality assessme
39个转录组分析工具，120种组合评估

IsoformSwitchAnalyzeR包是2019年才发布的一个分析包，更充分地发挥了RNA-seq数据的价值。学习笔记参考官方文档，示例文件参考R包内嵌数据；回校后再利用资源，根据笔记实操一下吧~

一、简单介绍

1、IsoformSwitch

One of these examples is the pyruvate kinase（丙酮酸激酶）. In normal adult homeostasis（体内平衡）, cells use the adult isoform (M1), which supports oxidative phosphorylation（氧化磷酸化）. However, almost all cancer cells use the embryonic isoform (M2), which promotes aerobic glycolysis（有氧糖酵解）, one of the hallmarks（标志） of cancer. Such shifts in isoform usage are termed ‘isoformisoform switching’

如上所述，IsoformSwitch就是同一基因产生的转录本发生变化（量与质）导致基因功能改变，从而最终导致对机体的影响。

2、IsoformSwitchAnalyze

简单理解就是比较两组样本的基因转录本差异与否。这是区别于差异基因分析的一种分析思路。从某些角度，这是有很大意义的。比如两组样本的某一基因表达量(gene expression)差异较小，但是该基因所产生的转录本情况可能发生大的改变(highly significant switch in isoform usage)，在差异基因分析时就会被忽略。

3、大致流程workflow

如下图，可以理解成三个步骤--

（1）导入数据--将前期基础数据导入R中，其中最关键的是Isoform quantification数据。
（2）对样本isoform从多角度进行注释分析；
（3）利用转录本注释信息，进行组间IsoformSwitch分析，并将结果可视化。

workflow

步骤一：数据导入

1、Isoform quantification

即RNA-seq中reads比对到转录本中的情况，分析软件有很多，这里推荐使用Salmon(high speed and increased accuracy)。分析后，直接将结果导入到R中即可。importIsoformExpression()在导入的同时会对counts进行标准化，计算abundance，更有比较意义。

library(IsoformSwitchAnalyzeR)
salmonQuant <- importIsoformExpression(
    parentDir = system.file("extdata/", package="IsoformSwitchAnalyzeR"),
    addIsofomIdAsColumn = TRUE
)

值得注意的是因为记得salmon分析会将每个样本数据结果储存单独一个文件夹，因此指定parentDir=即可。
此外这里因为利用的是包内嵌数据；当使用自己的数据时，直接指定路径/文件即可，下面示例文件操作也是类似的。

2、实验分组设计表

data.frame格式，要有两列：sampleID and condition

myDesign <- data.frame(
    sampleID = colnames(salmonQuant$abundance)[-1],
    condition = gsub('_.*', '', colnames(salmonQuant$abundance)[-1])
)

如下图，示例实验为三组condition，每组两个重复。

myDesign

3、GTF文件

主要提供以下两方面信息：

The transcript structure of the isoforms (in genomic coordinates) as well as information about which isoforms originate from the same gene.
即转录本的外显子信息，以及起源基因。
the CoDing Sequence regions as the as the ORF regions.
其实CDS本质与OPF是两个概念，此时可以一同理解，就是转录本的编码序列。

4、fasta文件

It is highly recommended that you supply the transcript fasta file while constructing the switchAnalyzeRlist. This will result in more accurate and much faster analysis in the entire workflow.

上述GTF与fasta注释文件均可在Ensembl genome browser 99网站下载，如下图。

GTF与fasta下载

5、导入合并

准备好上述四个文件，就可以使用importRdata()函数进行导入合并了：并进行了以下分析：

分析转录本abundance，得到基因表达情况；
根据分组，summarize gene/isoform expression/usage；
根据上一步，计算组间差异(dIF)；
为每个转录本，注释genomic annotation and the nucleotide sequences。

此时组间差异计算主要是dIF,the difference in isoform fraction。IF即isoform fraction，用来评价该 isoform usage；计算公式为isoform_exp / gene_exp。因此dIF=IF2 - IF1，用于评价差异大小(effect size)。

sar1 <- importRdata(
    isoformCountMatrix   = salmonQuant$counts,
    isoformRepExpression = salmonQuant$abundance,
    designMatrix         = myDesign,
    isoformExonAnnoation = system.file("extdata/example.gtf.gz"             , package="IsoformSwitchAnalyzeR"),
    isoformNtFasta       = system.file("extdata/example_isoform_nt.fasta.gz", package="IsoformSwitchAnalyzeR")
)

注意一下，此时sar1初步汇总结果仍可理解成：观测/行名为所有的转录本。

head(sar1,2)
head(sar1$isoformFeatures,2)
head(sar1$exons,2)
head(sar1$ntSequence,2)

6、初筛一下数据

这里的过滤主要基于low Gene/Isoform expression筛选，减少接下来不必要的运算时间。

sar2 <- preFilter(
  switchAnalyzeRlist = sar1,
  geneExpressionCutoff = 10,
  isoformExpressionCutoff = 3,
  removeSingleIsoformGenes = TRUE  #默认为TRUE，删除只有一个转录本的基因
)
#返回：The filtering removed 373 ( 34.16% of ) transcripts. There is now 719 isoforms left

preFilter默认参数 geneExpressionCutoff = 1、isoformExpressionCutoff = 0

sar2 <- preFilter(sar1)
#返回：The filtering removed 273 ( 25% of ) transcripts. There is now 819 isoforms left

步骤二：添加注释

1、注释Q值

这一步主要计算Switches的Q值与dIF，判断isoform的显著性与效应大小

The alpha argument indicating the FDR corrected P-value (Q-value) cutoff.
The dIFcutoff argument indicating the minimum (absolute) change in isoform usage (dIF).
由于dIF者前期已经计算，所以这里主要是计算Q值。有不同的的方法/算法，推荐DEXSeq。

DEXSeq was originally designed for testing for differential exon usage but have recently been shown to perform exceptionally well for differential isoform usage.

sar3 <- isoformSwitchTestDEXSeq(
  switchAnalyzeRlist = sar2,
  reduceToSwitchingGenes=TRUE
)

reduceToSwitchingGenes=TRUE参数设置筛选genes which each contains at least one differential used isoform, as indicated by the alpha and dIFcutoff cutoffs.

head(sar3,2)

head(sar3,2)

2、注释ORF（选做）

由于一般提供GTF文件会提供CDS数据，所以不需要再添加。由于本次示例GTF文件好像未提供相关信息。不过也没关系，因为也可以利用现有的数据集构建每个转录本的ORF。(utilizes the genomic coordinates of each transcript to extract the transcript nucleotide sequence from a reference genome )

sar4 <- analyzeORF(
    sar3,
    orfMethod = "longest",  #有four different methods ，详见官方文档
    showProgress=FALSE
)
head(sar4$orfAnalysis, 3)

head(sar4$orfAnalysis, 3)

3、生成转录本核苷酸与对应氨基酸序列

sar5 <- extractSequence(sar4,
                        filterAALength=TRUE,
                        alsoSplitFastaFile=TRUE,
                        pathToOutput = '~/li/test/01',
                        writeToFile=TRUE)

For easier usage of the Pfam and SignalP web servers，进行以下设置：

filterAALength参数表示氨基酸的有效长度为5~1000；
alsoSplitFastaFile参数表示对 the number of sequences。

4、External Sequence Analysis

（1）根据转录本核苷酸序列(Nucleotide/nt Sequences)

predict whether an isoform is coding or not（CPAT/CPC2二选一）

（2）根据转录本氨基酸序列( Amino Acid/AA Sequences)

Prediction of protein domains蛋白结构域（Pfam）；
Prediction of Signal Peptides信号肽（SignalP）；
Prediction of Intrinsically Disordered Regions,IDR内在无序区域（IUPred2A/NetSurfP 2二选一）

根据所得序列文件，进行上述四种角度的转录本深入分析，网站链接见官方文档。分析得到结果后，将结果注释到sar5中，方法可参考下述代码：

CPAT/CPC2

sar5.1 <- analyzeCPAT(
  switchAnalyzeRlist   = sar5,
  pathToCPATresultFile = system.file("extdata/cpat_results.txt", package = "IsoformSwitchAnalyzeR"),
  codingCutoff         = 0.725, # the coding potential cutoff we suggested for human
  removeNoncodinORFs   = TRUE   # 存疑！！！
)
#返回：Added coding potential to 350 (74.47%) transcripts
### Add CPC2 analysis
exampleSwitchListAnalyzed <- analyzeCPC2(
    switchAnalyzeRlist   = exampleSwitchListAnalyzed,
    pathToCPC2resultFile = system.file("extdata/cpc2_result.txt", package = "IsoformSwitchAnalyzeR"),
    codingCutoff         = 0.725, # the coding potential cutoff we suggested for human
    removeNoncodinORFs   = TRUE   # because ORF was predicted de novo
)

Pfam

sar5.2 <- analyzePFAM(
  switchAnalyzeRlist   = sar5.1,
  pathToPFAMresultFile = system.file("extdata/pfam_results.txt", package = "IsoformSwitchAnalyzeR"),
  showProgress=FALSE
)
#Added domain information to 102 (74.64%) transcripts

SignalP

sar5.3 <- analyzeSignalP(
  switchAnalyzeRlist       = sar5.2,
  pathToSignalPresultFile  = system.file("extdata/signalP_results.txt", package = "IsoformSwitchAnalyzeR")
)
#Added signal peptide information to 30 (6.38%) transcripts

IUPred2A/NetSurfP

sar5.4 <- analyzeIUPred2A(
  switchAnalyzeRlist        = sar5.3,
  pathToIUPred2AresultFile = system.file("extdata/iupred2a_result.txt.gz", package = "IsoformSwitchAnalyzeR"),
  showProgress = FALSE
)
# Added IDR information to 69 (14.68%) transcripts

友情提示：if you had to submit your data as multiple runs at the Pfam or SignalP website you can just supply a vector of strings indicating the path to each of the resulting files to the functions above and IsoformSwitchAnalyzeR will read them all and integrate them.

5、注释Alternative Splicing可变剪切类型

基于 the exon structure of all isoforms in a given gene (with isoform switching)

sar6 <- analyzeAlternativeSplicing(
    switchAnalyzeRlist = sar5.4,
    quiet=TRUE
)
tmp=as.data.frame(sar6$AlternativeSplicingAnalysis)
table(sar6$AlternativeSplicingAnalysis$IR )

如下表格显示，61 isoforms contain a single intron retention (IR) and 19 isoforms each contain two or more intron retentions.

table

6、注释Switch Consequences

These isoforms are then divided into the isoforms that increase their contribution to gene expression (positive dIF values larger than dIFcutoff) and the isoforms that decrease their contribution (negative dIF values smaller than -dIFcutoff).
The isoforms with increased contribution are then (in a pairwise manner) compared to the isoform with decreasing contribution.
In each of these comparisons the isoforms compared are analyzed for differences in their annotation (controlled by the consequencesToAnalyze parameter).

cIn <- c('intron_retention','coding_potential','NMD_status','domains_identified','ORF_seq_similarity')
sar7 <- analyzeSwitchConsequences(
  sar6,
  consequencesToAnalyze = cIn, 
  dIFcutoff = 0.4, # very high cutoff for fast runtimes
  showProgress=FALSE
)
sar7.df=as.data.frame(sar7$switchConsequence)
sar7.df=sar7$switchConsequence

结合sar7.df及上述定义，我个人的理解是首先筛选基因表达的多个转录本中既存在dIF>0.4的转录本，也存在dIF< -0.4的转录本。然后将同一基因的这些转录本的注释进行比较，观察，这些注释是否存在不同。

sar7.df

sar7

如下图所示，步骤二就是不断增加Feature analyzed条目的过程。到这里步骤二就大致完成了。之后的步骤就是结合这些前期数据进行最后结论性得分析、汇总了。

sar7

值得注意的是 IsoformSwitchAnalyzeR包还提供了两个便捷函数：part1,part2，快速实现上述步骤二所有的注释分析

data("exampleSwitchList")
exampleSwitchList    #已完成了前期的数据导入
exampleSwitchList <- isoformSwitchAnalysisPart1(
    switchAnalyzeRlist   = exampleSwitchList,
    dIFcutoff            = 0.3, # Cutoff for finding switches - set high for short runtime in example data
    # pathToOutput = 'path/to/where/output/should/be/'
    outputSequences      = FALSE, # change to TRUE whan analyzing your own data 
    prepareForWebServers = FALSE  # change to TRUE if you will use webservers for external sequence analysis
)
# 利用产生的序列信息，进行外部软件分析，得到相应结果
exampleSwitchList <- isoformSwitchAnalysisPart2(
  switchAnalyzeRlist        = exampleSwitchList, 
  dIFcutoff                 = 0.3,   # Cutoff for finding switches - set high for short runtime in example data
  n                         = 10,    # if plotting was enabled, it would only output the top 10 switches
  removeNoncodinORFs        = TRUE,  # Because ORF was predicted de novo
  pathToCPC2resultFile      = system.file("extdata/cpc2_result.txt"         , package = "IsoformSwitchAnalyzeR"),
  pathToPFAMresultFile      = system.file("extdata/pfam_results.txt"        , package = "IsoformSwitchAnalyzeR"),
  pathToIUPred2AresultFile  = system.file("extdata/iupred2a_result.txt.gz"  , package = "IsoformSwitchAnalyzeR"),
  pathToSignalPresultFile   = system.file("extdata/signalP_results.txt"     , package = "IsoformSwitchAnalyzeR"),
  outputPlots               = FALSE  # keeps the function from outputting the plots from this example
)

如下结果(Feature analyzed)，上述利用isoformSwitchAnalysisPart1，isoformSwitchAnalysisPart2两个函数可直接完成所有的分析，还是挺方便的~
exampleSwitchList

RNA-seq学习：No.7IsoformSwitchAnaly
IsoformSwitchAnalyzeR包是2019年才发布的一个分析包，更充分地发挥了RNA-seq数据的价值...
深入理解RNA-seq
RNA-seq数据介绍学习RNA-seq文章https://www.ncbi.nlm.nih.gov/pmc/a...
Transcript quantification/differ
本文为学习RNA-seq中** Transcript quantification / Differential ...
RNA-seq 检测变异之 GATK 最佳实践流程
tags： RNA-seq GATK STAR SNP INDEL RNA-seq 序列比对对 RNA-seq ...
[RNA-Seq] 学习
最近在学RNA-Seq，从收集资料开始，将好用的链接先拷过来，方便阅读： 1、初始数据过滤、质控NGS 数据过滤...
RNA-seq无比对直接定量(Kallisto - sleuth
RNA-seq数据下载数据来自文献：An RNA-Seq transcriptome and splicing ...
STAR 使用手册
Mapping RNA-seq Reads with STAR 基础用法：Mapping RNA-seq read...
Analysis of single cell RNA-seq
前言今天与大家分享学习的是《Analysis of single cell RNA-seq data》这本书，这...
【RSS】[1] FastQC-Quality assessme
【RSS】RNA-seq step by step这个RSS专题将step by step 记录我学习RNA-se...
39个转录组分析工具，120种组合评估
RNA-seq工具哪家强 RNA-seq分析工具知多少 RNA-seq是研究转录组应用最广泛，也最重要的技术之一。...

RNA-seq学习：No.7IsoformSwitchAnaly

一、简单介绍

1、IsoformSwitch

2、IsoformSwitchAnalyze

3、大致流程workflow

步骤一：数据导入

1、Isoform quantification

2、实验分组设计表

3、GTF文件

4、fasta文件

5、导入合并

6、初筛一下数据

步骤二：添加注释

1、注释Q值

2、注释ORF（选做）

3、生成转录本核苷酸与对应氨基酸序列

4、External Sequence Analysis

5、注释Alternative Splicing可变剪切类型

6、注释Switch Consequences

相关文章

RNA-seq学习：No.7IsoformSwitchAnaly

深入理解RNA-seq

Transcript quantification/differ

RNA-seq 检测变异之 GATK 最佳实践流程

[RNA-Seq] 学习

RNA-seq无比对直接定量(Kallisto - sleuth

STAR 使用手册

Analysis of single cell RNA-seq

【RSS】[1] FastQC-Quality assessme

39个转录组分析工具，120种组合评估

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

可变剪切与基因复制

小白学习RNA-seq之旅

可变剪切分析