CoRegNet: 共调控网络的重建和整合分析

作者: Hayley笔记 | 来源:发表于2021-11-16 20:13 被阅读0次

CoRegNet: 共调控网络的重建和整合分析
SCENIC||单细胞基因调控网络推断与聚类
WGCNA实例分析及解读（附代码）
四步完成单细胞数据调控网络流程分析-SCENIC/pySCENI
每周文献|上篇：NLR蛋白参与植物两大免疫机制
培训笔记
跟着Cell学单细胞转录组分析(十二):转录因子分析
关键词词频共现分析——基于BibExcel和Pajek分析工具
第10篇：ATAC-Seq、ChIP-Seq、RNA-Seq整合
10X单细胞（10X空间转录组）基因调控网络分析之DeepSEM

1. 介绍

CoRegNet包的目的是从转录组数据推断大型转录共调控网络，整合外部数据和基因调控信息去推断和分析转录组项目。这个包中独特的网络推断算法可以学习共调控网络。通过识别不同数据类型的基因共合作调节（co-operative regulators of genes）允许不同类型数据的整合到一个共表达分析中。

#安装
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("CoRegNet")

library(CoRegNet)
data(CIT_BLCA_EXP,HumanTF,CIT_BLCA_Subgroup)
dim(CIT_BLCA_EXP)
# [1] 1000  183
# 1000个基因x183个sample的基因表达矩阵
length(intersect(rownames(CIT_BLCA_EXP),HumanTF))
# [1] 67 ##1000个基因中有67个转录因子
head(intersect(rownames(CIT_BLCA_EXP),HumanTF))
# [1] "EEF1A1" "INSM1"  "HOXD1"  "IFI16"  "ID2"  "GRHL1"

2. Quick user guide（CoRegNet的主要功能）

1. 从基因表达数据中重建大规模调控网络

grn = hLICORN(CIT_BLCA_EXP, TFlist=HumanTF) 
#输入的是基因表达矩阵和2020个转录因子的list(包内置的)

2. 推断转录因子活性

influence = regulatorInfluence(grn,CIT_BLCA_EXP)

3. Retrieve inferred co-coregulators

coregs= coregulators(grn)

4. Analyze the network of cooperative regulators using an interactive display

display(grn,CIT_BLCA_EXP,influence,clinicalData=CIT_BLCA_Subgroup)

3详细功能和用法介绍

3.1 Construction of a large-scale co-regulatory network from gene expression data

CoRegNet包中的推理算法是LICORN 算法的hybrid版本。
网络的重构包含四步：
First, the gene expression data is discretized.
Second, all the potential sets of cooperative regulators are extracted using the apriori algorithm of frequent itemset mining.
Third, the best combinations of co-activators and co-inhibitors are identified for each gene.
Finally, a continuous model of regulation using a linear regression method with interaction terms is used to score the local gene regulatory networks of each gene.

使用CoRegNet做共表达至少需要输入两个dataset：

基因表达矩阵/数据框。要求行名为基因名，列名为样本名且列名要唯一。

TF的基因list。CoRegNet包中内置了人类转录因子list，并且有official gene symbol (HumanTF)和EntrezGene IDs (HumanTF_entrezgene)这两种格式。使用的时候直接用data(HumanTF)或data(HumanTF_entrezgene)调用即可。

# An example of how to infer a co-regulation network
grn =hLICORN(CIT_BLCA_EXP, TFlist=HumanTF)
print(grn) #grn是根据输入的表达矩阵和转录因子预测出来的转录因子，靶基因和调控互作
# [1] "63 Transcription Factors.  867 Target Genes.  8072 Regulatory interactions."
# [1] "No added evidences."
head(grn@GRN)
#   Target coact  corep         Coef.Acts          Coef.Reps Coef.coActs Coef.coReps         R2      RMSE
# 1 DIRAS2 NR1H4  IFI16 0.242753364850399  -0.39999852548476          NA          NA 0.24528282 0.9670125
# 2 DIRAS2 NR1H4    MYC 0.225607457464437 -0.273761935521453          NA          NA 0.11392664 1.0525775
# 3 DIRAS2 NR1H4  SNAI2 0.158558921637845 -0.349914647715346          NA          NA 0.23964161 0.9705609
# 4 DIRAS2 NR1H4   IRX3  0.25463526841265 -0.180678166467546          NA          NA 0.12653573 1.0419588
# 5 DIRAS2 NR1H4 SPOCD1  0.29515905804531 -0.212927437477536          NA          NA 0.14379615 1.0316241
# 6 DIRAS2 NR1H4   <NA> 0.308379330917705               <NA>          NA          NA 0.08489369 1.0651439

# 导出所有的Target
Target=unique(grn@GRN$Target) #可以拿出来做GO富集等

63 Transcription Factors x 867 Target Gene x 8072 Regulatory interactions

在上面hLICORN()这一步是自动对输入的矩阵进行了离散化的，也就是说，实际上用于构建调控网络的input矩阵是离散化后的矩阵。下面是对矩阵进行手动离散化的两种方法

#Default discretization. 
#Uses the standard deviation of the whole dataset to set a threshold.
disc1=discretizeExpressionData(CIT_BLCA_EXP)
table(disc1)
# disc1
#     -1      0      1 
#  23482 131019  28499 
boxplot(as.matrix(CIT_BLCA_EXP)~disc1)

#Discretization with a hard threshold
disc2=discretizeExpressionData(CIT_BLCA_EXP, threshold=1)
table(disc2)
# disc2
#    -1     0     1 
# 45956 93063 43981
boxplot(as.matrix(CIT_BLCA_EXP)~disc2)

# more examples here
help(discretizeExpressionData)

使用矩阵前200个基因以方便运算（可以开4线程，速度更快）

# running only on the 200 first gene in the matrix for fast analysis
# Choosing to divide in 4 threads whenever possible
options("mc.cores"=4)
grn =hLICORN(head(CIT_BLCA_EXP,200), TFlist=HumanTF)
print(grn)
# [1] "18 Transcription Factors.  161 Target Genes.  642 Regulatory interactions."
# [1] "No added evidences."
options("mc.cores"=2)
grn =hLICORN(head(CIT_BLCA_EXP,200), TFlist=HumanTF)
print(grn)
# [1] "18 Transcription Factors.  161 Target Genes.  642 Regulatory interactions."
# [1] "No added evidences."

3.2 Refining the inferred regulatory networks

第二步是使用external knowledge来丰富前面得到的inferred regulatory network。有两种类型的外部数据集可以使用：1. TF对基因的调控数据比如Transcription Factor Binding Sites (TFBS) 或ChIP数据；2. 共调控信息比如蛋白-蛋白互作，这类数据可以提供TFs的合作信息。
❗️对于前面两种类型的数据，分别使用addEvidences和addCooperativeEvidences函数来添加。

# 引入四类外部数据
# ChIP data from the CHEA database
data(CHEA_sub)
#ChIP data from the ENCODE project
data(ENCODE_sub)
# Protein protein interactions between TF from the HIPPIE database
data(HIPPIE_sub)
# Protein protein interactions between TF from the STRING database
data(STRING_sub)

enrichedGRN = addEvidences(grn,CHEA_sub,ENCODE_sub)
# CHEA_sub was integrated into the network.
# ENCODE_sub was integrated into the network.
enrichedGRN = addCooperativeEvidences(enrichedGRN,HIPPIE_sub,STRING_sub)
# [1] "HIPPIE_sub"
# [1] "No evidence from HIPPIE_sub were found in the inferred network."
# [1] "STRING_sub"
# [1] "STRING_sub was integrated into the network."

得到的enrichedGRN就是引入external knowledge的grn

print(enrichedGRN)
# [1] "63 Transcription Factors.  867 Target Genes.  8072 Regulatory interactions."
#            commonGene commonReg evidences commonEvidences       enrichment              p.value
# CHEA_sub          567        12      1050             218 1.13521053808365   0.0701008769458297
# ENCODE_sub        724        10      2749             382 1.18076947263415   0.0114797404982399
# STRING_sub       <NA>        39       137              36 3.16656003302599 2.24846204249256e-06

导入外部数据之后就可以对共调控网络进行refine，可以使用默认的refine方法。

# Default unsupervised refinement method
refinedGRN = refine(enrichedGRN)
print(refinedGRN)
# [1] "57 Transcription Factors.  867 Target Genes.  1854 Regulatory interactions."
#            commonGene commonReg evidences commonEvidences       enrichment              p.value
# CHEA_sub          567        12      1050             176 3.55976690089225  3.1172872878133e-32
# ENCODE_sub        724        10      2749             296 4.71677404968161 3.01483641042892e-48
# STRING_sub       <NA>        21        83              17              Inf 4.69277379589639e-08

可以看到refine之后，Transcription Factors和1854 Regulatory interactions的数目都发生了变化

也可以使用指定的数据集来进行refine

# Example of supervised refinement with the CHEA chip data
refinedGRN = refine(enrichedGRN, integration="supervised",referenceEvidence="CHEA_sub")
print(refinedGRN)

3.3 Identification of active transcriptional programs

CoRegNet包的目的是从给定样本或样本集中识别active cooperative TF的集合。Transcriptional activity的衡量方法是估计给定样本中特定转录调节因子的活化程度。这个衡量方法是比较transcriptional network中活化的和抑制的基因表达。它是基于一个样本中这两个基因sets间的divergence的测量(Welch’s t statistics)。从根本上来说，如果可被某个TF激活的基因出现高表达的同时可被抑制的基因出现低表达，这个TF就具有high influence。
使用一个coregnet object的共表达网络和一个基因表达矩阵，不管这个矩阵是用来推断共表达网络的矩阵还是别的矩阵。output的是一个列为样本（和基因表达矩阵一致），行为TF的矩阵。每一行的TF都在转录网络中拥有足够数量的targets(至少10 activated和10 repressed)。

CITinf =regulatorInfluence(grn,CIT_BLCA_EXP)
CITinf[1:6,1:6]
#             CIT100    CIT102     CIT103    CIT105    CIT106    CIT107
# EMX2    -0.6705802 -4.134082   3.433414  8.512250 -2.129228  2.170088
# EGR1     6.2721207 14.563272  -6.805588 -4.347788 11.962142 -3.975751
# TGFB1I1  7.0348603 19.819534  -4.918518 -5.555589 12.702996 -3.141942
# EGR2     4.5810891 14.504279  -8.664420 -7.408075 11.432331 -9.668110
# PLAGL1   5.7439075 18.780449 -10.075245 -7.001515 13.497227 -7.333840
# SOX9     6.1841193 10.627645  -4.749642 -4.485600 12.322927 -1.968706
str(CITinf)
#  num [1:33, 1:183] -0.671 6.272 7.035 4.581 5.744 ...
#  - attr(*, "dimnames")=List of 2
#   ..$ : chr [1:33] "EMX2" "EGR1" "TGFB1I1" "EGR2" ...
#   ..$ : chr [1:183] "CIT100" "CIT102" "CIT103" "CIT105" ...

这个新的transcriptional influence数据集可以被视为整个转录数据集的condensed版本。

# Coregulators of a hLICORN  inferred network
head(coregulators(grn))
#    Reg1    Reg2     Support   NA nGRN    fisherTest adjustedPvalue
# 1 PPARG   VGLL1 0.017881213 1247 1247  1.436958e-68   1.638132e-67
# 2 GATA3   PPARG 0.016891795 1178 1178  2.861003e-87   1.087181e-85
# 3 FOXF1 TGFB1I1 0.013536379  944  944 2.838172e-101   3.235516e-99
# 4  EGR1    EGR2 0.012733373  888  888  4.332173e-72   6.173346e-71
# 5  MSX2   PPARG 0.012288853  857  857  2.767895e-77   6.310800e-76
# 6 GATA3    MSX2 0.009922854  692  692  9.068728e-80   2.584587e-78

也可以引入分组数据和copy number数据等

data(CIT_BLCA_CNV)
data(CIT_BLCA_Subgroup)

display(grn,expressionData=CIT_BLCA_EXP,TFA=CITinf)

# Visualizing additional regulatory or co-regulatory evidences in the network
display(enrichedGRN,expressionData=CIT_BLCA_EXP,TFA=CITinf)

# Visualizing sample classification using a named factor
display(grn,expressionData=CIT_BLCA_EXP,TFA=CITinf,clinicalData=CIT_BLCA_Subgroup)

# Visualizing copy number alteration of regulators
data(CIT_BLCA_CNV)
display(grn,expressionData=CIT_BLCA_EXP,TFA=CITinf,clinicalData=CIT_BLCA_Subgroup,alterationData=CIT_BLCA_CNV)

参考：
http://www.bioconductor.org/packages/release/bioc/vignettes/CoRegNet/inst/doc/CoRegNet.html
差异共表达网络

CoRegNet: 共调控网络的重建和整合分析
1. 介绍 CoRegNet包的目的是从转录组数据推断大型转录共调控网络，整合外部数据和基因调控信息去推断和分析转...
SCENIC||单细胞基因调控网络推断与聚类
SCENIC主要用于基因调控网络的重建和细胞状态的鉴定。（a）使用GENIE3或GRNBoost推断转录因子与候...
WGCNA实例分析及解读（附代码）
WGCNA分析基于两个假设：1.相似表达模式的基因可能存在共调控、功能相关或处于同一通路，2.基因网络符合无尺度分...
四步完成单细胞数据调控网络流程分析-SCENIC/pySCENI
适用背景单细胞转录组调控网络分析是单细胞转录组分析内容的高级分析之一，本文将介绍SCENIC/pySCENIC的...
每周文献|上篇：NLR蛋白参与植物两大免疫机制
在植物免疫领域，2020-2021年爆发了大量整合ETI和PTI免疫通路的研究。探究ETI和PTI的共调控也成为了...
培训笔记
1.未来等待的人才注重设计，会讲故事，善于整合，懂得关怀。会玩乐，重意义。 2.构建和提升学校品质 ...
跟着Cell学单细胞转录组分析(十二):转录因子分析
转录因子分析可以了解细胞异质性背后的基因调控网络的异质性。转录因子分析也是单细胞转录组常见的分析内容，R语言分析一...
关键词词频共现分析——基于BibExcel和Pajek分析工具
概念：关键词共现矩阵的生成共词网络分析
第10篇：ATAC-Seq、ChIP-Seq、RNA-Seq整合
由于基因表达调控机制的复杂性，多种组学数据的整合分析，从不同的层面探究生物问题越来越重要。从RNA-Seq层面，我...
10X单细胞（10X空间转录组）基因调控网络分析之DeepSEM
hello，大家好，今天给大家分享一个新的内容，基因调控网络分析，文章在Modeling gene regulat...