2023-MutationalPatterns包学习笔记（三）-

作者: 信你个鬼 | 来源:发表于2023-10-21 02:18 被阅读0次

单细胞转录组3大R包系列学习笔记
使用Shiny搭建基于Seurat包的单细胞数据可视化平台
单细胞组学系列学习笔记
单细胞多组学系列学习笔记汇总
Python ☞ day 7
大师兄的Python学习笔记(七）: re包与正则表达式
javascript学习笔记（三）——闭包
avaScript 闭包的理解
学习小组笔记Day6-Jocelyn
JavaScript学习笔记(4) 闭包详解（Closure A

老规矩，先奉上学习资料链接：

https://bioconductor.org/packages/release/bioc/vignettes/MutationalPatterns/inst/doc/Introduction_to_MutationalPatterns.html

这一次学习Mutational signatures的构建。
Mutational signatures被认为代表了突变过程，并以具有特定序列上下文的突变类型的特定贡献为特征。

NMF：使用非负矩阵分解(NMF)，可以从突变计数矩阵中重新提取Mutational signatures。
signature refitting：还可以确定突变计数矩阵对先前定义的Mutational signatures的暴露（exposure）。

NMF对大样本最有用，而signature refitting可用于单个样本。
NMF学习笔记见：2023-MutationalPatterns包学习笔记（二）-Mutational signatures之NMF

本次学习signature refitting。

回顾上次得到mut_mat矩阵：

ref_genome <- "BSgenome.Hsapiens.UCSC.hg19"
library(ref_genome, character.only = TRUE)

vcf_files <- list.files(system.file("extdata", package = "MutationalPatterns"), pattern = "sample.vcf", full.names = TRUE)
sample_names <- c("colon1", "colon2", "colon3", "intestine1", "intestine2", "intestine3","liver1", "liver2", "liver3")

grl <- read_vcfs_as_granges(vcf_files, sample_names, ref_genome)

mut_mat <- mut_matrix(vcf_list = grl, ref_genome = ref_genome)
# First add a small pseudocount to your mutation count matrix:
mut_mat <- mut_mat + 0.0001
head(mut_mat)

矩阵内容：

image-20231021014109374.png

已知signatures：

signatures = get_known_signatures()

image-20231021015655115.png

Signature refitting

1）寻找COSMIC signatures的数学最优贡献

Signature refitting量化了任何一组signatures对样本突变谱的贡献。这对于小样本或单样本的突变特征分析特别有用，同时也可以将自己的发现与已知signatures和已发表的发现联系起来。fit_to_signatures函数通过求解一个非负的最小二乘约束问题，找到与突变矩阵重构最接近的突变特征的最优线性组合。它可以处理SNV、indel、DBS或其他类型的计数矩阵。
将突变矩阵拟合到COSMIC突变特征中：

fit_res <- fit_to_signatures(mut_mat, signatures)
str(fit_res)

List of 2
 $ contribution : num [1:60, 1:9] 51.3 0 0 0 33.9 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:60] "SBS1" "SBS2" "SBS3" "SBS4" ...
  .. ..$ : chr [1:9] "colon1" "colon2" "colon3" "intestine1" ...
 $ reconstructed: num [1:96, 1:9] 2.059 1.233 0.366 1.223 2.608 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:9] "colon1" "colon2" "colon3" "intestine1" ...

对fit_res进行可视化的函数有：plot_contribution，plot_contribution_heatmap，plot_compare_profiles，plot_original_vs_reconstructed
现在我们来展示signatures的贡献度：

## opt$od为你自己设置的输出目录
opt$od <- "/yourpath/"
p <- plot_contribution(fit_res$contribution, coord_flip = FALSE, mode = "absolute")
ggsave(filename = paste0(opt$od,"/fit_res.contribution.png"), width = 9, height = 6.5)

结果图：

image-20231021020224229.png

还可以显示与重建谱的余弦相似性，因为这很好地说明了signatures如何很好地解释突变谱：

p <- plot_original_vs_reconstructed(mut_mat, fit_res$reconstructed, y_intercept = 0.95)
ggsave(filename = paste0(opt$od,"/fit_res.original_vs_reconstructed.png"), width = 9, height = 6.5)

image-20231021020659374.png

2）更严格的refitting

在之前的图中，使用了许多signature来解释样本的突变概况。然而，这么多的突变过程似乎不太可能在这些样本中真正活跃。这个问题被称为过拟合，因为fit_to_signatures找到了重构突变pfofile的signature的最佳组合，即使改善性很小也会使用signature。
signature refitting的另一个问题是signature错误归属。突变有时会归因于具有相似突变谱的样品中的不同特征。这可能给人的印象是样品是非常不同的，但他们实际上不是。这通常是“扁平”signature的结果，难以拟合。彼此相似的signature也可能导致这种错误归属问题。
处理过拟合和signature错误归属的一种方法是选择有限数量的signature用于重新拟合。
另一种处理过拟合的方法是，从标准的重新拟合开始，然后删除那些对突变谱重构效果影响不大的特征。这种方法的缺点是，使用的cutoff有些主观，取决于数据。这里我们使用0.004的cutoff。cutoff越小，重新拟合越不严格，越大则越严格。尝试不同的值有时有助于获得最佳结果。

strict_refit <- fit_to_signatures_strict(mut_mat, signatures, max_delta = 0.004)

# 查看结果结构
str(strict_refit, max.level = 2)

List of 2
 $ sim_decay_fig:List of 9
  ..$ :List of 9
  .. ..- attr(*, "class")= chr [1:2] "gg" "ggplot"
  ..$ :List of 9
  .. ..- attr(*, "class")= chr [1:2] "gg" "ggplot"
  ..$ :List of 9
  .. ..- attr(*, "class")= chr [1:2] "gg" "ggplot"
  ..$ :List of 9
  .. ..- attr(*, "class")= chr [1:2] "gg" "ggplot"
  ..$ :List of 9
  .. ..- attr(*, "class")= chr [1:2] "gg" "ggplot"
  ..$ :List of 9
  .. ..- attr(*, "class")= chr [1:2] "gg" "ggplot"
  ..$ :List of 9
  .. ..- attr(*, "class")= chr [1:2] "gg" "ggplot"
  ..$ :List of 9
  .. ..- attr(*, "class")= chr [1:2] "gg" "ggplot"
  ..$ :List of 9
  .. ..- attr(*, "class")= chr [1:2] "gg" "ggplot"
 $ fit_res      :List of 2
  ..$ contribution : num [1:60, 1:9] 50 0 0 0 93.5 ...
  .. ..- attr(*, "dimnames")=List of 2
  ..$ reconstructed: num [1:96, 1:9] 2.114 1.291 0.227 1.088 2.897 ...
  .. ..- attr(*, "dimnames")=List of 2

这个函数返回一个包含fit_res对象和数字列表的列表，显示在修改过程中删除signatures的顺序。

这里我们展示了一个样本的图。x轴显示在迭代过程中被删除的签名。红色条表示余弦相似度的差异变得太大了。停止删除签名，并保留SBS1以供最终重新拟合refit。

fig_list <- strict_refit$sim_decay_fig
p <- fig_list[[1]]
ggsave(filename = paste0(opt$od,"/strict_refit.decay_fig.png"), width = 9, height = 6.5)

image-20231021184040831.png

可视化fit_res：

fit_res_strict <- strict_refit$fit_res
p <- plot_contribution(fit_res_strict$contribution,coord_flip = FALSE,mode = "absolute")
ggsave(filename = paste0(opt$od,"/fit_res_strict.contribution.png"), width = 9, height = 6.5)

image-20231021184229684.png

默认情况下，fit_to_signatures_strict使用上面描述的“向后”选择方法。然而，也可以使用“最佳子集”方法。这种方法的好处是，它可以比“向后”的方法更准确。然而，当使用许多signatures时，它在计算上变得不可行的。因此，它应该只用于较小的signature集(最多10-15个签名)，如组织特定signature。
我们为这个示例随机选择了几个签名，以保持较低的运行时间。在实践中，应该基于先验知识来选择签名：

best_subset_refit <- fit_to_signatures_strict(mut_mat, signatures[,1:5], max_delta = 0.002, method = "best_subset")

str(best_subset_refit, max.level = 2)

List of 2
 $ sim_decay_fig:List of 9
  ..$ :List of 9
  .. ..- attr(*, "class")= chr [1:2] "gg" "ggplot"
  ..$ :List of 9
  .. ..- attr(*, "class")= chr [1:2] "gg" "ggplot"
  ..$ :List of 9
  .. ..- attr(*, "class")= chr [1:2] "gg" "ggplot"
  ..$ :List of 9
  .. ..- attr(*, "class")= chr [1:2] "gg" "ggplot"
  ..$ :List of 9
  .. ..- attr(*, "class")= chr [1:2] "gg" "ggplot"
  ..$ :List of 9
  .. ..- attr(*, "class")= chr [1:2] "gg" "ggplot"
  ..$ :List of 9
  .. ..- attr(*, "class")= chr [1:2] "gg" "ggplot"
  ..$ :List of 9
  .. ..- attr(*, "class")= chr [1:2] "gg" "ggplot"
  ..$ :List of 9
  .. ..- attr(*, "class")= chr [1:2] "gg" "ggplot"
 $ fit_res      :List of 2
  ..$ contribution : num [1:5, 1:9] 58.5 0 0 0 134.6 ...
  .. ..- attr(*, "dimnames")=List of 2
  ..$ reconstructed: num [1:96, 1:9] 1.667 1.404 0.259 0.965 1.018 ...
  .. ..- attr(*, "dimnames")=List of 2

第三种可以减少signature过拟合和错误归属的方法是合并相似的signature。这是通过合并余弦相似度高于某个截止值的signature来实现的。然后可以使用这些合并的signature进行修改。这种方法的好处是你不需要先验知识。对于大多数常见的用例，我们不推荐这种方法，因为它不太传统，而且很难解释。

merged_signatures <- merge_signatures(signatures, cos_sim_cutoff = 0.8)

Combined the following two signatures: SBS26, SBS12
Combined the following two signatures: SBS36, SBS18
Combined the following two signatures: SBS92, SBS5
Combined the following two signatures: SBS40, SBS3
Combined the following two signatures: SBS10d, SBS10a
Combined the following two signatures: SBS10d;SBS10a, SBS10c
Combined the following two signatures: SBS15, SBS6
Combined the following two signatures: SBS29, SBS24
Combined the following two signatures: SBS94, SBS4
Combined the following two signatures: SBS23, SBS19
Combined the following two signatures: SBS26;SBS12, SBS37
Combined the following two signatures: SBS86, SBS39
Combined the following two signatures: SBS40;SBS3, SBS92;SBS5
Combined the following two signatures: SBS94;SBS4, SBS8
Combined the following two signatures: SBS23;SBS19, SBS31

总之，最好的refit方法取决于你的数据和研究问题。可以使用单一方法，但也可以结合使用几种方法。

3）Bootstrapped refitting

由于前面提到的signature错误归属，signature refitting的稳定性可能不是最优的。Bootstrapping可以用来验证refitting的稳定性。
可以使用fit_to_signatures_bootstrapped函数完成这个功能：

contri_boots <- fit_to_signatures_bootstrapped(mut_mat[, c(3, 7)], signatures, n_boots = 50, method = "strict")

str(contri_boots, max.level = 2)

num [1:100, 1:46] 205.9 0 189.9 10.6 180 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:100] "colon3_1" "liver1_1" "colon3_2" "liver1_2" ...
  ..$ : chr [1:46] "SBS1" "SBS2" "SBS5" "SBS6" ...

bootstrapped refitting结果可视化：
每个点代表一次bootstrap迭代

image-20231022010418263.png

可视化相对贡献：
点的颜色表示找到signature的迭代的百分比(贡献> 0)，点的大小表示该signature的平均贡献(在贡献大于0的迭代中)

p <- plot_bootstrapped_contribution(contri_boots, mode = "relative", plot_type = "dotplot")
ggsave(filename = paste0(opt$od,"/bootstrapped_contribution.relative.png"), width = 9, height = 6.5)

我们可以看到，SBS1在第一个样本中是相对稳定的。然而，SBS5在第二个样本中非常不稳定。这种不稳定性很可能是SBS5非常平坦的结果：

image-20231022011246484.png

可视化signatures之间的相似性：
两个signatures之间的负相关意味着它们的贡献在不同的bootstrap迭代中很高。
在这里，我们选择一个样本可视化其的相关性：

fig_list <- plot_correlation_bootstrap(contri_boots)
p <- fig_list[[2]]
ggsave(filename = paste0(opt$od,"/correlation_bootstrap.png"), width = 9, height = 6.5)

在这里我们可以看到SBS5和SBS40是负相关的。这是有道理的，因为它们都是平面signature，彼此非常相似。因此，refitting过程难以区分它们：

image-20231022011753632.png

mutational profiles 和 signatures的相似性

除了执行NMF或拟合特征，你还可以查看它们的相似性。相似性计算如下：
计算两个two mutational profiles和signatures的相似性：

cos_sim(mut_mat[, 1], signatures[, 1])

1] 0.8342898

计算多个two mutational profiles和signatures的相似性：

cos_sim_samples_signatures <- cos_sim_matrix(mut_mat, signatures)
cos_sim_samples_signatures[1:3, 1:3]

            SBS1       SBS2      SBS3
colon1 0.8342898 0.16102621 0.4046108
colon2 0.9015886 0.12142926 0.3628445
colon3 0.9105711 0.07415821 0.3390110

还可以使用plot_cosine_heatmap函数绘制热图可视化相似性结果并进行聚类：

p <- plot_cosine_heatmap(cos_sim_samples_signatures, cluster_rows = TRUE, cluster_cols = TRUE)
ggsave(filename = paste0(opt$od,"/cosine_heatmap.png"), width = 9, height = 6.5)

image-20231022013534394.png

计算样本间cosine相似性：

cos_sim_samples <- cos_sim_matrix(mut_mat, mut_mat)
p <- plot_cosine_heatmap(cos_sim_samples, cluster_rows = TRUE, cluster_cols = TRUE)
ggsave(filename = paste0(opt$od,"/cosine_heatmap.sample.png"), width = 9, height = 6.5)

image-20231022013700715.png

Signature潜在损伤分析

一些Signature比其他Signature更有可能通过引起“stop gain”或“mismatch”突变而产生功能性影响。有了MutationalPatterns，就有可能分析一个特征对一组感兴趣的基因造成“stop gain”、“mismatch”、“synonymous”或“splice site”突变的可能性有多大。请注意，这是一个相对基本的分析，只关注突变环境。其他特征如开/闭染色质没有被考虑在内。这种分析是为了给出一个迹象，而不是一个明确的答案，说明一个Signature可能有多大的破坏性。
首先，需要首先，需要加载一个转录注释数据库，并确保安装了一些依赖项。加载一个转录注释数据库，并确保安装了一些依赖项。

# For example get known genes table from UCSC for hg19 using
# BiocManager::install("TxDb.Hsapiens.UCSC.hg19.knownGene")
# BiocManager::install("AnnotationDbi")
# BiocManager::install("GenomicFeatures")
library("TxDb.Hsapiens.UCSC.hg19.knownGene")

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

接下来，需要选择一组基因并创建Entrez基因id的向量。在这个例子中，我们使用一个小的集合来保持较低的运行时间，但在实践中，你可以使用一个更大的基因列表，你感兴趣的。(本例中使用的基因是:P53、KRAS、NRAS、BRAF、BRCA2、CDKN2A、ARID1A、PTEN和TERT。)
一个有用的癌症基因列表可以在这里找到:https://cancer.sanger.ac.uk/cosmic/census

gene_ids <- c(7157, 3845, 4893, 673, 675, 1029, 8289, 5728, 7015)

现在，“stop gain”、“mismatch”、“synonymous”和“splice site”突变的比例可以根据基因组背景来确定。还给出了每个上下文的可能突变的总数。最后，对mismatch给出一个blosum62分。分数越低意味着mismatch的氨基酸越不相似。不同的氨基酸越多，就越有可能产生有害的影响

contexts <- rownames(mut_mat)
context_mismatches <- context_potential_damage_analysis(contexts, txdb, ref_genome, gene_ids)
head(context_mismatches)

## # A tibble: 6 × 5
##   type        context     n  ratio blosum62
##   <fct>       <fct>   <dbl>  <dbl>    <dbl>
## 1 Stop_gain   A[C>A]A    46 0.0480  NA     
## 2 Missense    A[C>A]A   879 0.918    0.0501
## 3 Synonymous  A[C>A]A    23 0.0240  NA     
## 4 splice_site A[C>A]A    10 0.0104  NA     
## 5 Stop_gain   A[C>G]A    46 0.0480  NA     
## 6 Missense    A[C>G]A   879 0.918    0.379

然后可以使用每个上下文的比率来获得每个Signature的比率。还给出了归一化比率。这些是通过将每个Signature中的比率除以一个完全“平坦”Signature的比率来计算的。

“stop gain”突变的归一化比率为2，意味着与完全随机的“平坦”Signature相比，一个Signature引起“stop gain”突变的可能性是其两倍。
可能引起的突变数量的度量指标：每个上下文可能的突变总数乘以每个上下文的Signature贡献，并对所有上下文求和

sig_damage <- signature_potential_damage_analysis(signatures, contexts, context_mismatches)
head(sig_damage)

## # A tibble: 6 × 7
##   type        sig     ratio ratio_by_background      n blosum62 blosum62_min_background
##   <fct>       <chr>   <dbl>               <dbl>  <dbl>    <dbl>                   <dbl>
## 1 Stop_gain   SBS1   0.0352               0.814  17.3    NA                      NA    
## 2 Missense    SBS1   0.612                0.862 282.     -0.387                   0.113
## 3 Synonymous  SBS1   0.340                1.57  149.     NA                      NA    
## 4 splice_site SBS1   0.0122               0.405   5.98   NA                      NA    
## 5 Stop_gain   SBS10a 0.156                3.60  166.     NA                      NA    
## 6 Missense    SBS10a 0.690                0.971 725.     -0.846                  -0.346

此外，以上这些所有分析都是针对SNV signatures进行的分析，其实还可以对 indel, DBS 和 transcription strand bias signatures进行分析，详细就去看教程吧~

这个包的内容很多，这次学习只能粗略过一遍了，后面还有一些，下次见~

单细胞转录组3大R包系列学习笔记
Seurat包学习笔记 Seurat包学习笔记（一）：Guided Clustering TutorialSeur...
使用Shiny搭建基于Seurat包的单细胞数据可视化平台
image ▼Seurat包系列学习笔记▼：Seurat包学习笔记（一）：Guided Clustering Tu...
单细胞组学系列学习笔记
scRNA-seq数据分析 Seurat包学习笔记 Seurat包学习笔记（一）：Guided Clusterin...
单细胞多组学系列学习笔记汇总
单细胞转录组数据分析 Seurat包学习笔记 Seurat包学习笔记（一）：Guided Clustering T...
Python ☞ day 7
Python学习笔记之 date & datetime & 模块 & 包 & 第三方模块安装时间 UTC(世界协...
大师兄的Python学习笔记(七）: re包与正则表达式
大师兄的Python学习笔记(六）: 常用库之os包大师兄的Python学习笔记(八）: 异常处理一、关于re包...
javascript学习笔记（三）——闭包
作用域在程序的某个节点上的作用域指的是，该代码节点能够阅读到的所有实体(entity)，也被称为上下文或者执行环...
avaScript 闭包的理解
JavaScript 学习笔记之闭包这篇文章是我学习闭包的笔记与总结，后面会有一些例子，结合画图的形式来理解，什...
学习小组笔记Day6-Jocelyn
#2020年7月8日学习笔记学习R包 ##一、安装R包（dplyr） ##二、使用R包之常用的五个基本函数 ##...
JavaScript学习笔记(4) 闭包详解（Closure A
《JavaScript学习笔记(4) 闭包详解（Closure Are Not Magic）》 ---- 写在开头...

2023-MutationalPatterns包学习笔记（三）-

Signature refitting

1）寻找COSMIC signatures的数学最优贡献

2）更严格的refitting

3）Bootstrapped refitting

mutational profiles 和 signatures的相似性

Signature潜在损伤分析

相关文章

单细胞转录组3大R包系列学习笔记

使用Shiny搭建基于Seurat包的单细胞数据可视化平台

单细胞组学系列学习笔记

单细胞多组学系列学习笔记汇总

Python ☞ day 7

大师兄的Python学习笔记(七）: re包与正则表达式

javascript学习笔记（三）——闭包

avaScript 闭包的理解

学习小组笔记Day6-Jocelyn

JavaScript学习笔记(4) 闭包详解（Closure A

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读