说在前面
Immugent在之前介绍过一个R包IOBR:一个R包带你走进数据挖掘的殿堂,可以一站式解决肿瘤免疫的所有分析,如此多功能的R包其实并不多见,小编对这种包也是情有独钟,一直在寻找这一类的包。果然功夫不负有心人,小编最近就在万包丛中又觅得一个精品--PROGENy。
说PROGENy是一个精品包一点也不为过,其被开发主要是用来对一些经典的肿瘤通路进行分析;这个包好到什么程度呢,就是它能做bulk分析的同时还能对单细胞数据进行分析;除了能分析人的数据外,还可以分析小鼠的数据。最开始小编差点看走眼,因为这个包在2018年就被开发出来,文章发在了Nat Commun(IF:14.91)。而截止到今天只被引用了不到一百次,桥豆麻袋,小编瞅了一眼引用它的文章,竟然都是Nature, Cell, Cancer cell, Cancer discovery等顶尖杂志,还好多留意了一下,搞不好Immugent差点就错失这个包了。
我们在评价一个生信算法好坏时,除了它的实用性,适用性,就要看它的应用程度了。很多被水文引用了上千次的包,可能只是算法适用性广而已;而这种能被大佬看上的算法,才是真正的LV包了。
好了,下面小编就来介绍一下它的用法和应用场景了,文末有代码的获得方式。
PROGENy分析bulk数据
首先我们先用PROGENy分析bulk数据来看看它的分析效果,数据来源是很多药物的治疗数据,大家也可以用这个数据对自己研究的基因进行分析。
setwd("D:\\PROGENy")
rm(list=ls())
library(airway)
library(DESeq2)
data(airway)
# import data to DESeq2 and variance stabilize
dset = DESeqDataSetFromMatrix(assay(airway),
colData=as.data.frame(colData(airway)), design=~dex)
dset = estimateSizeFactors(dset)
dset = estimateDispersions(dset)
gene_expr = getVarianceStabilizedData(dset)
# annotate matrix with HGNC symbols
library(biomaRt)
mart = useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
genes = getBM(attributes = c("ensembl_gene_id","hgnc_symbol"),
values=rownames(gene_expr), mart=mart)
matched = match(rownames(gene_expr), genes$ensembl_gene_id)
rownames(gene_expr) = genes$hgnc_symbol[matched]
dim(gene_expr)
gene_expr[1:5,1:5]
library(progeny)
pathways = progeny(gene_expr, scale=FALSE)
controls = airway$dex == "untrt"
ctl_mean = apply(pathways[controls,], 2, mean)
ctl_sd = apply(pathways[controls,], 2, sd)
pathways = t(apply(pathways, 1, function(x) x - ctl_mean))
pathways = apply(pathways, 1, function(x) x / ctl_sd)
library(dplyr)
result = apply(pathways, 1, function(x) {
broom::tidy(lm(x ~ !controls)) %>%
filter(term == "!controlsTRUE") %>%
dplyr::select(-term)
})
mutate(bind_rows(result), pathway=names(result))
length(names(result))
class(result)
# set up a file cache so we download only once
library(BiocFileCache)
# load the downloaded files
drug_table <- readxl::read_excel("data/f58c5245162f_TableS4A.xlsx", skip=5, na="NA")
drug_table <- replace(drug_table, is.na(drug_table), 0)
gene_table <- readr::read_tsv("data/Cell_line_RMA_proc_basalExp.txt")
# we need drug response with COSMIC IDs
drug_response <- data.matrix(drug_table[,3:ncol(drug_table)])
rownames(drug_response) <- drug_table[[1]]
# we need genes in rows and samples in columns
gene_expr <- data.matrix(gene_table[,3:ncol(gene_table)])
colnames(gene_expr) <- sub("DATA.", "", colnames(gene_expr), fixed=TRUE)
rownames(gene_expr) <- gene_table$GENE_SYMBOLS
library(progeny)
pathways <- progeny(gene_expr,scale = TRUE, organism = "Human", top = 100,
perm = 1, verbose = FALSE)
library(pheatmap)
myColor = colorRampPalette(c("Darkblue", "white","red"))(100)
pheatmap(pathways,fontsize=14, show_rownames = F,
color=myColor, main = "PROGENy", angle_col = 45, treeheight_col = 0,
border_color = NA)
这个热图横轴就是PROGENy分析的14个与肿瘤发生发展的信号了,纵轴是各个样本。在实际应用中,只需要提供自己数据的基因表达矩阵即可。
事实上我们对软件的评估应该更苛刻点,而不只是做出的图好看,下面我们对PROGENy的效能进行评估。曲美替尼是一种MEK抑制剂,所以我们假设MAPK活性较高的细胞株对MEK抑制更敏感,下面对它们的关系来进行验证。
head(pathways)
cell_lines = intersect(rownames(pathways), rownames(drug_response))
trametinib = drug_response[cell_lines, "Trametinib"]
mapk = pathways[cell_lines, "MAPK"]
associations = lm(trametinib ~ mapk)
summary(associations)
从上图的结果我们可以看出,MAPK活性与曲美替尼的敏感性密切相关:Pr(>|t|)远小于传统阈值0.05,这说明这种分析是较为准确的。
PROGENy分析单细胞数据
介绍完PROGENy对bulk数据的分析,下面我们就来演示一下它对单细胞数据分析的效果。其实目前有很多基于基因集对单细胞数据进行分析的算法,小编在之前的推文一文带你了解单细胞数据基因集打分的所有算法也有介绍。但是在一篇文章发表在Genome Biol(IF:13.58)的篇名为Robustness and applicability of transcription factor and pathway analysis tools on single-cell RNA-seq data的文章,作者评估了多种对bulk和单细胞数据基因集进行打分算法,综合评估后发现PROGENy的效果还是很好的。
下面开始展示:
rm(list=ls())
setwd("D:\\PROGENy")
library(progeny)
library(dplyr)
library(Seurat)
library(ggplot2)
library(tidyr)
library(readr)
library(pheatmap)
library(tibble)
## Load the PBMC dataset
pbmc.data <-
Read10X(data.dir = "hg19/")
## Initialize the Seurat object with the raw (non-normalized data).
pbmc <-
CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3,
min.features = 200)
## Identification of mithocondrial genes
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
## Filtering cells following standard QC criteria.
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 &
percent.mt < 5)
## Normalizing the data
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize",
scale.factor = 10000)
pbmc <- NormalizeData(pbmc)
## Identify the 2000 most highly variable genes
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
## In addition we scale the data
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
pbmc <-
RunPCA(pbmc, features = VariableFeatures(object = pbmc), verbose = FALSE)
pbmc <- FindNeighbors(pbmc, dims = 1:10, verbose = FALSE)
pbmc <- FindClusters(pbmc, resolution = 0.5, verbose = FALSE)
pbmc <- RunUMAP(pbmc, dims = 1:10, umap.method = 'uwot', metric='cosine')
DimPlot(pbmc, reduction = "umap")
演示数据是来自10X官网的单细胞数据,下面我们基于其已经注释好的细胞类型做PROGENy通路的分析。
## Finding differentially expressed features (cluster biomarkers)
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25,
logfc.threshold = 0.25, verbose = FALSE)
## Assigning cell type identity to clusters
new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T",
"FCGR3A+ Mono", "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
## We create a data frame with the specification of the cells that belong to
## each cluster to match with the Progeny scores.
CellsClusters <- data.frame(Cell = names(Idents(pbmc)),
CellType = as.character(Idents(pbmc)),
stringsAsFactors = FALSE)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
## We compute the Progeny activity scores and add them to our Seurat object
## as a new assay called Progeny.
pbmc <- progeny(pbmc, scale=FALSE, organism="Human", top=500, perm=1,
return_assay = TRUE)
pbmc@assays$progeny
## We can now directly apply Seurat functions in our Progeny scores.
## For instance, we scale the pathway activity scores.
pbmc <- Seurat::ScaleData(pbmc, assay = "progeny")
## We transform Progeny scores into a data frame to better handling the results
progeny_scores_df <-
as.data.frame(t(GetAssayData(pbmc, slot = "scale.data",
assay = "progeny"))) %>%
rownames_to_column("Cell") %>%
gather(Pathway, Activity, -Cell)
dim(progeny_scores_df)
## We match Progeny scores with the cell clusters.
progeny_scores_df <- inner_join(progeny_scores_df, CellsClusters)
## We summarize the Progeny scores by cellpopulation
summarized_progeny_scores <- progeny_scores_df %>%
group_by(Pathway, CellType) %>%
summarise(avg = mean(Activity), std = sd(Activity))
dim(summarized_progeny_scores)
## We prepare the data for the plot
summarized_progeny_scores_df <- summarized_progeny_scores %>%
dplyr::select(-std) %>%
spread(Pathway, avg) %>%
data.frame(row.names = 1, check.names = FALSE, stringsAsFactors = FALSE)
paletteLength = 100
myColor = colorRampPalette(c("Darkblue", "white","red"))(paletteLength)
PROGENy包自带可视化功能,那就画一个热图看一下。
progenyBreaks = c(seq(min(summarized_progeny_scores_df), 0,
length.out=ceiling(paletteLength/2) + 1),
seq(max(summarized_progeny_scores_df)/paletteLength,
max(summarized_progeny_scores_df),
length.out=floor(paletteLength/2)))
progeny_hmap = pheatmap(t(summarized_progeny_scores_df),fontsize=12,
fontsize_row = 10,
color=myColor, breaks = progenyBreaks,
main = "PROGENy (500)", angle_col = 45,
treeheight_col = 0, border_color = NA)
这个热图做出来的效果还是很好看的,很多作者都是直接放到文章中进行展示。
PROGENy应用场景
上面说完了这个包的用法,那下面小编就来个一步到位,再来解读一下它在实际文章中的使用方式。首先引入眼帘的是篇名为 Single-cell RNA sequencing reveals distinct tumor microenvironmental patterns in lung adenocarcinoma(Oncogene,IF:9.86)的文章,作者直接就是将分析结果的原图放主图的,分别对上皮细胞和胶质细胞中的肿瘤信号进行对比进行展示,发现不同患者的这两种细胞介导的这些通路具有较大差异。
紧接着是篇名为Proteogenomic insights into the biology and treatment of HPV-negative head and neck squamous cell carcinoma(Cancer Cell,IF:31.74)的文章,这篇作者就将PROGENy灵活使用在揭示各种EGFR信号上了,作者通过这种分析发现了在头颈癌中介导肿瘤血管生成的关键基因。
最后是篇名为A proteogenomic portrait of lung squamous cell carcinoma(Cell, IF:41.58)的文章,这里作者也是对EGFR套路进行分析,结果发现其蛋白和酪氨酸磷酸化水平与EGFR拷贝数和EGFR活性评分密切相关,从而找出了在肺癌中介导血管生成的关键配体。
总结
PROGENy有它的好,自然也有值得改进的地方,我们都需要理性看待。首先就是其目前只能对14个肿瘤相关的通路进行分析,如果想对其它通路进行分析需要对算法进行改造。不过这个算法的作者一直都在更新,每一次升级作者都发表了相应文章,大家可以进入其github文件下进行系统学习。此外,小编对小鼠数据也做过同样的分析,然而基于生物学知识对结果进行解读后发现其效果不如人的,这也是很多软件的一个软肋,在这里只能寄托于作者或者是其它大神对PROGENy包进行升级了,Immugent和大家一同拭目以待。
最后,小编已经把上面演示所用的PROGENy代码和配套的数据打包好传到百度网盘了,并且把一些相关文献也放进去了,可以私聊小编获得。
网友评论