利用MAGeCK算法处理CRISPR Screen数据

作者: 凯凯何_Boy | 来源:发表于2022-03-10 08:25 被阅读0次

利用MAGeCK算法处理CRISPR Screen数据
2021-08-25 基于Base editor的CRISPR
03 EM算法 - EM算法流程和直观案例
【重构笔记-2】工作中遇到的问题总结
【机器学习】无监督学习 K-means(K-均值聚类算法）
运用Aggregator模式实现MapReduce
CRISPR、CRISPR screen以及与单细胞测序的结合
一种室内地图解决方案
学习笔记--(移动数据挖掘引言)
hadoop处理数据模版

上次文章结尾时候提到了MAGeCK RRA算法处理，这次我们就来学习一下，Model-based Analysis of Genome-wide CRISPR-Cas9 Knockout(MAGeCK) 是一个可以从全基因组CRISPR-CAS9筛查技术中识别重要基因计算工具。Mageck是由Wei Li 和 Shirley Liu lab共同开发维护的。

Tutorial地址：https://sourceforge.net/p/mageck/wiki/

Paper：Li, et al. MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens. Genome Biology 15:554 (2014)

安装

Conda安装(推荐)

conda create -c bioconda -n mageckenv mageck  # 环境中python版本要大于3
source activate mageckenv
conda update mageck
source deactivate

Docker安装

可以follow这个流程https://sourceforge.net/p/mageck/wiki/demo/#tutorial-3-run-mageck-on-docker-image

源码安装

最新版的MAGeCK(0.5.9)下载地址为：https://sourceforge.net/projects/mageck/files/latest/download

tar xvzf mageck-0.5.4.tar.gz
cd mageck-0.5.4
python setup.py install
## 如果你想要安装MAGeCK安装到你自己的目录下
python setup.py install --user
## $HOME是你的root目录
python setup.py install --prefix=$HOME
## 设置环境变量
export PATH=$PATH:$HOME/bin

MAGECK RRA算法

下载测试数据

我们直接使用 Genome-wide recessive genetic screening in mammalian cells with a lentiviral CRISPR-guide RNA library这篇文章中的数据集中的两个样本为示例，在本文中，作者对小鼠ESC细胞进行了CRISPR / CAS9筛选，并鉴定了小鼠ESC细胞中必需的基因。

准备文件：

测序数据：对照组 ERR376998 (one replicate of plasmid)、处理组 ERR376999 (one replicate of ESC)
sgRNA库文件：yusa_library.csv 包含测序sgRNAid、序列、靶向基因

sgRNA文库

数据下载地址：http://www.ebi.ac.uk/ena/data/view/ERP003292

sgRNA库文件：https://sourceforge.net/projects/mageck/files/libraries/

数据的下载方式可以参考一文。

数据下载后进行解压

gunzip ERR376998.fastq.gz
gunzip ERR376999.fastq.gz

count获取样本比对数据

mageck count -l yusa_library.csv -n escneg --sample-label "plasmid,ESC1" --fastq ERR376998.fastq  ERR376999.fastq

参数	解释
-l yusa_library.csv	提供的sgRNA信息包含sgRNA id,序列和靶向的基因(支持csv与tab分割)
-n demo	输出文件的前缀
--sample-label L1,CTRL	L1 (test1.fastq) and CTRL (test2.fastq)两个样本的标签
--fastq test1.fastq test2.fastq	测序的fastq序列
--pdf-report	对测序数据的统计信息提供pdf报告文档
--trim-5 （可选）	切除5段接头序列

我们打开ERR376998文件可以看到各序列的前23个碱基是一致的，我们可以设置--trim-5参数手动切除这些序列，但是从version 0.5.6版本后, MAGeCK现在可以自动识别fastq文件中一致序列进行切除，因此这个参数是可选的。

image-20220301135529297

test比较样本差异

当我们得到count矩阵后，就可以用test命令进行组间差异分析

mageck test -k escneg.count.txt -t ESC1 -c plasmid -n esccp

参数	解释
-k escneg.count.txt	count命令得到的各样本read count矩阵
-t	指定处理组数据
-c	指定对照组数据
-n demo	输出文件的前缀

如果成功，可以看到"esccp.gene_summary.txt"文件, 可以看到两个核糖体基因 RPS5和RP19在阴性筛选的前几名

id      num     neg|score  neg|p-value   neg|fdr neg|rank        neg|goodsgrna   pos|score  pos|p-value   pos|fdr pos|rank  pos|goodsgrna
GTF2B   5       2.0462e-10      2.5851e-07      0.000707        1       5       1.0     1.0     1.0     19150   0
RPS5    5       5.9353e-10      2.5851e-07      0.000707        2       5       1.0     1.0     1.0     19149   0
RPL19   4       2.695e-09       2.5851e-07      0.000707        3       4       1.0     1.0     1.0     19148   0
KIF18B  5       1.0136e-08      2.5851e-07      0.000707        4       5       1.0     1.0     1.0     19146   0

如果我们想看阳性筛选排序，可以将第11列进行排序，发现 TPR53在前几名

## 排序
sort -k 11,11n esccp.gene_summary.txt | less

id      num     neg|score  neg|p-value   neg|fdr neg|rank        neg|goodsgrna   pos|score  pos|p-value   pos|fdr pos|rank  pos|goodsgrna
ZFP945  5       1.0     1.0     0.999999        19150   0       9.6166e-07      5.4287e-06      0.05198 1  5
TRP53   5       0.95411 0.95409 0.999999        17901   0       1.0347e-06      5.4287e-06      0.05198 2  4
PDAP1   5       0.85937 0.86223 0.999999        15753   1       7.6412e-06      2.8178e-05      0.174505  3       2

文件信息如下：

Column	Content
id	Gene ID
num	The number of targeting sgRNAs for each gene
neg\|score	The RRA lo value of this gene in negative selection
neg\|p-value	The raw p-value (using permutation) of this gene in negative selection
neg\|fdr	The false discovery rate of this gene in negative selection
neg\|rank	The ranking of this gene in negative selection
neg\|goodsgrna	The number of "good" sgRNAs, i.e., sgRNAs whose ranking is below the alpha cutoff (determined by the --gene-test-fdr-threshold option), in negative selection.
neg\|lfc	The log2 fold change of this gene in negative selection. The way to calculate gene lfc is controlled by the --gene-lfc-method option
pos\|score	The RRA lo value of this gene in positive selection
pos\|p-value	The raw p-value (using permutation) of this gene in positive selection
pos\|fdr	The false discovery rate of this gene in positive selection
pos\|rank	The ranking of this gene in positive selection
pos\|goodsgrna	The number of "good" sgRNAs, i.e., sgRNAs whose ranking is below the alpha cutoff (determined by the --gene-test-fdr-threshold option), in positive selection.
pos\|lfc	The log fold change of this gene in positive selection

同样我们可以添加--pdf-report参数，生成一个统计报告，便于分享展示。

image-20220301141757379

【关于筛选】：这是一项重要步骤，根据研究目的不同，CRISRP共同量测序分为阳性筛选和阴性筛选。阳性筛选指的是施加一定筛选压力，经过文库扰动后野生型细胞致死，仅有获得抗性的细胞存活，与阳性筛选不同，阳性筛选是通过比较不同筛选时间点的sgRNA丰度差，获得缺失的sgRNA，分析所对应的基因，从而找出可能影响细胞生长的候选基因。

MAGeCK MLE算法

除了传统的RRA算法，从0.5版本以后，MAGeCK提供了一个新的算法MLE，计算一个称为beta score的指标代表基因的重要程度，其实和转录组中差异分析中的log fold change指标类似，相比于RRA算法，这种方法有以下优势：

每个基因只会得到一个分数，而不是RRA中的两个分数：一个用于阳性选择，一个用于阴性选择。
它允许在多种条件(多组情况)下直接比较。
它包含各个sgRNA的干扰效率信息。

构建design matrix文件

这个design matrix文件表明那个样本被哪个条件所影响，通常是一个二进制矩阵，指示哪个样本（第一列指示）受到哪个条件的影响（由第一行表示）。

对于以上的数据对照组 ERR376998 (one replicate of plasmid)、处理组 ERR376999 (one replicate of ESC)，我们可以构建这样的矩阵designmat.txt

Samples baseline        plasmid ESC1
plasmid 1       0       0
ESC1    1       1      0

注意：

矩阵必须包含一个表头代表条件标签
第一列是样本的标签
第二列是必须是一个"baseline"列，所有值必须为1
表里的数字必须是0 或 1
必须设置一个初始状态的样本（比如day 0或者plasmid）

run the module

mageck mle -k escneg.count.txt -d designmat.txt -n yusa

执行成功后，会生成3列文件：log.file、 gene_summary file (包含 gene beta scores)、 sgrna_summary file (包含sgRNA efficiency probability predictions).,查看gene_summary.file：

gene_summary file

文件信息如下：

Column	Content
Gene	Gene ID
sgRNA	The number of targeting sgRNAs for each gene
plasmid\|beta;ESC1\|beta	The beta scores of this gene in conditions "plasmid" and "ESC1", respectively. The conditions are specified in the design matrix as an input of the mle subcommand.
plasmid\|p-value	The raw p-value (using permutation) of this gene
plasmid\|fdr	The false discovery rate of this gene
plasmid\|z	The z-score associated with Wald test
plasmid\|wald-p-value	The p value using Wald test
plasmid\|wald-fdr	The false discovery rate of the Wald test

其他参数

plot

我们可以使用plot命令对单独挑选出的基因绘图

usage: mageck plot [-h] -k COUNT_TABLE -g GENE_SUMMARY [--genes GENES]
                   [-s SAMPLES] [-n OUTPUT_PREFIX]
                   [--norm-method {none,median,total}] [--keep-tmp]

Parameter	Explanation
-k COUNT_TABLE, --count-table COUNT_TABLE	Provide a tab-separated count table.
-g GENE_SUMMARY, --gene-summary GENE_SUMMARY	The gene summary file generated by the test command.
-h, --help	show this help message and exit
--genes GENES	A list of genes to be plotted, separated by comma. Default: none.
-s SAMPLES, --samples SAMPLES	A list of samples to be plotted, separated by comma. Default: using all samples in the count table.
-n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX	The prefix of the output file(s). Default sample1.
--norm-method {none,median,total}	Method for normalization, default median.
--keep-tmp	Keep intermediate files.

## 举例子 单独绘制 GRHL2 BIK这两个基因
mageck plot -k escneg.count.txt -g esccp.gene_summary.txt --genes GRHL2,BIK

image-20220301151425139

可以看到plot命令将这两个基因每个sgRNA的变化，阳性和阴性筛选中按照P值和RRA score排序的图形绘制了除了，多个基因直接中间以，分割即可。

MAGECKGsea

MAGECKGSEA是使用C ++的基因设定富集分析（GSEA）的快速实现。与官方GSEA软件相比，主要优势是其易于使用和极快的运行速度。

作者这里也提供了gsea示例文件供我们练习：https://bitbucket.org/liulab/mageck/src/master/gsea/，其实只用到两个参数

 mageckGSEA -r demo1.txt -g kegg.ribosome.gmt  -c 1 -p 10000

Parameter	Explanation
-r rank_file, --rank_file rank_file	(required) Rank file. The first column of the rank file must be the gene name.
-g gmt_file, --gmt_file gmt_file	(required) The pathway annotation in GMT format.
-p perm_time, --perm_time perm_time	Permutations, default 1000.
-c score_column	The column for gene scores.

生成GSEA结果文件如下：

Pathway Size    ES  p   p_permutation   FDR Ranking Hits    LFC
KEGG_RIBOSOME   88  0.3262  0.00240772  0.0045  0.0045  0   32  0

Item	Explanation
Pathway	The name of the pathway
Size	The size of the pathway, i.e., the number of genes
ES	Enrichment Score (ES) in GSEA
p	The p value of ES
p_permutation	The permutation p value of ES (usually more accurate than p
FDR	False Discovery Rate of p_permutation
Ranking	The ranking of this pathway
Hits	The number of genes that are ranked before ES score. See "Leading Edge" analysis of GSEA
LFC	Log fold change (not implemented)

速度确实挺快，感觉可作为官方GSEA软件做基因集富集一种替代方式，只需要自己制作好GMT文件即可，不懂该格式的可以参考GSEA官网介绍：

Advanced tutorial

包含允许read mapping误差、纠正拷贝数变异影响、Docker iamge运行MAGeCK及更复杂设计实验条件下的操作，感兴趣看教程https://sourceforge.net/p/mageck/wiki/advanced_tutorial/

输入/出格式

更详细的参数介绍可直接看源文档

输入：https://sourceforge.net/p/mageck/wiki/input/

输出：https://sourceforge.net/p/mageck/wiki/output/

其它工具

除了MAGeCK, 该团队还开发了其他的软件与算法:

MAGeCK-VISPR: 制动CRISPR//Cas9筛查的开发的集质控，分析和可视化一体的workflow
MAGeCKFlute: 一个交互分析的pipelineR包
scMAGeCK: 一个结合单细胞转录组数据与Crispr筛查数据综合分析的计算模型
Vispr-online：在线版分析筛选数据集

后续有时间再来慢慢学习吧~~~

利用MAGeCK算法处理CRISPR Screen数据
上次文章结尾时候提到了MAGeCK RRA算法处理，这次我们就来学习一下，Model-based Analysis...
2021-08-25 基于Base editor的CRISPR
前面我们介绍了CRISPR Screen[https://www.jianshu.com/p/522b7688...
03 EM算法 - EM算法流程和直观案例
02 EM算法 - K-means算法回顾、EM概述利用EM算法处理聚类问题的步骤：样本数据x={x1,x2,...
【重构笔记-2】工作中遇到的问题总结
1. @media \0screen,screen\9利用 @media \0screen,screen\9{}可...
【机器学习】无监督学习 K-means(K-均值聚类算法）
任务利用K-均值聚类算法对未标注数据分组对聚类得到的簇进行后处理二分K-均值聚类算法思想聚类算法几乎可以...
运用Aggregator模式实现MapReduce
MapReduce是更好地利用并行计算资源来提升数据处理能力的重要算法，如今已被主流的大数据分析平台实现，成为了大...
CRISPR、CRISPR screen以及与单细胞测序的结合
CRISPR 首先来介绍一些CRISPR，它首先在大肠杆菌中发现，全称为：Clustered Regularly ...
一种室内地图解决方案
简介：本方案利用智能算法处理3D摄像头的点云数据生成GeoJSON，然后使用leaflet框架将GeoJSON渲染...
学习笔记--(移动数据挖掘引言)
移动数据挖掘的定义移动数据挖掘研究的是基于移动数据的数据挖掘算法。这些数据算法需要更多地利用移动数据特性，挖掘与...
hadoop处理数据模版
1 利用hql处理数据利用hql处理数据需要用到两个文件，handle.py和handle.sql文件，利用py...

利用MAGeCK算法处理CRISPR Screen数据

安装

Conda安装(推荐)

Docker安装

源码安装

MAGECK RRA算法

下载测试数据

count获取样本比对数据

test比较样本差异

MAGeCK MLE算法

构建design matrix文件

run the module

其他参数

plot

MAGECKGsea

Advanced tutorial

输入/出格式

其它工具

相关文章

利用MAGeCK算法处理CRISPR Screen数据

2021-08-25 基于Base editor的CRISPR

03 EM算法 - EM算法流程和直观案例

【重构笔记-2】工作中遇到的问题总结

【机器学习】无监督学习 K-means(K-均值聚类算法）

运用Aggregator模式实现MapReduce

CRISPR、CRISPR screen以及与单细胞测序的结合

一种室内地图解决方案

学习笔记--(移动数据挖掘引言)

hadoop处理数据模版

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读