作者按
本文记述了博主测试R包ExomeDepth的所有过程,mind如下,部分尚待补充。
image.png
概述
ExomeDepth是一款比较基础的检测panel以及全外数据的拷贝数变异的R包。
官方github为
https://github.com/vplagnol/ExomeDepth
上有官方manual,publication,保持更新,latest release is on 2020元旦。
原理
(待有时间补充)
通俗表达就是一批样品,不论control,依次做sample和control,挨个比对深度,计算方差是否显著。
安装
软件作者将源码释放在了R镜像
https://cran.r-project.org/web/packages/ExomeDepth/index.html
可在上述网站下载包、源码,或者直接install
install.packages("ExomeDepth")
博主在安装的时候报了非常多的问题,依次写出来,给后面的姐妹避雷。
- 首先,install的时候报错
Error in if (nzchar(SHLIB_LIBADD)) SHLIB_LIBADD else character() : argument is of length zero
Google百度知乎知乎发现是升级R导致的makeconf的问题,所以:
find ~ -name Makeconf
mv ~/miniconda2/lib/R/etc/Makeconf ~/miniconda2/lib/R/etc/Makeconf.backup
cp ~/miniconda2/pkgs/r-base-3.5.1-h1e0a451_2/lib/R/etc/Makeconf ~/miniconda2/lib/R/etc/Makeconf - 之后显示安装成功,但调用library(ExomeDepth)报错,显示
Error: package or namespace load failed for ‘ExomeDepth’ in library.dynam(lib, package, package.lib):
shared object ‘matrixStats.so’ not found
查找库
find ~ -name matrixStats.so
~/miniconda2/lib/R/library/matrixStats/libs/matrixStats.so
像上面步骤一复制后仍不成功
又一轮的Google必应百度知乎,不知道哪看到的需要重新安装install.packages("matrixStats")
终于安装成功
library("matrixStats")
library("ExomeDepth")
- 按照手册运行的时候发现又会报错
因为CallCNVs之前要ExomeDepth object was created.
即需要额外安包,以及导入数据库
如下
BiocManager::install("GenomicRanges")
help(GenomicRanges)
data(exons.hg19)
library(ExomeDepth)
library(GenomicRanges)
data(exons.hg19)
data(Conrad.hg19)
重要函数的作用
help(count.everted.reads)是计算每个bin的evertedreads的
help(countBam.everted)经常不会用到,是算高水平的上个countevertedreads的
help(countBamInGRanges.exomeDepth)解析bam文件并计算特定target的countdata
C_hmm 用C植入马尔可夫模型
ExomeCount数据集又不存在
过程:
1. create count data from bam files,样品bam,可以是过滤后的high quality,也可以是raw,在参数里过滤。
2. load dataset,外显子数据库。
3. build 最合适的reference set,要么loop【一批次多个样本依次循环做对照和分析样本】循环分析,要么一个分析样本,至少两个的对照bam。
4. CNV calling
5. ranking the CNV calls by confidence level,按可信度排序
6. better annotation of CNV calls,外显子注释和Conrad文献里的注释。
7. visual CNV
运行脚本
针对同一批次多个样本,采取loop比较方便快捷,使用参数samplelist,提交shell命令。
Rscript ~/TMP/20191105_ExomeDepth/191031.loop.R example.samplelist bam/
第一个参数是样本编号,第二个参数是bam所在路径
loop.R的脚本如下:
argv <- commandArgs(TRUE)
print(paste0("argv[1]=", argv[1]))
print(paste0("argv[2]=", argv[2]))
#print(paste0("argv[3]=", argv[3]))
##第一个参数是samplelist
##第二个参数是bam路径
##第三个参数是samplename
library(ExomeDepth)
library(GenomicRanges)
data(exons.hg19)
data(Conrad.hg19)
bed=read.table("panel_gene.bedfile")#panel或者外显子的捕获区域,格式是TAB键分割的chr,start,end.
names(bed) <- c("chromosome","start","end","name")
head(bed)
exons.hg19.GRanges <- GenomicRanges::GRanges(seqnames = exons.hg19$chromosome,IRanges::IRanges(start=exons.hg19$start,end=exons.hg19$end),names = exons.hg19$name)
###############多个样本要跑循环loop over################
#### get the annotation datasets to be used later data(Conrad.hg19)
### prepare the main matrix of read count data
samplelist <- read.table(argv[1])
allbam<-c()
for (i in samplelist$V1)
{
bam=paste(argv[2],'/',i,'.bam',sep='')#定义各个样品的路径,可以根据自己的目录结构调整
allbam<-c(allbam,bam)
}
allbam
class(allbam)
my.count <- getBamCounts(bed.file="panel_gene.bed",bam.files=c(allbam),min.mapq = 30,include.chr=FALSE, read.width = 300,referenceFasta = "~/database/hg19/Genome/ucsc.hg19.fasta")
my.count.dafr <- as(my.count,'data.frame')
my.count.dafr$chromosome<-gsub(as.character(my.count.dafr$space),pattern = 'chr',replacement = '')
head (my.count.dafr)
ExomeCount.mat <- as.matrix(my.count.dafr[,grep(names(my.count.dafr), pattern = '*.bam')])
nsamples <- ncol(ExomeCount.mat)
### start looping over each sample
for (i in 1:nsamples)
{ #### Create the aggregate reference set for this sample
my.choice <- select.reference.set (test.counts = ExomeCount.mat[,i],reference.counts = ExomeCount.mat[,-i], bin.length = (my.count.dafr$end - my.count.dafr$start)/1000, n.bins.reduced = 10000)
my.reference.selected <- apply(X = ExomeCount.mat[, my.choice$reference.choice, drop = FALSE], MAR = 1, FUN = sum)
message('Now creating the ExomeDepth object')
all.exons <- new('ExomeDepth', test = ExomeCount.mat[,i], reference = my.reference.selected, formula = 'cbind(test, reference) ~ 1')
################ Now call the CNVs
all.exons <- CallCNVs(x = all.exons, transition.probability = 10^-4, chromosome = my.count.dafr$chromosome, start = my.count.dafr$start, end = my.count.dafr$end, name = my.count.dafr$names)
########################### Now annotate the ExomeDepth object
all.exons <- AnnotateExtra(x = all.exons, reference.annotation = Conrad.hg19.common.CNVs, min.overlap = 0.5, column.name = 'Conrad.hg19')
all.exons <- AnnotateExtra(x = all.exons, reference.annotation = exons.hg19.GRanges, min.overlap = 0.0001, column.name = 'exons.hg19')
output.file <- paste('Exome_', colnames(ExomeCount.mat)[i], '.csv', sep = '')
write.csv(file = output.file, x = all.exons@CNV.calls, row.names = FALSE) }
结果解析
以下是某一次同panel的7个样本测序数据用上述脚本分析完之后的CNV结果:
example
"start.p","end.p","type","nexons","start","end","chromosome","id","BF","reads.expected","reads.observed","reads.ratio","Conrad.hg19","exons.hg19"
392,503,"duplication",112,156437435,181767972,"1","chr1:156437435-181767972",164,454724,584367,1.29,NA,"NES_4,DUSP27_6,F5_13"
953,960,"deletion",8,152437306,152446719,"2","chr2:152437306-152446719",8.68,3592,2692,0.749,"CNVR1026.1","NEB_104,NEB_103,NEB_102,NEB_101,NEB_100,NEB_99,NEB_98,NEB_97"
965,970,"deletion",6,152452569,152459210,"2","chr2:152452569-152459210",6.82,2297,1698,0.739,"CNVR1026.1","NEB_92,NEB_91,NEB_90,NEB_89,NEB_88,NEB_87"
2109,2109,"duplication",1,32546866,32546886,"6","chr6:32546866-32546886",4.93,426,704,1.65,"CNVR2845.21,CNVR2845.25,CNVR2845.24,CNVR2845.5,CNVR2845.4","HLA-DRB1_6"
2342,2349,"duplication",8,50358653,50468327,"7","chr7:50358653-50468327",8.76,33140,41112,1.24,NA,"IKZF1_2,IKZF1_3,IKZF1_4,IKZF1_5,IKZF1_6,IKZF1_7"
2351,2402,"duplication",52,55209974,82595808,"7","chr7:55209974-82595808",44.1,298065,356315,1.2,NA,"PCLO_5"
2406,2432,"duplication",27,87133557,87229505,"7","chr7:87133557-87229505",15.1,73559,86300,1.17,NA,"ABCB1_29,ABCB1_28,ABCB1_27,ABCB1_26,ABCB1_25,ABCB1_24,ABCB1_23,ABCB1_22,ABCB1_21,ABCB1_20,ABCB1_19,ABCB1_18,ABCB1_17,ABCB1_16,ABCB1_15,ABCB1_14,ABCB1_13,ABCB1_12,ABCB1_11,ABCB1_10,ABCB1_9,ABCB1_8,ABCB1_7,ABCB1_6,ABCB1_5,ABCB1_4,ABCB1_3"
2489,2549,"duplication",61,103118831,103417079,"7","chr7:103118831-103417079",35.4,178421,209001,1.17,NA,"RELN_63,RELN_62,RELN_61,RELN_60,RELN_59,RELN_58,RELN_57,RELN_56,RELN_55,RELN_54,RELN_53,RELN_52,RELN_51,RELN_50,RELN_49,RELN_48,RELN_47,RELN_46,RELN_45,RELN_44,RELN_43,RELN_42,RELN_41,RELN_40,RELN_39,RELN_38,RELN_37,RELN_36,RELN_35,RELN_34,RELN_33,RELN_32,RELN_31,RELN_30,RELN_29,RELN_28,RELN_27,RELN_26,RELN_25,RELN_24,RELN_23,RELN_22,RELN_21,RELN_20,RELN_19,RELN_18,RELN_17,RELN_16,RELN_15,RELN_14,RELN_13,RELN_12,RELN_11,RELN_10,RELN_9,RELN_8,RELN_7,RELN_6,RELN_5,RELN_4"
2554,2598,"duplication",45,104702606,116436180,"7","chr7:104702606-116436180",25.7,159578,187795,1.18,NA,"MLL5_27,PIK3CG_2,GPR22_3,LRRN3_4,PPP1R3A_4,MET_2"
2676,2692,"duplication",17,148504736,148526945,"7","chr7:148504736-148526945",11.2,34871,41239,1.18,NA,"EZH2_20,EZH2_19,EZH2_18,EZH2_17,EZH2_16,EZH2_15,EZH2_14,EZH2_13,EZH2_12,EZH2_11,EZH2_10,EZH2_9,EZH2_8,EZH2_7,EZH2_6,EZH2_5"
2696,2746,"duplication",51,151833912,151949805,"7","chr7:151833912-151949805",27.3,213235,249835,1.17,NA,"MLL3_59,MLL3_58,MLL3_57,MLL3_56,MLL3_55,MLL3_54,MLL3_53,MLL3_52,MLL3_51,MLL3_50,MLL3_49,MLL3_48,MLL3_47,MLL3_46,MLL3_45,MLL3_44,MLL3_43,MLL3_42,MLL3_41,MLL3_40,MLL3_39,MLL3_38,MLL3_37,MLL3_36,MLL3_35,MLL3_34,MLL3_33,MLL3_32,MLL3_31,MLL3_30,MLL3_29,MLL3_28,MLL3_27,MLL3_26,MLL3_25,MLL3_24,MLL3_23,MLL3_22,MLL3_21,MLL3_20,MLL3_19,MLL3_18,MLL3_17,MLL3_16,MLL3_15,MLL3_14,MLL3_13,MLL3_12,MLL3_11,MLL3_10"
3008,3030,"duplication",23,5021983,5126793,"9","chr9:5021983-5126793",12,39011,45476,1.17,NA,"JAK2_3,JAK2_4,JAK2_5,JAK2_6,JAK2_7,JAK2_8,JAK2_9,JAK2_10,JAK2_11,JAK2_12,JAK2_13,JAK2_14,AL161450.1_3,JAK2_15,AL161450.1_2,JAK2_16,JAK2_17,JAK2_18,JAK2_19,AL161450.1_1,JAK2_20,JAK2_21,JAK2_22,JAK2_23,JAK2_24,JAK2_25"
3043,3050,"deletion",8,21968206,22008957,"9","chr9:21968206-22008957",106,9834,1669,0.17,NA,"CDKN2A_3,CDKN2A_2,CDKN2A_1,CDKN2B_2,CDKN2B_1"
3065,3075,"duplication",11,36840524,37034033,"9","chr9:36840524-37034033",10,31564,38071,1.21,NA,"PAX5_10,PAX5_9,PAX5_8,PAX5_7,PAX5_6,PAX5_5,PAX5_4,PAX5_3,PAX5_2,PAX5_1"
3084,3155,"duplication",72,87285659,98279107,"9","chr9:87285659-98279107",57.3,187404,225332,1.2,NA,"DAPK1_26,SPATA31E1_4,S1PR3_1,NFIL3_2,ROR2_9,NOL8_7,ZNF484_4,PTPDC1_6,ZNF169_5"
4050,4071,"duplication",22,73333931,73355975,"13","chr13:73333931-73355975",16.4,44887,53525,1.19,NA,"DIS3_21,DIS3_20,DIS3_19,DIS3_18,DIS3_17,DIS3_16,DIS3_15,DIS3_14,DIS3_13,DIS3_12,DIS3_11,DIS3_10,DIS3_9,DIS3_8,DIS3_7,DIS3_6,DIS3_5,DIS3_4,DIS3_3,DIS3_2,DIS3_1"
4107,4150,"duplication",44,96752090,96852023,"14","chr14:96752090-96852023",27.5,95803,112814,1.18,NA,"ATG2B_42,ATG2B_41,ATG2B_40,ATG2B_39,ATG2B_38,ATG2B_37,ATG2B_36,ATG2B_35,ATG2B_34,ATG2B_33,ATG2B_32,ATG2B_31,ATG2B_30,ATG2B_29,ATG2B_28,ATG2B_27,ATG2B_26,ATG2B_25,ATG2B_24,ATG2B_23,ATG2B_22,ATG2B_21,ATG2B_20,ATG2B_19,ATG2B_18,ATG2B_17,ATG2B_16,ATG2B_15,ATG2B_14,ATG2B_13,ATG2B_12,ATG2B_11,ATG2B_10,ATG2B_9,ATG2B_8,ATG2B_7,ATG2B_6,ATG2B_5,ATG2B_4,ATG2B_3,ATG2B_2,ATG2B_1,GSKIP_3,GSKIP_4"
4165,4178,"duplication",14,105235921,105258985,"14","chr14:105235921-105258985",14.3,36445,44669,1.23,NA,"AKT1_13,AKT1_12,AKT1_11,AKT1_10,AKT1_9,AKT1_8,AKT1_7,AKT1_6,AKT1_5,AKT1_4,AKT1_3,AKT1_2,AKT1_1"
4203,4205,"deletion",3,45003740,45008542,"15","chr15:45003740-45008542",6.75,4891,3272,0.669,NA,"B2M_1,B2M_2,B2M_3"
列信息:
i. start.p编码蛋白起始点
ii. end.p 编码蛋白终止点
iii. type 二选一,duplication和deletion
iv. nexons CNV跨越的外显子个数
v. start 染色体绝对位置起始点
vi. end 染色体绝对位置终止点
vii. chromosome 染色体
viii. id 位置信息格式化
ix. BF Bayes factor贝叶斯因子,是log10标准化的tumor-normal深度比。算法核心统计值,越高代表该cnv片段越可信。官方建议过滤方法不是取threshold,而是解读前按该值排序,从高到低取可信结果。
x. reads.expected 同批次除分析样本外所有样本标准化后的期望reads
xi. reads.observed 该样本的reads数
xii. reads.ratio 可简单理解为reads.observed/reads.expected
xiii. Conrad.hg19 CNV数据库注释结果,参考文献为Conrad et al. 2010,该数据库对11700个CNV进行了验证,联合SNP做了GWAS,有多个明确致病位点的注释,可作参考。
xiv. exons.hg19 按外显子注释,eg.SF3B1_25表示SF3B1基因的25号外显子
过滤标准
过滤参考信息,重要性递减
i. BF,越高越可信
ii. readsratio,type为duplication的比1大的越多越可信,type为deletion的比1小的越多越可信。一般与BF趋向性吻合。
iii. 数据库+本地库+NA12878结果
iv. 跨越外显子数,越多越不可信
可视化结果
(待补充)
参考文献
a. 软件
Plagnol V, Curtis J, Epstein M, et al. A robust model for read count data in exome sequencing experiments and implications for copy number variant calling[J]. Bioinformatics, 2012, 28(21): 2747-2754.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3476336/
b. 数据库
Conrad D F, Pinto D, Redon R, et al. Origins and functional impact of copy number variation in the human genome[J]. Nature, 2010, 464(7289): 704-712.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3330748/
NOTE:
写博文时回溯发现官方网址并没有指导script,我也忘记我是从哪下载的了,源文件我有保存,需要的可以留言,我线下发送。
网友评论