作者按
本文记述了博主测试R包QDNAseq的所有过程,mind如下,部分尚待补充。
概述
QDNAseq是一款检测低深度WGS测序数据的拷贝数变异的R包,一般仅需2X6G数据便可做到0.1M精度的定位。
官方网址为
该R包释放在bioconductor
http://www.bioconductor.org/packages/release/bioc/html/QDNAseq.html
上有官方manual,publication,保持更新,指导手册见下图,下载后可查看示例脚本。
image.png
原理
(待有时间补充)
设定step,window计算reads深度,做方差分析,假设检验,取显著差异的window,最后合并相邻拷贝数一致的window合成region。
安装
按照bioconductor的方式直接install
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("QDNAseq")
博主在安装的时候非常顺利,没有报错,可能是测试过太多拷贝数检测的R包,相关依赖包早都装好的原因。检测是否正确安装:
library(QDNAseq)
运行脚本
不需要对照,只用sample的bam文件读取深度,计算。
library(QDNAseq)
bins <- getBinAnnotations(binSize=15)#获取每一个bin的基因注释,联网下载
bins
readCounts <- binReadCounts(bins,bamfile='input/sample.HQ.bam')
readCounts
pdf("readCounts.pdf")
plot(readCounts, logTransform=FALSE, ylim=c(-50, 200))
highlightFilters(readCounts, logTransform=FALSE,residual=TRUE, blacklist=TRUE)
readCountsFiltered <- applyFilters(readCounts,residual=TRUE, blacklist=TRUE)
pdf("mapping.pdf")
isobarPlot(readCountsFiltered)
readCountsFiltered <- estimateCorrection(readCountsFiltered)
pdf("sd.pdf")
noisePlot(readCountsFiltered)
copyNumbers <- correctBins(readCountsFiltered)
copyNumbers
copyNumbersNormalized <- normalizeBins(copyNumbers)
copyNumbersSmooth <- smoothOutlierBins(copyNumbersNormalized)
pdf("cnr.pdf")
plot(copyNumbersSmooth)
exportBins(copyNumbersSmooth, file="sample.txt")
exportBins(copyNumbersSmooth, file="sample.igv", format="igv")
exportBins(copyNumbersSmooth, file="sample.bed", format="bed")
######1.3Downstream analyses#########
copyNumbersSegmented <- segmentBins(copyNumbersSmooth, transformFun="sqrt")
copyNumbersSegmented <- normalizeSegmentedBins(copyNumbersSegmented)
pdf("cns.pdf")
plot(copyNumbersSegmented)
copyNumbersCalled <- callBins(copyNumbersSegmented)
pdf("cnc.pdf")
plot(copyNumbersCalled)
exportBins(copyNumbersCalled, format="vcf")
exportBins(copyNumbersCalled, format="seg")
##########CGHregions##############
cgh <- makeCgh(copyNumbersCalled)
cgh
dev.off()
结果解析
分析完之后会生成以下文件:
文本文件五个
- sample.txt,列信息依次是片段的染色体位置,染色体,start,end,该step的平均深度
feature chromosome start end sample.HQ
1:840001-855000 1 840001 855000 0.336
-
sample.igv,列信息依次是chromosome,start,end,片段坐标,平均深度,可以导入igv软件看深度,如图
image.png
#type=COPY_NUMBER
#track coords=1
chromosome start end feature sample.HQ
1 840001 855000 1:840001-855000 0.336
- sample.bed,列信息依次是chromosome,start,end,片段坐标,+(没明白什么意思)
track name="sample" description="copynumber"
1 840000 855000 1:840001-855000 0.336 +
- sample.HQ.vcf,vcf格式的SV结果
##fileformat=VCFv4.2
##source=QDNAseq-1.14.0
##REF=<ID=DIP,Description="CNV call">
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DUP,Description="Duplication">
##FILTER=<ID=LOWQ,Description="Filtered due to call in low quality region">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of variant: DEL,DUP,INS">
##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="Length of variant">
##INFO=<ID=BINS,Number=1,Type=Integer,Description="Number of bins in call">
##INFO=<ID=SCORE,Number=1,Type=Integer,Description="Score of calling algorithm">
##INFO=<ID=LOG2CNT,Number=1,Type=Float,Description="Log 2 count">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample.HQ
4 69180001 . <DIP> <DEL> 1000 PASS SVTYPE=DEL;END=69855000;SVLEN=675000;BINS=32;SCORE=-2;LOG2CNT=-11 GT 1/1
- sample.HQ.seg,txt格式的SV结果,拷贝数为2*power(2,LOG2_RATIO_MEAN)
SAMPLE_NAME CHROMOSOME START STOP DATAPOINTS LOG2_RATIO_MEAN
sample.HQ 4 69180001 69855000 32 -11
pdf文件6个
1.readCounts.pdf,表示基因组的readcount分布
image.png
2.sd.pdf,体现测序质量,表示每个bin深度与方差的线性关系
image.png
3.mapping.pdf,也是测序质量的指标
image.png
4.cnr.pdf,原始结果,将拷贝数转化为log2Ratio
image.png
5.cns.pdf,中间结果,合并seg
image.png
6.cnc.pdf,最终结果,建设检验后,标注各个SV类型,红色是del,蓝色是dup
image.png
过滤标准
(待定)
可视化结果
(待补充)
参考文献
Scheinin I, Sie D, Bengtsson H, van de Wiel MA, Olshen AB, van Thuijl HF, van Essen HF, Eijk PP, Rustenburg F, Meijer GA, Reijneveld JC, Wesseling P, Pinkel D, Albertson DG, Ylstra B (2014). “DNA copy number analysis of fresh and formalin-fixed specimens by shallow whole-genome sequencing with identification and exclusion of problematic regions in the genome assembly.” Genome Research, 24, 2022–2032.
网友评论