- 安装与导入
- 使用
- 测试SNP数据
安装与导入
if(!require("absCNseq")){
devtools::install_github("ShixiangWang/absCNseq")
}
导入:
library(absCNseq)
使用
absCNseq包没有提供手册,不过提供了一些样例数据,下面看下使用的方法和测试的效果。
主函数为run.absCNSeq()
,参数有点多,具体看文档,作者还是写的蛮清楚的。
args(run.absCNSeq)
#> function (seg.fn, snv.fn = NULL, res.dir, smp.name, seq.type = c("WES",
#> "WGS"), alpha.min = 0.2, alpha.max = 1, tau.min = 1.5, tau.max = 5,
#> min.sol.freq = 0, min.seg.len = 0, qmax = 7, lamda = 0.5,
#> verbose = FALSE)
#> NULL
直接使用包提供的数据进行计算和画图。
example_cn = "Data/example.cn.txt"
example_snv = "Data/example.snv.txt"
my.res.list <- run.absCNSeq(example_cn, example_snv, "myResult", "Sample1", seq.type="WES", min.seg.len=200)
i = 1
seg.CN <- compute.absCN(my.res.list$seg.dat, my.res.list$searchRes[i,"alpha"], my.res.list$searchRes[i,"tau"]) # the i-th solution
plot.absCN(seg.CN, chromnum=4)
image.png
输入文件说明作者没有直接提供在包里,这里我下载之后把内容copy进来,以免忘记。
1. Input copy ratio file ("example.cn.txt")
The input copy ratio file contains segmented copy ratio data, which can be generated by popular segmentation algorithms like
the "DNAcopy" R package.
It should be a tab-delimited text file with five columns.
Please use the EXACT header names as below.
1) "chrom"
The chromosome number of a segment. Must be a integer number from 1 to 22.
2) "loc.start"
The start position of a segment.
3) "loc.end"
The end position of a segment.
4) "eff.seg.len"
For exome sequencing, due to the nature of highly uneven coverage (zero coverage for introns), this column gives the number of base pairs with actual observed coverage.
It can be derived by concatenate all the VARSCAN bins within a segment.
Note that this length is usually much smaller than the length of the segment which is loc.end-loc.start+1
5) "normalized.ratio"
The mean copy ratio (tumor DNA vs. germline DNA) of a segment.
Note that the copy ratio should be normalized to eliminate any sequencing throughput difference between tumor and germline DNA.
For example, samtools can be used to count total reads that were properly paired/aligned for tumor and germline DNA.
The difference then need to be adjusted accordingly.
2. Input SNV file ("example.snv.txt")
The input SNV file contains allele frequency data for somatic mutations.
It should be a tab-delimited text file with three columns.
Please use the EXACT header names as below.
1) "chrom"
The chromosome number of a somatic SNV. Must be a integer number from 1 to 22.
2) "position"
The genomic position of a somatic SNV.
3) "tumor_var_freq"
The proportion of reads supporting the somatic SNV allele. Must be a fraction number.
测试SNP数据
既然格式跟SNP数据差不多,那我们能不能直接用它算SNP呢?下面我使用ABSOLUTE提供的一个样例数据算一下。
首先把TCGA的SNP数据导进来,转换为absCNseq所需的格式。
读入:
snp_seg = readr::read_tsv("Data/SNP6_solid_tumor.seg.txt")
snp_maf = readr::read_tsv("Data/solid_tumor.maf.txt")
处理CNV文件:
snp_seg = snp_seg[,-1]
seg_header = c("chrom", "loc.start", "loc.end", "eff.seg.len","normalized.ratio")
colnames(snp_seg) = seg_header
snp_seg$normalized.ratio = 2^snp_seg$normalized.ratio
snp_seg$eff.seg.len = snp_seg$loc.end - snp_seg$loc.start + 1
处理Maf文件:
table(snp_maf$Variant_Type)
#>
#> DEL INS SNP
#> 16 11 385
snp_maf = dplyr::filter(snp_maf, Variant_Type == "SNP")
snp_snv = tibble::tibble(
chrom = snp_maf$Chromosome,
position = snp_maf$Start_position,
tumor_var_freq = snp_maf$t_alt_count / (snp_maf$t_alt_count + snp_maf$t_ref_count)
)
写出文件:
readr::write_tsv(snp_seg, path = "Data/test_snp.cn.txt")
readr::write_tsv(snp_snv, path = "Data/test_snp.snv.txt")
测试SNP文件:
library(absCNseq)
test_cn = "Data/test_snp.cn.txt"
test_snv = "Data/test_snp.snv.txt"
my.res.list2 <- run.absCNSeq(test_cn, test_snv, "myResult", "Test1", seq.type="WGS")
i = 1
seg.CN2 <- compute.absCN(my.res.list2$seg.dat, my.res.list2$searchRes[i,"alpha"], my.res.list2$searchRes[i,"tau"]) # the i-th solution
plot.absCN(seg.CN2, chromnum=1)
这里存在一个问题,我们把所有分割得到的长度都当做有效长度,在原算法中是通过过滤去除噪声的,而我暂时的想法是计算每kb上的探针数,如果该数过小,我们把它过滤掉。
这个想法不是很成熟。先看下这个使用SNP分析的结果跟ABSOLUTE官方分析的结果一致性怎么样。
导入该数据:
load("~/Downloads/ABSOLUTE exampledata/1131204/solid_tumor.ABSOLUTE.RData")
ABSOLUTE中RunAbsolute计算的结果很难看,需要汇总。看看怎么用。
ABSOLUTE::CreateReviewObject(obj.name = "test1_summary",
absolute.files = "~/Downloads/ABSOLUTE exampledata/1131204/solid_tumor.ABSOLUTE.RData",
indv.results.dir = "myResult/test1_sm/sm",
copy_num_type = "total", verbose = TRUE)
抽取结果:
ABSOLUTE::ExtractReviewedResults(reviewed.pp.calls.fn = "myResult/test1_sm/sm/test1_summary.PP-calls_tab.txt",
analyst.id = "wsx",
modes.fn = "myResult/test1_sm/sm/test1_summary.PP-modes.data.RData",
out.dir.base = "myResult/test1_sm/sm/", obj.name = "absolute",
copy_num_type = "total", verbose = TRUE
)
跟官方得到的结果一致,但输出不一致,非常奇怪!算了,不搞了
library(ABSOLUTE)
RunAbsolute(seg.dat.fn = "Data/SNP6_solid_tumor.seg.txt", sigma.p = 0, max.sigma.h = 0.2, min.ploidy = 0.5, max.ploidy = 8, primary.disease = "BLCA", platform = "SNP_6.0", sample.name = "test_abs", results.dir = "myResult/test_abs", max.as.seg.count = 1500, max.non.clonal = 1, max.neg.genome = 0.005, copy_num_type = "total", maf.fn = "Data/solid_tumor.maf.txt", min.mut.af = 0.1, output.fn.base = "aabb", verbose = TRUE)
一个画分割文件的函数(from https://www.biostars.org/p/152215/)。
plot_segments = function(segments_file){
seg = read.delim(segments_file, sep="\t")
seg.spl = split(seg,as.factor(as.character(seg$Chromosome)))
pdf(file=paste(segments_file,"pdf",sep="."),paper="special",width=12,onefile=T,pointsize=8)
par(mfrow=c(4,1))
for(i in 1:length(seg.spl)){
x = seg.spl[[i]]
plot(x$Start,x$Segment_Mean,xlim = c(x[1,3],x[nrow(x),4]),pch = "",ylim = c(-3,3),xlab = paste("chr",names(seg.spl[i]),sep="_"),ylab = "log2 ratio")
points(x$End,x$Segment_Mean,pch = "")
segments(x0 = x$Start , y0 = x$Segment_Mean , x1 = x$End, y1 = x$Segment_Mean, lwd = 2, col = "maroon")
abline(h = 0, lty = 1,lwd = 0.5)
}
dev.off()
}
网友评论