由于fastqc的结果 reads 长度分布有点奇怪,所以想看看是原始 Fastq 文件的问题还是 Fastqc 处理的问题。取了前十万条 reads 来进行统计作图。
Fastq files
R 脚本在 QC 文件夹里面
R script
setwd("/sibcb2/bioinformatics2/wangjiahao/LungCGI/Fastq/QC")
Ns = list.files("..", ".gz", full.names = TRUE)
pdf(file = "Reads_length_counts.pdf")
for(i in 1:length(Ns)){
fastq = Ns[i]
cat("Processing", fastq, "\n")
system(paste("zcat", fastq, "| head -100000 > temp"))
seqs = readLines("temp")[seq(2, 400000, 4)]
counts = table(nchar(seqs))
plot(counts, col = "red", type = "l", main = strsplit(fastq, "[./]")[[1]][2], xlab = "Length", ylab = "Counts")
file.remove("temp")
}
dev.off()
Run
Result
刚开始没想到联合 Linux 的 zcat
和 head
命令,而是直接 readLines
把Fastq 文件读完,很显然这样太慢了,经过改进快了几十倍我giao
果然是 Fastq 文件的 reads 本身就有问题
网友评论