首先得到每条序列的长度,在这里使用seqkit软件。
seqkit软件是一个强大的序列处理工具,安装方法参见官方网站.
代码如下:
seqkit fx2tab -j 30 -l -n -i -H file.fastq.gz > Length.txt
#参数:
# -j 是线程数
#-B, --base-content value 要输出的碱基含量e.g. -B AT -B N
#-g, --gc print GC content
#-l, --length print sequence length
#-n, --name only print names
#-i, --only-id print ID instead of full head
head Length.txt # 查看Length.txt
结果如下所示:
Length.txt
提取长度分布:
seqkit fx2tab -j 30 -l -n -i -H file.fastq.gz |cut -f 2 > length.txt
将得到结果length.txt导入到R中绘制序列长度分布图,
代码如下:
library(tidyverse)
length <- read_tsv("Length.txt") %>% group_by(length) %>% summarise(Count = n())
sum <- sum(length$Count)
ggplot(length) +
geom_col(aes(length, Count), width = 0.8) +
geom_line(aes(length, Count), group = 1) +
geom_point(aes(length, Count)) +
scale_y_continuous(sec.axis = sec_axis(~.*100/sum, name = "% Relative Abundance")) +
xlab("R1.Length") +
theme_bw() + theme(panel.grid = element_blank(),
axis.title = element_text(size = 15))
ggsave("Length.png", height = 5, width = 8)
ggsave("Length.pdf", height = 5, width = 8)
结果如下:
图1上图由于数据太多,去掉点和连线更清晰
library(tidyverse)
length <- read_tsv("Length.txt") %>% group_by(length) %>% summarise(Count = n())
sum <- sum(length$Count)
ggplot(length) +
geom_col(aes(length, Count), width = 0.8) +
# geom_line(aes(length, Count), group = 1) +
# geom_point(aes(length, Count)) +
scale_y_continuous(sec.axis = sec_axis(~.*100/sum, name = "% Relative Abundance")) +
xlab("Length") +
theme_bw() + theme(panel.grid = element_blank(),
axis.title = element_text(size = 15))
ggsave("Length.png", height = 5, width = 8)
ggsave("Length.pdf", height = 5, width = 8)
结果如下:
图2---------------------------------------------------------------------------------------------------------------------------------------------------I`m a line ! Thanks !-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
网友评论