美文网首页
统计fasta/fastq的序列长度及其分布展示

统计fasta/fastq的序列长度及其分布展示

作者: 笺牒九州的怪咖 | 来源:发表于2022-03-24 14:57 被阅读0次

    首先得到每条序列的长度,在这里使用seqkit软件。
    seqkit软件是一个强大的序列处理工具,安装方法参见官方网站.
    代码如下:

    seqkit fx2tab -j 30 -l  -n -i -H file.fastq.gz  > Length.txt
    #参数:
    # -j 是线程数
    #-B, --base-content value   要输出的碱基含量e.g. -B AT -B N
    #-g, --gc                   print GC content
    #-l, --length               print sequence length
    #-n, --name                only print names
    #-i, --only-id              print ID instead of full head
    head  Length.txt  # 查看Length.txt
    

    结果如下所示:


    Length.txt

    提取长度分布:

     seqkit fx2tab -j 30 -l  -n -i -H file.fastq.gz  |cut -f 2 > length.txt
    

    将得到结果length.txt导入到R中绘制序列长度分布图,
    代码如下:

    library(tidyverse)
    
    length <- read_tsv("Length.txt") %>% group_by(length) %>% summarise(Count = n())
    sum <- sum(length$Count)
    ggplot(length) + 
      geom_col(aes(length, Count), width = 0.8) + 
      geom_line(aes(length, Count), group = 1) + 
      geom_point(aes(length, Count)) + 
      scale_y_continuous(sec.axis = sec_axis(~.*100/sum, name = "% Relative Abundance")) + 
      xlab("R1.Length") +
      theme_bw() + theme(panel.grid = element_blank(), 
                         axis.title = element_text(size = 15))
    
    ggsave("Length.png", height = 5, width = 8)
    ggsave("Length.pdf", height = 5, width = 8)
    
    

    结果如下:

    图1

    上图由于数据太多,去掉点和连线更清晰

    library(tidyverse)
    
    length <- read_tsv("Length.txt") %>% group_by(length) %>% summarise(Count = n())
    sum <- sum(length$Count)
    ggplot(length) + 
      geom_col(aes(length, Count), width = 0.8) + 
    #  geom_line(aes(length, Count), group = 1) + 
    #  geom_point(aes(length, Count)) + 
      scale_y_continuous(sec.axis = sec_axis(~.*100/sum, name = "% Relative Abundance")) + 
      xlab("Length") +
      theme_bw() + theme(panel.grid = element_blank(), 
                         axis.title = element_text(size = 15))
    
    ggsave("Length.png", height = 5, width = 8)
    ggsave("Length.pdf", height = 5, width = 8)
    
    

    结果如下:

    图2

    ---------------------------------------------------------------------------------------------------------------------------------------------------I`m a line ! Thanks !-------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    参考:https://www.jianshu.com/p/31244fb42da1

    相关文章

      网友评论

          本文标题:统计fasta/fastq的序列长度及其分布展示

          本文链接:https://www.haomeiwen.com/subject/wrokjrtx.html