美文网首页生信分析流程R生信
4个PG基因组SSRs数量分布

4个PG基因组SSRs数量分布

作者: EZ | 来源:发表于2019-12-19 21:10 被阅读0次

    一、统计tns不同染色体上SSR分布

    将序列碱基改为大写
    $ /home/Pomgroup/gdp/app/Seqkit/seqkit seq tns_wgs.fna -u > tsn.fna
    查看序列名称,也要contig水平的序列
          1 >CM018611.1 Punica granatum isolate Tunisia-2019 chromosome 1, whole genome shotgun sequenc
          2 >CM018612.1 Punica granatum isolate Tunisia-2019 chromosome 2, whole genome shotgun sequenc
          3 >CM018613.1 Punica granatum isolate Tunisia-2019 chromosome 3, whole genome shotgun sequenc
          4 >CM018614.1 Punica granatum isolate Tunisia-2019 chromosome 4, whole genome shotgun sequenc
          5 >CM018615.1 Punica granatum isolate Tunisia-2019 chromosome 5, whole genome shotgun sequenc
          6 >CM018616.1 Punica granatum isolate Tunisia-2019 chromosome 6, whole genome shotgun sequenc
          7 >CM018617.1 Punica granatum isolate Tunisia-2019 chromosome 7, whole genome shotgun sequenc
          8 >CM018618.1 Punica granatum isolate Tunisia-2019 chromosome 8, whole genome shotgun sequenc
          9 >MABG02000231.1 Punica granatum isolate Tunisia-2019 Contig00001, whole genome shotgun sequ
         10 >MABG02000443.1 Punica granatum isolate Tunisia-2019 Contig00002, whole genome shotgun sequ
         11 >MABG02000230.1 Punica granatum isolate Tunisia-2019 Contig00003, whole genome shotgun sequ
    

    重命名序列名,使用sed命令
    (1)先修改染色体序列名称

    $ sed -n '/>/p' tsn.fna | sed 's/CM.*2019 //g' | sed 's/\,.*c//g' |less -SN
          1 >chromosome 1e
          2 >chromosome 2e
          3 >chromosome 3e
          4 >chromosome 4e
          5 >chromosome 5e
          6 >chromosome 6e
          7 >chromosome 7e
          8 >chromosome 8e
          9 >MABG02000231.1 Punica granatum isolate Tunisia-2019 Contig00001e
    有e? 也修改了contig序列的名称
    

    (2)重新查看序列名称 ,因为使用less -SN后一行内容太多不能全部复制

    $ sed -n '/>/p' tsn.fna | head -n 10
    >CM018611.1 Punica granatum isolate Tunisia-2019 chromosome 1, whole genome shotgun sequence
    >CM018612.1 Punica granatum isolate Tunisia-2019 chromosome 2, whole genome shotgun sequence
    >CM018613.1 Punica granatum isolate Tunisia-2019 chromosome 3, whole genome shotgun sequence
    >CM018614.1 Punica granatum isolate Tunisia-2019 chromosome 4, whole genome shotgun sequence
    >CM018615.1 Punica granatum isolate Tunisia-2019 chromosome 5, whole genome shotgun sequence
    >CM018616.1 Punica granatum isolate Tunisia-2019 chromosome 6, whole genome shotgun sequence
    >CM018617.1 Punica granatum isolate Tunisia-2019 chromosome 7, whole genome shotgun sequence
    >CM018618.1 Punica granatum isolate Tunisia-2019 chromosome 8, whole genome shotgun sequence
    >MABG02000231.1 Punica granatum isolate Tunisia-2019 Contig00001, whole genome shotgun sequence
    >MABG02000443.1 Punica granatum isolate Tunisia-2019 Contig00002, whole genome shotgun sequence
    

    (3)重新修改

    $ sed -n '/>/p' tsn.fna | sed 's/CM.*2019 //g' |sed 's/MAB.*2019 //g' | sed 's/\,.*ce//g' | sed 's/\s//g' |head -n 10
    >chromosome1
    >chromosome2
    >chromosome3
    >chromosome4
    >chromosome5
    >chromosome6
    >chromosome7
    >chromosome8
    >Contig00001
    >Contig00002
    先修改保留内容之前的内容,再全部替换(,)后的内容。
    

    (4)直接修改文件

    $ sed -i 's/CM.*2019 //;s/MAB.*2019 //;s/\,.*ce//;s/\s//' tsn.fna
    $ sed -n '/>/p' tsn.fna | head -n 10
    >chromosome1
    >chromosome2
    >chromosome3
    >chromosome4
    >chromosome5
    >chromosome6
    >chromosome7
    >chromosome8
    >Contig00001
    >Contig00002
    修改应该没有问题
    

    (5)使用misa查找修改名称后的文件

    perl /home/Pomgroup/gdp/app/misa/misa.pl tsn.fna 
    检测到到的SSRs与不修改前的数据相同,总SSRs等
    

    (6) 查看 每条染色体上的SSRs数量

    $ cat tsn.fna.misa | awk 'NR>1{print$1}' | sort | uniq -c  | less - SN
          1   20316 chromosome1
          2   20480 chromosome2
          3   16830 chromosome3
          4   19217 chromosome4
          5   14610 chromosome5
          6   13039 chromosome6
          7   13626 chromosome7
          8   13418 chromosome8
          9      18 Contig00001
         10      48 Contig00002
         11      32 Contig00003
    $ grep -c "chromosome1" tsn.fna.misa
    20316
    统计的数量。应该是对的
    

    (7)统计每条染色体上不同SSR的数量

    cat tsn.fna.misa | sed -n '/^chromosome/p' |awk '{print$1,$3}' | sort |uniq -c
       2296 chromosome1 c
        384 chromosome1 c*
       9710 chromosome1 p1
       6005 chromosome1 p2
       1471 chromosome1 p3
        271 chromosome1 p4
        108 chromosome1 p5
         71 chromosome1 p6
    
       2226 chromosome2 c
        360 chromosome2 c*
       9909 chromosome2 p1
       6148 chromosome2 p2
       1434 chromosome2 p3
        270 chromosome2 p4
         71 chromosome2 p5
         62 chromosome2 p6
    
       1792 chromosome3 c
        318 chromosome3 c*
       8064 chromosome3 p1
       5055 chromosome3 p2
       1239 chromosome3 p3
        226 chromosome3 p4
         82 chromosome3 p5
         54 chromosome3 p6
    
       2013 chromosome4 c
        401 chromosome4 c*
       9400 chromosome4 p1
       5676 chromosome4 p2
       1311 chromosome4 p3
        256 chromosome4 p4
         87 chromosome4 p5
         73 chromosome4 p6
    
       1589 chromosome5 c
        295 chromosome5 c*
       6996 chromosome5 p1
       4429 chromosome5 p2
       1007 chromosome5 p3
        189 chromosome5 p4
         68 chromosome5 p5
         37 chromosome5 p6
    
       1404 chromosome6 c
        266 chromosome6 c*
       6361 chromosome6 p1
       3855 chromosome6 p2
        891 chromosome6 p3
        174 chromosome6 p4
         44 chromosome6 p5
         44 chromosome6 p6
    
       1542 chromosome7 c
        231 chromosome7 c*
       6698 chromosome7 p1
       4002 chromosome7 p2
        867 chromosome7 p3
        180 chromosome7 p4
         56 chromosome7 p5
         50 chromosome7 p6
    
       1446 chromosome8 c
        265 chromosome8 c*
       6324 chromosome8 p1
       4154 chromosome8 p2
        969 chromosome8 p3
        168 chromosome8 p4
         51 chromosome8 p5
         41 chromosome8 p6
    
    

    (8) 根据以上数据画图
    参考文章

    不知道为什么要使用reshape2这个包,可能是因为r只能读3行(x,y,fill),因为得到的数据就是这种格式,所以不用把每条染色体上不同的motif写到同一行。

    library(ggplot2)
    library(ggsci)
    dir()
    ssr <- read.csv("tnschrossrs.csv",header = T)
    head(ssr)
    p <- ggplot(ssr,aes(chr,num,fill=motif))+
      geom_bar(stat = "identity",position = "dodge")+
      labs(title="SSRS of tns genome",x="",y="SSRs数量")+
      theme_classic()
    p2 <- p +
      guides(fill=guide_legend(title="motif"))+
      scale_fill_npg()+
      theme(axis.text = element_text(size=20),
            axis.title =element_text(size=25),
            legend.title = element_text(size=25),
            legend.text = element_text(size=20))
    p2
    ggtheme 提供的颜色可能不超过6种,不能使用theme_wsj()+
    scale_fill_wsj() 不过可以使用ggsci的颜色,有必学一下ggsci 和ggthemes。
    堆积柱形图就是 dodge 改为stack ,智商受到暴击
    
    簇状柱状图 堆积柱形图position = "stack" 百分比堆积柱形图 position = "fill"

    8条染色体之间不同motif比例无很大差别

    簇状条形图

    使用coord_flip()函数 获得簇状条形图

    二、四个pg基因组数据SSR数量比较

    1.簇状柱状图

    library(ggsci)
    library(ggplot2)
    library(reshape2)
    df <- read.csv("ssrsfrequency.csv")
    fssr <-  df[1:4,1:7]
    head(fssr)
    fp <- melt(fssr,id.vars="sample",variable.name="motif",value.name = "num")
    head(fp)
    fp1 <- ggplot(fp,aes(sample,num,fill=motif))+
      geom_bar(stat = "identity",position = "fill")+
      labs(title="SSRs Frequency of four pomegranate genomes",x="",y="SSRs Frequency",
           caption = "B")+
      theme_classic()
    fp2 <- fp1 +
      guides(fill=guide_legend(title="motif"))+
      scale_fill_npg()+
      theme(axis.text = element_text(size=20),
            axis.title =element_text(size=25),
            legend.title = element_text(size=25),
            legend.text = element_text(size=20),
            plot.caption = element_text(size = 20,hjust = 0))+
      coord_flip()
    fp2
    
    
    四个基因组SSR数量比较 四个基因组不同SSR所占比例比较

    复合微卫星里的微卫星不知道是是不是算入单个微卫星的数量

    md,太难了,就不计算复合微卫星了

    自己统计到的数量有给出的结果不一致
    $ cat  tsn.fna.misa |awk 'NR>1{print$3}'|sort|uniq -c
      14836 c
       2619 c*
      65824 p1
      40737 p2
       9439 p3
       1784 p4
        591 p5
        447 p6
    $ sed -n '27,33p'  tsn.fna.statistics
    Unit size       Number of SSRs
    1       81777
    2       57745
    3       13592
    4       2837
    5       810
    6       543
    
    c* 到底表示什么意思
    

    用文件测试

    >test
    TTTTTTTTTTTTAAACCCTAAAACCCCTAAAACCCCAAAACCCTAAACCCTAAAACCAAACCCTAAAACCCCTAAAACCCCAAAACCCTCCTAAAAAACCCTAAAACCCCTAAAACCCCAAAACCCTACCCCAAAACCCTAAACCCTAAAACCCCTAAAACCCCAAAACCCTAAAAAAAAAAAAAAAAACCGGTTTTTTTTTT
    默认参数查找结果
    ID  SSR nr. SSR type    SSR size    start   end
    test    1   p1  (T)12   12  1   12
    test    2   c   (A)17ccgg(T)10  31  173 203
    
    Total number of sequences examined:              1
    Total size of examined sequences (bp):           203
    Total number of identified SSRs:                 3
    Number of SSR containing sequences:              1
    Number of sequences containing more than 1 SSR:  1
    Number of SSRs present in compound formation:    1
    
    
    Distribution to different repeat type classes
    ---------------------------------------------
    
    Unit size   Number of SSRs
    1   3
    statistic 统计的结果,即总的SSRS数量为perfect微卫星,复合微卫星里的perfect的数量是算到ferect数量里的
    

    重新misa,只查看完整perfect微卫星,打算是删除查找复合卫星的ini文件的内容,不行,就把两微卫星距离挑到很大查找避免不了复合微卫星还是按初始数据统计吧

    三 统计 p1 p2 p3 p4不同重复单元

    怎么单独将单碱基重复单元提取出来,使用if
    $ awk 'BEGIN{FS=OFS="\t"}{if(length($2)==5){print$0}}' all_four_pg_genome_ssrs_base > four_pg_p2base.txt
    
    
    
    单核苷酸重复型SSRs分布特征 二核苷酸重复型SSRs分布特征 三核苷酸重复型SSRs分布特征

    有的样本里没有特定的四碱基重复序列SSRs,看一下那个四碱基重复出没有出现

    ]$ awk '{print$2}' all_four_pg_genome_ssrs_base | sort | uniq -c | awk '{if($1<4)print$0 }'
          2 ACCC/GGGT  有2个基因组没有出现这个,
          1 ACCT/AGGT  有3个基因组没有出现这个
    $ sed -n '/ACCT\/AGGT/p' all_four_pg_genome_ssrs_base
    tsh     ACCT/AGGT       1  只有tsh出现
    $ sed -n '/ACCC\/GGGT/p' all_four_pg_genome_ssrs_base
    tns     ACCC/GGGT       1
    tsh     ACCC/GGGT       1  只有tns与tsh中出现
    
    
    

    p5p6类型太多,手动判断哪个基因组数据中没有出现慢,

     sed -n '811,$p' tsn.fna.statistics | awk 'BEGIN{FS=OFS="\t"}{print"tns",$1,$NF}' > ../all_four_pg_genome_123456ssrs_base   
    四个基因组数据都在上文件里
    统计没有在4个基因组中全部出现的
    $ awk '{if(length($2==11)){print$0}}' all_four_pg_genome_123456ssrs_base | awk '{print$2}' |sort|uniq -c| awk '{if($1<4)print$0}'
          3 AAAATG/ATTTTC
          1 AAAGC/CTTTG
          1 AAAGGG/CCCTTT
          3 AAATAC/ATTTGT
          1 AACATC/ATGTTG
          1 AACCCC/GGGGTT
          1 AACCG/CGGTT
          3 AACCTG/AGGTTC
          1 AACTCG/AGTTCG
          2 AACTTC/AAGTTG
          2 AAGATC/ATCTTG
          1 AAGTG/ACTTC
          3 AATCCT/AGGATT
          2 AATGC/ATTGC
          1 AATGGC/ATTGCC
          3 ACACAG/CTGTGT
          2 ACACCG/CGGTGT
          3 ACACGC/CGTGTG
          1 ACATC/ATGTG
          1 ACATGG/ATGTCC
          1 ACCATG/ATGGTC
          3 ACCCCC/GGGGGT
          2 ACCC/GGGT
          1 ACCCGT/ACGGGT
          3 ACCCTC/AGGGTG
          3 ACCCTG/AGGGTC
          1 ACCT/AGGT
          1 ACGCCC/CGTGGG
          1 ACGCGC/CGCGTG
          1 ACGGAG/CCGTCT
          2 ACGGCC/CCGTGG
          1 ACGGG/CCCGT
          3 ACGGGG/CCCCGT
          3 ACTATG/AGTCAT
          3 ACTCCG/AGTCGG
          1 ACTGCG/AGTCGC
          1 ACTGGG/AGTCCC
          1 AGAGAT/ATCTCT
          3 AGAGCG/CGCTCT
          1 AGAGCT/AGCTCT
          3 AGCCGC/CGGCTG
          2 AGCGGG/CCCGCT
          1 AGGGCG/CCCTCG
          1 ATCCCG/ATCGGG
          2 CCCCCG/CGGGGG
          1 CCCCG/CGGGG
          3 CCCCGG/CCGGGG
    $ awk '{if(length($2==11)){print$0}}' all_four_pg_genome_123456ssrs_base | awk '{print$2}' |sort|uniq -c| awk '{if($1<4)print$0}'|wc -l
    47  没有全部出现
    仅出现一次的有,统计是哪个仅仅出现了一次,学python
    $ awk '{if(length($2==11)){print$0}}' all_four_pg_genome_123456ssrs_base | awk '{print$2}' |sort|uniq -c| awk '{if($1==1)print$0}'
          1 AAAGC/CTTTG
          1 AAAGGG/CCCTTT
          1 AACATC/ATGTTG
          1 AACCCC/GGGGTT
          1 AACCG/CGGTT
          1 AACTCG/AGTTCG
          1 AAGTG/ACTTC
          1 AATGGC/ATTGCC
          1 ACATC/ATGTG
          1 ACATGG/ATGTCC
          1 ACCATG/ATGGTC
          1 ACCCGT/ACGGGT
          1 ACCT/AGGT
          1 ACGCCC/CGTGGG
          1 ACGCGC/CGCGTG
          1 ACGGAG/CCGTCT
          1 ACGGG/CCCGT
          1 ACTGCG/AGTCGC
          1 ACTGGG/AGTCCC
          1 AGAGAT/ATCTCT
          1 AGAGCT/AGCTCT
          1 AGGGCG/CCCTCG
          1 ATCCCG/ATCGGG
          1 CCCCG/CGGGG
    

    四 tns CDS序列SSRs数量

    library(ggthemes)
    Sys.setlocale ("LC_ALL","Chinese")
    library(ggplot2)
    library(ggsci)
    dir()
    ssr <- read.csv("tsh_cds_diffssrsnum.csv",header = T)[1:6,]
    head(ssr)
    dim(ssr)
    level<- as.vector(ssr$motif[1:6])
    str(level)
    ssr$motif<-factor(ssr$motif,levels = level)
    p <- ggplot(ssr,aes(motif,num,fill=motif))+
      geom_bar(stat = "identity",position = "dodge")+
      labs(x="",y="SSRs Number",
           caption = "A")+
      theme_classic()
    p2 <- p +
      guides(fill=guide_legend(title="motif"))+
      scale_fill_npg()+
      theme(axis.text = element_text(size=20),
            axis.title =element_text(size=25),
            legend.title = element_text(size=25),
            legend.text = element_text(size=20),
            plot.caption = element_text(size = 20,hjust = 0))+
      coord_flip()
      
      
    p2
    
    
    
    tns cds SSRs num tns cds SSRs num

    相关文章

      网友评论

        本文标题:4个PG基因组SSRs数量分布

        本文链接:https://www.haomeiwen.com/subject/yrpvnctx.html