一、统计tns不同染色体上SSR分布
将序列碱基改为大写
$ /home/Pomgroup/gdp/app/Seqkit/seqkit seq tns_wgs.fna -u > tsn.fna
查看序列名称,也要contig水平的序列
1 >CM018611.1 Punica granatum isolate Tunisia-2019 chromosome 1, whole genome shotgun sequenc
2 >CM018612.1 Punica granatum isolate Tunisia-2019 chromosome 2, whole genome shotgun sequenc
3 >CM018613.1 Punica granatum isolate Tunisia-2019 chromosome 3, whole genome shotgun sequenc
4 >CM018614.1 Punica granatum isolate Tunisia-2019 chromosome 4, whole genome shotgun sequenc
5 >CM018615.1 Punica granatum isolate Tunisia-2019 chromosome 5, whole genome shotgun sequenc
6 >CM018616.1 Punica granatum isolate Tunisia-2019 chromosome 6, whole genome shotgun sequenc
7 >CM018617.1 Punica granatum isolate Tunisia-2019 chromosome 7, whole genome shotgun sequenc
8 >CM018618.1 Punica granatum isolate Tunisia-2019 chromosome 8, whole genome shotgun sequenc
9 >MABG02000231.1 Punica granatum isolate Tunisia-2019 Contig00001, whole genome shotgun sequ
10 >MABG02000443.1 Punica granatum isolate Tunisia-2019 Contig00002, whole genome shotgun sequ
11 >MABG02000230.1 Punica granatum isolate Tunisia-2019 Contig00003, whole genome shotgun sequ
重命名序列名,使用sed命令
(1)先修改染色体序列名称
$ sed -n '/>/p' tsn.fna | sed 's/CM.*2019 //g' | sed 's/\,.*c//g' |less -SN
1 >chromosome 1e
2 >chromosome 2e
3 >chromosome 3e
4 >chromosome 4e
5 >chromosome 5e
6 >chromosome 6e
7 >chromosome 7e
8 >chromosome 8e
9 >MABG02000231.1 Punica granatum isolate Tunisia-2019 Contig00001e
有e? 也修改了contig序列的名称
(2)重新查看序列名称 ,因为使用less -SN后一行内容太多不能全部复制
$ sed -n '/>/p' tsn.fna | head -n 10
>CM018611.1 Punica granatum isolate Tunisia-2019 chromosome 1, whole genome shotgun sequence
>CM018612.1 Punica granatum isolate Tunisia-2019 chromosome 2, whole genome shotgun sequence
>CM018613.1 Punica granatum isolate Tunisia-2019 chromosome 3, whole genome shotgun sequence
>CM018614.1 Punica granatum isolate Tunisia-2019 chromosome 4, whole genome shotgun sequence
>CM018615.1 Punica granatum isolate Tunisia-2019 chromosome 5, whole genome shotgun sequence
>CM018616.1 Punica granatum isolate Tunisia-2019 chromosome 6, whole genome shotgun sequence
>CM018617.1 Punica granatum isolate Tunisia-2019 chromosome 7, whole genome shotgun sequence
>CM018618.1 Punica granatum isolate Tunisia-2019 chromosome 8, whole genome shotgun sequence
>MABG02000231.1 Punica granatum isolate Tunisia-2019 Contig00001, whole genome shotgun sequence
>MABG02000443.1 Punica granatum isolate Tunisia-2019 Contig00002, whole genome shotgun sequence
(3)重新修改
$ sed -n '/>/p' tsn.fna | sed 's/CM.*2019 //g' |sed 's/MAB.*2019 //g' | sed 's/\,.*ce//g' | sed 's/\s//g' |head -n 10
>chromosome1
>chromosome2
>chromosome3
>chromosome4
>chromosome5
>chromosome6
>chromosome7
>chromosome8
>Contig00001
>Contig00002
先修改保留内容之前的内容,再全部替换(,)后的内容。
(4)直接修改文件
$ sed -i 's/CM.*2019 //;s/MAB.*2019 //;s/\,.*ce//;s/\s//' tsn.fna
$ sed -n '/>/p' tsn.fna | head -n 10
>chromosome1
>chromosome2
>chromosome3
>chromosome4
>chromosome5
>chromosome6
>chromosome7
>chromosome8
>Contig00001
>Contig00002
修改应该没有问题
(5)使用misa查找修改名称后的文件
perl /home/Pomgroup/gdp/app/misa/misa.pl tsn.fna
检测到到的SSRs与不修改前的数据相同,总SSRs等
(6) 查看 每条染色体上的SSRs数量
$ cat tsn.fna.misa | awk 'NR>1{print$1}' | sort | uniq -c | less - SN
1 20316 chromosome1
2 20480 chromosome2
3 16830 chromosome3
4 19217 chromosome4
5 14610 chromosome5
6 13039 chromosome6
7 13626 chromosome7
8 13418 chromosome8
9 18 Contig00001
10 48 Contig00002
11 32 Contig00003
$ grep -c "chromosome1" tsn.fna.misa
20316
统计的数量。应该是对的
(7)统计每条染色体上不同SSR的数量
cat tsn.fna.misa | sed -n '/^chromosome/p' |awk '{print$1,$3}' | sort |uniq -c
2296 chromosome1 c
384 chromosome1 c*
9710 chromosome1 p1
6005 chromosome1 p2
1471 chromosome1 p3
271 chromosome1 p4
108 chromosome1 p5
71 chromosome1 p6
2226 chromosome2 c
360 chromosome2 c*
9909 chromosome2 p1
6148 chromosome2 p2
1434 chromosome2 p3
270 chromosome2 p4
71 chromosome2 p5
62 chromosome2 p6
1792 chromosome3 c
318 chromosome3 c*
8064 chromosome3 p1
5055 chromosome3 p2
1239 chromosome3 p3
226 chromosome3 p4
82 chromosome3 p5
54 chromosome3 p6
2013 chromosome4 c
401 chromosome4 c*
9400 chromosome4 p1
5676 chromosome4 p2
1311 chromosome4 p3
256 chromosome4 p4
87 chromosome4 p5
73 chromosome4 p6
1589 chromosome5 c
295 chromosome5 c*
6996 chromosome5 p1
4429 chromosome5 p2
1007 chromosome5 p3
189 chromosome5 p4
68 chromosome5 p5
37 chromosome5 p6
1404 chromosome6 c
266 chromosome6 c*
6361 chromosome6 p1
3855 chromosome6 p2
891 chromosome6 p3
174 chromosome6 p4
44 chromosome6 p5
44 chromosome6 p6
1542 chromosome7 c
231 chromosome7 c*
6698 chromosome7 p1
4002 chromosome7 p2
867 chromosome7 p3
180 chromosome7 p4
56 chromosome7 p5
50 chromosome7 p6
1446 chromosome8 c
265 chromosome8 c*
6324 chromosome8 p1
4154 chromosome8 p2
969 chromosome8 p3
168 chromosome8 p4
51 chromosome8 p5
41 chromosome8 p6
(8) 根据以上数据画图
参考文章
不知道为什么要使用reshape2这个包,可能是因为r只能读3行(x,y,fill),因为得到的数据就是这种格式,所以不用把每条染色体上不同的motif写到同一行。
library(ggplot2)
library(ggsci)
dir()
ssr <- read.csv("tnschrossrs.csv",header = T)
head(ssr)
p <- ggplot(ssr,aes(chr,num,fill=motif))+
geom_bar(stat = "identity",position = "dodge")+
labs(title="SSRS of tns genome",x="",y="SSRs数量")+
theme_classic()
p2 <- p +
guides(fill=guide_legend(title="motif"))+
scale_fill_npg()+
theme(axis.text = element_text(size=20),
axis.title =element_text(size=25),
legend.title = element_text(size=25),
legend.text = element_text(size=20))
p2
ggtheme 提供的颜色可能不超过6种,不能使用theme_wsj()+
scale_fill_wsj() 不过可以使用ggsci的颜色,有必学一下ggsci 和ggthemes。
堆积柱形图就是 dodge 改为stack ,智商受到暴击
簇状柱状图
堆积柱形图position = "stack"
百分比堆积柱形图 position = "fill"
簇状条形图8条染色体之间不同motif比例无很大差别
使用coord_flip()函数 获得簇状条形图
二、四个pg基因组数据SSR数量比较
1.簇状柱状图
library(ggsci)
library(ggplot2)
library(reshape2)
df <- read.csv("ssrsfrequency.csv")
fssr <- df[1:4,1:7]
head(fssr)
fp <- melt(fssr,id.vars="sample",variable.name="motif",value.name = "num")
head(fp)
fp1 <- ggplot(fp,aes(sample,num,fill=motif))+
geom_bar(stat = "identity",position = "fill")+
labs(title="SSRs Frequency of four pomegranate genomes",x="",y="SSRs Frequency",
caption = "B")+
theme_classic()
fp2 <- fp1 +
guides(fill=guide_legend(title="motif"))+
scale_fill_npg()+
theme(axis.text = element_text(size=20),
axis.title =element_text(size=25),
legend.title = element_text(size=25),
legend.text = element_text(size=20),
plot.caption = element_text(size = 20,hjust = 0))+
coord_flip()
fp2
四个基因组SSR数量比较
四个基因组不同SSR所占比例比较
复合微卫星里的微卫星不知道是是不是算入单个微卫星的数量
md,太难了,就不计算复合微卫星了
自己统计到的数量有给出的结果不一致
$ cat tsn.fna.misa |awk 'NR>1{print$3}'|sort|uniq -c
14836 c
2619 c*
65824 p1
40737 p2
9439 p3
1784 p4
591 p5
447 p6
$ sed -n '27,33p' tsn.fna.statistics
Unit size Number of SSRs
1 81777
2 57745
3 13592
4 2837
5 810
6 543
c* 到底表示什么意思
用文件测试
>test
TTTTTTTTTTTTAAACCCTAAAACCCCTAAAACCCCAAAACCCTAAACCCTAAAACCAAACCCTAAAACCCCTAAAACCCCAAAACCCTCCTAAAAAACCCTAAAACCCCTAAAACCCCAAAACCCTACCCCAAAACCCTAAACCCTAAAACCCCTAAAACCCCAAAACCCTAAAAAAAAAAAAAAAAACCGGTTTTTTTTTT
默认参数查找结果
ID SSR nr. SSR type SSR size start end
test 1 p1 (T)12 12 1 12
test 2 c (A)17ccgg(T)10 31 173 203
Total number of sequences examined: 1
Total size of examined sequences (bp): 203
Total number of identified SSRs: 3
Number of SSR containing sequences: 1
Number of sequences containing more than 1 SSR: 1
Number of SSRs present in compound formation: 1
Distribution to different repeat type classes
---------------------------------------------
Unit size Number of SSRs
1 3
statistic 统计的结果,即总的SSRS数量为perfect微卫星,复合微卫星里的perfect的数量是算到ferect数量里的
重新misa,只查看完整perfect微卫星,打算是删除查找复合卫星的ini文件的内容,不行,就把两微卫星距离挑到很大查找避免不了复合微卫星还是按初始数据统计吧
三 统计 p1 p2 p3 p4不同重复单元
怎么单独将单碱基重复单元提取出来,使用if
$ awk 'BEGIN{FS=OFS="\t"}{if(length($2)==5){print$0}}' all_four_pg_genome_ssrs_base > four_pg_p2base.txt
单核苷酸重复型SSRs分布特征
二核苷酸重复型SSRs分布特征
三核苷酸重复型SSRs分布特征
有的样本里没有特定的四碱基重复序列SSRs,看一下那个四碱基重复出没有出现
]$ awk '{print$2}' all_four_pg_genome_ssrs_base | sort | uniq -c | awk '{if($1<4)print$0 }'
2 ACCC/GGGT 有2个基因组没有出现这个,
1 ACCT/AGGT 有3个基因组没有出现这个
$ sed -n '/ACCT\/AGGT/p' all_four_pg_genome_ssrs_base
tsh ACCT/AGGT 1 只有tsh出现
$ sed -n '/ACCC\/GGGT/p' all_four_pg_genome_ssrs_base
tns ACCC/GGGT 1
tsh ACCC/GGGT 1 只有tns与tsh中出现
p5p6类型太多,手动判断哪个基因组数据中没有出现慢,
sed -n '811,$p' tsn.fna.statistics | awk 'BEGIN{FS=OFS="\t"}{print"tns",$1,$NF}' > ../all_four_pg_genome_123456ssrs_base
四个基因组数据都在上文件里
统计没有在4个基因组中全部出现的
$ awk '{if(length($2==11)){print$0}}' all_four_pg_genome_123456ssrs_base | awk '{print$2}' |sort|uniq -c| awk '{if($1<4)print$0}'
3 AAAATG/ATTTTC
1 AAAGC/CTTTG
1 AAAGGG/CCCTTT
3 AAATAC/ATTTGT
1 AACATC/ATGTTG
1 AACCCC/GGGGTT
1 AACCG/CGGTT
3 AACCTG/AGGTTC
1 AACTCG/AGTTCG
2 AACTTC/AAGTTG
2 AAGATC/ATCTTG
1 AAGTG/ACTTC
3 AATCCT/AGGATT
2 AATGC/ATTGC
1 AATGGC/ATTGCC
3 ACACAG/CTGTGT
2 ACACCG/CGGTGT
3 ACACGC/CGTGTG
1 ACATC/ATGTG
1 ACATGG/ATGTCC
1 ACCATG/ATGGTC
3 ACCCCC/GGGGGT
2 ACCC/GGGT
1 ACCCGT/ACGGGT
3 ACCCTC/AGGGTG
3 ACCCTG/AGGGTC
1 ACCT/AGGT
1 ACGCCC/CGTGGG
1 ACGCGC/CGCGTG
1 ACGGAG/CCGTCT
2 ACGGCC/CCGTGG
1 ACGGG/CCCGT
3 ACGGGG/CCCCGT
3 ACTATG/AGTCAT
3 ACTCCG/AGTCGG
1 ACTGCG/AGTCGC
1 ACTGGG/AGTCCC
1 AGAGAT/ATCTCT
3 AGAGCG/CGCTCT
1 AGAGCT/AGCTCT
3 AGCCGC/CGGCTG
2 AGCGGG/CCCGCT
1 AGGGCG/CCCTCG
1 ATCCCG/ATCGGG
2 CCCCCG/CGGGGG
1 CCCCG/CGGGG
3 CCCCGG/CCGGGG
$ awk '{if(length($2==11)){print$0}}' all_four_pg_genome_123456ssrs_base | awk '{print$2}' |sort|uniq -c| awk '{if($1<4)print$0}'|wc -l
47 没有全部出现
仅出现一次的有,统计是哪个仅仅出现了一次,学python
$ awk '{if(length($2==11)){print$0}}' all_four_pg_genome_123456ssrs_base | awk '{print$2}' |sort|uniq -c| awk '{if($1==1)print$0}'
1 AAAGC/CTTTG
1 AAAGGG/CCCTTT
1 AACATC/ATGTTG
1 AACCCC/GGGGTT
1 AACCG/CGGTT
1 AACTCG/AGTTCG
1 AAGTG/ACTTC
1 AATGGC/ATTGCC
1 ACATC/ATGTG
1 ACATGG/ATGTCC
1 ACCATG/ATGGTC
1 ACCCGT/ACGGGT
1 ACCT/AGGT
1 ACGCCC/CGTGGG
1 ACGCGC/CGCGTG
1 ACGGAG/CCGTCT
1 ACGGG/CCCGT
1 ACTGCG/AGTCGC
1 ACTGGG/AGTCCC
1 AGAGAT/ATCTCT
1 AGAGCT/AGCTCT
1 AGGGCG/CCCTCG
1 ATCCCG/ATCGGG
1 CCCCG/CGGGG
四 tns CDS序列SSRs数量
library(ggthemes)
Sys.setlocale ("LC_ALL","Chinese")
library(ggplot2)
library(ggsci)
dir()
ssr <- read.csv("tsh_cds_diffssrsnum.csv",header = T)[1:6,]
head(ssr)
dim(ssr)
level<- as.vector(ssr$motif[1:6])
str(level)
ssr$motif<-factor(ssr$motif,levels = level)
p <- ggplot(ssr,aes(motif,num,fill=motif))+
geom_bar(stat = "identity",position = "dodge")+
labs(x="",y="SSRs Number",
caption = "A")+
theme_classic()
p2 <- p +
guides(fill=guide_legend(title="motif"))+
scale_fill_npg()+
theme(axis.text = element_text(size=20),
axis.title =element_text(size=25),
legend.title = element_text(size=25),
legend.text = element_text(size=20),
plot.caption = element_text(size = 20,hjust = 0))+
coord_flip()
p2
tns cds SSRs num
tns cds SSRs num
网友评论