论文

AnchorWave: Sensitive alignment of genomes with high sequence diversity, extensive structural polymorphism, and whole-genome duplication

这个是做植物基因组比对的一个工具，正好最近在看这个论文。论文中公布了部分作图代码，作图使用到的是R语言的ggplot2。跟着其中的代码学习一下。论文本地文件是e2113075119.full.pdf

今天的推文重复一下论文附件中的Fig S1

image.png

这个图是用面积图来展示的拟南芥基因组中变异长度的分布

论文中提供的代码链接是 https://github.com/baoxingsong/genomeAlignment/blob/main/Arabidopsis/pipeline.Rmd

这里涉及到了18个数据，我这里只按照论文提供的代码处理得到了第一个小图的数据，这里就不介绍处理原始数据的代码了，我们只介绍作图的R语言代码

处理数据的代码里用到了一个gean的工具，工具对应的论文是 Complement Genome Annotation Lift Over Using a Weighted Sequence Alignment Strategy
对应的github链接 https://github.com/baoxingsong/GEAN

作图用到的数据是

tair10.fa.fai
bur_0.v7c.sdi.length

首先是读取数据

faidx = read.table("tair10.fa.fai")
genome_size = sum(faidx$V2)
dat = read.table("bur_0.v7c.sdi.length")

加载需要用到的R包

library(ggplot2)
library(data.table)
library(dplyr)

将数据整理成作图要求的格式

dat$length = dat$V1
dat[which(dat$V1 < dat$V2),]$length = dat[which(dat$V1 < dat$V2),]$V2

df<- dat[order(dat$length),] %>% mutate(CUMFREQ=cumsum(length))

作图代码

p <- ggplot(data=df, aes(x=length, y=CUMFREQ)) + 
  geom_area( fill = "#00BFC4", position = 'identity')  + 
  labs(x="Variant length", y="Cumulative length", fill="", title="") + 
  xlim(0, 65000) + ylim(0, 5200000)+
  theme_grey(base_size = 36) + 
  theme(axis.line = element_blank(), 
        panel.background = element_blank(),
        panel.border = element_rect(fill=NA,color="black", 
                                    size=0.5, linetype="solid"))
print(p)

出图如下

image.png

论文中提供的代码是因为有很多相同的数据要处理，所以他把作图代码整理成了函数的形式，大家可以研究一下论文中提供的代码

总共有18个这种图，最后如果要拼图可以借助aplot这个R包,这个R包里有一个函数是plot_list(),画很多个图非常方便

以下拼图代码需要的内存可能比较大，运行的时候需要谨慎

library(patchwork)
library(aplot)

pdf(file = "pnasfigures1.pdf",
    width = 18.8,
    height = 8,
    family = "serif")
plot_list(list(p,p,p,p,p,p,p,p,p,
               p,p,p,p,p,p,p,p,p),ncol=4)+
  plot_annotation(tag_levels = "a")

dev.off()