3K水稻SNP数据集的简单利用

作者: Nuvolar | 来源:发表于2021-09-10 18:21 被阅读0次

    3010份亚洲稻群体重测序项目是由中国农业科学院作物科学研究所牵头,联合国际水稻研究所、华大基因等16家单位共同完成。该研究对来自89个国家的3,010份水稻品种进行了重测序研究(resequence),参考的是日本晴(Nipponbare)基因组。所有3,010个基因组的平均测序深度(average sequencing depth)为14×,平均基因组覆盖率(average genome coverage)和绘图率(average mapping rate)分别为94.0%和92.5%。
    该项目的意义自然不用多说,并且该项目的测序数据都可以在国际水稻所CNGB上下载。我们可以从原始序列开始组装得到大的结构变异 或call到一些稀有SNP来做研究。当然以上想法不在本文讨论范围内,本文主要讨论只想知道几个基因区域上 SNP 及其分化系数等群体指标的计算。

    2021.09.15 更新了单倍型分型、作图和方差分析。

    放在前面

    RFGB可以极大程度上满足查找某个基因上的SNP&indel和单倍型查找。并提供15个表型可以进行单倍型分析,也可以上传表型进行分析。接下来探索一下。

    单倍型分析
    这个页面提供单倍型分析。分别可以用基因名,染色体位置和特定的SNP来进行分析。提供MAF是最小等位基因频率,建议选一下。可以用全部的3024个品种进行查找也可以分亚群进行查找,也提供了表型关联分析。
    结果 结果
    这里是用的LOC_Os08g33530 CDS 区域,选的籼稻群体,共找到20个SNP,和16个单倍型。其中hap11的谷粒长最高。
    这个网站用来做单倍型都是SNP,而且是复等位的,就是没找到在哪下数据集。
    可惜没有LD的图和pi值。
    但我找到了水稻单倍型的数据,具体可看文章
    The landscape of gene–CDS–haplotype diversity in rice: Properties, population organization, footprints of domestication and breeding, and implications for genetic improvement
    数据地址

    也可以用这个单倍型数据和表型做关联分析,用方差分析就可以。表型数据下载

    以下主要是3k构建的SNP的探索以及简单利用。

    工具准备

    R
    R package(LDheatmap,genetics)
    plink1.9(3个操作系统版本都有)下载地址
    vcftools(只有liunx版本,只是用来算群体遗传统计)下载地址

    数据下载

    可以在IRRI上下载到双等位SNP&indel数据(多等位的SNP貌似只能从头call,没找到下载的地方)。bed、bim、fam文件为一组,bed 是存放位点信息的二进制格式文件,用文本打开会发现是乱码;bim是存放位点信息的文件,类似于map格式;fam是存放样本信息的文件,第一列为FID,第二列为IID。

    bim
    fam
    Nipponbare_indel 是双等位indel数据集,共有2.3m个,但是是没有经过筛选的,后续需要加工下。
    双等位的SNP标记共5个。
    NB_final_snp 包含了所有的SNP,共29m个,是没有经过筛选的。
    Base 的SNP是经过基础筛选的,共18m个。
    base_filtered_v0.7 是从Base的SNP进一步筛选的,共4.2m个,本文主要利用这个数据集。
    pruned_v2.1 是2.1版本的筛选方案,包含1m个SNP。
    base_filtered_v0.7_0.8_10kb_1_0.8_50_1 是在base_filtered_v0.7 的基础上进一步筛选 共 404k 个,在用的时候发现很多基因区域上都没有SNP,毕竟个数少,比较稀疏。
    各个数据集的筛选方法可以看看里边的readme文件,不同的筛选方案主要目的就是提高精度,找到因果变异。

    准备目标基因的位置

    其实只需要染色体号和起始位置就好了,基因名字写啥都行,就是拿来做图片标题的。忘记提了,这个数据集中没有线粒体和叶绿体上的标记。建议起始位置可以把上游和下游都囊括进去,多个2kb,3kb 都是可以的,多找几个总是好的,万一和表型关联上了呢。 基因列表,以tab分割

    合并想要的SNP数据集&indel数据集

    其实本可以一个个数据集来,但我想把2.3m个indel和4.2m个SNP标记合在一起分析咋办呢。最简单的办法还是用vcftools或bcftools合并两个数据集。可以参考这篇。bash代码如下:

    plink --bfile base_filtered_v0.7 --recode vcf-iid --out base_filtered_v0.7 #先转到vcf格式
    plink --bfile Nipponbare_indel --recode vcf-iid --out Nipponbare_indel
    ###由于Nipponbare_indel 中少一个样本 IRIS_313-8921 并且这个样本在官网给的分类为NA,空值。所以就需要删掉它
    vcftools --vcf base_filtered_v0.7.vcf --recode --recode-INFO-all --stdout --indv IRIS_313-8921 > rm_na_base_filtered_v0.7.vcf
    
    bcftools view Nipponbare_indel.vcf -Oz -o Nipponbare_indel.vcf.gz  #构建引索
    bcftools index Nipponbare_indel.vcf.gz 
    
    bcftools view rm_na_base_filtered_v0.7.vcf -Oz -o rm_na_base_filtered_v0.7.vcf.gz
    bcftools index rm_na_base_filtered_v0.7.vcf.gz
    
    bcftools concat rm_na_base_filtered_v0.7.vcf.gz Nipponbare_indel.vcf.gz -a -O v -o combine_base_filtered_indel.vcf #合并两个vcf文件
    
    plink --vcf combine_base_filtered_indel.vcf --make-bed --out  combine_base_filtered_indel #再转回bed格式 主要是这样占内存小点,这个vcf文件有55G ,bed文件只有5.23G
    
    

    那么就不想用vcftools合并两个数据集咋办呢?有些时候并不想得到所有位点的信息,只需要一部分就可以了。其实可以用R和plink来实现合并两个数据集,并只提取几个位点,这种方案内存代价小点,且看后续讲解。

    准备分类文件

    群体分类文件如下 原始文件

    这个分类太细了,给它分成五类就可以了,GJ、XI、admix、aus、bas,用excel的分列就能完成,需要把IRIS_313-8921去掉,这个分类是NA。当然也可以用自定义的分类。

    整合成plink能识别的分类格式,分别是FID,IID,class plink 格式

    突然想起踩到的一个坑,plink1.9和2在筛选的时候无法识别样本名称带有"-"字符,所以需要把 分类文件里 "-"换成"_" ,"IRIS_313-8921"换成"IRIS_313_8921"如用 excel 就可以完成,对了fam文件里也要换。

    目标区域SNP位点提取

    首先需要把plink.exe和分类文件(plink_cluster.txt)放到工作目录下,方便调用,也可以把它放到环境目录下。

    R代码如下:

    #基本参数
    gene_list <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene_len.txt"  #目标区间
    out_name <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene" #输出文件前缀
    data_name <- c(".\\data\\base_filtered_v0.7",".\\data\\Nipponbare_indel") #SNP数据集,输入多个就是合并
    pop_name<-c("GJ","XI","admix","aus","bas","GJ XI","admix aus bas") #需要生成的群体
    fst_pop <- c("GJ XI","admix aus bas") #计算fst需要的群体 需在pop_name出现过
    #质控参数
    maf <- 0.05 
    miss <- 0.4
    
    color.rgb <- colorRampPalette(rev(c("white","red")),space="rgb") #color for LD
    
    target_gene <- read.table(gene_list,header = T)
    extract <- cbind(target_gene $CHROM,target_gene $start,target_gene $end,target_gene $GENE)
    write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)
    #提取目标区域位点 并输出vcf文件 可以用Tassel 看LD
    #查看样本是否缺少
    tem <- c()
    for (i in 1:length(data_name)) {
        temp <- read.table(paste0(data_name[i],".fam"),header = F)
        temp <- nrow(temp)
        tem <- c(tem,temp)
    }
    #用少了样本的数据集来keep
    keep_file <- data_name[which(tem %in% min(tem))]
    
    for (i in 1:length(data_name)) {
        total_cmd <- paste0(".//plink --bfile ",data_name[i] ," --extract temp_extract.txt --range --keep ",paste0(keep_file,".fam")," --recode vcf-iid --out ",paste0(out_name,"_",i))
        system(total_cmd,intern = FALSE)
    }
    temp_all <- read.table(paste0(out_name,"_",1,".vcf"),header = F,sep = '\t')
    
    if (length(data_name)>=2){    #合并数据集
        for (i in 2:length(data_name)) {
            temp <- read.table(paste0(out_name,"_",i,".vcf"),header = F,sep = '\t')
            temp_all<-rbind(temp_all ,temp )
        }
        temp_all<-temp_all[order(temp_all[,1],temp_all[,2]),] #按照染色体和POS排序
    }
    
    ### 列名
    fam_name<-read.table(paste0(keep_file,".fam") ,header = F,sep = ' ')[,1]
    vcf_colnames<-c("#CHROM",   "POS","ID","REF","ALT","QUAL","FILTER","INFO","FORMAT")
    vcf_colnames<-c(vcf_colnames,fam_name)
    ###输出结果,vcf
    names(temp_all)<-vcf_colnames
    write.table(temp_all,file=paste0(out_name,"_","target_all.vcf"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE)
    ###筛选
    total_cmd <- paste0("plink --vcf ",paste0(out_name,"_","target_all.vcf")," --const-fid --maf ",maf," --geno ",miss," --make-bed --out ",paste0(out_name,"_","target_all")) #直接用vcf筛选会有点问题,还是再转下bed格式
    system(total_cmd,intern = FALSE)
    
    temp <- read.table(paste0(out_name,"_","target_all",".fam"),header = F)
    temp[,1] <- temp[,2]
    write.table(temp,file=paste0(out_name,"_","target_all",".fam"),sep = " ",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE)
    total_cmd <- paste0(".//plink --bfile ",paste0(out_name,"_","target_all") ," --recode vcf-iid --out ",paste0(out_name,"_","target_all"))
    system(total_cmd,intern = FALSE)
    
    #不同群体提取 按基因提取 
    for(g in target_gene$GENE){
        extract <-dplyr::filter(target_gene ,GENE==g)
        extract <- cbind(extract$CHROM,extract$start,extract$end,extract$GENE)
        write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)
        
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g))
        system(total_cmd,intern = FALSE)
        
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range "," --make-bed --out ",paste0(out_name,"_","target_all_",g))
        system(total_cmd,intern = FALSE)
        
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g))
        system(total_cmd,intern = FALSE)
    
        for (p in pop_name) {
            total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
            system(total_cmd,intern = FALSE)
            total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
            system(total_cmd,intern = FALSE)
    ###        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --make-bed --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
    ###        system(total_cmd,intern = FALSE)
        }
    }
    
    for (p in pop_name) {
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --within plink_cluster.txt  --keep-cluster-names ",p,"  --make-bed  --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
        system(total_cmd,intern = FALSE)
    }
    

    画LD heatmap

    画LD用Tassel就可以话,上述代码也生成了vcf格式文件可以直接在Tassel中打开。这里用下LDheatmap。LDheatmap具体的教程比较多,自由度也比较高,可自行找些教程。当然也有在线能画LD图的,就是有点麻烦。这里具体就是如何将vcf或bed格式文件转到LDheatmap可用的格式。R代码如下:

    library(LDheatmap)
    library(genetics)
    
    plot_LD <- function(vcf_file_name,title){
        temp_vcf <- as.data.frame(data.table::fread(vcf_file_name,header = T,sep = '\t'))
        if(nrow(temp_vcf )<=1) {print(paste0(vcf_file_name," have no snp&indel"));return()}###没有就结束
    
        temp_snp <- temp_vcf[0,]
        for (i in 1:nrow(temp_vcf)) {
            temp <- temp_vcf[i,]
            for(j in 10:ncol(temp_vcf)){
                if(temp_vcf[i,j]=="0/0"){
                    temp[1,j] <- paste0(temp[1,4],"/",temp[1,4])
                    next
                }
                if(temp_vcf[i,j]=="0/1"){
                    temp[1,j] <- paste0(temp[1,4],"/",temp[1,5])
                    next
                }
                if(temp_vcf[i,j]=="1/1"){
                    temp[1,j] <- paste0(temp[1,5],"/",temp[1,5])
                    next
                }
                if(temp_vcf[i,j]=="./."){
                    temp[1,j] <- NA
                    next
                }
            }
            temp_snp <- rbind(temp_snp,temp)
        }
    
        SNPpos <- temp_snp$POS
        SNP <- as.data.frame(t(temp_snp[,10:ncol(temp_snp)]))
    
        ### 转换成LDheatmap能用的格式
        for(i in 1:ncol(SNP)){
            SNP[,i]<-as.genotype(SNP[,i])
        }
        ###画图
        LD <- LDheatmap(SNP, SNPpos,flip=TRUE,color=color.rgb(20),title = title,SNP.name=NULL)
    }
    
    for(g in target_gene$GENE){
        plot_LD(paste0(out_name,"_","target_all_",g,".vcf"),g)
        for (p in pop_name) {
            plot_LD(paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p),".vcf"),paste0(g,"_",gsub(" ","_",p)))
        }
    }
    
    
    结果

    计算群体遗传性指标

    首先推荐用vcftools来计算,因为方便。但理论上知道公式用excel也可以算。
    计算Fst,Pi,Tajimi'D based on SNP vcf file
    Fst详解(具体计算步骤)
    【群体遗传学】 π (pi)的计算

    Fst

    需要分成至少两个群体 每个群体一个文件,可以放所有的5个分类群体,也可以放几个感兴趣的群体。 vcftools 群体文件 只要一列

    bash代码

    vcftools --vcf combine_base_filtered_indel.vcf --weir-fst-pop pop_GJ.txt --weir-fst-pop pop_XI.txt --weir-fst-pop pop_admix.txt --weir-fst-pop pop_aus.txt --weir-fst-pop pop_bas.txt --out combine_base_filtered_indel.fst ###单位点计算
    
    vcftools --vcf combine_base_filtered_indel.vcf --weir-fst-pop pop_GJ.txt --weir-fst-pop pop_XI.txt --weir-fst-pop pop_admix.txt --weir-fst-pop pop_aus.txt --weir-fst-pop pop_bas.txt --fst-window-size 5000--out combine_base_filtered_indel.fst ###按windows计算 单位是bp
    

    也可以利用plink 计算单位点的Fst。

    total_pop<-length(fst_pop)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --fst --within plink_cluster.txt","  --out ",paste0(out_name,"_","target_all"))
    system(total_cmd,intern = FALSE)
    
    ###
    for(p in fst_pop){
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --fst --within plink_cluster.txt  --keep-cluster-names ",p," --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
        system(total_cmd,intern = FALSE)
    }
    
    temp_table<-read.table(paste0(out_name,"_","target_all",".fst"),header=T)
    fst <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
    fst[,1] <- temp_table$FST
    
    for (i in 1:total_pop) {
        temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",fst_pop[i]),".fst"),header=T)
        fst[,i+1] <- temp$FST
    }
    
    temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
    temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,fst)
    
    names(temp) <- c("CHR","POS","fst",paste("fst",gsub(" ","_",fst_pop),sep = "_"))
    write.table(temp,file=paste0(out_name,"_","target_all",".fst.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存结果
    
    

    结果和vcftools是一样的

    pi

    用vcftools计算 可以单位点计算 也可以按windows 计算
    一般认为按windows更合理,如果研究目标是SNP,还是单点算合理。
    在某个或多个群体中计算就加 --keep pop_GJ.txt --keep pop_XI.txt

    vcftools --vcf combine_base_filtered_indel.vcf --site-pi --out combine_base_filtered_indel.pi ###单位点
    vcftools --vcf combine_base_filtered_indel.vcf  --window-pi-step 3000  --out combine_base_filtered_indel ###3000个snp为一个windows
    

    双等位的基因 pi值计算较简单,参考下上面计算fst的代码就可以了

    total_pop<-length(pop_name)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --freq --out ",paste0(out_name,"_","target_all"))
    system(total_cmd,intern = FALSE)
    
    ###
    for(p in pop_name){
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --freq --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
        system(total_cmd,intern = FALSE)
    }
    
    temp_table<-read.table(paste0(out_name,"_","target_all",".frq"),header=T)
    pi <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
    pi[,1] <- 2*temp_table$MAF*(1-temp_table$MAF)*temp_table$NCHROBS/(temp_table$NCHROBS-1)
    
    for (j in 1:total_pop) {
        temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",pop_name[j]),".frq"),header=T)
        pi[,j+1] <- 2*temp$MAF*(1-temp$MAF)*temp$NCHROBS/(temp$NCHROBS-1)
    }
    
    temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
    temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,pi)
    
    names(temp) <- c("CHR","POS","pi",paste("pi",gsub(" ","_",pop_name),sep = "_"))
    write.table(temp,file=paste0(out_name,"_","target_all",".pi"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存结果
    
    pi结果比对

    结果基本相同。

    Tajima's D 中性检测的指标

    在某个或多个群体中计算就加 --keep pop_GJ.txt --keep pop_XI.txt

    vcftools --vcf combine_base_filtered_indel.vcf --TajimaD 5000 --out combine_base_filtered_indel.TajimaD ###按5000bp计算
    

    这个用R计算有点麻烦,但是用Tassel计算很方便,把vcf输入就可以了。


    Tassel

    单倍型分析

    这里都拿到SNP数据集了,也就可以进行单倍型分析了,这里用的是Haploview,要先装java环境(但不支持1.8以上版本),有GUI比较友好。输入格式可以用plink格式(ped/map)。
    但Haploview只能做SNP的分析,数量也不能太多。用前面的步骤生成了ped/info 文件,按linkage format 输入就可以了。
    这里是利用了admix群体的48个SNP生成的结果。

    LD 单倍型
    但是我想要画单倍型的地图,这个软件不太方便,最后我选择了用beagle+dnasp+popart来做单倍型分型和画图。
    beagle:用做单倍型推断和填充,需要java环境。
    dnasp:用作单倍型分型,生成nex文件。
    popart:用作单倍型画图,如果有地理信息,也可以加上去。
    由于单倍型分析不允许有缺失,所以需要基因型填充,可以利用上面的vcf文件作为输入文件,输出为vcf.gz,解压后得到的vcf文件可以到后续利用。定相之后vcf里的"/"会变为"|"。另一个原因是3K水稻数据中并没有说明bed中的SNP是定相的,这里为了保险一点。
    填充后的vcf
    在powershell中输入命令
    # gt 是需要的vcf文件,按基因位置分割 out 是输出文件前缀
    java -Xss5m   -jar .\beagle.28Jun21.220.jar gt="C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene_target_all.vcf" out=phased
    

    得到定相后的vcf文件然后生成fasta文件,由于dnasp的输入fasta文件需要等长,这里对有indel的地方加了"-"。每个样本能生成两条单链。
    这里用R解决

    phase_file <- "D:\\beagle\\phased.vcf" #vcf文件位置
    out_name <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene" #输出文件前缀
    g <- "g" #输出前缀
    
    vcf <- data.table::fread(phase_file,header = T)
    
    temp <- c()
    for (i in 1:nrow(vcf)) {
        temp_value <- vcf[i,10:ncol(vcf)]
        ref <- vcf[i,4]
        alt <- vcf[i,5]
        if(nchar(alt)>nchar(ref)){
        alt <- substr(alt,2,nchar(alt))
        ref <- paste(rep("-",nchar(alt)),collapse = "")
        }
        if(nchar(ref)>nchar(alt)){
            ref <- substr(ref,2,nchar(ref))
            alt <- paste(rep("-",nchar(ref)),collapse = "")
        }
        temp_value[,which(temp_value=="0|0")] <- paste(ref,ref)
        temp_value[,which(temp_value=="0|1")] <- paste(ref,alt)
        temp_value[,which(temp_value=="1|0")] <- paste(alt,ref)
        temp_value[,which(temp_value=="1|1")] <- paste(alt,alt)
        temp <- rbind(temp,temp_value)
    }
    temp <- as.data.frame(temp)
    
    cat("",file=paste0(out_name,g,".fasta"),append = FALSE)
    
    for (i in 1:ncol(temp)) {
        c1 <- c()
        c2 <- c()
        for (j in 1:nrow(temp)) {
            c1 <- c(c1,strsplit(temp[j,i]," ")[[1]][1])
            c2 <- c(c2,strsplit(temp[j,i]," ")[[1]][2])
        }
        cat(paste(">",paste0(names(temp)[i],".1"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
        cat(paste0(paste(c1,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
        cat(paste(">",paste0(names(temp)[i],".2"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
        cat(paste0(paste(c2,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    }
    
    fasta
    dnasp

    输入fasta后,Generate 选Haplotype 生成单倍型,得到单倍型文件(nex)


    结果
    这里一共是得到315个单倍型,但大部分都只有1个样本。

    由于需要统计样本的信息,这里把nex的freq字段另存为test.txt(),用来做统计,并生成TraitLabels字段,这里需要样本的分类信息plink_cluster.txt。

    clust <- read.table("plink_cluster.txt",header = T,sep = "\t")
    hap <- read.table("test.txt",header = F,sep = ":")
    hap[,1] <- gsub("\\[","",hap[,1])
    hap[,2] <- gsub("\\]","",hap[,2])
    
    clut_hap <- clust[,2:3]
    
    tem <- as.data.frame(strsplit(hap[,2],"    "))[2,]
    hap[,3] <-t(tem)
    hap[,3] <- gsub("\\.1","",hap[,3])
    hap[,3] <- gsub("\\.2","",hap[,3])
    
    for (i in 1:nrow(clust)) {
        flag <- 0
        v1 <- 0
        v2 <- 0
        for (j in 1:nrow(hap)) {
            if (flag==2) break
            if(clust[i,1] %in% strsplit(hap[j,3]," ")[[1]]){
                idx <- which(strsplit(hap[j,3]," ")[[1]] %in% clust[i,1])
                if(length(idx)==1&flag==0) {v1 <- hap[j,1];flag <- flag+1;next}
                if(length(idx)==1&flag==1) {v2 <- hap[j,1];flag <- flag+1;next}
                if(length(idx)==2&flag==0) {v1 <- hap[j,1];v2 <- hap[j,1];next}
            }
        }
        clut_hap[i,3] <- v1
        clut_hap[i,4] <- v2
    }
    
    write.table(clut_hap,file=paste0(out_name,g,"_clut_hap.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE) #保存结果
    names(clut_hap) <- c("ID","group"," "," ")
    tem <- rbind(clut_hap[,c(3,2)],clut_hap[,c(4,2)])
    tem <- tem[!is.na(tem[,2]),]
    temp <- table(tem, exclude = 0)
    
    out_matrix <- data.frame(matrix(temp,nrow = length(row.names(temp))))
    row.names(out_matrix) <- row.names(temp) 
    names(out_matrix) <- gsub(" ","_",colnames(temp))
    
    out_matrix$idx <- as.numeric(gsub("Hap_","",row.names(out_matrix)))
    out_matrix <- out_matrix[order(out_matrix$idx),]
    out_matrix <- out_matrix[,-ncol(out_matrix)]
    
    cat("\n",file=paste0(out_name,g,".prenex"),append = FALSE)
    cat("BEGIN TRAITS;\n",file=paste0(out_name,g,".prenex"),append = TRUE)
    cat(paste0("Dimensions NTRAITS=",ncol(out_matrix),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
    cat(paste0("Format labels=yes missing=? separator=Tab",";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
    cat(paste0("TraitLabels\t",paste(names(out_matrix),collapse = "\t"),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
    cat(paste0("Matrix","\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
    write.table(out_matrix,file=paste0(out_name,g,".prenex"),sep = "\t",append = TRUE,row.names = TRUE, col.names = F, quote = FALSE)
    
    freq字段 生成文件

    把生成的文件全部内容贴到dnasp生成nex文件底部。
    但我在用popart的时后,文件还是导不进去。发现需要在nex文件删掉一段。



    然后就用popart可视化就好了。

    315个单倍型网络图

    这个太多了,手动去了下,只保留17个单倍型(修改TAXA,CHARACTERS,TRAITS三个字段)。

    17个单倍型网络

    也可以把分类信息改成地理信息


    单倍型地理分布图

    我发现popart计算出的中性突变系数和Tassel是一样的。


    单倍型统计

    都有单倍型信息了也顺便做下方差分析好了。
    这里用RFGB下载到的GL表型试一下。
    正常的关联分析应该用的是混合线性模型。


    Statistical methods of gcHap-based GWAS

    但这里只有一个基因上的单倍型信息,用整体的协方差或亲缘关系矩阵会把模型变复杂,就用简单点的方差分析好了。(这方法有待商榷)

    pheno <- read.table("grain_length.txt",header = F,sep = "\t") #表型文件
    genotype <- read.table(paste0(out_name,g,"_clut_hap.txt"),header = F,sep = "\t") #单倍型文件
    thr <- 50 #去掉样本少的单倍型
    tem <- merge(genotype[,c(1,4)],pheno,by=c("V1"))
    colnames(tem) <- c("V1","V3","V2")
    tem <- rbind(tem, merge(genotype[,c(1,3)],pheno,by=c("V1")))
    tem <- merge(tem,genotype[,c(1,2)],by= c("V1"))
    colnames(tem) <- c("ID","hap","pheno","cluster")
    freq <- as.data.frame(table(tem$hap))
    pass_filt <-  as.character(freq[freq[,2]>=thr,1])
    data <- tem[which(tem$hap %in% pass_filt),]
    
    library(car)
    library(agricolae)
    mod <- aov(pheno~factor(hap),data) #方差分析
    Anova(mod,type=3)
    oneway.test(pheno~factor(hap),data,var.equal = F)#方差不齐的方差分析
    out <- scheffe.test(mod,"factor(hap)") #可以用于非平衡数据的多重比较
    out$groups 
    plot(out) 
    
    方差分析结果

    简单的方差分析可得hap_101的粒长较长。

    做个总结

    上述主要是提供了一个在所有7.1m的SNP数据集中,提取部分位点的标记,并计算LD和一些群体遗传指标,基本上是R+plink就解决了。
    当然如果手头上有一些品种的重测序数据,也可以和3k水稻的SNP数据集放到一起进行比对,相信这个数据集还有很多信息没有被挖掘,共勉。

    R代码汇总

    提取位点,群体指标计算

    #基本参数
    gene_list <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene_len.txt"  #目标区间
    out_name <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene" #输出文件前缀
    data_name <- c(".\\data\\base_filtered_v0.7",".\\data\\Nipponbare_indel") #SNP数据集,输入多个就是合并
    pop_name<-c("GJ","XI","admix","aus","bas","GJ XI","admix aus bas") #需要生成的群体
    fst_pop <- c("GJ XI","admix aus bas") #计算fst需要的群体 需在pop_name出现过
    #质控参数
    maf <- 0.05 
    miss <- 0.4
    
    color.rgb <- colorRampPalette(rev(c("white","red")),space="rgb") #color for LD
    
    target_gene <- read.table(gene_list,header = T)
    extract <- cbind(target_gene $CHROM,target_gene $start,target_gene $end,target_gene $GENE)
    write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)
    
    ###
    #提取目标区域位点 并输出vcf文件 可以用Tassel 看LD
    #查看样本是否缺少
    tem <- c()
    for (i in 1:length(data_name)) {
        temp <- read.table(paste0(data_name[i],".fam"),header = F)
        temp <- nrow(temp)
        tem <- c(tem,temp)
    }
    #用少了样本的数据集来keep
    keep_file <- data_name[which(tem %in% min(tem))]
    
    for (i in 1:length(data_name)) {
        total_cmd <- paste0(".//plink --bfile ",data_name[i] ," --extract temp_extract.txt --range --keep ",paste0(keep_file,".fam")," --recode vcf-iid --out ",paste0(out_name,"_",i))
        system(total_cmd,intern = FALSE)
    }
    temp_all <- read.table(paste0(out_name,"_",1,".vcf"),header = F,sep = '\t')
    
    if (length(data_name)>=2){    #合并数据集
        for (i in 2:length(data_name)) {
            temp <- read.table(paste0(out_name,"_",i,".vcf"),header = F,sep = '\t')
            temp_all<-rbind(temp_all ,temp )
        }
        temp_all<-temp_all[order(temp_all[,1],temp_all[,2]),] #按照染色体和POS排序
    }
    
    ### 列名
    fam_name<-read.table(paste0(keep_file,".fam") ,header = F,sep = ' ')[,1]
    vcf_colnames<-c("#CHROM",   "POS","ID","REF","ALT","QUAL","FILTER","INFO","FORMAT")
    vcf_colnames<-c(vcf_colnames,fam_name)
    ###输出结果,vcf
    names(temp_all)<-vcf_colnames
    write.table(temp_all,file=paste0(out_name,"_","target_all.vcf"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE)
    ###筛选
    total_cmd <- paste0("plink --vcf ",paste0(out_name,"_","target_all.vcf")," --const-fid --maf ",maf," --geno ",miss," --make-bed --out ",paste0(out_name,"_","target_all")) #直接用vcf筛选会有点问题,还是再转下bed格式
    system(total_cmd,intern = FALSE)
    
    temp <- read.table(paste0(out_name,"_","target_all",".fam"),header = F)
    temp[,1] <- temp[,2]
    write.table(temp,file=paste0(out_name,"_","target_all",".fam"),sep = " ",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE)
    total_cmd <- paste0(".//plink --bfile ",paste0(out_name,"_","target_all") ," --recode vcf-iid --out ",paste0(out_name,"_","target_all"))
    system(total_cmd,intern = FALSE)
    
    #不同群体提取 按基因提取 
    for(g in target_gene$GENE){
        extract <-dplyr::filter(target_gene ,GENE==g)
        extract <- cbind(extract$CHROM,extract$start,extract$end,extract$GENE)
        write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)
        
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g))
        system(total_cmd,intern = FALSE)
        
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range "," --recode --make-bed --out ",paste0(out_name,"_","target_all_",g))
        system(total_cmd,intern = FALSE)
        
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g))
        system(total_cmd,intern = FALSE)
    
        for (p in pop_name) {
            total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
            system(total_cmd,intern = FALSE)
            total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
            system(total_cmd,intern = FALSE)
    ###        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode --make-bed --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
    ###       system(total_cmd,intern = FALSE)
        }
    }
    
    for (p in pop_name) {
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --within plink_cluster.txt  --keep-cluster-names ",p,"  --make-bed  --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
        system(total_cmd,intern = FALSE)
    }
    ###
    library(LDheatmap)
    library(genetics)
    
    plot_LD <- function(vcf_file_name,title){
        temp_vcf <- as.data.frame(data.table::fread(vcf_file_name,header = T,sep = '\t'))
        if(nrow(temp_vcf )<=1) {print(paste0(vcf_file_name," have no snp&indel"));return()}###没有就结束
    
        temp_snp <- temp_vcf[0,]
        for (i in 1:nrow(temp_vcf)) {
            temp <- temp_vcf[i,]
            for(j in 10:ncol(temp_vcf)){
                if(temp_vcf[i,j]=="0/0"){
                    temp[1,j] <- paste0(temp[1,4],"/",temp[1,4])
                    next
                }
                if(temp_vcf[i,j]=="0/1"){
                    temp[1,j] <- paste0(temp[1,4],"/",temp[1,5])
                    next
                }
                if(temp_vcf[i,j]=="1/1"){
                    temp[1,j] <- paste0(temp[1,5],"/",temp[1,5])
                    next
                }
                if(temp_vcf[i,j]=="./."){
                    temp[1,j] <- NA
                    next
                }
            }
            temp_snp <- rbind(temp_snp,temp)
        }
    
        SNPpos <- temp_snp$POS
        SNP <- as.data.frame(t(temp_snp[,10:ncol(temp_snp)]))
    
        ### 转换成LDheatmap能用的格式
        for(i in 1:ncol(SNP)){
            SNP[,i]<-as.genotype(SNP[,i])
        }
        ###画图
        LD <- LDheatmap(SNP, SNPpos,flip=TRUE,color=color.rgb(20),title = title,SNP.name=NULL)
    }
    
    for(g in target_gene$GENE){
        plot_LD(paste0(out_name,"_","target_all_",g,".vcf"),g)
        for (p in pop_name) {
            plot_LD(paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p),".vcf"),paste0(g,"_",gsub(" ","_",p)))
        }
    }
    ###
    total_pop<-length(fst_pop)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --fst --within plink_cluster.txt","  --out ",paste0(out_name,"_","target_all"))
    system(total_cmd,intern = FALSE)
    
    ###
    for(p in fst_pop){
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --fst --within plink_cluster.txt  --keep-cluster-names ",p," --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
        system(total_cmd,intern = FALSE)
    }
    
    temp_table<-read.table(paste0(out_name,"_","target_all",".fst"),header=T)
    fst <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
    fst[,1] <- temp_table$FST
    
    for (i in 1:total_pop) {
        temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",fst_pop[i]),".fst"),header=T)
        fst[,i+1] <- temp$FST
    }
    
    temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
    temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,fst)
    
    names(temp) <- c("CHR","POS","fst",paste("fst",gsub(" ","_",fst_pop),sep = "_"))
    write.table(temp,file=paste0(out_name,"_","target_all",".fst.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存结果
    
    ###
    total_pop<-length(pop_name)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --freq --out ",paste0(out_name,"_","target_all"))
    system(total_cmd,intern = FALSE)
    
    ###
    for(p in pop_name){
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --freq --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
        system(total_cmd,intern = FALSE)
    }
    
    temp_table<-read.table(paste0(out_name,"_","target_all",".frq"),header=T)
    pi <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
    pi[,1] <- 2*temp_table$MAF*(1-temp_table$MAF)*temp_table$NCHROBS/(temp_table$NCHROBS-1)
    
    for (j in 1:total_pop) {
        temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",pop_name[j]),".frq"),header=T)
        pi[,j+1] <- 2*temp$MAF*(1-temp$MAF)*temp$NCHROBS/(temp$NCHROBS-1)
    }
    
    temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
    temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,pi)
    
    names(temp) <- c("CHR","POS","pi",paste("pi",gsub(" ","_",pop_name),sep = "_"))
    write.table(temp,file=paste0(out_name,"_","target_all",".pi"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存结果
    ###
    

    vcf to fasta

    phase_file <- "D:\\beagle\\phased.vcf.vcf"
    g <- g
    
    vcf <- data.table::fread(phase_file,header = T)
    
    temp <- c()
    for (i in 1:nrow(vcf)) {
        temp_value <- vcf[i,10:ncol(vcf)]
        ref <- vcf[i,4]
        alt <- vcf[i,5]
        if(nchar(alt)>nchar(ref)){
        alt <- substr(alt,2,nchar(alt))
        ref <- paste(rep("-",nchar(alt)),collapse = "")
        }
        if(nchar(ref)>nchar(alt)){
            ref <- substr(ref,2,nchar(ref))
            alt <- paste(rep("-",nchar(ref)),collapse = "")
        }
        temp_value[,which(temp_value=="0|0")] <- paste(ref,ref)
        temp_value[,which(temp_value=="0|1")] <- paste(ref,alt)
        temp_value[,which(temp_value=="1|0")] <- paste(alt,ref)
        temp_value[,which(temp_value=="1|1")] <- paste(alt,alt)
        temp <- rbind(temp,temp_value)
    }
    temp <- as.data.frame(temp)
    
    cat("",file=paste0(out_name,g,".fasta"),append = FALSE)
    
    for (i in 1:ncol(temp)) {
        c1 <- c()
        c2 <- c()
        for (j in 1:nrow(temp)) {
            c1 <- c(c1,strsplit(temp[j,i]," ")[[1]][1])
            c2 <- c(c2,strsplit(temp[j,i]," ")[[1]][2])
        }
        cat(paste(">",paste0(names(temp)[i],".1"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
        cat(paste0(paste(c1,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
        cat(paste(">",paste0(names(temp)[i],".2"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
        cat(paste0(paste(c2,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    }
    

    nex traits 文件生成

    clust <- read.table("plink_cluster.txt",header = T,sep = "\t")
    hap <- read.table("test.txt",header = F,sep = ":")
    hap[,1] <- gsub("\\[","",hap[,1])
    hap[,2] <- gsub("\\]","",hap[,2])
    
    clut_hap <- clust[,2:3]
    
    tem <- as.data.frame(strsplit(hap[,2],"    "))[2,]
    hap[,3] <-t(tem)
    hap[,3] <- gsub("\\.1","",hap[,3])
    hap[,3] <- gsub("\\.2","",hap[,3])
    
    for (i in 1:nrow(clust)) {
        flag <- 0
        v1 <- 0
        v2 <- 0
        for (j in 1:nrow(hap)) {
            if (flag==2) break
            if(clust[i,1] %in% strsplit(hap[j,3]," ")[[1]]){
                idx <- which(strsplit(hap[j,3]," ")[[1]] %in% clust[i,1])
                if(length(idx)==1&flag==0) {v1 <- hap[j,1];flag <- flag+1;next}
                if(length(idx)==1&flag==1) {v2 <- hap[j,1];flag <- flag+1;next}
                if(length(idx)==2&flag==0) {v1 <- hap[j,1];v2 <- hap[j,1];next}
            }
        }
        clut_hap[i,3] <- v1
        clut_hap[i,4] <- v2
    }
    
    write.table(clut_hap,file=paste0(out_name,g,"_clut_hap.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE) #保存结果
    names(clut_hap) <- c("ID","group"," "," ")
    tem <- rbind(clut_hap[,c(3,2)],clut_hap[,c(4,2)])
    tem <- tem[!is.na(tem[,2]),]
    temp <- table(tem, exclude = 0)
    
    out_matrix <- data.frame(matrix(temp,nrow = length(row.names(temp))))
    row.names(out_matrix) <- row.names(temp) 
    names(out_matrix) <- gsub(" ","_",colnames(temp))
    
    out_matrix$idx <- as.numeric(gsub("Hap_","",row.names(out_matrix)))
    out_matrix <- out_matrix[order(out_matrix$idx),]
    out_matrix <- out_matrix[,-ncol(out_matrix)]
    
    cat("\n",file=paste0(out_name,g,".prenex"),append = FALSE)
    cat("BEGIN TRAITS;\n",file=paste0(out_name,g,".prenex"),append = TRUE)
    cat(paste0("Dimensions NTRAITS=",ncol(out_matrix),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
    cat(paste0("Format labels=yes missing=? separator=Tab",";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
    cat(paste0("TraitLabels\t",paste(names(out_matrix),collapse = "\t"),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
    cat(paste0("Matrix","\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
    write.table(out_matrix,file=paste0(out_name,g,".prenex"),sep = "\t",append = TRUE,row.names = TRUE, col.names = F, quote = FALSE)
    

    方差分析

    pheno <- read.table("grain_length.txt",header = F,sep = "\t")
    genotype <- read.table(paste0(out_name,g,"_clut_hap.txt"),header = F,sep = "\t")
    thr <- 50
    tem <- merge(genotype[,c(1,4)],pheno,by=c("V1"))
    colnames(tem) <- c("V1","V3","V2")
    tem <- rbind(tem, merge(genotype[,c(1,3)],pheno,by=c("V1")))
    tem <- merge(tem,genotype[,c(1,2)],by= c("V1"))
    colnames(tem) <- c("ID","hap","pheno","cluster")
    freq <- as.data.frame(table(tem$hap))
    pass_filt <-  as.character(freq[freq[,2]>=thr,1])
    data <- tem[which(tem$hap %in% pass_filt),]
    
    library(car)
    library(agricolae)
    mod <- aov(pheno~factor(hap),data)
    Anova(mod,type=3)
    oneway.test(pheno~factor(hap),data,var.equal = F)
    out <- scheffe.test(mod,"factor(hap)") #可以用于非平衡数据的多重比较
    out$groups
    plot(out) 
    

    吐槽一下,简书的这个界面太不好写了,也不好看。代码中难免会有多个空格,逗号不对啥的,请见谅。

    相关文章

      网友评论

        本文标题:3K水稻SNP数据集的简单利用

        本文链接:https://www.haomeiwen.com/subject/gufwwltx.html