3K水稻SNP数据集的简单利用

作者: Nuvolar | 来源:发表于2021-09-10 18:21 被阅读0次

3K水稻SNP数据集的简单利用
使用SNPhylo构建群体系统发生树
计算群体pi
利用RNA数据识别SNP
忽略此条
「群体遗传学实战」第一课: 对SNP位点进行注释
SAS中的数据清洗
KNN算法应用
CS231N学习记录
SR4R数据库：水稻4个SNP集的筛选及其应用

3010份亚洲稻群体重测序项目是由中国农业科学院作物科学研究所牵头，联合国际水稻研究所、华大基因等16家单位共同完成。该研究对来自89个国家的3,010份水稻品种进行了重测序研究（resequence），参考的是日本晴（Nipponbare）基因组。所有3,010个基因组的平均测序深度（average sequencing depth）为14×,平均基因组覆盖率（average genome coverage）和绘图率（average mapping rate）分别为94.0％和92.5％。
该项目的意义自然不用多说，并且该项目的测序数据都可以在国际水稻所或CNGB上下载。我们可以从原始序列开始组装得到大的结构变异或call到一些稀有SNP来做研究。当然以上想法不在本文讨论范围内，本文主要讨论只想知道几个基因区域上 SNP 及其分化系数等群体指标的计算。

2021.09.15 更新了单倍型分型、作图和方差分析。

放在前面

RFGB可以极大程度上满足查找某个基因上的SNP&indel和单倍型查找。并提供15个表型可以进行单倍型分析，也可以上传表型进行分析。接下来探索一下。

单倍型分析
这个页面提供单倍型分析。分别可以用基因名，染色体位置和特定的SNP来进行分析。提供MAF是最小等位基因频率，建议选一下。可以用全部的3024个品种进行查找也可以分亚群进行查找,也提供了表型关联分析。

结果

结果
这里是用的LOC_Os08g33530 CDS 区域，选的籼稻群体，共找到20个SNP，和16个单倍型。其中hap11的谷粒长最高。
这个网站用来做单倍型都是SNP，而且是复等位的，就是没找到在哪下数据集。
可惜没有LD的图和pi值。
但我找到了水稻单倍型的数据，具体可看文章
The landscape of gene–CDS–haplotype diversity in rice: Properties, population organization, footprints of domestication and breeding, and implications for genetic improvement
数据地址

也可以用这个单倍型数据和表型做关联分析，用方差分析就可以。表型数据下载

以下主要是3k构建的SNP的探索以及简单利用。

工具准备

R
R package(LDheatmap,genetics)
plink1.9(3个操作系统版本都有)下载地址
vcftools(只有liunx版本，只是用来算群体遗传统计)下载地址

数据下载

可以在IRRI上下载到双等位SNP&indel数据（多等位的SNP貌似只能从头call，没找到下载的地方)。bed、bim、fam文件为一组，bed 是存放位点信息的二进制格式文件,用文本打开会发现是乱码；bim是存放位点信息的文件，类似于map格式；fam是存放样本信息的文件，第一列为FID,第二列为IID。

bim

fam
Nipponbare_indel 是双等位indel数据集，共有2.3m个，但是是没有经过筛选的，后续需要加工下。
双等位的SNP标记共5个。
NB_final_snp 包含了所有的SNP，共29m个，是没有经过筛选的。
Base 的SNP是经过基础筛选的，共18m个。
base_filtered_v0.7 是从Base的SNP进一步筛选的，共4.2m个，本文主要利用这个数据集。
pruned_v2.1 是2.1版本的筛选方案，包含1m个SNP。
base_filtered_v0.7_0.8_10kb_1_0.8_50_1 是在base_filtered_v0.7 的基础上进一步筛选共 404k 个，在用的时候发现很多基因区域上都没有SNP，毕竟个数少，比较稀疏。
各个数据集的筛选方法可以看看里边的readme文件，不同的筛选方案主要目的就是提高精度，找到因果变异。

准备目标基因的位置

其实只需要染色体号和起始位置就好了，基因名字写啥都行，就是拿来做图片标题的。忘记提了，这个数据集中没有线粒体和叶绿体上的标记。建议起始位置可以把上游和下游都囊括进去，多个2kb，3kb 都是可以的，多找几个总是好的，万一和表型关联上了呢。

基因列表，以tab分割

合并想要的SNP数据集&indel数据集

其实本可以一个个数据集来，但我想把2.3m个indel和4.2m个SNP标记合在一起分析咋办呢。最简单的办法还是用vcftools或bcftools合并两个数据集。可以参考这篇。bash代码如下：

plink --bfile base_filtered_v0.7 --recode vcf-iid --out base_filtered_v0.7 #先转到vcf格式
plink --bfile Nipponbare_indel --recode vcf-iid --out Nipponbare_indel
###由于Nipponbare_indel 中少一个样本 IRIS_313-8921 并且这个样本在官网给的分类为NA，空值。所以就需要删掉它
vcftools --vcf base_filtered_v0.7.vcf --recode --recode-INFO-all --stdout --indv IRIS_313-8921 > rm_na_base_filtered_v0.7.vcf

bcftools view Nipponbare_indel.vcf -Oz -o Nipponbare_indel.vcf.gz  #构建引索
bcftools index Nipponbare_indel.vcf.gz 

bcftools view rm_na_base_filtered_v0.7.vcf -Oz -o rm_na_base_filtered_v0.7.vcf.gz
bcftools index rm_na_base_filtered_v0.7.vcf.gz

bcftools concat rm_na_base_filtered_v0.7.vcf.gz Nipponbare_indel.vcf.gz -a -O v -o combine_base_filtered_indel.vcf #合并两个vcf文件

plink --vcf combine_base_filtered_indel.vcf --make-bed --out  combine_base_filtered_indel #再转回bed格式 主要是这样占内存小点，这个vcf文件有55G ，bed文件只有5.23G

那么就不想用vcftools合并两个数据集咋办呢？有些时候并不想得到所有位点的信息，只需要一部分就可以了。其实可以用R和plink来实现合并两个数据集，并只提取几个位点，这种方案内存代价小点，且看后续讲解。

准备分类文件

群体分类文件如下

原始文件

这个分类太细了，给它分成五类就可以了，GJ、XI、admix、aus、bas，用excel的分列就能完成，需要把IRIS_313-8921去掉，这个分类是NA。当然也可以用自定义的分类。

整合成plink能识别的分类格式，分别是FID,IID,class

plink 格式

突然想起踩到的一个坑，plink1.9和2在筛选的时候无法识别样本名称带有"-"字符，所以需要把分类文件里 "-"换成"_" ,"IRIS_313-8921"换成"IRIS_313_8921"如用 excel 就可以完成，对了fam文件里也要换。

目标区域SNP位点提取

首先需要把plink.exe和分类文件(plink_cluster.txt)放到工作目录下，方便调用，也可以把它放到环境目录下。

R代码如下：

#基本参数
gene_list <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene_len.txt"  #目标区间
out_name <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene" #输出文件前缀
data_name <- c(".\\data\\base_filtered_v0.7",".\\data\\Nipponbare_indel") #SNP数据集,输入多个就是合并
pop_name<-c("GJ","XI","admix","aus","bas","GJ XI","admix aus bas") #需要生成的群体
fst_pop <- c("GJ XI","admix aus bas") #计算fst需要的群体 需在pop_name出现过
#质控参数
maf <- 0.05 
miss <- 0.4

color.rgb <- colorRampPalette(rev(c("white","red")),space="rgb") #color for LD

target_gene <- read.table(gene_list,header = T)
extract <- cbind(target_gene $CHROM,target_gene $start,target_gene $end,target_gene $GENE)
write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)
#提取目标区域位点 并输出vcf文件 可以用Tassel 看LD
#查看样本是否缺少
tem <- c()
for (i in 1:length(data_name)) {
    temp <- read.table(paste0(data_name[i],".fam"),header = F)
    temp <- nrow(temp)
    tem <- c(tem,temp)
}
#用少了样本的数据集来keep
keep_file <- data_name[which(tem %in% min(tem))]

for (i in 1:length(data_name)) {
    total_cmd <- paste0(".//plink --bfile ",data_name[i] ," --extract temp_extract.txt --range --keep ",paste0(keep_file,".fam")," --recode vcf-iid --out ",paste0(out_name,"_",i))
    system(total_cmd,intern = FALSE)
}
temp_all <- read.table(paste0(out_name,"_",1,".vcf"),header = F,sep = '\t')

if (length(data_name)>=2){    #合并数据集
    for (i in 2:length(data_name)) {
        temp <- read.table(paste0(out_name,"_",i,".vcf"),header = F,sep = '\t')
        temp_all<-rbind(temp_all ,temp )
    }
    temp_all<-temp_all[order(temp_all[,1],temp_all[,2]),] #按照染色体和POS排序
}

### 列名
fam_name<-read.table(paste0(keep_file,".fam") ,header = F,sep = ' ')[,1]
vcf_colnames<-c("#CHROM",   "POS","ID","REF","ALT","QUAL","FILTER","INFO","FORMAT")
vcf_colnames<-c(vcf_colnames,fam_name)
###输出结果，vcf
names(temp_all)<-vcf_colnames
write.table(temp_all,file=paste0(out_name,"_","target_all.vcf"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE)
###筛选
total_cmd <- paste0("plink --vcf ",paste0(out_name,"_","target_all.vcf")," --const-fid --maf ",maf," --geno ",miss," --make-bed --out ",paste0(out_name,"_","target_all")) #直接用vcf筛选会有点问题，还是再转下bed格式
system(total_cmd,intern = FALSE)

temp <- read.table(paste0(out_name,"_","target_all",".fam"),header = F)
temp[,1] <- temp[,2]
write.table(temp,file=paste0(out_name,"_","target_all",".fam"),sep = " ",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE)
total_cmd <- paste0(".//plink --bfile ",paste0(out_name,"_","target_all") ," --recode vcf-iid --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

#不同群体提取 按基因提取 
for(g in target_gene$GENE){
    extract <-dplyr::filter(target_gene ,GENE==g)
    extract <- cbind(extract$CHROM,extract$start,extract$end,extract$GENE)
    write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range "," --make-bed --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)

    for (p in pop_name) {
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
        system(total_cmd,intern = FALSE)
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
        system(total_cmd,intern = FALSE)
###        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --make-bed --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
###        system(total_cmd,intern = FALSE)
    }
}

for (p in pop_name) {
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --within plink_cluster.txt  --keep-cluster-names ",p,"  --make-bed  --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}

画LD heatmap

画LD用Tassel就可以话，上述代码也生成了vcf格式文件可以直接在Tassel中打开。这里用下LDheatmap。LDheatmap具体的教程比较多，自由度也比较高，可自行找些教程。当然也有在线能画LD图的，就是有点麻烦。这里具体就是如何将vcf或bed格式文件转到LDheatmap可用的格式。R代码如下：

library(LDheatmap)
library(genetics)

plot_LD <- function(vcf_file_name,title){
    temp_vcf <- as.data.frame(data.table::fread(vcf_file_name,header = T,sep = '\t'))
    if(nrow(temp_vcf )<=1) {print(paste0(vcf_file_name," have no snp&indel"));return()}###没有就结束

    temp_snp <- temp_vcf[0,]
    for (i in 1:nrow(temp_vcf)) {
        temp <- temp_vcf[i,]
        for(j in 10:ncol(temp_vcf)){
            if(temp_vcf[i,j]=="0/0"){
                temp[1,j] <- paste0(temp[1,4],"/",temp[1,4])
                next
            }
            if(temp_vcf[i,j]=="0/1"){
                temp[1,j] <- paste0(temp[1,4],"/",temp[1,5])
                next
            }
            if(temp_vcf[i,j]=="1/1"){
                temp[1,j] <- paste0(temp[1,5],"/",temp[1,5])
                next
            }
            if(temp_vcf[i,j]=="./."){
                temp[1,j] <- NA
                next
            }
        }
        temp_snp <- rbind(temp_snp,temp)
    }

    SNPpos <- temp_snp$POS
    SNP <- as.data.frame(t(temp_snp[,10:ncol(temp_snp)]))

    ### 转换成LDheatmap能用的格式
    for(i in 1:ncol(SNP)){
        SNP[,i]<-as.genotype(SNP[,i])
    }
    ###画图
    LD <- LDheatmap(SNP, SNPpos,flip=TRUE,color=color.rgb(20),title = title,SNP.name=NULL)
}

for(g in target_gene$GENE){
    plot_LD(paste0(out_name,"_","target_all_",g,".vcf"),g)
    for (p in pop_name) {
        plot_LD(paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p),".vcf"),paste0(g,"_",gsub(" ","_",p)))
    }
}

结果

计算群体遗传性指标

首先推荐用vcftools来计算，因为方便。但理论上知道公式用excel也可以算。
计算Fst,Pi,Tajimi'D based on SNP vcf file
Fst详解（具体计算步骤）
【群体遗传学】 π （pi）的计算

Fst

需要分成至少两个群体每个群体一个文件，可以放所有的5个分类群体，也可以放几个感兴趣的群体。

vcftools 群体文件只要一列

bash代码

vcftools --vcf combine_base_filtered_indel.vcf --weir-fst-pop pop_GJ.txt --weir-fst-pop pop_XI.txt --weir-fst-pop pop_admix.txt --weir-fst-pop pop_aus.txt --weir-fst-pop pop_bas.txt --out combine_base_filtered_indel.fst ###单位点计算

vcftools --vcf combine_base_filtered_indel.vcf --weir-fst-pop pop_GJ.txt --weir-fst-pop pop_XI.txt --weir-fst-pop pop_admix.txt --weir-fst-pop pop_aus.txt --weir-fst-pop pop_bas.txt --fst-window-size 5000--out combine_base_filtered_indel.fst ###按windows计算 单位是bp

也可以利用plink 计算单位点的Fst。

total_pop<-length(fst_pop)

total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --fst --within plink_cluster.txt","  --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

###
for(p in fst_pop){
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --fst --within plink_cluster.txt  --keep-cluster-names ",p," --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}

temp_table<-read.table(paste0(out_name,"_","target_all",".fst"),header=T)
fst <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
fst[,1] <- temp_table$FST

for (i in 1:total_pop) {
    temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",fst_pop[i]),".fst"),header=T)
    fst[,i+1] <- temp$FST
}

temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,fst)

names(temp) <- c("CHR","POS","fst",paste("fst",gsub(" ","_",fst_pop),sep = "_"))
write.table(temp,file=paste0(out_name,"_","target_all",".fst.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存结果

结果和vcftools是一样的

pi

用vcftools计算可以单位点计算也可以按windows 计算
一般认为按windows更合理，如果研究目标是SNP，还是单点算合理。
在某个或多个群体中计算就加 --keep pop_GJ.txt --keep pop_XI.txt

vcftools --vcf combine_base_filtered_indel.vcf --site-pi --out combine_base_filtered_indel.pi ###单位点
vcftools --vcf combine_base_filtered_indel.vcf  --window-pi-step 3000  --out combine_base_filtered_indel ###3000个snp为一个windows

双等位的基因 pi值计算较简单，参考下上面计算fst的代码就可以了

total_pop<-length(pop_name)

total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --freq --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

###
for(p in pop_name){
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --freq --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}

temp_table<-read.table(paste0(out_name,"_","target_all",".frq"),header=T)
pi <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
pi[,1] <- 2*temp_table$MAF*(1-temp_table$MAF)*temp_table$NCHROBS/(temp_table$NCHROBS-1)

for (j in 1:total_pop) {
    temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",pop_name[j]),".frq"),header=T)
    pi[,j+1] <- 2*temp$MAF*(1-temp$MAF)*temp$NCHROBS/(temp$NCHROBS-1)
}

temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,pi)

names(temp) <- c("CHR","POS","pi",paste("pi",gsub(" ","_",pop_name),sep = "_"))
write.table(temp,file=paste0(out_name,"_","target_all",".pi"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存结果

pi结果比对

结果基本相同。

Tajima's D 中性检测的指标

在某个或多个群体中计算就加 --keep pop_GJ.txt --keep pop_XI.txt

vcftools --vcf combine_base_filtered_indel.vcf --TajimaD 5000 --out combine_base_filtered_indel.TajimaD ###按5000bp计算

这个用R计算有点麻烦，但是用Tassel计算很方便，把vcf输入就可以了。

Tassel

单倍型分析

这里都拿到SNP数据集了，也就可以进行单倍型分析了，这里用的是Haploview，要先装java环境（但不支持1.8以上版本），有GUI比较友好。输入格式可以用plink格式(ped/map)。
但Haploview只能做SNP的分析，数量也不能太多。用前面的步骤生成了ped/info 文件，按linkage format 输入就可以了。
这里是利用了admix群体的48个SNP生成的结果。

单倍型
但是我想要画单倍型的地图，这个软件不太方便，最后我选择了用beagle+dnasp+popart来做单倍型分型和画图。
beagle：用做单倍型推断和填充，需要java环境。
dnasp：用作单倍型分型，生成nex文件。
popart：用作单倍型画图，如果有地理信息，也可以加上去。
由于单倍型分析不允许有缺失，所以需要基因型填充，可以利用上面的vcf文件作为输入文件，输出为vcf.gz，解压后得到的vcf文件可以到后续利用。定相之后vcf里的"/"会变为"|"。另一个原因是3K水稻数据中并没有说明bed中的SNP是定相的，这里为了保险一点。

填充后的vcf
在powershell中输入命令

# gt 是需要的vcf文件，按基因位置分割 out 是输出文件前缀
java -Xss5m   -jar .\beagle.28Jun21.220.jar gt="C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene_target_all.vcf" out=phased

得到定相后的vcf文件然后生成fasta文件，由于dnasp的输入fasta文件需要等长,这里对有indel的地方加了"-"。每个样本能生成两条单链。
这里用R解决

phase_file <- "D:\\beagle\\phased.vcf" #vcf文件位置
out_name <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene" #输出文件前缀
g <- "g" #输出前缀

vcf <- data.table::fread(phase_file,header = T)

temp <- c()
for (i in 1:nrow(vcf)) {
    temp_value <- vcf[i,10:ncol(vcf)]
    ref <- vcf[i,4]
    alt <- vcf[i,5]
    if(nchar(alt)>nchar(ref)){
    alt <- substr(alt,2,nchar(alt))
    ref <- paste(rep("-",nchar(alt)),collapse = "")
    }
    if(nchar(ref)>nchar(alt)){
        ref <- substr(ref,2,nchar(ref))
        alt <- paste(rep("-",nchar(ref)),collapse = "")
    }
    temp_value[,which(temp_value=="0|0")] <- paste(ref,ref)
    temp_value[,which(temp_value=="0|1")] <- paste(ref,alt)
    temp_value[,which(temp_value=="1|0")] <- paste(alt,ref)
    temp_value[,which(temp_value=="1|1")] <- paste(alt,alt)
    temp <- rbind(temp,temp_value)
}
temp <- as.data.frame(temp)

cat("",file=paste0(out_name,g,".fasta"),append = FALSE)

for (i in 1:ncol(temp)) {
    c1 <- c()
    c2 <- c()
    for (j in 1:nrow(temp)) {
        c1 <- c(c1,strsplit(temp[j,i]," ")[[1]][1])
        c2 <- c(c2,strsplit(temp[j,i]," ")[[1]][2])
    }
    cat(paste(">",paste0(names(temp)[i],".1"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste0(paste(c1,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste(">",paste0(names(temp)[i],".2"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste0(paste(c2,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
}

fasta

dnasp

输入fasta后，Generate 选Haplotype 生成单倍型，得到单倍型文件(nex)

结果
这里一共是得到315个单倍型，但大部分都只有1个样本。

由于需要统计样本的信息，这里把nex的freq字段另存为test.txt()，用来做统计，并生成TraitLabels字段，这里需要样本的分类信息plink_cluster.txt。

clust <- read.table("plink_cluster.txt",header = T,sep = "\t")
hap <- read.table("test.txt",header = F,sep = ":")
hap[,1] <- gsub("\\[","",hap[,1])
hap[,2] <- gsub("\\]","",hap[,2])

clut_hap <- clust[,2:3]

tem <- as.data.frame(strsplit(hap[,2],"    "))[2,]
hap[,3] <-t(tem)
hap[,3] <- gsub("\\.1","",hap[,3])
hap[,3] <- gsub("\\.2","",hap[,3])

for (i in 1:nrow(clust)) {
    flag <- 0
    v1 <- 0
    v2 <- 0
    for (j in 1:nrow(hap)) {
        if (flag==2) break
        if(clust[i,1] %in% strsplit(hap[j,3]," ")[[1]]){
            idx <- which(strsplit(hap[j,3]," ")[[1]] %in% clust[i,1])
            if(length(idx)==1&flag==0) {v1 <- hap[j,1];flag <- flag+1;next}
            if(length(idx)==1&flag==1) {v2 <- hap[j,1];flag <- flag+1;next}
            if(length(idx)==2&flag==0) {v1 <- hap[j,1];v2 <- hap[j,1];next}
        }
    }
    clut_hap[i,3] <- v1
    clut_hap[i,4] <- v2
}

write.table(clut_hap,file=paste0(out_name,g,"_clut_hap.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE) #保存结果
names(clut_hap) <- c("ID","group"," "," ")
tem <- rbind(clut_hap[,c(3,2)],clut_hap[,c(4,2)])
tem <- tem[!is.na(tem[,2]),]
temp <- table(tem, exclude = 0)

out_matrix <- data.frame(matrix(temp,nrow = length(row.names(temp))))
row.names(out_matrix) <- row.names(temp) 
names(out_matrix) <- gsub(" ","_",colnames(temp))

out_matrix$idx <- as.numeric(gsub("Hap_","",row.names(out_matrix)))
out_matrix <- out_matrix[order(out_matrix$idx),]
out_matrix <- out_matrix[,-ncol(out_matrix)]

cat("\n",file=paste0(out_name,g,".prenex"),append = FALSE)
cat("BEGIN TRAITS;\n",file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Dimensions NTRAITS=",ncol(out_matrix),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Format labels=yes missing=? separator=Tab",";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("TraitLabels\t",paste(names(out_matrix),collapse = "\t"),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Matrix","\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
write.table(out_matrix,file=paste0(out_name,g,".prenex"),sep = "\t",append = TRUE,row.names = TRUE, col.names = F, quote = FALSE)

freq字段

生成文件

把生成的文件全部内容贴到dnasp生成nex文件底部。
但我在用popart的时后，文件还是导不进去。发现需要在nex文件删掉一段。

前

后

然后就用popart可视化就好了。

315个单倍型网络图

这个太多了，手动去了下，只保留17个单倍型(修改TAXA,CHARACTERS,TRAITS三个字段)。

17个单倍型网络

也可以把分类信息改成地理信息

单倍型地理分布图

我发现popart计算出的中性突变系数和Tassel是一样的。

单倍型统计

都有单倍型信息了也顺便做下方差分析好了。
这里用RFGB下载到的GL表型试一下。
正常的关联分析应该用的是混合线性模型。

Statistical methods of gcHap-based GWAS

但这里只有一个基因上的单倍型信息，用整体的协方差或亲缘关系矩阵会把模型变复杂，就用简单点的方差分析好了。（这方法有待商榷）

pheno <- read.table("grain_length.txt",header = F,sep = "\t") #表型文件
genotype <- read.table(paste0(out_name,g,"_clut_hap.txt"),header = F,sep = "\t") #单倍型文件
thr <- 50 #去掉样本少的单倍型
tem <- merge(genotype[,c(1,4)],pheno,by=c("V1"))
colnames(tem) <- c("V1","V3","V2")
tem <- rbind(tem, merge(genotype[,c(1,3)],pheno,by=c("V1")))
tem <- merge(tem,genotype[,c(1,2)],by= c("V1"))
colnames(tem) <- c("ID","hap","pheno","cluster")
freq <- as.data.frame(table(tem$hap))
pass_filt <-  as.character(freq[freq[,2]>=thr,1])
data <- tem[which(tem$hap %in% pass_filt),]

library(car)
library(agricolae)
mod <- aov(pheno~factor(hap),data) #方差分析
Anova(mod,type=3)
oneway.test(pheno~factor(hap),data,var.equal = F)#方差不齐的方差分析
out <- scheffe.test(mod,"factor(hap)") #可以用于非平衡数据的多重比较
out$groups 
plot(out)

方差分析结果

简单的方差分析可得hap_101的粒长较长。

做个总结

上述主要是提供了一个在所有7.1m的SNP数据集中，提取部分位点的标记，并计算LD和一些群体遗传指标，基本上是R+plink就解决了。
当然如果手头上有一些品种的重测序数据，也可以和3k水稻的SNP数据集放到一起进行比对，相信这个数据集还有很多信息没有被挖掘，共勉。

R代码汇总

提取位点，群体指标计算

#基本参数
gene_list <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene_len.txt"  #目标区间
out_name <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene" #输出文件前缀
data_name <- c(".\\data\\base_filtered_v0.7",".\\data\\Nipponbare_indel") #SNP数据集,输入多个就是合并
pop_name<-c("GJ","XI","admix","aus","bas","GJ XI","admix aus bas") #需要生成的群体
fst_pop <- c("GJ XI","admix aus bas") #计算fst需要的群体 需在pop_name出现过
#质控参数
maf <- 0.05 
miss <- 0.4

color.rgb <- colorRampPalette(rev(c("white","red")),space="rgb") #color for LD

target_gene <- read.table(gene_list,header = T)
extract <- cbind(target_gene $CHROM,target_gene $start,target_gene $end,target_gene $GENE)
write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)

###
#提取目标区域位点 并输出vcf文件 可以用Tassel 看LD
#查看样本是否缺少
tem <- c()
for (i in 1:length(data_name)) {
    temp <- read.table(paste0(data_name[i],".fam"),header = F)
    temp <- nrow(temp)
    tem <- c(tem,temp)
}
#用少了样本的数据集来keep
keep_file <- data_name[which(tem %in% min(tem))]

for (i in 1:length(data_name)) {
    total_cmd <- paste0(".//plink --bfile ",data_name[i] ," --extract temp_extract.txt --range --keep ",paste0(keep_file,".fam")," --recode vcf-iid --out ",paste0(out_name,"_",i))
    system(total_cmd,intern = FALSE)
}
temp_all <- read.table(paste0(out_name,"_",1,".vcf"),header = F,sep = '\t')

if (length(data_name)>=2){    #合并数据集
    for (i in 2:length(data_name)) {
        temp <- read.table(paste0(out_name,"_",i,".vcf"),header = F,sep = '\t')
        temp_all<-rbind(temp_all ,temp )
    }
    temp_all<-temp_all[order(temp_all[,1],temp_all[,2]),] #按照染色体和POS排序
}

### 列名
fam_name<-read.table(paste0(keep_file,".fam") ,header = F,sep = ' ')[,1]
vcf_colnames<-c("#CHROM",   "POS","ID","REF","ALT","QUAL","FILTER","INFO","FORMAT")
vcf_colnames<-c(vcf_colnames,fam_name)
###输出结果，vcf
names(temp_all)<-vcf_colnames
write.table(temp_all,file=paste0(out_name,"_","target_all.vcf"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE)
###筛选
total_cmd <- paste0("plink --vcf ",paste0(out_name,"_","target_all.vcf")," --const-fid --maf ",maf," --geno ",miss," --make-bed --out ",paste0(out_name,"_","target_all")) #直接用vcf筛选会有点问题，还是再转下bed格式
system(total_cmd,intern = FALSE)

temp <- read.table(paste0(out_name,"_","target_all",".fam"),header = F)
temp[,1] <- temp[,2]
write.table(temp,file=paste0(out_name,"_","target_all",".fam"),sep = " ",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE)
total_cmd <- paste0(".//plink --bfile ",paste0(out_name,"_","target_all") ," --recode vcf-iid --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

#不同群体提取 按基因提取 
for(g in target_gene$GENE){
    extract <-dplyr::filter(target_gene ,GENE==g)
    extract <- cbind(extract$CHROM,extract$start,extract$end,extract$GENE)
    write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range "," --recode --make-bed --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)

    for (p in pop_name) {
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
        system(total_cmd,intern = FALSE)
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
        system(total_cmd,intern = FALSE)
###        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode --make-bed --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
###       system(total_cmd,intern = FALSE)
    }
}

for (p in pop_name) {
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --within plink_cluster.txt  --keep-cluster-names ",p,"  --make-bed  --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}
###
library(LDheatmap)
library(genetics)

plot_LD <- function(vcf_file_name,title){
    temp_vcf <- as.data.frame(data.table::fread(vcf_file_name,header = T,sep = '\t'))
    if(nrow(temp_vcf )<=1) {print(paste0(vcf_file_name," have no snp&indel"));return()}###没有就结束

    temp_snp <- temp_vcf[0,]
    for (i in 1:nrow(temp_vcf)) {
        temp <- temp_vcf[i,]
        for(j in 10:ncol(temp_vcf)){
            if(temp_vcf[i,j]=="0/0"){
                temp[1,j] <- paste0(temp[1,4],"/",temp[1,4])
                next
            }
            if(temp_vcf[i,j]=="0/1"){
                temp[1,j] <- paste0(temp[1,4],"/",temp[1,5])
                next
            }
            if(temp_vcf[i,j]=="1/1"){
                temp[1,j] <- paste0(temp[1,5],"/",temp[1,5])
                next
            }
            if(temp_vcf[i,j]=="./."){
                temp[1,j] <- NA
                next
            }
        }
        temp_snp <- rbind(temp_snp,temp)
    }

    SNPpos <- temp_snp$POS
    SNP <- as.data.frame(t(temp_snp[,10:ncol(temp_snp)]))

    ### 转换成LDheatmap能用的格式
    for(i in 1:ncol(SNP)){
        SNP[,i]<-as.genotype(SNP[,i])
    }
    ###画图
    LD <- LDheatmap(SNP, SNPpos,flip=TRUE,color=color.rgb(20),title = title,SNP.name=NULL)
}

for(g in target_gene$GENE){
    plot_LD(paste0(out_name,"_","target_all_",g,".vcf"),g)
    for (p in pop_name) {
        plot_LD(paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p),".vcf"),paste0(g,"_",gsub(" ","_",p)))
    }
}
###
total_pop<-length(fst_pop)

total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --fst --within plink_cluster.txt","  --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

###
for(p in fst_pop){
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --fst --within plink_cluster.txt  --keep-cluster-names ",p," --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}

temp_table<-read.table(paste0(out_name,"_","target_all",".fst"),header=T)
fst <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
fst[,1] <- temp_table$FST

for (i in 1:total_pop) {
    temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",fst_pop[i]),".fst"),header=T)
    fst[,i+1] <- temp$FST
}

temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,fst)

names(temp) <- c("CHR","POS","fst",paste("fst",gsub(" ","_",fst_pop),sep = "_"))
write.table(temp,file=paste0(out_name,"_","target_all",".fst.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存结果

###
total_pop<-length(pop_name)

total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --freq --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

###
for(p in pop_name){
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --freq --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}

temp_table<-read.table(paste0(out_name,"_","target_all",".frq"),header=T)
pi <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
pi[,1] <- 2*temp_table$MAF*(1-temp_table$MAF)*temp_table$NCHROBS/(temp_table$NCHROBS-1)

for (j in 1:total_pop) {
    temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",pop_name[j]),".frq"),header=T)
    pi[,j+1] <- 2*temp$MAF*(1-temp$MAF)*temp$NCHROBS/(temp$NCHROBS-1)
}

temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,pi)

names(temp) <- c("CHR","POS","pi",paste("pi",gsub(" ","_",pop_name),sep = "_"))
write.table(temp,file=paste0(out_name,"_","target_all",".pi"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存结果
###

vcf to fasta

phase_file <- "D:\\beagle\\phased.vcf.vcf"
g <- g

vcf <- data.table::fread(phase_file,header = T)

temp <- c()
for (i in 1:nrow(vcf)) {
    temp_value <- vcf[i,10:ncol(vcf)]
    ref <- vcf[i,4]
    alt <- vcf[i,5]
    if(nchar(alt)>nchar(ref)){
    alt <- substr(alt,2,nchar(alt))
    ref <- paste(rep("-",nchar(alt)),collapse = "")
    }
    if(nchar(ref)>nchar(alt)){
        ref <- substr(ref,2,nchar(ref))
        alt <- paste(rep("-",nchar(ref)),collapse = "")
    }
    temp_value[,which(temp_value=="0|0")] <- paste(ref,ref)
    temp_value[,which(temp_value=="0|1")] <- paste(ref,alt)
    temp_value[,which(temp_value=="1|0")] <- paste(alt,ref)
    temp_value[,which(temp_value=="1|1")] <- paste(alt,alt)
    temp <- rbind(temp,temp_value)
}
temp <- as.data.frame(temp)

cat("",file=paste0(out_name,g,".fasta"),append = FALSE)

for (i in 1:ncol(temp)) {
    c1 <- c()
    c2 <- c()
    for (j in 1:nrow(temp)) {
        c1 <- c(c1,strsplit(temp[j,i]," ")[[1]][1])
        c2 <- c(c2,strsplit(temp[j,i]," ")[[1]][2])
    }
    cat(paste(">",paste0(names(temp)[i],".1"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste0(paste(c1,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste(">",paste0(names(temp)[i],".2"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste0(paste(c2,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
}

nex traits 文件生成

clust <- read.table("plink_cluster.txt",header = T,sep = "\t")
hap <- read.table("test.txt",header = F,sep = ":")
hap[,1] <- gsub("\\[","",hap[,1])
hap[,2] <- gsub("\\]","",hap[,2])

clut_hap <- clust[,2:3]

tem <- as.data.frame(strsplit(hap[,2],"    "))[2,]
hap[,3] <-t(tem)
hap[,3] <- gsub("\\.1","",hap[,3])
hap[,3] <- gsub("\\.2","",hap[,3])

for (i in 1:nrow(clust)) {
    flag <- 0
    v1 <- 0
    v2 <- 0
    for (j in 1:nrow(hap)) {
        if (flag==2) break
        if(clust[i,1] %in% strsplit(hap[j,3]," ")[[1]]){
            idx <- which(strsplit(hap[j,3]," ")[[1]] %in% clust[i,1])
            if(length(idx)==1&flag==0) {v1 <- hap[j,1];flag <- flag+1;next}
            if(length(idx)==1&flag==1) {v2 <- hap[j,1];flag <- flag+1;next}
            if(length(idx)==2&flag==0) {v1 <- hap[j,1];v2 <- hap[j,1];next}
        }
    }
    clut_hap[i,3] <- v1
    clut_hap[i,4] <- v2
}

write.table(clut_hap,file=paste0(out_name,g,"_clut_hap.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE) #保存结果
names(clut_hap) <- c("ID","group"," "," ")
tem <- rbind(clut_hap[,c(3,2)],clut_hap[,c(4,2)])
tem <- tem[!is.na(tem[,2]),]
temp <- table(tem, exclude = 0)

out_matrix <- data.frame(matrix(temp,nrow = length(row.names(temp))))
row.names(out_matrix) <- row.names(temp) 
names(out_matrix) <- gsub(" ","_",colnames(temp))

out_matrix$idx <- as.numeric(gsub("Hap_","",row.names(out_matrix)))
out_matrix <- out_matrix[order(out_matrix$idx),]
out_matrix <- out_matrix[,-ncol(out_matrix)]

cat("\n",file=paste0(out_name,g,".prenex"),append = FALSE)
cat("BEGIN TRAITS;\n",file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Dimensions NTRAITS=",ncol(out_matrix),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Format labels=yes missing=? separator=Tab",";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("TraitLabels\t",paste(names(out_matrix),collapse = "\t"),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Matrix","\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
write.table(out_matrix,file=paste0(out_name,g,".prenex"),sep = "\t",append = TRUE,row.names = TRUE, col.names = F, quote = FALSE)

方差分析

pheno <- read.table("grain_length.txt",header = F,sep = "\t")
genotype <- read.table(paste0(out_name,g,"_clut_hap.txt"),header = F,sep = "\t")
thr <- 50
tem <- merge(genotype[,c(1,4)],pheno,by=c("V1"))
colnames(tem) <- c("V1","V3","V2")
tem <- rbind(tem, merge(genotype[,c(1,3)],pheno,by=c("V1")))
tem <- merge(tem,genotype[,c(1,2)],by= c("V1"))
colnames(tem) <- c("ID","hap","pheno","cluster")
freq <- as.data.frame(table(tem$hap))
pass_filt <-  as.character(freq[freq[,2]>=thr,1])
data <- tem[which(tem$hap %in% pass_filt),]

library(car)
library(agricolae)
mod <- aov(pheno~factor(hap),data)
Anova(mod,type=3)
oneway.test(pheno~factor(hap),data,var.equal = F)
out <- scheffe.test(mod,"factor(hap)") #可以用于非平衡数据的多重比较
out$groups
plot(out)

吐槽一下，简书的这个界面太不好写了，也不好看。代码中难免会有多个空格，逗号不对啥的，请见谅。

3K水稻SNP数据集的简单利用
3010份亚洲稻群体重测序项目是由中国农业科学院作物科学研究所牵头，联合国际水稻研究所、华大基因等16家单位共同完...
使用SNPhylo构建群体系统发生树
SNPhylo是一个基于SNP数据构建群体系统发生树的流程工具，它能接受VCF,HapMap和简单SNP数据格式作...
计算群体pi
1、提取3k材料的名称 ''' bcftools query -l 3k.snp.vcf > 3k.list ''...
利用RNA数据识别SNP
今天跟大家分享一篇16年发在核酸研究上的在不借助参考基因组的情况下，利用转录组数据进行SNP分析的文章。一、摘要...
忽略此条
image.png 4.去除一直数据集snp. http://xiajingbo.weebly.com/uploa...
「群体遗传学实战」第一课: 对SNP位点进行注释
我们用于实战的数据集来自于2019年发表于NG的西瓜文章，它提供了GATK过滤后的SNP数据集用于分析，并且附录提...
SAS中的数据清洗
1.数据合并（1）利用set语句纵向合并 data数据集; set数据集1（数据集选项）数据集2（数据集选项...
KNN算法应用
1. 利用Iris数据集来使用KNN算法 1.1 Iris数据集介绍 Iris数据集是常用的分类实验数据集，由F...
CS231N学习记录
数据集：训练集+验证集+测试集交叉验证：当训练数据太小时，为了更好地利用数据，那么将训练数据集划分成n份，其中n...
SR4R数据库：水稻4个SNP集的筛选及其应用
前言王向峰老师2020年发表在《Genomics Proteomics Bioinformatics》（IF=6...

3K水稻SNP数据集的简单利用

放在前面

工具准备

数据下载

准备目标基因的位置

合并想要的SNP数据集&indel数据集

准备分类文件

目标区域SNP位点提取

画LD heatmap

计算群体遗传性指标

Fst

pi

Tajima's D 中性检测的指标

单倍型分析

做个总结

R代码汇总

相关文章

3K水稻SNP数据集的简单利用

使用SNPhylo构建群体系统发生树

计算群体pi

利用RNA数据识别SNP

忽略此条

「群体遗传学实战」第一课: 对SNP位点进行注释

SAS中的数据清洗

KNN算法应用

CS231N学习记录

SR4R数据库：水稻4个SNP集的筛选及其应用

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

生信分析工具

数量遗传或生统

R