GWAS | 2.群体结构 structure

作者: iBioinformatics | 来源:发表于2023-05-10 09:13 被阅读0次

全基因组关联分析（GWAS）-群体结构与亲缘关系
群体结构structure
GWAS基石---混合线性模型
GWAS分析-PCA
【群体结构】CLUMPP软件使用
【群体结构】structure结构图绘制
5. GWAS：群体结构——Admixture
群体遗传 || Pophelper-群体结构可视化学习
c结构体
使用ggplot2绘制折线图

严格意义上说遗传进化中的群体结构应该包括PCA、群体进化树和群体STRUCRURE，传统的PCA和群体进化树只能直观的了解群体中个体的分类关系，但无法解释群体间具体的分群数目和基因交流情况。这就需要群体STRUCRURE作为参考，群体结构分析能够提供个体的血统来源及其组成信息，是一种重要的遗传关系分析手段。目前经典的群体结构分析软件有structure和admixture，而我们常常将群体结构图称为STRUCRURE图。

structure是与PCA、进化树相似的方法，就是利用分子标记的基因型信息对一组样本进行分类，分子标记可以是SNP、indel、SSR。相比于PCA，进化树，群体结构分析可明确各个群之间是否存在交流及交流程度

群体遗传结构 (structure) 指遗传变异在物种或群体中的一种非随机分布。按照地理分布或其他标准可将一个群体分为若干亚群，处于同一亚群内的不同个体亲缘关系较高，而亚群与亚群之间则亲缘关系较远。

群体结构，又称群体分层，指所研究的群体中存在基因频率不同的亚群。

群体结构分析有助于理解进化过程，并且可以通过基因型和表型的关联研究确定个体所属的亚群。

1、群体结构影响因素

哈迪温伯格平衡：如果一个群体符合特定条件，这个群体中各个等位基因、基因型以及表型的比例可以从一代传到下一代维持不变，即世代保持不变。

特定条件指：

1、种群无限大，群体有一个数目巨大的个体组成；
2、群体中个体间可以随机交配，任何个体都有同等的机会与异性进行交配，不受任何遗传学或行为学的限制；
3、群体中的基因库中，没有新的突变产生；
4、群体中的个体没有迁出，也没有迁入，即没有个体迁徙；
5、所有基因型的个体都有同等的机会存活和繁殖后代，没有任何形式的自然选择。

哈迪温伯格平衡两大特点：

1、等位基因的频率在世代遗传中稳定不变；
2、无论群体起始的等位基因频率如何，只要经过一代的随机交配，群体的基因频率和等位基因频率就可以达到平衡状态。

影响群体平衡（哈迪温伯格平衡）的主要因素有突变、选择、迁徙、遗传漂变、交配系统等。

2 群体结构原理及应用

基本原理：
将群体分成 K 个服从 Hardy-Weinberger(HWE) 平衡的亚群，每个材料归到各个亚群，计算每个材料其基因组变异源于第 K 个亚群的可能性(Pritchard et al.2000)，主要用 Q 矩阵表示，Q 值越大，表明某个材料来自某个亚群的可能性越大。无论是structure和admixture 都是这个原理

研究意义:
1、了解研究材料的分群情况，推断祖先群体，评判最优分群;
2、辅助控制关联分析结果的假阳性。多个亚群混合成一个较大群体后，由于各个亚群基因频率不同，群体 LD 发生变化，导致GWAS 分析结果出现假阳性；
3、杂交事件/个体祖先来源。

3 群体结构格式转换

structure 和 admixture 两款群体结构分析的软件都有其特定的格式要求，为了下游分析，需要将 VCF 格式的基因型文件转换为两款软件指定的格式。

1 、添加 ID 信息

perl add_vcf_ID.pl Test.vcf Test_addID.vcf

2、根据 LD( 连锁不平衡), 针对 SNP 位点进行过滤,保留 unlinked SNP 位点

plink \
   --vcf Test_addID.vcf  \
   --indep-pairwise 50 10 0.2  \
   --out Test_addID.maf0.05.int0.8  \
   --maf 0.05  \
   --geno 0.2  \
   --allow-extra-chr

3 、提取Test_addID.maf0.05.int0.8.prune.in 中的标记

plink  \
   --vcf Test_addID.vcf  \
   --extract Test_addID.maf0.05.int0.8.prune.in \
   --out Test_addID.maf0.05.int0.8.prune.in \
   --recode vcf-iid  \
   --allow-extra-chr

4、将筛选的位点转换为 structure 和和 admixture 格式
structure 格式:

plink   \
   --vcf Test_addID.maf0.05.int0.8.prune.in.vcf   \
   --recode structure  \
   --out Test_addID.maf0.05.int0.8.prune.in   \
   --allow-extra-chr

admixture 格式:

plink   \
   --vcf Test_addID.maf0.05.int0.8.prune.in.vcf   \
   --recode 12   \
   --out Test_addID.maf0.05.int0.8.prune.in   \
   --allow-extra-chr

Test_addID.maf0.05.int0.8.prune.in.recode.strct_in 即是structure 要求的格式文件
Test_addID.maf0.05.int0.8.prune.in.ped 即是admixture 要求的格式文件

4 格式介绍

structure 格式：

第一行代表 SNP 的 ID 信息;
第二行代表相邻标记间的距离;
第一列代表样品信息;
第二列代表样品其它信息。
备注:不同软件转换格式可能不一样。

admixture格式：

第一列:Family ID
第二列:Individual ID
第三列:Paternal ID
第四列:Maternal ID
第五列:Sex (1=male; 2=female; other=unknown)
第六列:Phenotype(-9 表示缺失)
图片中 1 2 代表基因型,其中每个标记占两列(二等位),纯合为 1 1 或者 2 2,杂合为 1 2。

5 structure 分析

网址：http://web.stanford.edu/group/pritchardlab/software.html

structure软件两个配置文件：

mainparams_structure.cfg
extraparams_structure.cfg

主配置文件：mainparams_structure.cfg

Basic Program Parameters
#define MAXPOPS 2 // (int) number of populations assumed ### k value
#define BURNIN 5000 // (int) length of burnin period ### at least > 5000, general 30000, maybe
#define NUMREPS 50000 // (int) number of MCMC reps after burnin ### at least > 5000, usually 10000, maybe

Input/Output files
#define INFILE /home/dengdj/bin/Bio_soft/Structure/structure2.3.1_console/testdata1 // (str) name of input data file
#define OUTFILE results //(str) name of output data file

Data file format
#define NUMINDS 200 // (int) number of diploid individuals in data file
#define NUMLOCI 5 // (int) number of loci in data file
#define PLOIDY 2 // (int) ploidy of data
#define MISSING 0 // (int) value given to missing genotype data
#define ONEROWPERIND 1 // (B) store data for individuals in a single line
#define LABEL 1 // (B) Input file contains individual labels
#define POPDATA 1 // (B) Input file contains a population identifier
#define POPFLAG 0 // (B) Input file contains a flag which says

whether to use popinfo when USEPOPINFO==1
#define LOCDATA 0 // (B) Input file contains a location identifier
#define PHENOTYPE 0 // (B) Input file contains phenotype information
#define EXTRACOLS 0 // (int) Number of additional columns of data

before the genotype data start.
#define MARKERNAMES 1 // (B) data file contains row of marker names
#define RECESSIVEALLELES 0 // (B) data file contains dominant markers (eg AFLPs)
// and a row to indicate which alleles are recessive
#define MAPDISTANCES 1 // (B) data file contains row of map distances
// between loci

Advanced data file options
#define PHASED 0 // (B) Data are in correct phase (relevant for linkage model only)
#define PHASEINFO 0 // (B) the data for each individual contains a line

indicating phase (linkage model)
#define MARKOVPHASE 0 // (B) the phase info follows a Markov model.
#define NOTAMBIGUOUS -999 // (int) for use in some analyses of polyploid data

一般情况下，主配置文件 mainparams_structure.cfg 红色标准的需要检查外,其它默认即可。

extraparams_structure.cfg 可以直接使用官方提供的，内容也可以设置为空。

structure 主要参数如下:
-m mainparams 主配置文件
-e extraparams 其它配置文件
-k 推断分群数目
-i 输入文件输入 structure 格式文件
-o 输出文件
-L 位点数,分析的位点数目
-N 样本数
-D 随机种子

命令投递：

输出结果：

6 admixture 分析

网址：http://dalexander.github.io/admixture/index.html
admixtrue 是一款运算速度非常快的群体结构分析软件，操作简单。需要提供的文件是plink中的ped格式。

命令行（批量输出）：

for i in $(seq 1 10); 
    do echo "admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped $i -j5 > test_k_$i.log"; 
done

输出结果：

admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 1 -j5 > test_k_1.log
admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 2 -j5 > test_k_2.log
admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 3 -j5 > test_k_3.log
admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 4 -j5 > test_k_4.log
admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 5 -j5 > test_k_5.log
admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 6 -j5 > test_k_6.log
admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 7 -j5 > test_k_7.log
admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 8 -j5 > test_k_8.log
admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 9 -j5 > test_k_9.log
admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 10 -j5 > test_k_10.log

-C:收敛阈值,设置 0.01
--cv: 不需要提供参数或者数据(默认交叉验证 5 次)
Test_addID.maf0.05.int0.8.prune.in.ped：即admixtrue分析需要的基因型数据
-j5: 表示分析采用5个线程
test_k_*.log:输出的日志文件,里面记录了交叉验证错误率,用来推断最优分群。

7 结果分析

7.1 structure 结果分析

一般情况下,structure分析完毕之后，最优分群主要是基于deltak进行判断。而deltak值的获取，每个k值至少重复3次。

目前在线工具structureHarvester 可以基于structure分析的原始结果进行最优分群数目的推断。

登录网站，上传*.zip格式的压缩文件（即将原始结果zip格式压缩），然后点击Harvest，执行一段时间，即可返回结果。

群体结构图形绘制
R 语言包安装：

install.packages("devtools")
devtools::install_github('royfrancis/pophelper')

备注：最新版本2.3.1

library(pophelper)
files <- list.files(path = ".",full.names = T,pattern = "_f$")
slist <- readQ(files = files, indlabfromfile = T,filetype = "structure")
xlist <- alignK(slist)
xlist2 <- mergeQ(xlist)
plotQ(xlist2, imgoutput = "join",imgtype = "pdf",width = 20,height = 1.2,
showindlab = T,useindlab = T,indlabhjust = 0.5,indlabangle = 0,
indlabsize = 1.2,sortind = "all",sharedindlab = F,
panelspacer = 0.02, indlabspacer = 0,indlabcol = "black",
showtitle = T,titlelab = "",titlecol = "black",
titlehjust = 0.5,showsp = T,
sppos = "left",splab = paste0("K=",1:11),splabangle = 180,dpi = 600,
outputfilename = paste0("structure_barplot"),units = "cm",
exportplot = T,exportpath = getwd())

##draw by group
pop <- read.table("group.txt",header = T,sep = "\t",stringsAsFactors =
F,row.names = 1)
plotQ(xlist2,imgoutput = "join",imgtype = "png",height = 1.2,width = 20,
showindlab = T,useindlab = T,sharedindlab = T,
showsp = T,sppos = "left",splab = paste0("K=",1:11),splabangle = 180,
showlegend = F,ordergrp = T,grplab = pop,
subsetgrp = LETTERS[1:9],dpi = 600,
outputfilename = paste0("structure_barplot_pop"),units = "cm",
exportpath = getwd())

输出最优分群：

write.table(x = xlist2[[11]],file = "Q_score.bestk.txt",append = F,quote = F,
row.names = T,col.names = T,sep = "\t")

pophelper 进行最优分群推断-evanno 法

library(pophelper)
files <- list.files(path = ".",full.names = T,pattern = "_f$")
slist <- readQ(files = files, indlabfromfile = T,filetype = "structure")
tabulateQ(slist, writetable = F, sorttable = TRUE)

summariseQ(tabulateQ(slist, writetable = F, sorttable = TRUE))

evannoMethodStructure(summariseQ(tabulateQ(slist, writetable = F, sorttable =
TRUE)),writetable = F,exportplot = F)

7.2 admixture 结果分析

admixture分析基于交叉验证错误率的结果进行评断最优分群。当然，目前很多文章也会利用admixture软件重复运行k值（>=3），然后基于evanno方法进行最优k值判断。

如果k值仅重复运行一次，则需要CV法判断最优分群。CV值记录在运行的结果日志文件中，可以通过捕获的方法批量查询：

grep -h 'CV' *.log

图形绘制：

x <- 1:10
y<-c(0.52940,0.51765,0.51313,0.52238,0.53772,0.56136,0.59252,0.61948,0.662
29,0.71867)
par(mgp = c(5,1,0),mar = c(8,8,4,4))
plot(x,y,type="b",xlab="K value",ylab="Cross-Validation (CV) errors",
lab = c(10,5,1),
col=c(rep("black",2),"red",rep("black",7)),
bg = c(rep("black",2),"red",rep("black",7)),
bty = "l",las = 1,cex.lab = 2,cex.axis = 1.5,cex = 2,
pch = 21,lwd = 2,font = 2,font.lab = 2,font.axis = 2)
box(lwd = 3,bty = "l")

群体结构图绘制

sfile = paste0("Test_addID.maf0.05.int0.8.prune.in.",1:10,".Q")
slist <- readQ(sfile,indlabfromfile = T)
sample <- read.table("sampleID.txt",header = F,sep = "\t",stringsAsFactors = F)[,1]
for(i in 1:length(slist)){
rownames(slist[[i]]) <-sample
}
plotQ(slist, imgoutput = "join",imgtype = "png",width = 20,height = 1.2,
showindlab = T,useindlab = T,indlabhjust = 0.5,indlabangle = 0,
indlabsize = 1.2,sortind = "all",sharedindlab = F,
panelspacer = 0.02, indlabspacer = 0,indlabcol = "black",
showtitle = T,titlelab = "",titlecol = "black",
titlehjust = 0.5,showsp = T,
sppos = "left",splab = paste0("K=",1:10),splabangle = 180,dpi = 600,
outputfilename = paste0("admixture_barplot"),units = "cm",
exportplot = T,exportpath = getwd())
##draw by group
pop <- read.table("group.txt",header = T,sep = "\t",stringsAsFactors = F,row.names = 1)
plotQ(slist,imgoutput = "join",imgtype = "png",height = 1.2,width = 20,
showindlab = T,useindlab = T,sharedindlab = T,
showsp = T,sppos = "left",splab = paste0("K=",1:10),splabangle = 180,
showlegend = F,ordergrp = T,grplab = pop,
subsetgrp = LETTERS[1:9],dpi = 600,
outputfilename = paste0("admixture_barplot_pop"),units = "cm",
exportpath = getwd())

pophelper 参数说明

补充- 高级分析

clist<-list("shiny"=c("#1D72F5","#DF0101","#77CE61","#FF9326","#A945FF","#0089B2","#FDF060","#FFA6B2",
"#BFF217","#60D5FD","#CC1577","#F2B950","#7FB21D","#EC496F","#326397","#B26314","#027368","#A4A4A
4","#610B5E"),
"strong"=c("#11A4C8","#63C2C5","#1D4F9F","#0C516D","#2A2771","#396D35","#80C342","#725DA8","#B620
25","#ED2224","#ED1943","#ED3995","#7E277C","#F7EC16","#F8941E","#8C2A1C","#808080"),
"oceanfive"=c("#00A0B0", "#6A4A3C", "#CC333F", "#EB6841", "#EDC951"),
"keeled"=c("#48B098", "#91CB62", "#FFEE3B", "#FB9013", "#FF3C28"),
"vintage"=c("#400F13", "#027368", "#A3BF3F", "#F2B950", "#D93A2B"),
"muted"=c("#46BDDD","#82DDCE","#F5F06A","#F5CC6A","#F57E6A"),
"teal"=c("#CFF09E","#A8DBA8","#79BD9A","#3B8686","#0B486B"),
"merry"=c("#5BC0EB","#FDE74C","#9BC53D","#E55934","#FA7921"),
"funky"=c("#A6CEE3", "#3F8EAA", "#79C360", "#E52829", "#FDB762","#ED8F47","#9471B4"),
"retro"=c("#01948E","#A9C4E2","#E23560","#01A7B3","#FDA963","#323665","#EC687D"),
"cb_paired"=c("#A6CEE3","#1F78B4","#B2DF8A","#33A02C","#FB9A99","#E31A1C","#FDBF6F","#FF7F00","#C
AB2D6","#6A3D9A","#FFFF99","#B15928"),
"cb_set3"=c("#8DD3C7","#FFFFB3","#BEBADA","#FB8072","#80B1D3","#FDB462","#B3DE69","#FCCDE5","#D9
D9D9","#BC80BD","#CCEBC5","#FFED6F"),
"morris"=c("#4D94CC","#34648A","#8B658A","#9ACD32","#CC95CC","#9ACD32","#8B3A39","#CD6601","#CC
5C5B","#8A4500"),
"wong"=c("#000000","#E69F00","#56B4E9","#009E73","#F0E442","#006699","#D55E00","#CC79A7"),
"krzywinski"=c("#006E82","#8214A0","#005AC8","#00A0FA","#FA78FA","#14D2DC","#AA0A3C","#FA7850","#
0AB45A","#F0F032","#A0FA82","#FAE6BE"))
# add length of palettes
lengths <- sapply(clist,length)
names(clist) <- paste0(names(clist),"_",lengths)
par(mar=c(0.2,6,0.2,0))
par(mfrow=c(length(clist),1))
for(i in 1:length(clist)){
barplot(rep(1,max(lengths)),
col=c(clist[[i]],rep("white",max(lengths)-length(clist[[i]]))),
axes=F,border=F)
text(x=-0.1,y=0.5,adj=1,label=names(clist)[i],xpd=T,cex=1.2)
}

p1 <- plotQ(qlist = slist[2:4],exportplot = F,returnplot = T,
imgoutput = "join",basesize = 11,
clustercol=clist$shiny_19,splab=paste0("K=",2:4),sppos = "left")
p2 <- plotQ(qlist = slist[2:4],exportplot = F,returnplot = T,
imgoutput = "join",basesize = 11,
clustercol=clist$oceanfive_5,splab=paste0("K=",2:4),sppos = "left")
p3 <- plotQ(qlist = slist[2:4],exportplot = F,returnplot = T,
imgoutput = "join",basesize = 11,
clustercol=clist$wong_8,splab=paste0("K=",2:4),sppos = "left")
p4 <- plotQ(qlist = slist[2:4],exportplot = F,returnplot = T,
imgoutput = "join",basesize = 11,
clustercol=clist$vintage_5,splab=paste0("K=",2:4),sppos = "left")
cowplot::plot_grid(p1$plot[[1]],p2$plot[[1]],p3$plot[[1]],p4$plot[[1]],
align = "hv",labels = LETTERS[1:4])

8 系统树- 群体结构组合图

当前很多文章，都是将系统发育树的结果和群体结构的结果一起

R 操作代码

library(ggtree)
library(ggplot2)
library(reshape2)
library(ape)
library(RColorBrewer)
library(aplot)
Set1 <- brewer.pal(n = 9,name = "Set1")
Set3 <- brewer.pal(n = 12,name = 'Set3')
Set2 <- brewer.pal(n = 8,name = "Set2")
Dark2 <- brewer.pal(n = 8,name = "Dark2")
Paired <- brewer.pal(n = 12,name = "Paired")
Set <- unique(c(Set1,Set3,Set2,Dark2,Paired))
# 删除黄色
yellow <- c("#FFFF99","#FFED6F","#FFFFB3","#FFFF33")
Set <- Set[!Set %in% yellow]
##tree
tree <- read.tree(file = "Test.nwk")
##root
tree <- root(tree,outgroup = paste0("R",11:20))
##group
info <- read.table("group.txt",header = T,stringsAsFactors = F)
groupInfo <- split(x = as.character(info[,1]),f = as.character(info[,2]))
tree <- groupOTU(.data = tree,.node = groupInfo)
##admixture
data <- read.table("Q_score.bestk.txt",header = T)
df <- data.frame(tree$tip.label, data[,1:ncol(data)])
df <- melt(df, id = "tree.tip.label")
names(df) <- c("Sample","Cluster","Qvalue")
##draw
p <- ggtree(tree, branch.length = "none",mapping = aes(color = group)) +
theme(legend.position='none')
p <- p + scale_x_reverse() +
scale_y_reverse() +
coord_flip()
p <- p + scale_color_manual(values = Set,labels = names(groupInfo))
b <- ggplot(df, aes(x = Sample, y = Qvalue, fill = Cluster,color = Cluster)) +
geom_bar(stat='identity', colour="grey80") +
scale_y_continuous(expand = c(0,0)) +
theme(legend.position="none",
axis.text.x = element_text(size = 6,angle = 90, vjust = 0.5,
face = "bold", colour = "black", hjust=0.95),
axis.text.y = element_text(face = "bold", size = 6, colour = "black"),
axis.title = element_blank()) +
scale_fill_manual(values = Set) +
scale_color_manual(values = Set)
b %>% insert_top(p,height = 2.5)

https://blog.csdn.net/weixin_30273763/article/details/99769659
https://www.jianshu.com/p/3b621b2d6c5f
https://www.jianshu.com/p/d46f27665074
https://www.bilibili.com/read/cv6093335/
https://www.jianshu.com/p/968f34e820ca
https://blog.csdn.net/rainforestist/article/details/114403010

全基因组关联分析（GWAS）-群体结构与亲缘关系
1、群体结构及其对 GWAS 的影响群体结构的来源image.png群体结构对GWAS的影响--混淆变量Ance...
群体结构structure
#格式转换 vcftools --vcf total.final.snp.vcf --plink --out xj...
GWAS基石---混合线性模型
使用混合线性模型（linear mixed model)做GWAS可以有效校正群体结构和群体内复杂的亲缘关系。...
GWAS分析-PCA
PCA方法校正群体结构（群体分层），GWAS该用多少个主成分？ - 实验盒 - 博客园 (cnblogs.com)...
【群体结构】CLUMPP软件使用
最近在做群体结构分析，群体结构三剑客：structure、pac和kinship。这边我主要用的软件是struct...
【群体结构】structure结构图绘制
上一篇文章：【群体结构】CLUMPP软件使用[https://www.jianshu.com/p/c2f25204...
5. GWAS：群体结构——Admixture
群体结构是指材料的亚群分化情况，会导致标记间的非连锁关联，进而导致关联分析结果出现假阳性。地理隔离、人工选择、移...
群体遗传 || Pophelper-群体结构可视化学习
Pophelper基本说明群体遗传下游分析中的一项常规分析即是群体结构的分层展示(STRUCTURE ANALY...
c结构体
1.创建声明结构体输出 2.声明结构体变量2 3.单独定义 structure.h structure.c 4....
使用ggplot2绘制折线图
在做gwas分析的时候，使用admixture计算的群体结构，选择CV error最小的值作为最佳分群数，想用一张...