美文网首页群体遗传学
GWAS | 2.群体结构 structure

GWAS | 2.群体结构 structure

作者: iBioinformatics | 来源:发表于2023-05-10 09:13 被阅读0次

    严格意义上说遗传进化中的群体结构应该包括PCA、群体进化树和群体STRUCRURE,传统的PCA和群体进化树只能直观的了解群体中个体的分类关系,但无法解释群体间具体的分群数目和基因交流情况。这就需要群体STRUCRURE作为参考,群体结构分析能够提供个体的血统来源及其组成信息,是一种重要的遗传关系分析手段。目前经典的群体结构分析软件有structure和admixture,而我们常常将群体结构图称为STRUCRURE图。

    structure是与PCA、进化树相似的方法,就是利用分子标记的基因型信息对一组样本进行分类,分子标记可以是SNP、indel、SSR。相比于PCA,进化树,群体结构分析可明确各个群之间是否存在交流及交流程度

    群体遗传结构 (structure) 指遗传变异在物种或群体中的一种非随机分布。按照地理分布或其他标准可将一个群体分为若干亚群,处于同一亚群内的不同个体亲缘关系较高,而亚群与亚群之间则亲缘关系较远。

    群体结构,又称群体分层,指所研究的群体中存在基因频率不同的亚群。

    群体结构分析有助于理解进化过程,并且可以通过基因型和表型的关联研究确定个体所属的亚群。

    1、群体结构影响因素

    哈迪温伯格平衡:如果一个群体符合特定条件,这个群体中各个等位基因、基因型以及表型的比例可以从一代传到下一代维持不变,即世代保持不变。

    特定条件指:

    • 1、种群无限大,群体有一个数目巨大的个体组成;
    • 2、群体中个体间可以随机交配,任何个体都有同等的机会与异性进行交配,不受任何遗传学或行为学的限制;
    • 3、群体中的基因库中,没有新的突变产生;
    • 4、群体中的个体没有迁出,也没有迁入,即没有个体迁徙;
    • 5、所有基因型的个体都有同等的机会存活和繁殖后代,没有任何形式的自然选择。

    哈迪温伯格平衡两大特点:

    • 1、等位基因的频率在世代遗传中稳定不变;
    • 2、无论群体起始的等位基因频率如何,只要经过一代的随机交配,群体的基因频率和等位基因频率就可以达到平衡状态。

    影响群体平衡(哈迪温伯格平衡)的主要因素有突变、选择、迁徙、遗传漂变、交配系统等。

    2 群体结构原理及应用

    基本原理:
    将群体分成 K 个服从 Hardy-Weinberger(HWE) 平衡的亚群,每个材料归到各个亚群,计算每个材料其基因组变异源于第 K 个亚群的可能性(Pritchard et al.2000),主要用 Q 矩阵表示,Q 值越大,表明某个材料来自某个亚群的可能性越大。无论是structure和admixture 都是这个原理

    研究意义:
    1、了解研究材料的分群情况,推断祖先群体,评判最优分群;
    2、辅助控制关联分析结果的假阳性。多个亚群混合成一个较大群体后,由于各个亚群基因频率不同,群体 LD 发生变化,导致GWAS 分析结果出现假阳性;
    3、杂交事件/个体祖先来源。

    3 群体结构格式转换

    structureadmixture 两款群体结构分析的软件都有其特定的格式要求,为了下游分析,需要将 VCF 格式的基因型文件转换为两款软件指定的格式。

    1 、添加 ID 信息

    perl add_vcf_ID.pl Test.vcf Test_addID.vcf
    

    2、根据 LD( 连锁不平衡), 针对 SNP 位点进行过滤,保留 unlinked SNP 位点

    plink \
       --vcf Test_addID.vcf  \
       --indep-pairwise 50 10 0.2  \
       --out Test_addID.maf0.05.int0.8  \
       --maf 0.05  \
       --geno 0.2  \
       --allow-extra-chr
    

    3 、提取Test_addID.maf0.05.int0.8.prune.in 中的标记

    plink  \
       --vcf Test_addID.vcf  \
       --extract Test_addID.maf0.05.int0.8.prune.in \
       --out Test_addID.maf0.05.int0.8.prune.in \
       --recode vcf-iid  \
       --allow-extra-chr
    

    4、将筛选的位点转换为 structure 和 和 admixture 格式
    structure 格式:

    plink   \
       --vcf Test_addID.maf0.05.int0.8.prune.in.vcf   \
       --recode structure  \
       --out Test_addID.maf0.05.int0.8.prune.in   \
       --allow-extra-chr
    

    admixture 格式:

    plink   \
       --vcf Test_addID.maf0.05.int0.8.prune.in.vcf   \
       --recode 12   \
       --out Test_addID.maf0.05.int0.8.prune.in   \
       --allow-extra-chr
    

    Test_addID.maf0.05.int0.8.prune.in.recode.strct_in 即是structure 要求的格式文件
    Test_addID.maf0.05.int0.8.prune.in.ped 即是admixture 要求的格式文件

    4 格式介绍

    structure 格式:

    第一行代表 SNP 的 ID 信息;
    第二行代表相邻标记间的距离;
    第一列代表样品信息;
    第二列代表样品其它信息。
    备注:不同软件转换格式可能不一样。

    admixture格式:

    第一列:Family ID
    第二列:Individual ID
    第三列:Paternal ID
    第四列:Maternal ID
    第五列:Sex (1=male; 2=female; other=unknown)
    第六列:Phenotype(-9 表示缺失)
    图片中 1 2 代表基因型,其中每个标记占两列(二等位),纯合为 1 1 或者 2 2,杂合为 1 2。

    5 structure 分析

    网址:http://web.stanford.edu/group/pritchardlab/software.html

    structure软件两个配置文件:

    • mainparams_structure.cfg
    • extraparams_structure.cfg

    主配置文件:mainparams_structure.cfg

    Basic Program Parameters
    #define MAXPOPS 2 // (int) number of populations assumed ### k value
    #define BURNIN 5000 // (int) length of burnin period ### at least > 5000, general 30000, maybe
    #define NUMREPS 50000 // (int) number of MCMC reps after burnin ### at least > 5000, usually 10000, maybe
    
    Input/Output files
    #define INFILE /home/dengdj/bin/Bio_soft/Structure/structure2.3.1_console/testdata1 // (str) name of input data file
    #define OUTFILE results //(str) name of output data file
    
    Data file format
    #define NUMINDS 200 // (int) number of diploid individuals in data file
    #define NUMLOCI 5 // (int) number of loci in data file
    #define PLOIDY 2 // (int) ploidy of data
    #define MISSING 0 // (int) value given to missing genotype data
    #define ONEROWPERIND 1 // (B) store data for individuals in a single line
    #define LABEL 1 // (B) Input file contains individual labels
    #define POPDATA 1 // (B) Input file contains a population identifier
    #define POPFLAG 0 // (B) Input file contains a flag which says
    
    whether to use popinfo when USEPOPINFO==1
    #define LOCDATA 0 // (B) Input file contains a location identifier
    #define PHENOTYPE 0 // (B) Input file contains phenotype information
    #define EXTRACOLS 0 // (int) Number of additional columns of data
    
    before the genotype data start.
    #define MARKERNAMES 1 // (B) data file contains row of marker names
    #define RECESSIVEALLELES 0 // (B) data file contains dominant markers (eg AFLPs)
    // and a row to indicate which alleles are recessive
    #define MAPDISTANCES 1 // (B) data file contains row of map distances
    // between loci
    
    Advanced data file options
    #define PHASED 0 // (B) Data are in correct phase (relevant for linkage model only)
    #define PHASEINFO 0 // (B) the data for each individual contains a line
    
    indicating phase (linkage model)
    #define MARKOVPHASE 0 // (B) the phase info follows a Markov model.
    #define NOTAMBIGUOUS -999 // (int) for use in some analyses of polyploid data
    

    一般情况下,主配置文件 mainparams_structure.cfg 红色标准的需要检查外,其它默认即可。

    extraparams_structure.cfg 可以直接使用官方提供的,内容也可以设置为空。

    structure 主要参数如下:
    -m mainparams 主配置文件
    -e extraparams 其它配置文件
    -k 推断分群数目
    -i 输入文件 输入 structure 格式文件
    -o 输出文件
    -L 位点数,分析的位点数目
    -N 样本数
    -D 随机种子

    命令投递:

    输出结果:

    6 admixture 分析

    网址:http://dalexander.github.io/admixture/index.html
    admixtrue 是一款运算速度非常快的群体结构分析软件,操作简单。需要提供的文件是plink中的ped格式。

    命令行(批量输出):

    for i in $(seq 1 10); 
        do echo "admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped $i -j5 > test_k_$i.log"; 
    done
    

    输出结果:

    admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 1 -j5 > test_k_1.log
    admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 2 -j5 > test_k_2.log
    admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 3 -j5 > test_k_3.log
    admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 4 -j5 > test_k_4.log
    admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 5 -j5 > test_k_5.log
    admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 6 -j5 > test_k_6.log
    admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 7 -j5 > test_k_7.log
    admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 8 -j5 > test_k_8.log
    admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 9 -j5 > test_k_9.log
    admixture -C 0.01 --cv Test_addID.maf0.05.int0.8.prune.in.ped 10 -j5 > test_k_10.log
    

    -C:收敛阈值,设置 0.01
    --cv: 不需要提供参数或者数据(默认交叉验证 5 次)
    Test_addID.maf0.05.int0.8.prune.in.ped:即admixtrue分析需要的基因型数据
    -j5: 表示分析采用5个线程
    test_k_*.log:输出的日志文件,里面记录了交叉验证错误率,用来推断最优分群。

    7 结果分析

    7.1 structure 结果分析

    一般情况下,structure分析完毕之后,最优分群主要是基于deltak进行判断。而deltak值的获取,每个k值至少重复3次。

    目前在线工具structureHarvester 可以基于structure分析的原始结果进行最优分群数目的推断。

    登录网站,上传*.zip格式的压缩文件(即将原始结果zip格式压缩),然后点击Harvest,执行一段时间,即可返回结果。

    群体结构图形绘制
    R 语言包安装:

    install.packages("devtools")
    devtools::install_github('royfrancis/pophelper')
    

    备注:最新版本2.3.1

    library(pophelper)
    files <- list.files(path = ".",full.names = T,pattern = "_f$")
    slist <- readQ(files = files, indlabfromfile = T,filetype = "structure")
    xlist <- alignK(slist)
    xlist2 <- mergeQ(xlist)
    plotQ(xlist2, imgoutput = "join",imgtype = "pdf",width = 20,height = 1.2,
    showindlab = T,useindlab = T,indlabhjust = 0.5,indlabangle = 0,
    indlabsize = 1.2,sortind = "all",sharedindlab = F,
    panelspacer = 0.02, indlabspacer = 0,indlabcol = "black",
    showtitle = T,titlelab = "",titlecol = "black",
    titlehjust = 0.5,showsp = T,
    sppos = "left",splab = paste0("K=",1:11),splabangle = 180,dpi = 600,
    outputfilename = paste0("structure_barplot"),units = "cm",
    exportplot = T,exportpath = getwd())
    
    ##draw by group
    pop <- read.table("group.txt",header = T,sep = "\t",stringsAsFactors =
    F,row.names = 1)
    plotQ(xlist2,imgoutput = "join",imgtype = "png",height = 1.2,width = 20,
    showindlab = T,useindlab = T,sharedindlab = T,
    showsp = T,sppos = "left",splab = paste0("K=",1:11),splabangle = 180,
    showlegend = F,ordergrp = T,grplab = pop,
    subsetgrp = LETTERS[1:9],dpi = 600,
    outputfilename = paste0("structure_barplot_pop"),units = "cm",
    exportpath = getwd())
    

    输出最优分群:

    write.table(x = xlist2[[11]],file = "Q_score.bestk.txt",append = F,quote = F,
    row.names = T,col.names = T,sep = "\t")
    

    pophelper 进行最优分群推断-evanno 法

    library(pophelper)
    files <- list.files(path = ".",full.names = T,pattern = "_f$")
    slist <- readQ(files = files, indlabfromfile = T,filetype = "structure")
    tabulateQ(slist, writetable = F, sorttable = TRUE)
    
    summariseQ(tabulateQ(slist, writetable = F, sorttable = TRUE))
    
    evannoMethodStructure(summariseQ(tabulateQ(slist, writetable = F, sorttable =
    TRUE)),writetable = F,exportplot = F)
    
    7.2 admixture 结果分析

    admixture分析基于交叉验证错误率的结果进行评断最优分群。当然,目前很多文章也会利用admixture软件重复运行k值(>=3),然后基于evanno方法进行最优k值判断。

    如果k值仅重复运行一次,则需要CV法判断最优分群。CV值记录在运行的结果日志文件中,可以通过捕获的方法批量查询:

    grep -h 'CV' *.log
    

    图形绘制:

    x <- 1:10
    y<-c(0.52940,0.51765,0.51313,0.52238,0.53772,0.56136,0.59252,0.61948,0.662
    29,0.71867)
    par(mgp = c(5,1,0),mar = c(8,8,4,4))
    plot(x,y,type="b",xlab="K value",ylab="Cross-Validation (CV) errors",
    lab = c(10,5,1),
    col=c(rep("black",2),"red",rep("black",7)),
    bg = c(rep("black",2),"red",rep("black",7)),
    bty = "l",las = 1,cex.lab = 2,cex.axis = 1.5,cex = 2,
    pch = 21,lwd = 2,font = 2,font.lab = 2,font.axis = 2)
    box(lwd = 3,bty = "l")
    

    群体结构图绘制

    sfile = paste0("Test_addID.maf0.05.int0.8.prune.in.",1:10,".Q")
    slist <- readQ(sfile,indlabfromfile = T)
    sample <- read.table("sampleID.txt",header = F,sep = "\t",stringsAsFactors = F)[,1]
    for(i in 1:length(slist)){
    rownames(slist[[i]]) <-sample
    }
    plotQ(slist, imgoutput = "join",imgtype = "png",width = 20,height = 1.2,
    showindlab = T,useindlab = T,indlabhjust = 0.5,indlabangle = 0,
    indlabsize = 1.2,sortind = "all",sharedindlab = F,
    panelspacer = 0.02, indlabspacer = 0,indlabcol = "black",
    showtitle = T,titlelab = "",titlecol = "black",
    titlehjust = 0.5,showsp = T,
    sppos = "left",splab = paste0("K=",1:10),splabangle = 180,dpi = 600,
    outputfilename = paste0("admixture_barplot"),units = "cm",
    exportplot = T,exportpath = getwd())
    ##draw by group
    pop <- read.table("group.txt",header = T,sep = "\t",stringsAsFactors = F,row.names = 1)
    plotQ(slist,imgoutput = "join",imgtype = "png",height = 1.2,width = 20,
    showindlab = T,useindlab = T,sharedindlab = T,
    showsp = T,sppos = "left",splab = paste0("K=",1:10),splabangle = 180,
    showlegend = F,ordergrp = T,grplab = pop,
    subsetgrp = LETTERS[1:9],dpi = 600,
    outputfilename = paste0("admixture_barplot_pop"),units = "cm",
    exportpath = getwd())
    

    pophelper 参数说明

    补充- 高级分析

    clist<-list("shiny"=c("#1D72F5","#DF0101","#77CE61","#FF9326","#A945FF","#0089B2","#FDF060","#FFA6B2",
    "#BFF217","#60D5FD","#CC1577","#F2B950","#7FB21D","#EC496F","#326397","#B26314","#027368","#A4A4A
    4","#610B5E"),
    "strong"=c("#11A4C8","#63C2C5","#1D4F9F","#0C516D","#2A2771","#396D35","#80C342","#725DA8","#B620
    25","#ED2224","#ED1943","#ED3995","#7E277C","#F7EC16","#F8941E","#8C2A1C","#808080"),
    "oceanfive"=c("#00A0B0", "#6A4A3C", "#CC333F", "#EB6841", "#EDC951"),
    "keeled"=c("#48B098", "#91CB62", "#FFEE3B", "#FB9013", "#FF3C28"),
    "vintage"=c("#400F13", "#027368", "#A3BF3F", "#F2B950", "#D93A2B"),
    "muted"=c("#46BDDD","#82DDCE","#F5F06A","#F5CC6A","#F57E6A"),
    "teal"=c("#CFF09E","#A8DBA8","#79BD9A","#3B8686","#0B486B"),
    "merry"=c("#5BC0EB","#FDE74C","#9BC53D","#E55934","#FA7921"),
    "funky"=c("#A6CEE3", "#3F8EAA", "#79C360", "#E52829", "#FDB762","#ED8F47","#9471B4"),
    "retro"=c("#01948E","#A9C4E2","#E23560","#01A7B3","#FDA963","#323665","#EC687D"),
    "cb_paired"=c("#A6CEE3","#1F78B4","#B2DF8A","#33A02C","#FB9A99","#E31A1C","#FDBF6F","#FF7F00","#C
    AB2D6","#6A3D9A","#FFFF99","#B15928"),
    "cb_set3"=c("#8DD3C7","#FFFFB3","#BEBADA","#FB8072","#80B1D3","#FDB462","#B3DE69","#FCCDE5","#D9
    D9D9","#BC80BD","#CCEBC5","#FFED6F"),
    "morris"=c("#4D94CC","#34648A","#8B658A","#9ACD32","#CC95CC","#9ACD32","#8B3A39","#CD6601","#CC
    5C5B","#8A4500"),
    "wong"=c("#000000","#E69F00","#56B4E9","#009E73","#F0E442","#006699","#D55E00","#CC79A7"),
    "krzywinski"=c("#006E82","#8214A0","#005AC8","#00A0FA","#FA78FA","#14D2DC","#AA0A3C","#FA7850","#
    0AB45A","#F0F032","#A0FA82","#FAE6BE"))
    # add length of palettes
    lengths <- sapply(clist,length)
    names(clist) <- paste0(names(clist),"_",lengths)
    par(mar=c(0.2,6,0.2,0))
    par(mfrow=c(length(clist),1))
    for(i in 1:length(clist)){
    barplot(rep(1,max(lengths)),
    col=c(clist[[i]],rep("white",max(lengths)-length(clist[[i]]))),
    axes=F,border=F)
    text(x=-0.1,y=0.5,adj=1,label=names(clist)[i],xpd=T,cex=1.2)
    }
    
    p1 <- plotQ(qlist = slist[2:4],exportplot = F,returnplot = T,
    imgoutput = "join",basesize = 11,
    clustercol=clist$shiny_19,splab=paste0("K=",2:4),sppos = "left")
    p2 <- plotQ(qlist = slist[2:4],exportplot = F,returnplot = T,
    imgoutput = "join",basesize = 11,
    clustercol=clist$oceanfive_5,splab=paste0("K=",2:4),sppos = "left")
    p3 <- plotQ(qlist = slist[2:4],exportplot = F,returnplot = T,
    imgoutput = "join",basesize = 11,
    clustercol=clist$wong_8,splab=paste0("K=",2:4),sppos = "left")
    p4 <- plotQ(qlist = slist[2:4],exportplot = F,returnplot = T,
    imgoutput = "join",basesize = 11,
    clustercol=clist$vintage_5,splab=paste0("K=",2:4),sppos = "left")
    cowplot::plot_grid(p1$plot[[1]],p2$plot[[1]],p3$plot[[1]],p4$plot[[1]],
    align = "hv",labels = LETTERS[1:4])
    

    8 系统树- 群体结构组合图

    当前很多文章,都是将系统发育树的结果和群体结构的结果一起

    R 操作代码

    library(ggtree)
    library(ggplot2)
    library(reshape2)
    library(ape)
    library(RColorBrewer)
    library(aplot)
    Set1 <- brewer.pal(n = 9,name = "Set1")
    Set3 <- brewer.pal(n = 12,name = 'Set3')
    Set2 <- brewer.pal(n = 8,name = "Set2")
    Dark2 <- brewer.pal(n = 8,name = "Dark2")
    Paired <- brewer.pal(n = 12,name = "Paired")
    Set <- unique(c(Set1,Set3,Set2,Dark2,Paired))
    # 删除黄色
    yellow <- c("#FFFF99","#FFED6F","#FFFFB3","#FFFF33")
    Set <- Set[!Set %in% yellow]
    ##tree
    tree <- read.tree(file = "Test.nwk")
    ##root
    tree <- root(tree,outgroup = paste0("R",11:20))
    ##group
    info <- read.table("group.txt",header = T,stringsAsFactors = F)
    groupInfo <- split(x = as.character(info[,1]),f = as.character(info[,2]))
    tree <- groupOTU(.data = tree,.node = groupInfo)
    ##admixture
    data <- read.table("Q_score.bestk.txt",header = T)
    df <- data.frame(tree$tip.label, data[,1:ncol(data)])
    df <- melt(df, id = "tree.tip.label")
    names(df) <- c("Sample","Cluster","Qvalue")
    ##draw
    p <- ggtree(tree, branch.length = "none",mapping = aes(color = group)) +
    theme(legend.position='none')
    p <- p + scale_x_reverse() +
    scale_y_reverse() +
    coord_flip()
    p <- p + scale_color_manual(values = Set,labels = names(groupInfo))
    b <- ggplot(df, aes(x = Sample, y = Qvalue, fill = Cluster,color = Cluster)) +
    geom_bar(stat='identity', colour="grey80") +
    scale_y_continuous(expand = c(0,0)) +
    theme(legend.position="none",
    axis.text.x = element_text(size = 6,angle = 90, vjust = 0.5,
    face = "bold", colour = "black", hjust=0.95),
    axis.text.y = element_text(face = "bold", size = 6, colour = "black"),
    axis.title = element_blank()) +
    scale_fill_manual(values = Set) +
    scale_color_manual(values = Set)
    b %>% insert_top(p,height = 2.5)
    

    https://blog.csdn.net/weixin_30273763/article/details/99769659
    https://www.jianshu.com/p/3b621b2d6c5f
    https://www.jianshu.com/p/d46f27665074
    https://www.bilibili.com/read/cv6093335/
    https://www.jianshu.com/p/968f34e820ca
    https://blog.csdn.net/rainforestist/article/details/114403010

    相关文章

      网友评论

        本文标题:GWAS | 2.群体结构 structure

        本文链接:https://www.haomeiwen.com/subject/ffvbjdtx.html