文献笔记五十四：全基因组关联分析鉴定拟南芥中控制种子大小的调节因

作者: 小明的数据分析笔记本 | 来源:发表于2019-12-30 17:56 被阅读0次

论文

A new regulator of seed size control in Arabidopsis identified by a genome-wide association study
New Phytologist 2019
Peking University

两篇简书文章重复论文中GWAS分析的过程

重点关注论文中GWAS的分析过程，争取根据两篇简书文章重复出分析过程

测量种子大小（Seed size measurement）

种子平铺白纸，扫描，使用imageJ软件进行分析

全基因组关联分析

表型数据（种子面积）用平均值
SNP数据下载自1001 Genomes Project website
BeAGLE软件填充数据
数据过滤
非二等位基因位点
最小等位基因频率0.05
数据过滤
plink连锁不平衡
单变量混合线性模型
GEMMA软件默认参数
后面两句话看不明白
A linear mixed univariate model was used for marker association tests with a single phenotype to account for population stratification and sample structure.
The population structure was controlled using the relatedness matrix generated from GEMMA with the -gk 1 parameter.
校正p值
小于0.05显著
岭回归分析（？）

分析过程

下载数据

wget http://1001genomes.org/data/GMI-MPI/releases/v3.1/1001genomes_snp-short-indel_only_ACGTN.vcf.gz

初步筛选数据

vcftools --gzvcf 1001genomes_snp-short-indel_only_ACGTN.vcf.gz --remove-indels --min-alleles 2 --max-alleles 2 --recode --keep At_Seed_Size_Samples.txt --out 172sample

删除InDel
只保留二等位位点
挑选指定的样本

基因型填充

java -Xss5m -Xmn25G -Xms50G -Xmx50G -jar ~/mingyan/Bioinformatics_tool/Beagle/beagle.25Nov19.28d.jar nthreads=4 gt=172sample.recode.vcf out=172sample_out ne=172

遇到了报错

Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded

先跳过这一步

根据最小等位基因来过滤

vcftools --vcf 172sample.recode.vcf --maf 0.05 --recode --out 172sample_maf_filter

给每个snp位点添加ID

bcftools view -h 172sample_maf_filter.recode.vcf > head
bcftools view -H 172sample_maf_filter.recode.vcf | grep "^#" -v | awk '{$3=$1"_"$2;print $0}' | sed 's/ /\t/g' > body
cat head body > 172sample_maf_filter_snpID.vcf

到这一步数据集已经小了很多，试着在这一步对基因型进行填充（先填充在过滤和先过滤再填充的区别？）
基因型填充

java -jar beagle.25Nov19.28d.jar gt=172sample_maf_filter_snpID.vcf out=172sample_out ne=172

基因型填充需要很长时间，这里先用没有填充的做接下来的分析

基于连锁不平衡进行过滤

plink --vcf 172sample_maf_filter_snpID.vcf --recode --out 172sample
plink --file 172sample --indep 50 5 2
plink --file 172sample --extract plink.prune.in --recode --out 172sample_maf_filter_snpID_LD_filter

ped/map转换为fam/bed/bim

plink --file 172sample_maf_filter_snpID_LD_filter --make-bed --out clean_snp

为结果文件添加表型数据，需要在fam文件最后一列（-9）修改为真实表型值，这里我使用https://www.jianshu.com/p/fc628bd1001b 中提到的第二种方法
代码

import pandas as pd
tableA = pd.read_excel("add_Pheno_to_fam.xlsx",sheet_name="Sheet3",converts={'A':str})
tableB = pd.read_excel("add_Pheno_to_fam.xlsx",sheet_name="Sheet2",converts={'A':str})
tableA.head()
tableB.head()
tableA.merge(right=tableB,how="left",left_on="A",right_on="A")
tableC = tableA.merge(right=tableB,how="left",left_on="A",right_on="A")
tableC.to_excel("C.xlsx",index=False)

gemma软件下载
参考文章使用GEMMA进行复杂性状全基因组关联分析(GWAS)
代码

gemma-0.98.1-linux-static -bfile clean_snp -gk -o kinship
gemma-0.98.1-linux-static -bfile clean_snp -lmm -k ./output/kinship.cXX.txt -o GWAS_results

结果文件是GWAS_results.assoc.txt，用这个文件来画图
代码

rawdf<-read.table("GWAS_results.assoc.txt",
                  header=T,sep="\t")
dim(rawdf)
colnames(rawdf)
df<-data.frame(rs=rawdf$rs,chr=rawdf$chr,pos=rawdf$ps,
               pvalue=rawdf$p_wald)
head(df)
install.packages("rMVP")
library(rMVP)
?MVP.Report
MVP.Report(df)
MVP.Report(df,plot.type="m")

Rectangular-Manhattan.pvalue.jpg

QQplot.pvalue.jpg

Circular-Manhattan.pvalue.jpg

SNP_Density.pvalue.jpg

这个rMVP包好厉害，一条命令MVP.Report(df)所有图就全部生成了

文章开头提到的两篇简书文章还用rMVP这个R语言包做了全基因组关联分析，另外找时间来重复这部分内容
还有一部分内容是用R语言的glmnet包做岭回归分析分析，这部分内容暂时还没有思路！

参考文章

本篇文章绝大部分分析代码均来自参考文章1和2，特别感谢文章1和2作者的分享

欢迎大家关注我的公众号
小明的数据分析笔记本
如果需要文章中任何步骤中用到的数据，欢迎在我的公众号留言

公众号二维码.jpg

文献笔记五十四：全基因组关联分析鉴定拟南芥中控制种子大小的调节因

论文

两篇简书文章重复论文中GWAS分析的过程

测量种子大小（Seed size measurement）

全基因组关联分析

分析过程

先跳过这一步

基因型填充需要很长时间，这里先用没有填充的做接下来的分析

参考文章

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

数量遗传或生统

GWAS

群体遗传学

NGS

生信相关

文献笔记五十四：全基因组关联分析鉴定拟南芥中控制种子大小的调节因

论文

两篇简书文章重复论文中GWAS分析的过程

测量种子大小 （Seed size measurement）

全基因组关联分析

分析过程

先跳过这一步

基因型填充需要很长时间，这里先用没有填充的做接下来的分析

参考文章

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

数量遗传或生统

GWAS

群体遗传学

NGS

生信相关

测量种子大小（Seed size measurement）