2020-04-29 eMAGMA 基于基因的关联分析（Part

作者: 程凉皮儿 | 来源:发表于2020-04-28 09:24 被阅读0次

基于基因的关联分析的简单背景说明
玉米全基因关联分析 ( GWAS Analysis Pipeli
表观研究数据库推荐 | 查询疾病/表型与表观遗传之间的联系，原来
重测序分析（14）全基因组关联分析GWAS介绍
Famliy-based关联分析
DEPICT实现基因优化（gene prioritization
第九章数据关联规则分析算法——基于Apriori算法的关联项分
GWAS位点简单进行峰拆分输出QTNs并基于GFF关联基因
基因水平的关联分析
Science | 群体研究新思路：De novo + GWAS

输入数据准备

本教程要求eMAGMA 文件, 软件 (MAGMA) 和辅助文件都在同一个目录下如果你的文件在不同的目录上，请在命令行加入路径信息

cd /path/yourworking folder/eMAGMA

解压缩软件包及辅助文件：magma_v1.07b.zip, NCBI37.3.zip 和MDD2018_excluding23andMe (下载自 PGC website).

unzip [filename].zip

由于我下载的MDD2018_excluding23andMe是一个.gz的文件，解压使用如下代码：

gzip -d MDD2018_ex23andMe.gz

解压完后查看前几行

(base) bogon:eMAGMA chelsea$ head MDD2018_ex23andMe
CHR SNP BP  A1  A2  FRQ_A_59851 FRQ_U_113154    INFO    OR  SE  P   ngt Direction   HetISqt HetDf   HetPVa  Nca Nco Neff
8   rs62513865  101592213   T   C   0.0747  0.0733  0.957   1.01461 0.0153  0.3438  0   ---+++  0.0 5   0.7906  59851   113154  69115.85
8   rs79643588  106973048   A   G   0.092   0.092   0.999   1.02122 0.0136  0.1231  0   ++-+++  0.0 5   0.6847  59851   113154  69115.85
8   rs17396518  108690829   T   G   0.562   0.565   0.98    1.00331 0.008   0.6821  6   --+--+  34.9    5   0.05637 59851   113154  69115.85
8   rs6994300   102569817   A   G   0.00609 0.00556 0.466   0.88126 0.4243  0.7658  0   ?-????  0.0 0   1   16823   25632   20313.61
8   rs138449472 108580746   A   G   0.00965 0.00852 0.734   0.97181 0.0598  0.632   0   -+??-?  0.0 2   0.6663  41253   79756   47868.51
8   rs983166    108681675   A   C   0.565   0.568   0.991   0.99144 0.008   0.2784  0   --+--+  10.1    5   0.1685  59851   113154  69115.85
8   rs28842593  103044620   T   C   0.841   0.843   0.934   0.98926 0.0112  0.3381  1   -++--+  0.0 5   0.51    59851   113154  69115.85
8   rs377046245 105176418   I   D   0.265   0.265   0.994   1.03004 0.0196  0.1311  -   ?????-  0.0 0   1   14260   15480   14844.98
8   rs7014597   104152280   C   G   0.166   0.164   0.996   1.01501 0.0106  0.1591  0   +-+-++  0.0 5   0.4818  59851   113154  69115.85

运行程序需要P-values来自 MDD GWAS summary statistics. 使用下面的代码从 GWAS summary statistics来提取: SNP-ID, chromosome, BP(base pair position), and P-value相关信息并写入一个新的 txt file.

awk '{print $2,$1,$3,$11,$19}' MDD2018_ex23andMe > MDD2018_ex23andMe_emagma.txt

完了之后前3列的信息排列应该是：SNP, CHR 和 BP。

数据分析Analysis

分析需要来自适当参考样本的原始基因型数据来模拟LD结构。为此，我们使用欧洲人口的基因组参考文件[g1000_eur]。由于我们基于eQTL信息将SNP与基因相关联，因此我们使用注释文件，我们先前已将SNP分配给目标eGenes。这些注释文件是基于GTEx中显著的(FDR<0.05)SNP-基因关联生成的(更多细节见Gering 2009b)。注释文件位于文件夹Batch1、Batch2、Batch3与Batch4、Batch5与Batch6中。

出于实际原因，我们将仅使用位于目录Batch1中的杏仁核的注释文件。在Batch1中，有13个注释文件，前缀为组织名称，后缀为genes.annot。要将Batch1下载到名为Brain的目录中，请使用以下命令：

mkdir Brain 
mv Batch1_annot.zip Brain
cd Brain 
unzip Batch1.zip
cd ..

因为我之前已经下载了Batch1_annot.zip,新建Brain文件后将其移动到该目录下直接解压缩Batch1.zip

确保返回到eMAGMA文件夹。要运行关联，请使用以下命令：

./magma --bfile g1000_eur 
--gene-annot Brain/Brain_Amygdala.genes.annot 
--pval MDD2018_ex23andMe_emagma2.txt ncol=Neff 
--gene-settings adap-permp=10000 
--out Amygdala_emagma

运行结果如下所示：

(base) bogon:eMAGMA chelsea$ ./magma --bfile g1000_eur --gene-annot Brain/Batch1/Brain_Amygdala.genes.annot --pval MDD2018_ex23andMe_emagma.txt ncol=Neff --gene-settings adap-permp=10000 --out Amygdala_emagma
Welcome to MAGMA v1.07b (mac)
Using flags:
    --bfile g1000_eur
    --gene-annot Brain/Batch1/Brain_Amygdala.genes.annot
    --pval MDD2018_ex23andMe_emagma.txt ncol=Neff
    --gene-settings adap-permp=10000
    --out Amygdala_emagma

Start time is 20:55:53, Tuesday 28 Apr 2020

Loading PLINK-format data...
Reading file g1000_eur.fam... 503 individuals read
Reading file g1000_eur.bim... 22665064 SNPs read
Preparing file g1000_eur.bed...

Reading SNP synonyms from file g1000_eur.synonyms (auto-detected)
    read 6016767 mapped synonyms from file, mapping to 3921040 SNPs in the data
    WARNING: detected 133 synonymous SNP pairs in the data
             skipped all synonym entries involved, synonymous SNPs are kept in analysis
             writing list of detected synonyms in data to supplementary log file
Reading SNP p-values from file MDD2018_ex23andMe_emagma.txt...
    detected 5 variables in file
    using variable: SNP (SNP id)
    using variable: P (p-value)
    using variable: Neff (sample size; discarding SNPs with N < 50)

ERROR - reading p-value file: too few values on line 9533410 (found 4, expecting at least 5):
    line: rs190562351 21 9502243 0.83543

Terminating program.

由于rs190562351这行有个缺失值，运行终止了，那么我就删除这行

## 先用grep找到这行，没有重复项
$ grep 'rs190562351' MDD2018_ex23andMe_emagma.txt
rs190562351 21 9502243 0.83543
## 再去查找了grep的用法，-v参数是输出反选的数据，再赋值到一个新的文件，这样就删掉了这行。
$ grep -v 'rs190562351' MDD2018_ex23andMe_emagma.txt > MDD2018_ex23andMe_emagma1.txt

再用MDD2018_ex23andMe_emagma1.txt重新运行，但是万万没想到，这种有一个值缺失的行不只这一行，我要是这样一步步删除不知道要到什么时候，那就需要写循环了，尝试了好几次都没能成功

(base) bogon:eMAGMA chelsea$ grep '{cat id2.txt | while read id ; do echo ${id};  done}' MDD2018_ex23andMe_emagma2.txt
(base) bogon:eMAGMA chelsea$ cat id2.txt | while read id ; do echo ${id};  done
rs71269378
rs76790510
rs77252136
rs140659530
rs181553715
rs145993931
rs76661226
rs76722126
rs76196084
rs79016859
rs79149161
rs71259347
rs71258884
rs4104602
rs77072381
rs77056430
rs140202858
rs190271206
rs71229564
rs71277812
rs76384163
rs71244393
rs71247672
rs79178122
rs145133116
rs151229293
rs77863943
rs143923139
rs79633665
rs79047427
rs77756107
rs146840680
chr21_9862783_D
rs77110876
rs140066961
rs75310970
rs77482226
rs144043309
rs77690252
rs7378139
rs79407983
rs138722934
rs77061955
rs75645220
rs139338962
rs193153830
rs80297221
rs75702448
rs142407330
rs146496652
rs148968985
rs71259094
rs80035564
rs75780639
rs79300020
rs75625582
rs189257336
rs189188955
rs191838859
(base) bogon:eMAGMA chelsea$ cat id2.txt | while read id ; do echo ${id};  done |grep -v
usage: grep [-abcDEFGHhIiJLlmnOoqRSsUVvwxZ] [-A num] [-B num] [-C[num]]
    [-e pattern] [-f file] [--binary-files=value] [--color=when]
    [--context[=num]] [--directories=action] [--label] [--line-buffered]
    [--null] [pattern] [file ...]
(base) bogon:eMAGMA chelsea$ cat id2.txt | while read id ; do echo ${id};  done |grep -v ${id}
usage: grep [-abcDEFGHhIiJLlmnOoqRSsUVvwxZ] [-A num] [-B num] [-C[num]]
    [-e pattern] [-f file] [--binary-files=value] [--color=when]
    [--context[=num]] [--directories=action] [--label] [--line-buffered]
    [--null] [pattern] [file ...]
(base) bogon:eMAGMA chelsea$ cat id2.txt | while read id ; do echo ${id};  done |grep ${id} MDD2018_ex23andMe_emagma2.txt
(base) bogon:eMAGMA chelsea$ cat id2.txt | while read id ; do echo ${id};  done |grep  MDD2018_ex23andMe_emagma2.txt
(base) bogon:eMAGMA chelsea$ cat id2.txt | while read id ; do grep '${id}' MDD2018_ex23andMe_emagma2.txt;  done
(base) bogon:eMAGMA chelsea$ cat id2.txt | while read id ; do grep '${id}' MDD2018_ex23andMe_emagma2.txt > MDD2018_ex23andMe_emagmao.txt;  done
^C
(base) bogon:eMAGMA chelsea$ cat id2.txt | while read id ; do grep -v '${id}' MDD2018_ex23andMe_emagma2.txt > MDD2018_ex23andMe_emagmap.txt;  done

以上是我的探索过程，也记录下来，最终没有成功删除目标项目，看到缺失项还挺集中的，后面没有办法就简单粗爆的打开txt文档通过搜索rs71269378找到位置,手动把这些都删除了，然后再运行，就成功了。

(base) bogon:eMAGMA chelsea$ ./magma --bfile g1000_eur --gene-annot Brain/Batch1/Brain_Amygdala.genes.annot --pval MDD2018_ex23andMe_emagma2.txt ncol=Neff --gene-settings adap-permp=10000 --out Amygdala_emagma
Welcome to MAGMA v1.07b (mac)
Using flags:
    --bfile g1000_eur
    --gene-annot Brain/Batch1/Brain_Amygdala.genes.annot
    --pval MDD2018_ex23andMe_emagma2.txt ncol=Neff
    --gene-settings adap-permp=10000
    --out Amygdala_emagma

Start time is 09:02:45, Wednesday 29 Apr 2020

Loading PLINK-format data...
Reading file g1000_eur.fam... 503 individuals read
Reading file g1000_eur.bim... 22665064 SNPs read
Preparing file g1000_eur.bed...

Reading SNP synonyms from file g1000_eur.synonyms (auto-detected)
    read 6016767 mapped synonyms from file, mapping to 3921040 SNPs in the data
    WARNING: detected 133 synonymous SNP pairs in the data
             skipped all synonym entries involved, synonymous SNPs are kept in analysis
             writing list of detected synonyms in data to supplementary log file
Reading SNP p-values from file MDD2018_ex23andMe_emagma2.txt...
    detected 5 variables in file
    using variable: SNP (SNP id)
    using variable: P (p-value)
    using variable: Neff (sample size; discarding SNPs with N < 50)
    read 13554490 lines from file, containing valid SNP p-values for 10999960 SNPs in data (81.15% of lines, 48.53% of SNPs in data)
    WARNING: file contained 307 SNPs (same IDs or synonyms) with duplications
             dropped all occurrences of each from analysis
             writing list of duplicated IDs to supplementary log file
Filtering phenotype/covariate missing values... 503 individuals remaining
Loading gene annotation from file Brain/Batch1/Brain_Amygdala.genes.annot...
    1301 gene definitions read from file
    found 1258 genes containing valid SNPs in genotype data


Starting gene analysis...
    using model: SNPwise-mean
        writing gene analysis results to file Amygdala_emagma.genes.out
    writing intermediate output to file Amygdala_emagma.genes.raw


End time is 09:05:12, Wednesday 29 Apr 2020 (elapsed: 00:02:27)

成功运行后查看生成文件：

(base) bogon:eMAGMA chelsea$ ls
Amygdala_emagma.genes.out     Batch3_annot.zip              MDD2018_ex23andMe             NCBI37.3.zip                  g1000_eur.fam                 magma
Amygdala_emagma.genes.raw     Batch4_annot.zip              MDD2018_ex23andMe_emagma.txt  README                        g1000_eur.synonyms            magma_v1.07b_mac
Amygdala_emagma.log           Batch5_annot.zip              MDD2018_ex23andMe_emagma1.txt REPORT                        g1000_eur.zip                 magma_v1.07b_mac.zip
Amygdala_emagma.log.suppl     Batch6_annot.zip              MDD2018_ex23andMe_emagma2.txt emagma_Amygdala_script        g1000_eurreadme               manual_v1.07b.pdf
Amygdala_outputs.zip          Brain                         MDD2018_ex23andMe_emagmap.txt g1000_eur.bed                 id.txt                        ncbi_readme
Batch2_annot.zip              CHANGELOG                     NCBI37.3.gene.loc             g1000_eur.bim                 id2.txt                       network_files.zip

上述命令表明：Brain_Amygdala.genes.annot是用于分析的输入文件，P值从MDD GWAS汇总数据[MDD2018_ex23andMe_emagma.txt]中提取。有效样本数在Neff列中。使用10,000个自适应排列(--adap-permp=10,000)进行多次测试。

该程序还会生成日志文件[Amygdala_emagma.log]。日志文件包含运行的汇总信息，即错误(如果有)、读取了多少基因以及有多少基因分配了有效的SNP。对Amygdala_emagma.log的检查显示，从注释文件中读取了1301个基因定义，并且1258个基因在基因型数据中具有有效的SNP。
分析输出genes.raw [Amygdala_emagma.genes.raw]文件和genes.out [Amygdala_emagma.genes.out]文件。genes.raw文件包含分析结果，将在本教程后面的第2部分中使用。

genes.out文件的结果是一种易于阅读的格式。下面是genes.out文件的一个示例，它包含有关基因ID、位置、映射到基因的SNP数量、关联的P值和排列的P值的信息。

(base) bogon:eMAGMA chelsea$ head Amygdala_emagma.genes.out
GENE       CHR      START       STOP  NSNPS  NPARAM      N        ZSTAT            P        PERMP  NPERM
26155        1     879583     894679      6       2  38689     -0.47386       0.6822        0.653   1000
54991        1    1017198    1051736     14       1  61363      0.63601      0.26238        0.263   1000
54973        1    1246965    1260067      1       1  32381      -1.7636       0.9611        0.959   1000
441869       1    1353800    1356824    107       4  59224     -0.23583      0.59322        0.557   1000
83858        1    1407139    1444852     45       4  53085     -0.60178      0.72634          0.7   1000
728661       1    1591486    1624243     51       4  63001     -0.11335      0.54512        0.524   1000
728642       1    1633822    1655999      1       1  52547      -0.9717       0.8344        0.831   1000
388585       1    2460184    2461684      1       1  67392      0.57691        0.282        0.305   1000
9731         1    3728645    3773797      1       1  69116       0.9625       0.1679        0.173   1000

要使用Bonferroni校正校正多重测试(P值<=0.05/测试基因数=3.8431e-5)，并提取重要相关基因列表，请使用：

(base) bogon:eMAGMA chelsea$ awk '{if($9<=3.8431e-5) print $0}' Amygdala_emagma.genes.out | head
10463        4   41992516   42089551      1       1  54271       4.1106    1.973e-05       0.0001  10000
11118        6   26365387   26378548    223       6  68311       5.6024   1.0571e-08       0.0001  10000
64288        6   28292514   28321972     48       3  66367       4.5078   3.2753e-06       0.0001  10000
720          6   31949834   31970457    345       9  65285       4.0559   2.4967e-05       0.0001  10000
(base) bogon:eMAGMA chelsea$ awk '{if($9<=3.8431e-5) print $0}' Amygdala_emagma.genes.out > Amygdala_signif_genes.txt

上面的代码导出 [Amygdala_signif_genes.tx] 包括4个显著性基因(p<=3.8431e-5).

GENE       CHR      START       STOP  NSNPS  NPARAM      N        ZSTAT            P        PERMP  NPERM
10463        4   41992516   42089551      1       1  54271       4.1106    1.973e-05       0.0001  10000
11118        6   26365387   26378548    223       6  68311       5.6024   1.0573e-08       0.0001  10000
64288        6   28292514   28321972     48       3  66367       4.5078   3.2753e-06       0.0001  10000
720          6   31949834   31970457    345       9  65285       4.0559   2.4967e-05       0.0002  10000

基因ID 11118关联最大，P值=1.0573e-08，该基因位于6号染色体：26365387-26378548，由223SNPs定位。6号染色体上的另外两个基因和4号染色体上的一个基因对MDD也有重要意义。

这就完成了！您已经生成了一个在杏仁核中表达并与MDD有显著关联的eGenes列表。从这个分析中产生的信息可以用来研究这些基因的生物学途径和功能。

要进一步进行此分析，请转到第2部分！

基于基因的关联分析的简单背景说明
Burden 分析，是基于基因的关联分析方法中的一种，应该与基于位点的关联分析（GWAS），起到相辅相成，互相补充...
玉米全基因关联分析 ( GWAS Analysis Pipeli
全基因组关联分析是基于数量性状连锁不平衡理论，通过表型与基因型关联分析获得影响表型性状的显著关联标记。 1. 软件...
表观研究数据库推荐 | 查询疾病/表型与表观遗传之间的联系，原来
表观基因组关联分析（Epigenome-wide Association Study，EWAS），即基于全基因组范...
重测序分析（14）全基因组关联分析GWAS介绍
全基因组关联分析（GWAS）全基因组关联分析（Genome wide association study，GWA...
Famliy-based关联分析
为什么要做基于家系的关联分析？关联分析是挖掘复杂疾病致病位点的一个有效方法。在人类的疾病基因研究中，一般有两种方...
DEPICT实现基因优化（gene prioritization
全基因组关联分析除了找到显著的关联位点，我们还可以做基因优化、geneset富集分析、组织富集分析，下面具体讲一讲...
第九章数据关联规则分析算法——基于Apriori算法的关联项分
9.1 基于Apriori算法的关联分析 Aprior算法是关联规则分析中较为经典的频繁项集算法。关联规则反映的是...
GWAS位点简单进行峰拆分输出QTNs并基于GFF关联基因
来自rMVP的关联后的数据；峰拆分输出QTNsgetposQTNs.py 基于GFF关联基因，以水稻RAP注释为例...
基因水平的关联分析
1. 最显著SNP法 1.1 最显著 SNP 法把基因中 P 值最小的 SNP 作为该基因的代表2.1 最大 O...
Science | 群体研究新思路：De novo + GWAS
前言 GWAS，即全基因组关联分析（Genome Wide Association Study），是指在全基因组范...