美文网首页gwasscDRS
2020-04-29 eMAGMA 基于基因的关联分析(Part

2020-04-29 eMAGMA 基于基因的关联分析(Part

作者: 程凉皮儿 | 来源:发表于2020-04-28 09:24 被阅读0次

    输入数据准备

    本教程要求eMAGMA 文件, 软件 (MAGMA) 和辅助文件都在同一个目录下如果你的文件在不同的目录上,请在命令行加入路径信息

    cd /path/yourworking folder/eMAGMA
    

    解压缩软件包及辅助文件:magma_v1.07b.zip, NCBI37.3.zipMDD2018_excluding23andMe (下载自 PGC website).

    unzip [filename].zip 
    

    由于我下载的MDD2018_excluding23andMe是一个.gz的文件,解压使用如下代码:

    gzip -d MDD2018_ex23andMe.gz
    

    解压完后查看前几行

    (base) bogon:eMAGMA chelsea$ head MDD2018_ex23andMe
    CHR SNP BP  A1  A2  FRQ_A_59851 FRQ_U_113154    INFO    OR  SE  P   ngt Direction   HetISqt HetDf   HetPVa  Nca Nco Neff
    8   rs62513865  101592213   T   C   0.0747  0.0733  0.957   1.01461 0.0153  0.3438  0   ---+++  0.0 5   0.7906  59851   113154  69115.85
    8   rs79643588  106973048   A   G   0.092   0.092   0.999   1.02122 0.0136  0.1231  0   ++-+++  0.0 5   0.6847  59851   113154  69115.85
    8   rs17396518  108690829   T   G   0.562   0.565   0.98    1.00331 0.008   0.6821  6   --+--+  34.9    5   0.05637 59851   113154  69115.85
    8   rs6994300   102569817   A   G   0.00609 0.00556 0.466   0.88126 0.4243  0.7658  0   ?-????  0.0 0   1   16823   25632   20313.61
    8   rs138449472 108580746   A   G   0.00965 0.00852 0.734   0.97181 0.0598  0.632   0   -+??-?  0.0 2   0.6663  41253   79756   47868.51
    8   rs983166    108681675   A   C   0.565   0.568   0.991   0.99144 0.008   0.2784  0   --+--+  10.1    5   0.1685  59851   113154  69115.85
    8   rs28842593  103044620   T   C   0.841   0.843   0.934   0.98926 0.0112  0.3381  1   -++--+  0.0 5   0.51    59851   113154  69115.85
    8   rs377046245 105176418   I   D   0.265   0.265   0.994   1.03004 0.0196  0.1311  -   ?????-  0.0 0   1   14260   15480   14844.98
    8   rs7014597   104152280   C   G   0.166   0.164   0.996   1.01501 0.0106  0.1591  0   +-+-++  0.0 5   0.4818  59851   113154  69115.85
    

    运行程序需要P-values来自 MDD GWAS summary statistics. 使用下面的代码从 GWAS summary statistics来提取: SNP-ID, chromosome, BP(base pair position), and P-value相关信息并写入一个新的 txt file.

    awk '{print $2,$1,$3,$11,$19}' MDD2018_ex23andMe > MDD2018_ex23andMe_emagma.txt
    

    完了之后 前3列的信息排列应该是:SNP, CHRBP

    数据分析Analysis

    分析需要来自适当参考样本的原始基因型数据来模拟LD结构。为此,我们使用欧洲人口的基因组参考文件[g1000_eur]。由于我们基于eQTL信息将SNP与基因相关联,因此我们使用注释文件,我们先前已将SNP分配给目标eGenes。这些注释文件是基于GTEx中显著的(FDR<0.05)SNP-基因关联生成的(更多细节见Gering 2009b)。注释文件位于文件夹Batch1、Batch2、Batch3与Batch4、Batch5与Batch6中。

    出于实际原因,我们将仅使用位于目录Batch1中的杏仁核的注释文件。在Batch1中,有13个注释文件,前缀为组织名称,后缀为genes.annot。要将Batch1下载到名为Brain的目录中,请使用以下命令:

    mkdir Brain 
    mv Batch1_annot.zip Brain
    cd Brain 
    unzip Batch1.zip
    cd ..
    

    因为我之前已经下载了Batch1_annot.zip,新建Brain文件后将其移动到该目录下直接解压缩Batch1.zip

    确保返回到eMAGMA文件夹。要运行关联,请使用以下命令:

    ./magma --bfile g1000_eur 
    --gene-annot Brain/Brain_Amygdala.genes.annot 
    --pval MDD2018_ex23andMe_emagma2.txt ncol=Neff 
    --gene-settings adap-permp=10000 
    --out Amygdala_emagma      
    

    运行结果如下所示:

    (base) bogon:eMAGMA chelsea$ ./magma --bfile g1000_eur --gene-annot Brain/Batch1/Brain_Amygdala.genes.annot --pval MDD2018_ex23andMe_emagma.txt ncol=Neff --gene-settings adap-permp=10000 --out Amygdala_emagma
    Welcome to MAGMA v1.07b (mac)
    Using flags:
        --bfile g1000_eur
        --gene-annot Brain/Batch1/Brain_Amygdala.genes.annot
        --pval MDD2018_ex23andMe_emagma.txt ncol=Neff
        --gene-settings adap-permp=10000
        --out Amygdala_emagma
    
    Start time is 20:55:53, Tuesday 28 Apr 2020
    
    Loading PLINK-format data...
    Reading file g1000_eur.fam... 503 individuals read
    Reading file g1000_eur.bim... 22665064 SNPs read
    Preparing file g1000_eur.bed...
    
    Reading SNP synonyms from file g1000_eur.synonyms (auto-detected)
        read 6016767 mapped synonyms from file, mapping to 3921040 SNPs in the data
        WARNING: detected 133 synonymous SNP pairs in the data
                 skipped all synonym entries involved, synonymous SNPs are kept in analysis
                 writing list of detected synonyms in data to supplementary log file
    Reading SNP p-values from file MDD2018_ex23andMe_emagma.txt...
        detected 5 variables in file
        using variable: SNP (SNP id)
        using variable: P (p-value)
        using variable: Neff (sample size; discarding SNPs with N < 50)
    
    ERROR - reading p-value file: too few values on line 9533410 (found 4, expecting at least 5):
        line: rs190562351 21 9502243 0.83543
    
    Terminating program.
    

    由于rs190562351这行有个缺失值,运行终止了,那么我就删除这行

    ## 先用grep找到这行,没有重复项
    $ grep 'rs190562351' MDD2018_ex23andMe_emagma.txt
    rs190562351 21 9502243 0.83543
    ## 再去查找了grep的用法,-v参数是输出反选的数据,再赋值到一个新的文件,这样就删掉了这行。
    $ grep -v 'rs190562351' MDD2018_ex23andMe_emagma.txt > MDD2018_ex23andMe_emagma1.txt
    

    再用MDD2018_ex23andMe_emagma1.txt重新运行,但是万万没想到,这种有一个值缺失的行不只这一行,我要是这样一步步删除不知道要到什么时候,那就需要写循环了,尝试了好几次都没能成功

    (base) bogon:eMAGMA chelsea$ grep '{cat id2.txt | while read id ; do echo ${id};  done}' MDD2018_ex23andMe_emagma2.txt
    (base) bogon:eMAGMA chelsea$ cat id2.txt | while read id ; do echo ${id};  done
    rs71269378
    rs76790510
    rs77252136
    rs140659530
    rs181553715
    rs145993931
    rs76661226
    rs76722126
    rs76196084
    rs79016859
    rs79149161
    rs71259347
    rs71258884
    rs4104602
    rs77072381
    rs77056430
    rs140202858
    rs190271206
    rs71229564
    rs71277812
    rs76384163
    rs71244393
    rs71247672
    rs79178122
    rs145133116
    rs151229293
    rs77863943
    rs143923139
    rs79633665
    rs79047427
    rs77756107
    rs146840680
    chr21_9862783_D
    rs77110876
    rs140066961
    rs75310970
    rs77482226
    rs144043309
    rs77690252
    rs7378139
    rs79407983
    rs138722934
    rs77061955
    rs75645220
    rs139338962
    rs193153830
    rs80297221
    rs75702448
    rs142407330
    rs146496652
    rs148968985
    rs71259094
    rs80035564
    rs75780639
    rs79300020
    rs75625582
    rs189257336
    rs189188955
    rs191838859
    (base) bogon:eMAGMA chelsea$ cat id2.txt | while read id ; do echo ${id};  done |grep -v
    usage: grep [-abcDEFGHhIiJLlmnOoqRSsUVvwxZ] [-A num] [-B num] [-C[num]]
        [-e pattern] [-f file] [--binary-files=value] [--color=when]
        [--context[=num]] [--directories=action] [--label] [--line-buffered]
        [--null] [pattern] [file ...]
    (base) bogon:eMAGMA chelsea$ cat id2.txt | while read id ; do echo ${id};  done |grep -v ${id}
    usage: grep [-abcDEFGHhIiJLlmnOoqRSsUVvwxZ] [-A num] [-B num] [-C[num]]
        [-e pattern] [-f file] [--binary-files=value] [--color=when]
        [--context[=num]] [--directories=action] [--label] [--line-buffered]
        [--null] [pattern] [file ...]
    (base) bogon:eMAGMA chelsea$ cat id2.txt | while read id ; do echo ${id};  done |grep ${id} MDD2018_ex23andMe_emagma2.txt
    (base) bogon:eMAGMA chelsea$ cat id2.txt | while read id ; do echo ${id};  done |grep  MDD2018_ex23andMe_emagma2.txt
    (base) bogon:eMAGMA chelsea$ cat id2.txt | while read id ; do grep '${id}' MDD2018_ex23andMe_emagma2.txt;  done
    (base) bogon:eMAGMA chelsea$ cat id2.txt | while read id ; do grep '${id}' MDD2018_ex23andMe_emagma2.txt > MDD2018_ex23andMe_emagmao.txt;  done
    ^C
    (base) bogon:eMAGMA chelsea$ cat id2.txt | while read id ; do grep -v '${id}' MDD2018_ex23andMe_emagma2.txt > MDD2018_ex23andMe_emagmap.txt;  done
    

    以上是我的探索过程,也记录下来,最终没有成功删除目标项目,看到缺失项还挺集中的,后面没有办法就简单粗爆的打开txt文档通过搜索rs71269378找到位置,手动把这些都删除了,然后再运行,就成功了。

    (base) bogon:eMAGMA chelsea$ ./magma --bfile g1000_eur --gene-annot Brain/Batch1/Brain_Amygdala.genes.annot --pval MDD2018_ex23andMe_emagma2.txt ncol=Neff --gene-settings adap-permp=10000 --out Amygdala_emagma
    Welcome to MAGMA v1.07b (mac)
    Using flags:
        --bfile g1000_eur
        --gene-annot Brain/Batch1/Brain_Amygdala.genes.annot
        --pval MDD2018_ex23andMe_emagma2.txt ncol=Neff
        --gene-settings adap-permp=10000
        --out Amygdala_emagma
    
    Start time is 09:02:45, Wednesday 29 Apr 2020
    
    Loading PLINK-format data...
    Reading file g1000_eur.fam... 503 individuals read
    Reading file g1000_eur.bim... 22665064 SNPs read
    Preparing file g1000_eur.bed...
    
    Reading SNP synonyms from file g1000_eur.synonyms (auto-detected)
        read 6016767 mapped synonyms from file, mapping to 3921040 SNPs in the data
        WARNING: detected 133 synonymous SNP pairs in the data
                 skipped all synonym entries involved, synonymous SNPs are kept in analysis
                 writing list of detected synonyms in data to supplementary log file
    Reading SNP p-values from file MDD2018_ex23andMe_emagma2.txt...
        detected 5 variables in file
        using variable: SNP (SNP id)
        using variable: P (p-value)
        using variable: Neff (sample size; discarding SNPs with N < 50)
        read 13554490 lines from file, containing valid SNP p-values for 10999960 SNPs in data (81.15% of lines, 48.53% of SNPs in data)
        WARNING: file contained 307 SNPs (same IDs or synonyms) with duplications
                 dropped all occurrences of each from analysis
                 writing list of duplicated IDs to supplementary log file
    Filtering phenotype/covariate missing values... 503 individuals remaining
    Loading gene annotation from file Brain/Batch1/Brain_Amygdala.genes.annot...
        1301 gene definitions read from file
        found 1258 genes containing valid SNPs in genotype data
    
    
    Starting gene analysis...
        using model: SNPwise-mean
            writing gene analysis results to file Amygdala_emagma.genes.out
        writing intermediate output to file Amygdala_emagma.genes.raw
    
    
    End time is 09:05:12, Wednesday 29 Apr 2020 (elapsed: 00:02:27)
    

    成功运行后查看生成文件:

    (base) bogon:eMAGMA chelsea$ ls
    Amygdala_emagma.genes.out     Batch3_annot.zip              MDD2018_ex23andMe             NCBI37.3.zip                  g1000_eur.fam                 magma
    Amygdala_emagma.genes.raw     Batch4_annot.zip              MDD2018_ex23andMe_emagma.txt  README                        g1000_eur.synonyms            magma_v1.07b_mac
    Amygdala_emagma.log           Batch5_annot.zip              MDD2018_ex23andMe_emagma1.txt REPORT                        g1000_eur.zip                 magma_v1.07b_mac.zip
    Amygdala_emagma.log.suppl     Batch6_annot.zip              MDD2018_ex23andMe_emagma2.txt emagma_Amygdala_script        g1000_eurreadme               manual_v1.07b.pdf
    Amygdala_outputs.zip          Brain                         MDD2018_ex23andMe_emagmap.txt g1000_eur.bed                 id.txt                        ncbi_readme
    Batch2_annot.zip              CHANGELOG                     NCBI37.3.gene.loc             g1000_eur.bim                 id2.txt                       network_files.zip
    

    上述命令表明:Brain_Amygdala.genes.annot是用于分析的输入文件,P值从MDD GWAS汇总数据[MDD2018_ex23andMe_emagma.txt]中提取。有效样本数在Neff列中。使用10,000个自适应排列(--adap-permp=10,000)进行多次测试。

    该程序还会生成日志文件[Amygdala_emagma.log]。日志文件包含运行的汇总信息,即错误(如果有)、读取了多少基因以及有多少基因分配了有效的SNP。对Amygdala_emagma.log的检查显示,从注释文件中读取了1301个基因定义,并且1258个基因在基因型数据中具有有效的SNP。
    分析输出genes.raw [Amygdala_emagma.genes.raw]文件和genes.out [Amygdala_emagma.genes.out]文件。genes.raw文件包含分析结果,将在本教程后面的第2部分中使用。

    genes.out文件的结果是一种易于阅读的格式。下面是genes.out文件的一个示例,它包含有关基因ID、位置、映射到基因的SNP数量、关联的P值和排列的P值的信息。

    (base) bogon:eMAGMA chelsea$ head Amygdala_emagma.genes.out
    GENE       CHR      START       STOP  NSNPS  NPARAM      N        ZSTAT            P        PERMP  NPERM
    26155        1     879583     894679      6       2  38689     -0.47386       0.6822        0.653   1000
    54991        1    1017198    1051736     14       1  61363      0.63601      0.26238        0.263   1000
    54973        1    1246965    1260067      1       1  32381      -1.7636       0.9611        0.959   1000
    441869       1    1353800    1356824    107       4  59224     -0.23583      0.59322        0.557   1000
    83858        1    1407139    1444852     45       4  53085     -0.60178      0.72634          0.7   1000
    728661       1    1591486    1624243     51       4  63001     -0.11335      0.54512        0.524   1000
    728642       1    1633822    1655999      1       1  52547      -0.9717       0.8344        0.831   1000
    388585       1    2460184    2461684      1       1  67392      0.57691        0.282        0.305   1000
    9731         1    3728645    3773797      1       1  69116       0.9625       0.1679        0.173   1000
    

    要使用Bonferroni校正校正多重测试(P值<=0.05/测试基因数=3.8431e-5),并提取重要相关基因列表,请使用:

    (base) bogon:eMAGMA chelsea$ awk '{if($9<=3.8431e-5) print $0}' Amygdala_emagma.genes.out | head
    10463        4   41992516   42089551      1       1  54271       4.1106    1.973e-05       0.0001  10000
    11118        6   26365387   26378548    223       6  68311       5.6024   1.0571e-08       0.0001  10000
    64288        6   28292514   28321972     48       3  66367       4.5078   3.2753e-06       0.0001  10000
    720          6   31949834   31970457    345       9  65285       4.0559   2.4967e-05       0.0001  10000
    (base) bogon:eMAGMA chelsea$ awk '{if($9<=3.8431e-5) print $0}' Amygdala_emagma.genes.out > Amygdala_signif_genes.txt
    

    上面的代码导出 [Amygdala_signif_genes.tx] 包括4个显著性基因(p<=3.8431e-5).

    GENE       CHR      START       STOP  NSNPS  NPARAM      N        ZSTAT            P        PERMP  NPERM
    10463        4   41992516   42089551      1       1  54271       4.1106    1.973e-05       0.0001  10000
    11118        6   26365387   26378548    223       6  68311       5.6024   1.0573e-08       0.0001  10000
    64288        6   28292514   28321972     48       3  66367       4.5078   3.2753e-06       0.0001  10000
    720          6   31949834   31970457    345       9  65285       4.0559   2.4967e-05       0.0002  10000
    

    基因ID 11118关联最大,P值=1.0573e-08,该基因位于6号染色体:26365387-26378548,由223SNPs定位。6号染色体上的另外两个基因和4号染色体上的一个基因对MDD也有重要意义。

    这就完成了!您已经生成了一个在杏仁核中表达并与MDD有显著关联的eGenes列表。从这个分析中产生的信息可以用来研究这些基因的生物学途径和功能。

    要进一步进行此分析,请转到第2部分!

    相关文章

      网友评论

        本文标题:2020-04-29 eMAGMA 基于基因的关联分析(Part

        本文链接:https://www.haomeiwen.com/subject/wzqbwhtx.html