多组学联合分析-Matrix eQTL

作者: 子鹿学生信 | 来源:发表于2018-09-26 09:53 被阅读1908次

    找到Matrix eQTL这个包,看下文章Matrix eQTL: ultra fast eQTL analysis via large matrix operations(https://doi.org/10.1093/bioinformatics/bts163

    eQTL(表达数量性状位点)计算transcript-SNP 的关系,即分析SNP与基因的表达是否相关。由于计算数量巨大,很多人都用较小的数据来做。因此该作者开发了Matrix eQTL,用于处理大数据,支持additive linear and ANOVA models with covariates,并且可以将cis- and trans-eQTLs分开计算。
    Matrix eQTL相较于其他软件如FastMap — 18.4 min, Merlin — 12.3 min, Plink — 9.0 min, Matrix eQTL — 5.7 min and snpMatrix — 3.3 min要快,它设置一个阈值,只有超过这个阈值的p值才会被计算。
    采用的是线型回归模型,g为基因表达情况,s为SNP分型结果。

    说明文档http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/runit.html
    示例数据:http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/R.html
    分析过程很简单,首先设置好要分析文件的路径和名称:

    install.packages("MatrixEQTL")
    
    # 设置数据目录,示例数据放在包的安装目录下了。
    base.dir = find.package("MatrixEQTL")
    
    #设置分析的模型
    useModel = modelLINEAR; # modelANOVA or modelLINEAR or modelLINEAR_CROSS 
    #设置SNP文件的名称
    SNP_file_name = paste(base.dir, "/data/SNP.txt", sep="");
    # 设置表达数据文件的名称
    expression_file_name = paste(base.dir, "/data/GE.txt", sep="");
    # 设置协变量文件的名称
    # 无协变量设置为character()
    covariates_file_name = paste(base.dir, "/data/Covariates.txt", sep="");
    
    output_file_name = tempfile();
    
    • 提供了三种分析模型供选择
      (1) modelLINEAR
    Model: useModel = modelLINEAR
    Equation: expression = α + ∑k βk⋅covariatek + γ⋅genotype_additive
    Testing for significance of: γ
    Test statistic: t-statistic
    

    (2) modelANOVA

    Model: useModel = modelANOVA
    Equation: expression = α + ∑k βk⋅covariatek + γ1⋅genotype_additive + γ2⋅genotype_dominant
    Testing for significance of: (γ1,γ2) pair
    Test statistic: F-statistic
    

    (3) modelLINEAR_CROSS

    Model: useModel = modelLINEAR_CROSS
    Equation:
        expression = α + ∑k βk⋅covariatek + γ⋅genotype_additive + δ⋅genotype_additive⋅covariateK
    Testing for significance of: δ
    Test statistic: t-statistic
    
    • 注意这里要设置一个p值的阈值,一般越大的数据量阈值设的越小,之前说过它会按这个阈值来计算结果,如果设的过大,分析耗时并且输出很多结果。输出的结果都储存在output_file_name里
    pvOutputThreshold = 1e-2
    # 设置协变量矩阵为 numeric(),很少用,默认
    errorCovariance = numeric()
    
    # 这里建立了一个SlicedData的新对象,用于存放martix的数据,并设置存放数据的格式
    snps = SlicedData$new();
    # 设置数据分隔符为tab
    snps$fileDelimiter = "\t";      # the TAB character
    # 设置缺失值为NA
    snps$fileOmitCharacters = "NA"; # denote missing values;
    
    snps$fileSkipRows = 1;          # one row of column labels
    snps$fileSkipColumns = 1;       # one column of row labels
    snps$fileSliceSize = 2000;      # read file in pieces of 2,000 rows
    snps$LoadFile( SNP_file_name );
    
    ## Load gene expression data
    
    gene = SlicedData$new();
    gene$fileDelimiter = "\t";      # the TAB character
    gene$fileOmitCharacters = "NA"; # denote missing values;
    gene$fileSkipRows = 1;          # one row of column labels
    gene$fileSkipColumns = 1;       # one column of row labels
    gene$fileSliceSize = 2000;      # read file in slices of 2,000 rows
    gene$LoadFile(expression_file_name);
    
    ## Load covariates
    
    cvrt = SlicedData$new();
    cvrt$fileDelimiter = "\t";      # the TAB character
    cvrt$fileOmitCharacters = "NA"; # denote missing values;
    cvrt$fileSkipRows = 1;          # one row of column labels
    cvrt$fileSkipColumns = 1;       # one column of row labels
    

    看下文件格式,snp文件用0,1,2表示,基因文件是表达量,cvrt是covariates:


    image.png
    image.png
    image.png

    设置好文件后可以用 Matrix_eQTL_engine主函数进行eQTL分析了,参数snps设置SNP文件,gene设置表达量文件,cvrt设置协变量。然后将每行的SNP和gene放到一块进行线性回归的分析。

    me = Matrix_eQTL_engine(
        snps = snps,
        gene = gene,
        cvrt = cvrt,
        output_file_name = output_file_name,
        pvOutputThreshold = pvOutputThreshold,
        useModel = useModel, 
        errorCovariance = errorCovariance, 
        verbose = TRUE,
        pvalue.hist = TRUE,
        min.pv.by.genesnp = FALSE,
        noFDRsaveMemory = FALSE);
    

    运行完后得到的me对象是一个list:


    image.png

    输出文件的每行eqtl为:SNP name, a transcript name, estimate of the effect size, t- or F-statistic, p-value, and FDR。

    Matrix eQTL可以区分顺式(cis,local)和反式(trans,distant)eQTL,主要用Matrix_eQTL_main函数来分析。其包括以下几个参数:

    *   `pvOutputThreshold.cis` – p-value threshold for cis-eQTLs.
    *   `output_file_name.cis` – detected cis-eQTLs are saved in this file.
    *   `cisDist` – maximum distance at which gene-SNP pair is considered local.
    *   `snpspos` – data frame with information about SNP locations, must have 3 columns - SNP name, chromosome, and position. See [sample SNP location file](http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/Sample_Data/snpsloc.txt).
    *   `genepos` – data frame with information about gene locations, must have 4 columns - the name, chromosome, and positions of the left and right ends. See [sample gene location file](http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/Sample_Data/geneloc.txt).
    
    

    下面来看具体代码:

    # source("Matrix_eQTL_R/Matrix_eQTL_engine.r");
    library(MatrixEQTL)
    
    ## Location of the package with the data files.
    base.dir = find.package('MatrixEQTL');
    # base.dir = '.';
    
    ## Settings
    
    # Linear model to use, modelANOVA, modelLINEAR, or modelLINEAR_CROSS
    useModel = modelLINEAR; # modelANOVA, modelLINEAR, or modelLINEAR_CROSS
    
    # Genotype file name
    SNP_file_name = paste(base.dir, "/data/SNP.txt", sep="");
    snps_location_file_name = paste(base.dir, "/data/snpsloc.txt", sep="");
    
    # Gene expression file name
    expression_file_name = paste(base.dir, "/data/GE.txt", sep="");
    gene_location_file_name = paste(base.dir, "/data/geneloc.txt", sep="");
    
    # Covariates file name
    # Set to character() for no covariates
    covariates_file_name = paste(base.dir, "/data/Covariates.txt", sep="");
    
    # Output file name
    output_file_name_cis = tempfile();
    output_file_name_tra = tempfile();
    
    # Only associations significant at this level will be saved
    pvOutputThreshold_cis = 2e-2;
    pvOutputThreshold_tra = 1e-2;
    
    # Error covariance matrix
    # Set to numeric() for identity.
    errorCovariance = numeric();
    # errorCovariance = read.table("Sample_Data/errorCovariance.txt");
    
    # Distance for local gene-SNP pairs
    cisDist = 1e6;
    
    ## Load genotype data
    
    snps = SlicedData$new();
    snps$fileDelimiter = "\t";      # the TAB character
    snps$fileOmitCharacters = "NA"; # denote missing values;
    snps$fileSkipRows = 1;          # one row of column labels
    snps$fileSkipColumns = 1;       # one column of row labels
    snps$fileSliceSize = 2000;      # read file in slices of 2,000 rows
    snps$LoadFile(SNP_file_name);
    
    ## Load gene expression data
    
    gene = SlicedData$new();
    gene$fileDelimiter = "\t";      # the TAB character
    gene$fileOmitCharacters = "NA"; # denote missing values;
    gene$fileSkipRows = 1;          # one row of column labels
    gene$fileSkipColumns = 1;       # one column of row labels
    gene$fileSliceSize = 2000;      # read file in slices of 2,000 rows
    gene$LoadFile(expression_file_name);
    
    ## Load covariates
    
    cvrt = SlicedData$new();
    cvrt$fileDelimiter = "\t";      # the TAB character
    cvrt$fileOmitCharacters = "NA"; # denote missing values;
    cvrt$fileSkipRows = 1;          # one row of column labels
    cvrt$fileSkipColumns = 1;       # one column of row labels
    if(length(covariates_file_name)>0) {
    cvrt$LoadFile(covariates_file_name);
    }
    
    ## Run the analysis
    snpspos = read.table(snps_location_file_name, header = TRUE, stringsAsFactors = FALSE);
    genepos = read.table(gene_location_file_name, header = TRUE, stringsAsFactors = FALSE);
    
    me = Matrix_eQTL_main(
    snps = snps, 
    gene = gene, 
    cvrt = cvrt,
    output_file_name     = output_file_name_tra,
    pvOutputThreshold     = pvOutputThreshold_tra,
    useModel = useModel, 
    errorCovariance = errorCovariance, 
    verbose = TRUE, 
    output_file_name.cis = output_file_name_cis,
    pvOutputThreshold.cis = pvOutputThreshold_cis,
    snpspos = snpspos, 
    genepos = genepos,
    cisDist = cisDist,
    pvalue.hist = "qqplot",
    min.pv.by.genesnp = FALSE,
    noFDRsaveMemory = FALSE);
    
    unlink(output_file_name_tra);
    unlink(output_file_name_cis);
    
    ## Results:
    
    cat('Analysis done in: ', me$time.in.sec, ' seconds', '\n');
    cat('Detected local eQTLs:', '\n');
    show(me$cis$eqtls)
    cat('Detected distant eQTLs:', '\n');
    show(me$trans$eqtls)
    
    ## Plot the Q-Q plot of local and distant p-values
    
    plot(me)
    

    因此,分析自己的数据需要准备
    genotype
    expression
    covariates
    gene location
    SNP location
    这五个文件,前三个需要每列的样本名对应且顺序一致。
    作者也提供了生成模拟数据的代码:

    # Create an artificial dataset and plot the histogram and Q-Q plot of all p-values
    library('MatrixEQTL')
    
    # Number of samples
    n = 100;
    
    # Number of variables
    ngs = 2000;
    
    # Common signal in all variables (population stratification)
    pop = 0.2 * rnorm(n);
    
    # data matrices
    snps.mat = matrix(rnorm(n*ngs), ncol = ngs) + pop;
    gene.mat = matrix(rnorm(n*ngs), ncol = ngs) + pop + snps.mat*((1:ngs)/ngs)^9/2;
    
    # data objects for Matrix eQTL engine
    snps1 = SlicedData$new( t( snps.mat ) );
    gene1 = SlicedData$new( t( gene.mat ) );
    cvrt1 = SlicedData$new( );
    rm(snps.mat, gene.mat)
    
    # Slice data in blocks of 500 variables
    snps1$ResliceCombined(500);
    gene1$ResliceCombined(500);
    
    # name of temporary output file
    filename = tempfile();
    
    # Perform analysis recording information for 
    # a histogram
    meh = Matrix_eQTL_engine(
      snps = snps1,
      gene = gene1,
      cvrt = cvrt1,
      output_file_name = filename, 
      pvOutputThreshold = 1e-100, 
      useModel = modelLINEAR, 
      errorCovariance = numeric(), 
      verbose = TRUE,
      pvalue.hist = 100);
    unlink( filename );
    # png(filename = "histogram.png", width = 650, height = 650)
    plot(meh, col="grey")
    # dev.off();
    
    # Perform the same analysis recording information for 
    # a Q-Q plot
    meq = Matrix_eQTL_engine(
      snps = snps1, 
      gene = gene1, 
      cvrt = cvrt1, 
      output_file_name = filename,
      pvOutputThreshold = 1e-6, 
      useModel = modelLINEAR, 
      errorCovariance = numeric(), 
      verbose = TRUE,
      pvalue.hist = "qqplot");
    unlink( filename );
    # png(filename = "QQplot.png", width = 650, height = 650)
    plot(meq, pch = 16, cex = 0.7)
    # dev.off();
    

    阅读推荐:

    生信技能树公益视频合辑:学习顺序是linux,r,软件安装,geo,小技巧,ngs组学!

    B站链接:https://m.bilibili.com/space/338686099

    YouTube链接:https://m.youtube.com/channel/UC67sImqK7V8tSWHMG8azIVA/playlists

    生信工程师入门最佳指南:https://mp.weixin.qq.com/s/vaX4ttaLIa19MefD86WfUA

    学徒培养:https://mp.weixin.qq.com/s/3jw3_PgZXYd7FomxEMxFmw

    相关文章

      网友评论

        本文标题:多组学联合分析-Matrix eQTL

        本文链接:https://www.haomeiwen.com/subject/ubikoftx.html