无重复RNAseq样本Gfold值的计算

作者: x2yline | 来源:发表于2018-01-19 17:53 被阅读555次

    使用Gfold进行无重复RNAseq数据基因差异分析

    1. 准备资料

    输入的资料为read_count,要对两个样本进行差异分析则必需准备两个文件,每个文件格式如下:

    文件1:Sample2.read_cnt

    hsa-miR-146a-3p 20  48  20  20
    novel_mir217    45  20  20  20
    hsa-miR-548o-3p 20  40  20  20
    hsa-miR-378f    20  17703   20  20
    novel_mir857    20  38  20  20
    novel_mir1083   20  37  20  20
    novel_mir790    20  36  20  20
    novel_mir20 20  31  20  20
    novel_mir1517   20  25  20  20
    

    文件2:Sample1.read_cnt

    hsa-miR-146a-3p 20  0   20  20
    novel_mir217    20  0   20  20
    hsa-miR-548o-3p 20  0   20  20
    hsa-miR-378f    20  16  20  20
    novel_mir857    20  0   20  20
    novel_mir1083   20  0   20  20
    novel_mir790    20  0   20  20
    novel_mir20 20  0   20  20
    novel_mir1517   20  0   20  20
    

    在不同版本中输入文件的要求不同,一般第一列表头为GeneName,必需要有5列(如果只想计算gfold值,就只需要保证 Read Count所属的那列正确,其他列可以随意编造,如需要计算RPKM值则Gene exon length所属的那一列不能随意编造)程序才会正确运行,并且列分隔符必须为tab键('\t')

    在2015-05-24更新的V1.1.4版本中,列依次为GeneSymbol、GeneName、Read Count、Gene exon length、RPKM

    R语言提取得到文件1和文件2代码:

    注意两个文件的第一列必须完全相同

    # raw data input
    file = "H358_Vector.gene.FPKM.xls"
    raw_data_Vec <- read.delim(file, stringsAsFactors=FALSE)
    file <- "H358_LSH.gene.FPKM.xls"
    raw_data_LSH <- read.delim(file, stringsAsFactors=FALSE)
    
    ## merge data into one table
    data_merged <- merge(raw_data_Vec[, c(1, 4)], raw_data_LSH[, c(1, 4)], by.x=1, by.y=1, all=TRUE)
    data_merged[is.na(data_merged)] <- 0
    
    ## prepare the input data for gfold
    data1 <- data.frame(data_merged[, c(1, 1, 2, 1, 2)])
    data2 <- data.frame(data_merged[, c(1, 1, 3, 1, 3)])
    
    colnames(data1) <- c("GeneSymbol", "GeneName1", "Read Count1", "Gene exon length1", "RPKM1")
    colnames(data2) <- c("GeneSymbol", "GeneName2", "Read Count2", "Gene exon length2", "RPKM2")
    
    write.table(data1, file="Sample1Vec.read_cnt", row.names=F, col.names=F, quote=F, sep="\t")
    
    write.table(data2, file="Sample2LSH.read_cnt", row.names=F, col.names=F, quote=F, sep="\t")
    
    ## gfold diff -s1 Sample1Vec.read_cnt -s2 Sample2LSH.read_cnt -o Sample1VSSample2.diff
    
    

    2. 软件安装(只能在linux上使用)

    2.1 安装 gsl-1.15

    # /mnt/e/win_linux/gsl 替换为自己的安装目录
    cd /mnt/e/win_linux/gsl
    
    vi ~/.bashrc
    export CXXFLAGS="-g -O3 -I/mnt/e/win_linux/gsl/include -L/mnt/e/win_linux/gsl/lib"
    export LD_LIBRARY_PATH="/mnt/e/win_linux/gsl/lib:"$LD_LIBRARY_PATH 
    
    source ~/.bashrc
    
    wget ftp://ftp.gnu.org/gnu/gsl/gsl-2.2.tar.gz   
    tar zxf gsl-1.15.tar.gz
    cd gsl-1.15
    ./configure --prefix=/mnt/e/win_linux/gsl
    make
    make install
    

    2.2 安装gfold

    cd ..
    ## 在这里https://bitbucket.org/feeldead/gfold/downloads/?tab=tags选择一个版本下载,
    ## 注意不要选择最上面tag为tip的(安装不了)
    wget https://bitbucket.org/feeldead/gfold/get/V1.1.4.zip
    unzip V1.1.4.zip
    cd feeldead-gfold-1921fd6dc668/
    make
    
    ## 报错
    ## g++ -O3 -Wall -g main.cc -o gfold -lgsl -lgslcblas 
    ## In file included from GeneInfo.hpp:29:0,
    ##                  from main.cc:24:
    ## Utility.hpp:69:36: fatal error: gsl/gsl_statistics_int.h: No such file or directory
    ## compilation terminated.
    ## Makefile:11: recipe for target 'program' failed
    ## make: *** [program] Error 1
    
    ## 运行以下命令
    export CXXFLAGS="-g -O3 -I/mnt/e/win_linux/gsl/include -L/mnt/e/win_linux/gsl/lib"  
    export LD_LIBRARY_PATH="/mnt/e/win_linux/gsl/lib:"$LD_LIBRARY_PATH
    g++ -O3 -Wall -g main.cc -o gfold -lgsl -lgslcblas -I/mnt/e/win_linux/gsl/include -L/mnt/e/win_linux/gsl/lib
    ## 虽然显示warning 但是已经安装好了,有可执行文件gfold了
    ./gfold -h
    
    ##      =============================================================================== 
    ##          gfold    :   Generalized fold change for ranking differentially expressed   
    ##                       genes from RNA-seq data.
    ##
    ##          Author   :   Jianxing Feng (jianxing.tongji@gmail.com)
    ##            Date   :   Sun May 24 07:42:36 CST 2015
    ##          Version  :   V1.1.4
    ##      =============================================================================== 
    ##
    ##      USAGE:   Please go to the source code directory and use command 'man doc/gfold.man' 
    ##               or open doc/gfold.html to find documentation.
    ##
    ##      Quick Examples: 
    ##               gfold count -ann hg19Ref.gtf -tag sample1.sam -o sample1.read_cnt          
    ##               gfold count -ann hg19Ref.gtf -tag sample2.sam -o sample2.read_cnt          
    ##               gfold diff -s1 sample1 -s2 sample2 -suf .read_cnt -o sample1VSsample2.diff 
    
    rm -rf *.zip
    mv feeldead-gfold-1921fd6dc668/ gfold.V1.1.4
    echo "alias gfold='/mnt/e/win_linux/gfold.V1.1.4/gfold'" >> ~/.profile
    source ~/.profile
    gfold -h
    

    3. 运行gfold筛选差异基因

    # s1为样本1,s2为样本2,o为output,suf为后缀名(省略则默认表示为空)
    gfold diff -s1 Sample1.read_cnt -s2 Sample2.read_cnt -o Sample1VSSample2.diff
    ## 与该命令等价的命令:
    ## gfold diff -s1 Sample1 -s2 Sample2 -suf .read_cnt -o Sample1VSSample2.diff
    
    ## -VL1 Loading read count for each gene from file Vector.read_cnt ...
    ## -VL1 1935 genes have been scanned
    ## -VL1 Loading read count for each gene from file LSH.read_cnt ...
    ## -VL1 1935 genes have been scanned
    ## -VL1 Calculating normalization constant ...
    ## -VL1 Normalization constant is: 
    ##      Vector.read_cnt    2.3238e+07    1.04573
    ##     LSH.read_cnt    2.80085e+07    1
    ## -VL1 Calculate RPKM ...
    ## -VL1 Calculate accurate GFOLD value ...
    ## 等待一段时间
    ## Job diff is DONE!
    

    4. 结果解释

    # This file is generated by gfold V1.1.4 on Fri Jan 19 21:19:05 2018
    # Normalization constants :
    #    Sample1N.read_cnt  6802895 1.35542
    #    Sample2N.read_cnt  12483845    1
    # The GFOLD value could be considered as a reliable log2 fold change.
    # It is positive/negative if the gene is up/down regulated.
    # A gene with zero GFOLD value should never be considered as 
    # differentially expressed. For a comprehensive description of 
    # GFOLD, please refer to the manual.
    #GeneSymbol GeneName    GFOLD(0.01) E-FDR   log2fdc 1stRPKM 2ndRPKM
    novel_mir1585   NA  6.45001 1   9.19028 0   3444.45
    hsa-miR-4664-5p NA  5.70336 1   8.44996 0   2058.66
    novel_mir252    NA  4.13781 1   6.91447 0   704.911
    novel_mir1060   NA  3.64411 1   6.43874 0   504.652
    
    

    4.1 GFOLD

    A gene with zero GFOLD value should never be considered as differentially expressed.

    The GFOLD value could be considered as a reliable log2 fold change.

    It is positive/negative if the gene is up/down regulated.

    注:如果GFOLD大于0说明Sample2的表达高于sample1,基因上调(表示为Sample1 VS Sample2)

    4.2 E-FDR

    Empirical FDR based on replicates. It is always 1 when no replicates are available

    因此无重复样本的经验FDR均为1

    4.3 log2fdc

    log2 fold change. If no replicate is available, and -acc is T, log2 fold change is based on read counts and normalization constants. Otherwise, log2 fold change is based on the sampled expression level from the posterior distribution.

    -acc参数在无重复样本时使用,表示计算Gfold的校正方法,如果-acc值为T(默认值为T),表示根据测序深度进行校正(熟读较慢),如果-acc值为F,表示使用马尔科夫蒙特卡洛方法,-acc参数仅在diff功能下指定。

    log2fdc表示log2(fold change),该值默认是经校正的count值改变倍数取log2

    4.3 RPKM

    由于我输入的Gene exon length是假的,因此最后两列没有意义

    参考:
    https://kevinzjy.github.io/2017/05/20/Bioinfo-Differential-Expression/
    http://www.biotrainee.com/thread-752-1-1.html
    http://www.chenlianfu.com/?p=1085
    https://bitbucket.org/feeldead/gfold
    https://web.archive.org/web/20161125133249/http://compbio.tongji.edu.cn:80/~fengjx/GFOLD/gfold.html

    相关文章

      网友评论

        本文标题:无重复RNAseq样本Gfold值的计算

        本文链接:https://www.haomeiwen.com/subject/rxlaoxtx.html