Gfold说明书(备份)

作者: x2yline | 来源:发表于2018-01-19 15:45 被阅读42次

本文资料来源

https://web.archive.org/web/20161125133249/http://compbio.tongji.edu.cn:80/~fengjx/GFOLD/gfold.html

也可以参考软件安装目录doc文件夹里的gfold.html

原网页链接(已挂)

http://compbio.tongji.edu.cn/~fengjx/GFOLD/gfold.html

repo地址

https://bitbucket.org/feeldead/gfold/overview

NAME

gfold - Generalized fold change for ranking differentially expressed genes from RNA-seq data.

GFOLD is especially useful when no replicate is available. GFOLD generalizes the fold change by considering the posterior distribution of log fold change, such that each gene is assigned a reliable fold change. It overcomes the shortcoming of p-value that measures the significance of whether a gene is differentially expressed under different conditions instead of measuring relative expression changes, which are more interesting in many studies. It also overcomes the shortcoming of fold change that suffers from the fact that the fold change of genes with low read count are not so reliable as that of genes with high read count, even these two genes show the same fold change.


CITATION

Feng J, Meyer CA, Wang Q, Liu JS, Liu XS, Zhang Y. GFOLD: a generalized fold change for ranking differentially expressed genes from RNA-seq data. Bioinformatics 2012


SYNOPSIS

  • gfold
  • JOBS
  • OPTIONS

EXAMPLES

Example 1: Count reads and rank genes

In the following example, hg19Ref.gtf is the ucsc knownGene table for hg19; sample1.sam and sample2.sam are the mapped reads in SAM format.

gfold count -ann hg19Ref.gtf -tag sample1.sam -o sample1.read_cnt
gfold count -ann hg19Ref.gtf -tag sample2.sam -o sample2.read_cnt
gfold diff -s1 sample1 -s2 sample2 -suf .read_cnt -o sample1VSsample2.diff

example 2: Count reads

This example utilizes samtools to produce mapped reads in SAM format from BAM format.

samtools view sample1.bam | gfold count -ann hg19Ref.gtf -tag stdin -o sample1.read_cnt

Example 3: Identify differentially expressed genes without replicates

Suppose there are two samples: sample1 and sample2 with corresponding read count file being sample1.read_cnt sample2.read_cnt. This example finds differentially expressed genes using default parameters on two samples

gfold diff -s1 sample1 -s2 sample2 -suf .read_cnt -o sample1VSsample2.diff

Example 4: Identify differentially expressed genes with replicates

This example finds differentially expressed genes using default parameters on two group of samples.

gfold diff -s1 sample1,sample2,sample3 -s2 sample4,sample5,sample6 -suf .read_cnt -o 123VS456.diff

Example 5: Identify differentially expressed genes with replicates only in one condition

This example finds differentially expressed genes using default parameters on two group of samples. Only the first group contains replicates. In this case, the variance estimated based on the first group will be used as the variance of the second group.

gfold diff -s1 sample1,sample2 -s2 sample3 -suf .read_cnt -o 12VS3.diff**

JOBS

  • -h

    Print help information

  • count

    Given the gene annotation in GTF/GPF/BED format and mapped short reads in SAM/BED format, count the number of reads mapped to each gene. Because of possible overlapping of multiple genes, a read could be mapped to the overlaped region of multiple genes. In this case, a read is counted multiple times with each time for each gene. Furthermore, if a gene is on multiple chromosomes or different strands of the same chromosome, only exons on one strand of one chromosome (the one appear first in the annotation file) will be assigned to this gene. Exons not on this strand of the chromosome will be discarded.

  • diff

    For each gene, calculate GFOLD value and other statistics. diff accepts the output of count as the input. Please refer to the output format of count for more information about the input format. If you are not satisfied with the strategy adopted by count, you can generate gene read counts by yourself. In the input for diff 'GeneSymbol' and 'Read Count' are critical. If you fake 'Gene exon length', the RPKM calculated will not be valid. You can fake other columns to make gfold run. Further more the order of gene names should be the same for different samples. The third column of the output of count (gene length) only influences the RPKM in the output of diff. If it is missing, RPKM will not be generated by diff. diff does not use the forth column of the output of count.


OPTIONS

  • -ann <file>

    Gene annotation file in GTF/GPF/BED format. Note that the knownGene table downloaded from UCSC is in GPF format. For job countonly.

  • -annf <GTF/GPF/BED>

    The format of gene annotation file. Default GTF (Gene Transfer Format). For job count only. For GTF format, in the last column, 'gene_id' and 'transcript_id' are required. 'gene_name' and other IDs are optional. For BED format, please provide format which contains 6 columns with the ID at 4'th column. For GPF (Gene Prediction Format), in short, from UCSC Table Browser, the 'knownGene' table with all fields is in GPF and the 'refGene' table without the first column is in GPF format. Note that for either 'knownGene' or 'refGene' table, the downloaded file would contain a header which should be removed before calling GFOLD. More specifically, a file in GPF format contains 12 columns separated by TABs (adapted from UCSC):nameName of genechromReference sequence chromosome or scaffoldstrand+ or - for strandtxStartTranscription start positiontxEndTranscription end positioncdsStartCoding region startcdsEndCoding region endexonCountNumber of exonsexonStartsExon start positions separated by commasexonEndsExon end positions separated by commasproteinIDUniProt display ID for Known Genes, UniProt accession or RefSeq protein ID for UCSC GenesalignIDUnique identifier for each (known gene, alignment position) pair

  • -tag <file>

    Short reads in SAM format. 'stdin' stands for standard input stream. For job count only.

  • -tagf <SAM/BED>

    The format of short reads. Default SAM. For job count only.

  • -s <T/F>

    Whether is the sequencing data strand specific? T stands for strand specific. Default F. If you are not clear about this, using default parameter should be OK even for the strand specific case. For job count only.

  • -acc <T/F>

    When no replicate is available, whether to use accurate method to calculate GFOLD value. T stands for accurate which depends on sequencing depth and slower, F stands for MCMC. Default T. For job diff only.

  • -o <file>

    The file for output for all jobs. For job diff, two output files will be generated. The first is <file> and the second is <file>.ext

  • -s1 <file>

    The prefix for gene read count of the 1st group output by count. Multiple prefixes are separated by commas. For job diff only. If you have gene read count generated by other ways instead of job count, make sure that the format are the same for all files. Each file contains two columns corresponding to gene names and read counts separated by a TAB. All files are sorted by gene names and have the same number of lines.

  • -s2 <file>

    The prefix for gene read count of the 2st group output by count. Multiple prefixes are separated by commas. For job diff only.

  • -d <file>

    Gene description file. The first two columns of the file will be used. The first column should be gene descriptions; The second column should be gene IDs, e.g. the first column of the output of Count. The gene description file can be downloaded from: http://uswest.ensembl.org/biomart/martview/3e5a3ce70eea4f8d0061a0eb03ef11ad . Please select the correct ID type during downloading such that the IDs match the gene IDs appearing in -ann

  • -suf <string>

    The suffix for gene read count file specified by -s1 and -s2. For job diff only.

  • -sc <num>

    The significant cutoff for fold change. Default 0.01. For job diff only.

  • -bi <num>

    For MCMC, the iterations for burn-in phase. Default 1000. For job diff only.

  • -si <num>

    For MCMC, the iterations for sampling phase. Default 1000. For job diff only.

  • -r <num>

    The maximum number of selected pairs for calculating empirical FDR. Default 20. For job diff only.

  • -v <num>

    Verbos level. A larger value gives more information of the running process. Default 2.

  • -norm <Count/DESeq/NO>

    The way to do normalization. 'Count' stands for normalization by total number of mapped reads. 'DESeq' stands for the normalization proposed by DESeq. 'NO' stands for no normalization. You can also specifiy a list of normalization constant separated by commas. E.g. 1.2,2.1,1.0,2.0. Note that the number of constants should be the same as the total number of samples (group1 and group2) and the order should be for -s1 followed by for -s2. GFOLD using normalization constants not by directly multiplication (scaling up) nor division (scaling down). The normalization constants will be built into the model. In the model, division or multiplication has no difference. Default 'DESeq'.


OUTPUT FORMAT

All fields in a output file are separated by TABs.

  • For JOB count:

    The output file contains 5 columns:

    1. GeneSymbol:For BED file, this is the 4'th column. For GPF file, this is the first column. For GTF format, this corresponds to 'gene_id' if it exists, 'NA' otherwise.
    2. GeneName:For BED file, this is always 'NA'. For GPF file, this is the 12'th column. For GTF format, this corresponds to 'gene_name' if it exists, 'NA' otherwise.
    3. Read Count:The number of reads mapped to this gene.
    4. Gene exon length:The length sum of all the exons of this gene.
    5. RPKM:The expression level of this gene in RPKM.
  • For JOB diff:

    The first output file contains 7 columns:

    1. GeneSymbol:
      Gene symbols. The order of gene symbol is the same as what appearing in the read count file.

    2. GeneName:
      Gene name. The order of gene name is the same as what appearing in the read count file.

    3. GFOLD:
      GFOLD value for every gene. The GFOLD value could be considered as a reliable log2 fold change. It is positive/negative if the gene is up/down regulated. The main usefulness of GFOLD is to provide a biological meanlingful ranking of the genes. The GFOLD value is zero if the gene doesn't show differential expression. If the log2 fold change is treated as a random variable, a positive GFOLD value x means that the probability of the log2 fold change (2nd/1st) being larger than x is (1 - the parameter specified by -sc); A negative GFOLD value x means that the probability of the log2 fold change (2st/1nd) being smaller than x is (1 - the parameter specified by -sc). If this file is sorted by this column in descending order then genes ranked at the top are differentially up-regulated and genes ranked at the bottom are differentially down-regulated. Note that a gene with GFOLD value 0 should never be considered differentially expressed. However, it doesn't mean that all genes with non-negative GFOLD value are differentially expressed. For taking top differentially expressed genes, the user is responsible for selecting the cutoff.

    4. E-FDR:
      Empirical FDR based on replicates. It is always 1 when no replicates are available.

    5. log2fdc:
      log2 fold change. If no replicate is available, and -acc is T, log2 fold change is based on read counts and normalization constants. Otherwise, log2 fold change is based on the sampled expression level from the posterior distribution.

    6. 1st RPKM:
      The RPKM for the first condition. It is available only if gene length is available. If multiple replicates are available, the RPKM is calculated simply by summing over replicates. Because RPKM is acturally using sequencing depth as the normalization constant, log2 fold change based on RPKM could be different from the log2fdc field.

    7. 2nd RPKM:
      The RPKM for the second condition. It is available only if gene length is available. Please refer to 1stRPKM for more information.

The second output file (.ext) contains the normalized read counts

  • GeneSymbol:Gene symbols. The order of gene symbol is the same as what appearing in the read count file.
  • Normalized Read Counts:Multiple columns (the same number of input samples) contains the read count normalized by the calculated normalization constants. The are only muturally comparable and the absolute value is meaningless.
  • Gene Description:The last column contains the associated gene description if the description file -d is provided.

相关文章

  • Gfold说明书(备份)

    本文资料来源 https://web.archive.org/web/20161125133249/http://...

  • 无生物学重复的差异分析—Gfold

    一、安装gfold Gfold软件是做没有生物学重复样本首选的软件https://zhanglab.tongji....

  • GFOLD

    GFOLD:广义倍数变化 有重复的转录组数据,用的比较多的是DEseq2(负二项分布),edgeR(负二项分布),...

  • Gfold安装:Error while loading shar

    conda 下安装gfold 解决: gfold 依赖于gsl库,先确认好已安装好gsl相关软件,which gs...

  • CORNAS摸索

    用到CORNAS是因为我拿到了没有重复的样本,想和GFOLD对比一下有何不同详见这篇:我的GFOLD摸索 其实很简...

  • GFOLD摸索

    用到GFOLD是因为我拿到了没有重复的样本,所以试试,第一次做,欢迎大家和我讨论参考文献是:GFOLD大部分安装参...

  • gfold 分析

    之前分析遇到2个问题:1.样本只有一个,没有replicates,是不是就没有办法进行分析?2.一个样本有3个重复...

  • gfold问题处理记录

    这次在新的服务器上安装gfold的时候,偶然发现GFOLD的官网已经404了,于是索性就conda一套进行安装co...

  • 用五份说明书轻松管理财务

    首先五份投资理财说明书是什么? 风险说明书、投资说明书、资产说明书、保险说明书、安全说明书 风险说明书 简七理财有...

  • 第11章 用五份说明书轻松管理财务

    风险说明书(略) 投资说明书 资产说明书 保险说明书 安全说明书 我们有必要准备一份说明书,详细记录每一笔存款和投...

网友评论

    本文标题:Gfold说明书(备份)

    本文链接:https://www.haomeiwen.com/subject/fvoooxtx.html