CummeRbund: Visualization and Exploration of Cufflinks
4 Reading cuffdiff output
cummeRbund was designed to process the multi-file output format for a ’cuffdiff’ differential expression analysis. In this type of analysis, a user will use a reference .gtf file (either known annotation or a .gtf file created from a cufflinks assembly or merge of assemblies) and quantitate the expression values and differential regulation of the annotation(s) in the .gtf file across two or more SAM/BAM files. By design, cuffdiff produces a number of output files that contain test results for changes in expression at the level of transcripts, primary transcripts, and genes. It also tracks changes in the relative abundance of transcripts sharing a common transcription start site, and in the relative abundances of the primary transcripts of each gene. Tracking the former allows one to see changes in splicing, and the latter lets one see changes in relative promoter use within a gene. Note:Early versions of Cuffdiff required that transcripts in the input GTF be annotated with certain attributes in order to look for changes in primary transcript expression, splicing, coding output, and promoter use. This is no longer the case with >=v1.1.1 of cummeRbund, however we still recommend the use of both the following attributes in your GTF file to enable all downstream features of cummeRbund.
These attributes are:
- tss id: The ID of this transcript’s inferred start site. Determines which primary transcript this processed
transcript is believed to come from. Cuffcompare appends this attribute to every transcript reported in the
.combined.gtf file.
- p id The ID of the coding sequence this transcript contains. This attribute is attached by Cuffcompare to
the .combined.gtf records only when it is run with a reference annotation that include CDS records. Further,
differential CDS analysis is only performed when all isoforms of a gene have p id attributes, because neither
Cufflinks nor Cuffcompare attempt to assign an open reading frame to transcripts. cuffdiff calculates the FPKM of each transcript, primary transcript, and gene in each sample. Primary transcript and gene FPKMs are computed by summing the FPKMs of transcripts in each primary transcript group or gene group. The results are output in FPKM tracking files, the structure of which can be found in the cufflinks manual.
There are four FPKM tracking files:
- isoforms.fpkm tracking Transcript FPKMs
- genes.fpkm tracking Gene FPKMs. Tracks the summed FPKM of transcripts sharing each gene id
- cds.fpkm tracking Coding sequence FPKMs. Tracks the summed FPKM of transcripts sharing each p id,
independent of tss id
- tss groups.fpkm tracking Primary transcript FPKMs. Tracks the summed FPKM of transcripts sharing each tss id
cuffdiff also performs differential expression tests between supplied conditions. This tab delimited file lists the results of differential expression testing between samples for spliced transcripts, primary transcripts, genes, and coding sequences. For detailed file structure see cufflinks manual.
Four .diff files are created:
- isoform exp.diff Transcript differential FPKM.
- gene exp.diff Gene differential FPKM. Tests difference sin the summed FPKM of transcripts sharing each gene id
- tss group exp.diff Primary transcript differential FPKM. Tests differences in the summed FPKM of transcripts sharing each tss id
- cds exp.diff Coding sequence differential FPKM. Tests differences in the summed FPKM of transcripts sharing each p id independent of tss id
In addition, cuffdiff also performs differential splicing, CDS usage, and promoter usage tests for each gene across conditions:
- splicing.diff Differential splicing tests.
- CDS.diff Differential coding output.
- promoters.diff Differential promoter use.
All of these output files are related to each other through their various tracking ids, but parsing through individual files to query for important result information requires both a good deal of patience and a strong grasp of commandline text manipulation. Enter cummeRbund, an R solution to aggregate, organize, and help visualize this multilayered dataset. One of the principle benefits of using cummeRbund is that data are stored in a SQLite database. This allows for out of-memory analysis of data, quick retrieval, and only a one-time cost to setup the tables. By default, cummeRbund assumes that all output files from cuffdiff are in the current working directory. To read these files, populate the ’cuffData.db’ database backend, and return the CuffSet pointer object, you can do the following.
library(cummeRbund)
cuff<-readCufflinks()
cuff
5. Graphic
disp<-dispersionPlot(genes(cuff))
disp
genes.scv<-fpkmSCVPlot(genes(cuff))
isoforms.scv<-fpkmSCVPlot(isoforms(cuff))
#To assess the distributions of FPKM scores across samples, you can use the csDensity plot (Figure 1).
dens<-csDensity(genes(cuff))
dens
densRep<-csDensity(genes(cuff),replicates=T)
densRep
#Boxplots can be visualized using the csBoxplot method (Figure 2).
b<-csBoxplot(genes(cuff))
b
brep<-csBoxplot(genes(cuff),replicates=T)
brep
#A matrix of pairwise scatterplots can be drawn using the csScatterMatrix() method.
s<-csScatterMatrix(genes(cuff))
s<-csScatter(genes(cuff),"hESC","Fibroblasts",smooth=T)
s
#树状图
dend<-csDendro(genes(cuff))
dend.rep<-csDendro(genes(cuff),replicates=T)
#MvsA plots can be useful to determine any systematic bias that may be present between #conditions. The CuffData method MAplot() can be used to examine these intensity vs fold-#change plots. You must specify the sample names to use for the pairwise comparison with x #and y:
m<-MAplot(genes(cuff),"hESC","Fibroblasts")
m
mCount<-MAplot(genes(cuff),"hESC","Fibroblasts",useCount=T)
mCount
#Volcano plots are also available for the CuffData objects.
v<-csVolcanoMatrix(genes(cuff))
v
#For individual pairwise comparisons, you must specify the comparisons by sample name.
v<-csVolcano(genes(cuff),"hESC","Fibroblasts")
图片.png
图片.png
图片.png
图片.png
图片.png
图片.png
图片.png
图片.png
图片.png
6 Accessing Data
Cuffdiff run information
Run-level information such as run parameters, and sample information can be accessed from a CuffSet object by using the runInfo and replicates methods:
runInfo(cuff)
replicates(cuff)
#Features/Annotation
#Feature-level information can be accessed directly from a CuffData object using the fpkm, #repFpkm, count, diffData,
gene.features<-annotation(genes(cuff))
head(gene.features)
gene.fpkm<-fpkm(genes(cuff))
head(gene.fpkm)
gene.repFpkm<-repFpkm(genes(cuff))
head(gene.repFpkm)
gene.counts<-count(genes(cuff))
head(gene.counts)
isoform.fpkm<-fpkm(isoforms(cuff))
head(isoform.fpkm)
gene.diff<-diffData(genes(cuff))
head(gene.diff)
#Condition and feature names
#Vectors of sample names and feature names are available by using the samples and featureNames #methods:
sample.names<-samples(genes(cuff))
head(sample.names)
gene.featurenames<-featureNames(genes(cuff))
head(gene.featurenames)
#Convenience functions
#To facilitate Bioconductor-like operations, an ’FPKM-matrix’ can be returned easily using the #fpkmMatrix method:
gene.matrix<-fpkmMatrix(genes(cuff))
head(gene.matrix)
gene.rep.matrix<-repFpkmMatrix(genes(cuff))
head(gene.rep.matrix)
gene.count.matrix<-countMatrix(genes(cuff))
head(gene.count.matrix)
7 Creating Gene Sets
Gene Sets (stored in a CuffGeneSet object) can be created using the getGenes method on a CuffSet object. You must first create a vector of ’gene id’ or ’gene short name’ values to identify the genes you wish to select:
data(sampleData)
myGeneIds<-sampleIDs
myGeneIds
myGenes<-getGenes(cuff,myGeneIds)
myGenes
#The same fpkm, repFpkm, count, annotation, diffData, samples, and featureNames methods are #available for instances of the CuffGeneSet class, but additional accessor methods are available #for the promoters, relCDS, and splicing slot data as well.
#FPKM values for genes in gene set
head(fpkm(myGenes))
head(fpkm(isoforms(myGenes)))
head(repFpkm(TSS(myGenes)))
#As of cummeRbund v2.0 CuffGeneSet classes can be created from any type of identifier (’gene #id’,’isoform id’,’TSS group id’,or ’CDS id’). When you pass a list of identifiers that are not #gene id to getGenes(), the function attempts to lookup the parent gene id for each feature and #returns all relevant information for the given genes and all of their subfeatures (not just the #sub-features passed to getGenes()). If you are interested in just retrieving information for a
#given set of features, please use the new getFeatures() method described later.
myGeneset.pluri<-getGenes(cuff,myGeneIds,sampleIdList=c("hESC","iPS"))
myGeneset.pluri
7.1 Geneset level plots
There are several plotting functions available for gene-set-level visualization:
The csHeatmap() function is a plotting wrapper that takes as input either a CuffGeneSet or a CuffFeatureSet object (essentially a collection of genes and/or features) and produces a heatmap of FPKM expression values. The ’cluster’ argument can be used to re-order either ’row’, ’column’, or ’both’ dimensions of this matrix. By default, the Jensen-Shannon distance is used as the clustering metric, however, any function that produces a dist object can be passed to the ’cluster’ argument as well.
h<-csHeatmap(myGenes,cluster= ’ both ’ )
h
h.rep<-csHeatmap(myGenes,cluster= ’ both ’ ,replicates=T)
h.rep
b<-expressionBarplot(myGenes)
b
#The csScatter() method can be used to produce scatter plot comparisons between any two #conditions.
s<-csScatter(myGenes,"Fibroblasts","hESC",smooth=T)
s
#The volcano plot is a useful visualization to compare fold change between any two conditions #and significance (-log P-values).
v<-csVolcano(myGenes,"Fibroblasts","hESC")
v
#Similar plots can be made for all sub-level features of a CuffGeneSet class by specifying which #slot you would like to plot (eg. isoforms(myGenes),TSS(myGenes),CDS(myGenes)).
ih<-csHeatmap(isoforms(myGenes),cluster= ’ both ’ ,labRow=F)
ih
th<-csHeatmap(TSS(myGenes),cluster= ’ both ’ ,labRow=F)
th
#Dendrograms can provide insight into the relationships between conditions for various genesets #(e.g. significant genes used to draw relationships between conditions). As of v1.1.3 the method #csDendro() can be used to plot a dendrogram based on Jensen-Shannon distances between #conditions for a given CuffFeatureSet or CuffGeneSet.
den<-csDendro(myGenes)
图片.png
图片.png
图片.png
图片.png
图片.png
图片.png
8 Individual Genes
An individual CuffGene object can be created by using the getGene() function for a given ’gene id’ or ’gene short name’. As of cummeRbund ≥ v2.0 you can also use isoform id, tss group id, or cds id values to retrieve the corresponding parent gene object.
myGeneId<-"PINK1"
myGene<-getGene(cuff,myGeneId)
myGene
head(fpkm(myGene))
head(fpkm(isoforms(myGene)))
8.1 Gene-level plots
gl<-expressionPlot(myGene)
gl
gl.rep<-expressionPlot(myGene,replicates=TRUE)
gl.rep
gl.iso.rep<-expressionPlot(isoforms(myGene),replicates=T)
gl.iso.rep
gl.cds.rep<-expressionPlot(CDS(myGene),replicates=T)
gl.cds.rep
gb<-expressionBarplot(myGene)
gb
gb.rep<-expressionBarplot(myGene,replicates=T)
gb.rep
igb<-expressionBarplot(isoforms(myGene),replicates=T)
igb
gp<-csPie(myGene,level="isoforms")
gp
8.1.1 Gene Feature plots
If you included both the genome build and gtfFile in your call to readCufflinks() then you will be able to access some of the transcript-structure level features that are now being integrated into cummeRbund. For now, these features are extended only to the single gene, CuffGene objects.
head(features(myGene))
#The Gviz package can be used to display features in a ’track’-like format. In particular, the #GeneRegionTrack class creates a mechanism by which we can start to visualize transcript-level #structures in their genomic context. cummeRbund implements the makeGeneRegionTrack() method to #quickly create a GeneRegionTrack from the gene features.
genetrack<-makeGeneRegionTrack(myGene)
plotTracks(genetrack)
图片.png
图片.png
图片.png
图片.png 图片.png
9 Data Exploration
The cummeRbund package is more than just a visualization tool as well. We are working to implement several different means of data exploration from gene and condition clustering, finding features with similar expression profiles, as well as incorporating Gene Ontology analysis.
9.1 Overview of significant features
The sigMatrix() function can provide you with a “quick–and–dirty” view of the number of significant features of a particular type, and at a given alpha (0.05 by default). For example:
mySigMat<-sigMatrix(cuff,level= ’ genes ’ ,alpha=0.05)
9.2 Creating gene sets from significantly regulated genes
One of the primary roles of a differential expression analysis is to conduct significance tests on each feature (genes,isoforms, TSS, and CDS) for appropriate pairwise comparisons of conditions. The results of these tests (after multiple testing correction of course) can be used to determine which genes are differentially regulated. cummeRbund makes accessing the results of these significance tests simple via getSig(). This function takes a CuffSet object and will scan at various feature levels (’genes’ by default) to produce a vector of feature IDs. By default getSig() outputs a vector of tracking IDs corresponding to all genes that reject the null hypothesis in any condition tested. The default feature type can be changed by adjusting the ’level’ argument to getSig(). In addition, a alpha value can be provided on which to filter the resulting list (the default is 0.05 to match the default of cuffdiff).
mySigGeneIds<-getSig(cuff,alpha=0.05,level= ’ genes ’ )
head(mySigGeneIds)
length(mySigGeneIds)
#By default getSig() outputs a vector of tracking IDs corresponding to all genes that reject the #null hypothesis in any condition tested. The default feature type can be changed by adjusting #the ’level’ argument to getSig(). In addition, a alpha value can be provided on which to filter #the resulting list (the default is 0.05 to match the default of cuffdiff). Significance results #for specific pairwise comparisons can be retrieved as well by specifying the two conditions as #’x’ and ’y’. In this case, p-values are adjusted to reduce the impact of multiple-testing #correction when only one set of tests is being conducted.
hESC_vs_iPS.sigIsoformIds<-getSig(cuff,x= ’ hESC ’ ,y= ’ iPS ’ ,alpha=0.05,level= ’ isoforms ’ )
head(hESC_vs_iPS.sigIsoformIds)
length(hESC_vs_iPS.sigIsoformIds)
#The values returned for each level of this list can be used as an argument to getGenes, to #create a CuffGeneSet object of significantly regulated genes (or features).
mySigGenes<-getGenes(cuff,mySigGeneIds)
mySigGenes
#Alternatively, you can use the getSigTable() method to return a full test-table of ’significant #features’ x ’pairwise tests’ for all comparisons. Only features in which the null hypothesis #can be rejected in at least one test are reported.
mySigTable<-getSigTable(cuff,alpha=0.01,level= ’ genes ’ )
head(mySigTable,20)
9.3 Exploring the relationships between conditions
9.3.1 Distance matrix
Similarities between conditions and/or replicates can provide useful insight into the relationship between various groupings of conditions and can aid in identifying outlier replicates that do not behave as expected. cummeRbund provides the csDistHeat() method to visualize the pairwise similarities between conditions:
myDistHeat<-csDistHeat(genes(cuff))
myRepDistHeat<-csDistHeat(genes(cuff),replicates=T)
9.3.2 Dimensionality reduction
Dimensionality reduction is an informative approach for clustering and exploring the relationships between conditions. It can be useful for feature selection as well as identifying the sources of variability within your data. To this end, we have applied two different dimensionality reduction strategies in cummeRbund: principal component analysis (PCA) and multi-dimensional scaling (MDS). We provide the two wrapper methods, PCAplot and MDSplot
genes.MDS<-MDSplot(genes(cuff))
genes.PCA.rep<-PCAplot(genes(cuff),"PC1","PC2",replicates=T)
genes.MDS.rep<-MDSplot(genes(cuff),replicates=T)
genes.PCA<-PCAplot(genes(cuff),"PC1","PC2")
9.4 Partitioning
K-means clustering is a useful tool that can be helpful in identifying clusters of genes with similar expression profiles. In fact, these profiles are learned from the data during the clustering. csCluster() uses the pam() method from the clustering package to perform the partitioning around medoids. In this case however, the distance metric used by default is the Jensen-Shannon distance instead of the default Euclidean distance. Prior to performing this particular partitioning, the user must choose the number of clusters (K) into which the expression profiles should be divided.
ic<-csCluster(myGenes,k=4)
head(ic$cluster)
icp<-csClusterPlot(ic)
icp
9.5 Specificity
In some cases, a researcher may be interested in identifying features that are ’condition-specific’. Or, more likely, producing an ordered list of genes based on their specificity for a given condition. We define a specificity score (S) as the following:
myGenes.spec<-csSpecificity(myGenes)
head(myGenes.spec)
9.6 Finding similar genes
Another common question in large-scale gene expression analyses is ’How can I find genes with similar expression profiles to gene x?’. We have implemented a method, findSimilar to allow you to identify a fixed number of the most similar genes to a given gene of interest. For example, if you wanted to find the 20 genes most similar to ”PINK1”, you could do the following:
By default, findSimilar will return a CuffGeneSet of similar genes matching your criteria. Recently a few additional features have been added as well to enhance this type of exploration:
- If ’returnGeneSet’ is set to FALSE, then findSimilar returns a data.frame of distance-ranked similar genes with distances. This is useful if you would like to see a rank-ordered list of similar genes.
- The ’distThresh’ argument allows you to pass a value [between 0-1] to be used as a distance threshold instead of an arbitrary ’n’ number of genes. setting distThresh=1.0 will return all genes ranked by their distance to your gene of interest.
You are also able to provide your own expression profile in lieu of a ’gene id’. The vector provided must match the order and length of samples().
myProfile<-c(500,0,400)
mySimilar2<-findSimilar(cuff,myProfile,n=10)
mySimilar2.expression<-expressionPlot(mySimilar2,logMode=T,showErrorbars=F)
findSimilar() also uses the Jensen-Shannon distance between the probability distributions of each gene across conditions to determine the similarity. We have found this to be a more robust way to determine distance between genes using the high dynamic range of FPKM data. Future versions may allow for other dissimilarity measures to be used instead.
图片.png
图片.png
图片.png
图片.png
图片.png
图片.png
图片.png
10 Miscellaneous
- In appropriate plots, using the argument replicates=T will allow you to visualize replicate-level FPKM values either in lieu of or in addition to condition-level FPKMs.
- As of v1.1.3 we attempt to provide new visual cues in most plots that will indicate the quantification status for a particular feature in each given condition. We have enabled this feature by default for most plots to suggest a measure of reliability for each feature in a particular condition. In most cases, this feature can be disabled by setting ’showStatus=FALSE’.
- CummeRbund will now work with the hidden ’–no-diff’ argument for cuffdiff. This will quantify features against .bam files but not do differential testing. This is useful when you want to aggregate very large numbers of conditions, and cannot afford the time or space for the differential test results. (Not recommended unless you have a SPECIFIC need for this).
- All plotting functions return ggplot objects and the resulting objects can be manipulated/faceted/altered using standard ggplot2 methods.
- There are occasional DB connectivity issues that arise. Not entirely sure why yet. If necessary, just readCufflinks again and this should solve connectivity issues with a new RSQLite connection object. If connectivity continues to be a problem, try cuff<-readCufflinks(rebuild=T)
- I am still working on fully documenting each of the methods. There are a good number of arguments that exist, but might be hard to find without looking at the reference manual.
网友评论