1. 将这些计算工具分为以下三类
(1) basic traditional statistical analysis,
(2) machine learning approaches
(3) assignment of functional and biological information to describe and understand protein interaction networks.
2. 分析大数据的Guideline
Step one: Observe your data, quality control
Step two: Traditional statistics
Groups identified by the researcher either during experimental
design or during the data observation step can be compared here using Student’s t test, analysis of variance (ANOVA), and their nonparametric equivalents such as Kruskal-Wallis, in addition to regression modeling and other tests of traditional statistics. Many tests done simultaneously should be corrected using a multiple
test correction such as the Benjamini-Hochberg correction algorithm
Step three: Dimension reduction with machine learning
使用Table 1所示分类算法将features减少。而这些分类算法又分为Unsupervised和Supervised两类。
(1)Unsupervised
principal component analysis (PCA)
Independent component analysis (ICA)
K-means
Hierarchical clustering
(2)supervised
Partial least squares (PLS)
Random forests (RF)
Support vector machine (SVM)
支持上述分类算法的软件工具有:Weka [14], Scikit-learn (Machine Learning in Python)[15], and SHOGUN [16].
Table 1 Summary and comparison of classification and clustering methods
Step four: Pathway and network analysis
For pathway analysis, we refer to data analysis that aims to identify activated pathways or pathway modules from functional proteomic data.
For network analysis, we refer to data analysis that builds, overlays,
visualizes, and infers protein interaction networks from functional proteomics and other systems biology data.
Table 2 Summary of functional and network tools
3. Longitudinal or time-series data
Several software tools are available that specifically address
the problems associated with time-series data.
TimeClust is a stand-alone tool which is available for different platforms and allows the clustering of gene expression data collected over time with distance-based, model-based, and template-based methods [61]. There are also several other packages available in R such as maSigPro [62], timecourse [63], BAT [64], betr [65], fpca
[66], timeclip [67], rnits [68], and STEM [69].
Python probabilistic graphical query language (pGQL) [70] allows its user to interactively define linear HMM queries on time-course data using rectangular graphical widgets called probabilistic time boxes. The analysis is fully interactive, and the graphical display shows the
time courses along with the graphical query.
In JAVA, PESTS [71] and OPTricluster [72] both of which are
stand-alone with a GUI interface are useful for the clustering
of short time-series data in MATLAB.
DynamiteC is a dynamic modeling and clustering algorithm which
interleaves clustering time-course gene expression data
with estimation of dynamic models of their response by
biologically meaningful parameters [73].
网友评论