美文网首页
2015-Human genomics-A survey of

2015-Human genomics-A survey of

作者: 英天 | 来源:发表于2017-08-19 17:29 被阅读0次

    1. 将这些计算工具分为以下三类

    (1) basic traditional statistical analysis,
    (2) machine learning approaches
    (3) assignment of functional and biological information to describe and understand protein interaction networks.

    2. 分析大数据的Guideline

    Step one: Observe your data, quality control

    Step two: Traditional statistics

    Groups identified by the researcher either during experimental
    design or during the data observation step can be compared here using Student’s t test, analysis of variance (ANOVA), and their nonparametric equivalents such as Kruskal-Wallis, in addition to regression modeling and other tests of traditional statistics. Many tests done simultaneously should be corrected using a multiple
    test correction such as the Benjamini-Hochberg correction algorithm

    Step three: Dimension reduction with machine learning

    使用Table 1所示分类算法将features减少。而这些分类算法又分为Unsupervised和Supervised两类。

       (1)Unsupervised
    

    principal component analysis (PCA)
    Independent component analysis (ICA)
    K-means
    Hierarchical clustering

       (2)supervised
    

    Partial least squares (PLS)
    Random forests (RF)
    Support vector machine (SVM)

    支持上述分类算法的软件工具有:Weka [14], Scikit-learn (Machine Learning in Python)[15], and SHOGUN [16].


    Table 1 Summary and comparison of classification and clustering methods

    Step four: Pathway and network analysis

    For pathway analysis, we refer to data analysis that aims to identify activated pathways or pathway modules from functional proteomic data.

    For network analysis, we refer to data analysis that builds, overlays,
    visualizes, and infers protein interaction networks from functional proteomics and other systems biology data.


    Table 2 Summary of functional and network tools

    3. Longitudinal or time-series data

    Several software tools are available that specifically address
    the problems associated with time-series data.
    TimeClust is a stand-alone tool which is available for different platforms and allows the clustering of gene expression data collected over time with distance-based, model-based, and template-based methods [61]. There are also several other packages available in R such as maSigPro [62], timecourse [63], BAT [64], betr [65], fpca
    [66], timeclip [67], rnits [68], and STEM [69].
    Python probabilistic graphical query language (pGQL) [70] allows its user to interactively define linear HMM queries on time-course data using rectangular graphical widgets called probabilistic time boxes. The analysis is fully interactive, and the graphical display shows the
    time courses along with the graphical query.
    In JAVA, PESTS [71] and OPTricluster [72] both of which are
    stand-alone with a GUI interface are useful for the clustering
    of short time-series data in MATLAB.
    DynamiteC is a dynamic modeling and clustering algorithm which
    interleaves clustering time-course gene expression data
    with estimation of dynamic models of their response by
    biologically meaningful parameters [73].

    相关文章

      网友评论

          本文标题:2015-Human genomics-A survey of

          本文链接:https://www.haomeiwen.com/subject/epktdxtx.html