Lecture 7 Gathering and Analyzing Large Data Sets

7.A Experimental Technologies

1. Genomics

Started with development of Microarrays
Extract mRNA → Convert to cDNA by reverse transcriptase→ Couple to dyes →hybridize →visualize → Computationally analyze to separate signal from noise
1）
Sequencing the whole genome
Deep Sequencing - Repeated sequencing of a DNA fragment – region of interest in a chromosome Substantial increase in sensitivity and accuracy
SNPs -- Single nucleotide polymorphisms - single base pair variations in the genome that occur with and relatively high frequency
CNVs -- copy number variation - alterations in DNA structure such a region of the chromosome is abnormally duplicated or deleted
Exome Sequencing -- Sequencing the expressed genome Separate the part of the whole genome that codes for proteins (the exons) and then sequence

CHiP- Seq Sequencing transcription factor bound DNA
CHiP - chromatin – immunoprecipitation - using an antibody against a transcription factor of interest
2）
RNA Seq - Sequencing the expressed mRNA
Extract and fragment RNA-Convert fragments to cDNA-Sequence DNA fragments and map on to reference genome
3）
DNA Methylation 甲基化：Addition of methyl groups to C in DNA in mammals
Typically 5’ position of C in CpG dinucleotides are methylated leading to inhibition of gene expression。表观遗传学研究内容

Detection by genome wide bisulfite sequencing
Bisulfite converts C→ U but not Me-C

MicroRNAs miRNAs
Small ( 21-25 nucleotide) RNA - regulates gene expression Can be sequenced using RNA-Seq starting with size selected RNAs。Around 1100 human mirs mir-001, or mir-123 or mir-500

2. Proteomics

Phosphoproteomics measuring phosphorylated peptides Ser, Thr or Tyr

3. Metabolomics:The full set of metabolites found in a cell, tissue organism-Useful in understanding how phenotypic changes occur or not

7.B Analyzing Large Data Set

1. Heatmap：

From HHMI : A –free 26 slide tutorial on how to analyze DNA microarray data
http://www.hhmi.org/biointeractive/howanalyze-dna-microarray-data

2. Statistical tests：

T-tests can be used to test if two sets of data are significantly different from each other. Generally used if the test statistic follows a normal distribution
ANOVA analysis of variance – commonly used to test the null hypothesis零假设 and determine if there is difference between any two groups when there are more than two groups in an experiment. Significance at a user defined value, p value of 0.05 or 0.01
Mann–Whitney non–parametric test 非参数检验of the null hypothesis. Non-parametric means there is no assumption regarding the distribution of the test statistic不对测试的分布情况进行假定
Cluster Analysis – putting entities (e.g.) genes into groups such that entities within a group are more closely related to each other than to entities in another group. Often used to identify groups of genes expressed (or repressed) under a specified condition (perturbation, duration of treatment etc)

3. Gene-Set Enrichment Analysis：

4. Cufflinks and Cuffdiff

An open source program that maps RNA-Seq reads to a reference genome to identify transcripts and estimate relative abundance
Cuffdiff can be used to detect change in expression levels of individual transcripts http://cufflinks.cbcb.umd.edu

5. Genome-wide Association Studies全基因组关联研究

Identification of variations in DNA sequence that are associated with increased risk of a disease
Most often focused on SNPs
Define phenotype: categorical or quantitative

Assemble patient population for control and disease group

Sequence whole genome – for better established cases – SNP-Chips
Use of appropriate statistical test to establish association of SNPs with increased risk of disease
Bush W.S and Moore J. H. (2012) PloS Comp Bio 8 : issue 12 e1002822

6.Proteomics Technologies

7.Gene-Ontology基因本体论

A bioinformatics resource that allows you to categorize genes/gene products (proteins)www.geneontology.org
It contains three categories: ‘Biological Process’, ‘Cellular Component’, ‘Molecular Function’
Each of these categories is organized in a hierarchical 高低不等manner:

More nonspecific terms are called Parents主条目 which have more specific terms are called Children
The relationship between Parents and Children is further characterized by GO relations (e.g.: ‘is a’, ‘part of’, ‘has part’, ‘regulates’)

8.A Network Building & Analysis and Data Organization

Graph Theory
Bayesian Networks
Networks: Undirected graphs, directed graphs, sign-specified directed graphs,
Networks relevant to cellular systems biology: Cell signaling networks, PPI, Gene regulatory networks
Bioinformatics
Genes- Genomics
DNA Sequences and Sequence Analysis - GenBank
Proteins
Database of Protein Structures - PDB
Protein characteristics - UniProtKB
National Center for Biotechnology Information at the National Library of Medicine www.ncbi.nlm.nih.gov
Database of Cell Signaling

KEGG : Kyoto Encyclopedia of Genes and Genomes: a database of biological functions and systems including pathways
Pathway Commons: Biological pathways from multiple organisms
GEO :Gene Expression Omnibus genomics data base supported by NCBI - microarray and sequence based data
OMIM: Online Mendelian Inheritance in Man - catalog of human genes and genetic disorders and traits
ENCODE: Encyclopedia of DNA Elements – all functional elements in the human DNA sequence

8.B Building Networks from Large Datasets

Genes2Networks and Lists2Networks
Combines lists of genes and proteins from an experiment with a background network of all known interactions (for species of interest) to produce a network of interest
Tracing Pathways with ChEA and KEA
From Expression Patterns to Regulatory Networks-Expression2Kinases X2K
Visualization of Networks
Pajek-http://vlado.fmf.uni-lj.si/pub/networks/doc/gd.01/Pajek2.png
Cytoscape
Visualizing Large-scale Dynamics
GATE: Grid Analysis of Time Series Expression