Lecture 7 Gathering and Analyzing Large Data Sets
7.A Experimental Technologies
1. Genomics
Started with development of Microarrays
Extract mRNA → Convert to cDNA by reverse transcriptase→ Couple to dyes →hybridize →visualize → Computationally analyze to separate signal from noise
1)
Sequencing the whole genome
Deep Sequencing - Repeated sequencing of a DNA fragment – region of interest in a chromosome Substantial increase in sensitivity and accuracy
SNPs -- Single nucleotide polymorphisms - single base pair variations in the genome that occur with and relatively high frequency
CNVs -- copy number variation - alterations in DNA structure such a region of the chromosome is abnormally duplicated or deleted
Exome Sequencing -- Sequencing the expressed genome Separate the part of the whole genome that codes for proteins (the exons) and then sequence
CHiP- Seq Sequencing transcription factor bound DNA
CHiP - chromatin – immunoprecipitation - using an antibody against a transcription factor of interest
2)
RNA Seq - Sequencing the expressed mRNA
Extract and fragment RNA-Convert fragments to cDNA-Sequence DNA fragments and map on to reference genome
3)
DNA Methylation 甲基化:Addition of methyl groups to C in DNA in mammals
Typically 5’ position of C in CpG dinucleotides are methylated leading to inhibition of gene expression。表观遗传学研究内容
Detection by genome wide bisulfite sequencing
Bisulfite converts C→ U but not Me-C
MicroRNAs miRNAs
Small ( 21-25 nucleotide) RNA - regulates gene expression Can be sequenced using RNA-Seq starting with size selected RNAs。Around 1100 human mirs mir-001, or mir-123 or mir-500
2. Proteomics
Phosphoproteomics measuring phosphorylated peptides Ser, Thr or Tyr
3. Metabolomics:The full set of metabolites found in a cell, tissue organism-Useful in understanding how phenotypic changes occur or not
7.B Analyzing Large Data Set
1. Heatmap:
From HHMI : A –free 26 slide tutorial on how to analyze DNA microarray data
http://www.hhmi.org/biointeractive/howanalyze-dna-microarray-data
2. Statistical tests:
T-tests can be used to test if two sets of data are significantly different from each other. Generally used if the test statistic follows a normal distribution
ANOVA analysis of variance – commonly used to test the null hypothesis零假设 and determine if there is difference between any two groups when there are more than two groups in an experiment. Significance at a user defined value, p value of 0.05 or 0.01
Mann–Whitney non–parametric test 非参数检验of the null hypothesis. Non-parametric means there is no assumption regarding the distribution of the test statistic不对测试的分布情况进行假定
Cluster Analysis – putting entities (e.g.) genes into groups such that entities within a group are more closely related to each other than to entities in another group. Often used to identify groups of genes expressed (or repressed) under a specified condition (perturbation, duration of treatment etc)
3. Gene-Set Enrichment Analysis:
4. Cufflinks and Cuffdiff
An open source program that maps RNA-Seq reads to a reference genome to identify transcripts and estimate relative abundance
Cuffdiff can be used to detect change in expression levels of individual transcripts http://cufflinks.cbcb.umd.edu
5. Genome-wide Association Studies全基因组关联研究
Identification of variations in DNA sequence that are associated with increased risk of a disease
Most often focused on SNPs
Define phenotype: categorical or quantitative
Assemble patient population for control and disease group
Sequence whole genome – for better established cases – SNP-Chips
Use of appropriate statistical test to establish association of SNPs with increased risk of disease
Bush W.S and Moore J. H. (2012) PloS Comp Bio 8 : issue 12 e1002822
6.Proteomics Technologies
7.Gene-Ontology基因本体论
A bioinformatics resource that allows you to categorize genes/gene products (proteins)www.geneontology.org
It contains three categories: ‘Biological Process’, ‘Cellular Component’, ‘Molecular Function’
Each of these categories is organized in a hierarchical 高低不等manner:
- More nonspecific terms are called Parents主条目 which have more specific terms are called Children
- The relationship between Parents and Children is further characterized by GO relations (e.g.: ‘is a’, ‘part of’, ‘has part’, ‘regulates’)
8.A Network Building & Analysis and Data Organization
- Graph Theory
- Bayesian Networks
- Networks: Undirected graphs, directed graphs, sign-specified directed graphs,
- Networks relevant to cellular systems biology: Cell signaling networks, PPI, Gene regulatory networks
- Bioinformatics
Genes- Genomics
DNA Sequences and Sequence Analysis - GenBank
Proteins
Database of Protein Structures - PDB
Protein characteristics - UniProtKB
National Center for Biotechnology Information at the National Library of Medicine www.ncbi.nlm.nih.gov - Database of Cell Signaling
KEGG : Kyoto Encyclopedia of Genes and Genomes: a database of biological functions and systems including pathways
Pathway Commons: Biological pathways from multiple organisms
GEO :Gene Expression Omnibus genomics data base supported by NCBI - microarray and sequence based data
OMIM: Online Mendelian Inheritance in Man - catalog of human genes and genetic disorders and traits
ENCODE: Encyclopedia of DNA Elements – all functional elements in the human DNA sequence
8.B Building Networks from Large Datasets
- Genes2Networks and Lists2Networks
Combines lists of genes and proteins from an experiment with a background network of all known interactions (for species of interest) to produce a network of interest - Tracing Pathways with ChEA and KEA
- From Expression Patterns to Regulatory Networks-Expression2Kinases X2K
- Visualization of Networks
Pajek-http://vlado.fmf.uni-lj.si/pub/networks/doc/gd.01/Pajek2.png
Cytoscape - Visualizing Large-scale Dynamics
GATE: Grid Analysis of Time Series Expression
网友评论