- 文献:Hi-C analysis: from data generation to integration
文章首先回顾与论述了3D基因组学发展的必然性,以及3C、4C、5C,Hi-C的发展以及Hi-C的基本步骤,这里就不在叙述,可参考之前的文献导读。
1、Hi-C目前特点
1.1 mutiple scales
(1)large scale: A/B compartment
genome is organized in two distinct comparments
- A compartment: active; open chromatin
- B compartment: inactive; close chromatin: compact heterochromatin
(2)fine scale: TAD
regions characterized by hy intradomain contact frequency and reduced interdomain contacts.
(3)finer scale: loop
- identify specific points of contact between distant chromatin region
- be especially challenging for the resolution limit of Hi-C data
mutiple scales-2
mutiple scales-3
1.2
如上从compartment,到TAD,再到loop,需要resolution分辨率越高(值越小=bin)。
因此如何提高分辨率是一个重要的问题,文章主要了介绍从以下两个方面
(1)the restriction enzymes
- ultimate resolution limit of Hi-C data
- The original Hi-C protocol was based on HindIII and NcoI restriction enzymes, both recognizing and cutting a 6 bp long sequence: AAGCTT and CCATGG, respectively
HindⅢ - 如果限制酶切位点识别的碱基数越少,那么可以识别到更多的target restriction sites, then result in smaller fragments so as to increase the resolution.
有报道采用Dpn-Ⅱ限制酶切位点为GATC, a restriction enzyme recogniz-ing an RCGY motif (with R equal to A or G and Yequal to Cor G) to achieve an even smaller average fragment size
(2)sequencing depth
- the most striking effort in improving Hi-C data resolution
- the first Hi-C dataset had 95 million reads in total (up to 30 million per sample) .[1kb]
- the recent articles have reached up to 40 billion reads per dataset (up to 7.3 billion per sample)
- More recently, the local chromatin topology of the Drosophila genome has been investigated at a resolution of 500 bp (average fragment size) to characterize do-mains at sub-kb resolution.
The contact maps at 40 kb resolution can clearly highlight topological domains
1.3 future
As more and more datasets become available, it will become increasingly important to establish common and standardized procedures to assess data quality and reproducibility of replicates.
2、Hi-C分析流程
Hi-C分析简单来说可以分成两大步:raw data → Hi-C contact matrix → downstream analysis
(这样说来和RNA-seq差不多,先根据原始测序数据拿到表达矩阵,再做下游分析)
2-1
2.1 raw data(fastq) → Hi-C contact matrix
(1)比对 aligned to the reference genome
- Hi-C paired-end reads are aligned separately, as they are expected to map in different unrelated regions of the genome .
- tools:bowtie, bwa
- challenging: chimeric reads( the read spans the ligation junction, thus having two portions of the read itself matching distinct genomic positions.)
require specific strategies to attempt mapping different portions of the read.
approaches: ICE, TADbit, HiCUP, HIPPIE, Juicer, HiC-Pro
(2)质控 filtered to remove spurious signal
- Read level filters include the removal of reads with low alignment quality or PCR artifacts
- i.e. multiple read pairs mapped in the same positions.
- More recently, the MAD-max (maximum allowed median absolute deviation) filter on genomic coverage has been proposed to remove low-coverage bins.
(3)set bin
- The choice of the bin size used to summarize results is de facto defining the final resolution of analysis results.
- The genomic bins allow achievingamorerobustandlessnoisysignalintheestimation of contact frequencies.
- Recently, two approaches to determine optimal bin size have been proposed(deDoc,QuASAR)
(4)normalization
- Read counts binning and normalization are usually coupled and performed simultaneously by the same tools.
- Two major strategies:explicit and implicit
- Explicit method:计算每个bin的correction factor (computed for each of the considered biases and their combination)
- Implicit method(matrix-balancing):assume that each genomic locus should have "equal visibility" i.e.,the interaction signal, as measured by Hi-C for each genomic locus, should add up to the same total amount. (tools: SCN,ICE...)
still open problem: the normalization of Hi-C data originating from genomes with copy number alterations.(拷贝数变异)
An earlier work proposed a solution with an additional scaling factor to be applied on top of
ICE normalization to correct for aneuploidies with whole chromosome duplications or deletions.
Recent publications proposed instead more generalizable solutions adding a correction factor to matrix-balancing normalization to model and adjust the effect of local copy number variations.
2.2 Hi-C contact matrix → Downsteam analysis
即1.1所述的3个角度的研究--compartment,TAD,loop
(1)Tools to call compartments
- first level of chromatin organization
- 步骤
A:the correlation of a matrix of observed over expected Hi-C signal ratio, where the expected signal was obtained from a distance normalized contact matrix;
B:calculate Pearson correlation of the distance normalized map;
C:PCA and use the sign of the first eigenvector (first principal component) - Tools:i.e., A similar approach is available in HOMER, whereas loess calculation of distance dependency is implemented in Cworld (https://github.com/dekkerlab/cworld-dekker) and in the HiTC R package
(2)TAD callers
- first identified by visual inspection of the interaction maps(triangle).
- find the enrichment of insulator proteins binding at TAD boundaries.
- The biggest challenge to a rigorous methods benchmarking is probably the lack of a set of true, experimentally validated TADs.( unambiguous definition)
- An important aspect to review TAD callers is how they deal with data resolution.
- a large production of TAD calling methods--
- The first methods developed to call TADs were based on one-dimensional scores.
DI( directionality index) then HMM to derive TADs in DomainCaller and the 4D Nucleome Analysis Toolbox - the insulation score quantifies the interactions passing across each genomic bin,and it allows defining boundaries by identifying local minima, also implemented in Cworld
- TAD 鉴定的思路很多,详见文献。
(3)Interaction callers
- specific points of contact between distant chromatinregions,such as those occurring between promoters and enhancers.
- The computational identification of interactions requires the definition of a background model in order to discern contacts with an interaction frequency higher than expected.
- The background can be estimated using local signal distribution or modeled using global (chromosome-wide orgenome-wide)approaches.
The former: HOMER,HiCCUPS,diffHic
The latter:Fit-Hi-C(uses nonparametric splines),GOTHiC,HIPPIE,HOMER - other methods,详见文献。
3、数据格式与可视化
3.1 数据格式
- The lack of a common standard in data formats has already been reported as a critical issue in the field of Hi-C data analysis.
- Most tools store data in different formats, and only few provide utilities to convert from one format to another.
(1)rawdata:fastq
(2)aligned reads:bam
(3)contact matrix:如下,也是最“混乱”的
HOMER -- 'dense' format
HiC-Pro -- 'sparse' format
由于数据太大,常压缩为二进制文件,常用的有如下两种
the '.cool' format is based on HDF5 and is used by the cooler pipeline;
the '.hic' format is used instead by the Juicer pipeline.
3.2 可视化
- 主要是 display Hi-C contact maps as heatmaps
- allow to smoothly browse Hi-C heatmaps interactively, to zoom in and out with different resolutions, to visualize maps together with other genomics data such as ChIP-seq and to compare multiple maps in a synchronous way.
(1)Juiceboxisavailable bothasa desktop and a cloudbased web application named Juicebox.js. It loads matrices in '.hic' format and its strengths are its intuitive interface and easy use.
(2)gcMapExplorer is a Python software featuring a GUI that loads data in the '.gcmap' format; it also performs different types of normalizations on raw matrices.
(3) HiGlass is available as a docker container and loads matrices in '.cool' format. It allows sophisticated customization of the layout by juxtaposing panels with multiple maps at the desired zoomlevels, along with othergenomic data.
Juicebox and HiGlass allow sharing a session via a URL or a JSON representation, respectively, which can also be easily hosted at web sites.
网友评论