hi-c 文献导读-3 Hi-C analysis

作者: 小贝学生信 | 来源:发表于2020-10-06 20:24 被阅读0次

hi-c 文献导读-3 Hi-C analysis
3D基因组
hi-c 文献导读-2 TAD
hi-c 文献导读-4 MACS
hi-c 文献导读-1 basic
Hi-C技术发展
HiC质控-HiCUP
Hi-C 进行染色体挂载
Hi-C文献解读
(PNAS 2019) scHiCluster （Part I：

文献：Hi-C analysis: from data generation to integration

文章首先回顾与论述了3D基因组学发展的必然性，以及3C、4C、5C，Hi-C的发展以及Hi-C的基本步骤，这里就不在叙述，可参考之前的文献导读。

1、Hi-C目前特点

1.1 mutiple scales

（1）large scale: A/B compartment

genome is organized in two distinct comparments

A compartment: active; open chromatin
B compartment: inactive; close chromatin: compact heterochromatin

（2）fine scale: TAD

regions characterized by hy intradomain contact frequency and reduced interdomain contacts.

（3）finer scale: loop

identify specific points of contact between distant chromatin region
be especially challenging for the resolution limit of Hi-C data

mutiple scales-1

mutiple scales-2

mutiple scales-3

1.2

如上从compartment，到TAD，再到loop，需要resolution分辨率越高（值越小=bin）。
因此如何提高分辨率是一个重要的问题，文章主要了介绍从以下两个方面

（1）the restriction enzymes

ultimate resolution limit of Hi-C data
The original Hi-C protocol was based on HindIII and NcoI restriction enzymes, both recognizing and cutting a 6 bp long sequence: AAGCTT and CCATGG, respectively
HindⅢ
如果限制酶切位点识别的碱基数越少，那么可以识别到更多的target restriction sites, then result in smaller fragments so as to increase the resolution.
有报道采用Dpn-Ⅱ限制酶切位点为GATC， a restriction enzyme recogniz-ing an RCGY motif (with R equal to A or G and Yequal to Cor G) to achieve an even smaller average fragment size

（2）sequencing depth

the most striking effort in improving Hi-C data resolution
the first Hi-C dataset had 95 million reads in total (up to 30 million per sample) .[1kb]
the recent articles have reached up to 40 billion reads per dataset (up to 7.3 billion per sample)
More recently, the local chromatin topology of the Drosophila genome has been investigated at a resolution of 500 bp (average fragment size) to characterize do-mains at sub-kb resolution.

The contact maps at 40 kb resolution can clearly highlight topological domains

1.3 future

As more and more datasets become available, it will become increasingly important to establish common and standardized procedures to assess data quality and reproducibility of replicates.

2、Hi-C分析流程

Hi-C分析简单来说可以分成两大步：raw data → Hi-C contact matrix → downstream analysis
(这样说来和RNA-seq差不多，先根据原始测序数据拿到表达矩阵，再做下游分析)

2-1

2.1 raw data（fastq） → Hi-C contact matrix

（1）比对 aligned to the reference genome

Hi-C paired-end reads are aligned separately, as they are expected to map in different unrelated regions of the genome .
tools：bowtie, bwa
challenging: chimeric reads( the read spans the ligation junction, thus having two portions of the read itself matching distinct genomic positions.)
require specific strategies to attempt mapping different portions of the read.
approaches: ICE, TADbit, HiCUP, HIPPIE, Juicer, HiC-Pro

（2）质控 filtered to remove spurious signal

Read level filters include the removal of reads with low alignment quality or PCR artifacts
i.e. multiple read pairs mapped in the same positions.
More recently, the MAD-max (maximum allowed median absolute deviation) filter on genomic coverage has been proposed to remove low-coverage bins.

（3）set bin

The choice of the bin size used to summarize results is de facto defining the final resolution of analysis results.
The genomic bins allow achievingamorerobustandlessnoisysignalintheestimation of contact frequencies.
Recently, two approaches to determine optimal bin size have been proposed（deDoc，QuASAR）

（4）normalization

Read counts binning and normalization are usually coupled and performed simultaneously by the same tools.
Two major strategies：explicit and implicit
Explicit method：计算每个bin的correction factor (computed for each of the considered biases and their combination)
Implicit method(matrix-balancing)：assume that each genomic locus should have "equal visibility" i.e.,the interaction signal, as measured by Hi-C for each genomic locus, should add up to the same total amount. (tools: SCN，ICE...)

still open problem： the normalization of Hi-C data originating from genomes with copy number alterations.(拷贝数变异)
An earlier work proposed a solution with an additional scaling factor to be applied on top of
ICE normalization to correct for aneuploidies with whole chromosome duplications or deletions.
Recent publications proposed instead more generalizable solutions adding a correction factor to matrix-balancing normalization to model and adjust the effect of local copy number variations.

2.2 Hi-C contact matrix → Downsteam analysis

即1.1所述的3个角度的研究--compartment，TAD，loop

（1）Tools to call compartments

first level of chromatin organization
步骤
A：the correlation of a matrix of observed over expected Hi-C signal ratio, where the expected signal was obtained from a distance normalized contact matrix；
B：calculate Pearson correlation of the distance normalized map；
C：PCA and use the sign of the first eigenvector (first principal component)
Tools：i.e., A similar approach is available in HOMER, whereas loess calculation of distance dependency is implemented in Cworld (https://github.com/dekkerlab/cworld-dekker) and in the HiTC R package

2.2-1

（2）TAD callers

first identified by visual inspection of the interaction maps(triangle).
find the enrichment of insulator proteins binding at TAD boundaries.
The biggest challenge to a rigorous methods benchmarking is probably the lack of a set of true, experimentally validated TADs.( unambiguous definition)
An important aspect to review TAD callers is how they deal with data resolution.
a large production of TAD calling methods--
The first methods developed to call TADs were based on one-dimensional scores.
DI( directionality index) then HMM to derive TADs in DomainCaller and the 4D Nucleome Analysis Toolbox
the insulation score quantifies the interactions passing across each genomic bin,and it allows defining boundaries by identifying local minima, also implemented in Cworld
TAD 鉴定的思路很多，详见文献。

（3）Interaction callers

specific points of contact between distant chromatinregions,such as those occurring between promoters and enhancers.
The computational identification of interactions requires the definition of a background model in order to discern contacts with an interaction frequency higher than expected.
The background can be estimated using local signal distribution or modeled using global (chromosome-wide orgenome-wide)approaches.
The former： HOMER，HiCCUPS，diffHic
The latter：Fit-Hi-C(uses nonparametric splines)，GOTHiC，HIPPIE，HOMER
other methods，详见文献。

3、数据格式与可视化

3.1 数据格式

The lack of a common standard in data formats has already been reported as a critical issue in the field of Hi-C data analysis.
Most tools store data in different formats, and only few provide utilities to convert from one format to another.
（1）rawdata：fastq
（2）aligned reads：bam
（3）contact matrix：如下，也是最“混乱”的
HOMER -- 'dense' format
HiC-Pro -- 'sparse' format
由于数据太大，常压缩为二进制文件，常用的有如下两种
the '.cool' format is based on HDF5 and is used by the cooler pipeline;
the '.hic' format is used instead by the Juicer pipeline.

3.2 可视化

主要是 display Hi-C contact maps as heatmaps
allow to smoothly browse Hi-C heatmaps interactively, to zoom in and out with different resolutions, to visualize maps together with other genomics data such as ChIP-seq and to compare multiple maps in a synchronous way.

（1）Juiceboxisavailable bothasa desktop and a cloudbased web application named Juicebox.js. It loads matrices in '.hic' format and its strengths are its intuitive interface and easy use.
（2）gcMapExplorer is a Python software featuring a GUI that loads data in the '.gcmap' format; it also performs different types of normalizations on raw matrices.
（3） HiGlass is available as a docker container and loads matrices in '.cool' format. It allows sophisticated customization of the layout by juxtaposing panels with multiple maps at the desired zoomlevels, along with othergenomic data.