Diffbind

作者: 余绕 | 来源:发表于2021-10-25 16:25 被阅读0次

2022-11-17 DiffBind使用
Diffbind
DiffBind安装（二）
DiffBind安装（一）
DiffBind简要分析流程
ChIP-Seq: DiffBind无control，无重复样本
awk提取差异分析结果时候的一个小错误
DiffBind分析ATAC-seq（差异peaks）
ATAC-Seq 差异分析 DiffBind
第9篇：差异peaks分析——DiffBind

DiffBind, which provides functions for processing DNA data enriched for genomic loci, including ChIP-seq data enriched for sites where specific protein/DNA binding occurs, or histone marks are enriched, as well as open-chromatin assays such as ATAC-seq.

The primary emphasis of the package is on identifying sites that are differentially bound between sample groups. It includes functions to support the processing of peak sets, including
(1）overlapping and merging peak sets, （2）counting sequencing reads overlapping intervals
in peak sets, and （3） identifying statistically significantly differentially bound sites based on evidence of binding affinity (measured by differences in read densities). To this end it uses
statistical routines developed in an RNA-Seq context (primarily the Bioconductor packages edgeR and DESeq2 ). Additionally, the package builds on Rgraphics routines to provide a set of standardized plots to aid in binding analysis
1) Install the package.

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

BiocManager::install("DiffBind")

2) Prepare the metadata.
The input file is a csv file which ***may include *** the following columns:
[1] SampleID: Identifier string for sample. Must be unique for each sample.
[2] Tissue: Identifier string for tissue type
[3] Factor: Identifier string for factor
[4] Condition: Identifier string for condition
[5]Treatment：Identifier string for treatment
[6] Replicate：Replicate number of sample
[7] bamReads： file path for bam file containing aligned reads for ChIP sample
[8] Control ID： : file path for bam file containing aligned reads for control sample
[9] bamControl： file path for bam file containing aligned reads for control sample
[10] Peaks: path for file containing peaks for sample. Format determined by
PeakCaller field or caller parameter
[11] PeakCaller: Identifier string for peak caller used. If Peaks is not a bed
file, this will determine how the Peaks file is parsed. If missing, will use
default peak caller specified in caller parameter. Possible values:
– “raw”: text file file; peak score is in fourth column
– “bed”: .bed file; peak score is in fifth column
– “narrow”: default peak.format: narrowPeaks file
– “macs”: MACS .xls file
– “swembl”: SWEMBL .peaks file
– “bayes”: bayesPeak file
– “peakset”: peakset written out using pv.writepeakset
– “fp4”: FindPeaks v4

Sample file from diffbind

Sample file from Intro to ChIPseq using HPC

image.png
Judging from the above two sheets, it is obvious that not all fields are necessary for following data analysis.

Due to the large files of bam, it is very time consuming to download the data from servers. Thus, it is better to install Diffbind in the server.
Following is the instruction for Diffbind installation in R on Linux servers.

R.3.5 or 4.1 in conda seems does not work, needs lots of dependent packages.
I choose the R in server which is also R.3.5
a) Tried the installation using following command:

 if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("DiffBind")

But it reported error like this:
ERROR: dependency ?.map?.is not available for package ?.iffBind?
THis error indicates that it need amap package.
b)Download the map package in the website :
https://cran.r-project.org/web/packages/amap/index.html
Note: Choose the old version cuasue the latest version is for >=R.4.0.
Here, we downloaded ''amap_0.8-14.tar.gz''.
c） Install amap

R CMD  INSTALL amap_0.8-14.tar.gz

d) Install Diffbind

 if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("DiffBind")

3) Data analysis
1. Prepare the metadata sheet.

image.png
2. Reading in the peaksets

#1 Reading in the peaksets
dbObj <- dba(sampleSheet="NF.csv")
#This shows how many peaks are in each peakset, as well as (in the first line) the total
#number of unique peaks after merging overlapping ones (3795), and the dimensions of the 
#default binding matrix of 11 samples by the 2845 sites that overlap in at least two of the samples.

> dbObj 
Samples, 33147 sites in matrix (36024 total):
      ID Factor Replicate Caller Intervals
1 N.Rep1      N         1    bed     32303
2 N.Rep2      N         2    bed     32029
3 F.Rep1      F         1    bed     32388
4 F.Rep2      F         2    bed     32797

3.clustering of the samples

#clustering of the samples 
plot(dbObj)

image.png
4.Counting reads
The next step is to calculate a binding matrix with scores based on read counts for every
sample (affinity scores), rather than confidence scores for only those peaks called in a specific
sample (occupancy scores). These reads are obtained using the dba.count function:

tamoxifen <- dba.count(dbObj)

> tamoxifen
 Samples, 33147 sites in matrix:
      ID Factor Replicate Caller Intervals FRiP
1 N.Rep1      N         1 counts     33147 0.54
2 N.Rep2      N         2 counts     33147 0.59
3 F.Rep1      F         1 counts     33147 0.52
4 F.Rep2      F         2 counts     33147 0.54
# Or
tamoxifen <- dba.count(dbObj, bUseSummarizeOverlaps=TRUE) #This option is better.
#bUseSummarizeOverlaps: logical indicating that summarizeOverlaps should be used for counting instead 
# of the built-in counting code. This option is slower but uses the more standard
# counting function. If TRUE, all read files must be BAM (.bam extension), with
# associated index files (.bam.bai extension). The fragmentSize parameter must
# absent


#plot a new correlation heatmap based on the count scores.
plot(tamoxifen)

image.png

5.Establishing a contrast
Before running the differential analysis, we need to tell DiffBind which cell lines fall in which
groups. This is done using the dba.contrast function, as follows:

#Before running the differential analysis, we need to tell DiffBind which cell lines fall in which
  #groups. This is done using the dba.contrast function, as follows:

tamoxifen <- dba.contrast(tamoxifen, categories=DBA_FACTOR,minMembers = 2)
  #This uses the Condition metadata (Responsive vs. Resistant) to set up a contrast with 4
  #samples in the Resistant group and 7 samples in the Responsive group.2 (This step is optional)

image.png
6.Performing the differential analysis

tamoxifen <- dba.analyze(tamoxifen, method=DBA_ALL_METHODS)
  # if this value is set to DBA_ALL_METHODS, this is equivalent to specifying c(DBA_EDGER,DBA_DESEQ2).
#  summary of results
dba.show(tamoxifen, bContrasts=T)

> dba.show(tamoxifen, bContrasts=T)
  Group1 Members1 Group2 Members2 DB.edgeR DB.DESeq2
1      N        2      F        2     7711      5398

image.png

#  overlapping peaks identified by the two different tools (DESeq2 and edgeR)
dba.plotVenn(tamoxifen,contrast=1,method=DBA_ALL_METHODS)

image.png

7 Retrieving the differentially bound sites
The value columns show the mean read concentration over all the samples (the default calculation uses log2 normalized ChIP read counts with control read counts subtracted) and the mean concentration over the first (Resistant) group and second (Responsive) group(前面/后面).

comp1.edgeR <- dba.report(tamoxifen, method=DBA_EDGER, contrast = 1, th=1)
comp1.deseq <- dba.report(tamoxifen, method=DBA_DESEQ2, contrast = 1, th=1)

image.png
8Save all results

  # EdgeR
out <- as.data.frame(comp1.edgeR)
write.table(out, file="edgeR.txt", sep="\t", quote=F, col.names = NA)
  # DESeq2
out <- as.data.frame(comp1.deseq)
write.table(out, file="deseq2.txt", sep="\t", quote=F, col.names = NA)

9 Save results in BED format

# Create bed files for each keeping only significant peaks (p < 0.05)
# EdgeR
out <- as.data.frame(comp1.edgeR)
edge.bed <- out[ which(out$FDR < 0.05), 
                 c("seqnames", "start", "end", "strand", "Fold")]
write.table(edge.bed, file="results/Nanog_vs_Pou5f1_edgeR_sig.bed", sep="\t", quote=F, row.names=F, col.names=F)

# DESeq2
out <- as.data.frame(comp1.deseq)
deseq.bed <- out[ which(out$FDR < 0.05), 
                  c("seqnames", "start", "end", "strand", "Fold")]
write.table(deseq.bed, file="results/Nanog_vs_Pou5f1_deseq2_sig.bed", sep="\t", quote=F, row.names=F, col.names=F)

#MAPLOT
dba.plotMA(tamoxifen)
#Volcano plot
dba.plotVolcano(tamoxifen)
#Heatmap
corvals <- dba.plotHeatmap(tamoxifen, contrast=1, correlations=FALSE,colScheme="Spectral")

#The sequential palettes names are 
# Blues BuGn BuPu GnBu Greens Greys Oranges OrRd PuBu PuBuGn PuRd Purples RdPu Reds YlGn YlGnBu YlOrBr YlOrRd

#All the sequential palettes are available in variations from 3 different values up to 9 different values.
#The diverging palettes are 
#  BrBG PiYG PRGn PuOr RdBu RdGy RdYlBu RdYlGn Spectral

NOTE: THE DIFFBIND is not the latest version. The manual is Edited by the author on: September 18, 2019 and Compiled on: September 15, 2020
FOR the latest version of DIFFBIND , NEXT TIME!
Refs:
https://www.jianshu.com/p/f849bd55ac27
https://hbctraining.github.io/Intro-to-ChIPseq/

2022-11-17 DiffBind使用
##安装DiffBind包 BiocManager::install("DiffBind") library(Di...
Diffbind
DiffBind, which provides functions for processing DNA dat...
DiffBind安装（二）
DiffBind包现在更新了，但是我还是习惯了用旧的版本，由于我的R语言更新到了4.0，默认安装DiffBind的...
DiffBind安装（一）
我的R语言版本是目前最新的4.0，用bioconductor安装DiffBind。出现报错：按照报错要求安装l...
DiffBind简要分析流程
1. 读入文件 2. 计算亲和结合矩阵 3.计算差异分析 4. 提取结果 5-1. 保存文件--DEseq2 结果...
ChIP-Seq: DiffBind无control，无重复样本
DiffBind的使用有前辈已经写的很详细了，可以参考下： https://mp.weixin.qq.com/s?...
awk提取差异分析结果时候的一个小错误
我有时候会对DIffBind的输出结果进行下修改，然后放到deeptools-computeMatrix里面去。跟...
DiffBind分析ATAC-seq（差异peaks）
最近遇到一个问题，用DiffBind找差异peaks，但是网上没有找到合适（直接用）的参考，网上的例子[https...
ATAC-Seq 差异分析 DiffBind
本文是《ATAC-Seq 分析流程》[https://www.jianshu.com/p/e043bc8c1bae...
第9篇：差异peaks分析——DiffBind
学习目标学习用DiffBind流程评估两个样本间的差异结合区域用PCA评估样本间的关系评估不同工具得到的差异...