DiffBind, which provides functions for processing DNA data enriched for genomic loci, including ChIP-seq data enriched for sites where specific protein/DNA binding occurs, or histone marks are enriched, as well as open-chromatin assays such as ATAC-seq.
The primary emphasis of the package is on identifying sites that are differentially bound between sample groups. It includes functions to support the processing of peak sets, including
(1)overlapping and merging peak sets, (2)counting sequencing reads overlapping intervals
in peak sets, and (3) identifying statistically significantly differentially bound sites based on evidence of binding affinity (measured by differences in read densities). To this end it uses
statistical routines developed in an RNA-Seq context (primarily the Bioconductor packages edgeR and DESeq2 ). Additionally, the package builds on Rgraphics routines to provide a set of standardized plots to aid in binding analysis
1) Install the package.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DiffBind")
2) Prepare the metadata.
The input file is a csv file which ***may include *** the following columns:
[1] SampleID: Identifier string for sample. Must be unique for each sample.
[2] Tissue: Identifier string for tissue type
[3] Factor: Identifier string for factor
[4] Condition: Identifier string for condition
[5]Treatment:Identifier string for treatment
[6] Replicate:Replicate number of sample
[7] bamReads: file path for bam file containing aligned reads for ChIP sample
[8] Control ID: : file path for bam file containing aligned reads for control sample
[9] bamControl: file path for bam file containing aligned reads for control sample
[10] Peaks: path for file containing peaks for sample. Format determined by
PeakCaller field or caller parameter
[11] PeakCaller: Identifier string for peak caller used. If Peaks is not a bed
file, this will determine how the Peaks file is parsed. If missing, will use
default peak caller specified in caller parameter. Possible values:
– “raw”: text file file; peak score is in fourth column
– “bed”: .bed file; peak score is in fifth column
– “narrow”: default peak.format: narrowPeaks file
– “macs”: MACS .xls file
– “swembl”: SWEMBL .peaks file
– “bayes”: bayesPeak file
– “peakset”: peakset written out using pv.writepeakset
– “fp4”: FindPeaks v4
Sample file from diffbind
Sample file from Intro to ChIPseq using HPC
image.png
Judging from the above two sheets, it is obvious that not all fields are necessary for following data analysis.
Due to the large files of bam, it is very time consuming to download the data from servers. Thus, it is better to install Diffbind in the server.
Following is the instruction for Diffbind installation in R on Linux servers.
- R.3.5 or 4.1 in conda seems does not work, needs lots of dependent packages.
- I choose the R in server which is also R.3.5
a) Tried the installation using following command:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DiffBind")
But it reported error like this:
ERROR: dependency ?.map?.is not available for package ?.iffBind?
THis error indicates that it need amap package.
b)Download the map package in the website :
https://cran.r-project.org/web/packages/amap/index.html
Note: Choose the old version cuasue the latest version is for >=R.4.0.
Here, we downloaded ''amap_0.8-14.tar.gz''.
c) Install amap
R CMD INSTALL amap_0.8-14.tar.gz
d) Install Diffbind
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DiffBind")
3) Data analysis
1. Prepare the metadata sheet.
2. Reading in the peaksets
#1 Reading in the peaksets
dbObj <- dba(sampleSheet="NF.csv")
#This shows how many peaks are in each peakset, as well as (in the first line) the total
#number of unique peaks after merging overlapping ones (3795), and the dimensions of the
#default binding matrix of 11 samples by the 2845 sites that overlap in at least two of the samples.
> dbObj
Samples, 33147 sites in matrix (36024 total):
ID Factor Replicate Caller Intervals
1 N.Rep1 N 1 bed 32303
2 N.Rep2 N 2 bed 32029
3 F.Rep1 F 1 bed 32388
4 F.Rep2 F 2 bed 32797
3.clustering of the samples
#clustering of the samples
plot(dbObj)
image.png
4.Counting reads
The next step is to calculate a binding matrix with scores based on read counts for every
sample (affinity scores), rather than confidence scores for only those peaks called in a specific
sample (occupancy scores). These reads are obtained using the dba.count function:
tamoxifen <- dba.count(dbObj)
> tamoxifen
Samples, 33147 sites in matrix:
ID Factor Replicate Caller Intervals FRiP
1 N.Rep1 N 1 counts 33147 0.54
2 N.Rep2 N 2 counts 33147 0.59
3 F.Rep1 F 1 counts 33147 0.52
4 F.Rep2 F 2 counts 33147 0.54
# Or
tamoxifen <- dba.count(dbObj, bUseSummarizeOverlaps=TRUE) #This option is better.
#bUseSummarizeOverlaps: logical indicating that summarizeOverlaps should be used for counting instead
# of the built-in counting code. This option is slower but uses the more standard
# counting function. If TRUE, all read files must be BAM (.bam extension), with
# associated index files (.bam.bai extension). The fragmentSize parameter must
# absent
#plot a new correlation heatmap based on the count scores.
plot(tamoxifen)
image.png
5.Establishing a contrast
Before running the differential analysis, we need to tell DiffBind which cell lines fall in which
groups. This is done using the dba.contrast function, as follows:
#Before running the differential analysis, we need to tell DiffBind which cell lines fall in which
#groups. This is done using the dba.contrast function, as follows:
tamoxifen <- dba.contrast(tamoxifen, categories=DBA_FACTOR,minMembers = 2)
#This uses the Condition metadata (Responsive vs. Resistant) to set up a contrast with 4
#samples in the Resistant group and 7 samples in the Responsive group.2 (This step is optional)
image.png
6.Performing the differential analysis
tamoxifen <- dba.analyze(tamoxifen, method=DBA_ALL_METHODS)
# if this value is set to DBA_ALL_METHODS, this is equivalent to specifying c(DBA_EDGER,DBA_DESEQ2).
# summary of results
dba.show(tamoxifen, bContrasts=T)
> dba.show(tamoxifen, bContrasts=T)
Group1 Members1 Group2 Members2 DB.edgeR DB.DESeq2
1 N 2 F 2 7711 5398
image.png
# overlapping peaks identified by the two different tools (DESeq2 and edgeR)
dba.plotVenn(tamoxifen,contrast=1,method=DBA_ALL_METHODS)
image.png
7 Retrieving the differentially bound sites
The value columns show the mean read concentration over all the samples (the default calculation uses log2 normalized ChIP read counts with control read counts subtracted) and the mean concentration over the first (Resistant) group and second (Responsive) group(前面/后面).
comp1.edgeR <- dba.report(tamoxifen, method=DBA_EDGER, contrast = 1, th=1)
comp1.deseq <- dba.report(tamoxifen, method=DBA_DESEQ2, contrast = 1, th=1)
image.png
8Save all results
# EdgeR
out <- as.data.frame(comp1.edgeR)
write.table(out, file="edgeR.txt", sep="\t", quote=F, col.names = NA)
# DESeq2
out <- as.data.frame(comp1.deseq)
write.table(out, file="deseq2.txt", sep="\t", quote=F, col.names = NA)
9 Save results in BED format
# Create bed files for each keeping only significant peaks (p < 0.05)
# EdgeR
out <- as.data.frame(comp1.edgeR)
edge.bed <- out[ which(out$FDR < 0.05),
c("seqnames", "start", "end", "strand", "Fold")]
write.table(edge.bed, file="results/Nanog_vs_Pou5f1_edgeR_sig.bed", sep="\t", quote=F, row.names=F, col.names=F)
# DESeq2
out <- as.data.frame(comp1.deseq)
deseq.bed <- out[ which(out$FDR < 0.05),
c("seqnames", "start", "end", "strand", "Fold")]
write.table(deseq.bed, file="results/Nanog_vs_Pou5f1_deseq2_sig.bed", sep="\t", quote=F, row.names=F, col.names=F)
#MAPLOT
dba.plotMA(tamoxifen)
#Volcano plot
dba.plotVolcano(tamoxifen)
#Heatmap
corvals <- dba.plotHeatmap(tamoxifen, contrast=1, correlations=FALSE,colScheme="Spectral")
#The sequential palettes names are
# Blues BuGn BuPu GnBu Greens Greys Oranges OrRd PuBu PuBuGn PuRd Purples RdPu Reds YlGn YlGnBu YlOrBr YlOrRd
#All the sequential palettes are available in variations from 3 different values up to 9 different values.
#The diverging palettes are
# BrBG PiYG PRGn PuOr RdBu RdGy RdYlBu RdYlGn Spectral
NOTE: THE DIFFBIND is not the latest version. The manual is Edited by the author on: September 18, 2019 and Compiled on: September 15, 2020
FOR the latest version of DIFFBIND , NEXT TIME!
Refs:
https://www.jianshu.com/p/f849bd55ac27
https://hbctraining.github.io/Intro-to-ChIPseq/
网友评论