Diffbind

作者: 余绕 | 来源:发表于2021-10-25 16:25 被阅读0次

    DiffBind, which provides functions for processing DNA data enriched for genomic loci, including ChIP-seq data enriched for sites where specific protein/DNA binding occurs, or histone marks are enriched, as well as open-chromatin assays such as ATAC-seq.

    The primary emphasis of the package is on identifying sites that are differentially bound between sample groups. It includes functions to support the processing of peak sets, including
    (1)overlapping and merging peak sets, (2)counting sequencing reads overlapping intervals
    in peak sets, and (3) identifying statistically significantly differentially bound sites based on evidence of binding affinity (measured by differences in read densities).
    To this end it uses
    statistical routines developed in an RNA-Seq context (primarily the Bioconductor packages edgeR and DESeq2 ). Additionally, the package builds on Rgraphics routines to provide a set of standardized plots to aid in binding analysis
    1) Install the package.

    if (!requireNamespace("BiocManager", quietly = TRUE))
      install.packages("BiocManager")
    
    BiocManager::install("DiffBind")
    

    2) Prepare the metadata.
    The input file is a csv file which ***may include *** the following columns:
    [1] SampleID: Identifier string for sample. Must be unique for each sample.
    [2] Tissue: Identifier string for tissue type
    [3] Factor: Identifier string for factor
    [4] Condition: Identifier string for condition
    [5]Treatment:Identifier string for treatment
    [6] Replicate:Replicate number of sample
    [7] bamReads: file path for bam file containing aligned reads for ChIP sample
    [8] Control ID: : file path for bam file containing aligned reads for control sample
    [9] bamControl: file path for bam file containing aligned reads for control sample
    [10] Peaks: path for file containing peaks for sample. Format determined by
    PeakCaller field or caller parameter
    [11] PeakCaller: Identifier string for peak caller used. If Peaks is not a bed
    file, this will determine how the Peaks file is parsed. If missing, will use
    default peak caller specified in caller parameter. Possible values:
    – “raw”: text file file; peak score is in fourth column
    – “bed”: .bed file; peak score is in fifth column
    – “narrow”: default peak.format: narrowPeaks file
    – “macs”: MACS .xls file
    – “swembl”: SWEMBL .peaks file
    – “bayes”: bayesPeak file
    – “peakset”: peakset written out using pv.writepeakset
    – “fp4”: FindPeaks v4

    Sample file from diffbind


    Sample file from Intro to ChIPseq using HPC
    image.png
    Judging from the above two sheets, it is obvious that not all fields are necessary for following data analysis.

    Due to the large files of bam, it is very time consuming to download the data from servers. Thus, it is better to install Diffbind in the server.
    Following is the instruction for Diffbind installation in R on Linux servers.

    1. R.3.5 or 4.1 in conda seems does not work, needs lots of dependent packages.
    2. I choose the R in server which is also R.3.5
      a) Tried the installation using following command:
     if (!requireNamespace("BiocManager", quietly = TRUE))
      install.packages("BiocManager")
    BiocManager::install("DiffBind")
    

    But it reported error like this:
    ERROR: dependency ?.map?.is not available for package ?.iffBind?
    THis error indicates that it need amap package.
    b)Download the map package in the website :
    https://cran.r-project.org/web/packages/amap/index.html
    Note: Choose the old version cuasue the latest version is for >=R.4.0.
    Here, we downloaded ''amap_0.8-14.tar.gz''.
    c) Install amap

    R CMD  INSTALL amap_0.8-14.tar.gz 
    

    d) Install Diffbind

     if (!requireNamespace("BiocManager", quietly = TRUE))
      install.packages("BiocManager")
    BiocManager::install("DiffBind")
    

    3) Data analysis
    1. Prepare the metadata sheet.

    image.png
    2. Reading in the peaksets
    #1 Reading in the peaksets
    dbObj <- dba(sampleSheet="NF.csv")
    #This shows how many peaks are in each peakset, as well as (in the first line) the total
    #number of unique peaks after merging overlapping ones (3795), and the dimensions of the 
    #default binding matrix of 11 samples by the 2845 sites that overlap in at least two of the samples.
    
    > dbObj 
    Samples, 33147 sites in matrix (36024 total):
          ID Factor Replicate Caller Intervals
    1 N.Rep1      N         1    bed     32303
    2 N.Rep2      N         2    bed     32029
    3 F.Rep1      F         1    bed     32388
    4 F.Rep2      F         2    bed     32797
    

    3.clustering of the samples

    #clustering of the samples 
    plot(dbObj)
    
    image.png
    4.Counting reads
    The next step is to calculate a binding matrix with scores based on read counts for every
    sample (affinity scores), rather than confidence scores for only those peaks called in a specific
    sample (occupancy scores). These reads are obtained using the dba.count function:
    tamoxifen <- dba.count(dbObj)
    
    > tamoxifen
     Samples, 33147 sites in matrix:
          ID Factor Replicate Caller Intervals FRiP
    1 N.Rep1      N         1 counts     33147 0.54
    2 N.Rep2      N         2 counts     33147 0.59
    3 F.Rep1      F         1 counts     33147 0.52
    4 F.Rep2      F         2 counts     33147 0.54
    # Or
    tamoxifen <- dba.count(dbObj, bUseSummarizeOverlaps=TRUE) #This option is better.
    #bUseSummarizeOverlaps: logical indicating that summarizeOverlaps should be used for counting instead 
    # of the built-in counting code. This option is slower but uses the more standard
    # counting function. If TRUE, all read files must be BAM (.bam extension), with
    # associated index files (.bam.bai extension). The fragmentSize parameter must
    # absent
    
    
    #plot a new correlation heatmap based on the count scores.
    plot(tamoxifen)
    
    image.png

    5.Establishing a contrast
    Before running the differential analysis, we need to tell DiffBind which cell lines fall in which
    groups. This is done using the dba.contrast function, as follows:

    #Before running the differential analysis, we need to tell DiffBind which cell lines fall in which
      #groups. This is done using the dba.contrast function, as follows:
    
    tamoxifen <- dba.contrast(tamoxifen, categories=DBA_FACTOR,minMembers = 2)
      #This uses the Condition metadata (Responsive vs. Resistant) to set up a contrast with 4
      #samples in the Resistant group and 7 samples in the Responsive group.2 (This step is optional)
    
    image.png
    6.Performing the differential analysis
    tamoxifen <- dba.analyze(tamoxifen, method=DBA_ALL_METHODS)
      # if this value is set to DBA_ALL_METHODS, this is equivalent to specifying c(DBA_EDGER,DBA_DESEQ2).
    #  summary of results
    dba.show(tamoxifen, bContrasts=T)
    
    > dba.show(tamoxifen, bContrasts=T)
      Group1 Members1 Group2 Members2 DB.edgeR DB.DESeq2
    1      N        2      F        2     7711      5398
    
    image.png
    #  overlapping peaks identified by the two different tools (DESeq2 and edgeR)
    dba.plotVenn(tamoxifen,contrast=1,method=DBA_ALL_METHODS)
    
    image.png

    7 Retrieving the differentially bound sites
    The value columns show the mean read concentration over all the samples (the default calculation uses log2 normalized ChIP read counts with control read counts subtracted) and the mean concentration over the first (Resistant) group and second (Responsive) group(前面/后面).

    comp1.edgeR <- dba.report(tamoxifen, method=DBA_EDGER, contrast = 1, th=1)
    comp1.deseq <- dba.report(tamoxifen, method=DBA_DESEQ2, contrast = 1, th=1)
    
    image.png
    8Save all results
      # EdgeR
    out <- as.data.frame(comp1.edgeR)
    write.table(out, file="edgeR.txt", sep="\t", quote=F, col.names = NA)
      # DESeq2
    out <- as.data.frame(comp1.deseq)
    write.table(out, file="deseq2.txt", sep="\t", quote=F, col.names = NA)
    

    9 Save results in BED format

    # Create bed files for each keeping only significant peaks (p < 0.05)
    # EdgeR
    out <- as.data.frame(comp1.edgeR)
    edge.bed <- out[ which(out$FDR < 0.05), 
                     c("seqnames", "start", "end", "strand", "Fold")]
    write.table(edge.bed, file="results/Nanog_vs_Pou5f1_edgeR_sig.bed", sep="\t", quote=F, row.names=F, col.names=F)
    
    # DESeq2
    out <- as.data.frame(comp1.deseq)
    deseq.bed <- out[ which(out$FDR < 0.05), 
                      c("seqnames", "start", "end", "strand", "Fold")]
    write.table(deseq.bed, file="results/Nanog_vs_Pou5f1_deseq2_sig.bed", sep="\t", quote=F, row.names=F, col.names=F)
    
    #MAPLOT
    dba.plotMA(tamoxifen)
    #Volcano plot
    dba.plotVolcano(tamoxifen)
    #Heatmap
    corvals <- dba.plotHeatmap(tamoxifen, contrast=1, correlations=FALSE,colScheme="Spectral")
    
    #The sequential palettes names are 
    # Blues BuGn BuPu GnBu Greens Greys Oranges OrRd PuBu PuBuGn PuRd Purples RdPu Reds YlGn YlGnBu YlOrBr YlOrRd
    
    #All the sequential palettes are available in variations from 3 different values up to 9 different values.
    #The diverging palettes are 
    #  BrBG PiYG PRGn PuOr RdBu RdGy RdYlBu RdYlGn Spectral
    

    NOTE: THE DIFFBIND is not the latest version. The manual is Edited by the author on: September 18, 2019 and Compiled on: September 15, 2020
    FOR the latest version of DIFFBIND , NEXT TIME!
    Refs:
    https://www.jianshu.com/p/f849bd55ac27
    https://hbctraining.github.io/Intro-to-ChIPseq/

    相关文章

      网友评论

          本文标题:Diffbind

          本文链接:https://www.haomeiwen.com/subject/qdposltx.html