水稻CUT&Tag数据分析流程

作者: 五香可达鸭 | 来源:发表于2021-06-01 15:04 被阅读0次

    最近实验室的师兄做了几个样本的Cut&Tag数据,公司没有分析流程,只能自己摸索着分析。

    参考了网上的一个流程:https://yezhengstat.github.io/CUTTag_tutorial/#I_Introduction
    中文版的一个帖子:https://www.jianshu.com/p/1a23656a0713 ##这个帖子分成了三篇

    照虎画猫,有不对的地方还请大家指出

    0 样本信息

    一共是4个样本,每个样本2个重复:
    A_treat_rep1 A_Igg_rep1 A_treat_rep2 A_Igg_rep2
    B_treat_rep1 B_Igg_rep1 B_treat_rep2 B_Igg_rep2

    1 质控

    公司的测序报告说数据经过内部软件进行了初步的质控(具体啥软件也没说)


    image.png

    使用fastqc进行质控,使用multiqc批量查看质控结果

    fastqc -t 8 -o out_path sample1_1.fq   ## -t 指定线程 -o 指定输出目录
    multiqc fast_qc -o multiqc ##multiqc会自动寻找指定文件下面fastqc的结果文件,-o 指定输出目录
    

    生成了multiqc_report.html文件,然后看一下结果


    image.png
    image.png

    看了碱基质量和碱基组成成分
    教程说:读取开始处的不一致序列内容是 CUT&Tag 读取的常见现象。 未能通过 Per base seuqnence 内容并不意味着您的数据失败。在任何情况下都不建议对序列在进行trim,因为后续的bowtie2参数将在不进行trim的情况下提供准确的mapping信息。

    2比对

    2.1 比对

    使用bowtie2进行比对,比对之前要先构建index
    我使用的MSU7.0的参考序列

    bowtie2-build --threads 8 Osativa_genome.fa Osativa_index  > Osativa_bowtie2_index.log 2>&1  ##首先构建索引
    bowtie2 --end-to-end --very-sensitive --no-mixed --no-discordant --phred33 -I 10 -X 700 -p ${cores} -x ${ref} -1 ${projPath}/fastq/${histName}_R1.fastq.gz -2 ${projPath}/fastq/${histName}_R2.fastq.gz -S ${projPath}/alignment/sam/${histName}_bowtie2.sam &> ${projPath}/alignment/sam/bowtie2_summary/${histName}_bowtie2.txt
    ### -I 是最短序列  -X 是最长序列 -p 线程 -x 指定参考基因组的index -1 序列1文件 -2 序列2文件 -S 生成的sam文件
    ##如果文件比较多,可以写一个简单的for循环 for i in `cat file_list` ; do done
    

    2.3 spike-in校准

    这一步的目的是根据比对到大肠杆菌DNA上的片段数量对测序序列进行校准,教程里用的是E. coli genome U00096.3 ,ncbi上搜一下就出来了。

    bowtie2-build path/to/Ecoli/fasta/Ecoli.fa /path/to/bowtie2Index/Ecoli ###建立索引
    
    bowtie2 --end-to-end --very-sensitive --no-mixed --no-discordant --phred33 -I 10 -X 700 -p ${cores} -x ${spikeInRef} -1 ${projPath}/fastq/${histName}_R1.fastq.gz -2 ${projPath}/fastq/${histName}_R2.fastq.gz -S $projPath/alignment/sam/${histName}_bowtie2_spikeIn.sam &> $projPath/alignment/sam/bowtie2_summary/${histName}_bowtie2_spikeIn.txt  ##进行比对
    
    seqDepthDouble=`samtools view -F 0x04 $projPath/alignment/sam/${histName}_bowtie2_spikeIn.sam | wc -l`
    seqDepth=$((seqDepthDouble/2))
    echo $seqDepth >$projPath/alignment/sam/bowtie2_summary/${histName}_bowtie2_spikeIn.seqDepth  ##计算出了测序深度,这一步的结果在下面会用到
    

    2.4 比对情况总结及可视化

    这一步是对比对情况的总结和可视化,要在R中进行,本次教程设计的所以R操作建议在一个脚本中进行,因为后面的步骤会调用前面的结果。(个人感觉用处不大,并且还比较浪费时间,很繁琐)
    对下面的几个方面进行了汇总:
    测序深度,比对率、比对上的片段数量、重复率、unique文库大小、片段大小分布

    ##=== R command ===## 
    ###在分析之前,先把该装的包都装上,建议用biocmanager装 省时省力
    library(dplyr)
    library(stringr)
    library(ggplot2)
    library(viridis)
    library(GenomicRanges)
    library(chromVAR) ## For FRiP analysis and differential analysis
    library(DESeq2) ## For differential analysis section
    library(ggpubr) ## For customizing figures
    library(corrplot) ## For correlation plot
    
    ## Path to the project and histone list
    projPath = "/fh/fast/gottardo_r/yezheng_working/cuttag/CUTTag_tutorial"
    sampleList = c("K27me3_rep1", "K27me3_rep2", "K4me3_rep1", "K4me3_rep2", "IgG_rep1", "IgG_rep2")
    histList = c("K27me3", "K4me3", "IgG")
    
    ## Collect the alignment results from the bowtie2 alignment summary files
    alignResult = c()
    for(hist in sampleList){
      alignRes = read.table(paste0(projPath, "/alignment/sam/bowtie2_summary/", hist, "_bowtie2.txt"), header = FALSE, fill = TRUE)
      alignRate = substr(alignRes$V1[6], 1, nchar(as.character(alignRes$V1[6]))-1)
      histInfo = strsplit(hist, "_")[[1]]
      alignResult = data.frame(Histone = histInfo[1], Replicate = histInfo[2], 
                               SequencingDepth = alignRes$V1[1] %>% as.character %>% as.numeric, 
                               MappedFragNum_hg38 = alignRes$V1[4] %>% as.character %>% as.numeric + alignRes$V1[5] %>% as.character %>% as.numeric, 
                               AlignmentRate_hg38 = alignRate %>% as.numeric)  %>% rbind(alignResult, .)
    }
    alignResult$Histone = factor(alignResult$Histone, levels = histList)
    alignResult %>% mutate(AlignmentRate_hg38 = paste0(AlignmentRate_hg38, "%"))
    
    #######Spike-in alignment####
    spikeAlign = c()
    for(hist in sampleList){
      spikeRes = read.table(paste0(projPath, "/alignment/sam/bowtie2_summary/", hist, "_bowtie2_spikeIn.txt"), header = FALSE, fill = TRUE)
      alignRate = substr(spikeRes$V1[6], 1, nchar(as.character(spikeRes$V1[6]))-1)
      histInfo = strsplit(hist, "_")[[1]]
      spikeAlign = data.frame(Histone = histInfo[1], Replicate = histInfo[2], 
                              SequencingDepth = spikeRes$V1[1] %>% as.character %>% as.numeric, 
                              MappedFragNum_spikeIn = spikeRes$V1[4] %>% as.character %>% as.numeric + spikeRes$V1[5] %>% as.character %>% as.numeric, 
                              AlignmentRate_spikeIn = alignRate %>% as.numeric)  %>% rbind(spikeAlign, .)
    }
    spikeAlign$Histone = factor(spikeAlign$Histone, levels = histList)
    spikeAlign %>% mutate(AlignmentRate_spikeIn = paste0(AlignmentRate_spikeIn, "%"))
    
    
    #########Summarize the alignment to hg38 and E.coli########
    alignSummary = left_join(alignResult, spikeAlign, by = c("Histone", "Replicate", "SequencingDepth")) %>%
      mutate(AlignmentRate_hg38 = paste0(AlignmentRate_hg38, "%"), 
             AlignmentRate_spikeIn = paste0(AlignmentRate_spikeIn, "%"))
    alignSummary
    
    ##########       Visualizing the sequencing depth and alignment results    ##########
    ## Generate sequencing depth boxplot
    fig3A = alignResult %>% ggplot(aes(x = Histone, y = SequencingDepth/1000000, fill = Histone)) +
        geom_boxplot() +
        geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
        scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
        scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
        theme_bw(base_size = 18) +
        ylab("Sequencing Depth per Million") +
        xlab("") + 
        ggtitle("A. Sequencing Depth")
    
    fig3B = alignResult %>% ggplot(aes(x = Histone, y = MappedFragNum_hg38/1000000, fill = Histone)) +
        geom_boxplot() +
        geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
        scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
        scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
        theme_bw(base_size = 18) +
        ylab("Mapped Fragments per Million") +
        xlab("") +
        ggtitle("B. Alignable Fragment (hg38)")
    
    fig3C = alignResult %>% ggplot(aes(x = Histone, y = AlignmentRate_hg38, fill = Histone)) +
        geom_boxplot() +
        geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
        scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
        scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
        theme_bw(base_size = 18) +
        ylab("% of Mapped Fragments") +
        xlab("") +
        ggtitle("C. Alignment Rate (hg38)")
    
    fig3D = spikeAlign %>% ggplot(aes(x = Histone, y = AlignmentRate_spikeIn, fill = Histone)) +
        geom_boxplot() +
        geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
        scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
        scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
        theme_bw(base_size = 18) +
        ylab("Spike-in Alignment Rate") +
        xlab("") +
        ggtitle("D. Alignment Rate (E.coli)")
    
    ggarrange(fig3A, fig3B, fig3C, fig3D, ncol = 2, nrow=2, common.legend = TRUE, legend="bottom")
    

    2.5 去重复

    这一步的目的是去除重复序列,高质量的CUT&TAG数据集重复率很低,不用去除重复。 在使用非常少量的材料或怀疑 PCR 重复的实验中,可以删除重复项。这里我先对sam文件转为bam格式并排序,检查了重复率之后发现都比较高,所以对所有的文件都进行了去重。

    samtools view -S xx.sam -h -b | samtools sort - -o xx_sort.bam
    
    picardCMD="java -jar picard.jar"
    
    mkdir -p $projPath/alignment/removeDuplicate/picard_summary
    
    picardCMD MarkDuplicates I=$projPath/alignment/sam/${histName}_bowtie2.sorted.sam O=$projPath/alignment/removeDuplicate/${histName}_bowtie2.sorted.rmDup.sam REMOVE_DUPLICATES=true METRICS_FILE=$projPath/alignment/removeDuplicate/picard_summary/${histName}_picard.rmDup.txt
    
    

    下面就对去重结果进行汇总和可视化,在R语言中进行

    ##=== R command ===## 
    ## Summarize the duplication information from the picard summary outputs.
    dupResult = c()
    for(hist in sampleList){
      dupRes = read.table(paste0(projPath, "/alignment/removeDuplicate/picard_summary/", hist, "_picard.rmDup.txt"), header = TRUE, fill = TRUE)
      
      histInfo = strsplit(hist, "_")[[1]]
      dupResult = data.frame(Histone = histInfo[1], Replicate = histInfo[2], MappedFragNum_hg38 = dupRes$READ_PAIRS_EXAMINED[1] %>% as.character %>% as.numeric, DuplicationRate = dupRes$PERCENT_DUPLICATION[1] %>% as.character %>% as.numeric * 100, EstimatedLibrarySize = dupRes$ESTIMATED_LIBRARY_SIZE[1] %>% as.character %>% as.numeric) %>% mutate(UniqueFragNum = MappedFragNum_hg38 * (1-DuplicationRate/100))  %>% rbind(dupResult, .)
    }
    dupResult$Histone = factor(dupResult$Histone, levels = histList)
    alignDupSummary = left_join(alignSummary, dupResult, by = c("Histone", "Replicate", "MappedFragNum_hg38")) %>% mutate(DuplicationRate = paste0(DuplicationRate, "%"))
    alignDupSummary
    
    ##=== R command ===## 
    ## generate boxplot figure for the  duplication rate
    fig4A = dupResult %>% ggplot(aes(x = Histone, y = DuplicationRate, fill = Histone)) +
        geom_boxplot() +
        geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
        scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
        scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
        theme_bw(base_size = 18) +
        ylab("Duplication Rate (*100%)") +
        xlab("") 
    
    fig4B = dupResult %>% ggplot(aes(x = Histone, y = EstimatedLibrarySize, fill = Histone)) +
        geom_boxplot() +
        geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
        scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
        scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
        theme_bw(base_size = 18) +
        ylab("Estimated Library Size") +
        xlab("") 
    
    fig4C = dupResult %>% ggplot(aes(x = Histone, y = UniqueFragNum, fill = Histone)) +
        geom_boxplot() +
        geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
        scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
        scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
        theme_bw(base_size = 18) +
        ylab("# of Unique Fragments") +
        xlab("")
    
    ggarrange(fig4A, fig4B, fig4C, ncol = 3, common.legend = TRUE, legend="bottom")
    
    • 在这些示例数据集中,IgG 对照样本具有相对较高的重复率,因为该样本中的读数源自 CUT&Tag 反应中的非特异性标记。 因此,在下游分析之前从 IgG 数据集中删除重复项是合适的。
    • 估计的文库大小是根据 Picard 计算的 PE 重复估计文库中独特分子的数量。
    • 估计的文库大小与目标表位的丰度和所用抗体的质量成正比,而 IgG 样本的估计文库大小预计非常低。

    2.6 比对片段分布

    这一部分我直接贴代码把

    ##== linux command ==##
    mkdir -p $projPath/alignment/sam/fragmentLen
    
    ## Extract the 9th column from the alignment sam file which is the fragment length
    samtools view -F 0x04 $projPath/alignment/sam/${histName}_bowtie2.sam | awk -F'\t' 'function abs(x){return ((x < 0.0) ? -x : x)} {print abs($9)}' | sort | uniq -c | awk -v OFS="\t" '{print $2, $1/2}' >$projPath/alignment/sam/fragmentLen/${histName}_fragmentLen.txt
    
    ##=== R command ===## 
    ## Collect the fragment size information
    fragLen = c()
    for(hist in sampleList){
      
      histInfo = strsplit(hist, "_")[[1]]
      fragLen = read.table(paste0(projPath, "/alignment/sam/fragmentLen/", hist, "_fragmentLen.txt"), header = FALSE) %>% mutate(fragLen = V1 %>% as.numeric, fragCount = V2 %>% as.numeric, Weight = as.numeric(V2)/sum(as.numeric(V2)), Histone = histInfo[1], Replicate = histInfo[2], sampleInfo = hist) %>% rbind(fragLen, .) 
    }
    fragLen$sampleInfo = factor(fragLen$sampleInfo, levels = sampleList)
    fragLen$Histone = factor(fragLen$Histone, levels = histList)
    ## Generate the fragment size density plot (violin plot)
    fig5A = fragLen %>% ggplot(aes(x = sampleInfo, y = fragLen, weight = Weight, fill = Histone)) +
        geom_violin(bw = 5) +
        scale_y_continuous(breaks = seq(0, 800, 50)) +
        scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
        scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
        theme_bw(base_size = 20) +
        ggpubr::rotate_x_text(angle = 20) +
        ylab("Fragment Length") +
        xlab("")
    
    fig5B = fragLen %>% ggplot(aes(x = fragLen, y = fragCount, color = Histone, group = sampleInfo, linetype = Replicate)) +
      geom_line(size = 1) +
      scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma") +
      theme_bw(base_size = 20) +
      xlab("Fragment Length") +
      ylab("Count") +
      coord_cartesian(xlim = c(0, 500))
    
    ggarrange(fig5A, fig5B, ncol = 2)
    

    3 比对结果过滤和格式转换

    3.1 通过比对质量过滤比对序列

    ##== linux command ==##
    minQualityScore=2
    samtools view -q $minQualityScore ${projPath}/alignment/sam/${histName}_bowtie2.sam >${projPath}/alignment/sam/${histName}_bowtie2.qualityScore$minQualityScore.sam
    ### 个人感觉没必要给嘴小得分质量赋值,可能是为了更好的说明吧?
    

    4.2 格式转换

    ##== linux command ==##
    ## Filter and keep the mapped read pairs
    samtools view -bS -F 0x04 $projPath/alignment/sam/${histName}_bowtie2.sam >$projPath/alignment/bam/${histName}_bowtie2.mapped.bam
    
    ## Convert into bed file format
    bedtools bamtobed -i $projPath/alignment/bam/${histName}_bowtie2.mapped.bam -bedpe >$projPath/alignment/bed/${histName}_bowtie2.bed
    
    ## Keep the read pairs that are on the same chromosome and fragment length less than 1000bp.
    awk '$1==$4 && $6-$2 < 1000 {print $0}' $projPath/alignment/bed/${histName}_bowtie2.bed >$projPath/alignment/bed/${histName}_bowtie2.clean.bed
    
    ## Only extract the fragment related columns
    cut -f 1,2,6 $projPath/alignment/bed/${histName}_bowtie2.clean.bed | sort -k1,1 -k2,2n -k3,3n  >$projPath/alignment/bed/${histName}_bowtie2.fragments.bed
    

    4.3 重复再现性

    ##== linux command ==##
    ## We use the mid point of each fragment to infer which 500bp bins does this fragment belong to.
    binLen=500
    awk -v w=$binLen '{print $1, int(($2 + $3)/(2*w))*w + w/2}' $projPath/alignment/bed/${histName}_bowtie2.fragments.bed | sort -k1,1V -k2,2n | uniq -c | awk -v OFS="\t" '{print $2, $3, $1}' |  sort -k1,1V -k2,2n  >$projPath/alignment/bed/${histName}_bowtie2.fragmentsCount.bin$binLen.bed
    
    ##== R command ==##
    reprod = c()
    fragCount = NULL
    for(hist in sampleList){
      
      if(is.null(fragCount)){
        
        fragCount = read.table(paste0(projPath, "/alignment/bed/", hist, "_bowtie2.fragmentsCount.bin500.bed"), header = FALSE) 
        colnames(fragCount) = c("chrom", "bin", hist)
      
      }else{
        
        fragCountTmp = read.table(paste0(projPath, "/alignment/bed/", hist, "_bowtie2.fragmentsCount.bin500.bed"), header = FALSE)
        colnames(fragCountTmp) = c("chrom", "bin", hist)
        fragCount = full_join(fragCount, fragCountTmp, by = c("chrom", "bin"))
        
      }
    }
    
    M = cor(fragCount %>% select(-c("chrom", "bin")) %>% log2(), use = "complete.obs") 
    
    corrplot(M, method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 3, rect.col = "black", rect.lwd = 3,cl.pos = "b", tl.col = "indianred4", tl.cex = 1, cl.cex = 1, addCoef.col = "black", number.digits = 2, number.cex = 1, col = colorRampPalette(c("midnightblue","white","darkred"))(100))
    

    5 spike-in 校准

    ##== linux command ==##
    if [[ "$seqDepth" -gt "1" ]]; then
        
        mkdir -p $projPath/alignment/bedgraph
    
        scale_factor=`echo "10000 / $seqDepth" | bc -l`
        echo "Scaling factor for $histName is: $scale_factor!"
        bedtools genomecov -bg -scale $scale_factor -i $projPath/alignment/bed/${histName}_bowtie2.fragments.bed -g $chromSize > $projPath/alignment/bedgraph/${histName}_bowtie2.fragments.normalized.bedgraph
        
    fi
    ## 这里的seqDepth是之前设置的变量,我直接查看了所有的seqDepth,然后发现所有的都大于1  就没有执行这个判断,直接循环的下面的命令
    ##!!!!!!注意,这里的chromeSize 是一个文件,包含两列,第一列是参考基因组染色体名称,第二列是染色体的长度,\t分隔 。  使用samtools index genome.fa   会生成一个.fai 结尾的文件 里面会有这些内容。
    
    ##=== R command ===## 
    scaleFactor = c()
    multiplier = 10000
    for(hist in sampleList){
      spikeDepth = read.table(paste0(projPath, "/alignment/sam/bowtie2_summary/", hist, "_bowtie2_spikeIn.seqDepth"), header = FALSE, fill = TRUE)$V1[1]
      
      histInfo = strsplit(hist, "_")[[1]]
      scaleFactor = data.frame(scaleFactor = multiplier/spikeDepth, Histone = histInfo[1], Replicate = histInfo[2])  %>% rbind(scaleFactor, .)
    }
    scaleFactor$Histone = factor(scaleFactor$Histone, levels = histList)
    left_join(alignDupSummary, scaleFactor, by = c("Histone", "Replicate")
    
    ##=== R command ===##
    ## Generate sequencing depth boxplot
    fig6A = scaleFactor %>% ggplot(aes(x = Histone, y = scaleFactor, fill = Histone)) +
        geom_boxplot() +
        geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
        scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
        scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
        theme_bw(base_size = 20) +
        ylab("Spike-in Scalling Factor") +
        xlab("")
    
    normDepth = inner_join(scaleFactor, alignResult, by = c("Histone", "Replicate")) %>% mutate(normDepth = MappedFragNum_hg38 * scaleFactor)
    
    fig6B = normDepth %>% ggplot(aes(x = Histone, y = normDepth, fill = Histone)) +
        geom_boxplot() +
        geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
        scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
        scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
        theme_bw(base_size = 20) +
        ylab("Normalization Fragment Count") +
        xlab("") + 
        coord_cartesian(ylim = c(1000000, 130000000))
    ggarrange(fig6A, fig6B, ncol = 2, common.legend = TRUE, legend="bottom")
    

    6 call peak

    终于到了激动人心的时刻,这里使用SEACR进行call peak 需要把.sh 和.r 文件同时传到服务器上

    ##== linux command ==##
    seacr="/fh/fast/gottardo_r/yezheng_working/Software/SEACR/SEACR_1.3.sh"
    histControl=$2
    mkdir -p $projPath/peakCalling/SEACR
    
    bash $seacr $projPath/alignment/bedgraph/${histName}_bowtie2.fragments.normalized.bedgraph \
         $projPath/alignment/bedgraph/${histControl}_bowtie2.fragments.normalized.bedgraph \
         non stringent $projPath/peakCalling/SEACR/${histName}_seacr_control.peaks
    
    bash $seacr $projPath/alignment/bedgraph/${histName}_bowtie2.fragments.normalized.bedgraph 0.01 non stringent $projPath/peakCalling/SEACR/${histName}_seacr_top0.01.peaks  
    

    6.2 peak的数量统计

    ##=== R command ===## 
    peakN = c()
    peakWidth = c()
    peakType = c("control", "top0.01")
    for(hist in sampleList){
      histInfo = strsplit(hist, "_")[[1]]
      if(histInfo[1] != "IgG"){
        for(type in peakType){
          peakInfo = read.table(paste0(projPath, "/peakCalling/SEACR/", hist, "_seacr_", type, ".peaks.stringent.bed"), header = FALSE, fill = TRUE)  %>% mutate(width = abs(V3-V2))
          peakN = data.frame(peakN = nrow(peakInfo), peakType = type, Histone = histInfo[1], Replicate = histInfo[2]) %>% rbind(peakN, .)
          peakWidth = data.frame(width = peakInfo$width, peakType = type, Histone = histInfo[1], Replicate = histInfo[2])  %>% rbind(peakWidth, .)
        }
      }
    }
    peakN %>% select(Histone, Replicate, peakType, peakN)
    

    6.3 重复中峰的重现性

    比较重复数据集上的峰值调用以定义可重现的峰值。 前 1% 的峰值(按每个块中的总信号排序)被选为高置信度站点。

    ##=== R command ===## 
    histL = c("K27me3", "K4me3")
    repL = paste0("rep", 1:2)
    peakType = c("control", "top0.01")
    peakOverlap = c()
    for(type in peakType){
      for(hist in histL){
        overlap.gr = GRanges()
        for(rep in repL){
          peakInfo = read.table(paste0(projPath, "/peakCalling/SEACR/", hist, "_", rep, "_seacr_", type, ".peaks.stringent.bed"), header = FALSE, fill = TRUE)
          peakInfo.gr = GRanges(peakInfo$V1, IRanges(start = peakInfo$V2, end = peakInfo$V3), strand = "*")
          if(length(overlap.gr) >0){
            overlap.gr = overlap.gr[findOverlaps(overlap.gr, peakInfo.gr)@from]
          }else{
            overlap.gr = peakInfo.gr
            
          }
        }
        peakOverlap = data.frame(peakReprod = length(overlap.gr), Histone = hist, peakType = type) %>% rbind(peakOverlap, .)
      }
    }
    
    peakReprod = left_join(peakN, peakOverlap, by = c("Histone", "peakType")) %>% mutate(peakReprodRate = peakReprod/peakN * 100)
    peakReprod %>% select(Histone, Replicate, peakType, peakN, peakReprodNum = peakReprod, peakReprodRate)
    

    6.4 峰区的片段比例(FRiPs)

    我们计算峰值读数的分数 (FRiPs) 作为信噪比的度量,并将其与 IgG 对照数据集中的 FRiPs 进行对比以进行说明。 尽管 CUT&Tag 的测序深度通常只有 1-5 百万个reads,但该方法的低背景导致高 FRiP 分数。

    ##=== R command ===## 
    library(chromVAR)
    
    bamDir = paste0(projPath, "/alignment/bam")
    inPeakData = c()
    ## overlap with bam file to get count
    for(hist in histL){
      for(rep in repL){
        peakRes = read.table(paste0(projPath, "/peakCalling/SEACR/", hist, "_", rep, "_seacr_control.peaks.stringent.bed"), header = FALSE, fill = TRUE)
        peak.gr = GRanges(seqnames = peakRes$V1, IRanges(start = peakRes$V2, end = peakRes$V3), strand = "*")
        bamFile = paste0(bamDir, "/", hist, "_", rep, "_bowtie2.mapped.bam")
        fragment_counts <- getCounts(bamFile, peak.gr, paired = TRUE, by_rg = FALSE, format = "bam")
        inPeakN = counts(fragment_counts)[,1] %>% sum
        inPeakData = rbind(inPeakData, data.frame(inPeakN = inPeakN, Histone = hist, Replicate = rep))
      }
    }
    
    frip = left_join(inPeakData, alignResult, by = c("Histone", "Replicate")) %>% mutate(frip = inPeakN/MappedFragNum_hg38 * 100)
    frip %>% select(Histone, Replicate, SequencingDepth, MappedFragNum_hg38, AlignmentRate_hg38, FragInPeakNum = inPeakN, FRiPs = frip)
    

    6.5峰数、峰宽、峰重现性和 FRiP 的可视化

    fig7A = peakN %>% ggplot(aes(x = Histone, y = peakN, fill = Histone)) +
        geom_boxplot() +
        geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
        facet_grid(~peakType) +
        scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.55, option = "magma", alpha = 0.8) +
        scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
        theme_bw(base_size = 18) +
        ylab("Number of Peaks") +
        xlab("")
    
    fig7B = peakWidth %>% ggplot(aes(x = Histone, y = width, fill = Histone)) +
        geom_violin() +
        facet_grid(Replicate~peakType) +
        scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.55, option = "magma", alpha = 0.8) +
        scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
        scale_y_continuous(trans = "log", breaks = c(400, 3000, 22000)) +
        theme_bw(base_size = 18) +
        ylab("Width of Peaks") +
        xlab("")
    
    fig7C = peakReprod %>% ggplot(aes(x = Histone, y = peakReprodRate, fill = Histone, label = round(peakReprodRate, 2))) +
        geom_bar(stat = "identity") +
        geom_text(vjust = 0.1) +
        facet_grid(Replicate~peakType) +
        scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.55, option = "magma", alpha = 0.8) +
        scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
        theme_bw(base_size = 18) +
        ylab("% of Peaks Reproduced") +
        xlab("")
    
    fig7D = frip %>% ggplot(aes(x = Histone, y = frip, fill = Histone, label = round(frip, 2))) +
        geom_boxplot() +
        geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
        scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.55, option = "magma", alpha = 0.8) +
        scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
        theme_bw(base_size = 18) +
        ylab("% of Fragments in Peaks") +
        xlab("")
    
    ggarrange(fig7A, fig7B, fig7C, fig7D, ncol = 2, nrow=2, common.legend = TRUE, legend="bottom")
    

    7可视化

    因为我是想寻找该蛋白特定结合的基因,不像转录因子每个基因都有特定的结合区域,所以这部分就没涉及。

    8 差异分析

    8.1 创建一个主峰列表,合并为每个样本调用的所有峰。

    ##=== R command ===## 
    mPeak = GRanges()
    ## overlap with bam file to get count
    for(hist in histL){
      for(rep in repL){
        peakRes = read.table(paste0(projPath, "/peakCalling/SEACR/", hist, "_", rep, "_seacr_control.peaks.stringent.bed"), header = FALSE, fill = TRUE)
        mPeak = GRanges(seqnames = peakRes$V1, IRanges(start = peakRes$V2, end = peakRes$V3), strand = "*") %>% append(mPeak, .)
      }
    }
    masterPeak = reduce(mPeak)    
    ###  这个masterPeak存了所有的峰的位置
    

    8.2获取主峰列表中每个峰的碎片计数。

    ##=== R command ===## 
    library(DESeq2)
    bamDir = paste0(projPath, "/alignment/bam")
    countMat = matrix(NA, length(masterPeak), length(histL)*length(repL))
    ## overlap with bam file to get count
    i = 1
    for(hist in histL){
      for(rep in repL){
        
        bamFile = paste0(bamDir, "/", hist, "_", rep, "_bowtie2.mapped.bam")
        fragment_counts <- getCounts(bamFile, masterPeak, paired = TRUE, by_rg = FALSE, format = "bam")
        countMat[, i] = counts(fragment_counts)[,1]
        i = i + 1
      }
    }
    colnames(countMat) = paste(rep(histL, 2), rep(repL, each = 2), sep = "_")
    

    8.3 测序深度归一化和差异富集峰检测

    ##=== R command ===## 
    selectR = which(rowSums(countMat) > 5) ## remove low count genes
    dataS = countMat[selectR,]
    condition = factor(rep(histL, each = length(repL)))
    dds = DESeqDataSetFromMatrix(countData = dataS,
                                  colData = DataFrame(condition),
                                  design = ~ condition)
    DDS = DESeq(dds)
    normDDS = counts(DDS, normalized = TRUE) ## normalization with respect to the sequencing depth
    colnames(normDDS) = paste0(colnames(normDDS), "_norm")
    res = results(DDS, independentFiltering = FALSE, altHypothesis = "greaterAbs")
    
    countMatDiff = cbind(dataS, normDDS, res)
    head(countMatDiff)
    
    • DESeq2 要求输入矩阵应该是非标准化计数或测序读数的估计计数。
    • DESeq2 模型在内部校正库大小。
    • countMatDiff 总结了差分分析结果:
      • 前 4 列:过滤掉计数低的峰区域后的原始读数计数
      • 第二个 4 列:标准化读取计数消除了文库大小差异
      • 剩下的就是差异分析结果

    写到最后:
    真的繁琐,一步一步,我人已经傻掉了,脑袋嗡嗡的。

    相关文章

      网友评论

        本文标题:水稻CUT&Tag数据分析流程

        本文链接:https://www.haomeiwen.com/subject/ddmwsltx.html