美文网首页生物信息学
RNA-seq小考核----生信技能树

RNA-seq小考核----生信技能树

作者: LiuYueRR | 来源:发表于2019-07-29 13:28 被阅读0次

    生信技能树----RNA-seq小考核。

    题目链接地址为:http://www.bio-info-trainee.com/3920.html

    教程对应B站:【生信技能树】转录组测序数据分析

    链接:https://www.bilibili.com/video/av28453557?from=search&seid=17370292426064271948
    代码参考:https://github.com/jmzeng1314/GEO/tree/master/airway_RNAseq

    Q1: 参考基因组及注释文件下载地址

    列出人,小鼠,拟南芥的基因组序列,转录组cDNA序列,基因组注释gtf文件下载地址

    参考基因组存储网站三个:ENSEMBL、NCBI、UCSC

    mkdir rnaseq_test
    cd rnaseq_test
    #拟南芥
    wget -c ftp://ftp.ensemblgenomes.org/pub/release-44/plants/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
    wget -c ftp://ftp.ensemblgenomes.org/pub/release-44/plants/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz
    wget -c ftp://ftp.ensemblgenomes.org/pub/release-44/plants/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.44.gtf.gz
    #人
    wget -c ftp://ftp.ensembl.org/pub/release-97/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.toplevel.fa.gz
    wget -c ftp://ftp.ensembl.org/pub/release-97/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
    wget -c ftp://ftp.ensembl.org/pub/release-97/gtf/homo_sapiens/Homo_sapiens.GRCh38.97.gtf.gz
    #小鼠
    wget -c ftp://ftp.ensembl.org/pub/release-97/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna_sm.toplevel.fa.gz
    wget -c ftp://ftp.ensembl.org/pub/release-97/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz
    wget -c ftp://ftp.ensembl.org/pub/release-97/gtf/mus_musculus/Mus_musculus.GRCm38.97.gtf.gz
    

    Q2: 找到文章的测序数据

    2018年12月的NC文章:Spatially and functionally distinct subclasses of breast cancer-associated fibroblasts revealed by single cell RNA sequencing 使用成熟的单细胞转录组( Smart-seq2 )手段探索了癌相关的成纤维细胞 CAFs的功能和空间异质性。
    在文献中找到了:GSE111229
    然后进图NCBI-GEO数据库中搜索:GSE111229
    下载含有SRR编号的SRR_Acc_List.txt文件
    下载方法步骤如下图所示:
    
    1563964898(1).png 1563964927(1).png
    1563964959(1).png
    1563976393(1).png

    Q3:下载测序数据

    #写一个while循环或者for循环,批量进行测序数据的下载,三种方式均可进行批量操作,实际上取任何一种方法操作即可。
    cat SRR_Acc_List.txt|while read id ;do (prefetch  ${id} );done
    for i in `less SRR_Acc_List.txt`;do (prefetch  ${id} ) ;done
    for i in SRR679{0711..1478};do ( prefetch ${i} );done
    

    Q4: 任意挑选6个样本走标准的RNA-seq上游流程

    即 sra → fastq→bam→counts

    注意每个步骤的质控细节,注意每个步骤的文件格式转换背后的生物学意义。
    代码参考在:code

    sra → fastq→bam→counts
    #当前文件夹内容
    (rna) yliu 22:35:41 ~/rnaseq_test
    $ tree
    .
    ├── sra_data
    │     ├── SRR6790711.sra
    │     ├── SRR6790712.sra
    │     ├── SRR6790713.sra
    │     ├── SRR6790714.sra
    │     ├── SRR6790715.sra
    │     └── SRR6790716.sra
    └── SRR_Acc_List.txt
    
    1 directory, 7 files
    
    #sra → fastq
    ##主要应该注意sratoolkit软件的用法(软件参数的使用,文件的输入和输出等等)
    cd sra_data/
    ls *.sra|while read id;do (fastq-dump --split-3 --gzip  ${id});done
    ls
    
    #fastq→bam
    ##mapping之前需要对raw_data进行质量控制
    ##raw_fastq→clean_fastq
    ##quanlity control
    mkdir fastqc
    ls *.gz|while read id;do (fastqc  -t 2 -o ~/rnaseq_test/fastqc/ ${id});done
    cd fastqc
    multiqc ./*.zip
    
    ##filter data
    ##对raw_data进行质量控制后,要对其进行过滤和截断,确保mapping时数据的可靠性
    cd ../
    mkdir clean_read/
    for i in `ls *_1.fastq.gz`; do i=${i/_*/}; echo "trim_galore -q 25 \ 
     --phred33 --length 36 --stringency 3 --paired \
     -o ./clean_read/ ${i}_1.fastq.gz ${i}_2.fastq.gz"; done > ./clean_read/command_trim_galore.sh
    
    less -S ./clean_read/command_trim_galore.sh 
    cd  ./clean_read/
    ls
    chmod 755 command_trim_galore.sh
    bash command_trim_galore.sh
    
    ##mapping
    mapping之前需要先构建参考基因组index
    这里使用服务器已经构建好的参考基因组索引
    hg38的参考基因组索引位置为:
    /teach/rna_testdata/database/index/hisat/hg38/genome
    mkdir hisat_mapping
    ls *_val_1.fq.gz|while read id; do id=${id%%_*}; echo "hisat2 -p 4 -x /teach/rna_testdata/database/index/hisat/hg38/genome\
      -1 ${id}_1_val_1.fq.gz -2 ${id}_2_val_2.fq.gz\
      -S /home/yliu/rnaseq_test/sra_data/hisat_mapping/${id}.sam"; done > command_hisat_list.sh
    less command_hisat_list.sh
    nohup bash command_hisat_list.sh & 
    
    #sam→bam
    ##sam文件占据内存非常巨大,因此将sam文件合适转化为bam文件格式(二进制格式),缩小文件的所占的空间大小
    cd ../hisat_mapping
    ls
    ls *.sam|while read id;do echo "samtools sort -O bam -@ 4 -o ${id%%.*}.bam ${id} ";done > command_sam_bam.list
    less command_sam_bam.list
    nohup bash command_sam_bam.list &
    top
    
    #bam→counts
    ##从bam文件中获取每一个基因的counts值
    mkdir ../counts
    ls *.sam|while read id;do echo "featureCounts -T 5 -p -t exon -g gene_id -a /teach/rna_testdata/database/gtf/gencode.v25.annotation.gtf.gz -o ../counts/${id%%.*}.count.txt ${id}";done > command.counts.list
    less command.counts.list
    nohup bash command.counts.list &
    top
    

    Q5: 理解RNA-seq上游流程得到的表达矩阵的多种形式

    在RNA-seq中,我们通常使用分析得到的count值,
    以及RPKM、FPKM、TPM等均一化后的基因表达单位,去表示表达矩阵。
    ·count:
    每个基因上比对到的reads的数量
    ·RPKM(Reads Per Kilobase Million)
    每千个碱基的转录每百万映射读取的reads数
    ·FPKM(Fragments Per Kilobase Million)
    每千个碱基的转录每百万映射读取的fragments数
    ·TPM(Transcripts Per Million)
    每百万条reads的转录本

    在RNA-seq数据分析中,对基因或者转录本的read counts数目进行标准化(Normalization)是一个非常重要的步骤。
    因为在转录组数据分析中,落在一个基因上面的count值与基因长度和测序深度有关。
    一个基因的长度越长,落在基因上面的reads数目就越多;测序深度越多,落在基因上面reads数目就越多。因此在做基因表达差异分析的时候,我们必须要对其进行基因长度和测序深度的标准化。

    Q6: 任取6个样本表达矩阵随意分成2组走差异分析代码

    这里使用了Bioconductor中的airway数据进行差异表达分析 (limma包)

    rm(list = ls())
    options(stringsAsFactors = F)
    suppressMessages(library(airway))
    data("airway")
    #获取表达矩阵
    exprSet <- assay(airway)
    exprSet[1:4,1:4]
    #获取分组信息
    group_list <- colData(airway)[,3]
    group_list
    table(group_list)
    #得到表达量矩阵后,需要对表达量进行筛选
    table(apply(exprSet,1,function(x) sum(x>1)))
    table(apply(exprSet,1, function(x) sum(x>1)==0))
    exprSet=exprSet[apply(exprSet,1, function(x) sum(x>1) > 4),]
    dim(exprSet)
    [1] 20384     8
    #计算样本间的相关性
    library(edgeR)
    exprSet1=cpm(exprSet,log = T,prior.count=2)
    dim(exprSet1)
    #取mad前1000的基因,因为这些基因在转录组的结果分析中,非常重要
    tt=sort(apply(exprSet1, 1,mad),decreasing = T)[1:1000]
    exprSet1=exprSet1[names(tt),]
    cor(exprSet1)
    pheatmap::pheatmap(cor(exprSet1))
    
    corplot_pheatmap.png

    差异表达分析---limma

    suppressMessages(library(limma))
    dim(exprSet)
    #1.创建分组信息
    design <- model.matrix(~0+factor(group_list))
    colnames(design)=levels(factor(group_list))
    rownames(design)=colnames(exprSet)
    design
               trt untrt
    SRR1039508   0     1
    SRR1039509   1     0
    SRR1039512   0     1
    SRR1039513   1     0
    SRR1039516   0     1
    SRR1039517   1     0
    SRR1039520   0     1
    SRR1039521   1     0
    attr(,"assign")
    [1] 1 1
    attr(,"contrasts")
    attr(,"contrasts")$`factor(group_list)`
    [1] "contr.treatment"
    
    #2.处理表达矩阵,过滤表达量太低的基因
    dge <- DGEList(counts=exprSet)
    dge <- calcNormFactors(dge)
    dim(dge)
    #3.差异表达分析
    comp='trt-untrt'
    v <- voom(dge,design,plot=TRUE, normalize="quantile")
    fit <- lmFit(v, design)
    cont.matrix=makeContrasts(contrasts=c(comp),levels = design)
    fit2=contrasts.fit(fit,cont.matrix)
    fit2=eBayes(fit2)
    fit2
    #4.分析结束,拿到最终结果
    tempOutput = topTable(fit2, coef=comp, n=Inf)
    DEG_limma_voom = na.omit(tempOutput)
    head(DEG_limma_voom)
    nrDEG=DEG_limma_voom[,c(1,4)]
    colnames(nrDEG)=c('log2FoldChange','pvalue')
    save(nrDEG, DEG_limma_voom, file = "limma_DEG.Rdata")
    dim(nrDEG)
    [1] 20384     2
    write.csv(nrDEG,file = 'limma_DEG.csv')
    #查看上调或者下调基因的个数
    table(abs(nrDEG[,1]) >1 & nrDEG[,2] < 0.05 )
    FALSE  TRUE 
    19119  1265 
    

    差异表达分析---DESEQ2

    #差异表达基因---DESeq2
    suppressMessages(library(DESeq2))
    #1.创建分组信息
    colData <- data.frame(row.names=colnames(exprSet),group_list=group_list)
    colData
    #2.按照DESeq2包的方法处理数据
    dds <- DESeqDataSetFromMatrix(countData = exprSet,colData = colData,
                                  design = ~ group_list)
    dds_2 <- DESeq(dds)
    dds=dds_2
    res <- results(dds,contrast=c("group_list","trt","untrt"))
    resOrdered <- res[order(res$padj),]
    DEG =as.data.frame(resOrdered)
    DESeq2_DEG = na.omit(DEG)
    nrDEG=DESeq2_DEG[,c(2,6)]
    colnames(nrDEG)=c('log2FoldChange','pvalue')  
    save(nrDEG, DESeq2_DEG, file = "DESeq_DEG.Rdata")
    head(nrDEG)
    table(nrDEG$log2FoldChange > 1 & nrDEG$pvalue < 0.05)
    FALSE  TRUE 
    17502   470 
    

    差异表达分析---edgeR

    suppressMessages(library(edgeR))
    d <- DGEList(counts=exprSet,group=factor(group_list))
    keep <- rowSums(cpm(d)>1) >= 2
    table(keep)
    d <- d[keep, , keep.lib.sizes=FALSE]
    d$samples$lib.size <- colSums(d$counts)
    d <- calcNormFactors(d)
    d$samples
    dge=d
    design <- model.matrix(~0+factor(group_list))
    rownames(design)<-colnames(dge)
    colnames(design)<-levels(factor(group_list))
    dge=d
    dge <- estimateGLMCommonDisp(dge,design)
    dge <- estimateGLMTrendedDisp(dge, design)
    dge <- estimateGLMTagwiseDisp(dge, design)
    
    fit <- glmFit(dge, design)
    lrt <- glmLRT(fit, contrast=c(1,-1)) 
    nrDEG=topTags(lrt, n=nrow(dge))
    nrDEG=as.data.frame(nrDEG)
    head(nrDEG)
    edgeR_DEG =nrDEG 
    nrDEG=edgeR_DEG[,c(1,5)]
    colnames(nrDEG)=c('log2FoldChange','pvalue') 
    save(nrDEG, edgeR_DEG, file = "edgeR_DEG.Rdata")
    table(nrDEG$log2FoldChange > 1 & nrDEG$pvalue < 0.05)
    FALSE  TRUE 
    14614   421 
    

    pheatmap 绘制

    #pheatmap绘制
    rm(list = ls())
    options(stringsAsFactors = F)
    #加载limma的差异表达分析后的数据
    load(file = 'limma_DEG.Rdata')
    library(pheatmap)
    dim(nrDEG)
    choose_name <- head(rownames(nrDEG),100)
    choose_matirx <- exprSet[choose_name,]
    choose_matirx=t(scale(t(choose_matirx)))
    head(choose_matirx)
    dim(choose_matirx)
    #设置好分组信息
    table(group_list)
    group_list <- as.data.frame(group_list)
    rownames(group_list) <- colnames(choose_matirx)
    group_list
    pheatmap(choose_matirx,annotation_col = group_list,fontsize_col = 0.5)
    
    pheatmap.png

    火山图绘制

    #火山图的绘制
    rm(list = ls())
    options(stringsAsFactors = F)
    load(file = 'limma_DEG.Rdata')
    library(ggplot2)
    nrDEG[nrDEG$pvalue <0.05 & nrDEG$log2FoldChange >1,ncol(nrDEG)+1]="Up"
    nrDEG[nrDEG$pvalue <0.05 & nrDEG$log2FoldChange < -1,ncol(nrDEG)]="Down"
    nrDEG[nrDEG$pvalue>=0.05 | 1 > abs(nrDEG$log2FoldChange),ncol(nrDEG)]="Normal"
    colnames(nrDEG)[ncol(nrDEG)]="Regulate"
    nrDEG$Regulate=factor(nrDEG$Regulate,levels = c("Up","Down","Normal"),order=T)
    head(nrDEG)
    col=c("red","lightseagreen", "black")
    p_volcano=ggplot(nrDEG,aes(x=log2FoldChange,y=-log10(pvalue)))+
      geom_point(aes(color=nrDEG$Regulate),alpha=0.5)+scale_color_manual(values =col)+
      geom_hline(yintercept=c(-log10(0.05)),linetype=4)+
      geom_vline(xintercept=c(-log2(2),log2(2)),linetype=4)+
      theme(
        legend.text = element_text(size = 10, face = "bold"),
        legend.position = 'right',
        legend.key.size=unit(0.5,'cm'),
        axis.text.x=element_text(size = 10,face = "bold", vjust = 0.5, hjust = 0.5),
        axis.text.y=element_text(size = 10,face = "bold", vjust = 0.5, hjust = 0.5),
        axis.title.x = element_text(size = 10,face = "bold", vjust = 0.5, hjust = 0.5),
        axis.title.y = element_text(size = 10,face = "bold", vjust = 0.5, hjust = 0.5),
        
        panel.background = element_rect(fill = "transparent",colour = "black"), 
        panel.grid.minor = element_line(color="lightgrey",size=0.1),
        panel.grid.major = element_line(color="lightgrey",size=0.1),
        plot.background = element_rect(fill = "transparent",colour = "black")
      )
    p_volcano
    
    volcanno.png

    Q7:挑选差异分析结果的统计学显著上调下调基因集

    在此使用limma包的差异分析结果进行结果的展示

    rm(list = ls())
    options(stringsAsFactors = F)
    load(file = 'limma_DEG.Rdata')
    head(nrDEG)
    nrDEG[nrDEG$log2FoldChange > 1 & nrDEG$pvalue < 0.05, ncol(nrDEG)+1]='UP'
    nrDEG[nrDEG$log2FoldChange < -1 & nrDEG$pvalue < 0.05, ncol(nrDEG)] = 'DOWN'
    nrDEG[abs(nrDEG$log2FoldChange) <1 | nrDEG$pvalue > 0.05,ncol(nrDEG)]='NORMAL'
    table(nrDEG$V3)
    dim(nrDEG)
    head(nrDEG)
    write.csv(nrDEG,file = 'limma_DEG_Regulation.csv')
    

    既然得到了上调和下调的差异表达基因,接下来就要对差异表达的基因,进行GO和KEGG的注释分析

    #kegg
    kk.up <- enrichKEGG(   gene          =  gene_up    ,
                           organism      =  'hsa'      ,
                           universe      =  gene_all   ,
                           pvalueCutoff  =  0.8        ,
                           qvalueCutoff  =  0.8        )
    kk.down <- enrichKEGG( gene          =  gene_down  ,
                           organism      =  'hsa'      ,
                           universe      =  gene_all   ,
                           pvalueCutoff  =  0.05       )
    
    head(kk.up)[,1:6]
    head(kk.down)[,1:6]
    #绘图
    suppressMessages(library(ggplot2))
    kegg_down_dt <- as.data.frame(kk.down)
    kegg_up_dt <- as.data.frame(kk.up)
    down_kegg <- kegg_down_dt[kegg_down_dt$pvalue<0.05,]
    down_kegg$group=-1
    up_kegg <- kegg_up_dt[kegg_up_dt$pvalue<0.05,]
    up_kegg$group=1
    dat=rbind(up_kegg,down_kegg)
    dat$pvalue=-log10(dat$pvalue)
    dat$pvalue = dat$pvalue * dat$group
    dat = dat[ order( dat$pvalue, decreasing = F ), ]
    g_kegg <- ggplot( dat, 
                      aes(x = reorder( Description, order( pvalue, decreasing=F ) ), y = pvalue, fill = group)) + 
      geom_bar( stat = "identity" ) + 
      scale_fill_gradient( low = "blue", high = "red", guide = FALSE ) + 
      scale_x_discrete( name = "Pathway names" ) +
      scale_y_continuous( name = "log10P-value" ) +
      coord_flip() + theme_bw() + theme( plot.title = element_text( hjust = 0.5 ) ) +
      ggtitle( "Pathway Enrichment" )
    g_kegg
    
    g_kegg.png
    #go
    g_list=list(gene_up=gene_up, gene_down=gene_down, gene_diff=gene_diff)
    go_enrich_results <- lapply( g_list , function(gene) {
      lapply( c('BP','MF','CC') , function(ont) {
        cat(paste('Now process ',ont ))
        ego <- enrichGO(gene          = gene,
                        universe      = gene_all,
                        OrgDb         = org.Hs.eg.db,
                        ont           = ont ,
                        pAdjustMethod = "BH",
                        pvalueCutoff  = 0.99,
                        qvalueCutoff  = 0.99,
                        readable      = TRUE)
        
        print( head(ego) )
        return(ego)
      })
    })
    save(go_enrich_results,file = 'go_enrich_results.Rdata')
    
    n1= c('gene_up','gene_down','gene_diff')
    n2= c('BP','MF','CC') 
    for (i in 1:3){
      for (j in 1:3){
        fn=paste0('8.GO_dotplot_',n1[i],'_',n2[j],'.png')
        cat(paste0(fn,'\n'))
        png(fn,res=150,width = 1080)
        print(dotplot(go_enrich_results[[i]][[j]]))
        dev.off()
      }
    }
    
    GO_dotplot_gene_diff_BP.png

    Q8: 直接对任取6个样本表达矩阵做GSVA分析

    最后的GSVA分析,不是很懂这一部分的知识点,要学习一下再回来接着写!

    相关文章

      网友评论

        本文标题:RNA-seq小考核----生信技能树

        本文链接:https://www.haomeiwen.com/subject/jnyfrctx.html