美文网首页
Brief workflow of RNAseq analysi

Brief workflow of RNAseq analysi

作者: wkzhang81 | 来源:发表于2022-07-12 14:37 被阅读0次

1. General workflow for RNAseq analysis

After standard sequencing process (recommended for the illumina high-throughput sequencing protocol), raw data were obtained and processed with python (higher than 3.7 was recommended) for QC, Trimming, Gene alignment, and Expression count. Subsequently, R (higher than 4.2.0 was recommended) was used for the analysis of differential expressions, and web-based tools were used for annotation / enrichment analysis.

the general workflow for RNAseq analysis

2. Preparations

System requirement

Generally speaking, the linux system (ubuntu or deepin OS) was recommended for the python process, while anaconda or miniconda was recommended for the installation of python.
ubuntu: Enterprise Open Source and Linux | Ubuntu

Installation of Python environment

After the installation of conda environment, source files should be added.

conda config --add channels https://mirrors.ustc.edu.cn/anaconda/pkgs/main/
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/conda-forge/
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/msys2/
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/bioconda/
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/menpo/
conda config --set show_channel_urls yes

Environment for the bioinformatics should be added to the conda.

conda create -y --name bioinfo

Every time you want to do analysis, the environment bioinfo should be activated, and software should be installed under the bioinfo environment.

source ~/miniconda3/bin/activate
source activate bioinfo
conda install -c bioconda fastqc
conda install -c bioconda trim-galore
conda install -c bioconda hisat2
conda install -c bioconda samtools
conda install -c bioconda subread

Genomic index download and build

Human and mouse comprehensive gene annotation (GTF) and transcript sequences (FASTA) could be downloaded via the following website:
GENCODE - Home page (gencodegenes.org)

Index could then be built via the following command:

extract_exons.py genomic_annotation.gtf > exons.gtf
extract_splice_sites.py genomic_annotation.gtf > splice_site.gtf
hisat2-build --ss splice_site.gtf --exon exon.gtf genomic_fasta.fa genomic_annotation

Installation of R and packages

R could be downloaded directly from the following website.
R: The R Project for Statistical Computing (r-project.org)

After the installation, R should be started and the following libraries should be installed.

install.packages("BiocManager")  ### installing BiocManager packages 
BiocManager::install('DESeq2') ### for differential analysis
BiocManager::install('biomaRt') ### for nomination exchange among ensembl, entrez, and external gene names
install.packages("corrplot") ### for plotting correlation graphs
install.packages("FactoMineR") ### for PCA analysis
BiocManager::install('org.Hs.eg.db') ### org.Hs.eg.db was for human genome while org.Mm.eg.db was for mouse
BiocManager::install('clusterProfiler') ### for GO and KEGG enrichment analysis

3. Raw data acquisition

General procedure for the production of RNAseq raw data

4. Python analysis to get count-matrix of the RNAseq

  • Quarlity check
fastqc -t 12 -o ./ your_rawdata_name_1.fq.gz your_rawdata_name_2.fq.gz
  • Trimming
trim_galore -q 20 --phred33 --stringency 3 --length 20 -e 0.1 --paired your_rawdata_name_1.fq.gz your_rawdata_name_2.fq.gz
  • Gene alignment
hisat2 -p 8 -x ~genome_site -1 your_rawdata_name_1_val_1.fq.gz -2 your_rawdata_name_2_val_2.fq.gz | samtools view -Sb > your_rawdata_name.sam
  • Expression count
featureCounts -t exon -g gene_id -a genome_directory -o your_rawdata_name_counts.txt your_rawdata_name.bam 
  • Batch running

for batch execution of the above commands, save the following as process_batch.sh

#!/bin/bash
for i in your_rawdata_name_1 your_rawdata_name_2 your_rawdata_name_3...
do
   above command; ### ';' was used for interruption between commands
done

then run the following in the bash

bash process_batch.sh
  • Count-Matrix

Combine all the count values for individual genes in one .csv file and saved for differential expression analysis.

5. R process for differential expression analysis

  • Nomination
> setwd ("~/Desktop/...")
> mydata <- read.csv("counts.csv", header=T) ###replace the name of first column into ensembl_transcript_id_version first
> name <- mydata[,1]
> library(biomaRt)
> mart <- useMart("ensembl", "mmusculus_gene_ensembl") ### use hsapiens_gene_ensembl for human data
> gene_name <- getBM(attributes=c("ensembl_transcript_id_version", "external_gene_name", "ensembl_gene_id"), filters="ensembl_transcript_id_version", values=name, mart=mart)
> gene_data <- merge(mydata, gene_name, by="ensembl_transcript_id_version")
> write.csv(gene_data, "count_with_name.csv", row.names=F)
  • TPM calculation
> setwd("~/Desktop/...")
> mydata <- read.csv("Merged_counts.csv")
> mydata <- na.omit(mydata)
> rownames(mydata) <- mydata[,2]
> mydata <- mydata [,-2]
> kb <- mydata$Length / 1000
> countdata <- mydata[, 3:26]
> rpk <- countdata / kb
> tpm <- t(t(rpk)/colSums(rpk) * 1e6)
> mydata <- merge(mydata, tpm, by="row.names", sort=FALSE)
> write.csv(mydata, file="Merged_TPM.csv", row.names=F)
  • DEG analysis
> library("DESeq2")
> mydata <- read.csv("Merged_counts.csv", header=T, row.names=1)
> countData <- mydata
> condition <- factor(c("control","control","control","treated","treated","treated"))
> colData <- data.frame(row.names = colnames(countData), condition)
> dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), design=~condition)
> dds <- DESeq(dds)
> res <- results(dds)
> res <- res[order(res$padj), ]
> diff_gene_deseq2 <- subset(res, padj < 0.05 & (log2FoldChange >1 | log2FoldChange < -1))
> diff_gene_deseq2 <- row.names(diff_gene_deseq2)
> resdata <- merge(as.data.frame(res),as.data.frame(counts(dds,normalize=TRUE)),by="row.names",sort=FALSE)
> write.csv(resdata, file="ML-SI3.csv", row.names=F)
  • Correlation and PCA plot
> setwd("~/Desktop/...")
> expr <- read.csv("TPM_matrix.csv", header=T, row.names=1)
> corr <- cor(expr, method='pearson')
> library(corrplot)
> corrplot(corr, type='upper', tl.col='black', order='hclust', tl.srt=45, addCoef.col = 'white')
> library(FactoMineR)
> expr <- t(expr)
> gene.pca <- PCA(expr, ncp=2, scale.unit=TRUE, graph=FALSE)
> plot(gene.pca)
  • GO and KEGG Enrichment analysis
> library(org.Hs.eg.db)
> library("clusterProfiler")
> mydata <- read.csv("..._1.csv")
> erich.go.BP = enrichGO(gene = mydata$down_genes, OrgDb = org.Mm.eg.db, keyType = "ENSEMBL", ont="BP", pAdjustMethod = 'BH', pvalueCutoff = 0.01, qvalueCutoff = 0.05)
> dotplot(erich.go.BP) ### GO analysis included ont = "BP", "CC", and "MF"
> my_entrez_id <- mapIds(x=org.Hs.eg.db, column = "ENTREZID", keys = as.character(mydata$down_genes), keytype = "ENSEMBL")
> erich.kegg = enrichKEGG(gene = my_entrez_id, organism = "hsa", pAdjustMethod = 'BH', pvalueCutoff = 0.01, qvalueCutoff = 0.05)
> dotplot(erich.kegg)
> write.table(as.data.frame(erich.kegg), 'kegg.txt', sep = '\t', row.names = F, quote = F)

6. Geneset annotation or enrichment analysis

mainly performed in the following web site:
DAVID: Functional Annotation Tools (ncifcrf.gov)
WEB-based GEne SeT AnaLysis Toolkit

相关文章

网友评论

      本文标题:Brief workflow of RNAseq analysi

      本文链接:https://www.haomeiwen.com/subject/baoobrtx.html