1. General workflow for RNAseq analysis
After standard sequencing process (recommended for the illumina high-throughput sequencing protocol), raw data were obtained and processed with python (higher than 3.7 was recommended) for QC, Trimming, Gene alignment, and Expression count. Subsequently, R (higher than 4.2.0 was recommended) was used for the analysis of differential expressions, and web-based tools were used for annotation / enrichment analysis.

2. Preparations
System requirement
Generally speaking, the linux system (ubuntu or deepin OS) was recommended for the python process, while anaconda or miniconda was recommended for the installation of python.
ubuntu: Enterprise Open Source and Linux | Ubuntu
- Deepin OS: deepin.org
- miniconda: Anaconda — USTC Mirror Help
Installation of Python environment
After the installation of conda environment, source files should be added.
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/pkgs/main/
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/conda-forge/
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/msys2/
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/bioconda/
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/menpo/
conda config --set show_channel_urls yes
Environment for the bioinformatics should be added to the conda.
conda create -y --name bioinfo
Every time you want to do analysis, the environment bioinfo should be activated, and software should be installed under the bioinfo environment.
source ~/miniconda3/bin/activate
source activate bioinfo
conda install -c bioconda fastqc
conda install -c bioconda trim-galore
conda install -c bioconda hisat2
conda install -c bioconda samtools
conda install -c bioconda subread
Genomic index download and build
Human and mouse comprehensive gene annotation (GTF) and transcript sequences (FASTA) could be downloaded via the following website:
GENCODE - Home page (gencodegenes.org)
Index could then be built via the following command:
extract_exons.py genomic_annotation.gtf > exons.gtf
extract_splice_sites.py genomic_annotation.gtf > splice_site.gtf
hisat2-build --ss splice_site.gtf --exon exon.gtf genomic_fasta.fa genomic_annotation
Installation of R and packages
R could be downloaded directly from the following website.
R: The R Project for Statistical Computing (r-project.org)
After the installation, R should be started and the following libraries should be installed.
install.packages("BiocManager") ### installing BiocManager packages
BiocManager::install('DESeq2') ### for differential analysis
BiocManager::install('biomaRt') ### for nomination exchange among ensembl, entrez, and external gene names
install.packages("corrplot") ### for plotting correlation graphs
install.packages("FactoMineR") ### for PCA analysis
BiocManager::install('org.Hs.eg.db') ### org.Hs.eg.db was for human genome while org.Mm.eg.db was for mouse
BiocManager::install('clusterProfiler') ### for GO and KEGG enrichment analysis
3. Raw data acquisition

4. Python analysis to get count-matrix of the RNAseq
- Quarlity check
fastqc -t 12 -o ./ your_rawdata_name_1.fq.gz your_rawdata_name_2.fq.gz
- Trimming
trim_galore -q 20 --phred33 --stringency 3 --length 20 -e 0.1 --paired your_rawdata_name_1.fq.gz your_rawdata_name_2.fq.gz
- Gene alignment
hisat2 -p 8 -x ~genome_site -1 your_rawdata_name_1_val_1.fq.gz -2 your_rawdata_name_2_val_2.fq.gz | samtools view -Sb > your_rawdata_name.sam
- Expression count
featureCounts -t exon -g gene_id -a genome_directory -o your_rawdata_name_counts.txt your_rawdata_name.bam
- Batch running
for batch execution of the above commands, save the following as process_batch.sh
#!/bin/bash
for i in your_rawdata_name_1 your_rawdata_name_2 your_rawdata_name_3...
do
above command; ### ';' was used for interruption between commands
done
then run the following in the bash
bash process_batch.sh
- Count-Matrix
Combine all the count values for individual genes in one .csv file and saved for differential expression analysis.
5. R process for differential expression analysis
- Nomination
> setwd ("~/Desktop/...")
> mydata <- read.csv("counts.csv", header=T) ###replace the name of first column into ensembl_transcript_id_version first
> name <- mydata[,1]
> library(biomaRt)
> mart <- useMart("ensembl", "mmusculus_gene_ensembl") ### use hsapiens_gene_ensembl for human data
> gene_name <- getBM(attributes=c("ensembl_transcript_id_version", "external_gene_name", "ensembl_gene_id"), filters="ensembl_transcript_id_version", values=name, mart=mart)
> gene_data <- merge(mydata, gene_name, by="ensembl_transcript_id_version")
> write.csv(gene_data, "count_with_name.csv", row.names=F)
- TPM calculation
> setwd("~/Desktop/...")
> mydata <- read.csv("Merged_counts.csv")
> mydata <- na.omit(mydata)
> rownames(mydata) <- mydata[,2]
> mydata <- mydata [,-2]
> kb <- mydata$Length / 1000
> countdata <- mydata[, 3:26]
> rpk <- countdata / kb
> tpm <- t(t(rpk)/colSums(rpk) * 1e6)
> mydata <- merge(mydata, tpm, by="row.names", sort=FALSE)
> write.csv(mydata, file="Merged_TPM.csv", row.names=F)
- DEG analysis
> library("DESeq2")
> mydata <- read.csv("Merged_counts.csv", header=T, row.names=1)
> countData <- mydata
> condition <- factor(c("control","control","control","treated","treated","treated"))
> colData <- data.frame(row.names = colnames(countData), condition)
> dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), design=~condition)
> dds <- DESeq(dds)
> res <- results(dds)
> res <- res[order(res$padj), ]
> diff_gene_deseq2 <- subset(res, padj < 0.05 & (log2FoldChange >1 | log2FoldChange < -1))
> diff_gene_deseq2 <- row.names(diff_gene_deseq2)
> resdata <- merge(as.data.frame(res),as.data.frame(counts(dds,normalize=TRUE)),by="row.names",sort=FALSE)
> write.csv(resdata, file="ML-SI3.csv", row.names=F)
- Correlation and PCA plot
> setwd("~/Desktop/...")
> expr <- read.csv("TPM_matrix.csv", header=T, row.names=1)
> corr <- cor(expr, method='pearson')
> library(corrplot)
> corrplot(corr, type='upper', tl.col='black', order='hclust', tl.srt=45, addCoef.col = 'white')
> library(FactoMineR)
> expr <- t(expr)
> gene.pca <- PCA(expr, ncp=2, scale.unit=TRUE, graph=FALSE)
> plot(gene.pca)
- GO and KEGG Enrichment analysis
> library(org.Hs.eg.db)
> library("clusterProfiler")
> mydata <- read.csv("..._1.csv")
> erich.go.BP = enrichGO(gene = mydata$down_genes, OrgDb = org.Mm.eg.db, keyType = "ENSEMBL", ont="BP", pAdjustMethod = 'BH', pvalueCutoff = 0.01, qvalueCutoff = 0.05)
> dotplot(erich.go.BP) ### GO analysis included ont = "BP", "CC", and "MF"
> my_entrez_id <- mapIds(x=org.Hs.eg.db, column = "ENTREZID", keys = as.character(mydata$down_genes), keytype = "ENSEMBL")
> erich.kegg = enrichKEGG(gene = my_entrez_id, organism = "hsa", pAdjustMethod = 'BH', pvalueCutoff = 0.01, qvalueCutoff = 0.05)
> dotplot(erich.kegg)
> write.table(as.data.frame(erich.kegg), 'kegg.txt', sep = '\t', row.names = F, quote = F)
6. Geneset annotation or enrichment analysis
mainly performed in the following web site:
DAVID: Functional Annotation Tools (ncifcrf.gov)
WEB-based GEne SeT AnaLysis Toolkit
网友评论