1. General workflow for RNAseq analysis

After standard sequencing process (recommended for the illumina high-throughput sequencing protocol), raw data were obtained and processed with python (higher than 3.7 was recommended) for QC, Trimming, Gene alignment, and Expression count. Subsequently, R (higher than 4.2.0 was recommended) was used for the analysis of differential expressions, and web-based tools were used for annotation / enrichment analysis.

the general workflow for RNAseq analysis

2. Preparations

System requirement

Generally speaking, the linux system (ubuntu or deepin OS) was recommended for the python process, while anaconda or miniconda was recommended for the installation of python.
ubuntu: Enterprise Open Source and Linux | Ubuntu

Deepin OS: deepin.org
miniconda: Anaconda — USTC Mirror Help

Installation of Python environment

After the installation of conda environment, source files should be added.

conda config --add channels https://mirrors.ustc.edu.cn/anaconda/pkgs/main/
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/conda-forge/
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/msys2/
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/bioconda/
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/menpo/
conda config --set show_channel_urls yes

Environment for the bioinformatics should be added to the conda.

conda create -y --name bioinfo

Every time you want to do analysis, the environment bioinfo should be activated, and software should be installed under the bioinfo environment.

source ~/miniconda3/bin/activate
source activate bioinfo
conda install -c bioconda fastqc
conda install -c bioconda trim-galore
conda install -c bioconda hisat2
conda install -c bioconda samtools
conda install -c bioconda subread

Genomic index download and build

Human and mouse comprehensive gene annotation (GTF) and transcript sequences (FASTA) could be downloaded via the following website:
GENCODE - Home page (gencodegenes.org)

Index could then be built via the following command:

extract_exons.py genomic_annotation.gtf > exons.gtf
extract_splice_sites.py genomic_annotation.gtf > splice_site.gtf
hisat2-build --ss splice_site.gtf --exon exon.gtf genomic_fasta.fa genomic_annotation

Installation of R and packages

R could be downloaded directly from the following website.
R: The R Project for Statistical Computing (r-project.org)

After the installation, R should be started and the following libraries should be installed.

install.packages("BiocManager")  ### installing BiocManager packages 
BiocManager::install('DESeq2') ### for differential analysis
BiocManager::install('biomaRt') ### for nomination exchange among ensembl, entrez, and external gene names
install.packages("corrplot") ### for plotting correlation graphs
install.packages("FactoMineR") ### for PCA analysis
BiocManager::install('org.Hs.eg.db') ### org.Hs.eg.db was for human genome while org.Mm.eg.db was for mouse
BiocManager::install('clusterProfiler') ### for GO and KEGG enrichment analysis

3. Raw data acquisition

General procedure for the production of RNAseq raw data

4. Python analysis to get count-matrix of the RNAseq

Quarlity check

fastqc -t 12 -o ./ your_rawdata_name_1.fq.gz your_rawdata_name_2.fq.gz

Trimming

trim_galore -q 20 --phred33 --stringency 3 --length 20 -e 0.1 --paired your_rawdata_name_1.fq.gz your_rawdata_name_2.fq.gz

Gene alignment

hisat2 -p 8 -x ~genome_site -1 your_rawdata_name_1_val_1.fq.gz -2 your_rawdata_name_2_val_2.fq.gz | samtools view -Sb > your_rawdata_name.sam

Expression count

featureCounts -t exon -g gene_id -a genome_directory -o your_rawdata_name_counts.txt your_rawdata_name.bam

Batch running

for batch execution of the above commands, save the following as process_batch.sh

#!/bin/bash
for i in your_rawdata_name_1 your_rawdata_name_2 your_rawdata_name_3...
do
   above command; ### ';' was used for interruption between commands
done

then run the following in the bash

bash process_batch.sh

Count-Matrix

Combine all the count values for individual genes in one .csv file and saved for differential expression analysis.

5. R process for differential expression analysis

Nomination

> setwd ("~/Desktop/...")
> mydata <- read.csv("counts.csv", header=T) ###replace the name of first column into ensembl_transcript_id_version first
> name <- mydata[,1]
> library(biomaRt)
> mart <- useMart("ensembl", "mmusculus_gene_ensembl") ### use hsapiens_gene_ensembl for human data
> gene_name <- getBM(attributes=c("ensembl_transcript_id_version", "external_gene_name", "ensembl_gene_id"), filters="ensembl_transcript_id_version", values=name, mart=mart)
> gene_data <- merge(mydata, gene_name, by="ensembl_transcript_id_version")
> write.csv(gene_data, "count_with_name.csv", row.names=F)

TPM calculation

> setwd("~/Desktop/...")
> mydata <- read.csv("Merged_counts.csv")
> mydata <- na.omit(mydata)
> rownames(mydata) <- mydata[,2]
> mydata <- mydata [,-2]
> kb <- mydata$Length / 1000
> countdata <- mydata[, 3:26]
> rpk <- countdata / kb
> tpm <- t(t(rpk)/colSums(rpk) * 1e6)
> mydata <- merge(mydata, tpm, by="row.names", sort=FALSE)
> write.csv(mydata, file="Merged_TPM.csv", row.names=F)

DEG analysis

> library("DESeq2")
> mydata <- read.csv("Merged_counts.csv", header=T, row.names=1)
> countData <- mydata
> condition <- factor(c("control","control","control","treated","treated","treated"))
> colData <- data.frame(row.names = colnames(countData), condition)
> dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), design=~condition)
> dds <- DESeq(dds)
> res <- results(dds)
> res <- res[order(res$padj), ]
> diff_gene_deseq2 <- subset(res, padj < 0.05 & (log2FoldChange >1 | log2FoldChange < -1))
> diff_gene_deseq2 <- row.names(diff_gene_deseq2)
> resdata <- merge(as.data.frame(res),as.data.frame(counts(dds,normalize=TRUE)),by="row.names",sort=FALSE)
> write.csv(resdata, file="ML-SI3.csv", row.names=F)

Correlation and PCA plot

> setwd("~/Desktop/...")
> expr <- read.csv("TPM_matrix.csv", header=T, row.names=1)
> corr <- cor(expr, method='pearson')
> library(corrplot)
> corrplot(corr, type='upper', tl.col='black', order='hclust', tl.srt=45, addCoef.col = 'white')
> library(FactoMineR)
> expr <- t(expr)
> gene.pca <- PCA(expr, ncp=2, scale.unit=TRUE, graph=FALSE)
> plot(gene.pca)

GO and KEGG Enrichment analysis

> library(org.Hs.eg.db)
> library("clusterProfiler")
> mydata <- read.csv("..._1.csv")
> erich.go.BP = enrichGO(gene = mydata$down_genes, OrgDb = org.Mm.eg.db, keyType = "ENSEMBL", ont="BP", pAdjustMethod = 'BH', pvalueCutoff = 0.01, qvalueCutoff = 0.05)
> dotplot(erich.go.BP) ### GO analysis included ont = "BP", "CC", and "MF"
> my_entrez_id <- mapIds(x=org.Hs.eg.db, column = "ENTREZID", keys = as.character(mydata$down_genes), keytype = "ENSEMBL")
> erich.kegg = enrichKEGG(gene = my_entrez_id, organism = "hsa", pAdjustMethod = 'BH', pvalueCutoff = 0.01, qvalueCutoff = 0.05)
> dotplot(erich.kegg)
> write.table(as.data.frame(erich.kegg), 'kegg.txt', sep = '\t', row.names = F, quote = F)