Mindmap

Overview

RNAseq

A gentle introduction to RNAseq from StatQuest

Three Main Steps

1. Library Preparation

step1. isolate RNA
step2. break RNA into small fragments
RNA transcripts can be thousands of bases long, the sequencing mache can only sequence 200-300bp fragments.
step3. convert RNA fragments to ds DNA
dsDNA is more stable and easily amplified and modified
step4. add adaptors
adaptors do two things: 1) allow sequencing mache to recognize the fragments 2) allow you to sequence different samples at the same time because diff samples can use diff adaptors -> save time and money.

But this step doesn't work 100% of the time, a couple of DNA doesn't get adaptors
step5. PCR amplify
only fragments with sequencing adapters are amplified aka enriched.
step6. QC
verify library concentration
verify library fragments lengths to make sure it not too long or too short

2. Sequence

grid -> flow cell

注意彩色点是machine的probe

then machine takes the picture from above -> washes the colour off of the probe -> probes bound to the next base -> take a picture again

Quality scores: reflect how confident the machine is that it correctly call a base(reflect as a colour).

Get low quality score when a lot of probes the same colour in a region -> low diversity

Low diversity: the overabundance of a single colour makes it hard to identify the individual's sequence.
Especially a problem when first few nucleotides sequenced.

Raw Data

how it looks
first line starts with @ : unique ID for the sequence that follows
second line: contain bases called for the sequenced fragment ?
third line: +
fourth line: contain quality score for each base in the sequenced fragment

filter out garbage reads
garbage reads: reads with low quality base calls/ reads are clearly artifacts of the chemistry
e.g. adapter binds together directly
align the high quality reads to a genome

why break the sequences up into small fragments? because it allows aligning reads even if they are not exact matches to the reference genome.
count the number of reads per gene

once we know the chromosome and position for a read -> if it falls with in the coordinates of a gene (or some other interersting feature)

6 to 800+ samples

bulk RNAseq: where a sample is the average of a pool of cells (usually 6 million cells), might have 3 normal samples and 3 disease samples. it's common and original method

single-cell RNAseq: treats each cell as an individual sample, so it can generate a lot of samples (e.g. 800+).

that's a huge matrix and it's only going to get bigger since sequencing gets cheaper and people are doing more and more samples.

normalize data

because each sample has a different number of reads, e.g. one sample has more low quality reads/ another sample has a higher concentration on the flow cell

methods: RPKM/FPKM/TPM
RPKM for single-end, FPKM for paired-end.

RPKM:Reads Per Kilobase Million
FPKM:Fragments Per Kilobase Million
TPM: transcripts per million
TPM is recommended, it just has a different operation order with others. first, normalize gene length then normalize sequencing depth. these two things bring bias.
It's good because TPM makes it easier to compare the proportion of total reads mapped to what gene in each sample.

此处补充：
Reads and fragments
rederence: clearly explained

An insert is the of DNA of interest you add adapter sequences to.
A fragment is the insert plus adapters.
A read is the sequenced part of a fragment, usually the insert, but can also sequence parts of the adapters as well.
What you are sequencing is the fragment, in either SE or PE sequencing, the only difference is the number of reads per fragment.

If only a SE run is performed, then only one of the adapters will be used to sequence. If PE, then both adapters are used. With an SE run, you are getting one read per fragment, in PE you are getting two reads per fragment.

3. Data analysis

step1: Plot the data
use PCA or sth. to plot the data in two dimensions.
Plotting data tells 1)if we can expect to find interesting differences/ 2) if we should exclude some samples from any downstream analysis

further analysis exclude wt2 too because it's higher

example of excluding some samples from any downstream analysis (point2)

step2: identify DEG between control and mutant samples

截屏2020-08-15 15.53.25.png

red dots: a gene that different between normal and mutant samples
black dots: genes are the same
x-axis(logCPM): how much each gene is transcribed, left is low transcribed
y-axis(logFC): how big the relative difference is between normal and mutant

CPM: counts per million

fold change: 即倍数变化，假设A基因表达值为1，B表达值为3，那么B的表达就是A的3倍。一般我们都用count、TPM或FPKM来衡量基因表达水平，所以基因表达值肯定是非负数，fold change 取值范围为0 ～正无穷，不用表达之差而用log2 fold change，更能表示相对变化趋势。缺点：基因总的表达值很低（接近0），log2fold change的可信度变低。

if you don't know what you are looking for, you can see if certain pathways are enriched in either the normal or mutant gene sets.