植物长链非编码RNA（lncRNA）鉴定实例（拟南芥数据）

作者: 小明的数据分析笔记本 | 来源:发表于2020-03-13 17:22 被阅读0次

植物长链非编码RNA（lncRNA）鉴定实例（拟南芥数据）
最强攻略2: 史上最全非编码RNA数据库汇总解读
生物信息学基础
IncRNA介绍
使用blastn鉴定lncRNA中可能是miRNA前体的序列
一文读懂LncRNA
LncRNA与miRNA结合预测
Linux下通过RIblast预测lncRNA的作用靶基因
NAR2019数据集——CNIT RNA编码能力预测
lncRNA简介&链特异性

参考资料

Plant Long Non-Coding RNAs

第十三章
An Easy-to-Follow Pipeline for Long Noncoding RNA Identification: A case Study in Diploid Strawberry Fragaria vesca

参考资料里用到的是草莓的数据，我这里换成拟南芥的转录组测序数据
对应论文的数据实验组和对照组分别三个生物学重复，为了减小数据量和缩短计算时间，我这里只下载两个

数据来源

论文 Tapetal Expression of BnaC.MAGL8.a Causes Male Sterility in Arabidopsis

下载数据

直接利用参考文章里的shell脚本

SEQLIBS=(SRR8428909 SRR8428908 SRR8428906 SRR8428905 )

for seqlib in ${SEQLIBS[@]}; do
    wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR842/00${seqlib:9:10}/${seqlib}/${seqlib}_1.fastq.gz
    wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR842/00${seqlib:9:10}/${seqlib}/${seqlib}_2.fastq.gz
done

执行

bash download_raw_data.sh

解压重命名

bgzip -d SRR8428909_1.fastq.gz
bgzip -d SRR8428909_2.fastq.gz

mv SRR8428909_1.fastq wt_Rep1_R1.fastq
mv SRR8428909_2.fastq wt_Rep1_R2.fastq

bgzip -d SRR8428908_1.fastq.gz
bgzip -d SRR8428908_2.fastq.gz

mv SRR8428908_1.fastq wt_Rep2_R1.fastq
mv SRR8428908_2.fastq wt_Rep2_R2.fastq

bgzip -d SRR8428906_1.fastq.gz
bgzip -d SRR8428906_2.fastq.gz

mv SRR8428906_1.fastq EE_Rep1_R1.fastq
mv SRR8428906_2.fastq EE_Rep1_R2.fastq

bgzip -d SRR8428905_1.fastq.gz
bgzip -d SRR8428905_2.fastq.gz

mv SRR8428905_1.fastq EE_Rep2_R1.fastq
mv SRR8428905_2.fastq EE_Rep2_R2.fastq

使用fastp对数据进行过滤

SEQLIBS=(EE_Rep1 EE_Rep2  wt_Rep1 wt_Rep2 )

for seqlib in ${SEQLIBS[@]}; do
    fastp -i ${seqlib}_R1.fastq -I ${seqlib}_R2.fastq -o ${seqlib}_clean_R1.fastq -O ${seqlib}_clean_R2.fastq
done

下载参考基因组和注释文件

wget ftp://ftp.ensemblgenomes.org/pub/plants/release-40/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
bgzip -d Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
mv  Arabidopsis_thaliana.TAIR10.dna.toplevel.fa At.fa
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-40/gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.40.gff3.gz
bgzip -d Arabidopsis_thaliana.TAIR10.40.gff3.gz 
mv Arabidopsis_thaliana.TAIR10.40.gff3 At.gff3

bowtie2构建索引

mkdir reference
mv At* reference
cd reference
bowtie2-build At.fa At

tophat2比对

cd ../
tophat2 -p 8 -I 5000 -G reference/At.gff3 -o wt1_thout reference/At ../wt_Rep1_clean_R1.fastq ../wt_Rep1_clean_R2.fastq
tophat2 -p 12 -I 5000 -G reference/At.gff3 -o wt2_thout reference/At ../wt_Rep2_clean_R1.fastq ../wt_Rep2_clean_R2.fastq

tophat2 -p 8 -I 5000 -G reference/At.gff3 -o EE1_thout reference/At ../EE_Rep1_clean_R1.fastq ../EE_Rep1_clean_R2.fastq
tophat2 -p 8 -I 5000 -G reference/At.gff3 -o EE2_thout reference/At ../EE_Rep2_clean_R1.fastq ../EE_Rep2_clean_R2.fastq

-I 参数指定最大内含子长度，这里为什么设置为5000呢？

使用cufflinks组装转录本

cufflinks -p 4 -g reference/At.gff3 -I 5000 -o wt1_clout wt1_thout/accepted_hits.bam
cufflinks -p 4 -g reference/At.gff3 -I 5000 -o wt2_clout wt2_thout/accepted_hits.bam
cufflinks -p 4 -g reference/At.gff3 -I 5000 -o EE1_clout EE1_thout/accepted_hits.bam
cufflinks -p 4 -g reference/At.gff3 -I 5000 -o EE2_clout EE2_thout/accepted_hits.bam

合并以上的结果

find . -name transcripts.gtf > assemblies.txt
cuffmerge -p 4 -g reference/At.gff3 -s reference/At.fa assemblies.txt

计算不同位点和转录本的表达量

cuffdiff -o diff_out -b reference/At.fa -p 8 -T -L EE,wt -u merged_asm/merged.gtf EE1_thout/accepted_hits.bam,EE2_thout/accepted_hits.bam wt1_thout/accepted_hits.bam,wt2_thout/accepted_hits.bam

这之前的步骤和普通的转录组数据分析是一样的

选择转录本

cat merged_asm/merged.gtf | grep 'class_code "[uiox]"' > selected.gtf
gffread -w selected.fa -g reference/At.fa selected.gtf
cat selected.fa | grep ">" | wc -l

u: unknow or intergenic regions
o: generic exonic overlap with a reference transcript
x: natural antisense transcript
i: located entirely within the intronic regions

写一个简单的python脚本选择长度大于200nt的序列

import sys
from Bio import SeqIO

inputfasta = sys.argv[1]
min_len = int(sys.argv[2])
outputfasta = sys.argv[3]

i = 0
j = 0

fw = open(outputfasta,'w')

for rec in SeqIO.parse(inputfasta,'fasta'):
    i += 1
    if len(rec.seq) > min_len:
        j += 1
        fw.write(">%s\n%s\n"%(rec.id,str(rec.seq)))
        
print("The total number of sequence is: ",i)
print("Number of sequence retained is: ",j)

运行脚本

python .\select_seq_according_to_seq_length.py .\selected.fa 200 output.fasta

将转录本上传到cpc2预测转录本的蛋白编码能力

cpc2链接 http://cpc2.cbi.pku.edu.cn/

选择结果中没有蛋白编码能力的转录本

cat result_cpc2.txt | grep 'noncoding' | cut -f 1 > non-coding-transcript.txt
wc -l non-coding-transcript.txt

到这里已经根据3个标准来筛选lncRNA

转录本在染色体上的位置 uoxi
转录本长度
蛋白编码能力

接下来的内容是

去除可能编码miRNA的转录本
研究非编码RNA的表达情况
找到与lncRNA相邻的蛋白编码基因
lncRNA与蛋白编码基因的共表达

这篇文章的内容就暂时先到这里了。
欢迎大家关注我的公众号
小明的数据分析笔记本

公众号二维码.jpg

中国加油！武汉加油！

网友评论

ncRNA

本文标题：植物长链非编码RNA（lncRNA）鉴定实例（拟南芥数据）

本文链接：https://www.haomeiwen.com/subject/mdlmjhtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

植物长链非编码RNA（lncRNA）鉴定实例（拟南芥数据）

参考资料

数据来源

下载数据

解压重命名

使用fastp对数据进行过滤

下载参考基因组和注释文件

bowtie2构建索引

tophat2比对

使用cufflinks组装转录本

合并以上的结果

计算不同位点和转录本的表达量

这之前的步骤和普通的转录组数据分析是一样的

选择转录本

将转录本上传到cpc2预测转录本的蛋白编码能力

选择结果中没有蛋白编码能力的转录本

到这里已经根据3个标准来筛选lncRNA

接下来的内容是

相关文章

植物长链非编码RNA（lncRNA）鉴定实例（拟南芥数据）

最强攻略2: 史上最全非编码RNA数据库汇总解读

生物信息学基础

IncRNA介绍

使用blastn鉴定lncRNA中可能是miRNA前体的序列

一文读懂LncRNA

LncRNA与miRNA结合预测

Linux下通过RIblast预测lncRNA的作用靶基因

NAR2019数据集——CNIT RNA编码能力预测

lncRNA简介&链特异性

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

ncRNA