数据下载
使用本实验室一个RNA-seq(GSE56087)和一个circRNA-seq(未发布)的数据作为discovery dataset 和 validation dataset。
RNA-seq数据下载
shell命令批量下载
#!/bin/sh
#调用prefetch下载SRA数据库的测序文件,并用pfastq-dump转化为fastq文件。
#需要在当前目录的download_list.txt文件中把SRA编号,每个一行事先准备好,具体格式参考example.txt
#建议使用NCBI的SRA RUN SELECTOR直接导出。
read -p "请输入下载列表文件名:" downloadlist
read -p "请选择单端(SE),双端(PE)测序:" class
read -p "请输入要使用的线程数:" thread
if [ $class = "SE" ]; then
cat $downloadlist | while read line
do
echo $line
prefetch $line -o ./${line}.sra
pfastq-dump --threads $thread --outdir ./ ./${line}.sra
done
elif [ $class = "PE" ]; then
cat $downloadlist | while read line
do
echo $line
prefetch $line -o ./${line}.sra
pfastq-dump --split-3 --threads $thread --outdir ./ ./${line}.sra
done
else echo "错误的输入!"
fi
相应的accession number
SRR1200850
SRR1200851
SRR1200852
SRR1200853
SRR1200854
SRR1200855
SRR1200856
SRR1200857
SRR1200860
SRR1200861
SRR1200858
SRR1200859
SRR1200862
SRR1200863
SRR1200864
SRR1200865
SRR1200866
SRR1200867
SRR1200868
SRR1200869
SRR1200870
SRR1200871
SRR1200872
SRR1200873
SRR1200874
SRR1200875
SRR1200876
SRR1200877
SRR1200878
SRR1200879
SRR1200880
SRR1200881
SRR1200882
SRR1200883
SRR1200884
SRR1200885
数据分析
主要使用RNAEditor软件分析,环境配置按照官网的配置,注意相应软件的版本,命令需要在RNAEditor目录下运行,并且要求不能在SSH界面下运行,会报错(无法画图),需要在图形界面下打开终端运行。
shell批量分析
for ((i=1200851;i<1200886;i++))
do
fastp -w 16 -i ~/RNASEQ/EC_lina/SRR${i}_1.fastq -I ~/RNASEQ/EC_lina/SRR${i}_2.fastq -o ~/RNASEQ/EC_lina/SRR${i}_1_fastpedited.fastq -O ~/RNASEQ/EC_lina/SRR${i}_2_fastpedited.fastq
python RNAEditor.py -i ~/RNASEQ/EC_lina/SRR${i}_1_fastpedited.fastq ~/RNASEQ/EC_lina/SRR${i}_2_fastpedited.fastq -c configuration.txt
done
配置文件configuration.txt如下
#This file is used to configure the behaviour of RNAeditor
#Standard input files
refGenome = /home/zhou/rnaEditor_annotations/human/GRCH38/Homo_sapiens.GRCh38.dna.primary_assembly.fa
gtfFile = /home/zhou/rnaEditor_annotations/human/GRCH38/Homo_sapiens.GRCh38.83.gtf
dbSNP = /home/zhou/rnaEditor_annotations/human/GRCH38/dbSNP.vcf
hapmap = /home/zhou/rnaEditor_annotations/human/GRCH38/HAPMAP.vcf
omni = /home/zhou/rnaEditor_annotations/human/GRCH38/1000GenomeProject.vcf
esp = /home/zhou/rnaEditor_annotations/human/GRCH38/ESP_filtered
aluRegions = /home/zhou/rnaEditor_annotations/human/GRCH38/repeats.bed
output = default
sourceDir = /usr/local/bin/
maxDiff = 0.04
seedDiff = 2
standCall = 0
standEmit = 0
edgeDistance = 3
paired = True
keepTemp = True
overwrite = False
threads = 23
开23个核,每个样本分析时间大概在十个小时左右。
结果解析
结果实例
image.pngVCF文件
VCF文件包含所有的编辑站点以供进一步分析.
image.png
GCF文件
GCF文件保存了每个编辑站点的附加信息,如基因名称、片段、总读数、编辑读数和编辑比。
image.png
summary文件
summary文件显示每个基因的RNA编辑所处基因位置(如3‘UTR,5‘UTR)的数量。
image.png
bed文件
Editing island的bed文件,可供后续可视化等分析。
image.png
打包除了bam,sam这两种文件以外的所有结果
tar -cvf RNAEditor_1.tar ~/RNASEQ/EC_lina/rnaEditor/ --exclude=bam --exclude=sam --exclude=*sai
gzip -9 RNAEditor_1.tar
circRNA-seq数据集的验证工作
shell批量分析
#!/bin/sh
#把fastq文件路径存入数组
c=0
for file in `find ~/RNASEQ/CircularRNA-seq/rawdata/ -name *fastq.gz -print`
do
filelist[$c]=$file
((c++))
done
for ((i=0;i<${#filelist[@]};i=i+2))
do
tmp1=$(echo ${filelist[$i]}|sed 's/\.fastq\.gz/\_fastpedited\.fastq/g')
tmp2=$(echo ${filelist[$i+1]}|sed 's/\.fastq\.gz/\_fastpedited\.fastq/g')
#fastp预处理
fastp -w 16 -i ${filelist[$i]} -I ${filelist[$i+1]} -o $tmp1 -O $tmp2
#pigz -d $tmp1
#pigz -d $tmp2
python RNAEditor.py -i $tmp1 $tmp2 -c configuration.txt
done
configuration.txt不变
后续分析
准备想个统计方法算一下癌与癌旁那些个有差异
网友评论