teprof2转录组数据鉴定转座子融合嵌合转录本

作者: Yu_Lin | 来源:发表于2023-06-06 19:30 被阅读0次

科研干货 | 一文了解RNA序列比对软件HISAT2
融合转录本的检测软件对比（NGS）
生信菜鸟团-专题学习目录
全长转录本结构分析（下）
转录组学习三（数据质控）
转录组学习二（数据下载）
转录组学习五（reads比对）
转录组学习八（功能富集分析）
转录组学习六（reads计数与标准化）
转录组学习四（参考基因组及gtf注释探究）

参考资料

转座因子（TE）是丰富的遗传学资源，其中包含着大量的调控序列。在TE内部存在着潜在的隐性调控元件，在生理和病理中这些元件可以被表观遗传学机制重新激活，从而影响疾病的发生和发展。然而，不同癌症类型中TE相关的癌症外源适应事件的普遍性和影响仍然缺乏充分的了解。
本教程主要翻译上面文章的教程，本人经过测试运行稳定后，进行代码的书写，有些地方没有进行翻译，感觉英文版的更加原汁原味，翻译过来就说不清楚了，

分析步骤

下载文章中的源代码https://github.com/twlab/TEProf2Paper

image.png

一、配置teprof2环境

注意组织好文件

mkdir {sra,raw,clean,align,peaks,motif,qc}
cd sra

需要的软件

stringtie >= 1.3.3
samtools >= 1.3.1
cufflinks >= 2.2.1
python 2.7 (cPickle, pytabix 0.1)
R >= 3.4.1 (ggplot2, bsgenome.hsapiens.ucsc.hg38 (or genome of your choosing), Xmisc, reshape2)

,,,以上可单独安装。该管道仅针对列出的版本进行了测试，可能会与较新的版本发生冲突。

conda配置环境

conda env create -f TEProf2.yml

激活环境

source activate
conda activate teprof2

然后需要手动安装R包Xmisc

install.packages('Xmisc')

如果由于最近从CRAN中删除了Xmisc，因此上述操作不起作用，可以通过devtools安装它。退出R，并运行以下命令。

conda install -c conda-forge r-devtools

开启R

加载devtools并确保环境变量设置正确

>Sys.setenv(TAR = "/bin/tar")
>install_url("https://cran.r-project.org/src/contrib/Archive/Xmisc/Xmisc_0.2.1.tar.gz")

把 bin 文件加入到 PATH

echo 'export PATH=/mnt/d/D/temp/transposable.elements/twlab-TEProf2Paper-3bb4b8d/bin:$PATH'  >> ~/.bashrc
source ~/.bashrc

二、下载Reference Files

作者已经为hg38、hg19和mm10创建了一组默认的参考文件。为了简单使用，下载该目录，将其放在rnapipeline文件夹中，然后解压缩。确保使用这些文件的完整路径更新arguments.txt文件(将在下面的用法中解释)。
Download Link hg38: External Download Link
Download Link hg19: External Download Link
Download Link mm10: External Download Link

设置arguments.txt

参考文件的位置需要在一个以制表符分隔的文件中指定。不应该有额外的行或标题。下面是一个例子:

rmsk /bar/nshah/reference/rmskhg38.bed6.gz
rmskannotationfile /bar/nshah/reference/repeatmasker_description_uniq.lst
gencodeplusdic /bar/nshah/reference/genecode_plus_hg38.dic
gencodeminusdic /bar/nshah/reference/genecode_minus_hg38.dic
focusgenes /bar/nshah/reference/oncogenes_augmented.txt
plusintron /bar/nshah/reference/gencode.v25.annotation.sorted.gtf_introns_plus_sorted.gz
minusintron /bar/nshah/reference/gencode.v25.annotation.sorted.gtf_introns_minus_sorted.gz

Required Arguments

rmsk: tabix formatted bed6 files with repeatmasker or other file that user wants to check start location for
rmskannotationfile: a tab delimitted with 3 columns: subfamily class family
gencodeplusdic: Dictionary of all gencode elements including introns and exons for the plus (+) strand
gencodeminusdic: Dictionary of all gencode elements including introns and exons for the minus (-) strand

三、公共数据库下载输入文件

1、下载公共数据

cd /mnt/d/D/xuhonglab/singleED/NCB/bulk-seq/
prefetch -X 300GB -O ./ --option-file SRR_Acc_List.txt
ls /mnt/d/D/xuhonglab/singleED/NCB/bulk-seq/sra/SRR*/*.sra | cut -d "/" -f 10 | while read id; do (fasterq-dump --split-files -e 24 /mnt/d/D/xuhonglab/singleED/NCB/bulk-seq/sra/$id --include-technical); done

2、trim_galore测序数据的过滤

cd /mnt/d/D/xuhonglab/singleED/NCB/bulk-seq/
find . -name "*.fastq" | while read file ; do echo "trim_galore -q 25 --phred33 --length 25 -e 0.1 --stringency 4 -o ./clean "$file >> trim_galore.txt ; done ;
parallel -j 10 < trim_galore.txt

3、STAR进行比对（STAR只能在linux系统盘里运行，不能挂在mnt （占内存比较大建议一个个的运行，不建议使用并行运算Parallel））

cd /0.clear_fq/
for file in $(ls | grep _1_trimmed.fq.gz)
        do
        pre_name=${file%_1_trimmed.fq.gz}
        echo $pre_name
STAR --runThreadN 20 --genomeDir /StarIndex \
 --runMode alignReads \
 --quantMode TranscriptomeSAM GeneCounts \
 --readFilesIn  /0.clear_fq/${pre_name}_1_trimmed.fq.gz \
 --outSAMtype BAM SortedByCoordinate \
 --readFilesCommand gunzip -c \
 --outSAMattrIHstart 0  \
 --outSAMstrandField intronMotif \
 --outFileNamePrefix  ${pre_name}

4、由于STAR只能不能再mnt盘使用，所以将比对的文件移到指定的位置并重新命名

cat sample.id.txt | while read id; do      
 arr=(${id});
 input=${arr[0]};
 sample=${arr[1]};
 str=`echo $sample | sed 's/ //g'`;
 cp ${arr[0]}Aligned.sortedByCoord.out.bam /mnt/d/D/xuhonglab/singleED/NCB/bulk-seq/motif/${str}.bam; done

5、stringtie比对转录本(内存大如果足够大，可并行运算)

cd /mnt/d/D/xuhonglab/singleED/NCB/bulk-seq/motif/
find . -name "*.bam" | while read file ;do fileid=$(echo "$file" | awk -F "/" '{print $2}'); echo "stringtie -o /mnt/d/D/xuhonglab/singleED/NCB/bulk-seq/peaks/"${fileid}".gtf "${file}" -m 100 -c 1" >> stringtie.txt ; done ;
parallel -j 10 < stringtie.txt

6、Annotate GTF Files

cd /mnt/d/D/xuhonglab/singleED/NCB/bulk-seq/peaks/
find . -name "*.gtf" | while read file ; do echo "rmskhg38_annotate_gtf_update_test_tpm.py "$file >> 1_annotationCommands.txt ; done ;
echo "(1/8) Annotate GTF Files";
parallel -j 10 < 1_annotationCommands.txt

7、对bam文件建立索引（[sam view] random alignment retrieval only works for indexed BAM or CRAM files）

cd /mnt/d/D/xuhonglab/singleED/NCB/bulk-seq/motif/
##更改名字
find . -name "*.Aligned.sortedByCoord.out.bam" | while read file ;do fileid=$(echo "$file" | awk -F "/" '{print $2}'|awk -F "." '{print $1}'); mv ${file} "${fileid}".bam; done
find . -name "*.bam" | while read file ; do echo "samtools index" $file >> samtooks.txt ; done
parallel -j 10 < samtooks.txt

处理注释文件，粗略估计转录本相对于所有其他基因转录本的相对表达量。处理注释文件，粗略估计转录本相对于所有其他基因转录本的相对表达量

cd /mnt/d/D/xuhonglab/singleED/NCB/bulk-seq/eu.ec/
find . -name "*_annotated_filtered_test_all" | while read file ; do echo "annotationtpmprocess.py "$file >> 2_processAnnotationCommands.txt ; done ;
echo "(2/8) Annotate GTF Files"
parallel -j 10 < 2_processAnnotationCommands.txt
Output File(s):
(1) <gtffile>annotated_filtered_test(all)_c

Three additional columns are made: (1) Maximum coverage of gene of interest (2) Maximum tpm of gene of interest (3) Total of gene of interest. Note: For those transcripts that are from a TE but do not splice into a gene, these stats will be compared across all the transcripts like this. Thus the fraction will be very low.

跨样本聚合注释样本

echo "(3/8) Aggregate Annotations"
aggregateProcessedAnnotation.R -e "Ec"
Options with Defaults:
-e <ext_treatment> (default: ''): The label in the treatment file names that will identify them. If there is not a treatment versus untreated experimental design, the default will call everything as treatment.
-l <exon 1 length max> (default: 2588): The maximum length of exon 1. We are using the 99th percentile of gencode v25 transcripts
-s <exon skipping max> (default: 2): Based on genomic contamination, assembly can create transcripts that have intron retention. These could be real, but oftentimes these look like noise. A maximum level of 2 is recommended.
-n <sample total min> (default: 1): Minimum number of samples that must have a candidate for it to be considered. With low number of samples 1 is recommended. For larger studies this can be increased.
-t <treatment total min> (default: 0): In case the user would like to only consider candidates that are present in a certain number of treatment samples.
-x <treatment exclusive> (default: no): In case the user would like a maximum for the untreated samples. This could be if the user wants to only consider treatment-exclusive samples. In that case, the user should put yes instead of no.
-k <keep none> (default: no): There can be transcripts from TEs that have splicing, but do not end up splicing into a gene. By default these are removed. This can be changed to yes if the user would like to look at these.
-f <filter for TEs> (default: yes): Repeatmasker files include many repeats besides TE. This could be an interesting benchmark to compare to, but by default they are removed. The user can specify no if they would like to keep these. NOTE: Currently does not work with downstream steps if filter is set to No.
-a <argument file> (default: <directory of pipeline>/arguments.txt): The arguments.txt file location. By default, the arguments.txt file that is in the rnapipeline directory will be used. If another is desired for use then the fullpath of the file can be specified.
Output File(s):
(1) filter_combined_candidates.tsv: A file with every TE-gene transcript. This file is used for calculating read information in subsequent steps.
(2) initial_candidate_list.tsv: A summary of filter_combined_candidates.tsv for each unique candidate. Also lists the treatment and untreated files that the candidate is present in.
(3) Step4.RData: Workspace file with data loaded from R session. Subsequent steps load this to save time.
Note:

This step will give ALL assembled transcripts. Assembly is very noisy, and thus the output of this step will have a high false positive rate. Filtering based on tpm or fraction of tpm can help. It is reccomended that the user looks at initial_candidate_list.tsv and assures that the candidates are in the format that they desire.

Calculate Read Information

It is recommended that candidates are confirmed using read information. We have found this to greatly increase the specificity of the pipeline since assembly can make many errors especially without a reference.
We have developed a fast parser using samtools and bedtools to be able to get read information. There are separate programs for either single and or paired end reads. The only argument needed is the path of the bam files.
Note:

Bam files should be sorted by position and indexed for this to work. Bam file naems should be the same as the gtf files that were the oriignal input for the pipeline except the extention is different. It is recommended that paired-end reads are used. Single-end performance is not as robust.

cd /mnt/d/D/xuhonglab/singleED/NCB/bulk-seq/eu.ec/
mkdir filterreadstats
commandsmax_speed_se.py filter_combined_candidates.tsv ../motif/
#双端测序采用这个命令commandsmax_speed.py filter_combined_candidates.tsv ../motif/
echo "(4/8) Filter based on Reads"
parallel -j 10 < filterreadcommands_se.txt

合并所有已读信息文件

find ./filterreadstats -name "*.stats" -type f -maxdepth 1 -print0 | xargs -0 -n128 -P1 grep -H e > resultgrep_filterreadstatsdone.txt
cat resultgrep_filterreadstatsdone.txt | sed 's/\:/\t/g' > filter_read_stats.txt
filterReadCandidates.R

Output File(s):

(1) read_filtered_candidates.tsv: A file with every TE-gene transcript. This file is used for calculating read information in subsequent steps.
(2) candidate_transcripts.gff3: A GFF3 formatted file of all the unique TE-gene transcripts detected. This can be viewed on a genome viewer to see what candidates look like. This is used in subsequent steps.
(3) Step6.RData: Workspace file with data loaded from R session. Subsequent steps load this to save time.
Remove step 4 session data that is no longer needed.

与参考GTF合并

gffread -E candidate_transcripts.gff3 -T -o candidate_transcripts.gtf
echo candidate_transcripts.gtf > cuffmergegtf.list
mkdir merged_asm_full
cuffmerge -o ./merged_asm_full -g /mnt/d/D/linux/genome/gencode-h/gencode.v42.chr_patch_hapl_scaff.annotation.gtf cuffmergegtf.list
mv ./merged_asm_full/merged.gtf reference_merged_candidates.gtf
gffread -E reference_merged_candidates.gtf -o- > reference_merged_candidates.gff3
echo "(5/8) Annotate Merged Transcripts"
rmskhg38_annotate_gtf_update_test_tpm_cuff.py reference_merged_candidates.gff3

Output File(s):
(1) reference_merged_candidates.gff3_annotated_test_all
(2) reference_merged_candidates.gff3_annotated_filtered_test_all <(Final List of all TE-gene transcripts)
(3) reference_merged_candidates.gff3_annotated_test*
(4) reference_merged_candidates.gff3_annotated_filtered_test
Note:

We recommend users to view the reference_merged_candidates.gff3 on a genome browser to confirm that the merged transcript file has the expected candidates. The user can map transcripts back to their annotation using the transcript ID and confirm proper identification.

计算转录本

find ../motif/ -name "*bam" | while read file ; do xbase=${file##*/}; echo "samtools view -q 255 -h "$file" | stringtie - -o "${xbase%.*}".gtf -e -b "${xbase%.*}"_stats -p 2 --fr -m 100 -c 1 -G reference_merged_candidates.gtf" >> quantificationCommands.txt ; done ;
echo "(6/8) Transcript Quantification"
parallel -j 10 < quantificationCommands.txt
mergeAnnotationProcess.R

Options with Defaults:
-f <merged gtf annotation file> (default: reference_merged_candidates.gff3_annotated_filtered_test_all): Reference transcript file to be processed. Default will work if all the previous steps have been done as described with the same names.
Output File(s):
(1) candidate_introns.txt: A large one-line string with all intron locations. This is used in susequent steps to extract intron read information
(2) candidate_names.txt: The trnascript names for all candidate transcripts that are left
(3) Step10.RData: Workspace file with data loaded from R session. Subsequent steps load this to save time.
Remove step 6 session data that is no longer needed.

处理stringtie transcript注释文件，获取相关信息并进行汇总

（Th stringtie output needs to be formmated into a table, and thus we have a few helper scripts and bash commands to make these tables.）

ls ./*stats/i_data.ctab > ctab.txt
cat ctab.txt | while read ID ; do fileid=$(echo "$ID" | awk -F "/" '{print $2}'| awk -F "_" '{print $1}'); cp $ID ./${fileid}_i_data.ctab; done;

find . -name "*i_data.ctab" > ctab_i.txt
cat ctab_i.txt | while read ID ; do fileid=$(echo "$ID" | awk -F "/" '{print $2}'); cat <(printf 'chr\tstrand\tstart\tend\t'${fileid/_stats/}'\n') <(grep -f candidate_introns.txt $ID | awk -F'\t' '{ print $2"\t"$3"\t"$4"\t"$5"\t"$6 }') > ${ID}_cand ; done ;

cat <(find . -name "*i_data.ctab_cand" | head -1 | while read file ; do cat $file | awk '{print $1"\t"$2"\t"$3"\t"$4}' ; done;) > table_i_all

find . -name "*i_data.ctab_cand" | while read file ; do paste -d'\t' <(cat table_i_all) <(cat $file | awk '{print $5}') > table_i_all_temp; mv table_i_all_temp table_i_all; done ;

ls ./*stats/t_data.ctab > ctablist.txt

cat ctablist.txt | while read file ; do echo "stringtieExpressionFrac.py $file" >> stringtieExpressionFracCommands.txt ; done;

echo "(7/8) Transcript Quantification比较耗时间2小时"
parallel -j 24 < stringtieExpressionFracCommands.txt

ls ./*stats/t_data.ctab_frac_tot > ctab_frac_tot_files.txt
ls ./*stats/t_data.ctab_tpm > ctab_tpm_files.txt

cat <(echo "TranscriptID") <(find . -name "*ctab_frac_tot" | head -1 | while read file ; do sort $file | awk '{print $1}' ; done;) > table_frac_tot
cat ctab_frac_tot_files.txt | while read file ; do fileid=$(echo "$file" | awk -F "/" '{print $2}') ; paste -d'\t' <(cat table_frac_tot) <(cat <(echo ${fileid/_stats/}) <(sort $file | awk '{print $2}')) > table_frac_tot_temp; mv table_frac_tot_temp table_frac_tot; done ;

cat <(echo "TranscriptID") <(find . -name "*ctab_tpm" | head -1 | while read file ; do sort $file | awk '{print $1}' ; done;) > table_tpm
cat ctab_tpm_files.txt | while read file ; do fileid=$(echo "$file" | awk -F "/" '{print $2}') ; paste -d'\t' <(cat table_tpm) <(cat <(echo ${fileid/_stats/}) <(sort $file | awk '{print $2}')) > table_tpm_temp; mv table_tpm_temp table_tpm; done ;

cat <(head -1 table_frac_tot) <(grep -Ff candidate_names.txt table_frac_tot) > table_frac_tot_cand
cat <(head -1 table_tpm) <(grep -Ff candidate_names.txt table_tpm) > table_tpm_cand

样品识别，定量和创建最后的表格

聚合stringtie注释和内含子覆盖率信息。它将根据用户设置的参数预测是否存在。用户还可以使用此步骤输出的表进行更高级的统计分析。

finalStatisticsOutput.R -e "Eu"

Output File(s):

(1) All TE-derived Alternative Isoforms Statistics.xlsx: A file with the final statistics calculated for each candidate. There is also data on the gene expression in both groups (treatment and normal), fraction of gene expression (treatment and normal), the number of reads to main splicing intron (treatment and normal), and treatment enrichment.
Note:

The Treatment Count and Normal Count columns are calculated based on the number of files in which the candidate passes the thresholds set on fraction of expression, total gene expression, and number of reads spanning the main splicing intron. The final table has all the data used for this, so the user can try using different thresholds to optimize candidate presence based on their data.

(2) allCandidateStatistics.tsv: file with gene expression, fraction expression, transcript expression, and intron junction read information across all the samples for all the candidates.
(3) Step11_FINAL.RData: Workspace file with data loaded from R session. Can be loaded by user to save time to do more advanced analysis.
Remove old RData, the final one will have all data from previous ones.

用Kozak法翻译转录本，生成RNa序列的FASTA

科研干货 | 一文了解RNA序列比对软件HISAT2
转录组测序自问世以来，在研究基因表达、转录本结构、基因融合、非编码RNA鉴定等方面发挥了重要的作用。在有参考基因组...
融合转录本的检测软件对比（NGS）
融合转录本的产生融合转录本由融合基因（DNA水平）或可变剪切事件产生（RNA水平）。融合转录本的检测术语 Fi...
生信菜鸟团-专题学习目录
生信菜鸟团-专题学习目录原文链接转录组专题转录组专题-可变剪接转录组专题-融合基因转录组专题-WGCNA分析...
全长转录本结构分析（下）
作者：Arno审稿：童蒙编辑：angelica 前面我们介绍了PacBio三代全长转录组测序相关的全长转录本鉴定、...
转录组学习三（数据质控）
转录组学习一（软件安装）转录组学习二（数据下载）转录组学习三（数据质控）转录组学习四（参考基因组及gt...
转录组学习二（数据下载）
转录组学习一（软件安装）转录组学习二（数据下载）转录组学习三（数据质控）转录组学习四（参考基因组及gt...
转录组学习五（reads比对）
转录组学习一（软件安装）转录组学习二（数据下载）转录组学习三（数据质控）转录组学习四（参考基因组及gt...
转录组学习八（功能富集分析）
转录组学习一（软件安装）转录组学习二（数据下载）转录组学习三（数据质控）转录组学习四（参考基因组及gt...
转录组学习六（reads计数与标准化）
转录组学习一（软件安装）转录组学习二（数据下载）转录组学习三（数据质控）转录组学习四（参考基因组及gt...
转录组学习四（参考基因组及gtf注释探究）
转录组学习一（软件安装）转录组学习二（数据下载）转录组学习三（数据质控）转录组学习四（参考基因组及gt...