TE分析工具TRACKPOSON测试

作者: 花生学生信 | 来源:发表于2023-06-15 09:38 被阅读0次

GO学习笔记(17) -测试
16.手撕Go语言-测试
EDTA-重复序列注释
生信分析软件总结合集
NPDP拆书（二十三）：市场研究（市场研究工具）
php入门－－性能测试
Golang Notes
简单的App启动速度优化
爬虫练习-斗鱼颜值频道
06_01_bash脚本编程之四、整数测试及特殊变量、06_

转座子(transposable element)：或称跳跃基因，是一种可以改变其在基因组中位置的DNA序列，有时会产生或逆转突变，改变细胞的遗传特性和基因组大小。
转座子按照转座方式的不同，可分为两大类：I型转座子(Class I elements)，II型转座子(Class II elements)。
I型转座子又叫反转座子(retrotransposon)，其在转座时，会先以DNA为模板，在RNA聚合酶II的作用下，转录成一段mRNA，然后再以这段mRNA为模板反转录成cDNA，最后在整合酶的作用下将这段cDNA整合到基因组上新的位置。根据反转座子转座机制，人们形象地称其为“复制-粘贴”型转座原件。
II型转座子也叫做转座子(transposon)，在转座酶的作用下，II型转座子从原来的位置解离下来，再重新整合到染色体上的其他位置，原来的位置由于转座子解离形成的断链，会在DNA修复的机制下被修复完整。故II型转座子转座的机制被形象地称为“剪切-粘贴”。

参考文献

文献方法，使用已有的TE库

挑选32个反转坐子做代表

32个TE家族

测试：，没有用数据库，所以结果少

0、数据准备

###以TE做索引
bowtie2-build 32_TE_families_TRACKPOSON_NC_Carpentier_et_al.fa 32

makeblastdb -in IRGSP-1.0_genome.fasta -dbtype nucl -title NIP

bowtie2 --time --end-to-end -k 1 --very-fast -p 6 -x 32 -1 H6991.fq -2 H6992.fq |samtools view -bS -@ 2 -> 32.bam

1、keep only unmap reads with flag unmap/map

###构建te
samtools view DL104-1.mk.bam | awk -F "\t" '{if ( ($1!~/^@/) && (($2==69) || ($2==133) || ($2==165) || ($2==181) || ($2==101) || ($2==117)) ) {print ">"$1"\n"$10}}' > te.fa

et fasta creation

2、blast fa against reference genome (IRGSP1.0) for identification insertion point

###构建参考基因组索引
 makeblastdb -in IRGSP-1.0_genome.fasta -dbtype nucl -out NIP
###比对
 blastn -db NIP.db -query te.fa -out te.fa.bl -num_threads 8 -evalue 1e-20

3、parse blast to find TE insertion point

perl /public/home/fengting/work/6.14_TE/TRACKPOSON/test/find_insertion_point.pl te.fa.bl out.te 
###排序
sort -k1,1 -k2,2n out.te.bed > te.sort.bed

结果参考

4、coveragebed by 10kb windows

###索引，保留前两行
samtools faidx IRGSP-1.0_genome.fasta
###划窗
bedtools makewindows -g IRGSP-1.0_genome.fasta.fai -w 10000 >10kb.bed

###coveragebed by 10kb windows
bedtools coverage -counts -nonamecheck -a 10kb.bed -b te.sort.bed | awk -F "\t" '{if ($4>=2){print $0}}' > per10kb.bed

后面看看文献提供的代码

可能是根据TE家族分类去筛选

脚本已修改：

####You need to change the variables
###################################样本名改成变量形式，这是测试
fq1=/public/home/fengting/database/111pan/hhz_ils_ill/CRR366177_f1.fastq.gz
fq2=/public/home/fengting/database/111pan/hhz_ils_ill/CRR366177_r2.fastq.gz
out=CRR366177
#te=/public/home/fengting/work/6.14_TE/TRACKPOSON/test/fam/fam31.fa
te=fam31
DIR=CRR366177_fam31
DB=/public/home/fengting/work/6.14_TE/TRACKPOSON/test/fam/fam31/fam31
blast_ref_database=/public/home/fengting/work/6.14_TE/TRACKPOSON/test/test1/IRGSP-1.0_genome.fasta
perl_script=/public/home/fengting/work/6.14_TE/TRACKPOSON/test/find_insertion_point.pl
ref_genome_10kbpwindows.bed=/public/home/fengting/work/6.14_TE/TRACKPOSON/test/00.data/10kb.bed
########################
#######################

mkdir $DIR
cd $DIR

######mapping reads against TE reference

bowtie2 --time --end-to-end  -k 1 --very-fast -p 6 -x $DB  -1 $fq1 -2 $fq2  | samtools view -bS -@ 2 - > "$out"-vs-"$te".bam

#######keep only unmap reads with flag unmap/map
#et fasta creation
samtools view "$out"-vs-"$te".bam | awk -F "\t" '{if ( ($1!~/^@/) && (($2==69) || ($2==133) || ($2==165) || ($2==181) || ($2==101) || ($2==117)) ) {print ">"$1"\n"$10}}' > $out-vs-$te.fa

######blast fa against reference genome (IRGSP1.0) for identification insertion point
blastn -db $blast_ref_database -query $out-vs-$te.fa -out $out-vs-$te.fa.bl -num_threads 8 -evalue 1e-20

######parse blast to find TE insertion point 
perl $perl_script $out-vs-$te.fa.bl $out-vs-$te

######sort bed output
sort -k1,1 -k2,2n $out-vs-$te.bed > $out-vs-$te.sort.bed

######coveragebed by 10kb windows
bedtools coverage -counts -nonamecheck -a $ref_genome_10kbpwindows.bed -b $out-vs-$te.sort.bed | awk -F "\t" '{if ($4>=2){print $0}}' > coveragebed_$out-vs-$te\_per10kb.bed

######cleaning temporary files
rm $out-vs-$te.bam
rm $out-vs-$te.fa*
rm $out-vs-$te.bed

#####Analyse_pipeline.sh
# Automatic analysis of TRACKPOSON output
#####

#!/usr/bin/bash


#name of TE familly
te=fam31
echo $te

#only coveragebed analyzed
mkdir final_cov2
mv coveragebed_* final_cov2
cd final_cov2

#all TE insertion 
for file in *bed;do awk -F "\t" '{print $1"_"$2"_"$3}' $file;done | sort -u > all_insertion_$te.names
#umber of TE insertion
wc -l all_insertion_$te.names

#Retrieve insertion with minimum of coverage at 5
for file in coveragebed_*;do awk -F "\t" '{if ($4>=5){print $1"_"$2"_"$3}}' $file;done | sort -u  > all_position_cov5_$te.names
wc -l all_position_cov5_$te.names

#reformat output
for file in *bed;do n=$(echo $file | sed -e "s/coveragebed_//" | sed -e "s/\-vs-.*_per10kb.bed//"); awk -F "\t" '{print $1"_"$2"_"$3}' $file > $n.txt;done

#delete empty files
find ../final_cov2/ -size 0 -exec rm -f {} \;

#Create final matrice with R 
cp ../Analyse_pipeline.R ../final_cov2/
R CMD BATCH "--args $te" Analyse_pipeline.R

###Reformat the matrix
sed -e "s/\.txt//g" matrice_final.csv | sed -e "s/FALSE/0/g" | sed -e "s/TRUE/1/g" | sed -e "s/\-/\_/g" | sed -e "s/\./\_/g" | awk 'NR<2{print $0;next}{print $0| "sort -k1,1V"}'  > matrice_final_$te.csv


#script R traditionel + histo
R CMD BATCH "--args $te" Analyse_tradi.R

#####Analysis_pipline.R
# Automatic analysis of TRACKPOSON output
#Create final matrix 
#####

#warning eliminations
options(warn=-1)

#arguments retrieve
args <- commandArgs(trailingOnly = TRUE)

#TE family name
te<-as.character(args[1])

inputFile<-try(system("ls *txt",intern=TRUE))
files <- unlist(strsplit(inputFile,split=" "))

length(files)
M<-data.frame()

all<-paste("all_insertion_",te,".names",sep="")
M<-read.table(all,sep="\n",h=F)

for (i in 1:length(files)){
 res<-read.table(files[i],sep="\n",h=F)
 tmp<-M$V1 %in% res$V1
 M<-cbind(M,tmp)
 }
colnames(M)<-c("insertion",files)

#TE insertion with minimum coverage at 5
pos<-paste("all_position_cov5_",te,".names",sep="")
POS<-read.table(pos,sep="\n",h=F)

#FInal matrice
M2<-M[which(M$insertion %in% POS$V1),]
write.csv(M2,file="matrice_final.csv",sep="\t",row.names=F,col.names=T,quote=F)