美文网首页
TE分析工具TRACKPOSON测试

TE分析工具TRACKPOSON测试

作者: 花生学生信 | 来源:发表于2023-06-15 09:38 被阅读0次

转座子(transposable element):或称跳跃基因,是一种可以改变其在基因组中位置的DNA序列,有时会产生或逆转突变,改变细胞的遗传特性和基因组大小。
  转座子按照转座方式的不同,可分为两大类:I型转座子(Class I elements),II型转座子(Class II elements)。
  I型转座子又叫反转座子(retrotransposon),其在转座时,会先以DNA为模板,在RNA聚合酶II的作用下,转录成一段mRNA,然后再以这段mRNA为模板反转录成cDNA,最后在整合酶的作用下将这段cDNA整合到基因组上新的位置。根据反转座子转座机制,人们形象地称其为“复制-粘贴”型转座原件。
  II型转座子也叫做转座子(transposon),在转座酶的作用下,II型转座子从原来的位置解离下来,再重新整合到染色体上的其他位置,原来的位置由于转座子解离形成的断链,会在DNA修复的机制下被修复完整。故II型转座子转座的机制被形象地称为“剪切-粘贴”。

参考文献
文献方法,使用已有的TE库 挑选32个反转坐子做代表 32个TE家族

测试:,没有用数据库,所以结果少

0、数据准备


###以TE做索引
bowtie2-build 32_TE_families_TRACKPOSON_NC_Carpentier_et_al.fa 32

makeblastdb -in IRGSP-1.0_genome.fasta -dbtype nucl -title NIP

bowtie2 --time --end-to-end -k 1 --very-fast -p 6 -x 32 -1 H6991.fq -2 H6992.fq |samtools view -bS -@ 2 -> 32.bam


1、keep only unmap reads with flag unmap/map

###构建te
samtools view DL104-1.mk.bam | awk -F "\t" '{if ( ($1!~/^@/) && (($2==69) || ($2==133) || ($2==165) || ($2==181) || ($2==101) || ($2==117)) ) {print ">"$1"\n"$10}}' > te.fa
et fasta creation

2、blast fa against reference genome (IRGSP1.0) for identification insertion point

###构建参考基因组索引
 makeblastdb -in IRGSP-1.0_genome.fasta -dbtype nucl -out NIP
###比对
 blastn -db NIP.db -query te.fa -out te.fa.bl -num_threads 8 -evalue 1e-20

3、parse blast to find TE insertion point

perl /public/home/fengting/work/6.14_TE/TRACKPOSON/test/find_insertion_point.pl te.fa.bl out.te 
###排序
sort -k1,1 -k2,2n out.te.bed > te.sort.bed
结果参考

4、coveragebed by 10kb windows

###索引,保留前两行
samtools faidx IRGSP-1.0_genome.fasta
###划窗
bedtools makewindows -g IRGSP-1.0_genome.fasta.fai -w 10000 >10kb.bed

###coveragebed by 10kb windows
bedtools coverage -counts -nonamecheck -a 10kb.bed -b te.sort.bed | awk -F "\t" '{if ($4>=2){print $0}}' > per10kb.bed
后面看看文献提供的代码

可能是根据TE家族分类去筛选

脚本已修改:


####You need to change the variables
###################################样本名改成变量形式,这是测试
fq1=/public/home/fengting/database/111pan/hhz_ils_ill/CRR366177_f1.fastq.gz
fq2=/public/home/fengting/database/111pan/hhz_ils_ill/CRR366177_r2.fastq.gz
out=CRR366177
#te=/public/home/fengting/work/6.14_TE/TRACKPOSON/test/fam/fam31.fa
te=fam31
DIR=CRR366177_fam31
DB=/public/home/fengting/work/6.14_TE/TRACKPOSON/test/fam/fam31/fam31
blast_ref_database=/public/home/fengting/work/6.14_TE/TRACKPOSON/test/test1/IRGSP-1.0_genome.fasta
perl_script=/public/home/fengting/work/6.14_TE/TRACKPOSON/test/find_insertion_point.pl
ref_genome_10kbpwindows.bed=/public/home/fengting/work/6.14_TE/TRACKPOSON/test/00.data/10kb.bed
########################
#######################

mkdir $DIR
cd $DIR

######mapping reads against TE reference

bowtie2 --time --end-to-end  -k 1 --very-fast -p 6 -x $DB  -1 $fq1 -2 $fq2  | samtools view -bS -@ 2 - > "$out"-vs-"$te".bam

#######keep only unmap reads with flag unmap/map
#et fasta creation
samtools view "$out"-vs-"$te".bam | awk -F "\t" '{if ( ($1!~/^@/) && (($2==69) || ($2==133) || ($2==165) || ($2==181) || ($2==101) || ($2==117)) ) {print ">"$1"\n"$10}}' > $out-vs-$te.fa

######blast fa against reference genome (IRGSP1.0) for identification insertion point
blastn -db $blast_ref_database -query $out-vs-$te.fa -out $out-vs-$te.fa.bl -num_threads 8 -evalue 1e-20

######parse blast to find TE insertion point 
perl $perl_script $out-vs-$te.fa.bl $out-vs-$te

######sort bed output
sort -k1,1 -k2,2n $out-vs-$te.bed > $out-vs-$te.sort.bed

######coveragebed by 10kb windows
bedtools coverage -counts -nonamecheck -a $ref_genome_10kbpwindows.bed -b $out-vs-$te.sort.bed | awk -F "\t" '{if ($4>=2){print $0}}' > coveragebed_$out-vs-$te\_per10kb.bed

######cleaning temporary files
rm $out-vs-$te.bam
rm $out-vs-$te.fa*
rm $out-vs-$te.bed
#####Analyse_pipeline.sh
# Automatic analysis of TRACKPOSON output
#####

#!/usr/bin/bash


#name of TE familly
te=fam31
echo $te

#only coveragebed analyzed
mkdir final_cov2
mv coveragebed_* final_cov2
cd final_cov2

#all TE insertion 
for file in *bed;do awk -F "\t" '{print $1"_"$2"_"$3}' $file;done | sort -u > all_insertion_$te.names
#umber of TE insertion
wc -l all_insertion_$te.names

#Retrieve insertion with minimum of coverage at 5
for file in coveragebed_*;do awk -F "\t" '{if ($4>=5){print $1"_"$2"_"$3}}' $file;done | sort -u  > all_position_cov5_$te.names
wc -l all_position_cov5_$te.names

#reformat output
for file in *bed;do n=$(echo $file | sed -e "s/coveragebed_//" | sed -e "s/\-vs-.*_per10kb.bed//"); awk -F "\t" '{print $1"_"$2"_"$3}' $file > $n.txt;done

#delete empty files
find ../final_cov2/ -size 0 -exec rm -f {} \;

#Create final matrice with R 
cp ../Analyse_pipeline.R ../final_cov2/
R CMD BATCH "--args $te" Analyse_pipeline.R

###Reformat the matrix
sed -e "s/\.txt//g" matrice_final.csv | sed -e "s/FALSE/0/g" | sed -e "s/TRUE/1/g" | sed -e "s/\-/\_/g" | sed -e "s/\./\_/g" | awk 'NR<2{print $0;next}{print $0| "sort -k1,1V"}'  > matrice_final_$te.csv


#script R traditionel + histo
R CMD BATCH "--args $te" Analyse_tradi.R
#####Analysis_pipline.R
# Automatic analysis of TRACKPOSON output
#Create final matrix 
#####

#warning eliminations
options(warn=-1)

#arguments retrieve
args <- commandArgs(trailingOnly = TRUE)

#TE family name
te<-as.character(args[1])

inputFile<-try(system("ls *txt",intern=TRUE))
files <- unlist(strsplit(inputFile,split=" "))

length(files)
M<-data.frame()

all<-paste("all_insertion_",te,".names",sep="")
M<-read.table(all,sep="\n",h=F)

for (i in 1:length(files)){
 res<-read.table(files[i],sep="\n",h=F)
 tmp<-M$V1 %in% res$V1
 M<-cbind(M,tmp)
 }
colnames(M)<-c("insertion",files)

#TE insertion with minimum coverage at 5
pos<-paste("all_position_cov5_",te,".names",sep="")
POS<-read.table(pos,sep="\n",h=F)

#FInal matrice
M2<-M[which(M$insertion %in% POS$V1),]
write.csv(M2,file="matrice_final.csv",sep="\t",row.names=F,col.names=T,quote=F)



结果文件,csv格式 最终结果,01矩阵

相关文章

网友评论

      本文标题:TE分析工具TRACKPOSON测试

      本文链接:https://www.haomeiwen.com/subject/ozzhydtx.html