转座子(transposable element)
:或称跳跃基因,是一种可以改变其在基因组中位置的DNA序列,有时会产生或逆转突变,改变细胞的遗传特性和基因组大小。
转座子按照转座方式的不同,可分为两大类:I型转座子(Class I elements),II型转座子(Class II elements)。
I型转座子又叫反转座子(retrotransposon)
,其在转座时,会先以DNA为模板,在RNA聚合酶II的作用下,转录成一段mRNA,然后再以这段mRNA为模板反转录成cDNA,最后在整合酶的作用下将这段cDNA整合到基因组上新的位置。根据反转座子转座机制,人们形象地称其为“复制-粘贴”型转座原件。
II型转座子也叫做转座子(transposon)
,在转座酶的作用下,II型转座子从原来的位置解离下来,再重新整合到染色体上的其他位置,原来的位置由于转座子解离形成的断链,会在DNA修复的机制下被修复完整。故II型转座子转座的机制被形象地称为“剪切-粘贴”。




测试:,没有用数据库,所以结果少
0、数据准备
![]()
###以TE做索引
bowtie2-build 32_TE_families_TRACKPOSON_NC_Carpentier_et_al.fa 32
makeblastdb -in IRGSP-1.0_genome.fasta -dbtype nucl -title NIP
bowtie2 --time --end-to-end -k 1 --very-fast -p 6 -x 32 -1 H6991.fq -2 H6992.fq |samtools view -bS -@ 2 -> 32.bam
1、keep only unmap reads with flag unmap/map
###构建te
samtools view DL104-1.mk.bam | awk -F "\t" '{if ( ($1!~/^@/) && (($2==69) || ($2==133) || ($2==165) || ($2==181) || ($2==101) || ($2==117)) ) {print ">"$1"\n"$10}}' > te.fa

2、blast fa against reference genome (IRGSP1.0) for identification insertion point
###构建参考基因组索引
makeblastdb -in IRGSP-1.0_genome.fasta -dbtype nucl -out NIP
###比对
blastn -db NIP.db -query te.fa -out te.fa.bl -num_threads 8 -evalue 1e-20
3、parse blast to find TE insertion point
perl /public/home/fengting/work/6.14_TE/TRACKPOSON/test/find_insertion_point.pl te.fa.bl out.te
###排序
sort -k1,1 -k2,2n out.te.bed > te.sort.bed

4、coveragebed by 10kb windows
###索引,保留前两行
samtools faidx IRGSP-1.0_genome.fasta
###划窗
bedtools makewindows -g IRGSP-1.0_genome.fasta.fai -w 10000 >10kb.bed
###coveragebed by 10kb windows
bedtools coverage -counts -nonamecheck -a 10kb.bed -b te.sort.bed | awk -F "\t" '{if ($4>=2){print $0}}' > per10kb.bed

可能是根据TE家族分类去筛选
脚本已修改:

####You need to change the variables
###################################样本名改成变量形式,这是测试
fq1=/public/home/fengting/database/111pan/hhz_ils_ill/CRR366177_f1.fastq.gz
fq2=/public/home/fengting/database/111pan/hhz_ils_ill/CRR366177_r2.fastq.gz
out=CRR366177
#te=/public/home/fengting/work/6.14_TE/TRACKPOSON/test/fam/fam31.fa
te=fam31
DIR=CRR366177_fam31
DB=/public/home/fengting/work/6.14_TE/TRACKPOSON/test/fam/fam31/fam31
blast_ref_database=/public/home/fengting/work/6.14_TE/TRACKPOSON/test/test1/IRGSP-1.0_genome.fasta
perl_script=/public/home/fengting/work/6.14_TE/TRACKPOSON/test/find_insertion_point.pl
ref_genome_10kbpwindows.bed=/public/home/fengting/work/6.14_TE/TRACKPOSON/test/00.data/10kb.bed
########################
#######################
mkdir $DIR
cd $DIR
######mapping reads against TE reference
bowtie2 --time --end-to-end -k 1 --very-fast -p 6 -x $DB -1 $fq1 -2 $fq2 | samtools view -bS -@ 2 - > "$out"-vs-"$te".bam
#######keep only unmap reads with flag unmap/map
#et fasta creation
samtools view "$out"-vs-"$te".bam | awk -F "\t" '{if ( ($1!~/^@/) && (($2==69) || ($2==133) || ($2==165) || ($2==181) || ($2==101) || ($2==117)) ) {print ">"$1"\n"$10}}' > $out-vs-$te.fa
######blast fa against reference genome (IRGSP1.0) for identification insertion point
blastn -db $blast_ref_database -query $out-vs-$te.fa -out $out-vs-$te.fa.bl -num_threads 8 -evalue 1e-20
######parse blast to find TE insertion point
perl $perl_script $out-vs-$te.fa.bl $out-vs-$te
######sort bed output
sort -k1,1 -k2,2n $out-vs-$te.bed > $out-vs-$te.sort.bed
######coveragebed by 10kb windows
bedtools coverage -counts -nonamecheck -a $ref_genome_10kbpwindows.bed -b $out-vs-$te.sort.bed | awk -F "\t" '{if ($4>=2){print $0}}' > coveragebed_$out-vs-$te\_per10kb.bed
######cleaning temporary files
rm $out-vs-$te.bam
rm $out-vs-$te.fa*
rm $out-vs-$te.bed
#####Analyse_pipeline.sh
# Automatic analysis of TRACKPOSON output
#####
#!/usr/bin/bash
#name of TE familly
te=fam31
echo $te
#only coveragebed analyzed
mkdir final_cov2
mv coveragebed_* final_cov2
cd final_cov2
#all TE insertion
for file in *bed;do awk -F "\t" '{print $1"_"$2"_"$3}' $file;done | sort -u > all_insertion_$te.names
#umber of TE insertion
wc -l all_insertion_$te.names
#Retrieve insertion with minimum of coverage at 5
for file in coveragebed_*;do awk -F "\t" '{if ($4>=5){print $1"_"$2"_"$3}}' $file;done | sort -u > all_position_cov5_$te.names
wc -l all_position_cov5_$te.names
#reformat output
for file in *bed;do n=$(echo $file | sed -e "s/coveragebed_//" | sed -e "s/\-vs-.*_per10kb.bed//"); awk -F "\t" '{print $1"_"$2"_"$3}' $file > $n.txt;done
#delete empty files
find ../final_cov2/ -size 0 -exec rm -f {} \;
#Create final matrice with R
cp ../Analyse_pipeline.R ../final_cov2/
R CMD BATCH "--args $te" Analyse_pipeline.R
###Reformat the matrix
sed -e "s/\.txt//g" matrice_final.csv | sed -e "s/FALSE/0/g" | sed -e "s/TRUE/1/g" | sed -e "s/\-/\_/g" | sed -e "s/\./\_/g" | awk 'NR<2{print $0;next}{print $0| "sort -k1,1V"}' > matrice_final_$te.csv
#script R traditionel + histo
R CMD BATCH "--args $te" Analyse_tradi.R
#####Analysis_pipline.R
# Automatic analysis of TRACKPOSON output
#Create final matrix
#####
#warning eliminations
options(warn=-1)
#arguments retrieve
args <- commandArgs(trailingOnly = TRUE)
#TE family name
te<-as.character(args[1])
inputFile<-try(system("ls *txt",intern=TRUE))
files <- unlist(strsplit(inputFile,split=" "))
length(files)
M<-data.frame()
all<-paste("all_insertion_",te,".names",sep="")
M<-read.table(all,sep="\n",h=F)
for (i in 1:length(files)){
res<-read.table(files[i],sep="\n",h=F)
tmp<-M$V1 %in% res$V1
M<-cbind(M,tmp)
}
colnames(M)<-c("insertion",files)
#TE insertion with minimum coverage at 5
pos<-paste("all_position_cov5_",te,".names",sep="")
POS<-read.table(pos,sep="\n",h=F)
#FInal matrice
M2<-M[which(M$insertion %in% POS$V1),]
write.csv(M2,file="matrice_final.csv",sep="\t",row.names=F,col.names=T,quote=F)


网友评论