已知,我们通过seqkit faidx ref.fa chr8:25234310-25266151 > target.fa
提取基因组上的一个片段序列,接着我们用AUGUSTUS对这段序列进行预测
augustus --species=arabidopsis --gff3=on target.fa > target.gff
我们的target.gff是如下结构
# This output was generated with AUGUSTUS (version 3.2.3).
# AUGUSTUS is a gene prediction tool written by M. Stanke (mario.stanke@uni-greifswald.de),
# O. Keller, S. König, L. Gerischer and L. Romoth.
# Please cite: Mario Stanke, Mark Diekhans, Robert Baertsch, David Haussler (2008),
# Using native and syntenically mapped cDNA alignments to improve de novo gene finding
# Bioinformatics 24: 637-644, doi 10.1093/bioinformatics/btn013
# No extrinsic information on sequences given.
# arabidopsis version. Using default transition matrix.
# We have hints for 0 sequences and for 0 of the sequences in the input set.
#
# ----- prediction on sequence number 1 (length = 31842, name = chr8:25234310-25266151) -----
#
# Constraints/Hints:
# (none)
# Predicted genes for sequence number 1 on both strands
# start gene g1
chr8:25234310-25266151 AUGUSTUS gene 2896 5265 0.03 + . g1
chr8:25234310-25266151 AUGUSTUS transcript 2896 5265 0.03 + . g1.t1
chr8:25234310-25266151 AUGUSTUS tss 2896 2896 . + . transcript_id "g1.t1"; gene_id "g1";
...
如果我们想要在IGV原来的基因组上查看预测的的信息,效果如下
![](https://img.haomeiwen.com/i2013053/f782228f3f1a2570.png)
那么我们就需要对AUGUSTUS得到GFF文件进行处理,将原来的chr8:25234310-25266151
替换成chr8
,将坐标2896
根据我们提取序列的起始位置进行调整,也就是25234310 + 2896 -1
。
请使用拿手的工具写一段代码进行转换,如下是我的Python3脚本
#/usr/bin/env python3
import sys
fn = sys.argv[1]
for line in open(fn, "r"):
if line.startswith("#"):
continue
if len(line) == 0:
continue
line = line.strip()
col = line.split("\t")
seq = col[0]
seq_start = seq.split(":")[1].split("-")[0]
col[0] = seq.split(":")[0]
col[3] = str(int(col[3]) + int(seq_start) - 1)
col[4] = str(int(col[4]) + int(seq_start) - 1)
print("\t".join(col), file= sys.stdout)
测试数据可以从这里下载(链接:https://share.weiyun.com/5E3Gk2u 密码:zxe3q4)
网友评论