「生信练习题」调整GFF文件中的坐标位置

作者: xuzhougeng | 来源:发表于2020-04-07 16:39 被阅读0次

「生信练习题」调整GFF文件中的坐标位置
生信分析常见文件格式说明
2021.7月
脚本 | Shell | 提取每个转录本的intron坐标
山羊转录组GFF文件与GTF格式转换
基因家族分析十（基因家族加倍分析）
bedtools intersect 常用
技巧 | 从 GTF 中提取内含子坐标及序列
awk命令
基因组注释文件(GTF/GFF)格式介绍

已知，我们通过seqkit faidx ref.fa chr8:25234310-25266151 > target.fa提取基因组上的一个片段序列，接着我们用AUGUSTUS对这段序列进行预测

augustus --species=arabidopsis --gff3=on target.fa > target.gff

我们的target.gff是如下结构

# This output was generated with AUGUSTUS (version 3.2.3).
# AUGUSTUS is a gene prediction tool written by M. Stanke (mario.stanke@uni-greifswald.de),
# O. Keller, S. König, L. Gerischer and L. Romoth.
# Please cite: Mario Stanke, Mark Diekhans, Robert Baertsch, David Haussler (2008),
# Using native and syntenically mapped cDNA alignments to improve de novo gene finding
# Bioinformatics 24: 637-644, doi 10.1093/bioinformatics/btn013
# No extrinsic information on sequences given.
# arabidopsis version. Using default transition matrix.
# We have hints for 0 sequences and for 0 of the sequences in the input set.
#
# ----- prediction on sequence number 1 (length = 31842, name = chr8:25234310-25266151) -----
#
# Constraints/Hints:
# (none)
# Predicted genes for sequence number 1 on both strands
# start gene g1
chr8:25234310-25266151  AUGUSTUS        gene    2896    5265    0.03    +       .       g1
chr8:25234310-25266151  AUGUSTUS        transcript      2896    5265    0.03    +       .       g1.t1
chr8:25234310-25266151  AUGUSTUS        tss     2896    2896    .       +       .       transcript_id "g1.t1"; gene_id "g1";
...

如果我们想要在IGV原来的基因组上查看预测的的信息，效果如下

IGV展示

那么我们就需要对AUGUSTUS得到GFF文件进行处理，将原来的chr8:25234310-25266151替换成chr8,将坐标2896根据我们提取序列的起始位置进行调整，也就是25234310 + 2896 -1。

请使用拿手的工具写一段代码进行转换，如下是我的Python3脚本

#/usr/bin/env python3

import sys

fn = sys.argv[1]

for line in open(fn, "r"):
    if line.startswith("#"):
        continue
    if len(line) == 0:
        continue
    line = line.strip()
    col = line.split("\t")
    seq = col[0]
    seq_start = seq.split(":")[1].split("-")[0]
    col[0] = seq.split(":")[0]
    col[3] = str(int(col[3]) + int(seq_start) - 1)
    col[4] = str(int(col[4]) + int(seq_start) - 1)
    print("\t".join(col), file= sys.stdout)

测试数据可以从这里下载（链接：https://share.weiyun.com/5E3Gk2u 密码：zxe3q4）