美文网首页ggplot集锦
Python脚本之从gtf文件中提取splice sites

Python脚本之从gtf文件中提取splice sites

作者: 中午打个盹 | 来源:发表于2020-09-05 22:16 被阅读0次

    代码如下:

    #!/usr/bin/env python3
    # -*- coding: utf-8 -*-
    # extract_splice_sites.py
    # @author Aaron Liu
    # @description 从gtf文件中提取剪切位点(参考hsiat2_extract_splice_sites.py)
    # @created 2020-09-04T19:26:12.427Z+08:00
    # @last-modified 2020-09-05T17:33:50.935Z+08:00
    
    from argparse import ArgumentParser, FileType
    
    def extract_splice_sites(gtf_file):
        trans = {}
        for line in gtf_file:
            line = line.strip()
            if not line or line.startswith('#'):
                continue
            if '#' in line:
                line = line.strip('#')[0].strip()
    
            chr, source, feature, start, end, score, strand, \
                frame, values = line.split('\t')
            start, end = int(start), int(end)
    
            if feature != 'exon' or start >= end:
                continue
    
            value_attr = {}
            for attr in values.split(';'):
                gene, _, transcript = attr.strip().partition(' ')
                value_attr[gene] = transcript.strip('"')
    
            if "gene_id" not in value_attr \
                or "transcript_id" not in value_attr:
                continue
    
            transcript_id = value_attr['transcript_id']
            gene_id = value_attr['gene_id']
            if transcript_id not in trans:
                trans[transcript_id] = [gene_id, chr, strand, [[start, end]]]
            else:
                trans[transcript_id][3].append([start, end])
    
        for transid, [gene_id, chr, strand, exon] in trans.items():
            exon.sort()
            tem_exon = [exon[0]]
            for i in range(1, len(exon)):
                if exon[i][0] - tem_exon[-1][1] <= 5:
                    tem_exon[-1][1] = exon[i][1]
                else:
                    tem_exon.append(exon[i])
            trans[transid] = [gene_id, chr, strand, tem_exon]
    
        junction = set()
        for gene, chrom, strand, exon in trans.values():
            exon.sort()
            for i in range(1, len(exon)):
                junction.add(
                    (gene, chrom, exon[i - 1][1] + 1, exon[i][0] - 1, strand))
    
        junction = sorted(junction)
        return junction
    
    
    if __name__ == "__main__":
        parse = ArgumentParser(description='Extract splice sites from a GTF file')
        parse.add_argument('-i',
                           '--gtf_file',
                           required=True,
                           nargs='?',
                           type=FileType('r'),
                           help="a gtf file is required")
        parse.add_argument('-o',
                           '--out_file',
                           required=True,
                           nargs='?',
                           type=FileType('w'),
                           help='output file is required')
    
        args = parse.parse_args()
    
        splicesites = extract_splice_sites(args.gtf_file)
        for row in splicesites:
            args.out_file.write(row[0] + '\t' + row[1] + '\t' + str(row[2]) +
                                '\t' + str(row[3]) + '\t' + row[4] + '\n')
    
    

    实例:

    ./extract_splice_sites.py -i merge.gtf -o ss.txt
    

    输出:

    ...
    MSTRG.10013 chr04   2534745 2535071 +
    MSTRG.10013 chr04   2535164 2535818 +
    MSTRG.10013 chr04   2535988 2536609 +
    MSTRG.10013 chr04   2536732 2537314 +
    MSTRG.10013 chr04   2536732 2537381 +
    MSTRG.10013 chr04   2537444 2537540 +
    MSTRG.10013 chr04   2537444 2537551 +
    MSTRG.10013 chr04   2537651 2537730 +
    MSTRG.10013 chr04   2537789 2537870 +
    ...
    

    相关文章

      网友评论

        本文标题:Python脚本之从gtf文件中提取splice sites

        本文链接:https://www.haomeiwen.com/subject/fafqektx.html