「像读文献一样读代码」如何解析GTF文件进行统计分析？

作者: 陈有朴 | 来源:发表于2022-08-13 20:50 被阅读0次

「像读文献一样读代码」如何解析GTF文件进行统计分析？
miRNA array笔记1
Java文件读写
gffread软件使用教程
教你如何在 Python 中读，写和解析 CSV 文
填表记
21天（js高程）-第2天
如何读文献
如何读文献
JVM虚拟机学习(1)

测试数据下载

wget -c ftp://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh38.87.chr.gtf.gz

先验知识

代码解读

实现了哪些功能？

1）读取GTF文件

2）根据feature，将其分为3种数据类型进行保存，即gene、transcript、exon。

每一种数据类型，分别构建一种对象，每种对象所具有的属性定义如下，

Object gene包含的属性有：所属的染色体，起始位置，终止位置，正负链信息，gene id（对应GTF文件中的gene_id）
Object transcript包含的属性有：所属的染色体，起始位置，终止位置，transcript id（对应GTF文件中的transcript_id），parent id（对应GTF文件中的gene_id）
Object exon包含的属性有：所属的染色体，起始位置，终止位置，parent id（对应GTF文件中的transcript_id）

3）数据处理

统计每一条染色体上的gene总数
计算每一个gene的长度
统计每个gene的转录本数
记录每个exon的坐标

代码设计思路

将GTF文件的每一行看作一个对象来处理，构建不同类型的对象对信息进行存储，同时由于不同的对象之间存在信息交集，这就是类继承的应用之处。

Python代码

在正则表达式上，我进行了一定修改。

'''
Date: 2022.8.14
Description: Kind of a collection of tools to parse and analyze GTF
Main Function:
- count gene number of each chromosome
- count transcritp number of each gene
- calculate the total number of exon and intron of each transcript
- plot the result
'''
import sys
import re

args = sys.argv  # Set argv to grep the ipnut. argv is a list, e.g. [, , ]

class Genome_info():
    def __init__(self):
        self.chr = ""
        self.start = 0
        self.end = 0

class Gene(Genome_info):
    '''
    Note: Object Gene is inherited from Genome_info
    Gene has two attributes,
    1) orientation
    2) id (gene id)
    '''
    def __init__(self):
        Genome_info.__init__(self)
        self.orientation = ""
        self.id = ""

class Transcript(Genome_info):
    '''
    Note: Object Transcript is inherited from Genome_info
    Transcript has two attributes,
    1) id (transcript id)
    2) parent (gene id)
    '''
    def __init__(self):
        Genome_info.__init__(self)
        self.id = ""
        self.parent = ""

class Exon(Genome_info):
    '''
    Note: Object Exon is inherited from Genome_info
    Transcript has one attributes,
    2) parent (?)
    '''
    def __init__(self):
        Genome_info.__init__(self)
        self.parent = ""


'''
Define a main function to handle gtf file
'''
def main(args):

    list_chr = []
    list_gene = {}
    list_transcript = {}
    list_exon = []
    # l_n = 0
    with open(args[1]) as fp_gtf:
        for line in fp_gtf:
            if line.startswith("#"):
                continue
            # print ("in %s" % l_n)
            # l_n += 1
            lines = line.strip("\n").split("\t")
            chr = lines[0]
            type = lines[2]
            start = int(lines[3])
            end = int(lines[4])
            orientation = lines[6]
            attr = lines[8]
            if not re.search(r'protein_coding', attr):   # 只针对protein_coding进行统计分析
                continue

            if not chr in list_chr :
                list_chr.append(chr)

            if type == "gene":
                gene = Gene()
                id = re.search(r'gene_id "([^;]+)";?', attr).group(1)  # Original version, could be adapted to r'gene_id "(\S+)";'
                # id = re.search(r'gene_id "(\S+)";', attr).group(1)      # My adaptation
                gene.chr = chr
                gene.start = start
                gene.end = end
                gene.id = id
                gene.orientation = orientation
                list_gene[id] = gene  # list_gene is a dict to depost gene id and its correspoding Gene Object
                # print(id)
            
            elif type == "transcript":
                transcript = Transcript()
                id = re.search(r'transcript_id "([^;]+)";?', attr).group(1)
                # id = re.search(r'transcript_id "(\S+)";', attr).group(1)      # My adaptation
                parent = re.search(r'gene_id "([^;]+)";?', attr).group(1)
                # id = re.search(r'gene_id "(\S+)";', attr).group(1)      # My adaptation
                
                if not parent in list_gene:
                    continue
                transcript.chr = chr
                transcript.start = start
                transcript.end = end
                transcript.id = id
                transcript.parent = parent
                list_transcript[id] = transcript

            elif type == "exon":
                exon = Exon()
                parent = re.search(r'transcript_id "([^;]+)";?', attr).group(1)
                # parent = re.search(r'transcript_id "(\S+)";', attr).group(1) # My adaptation
                
                if not parent in list_transcript:
                    continue
                exon.chr = chr
                exon.start = start
                exon.end = end
                exon.parent = parent
                list_exon.append(exon)

    chr_gene(list_gene)
    gene_len(list_gene)
    gene_transcript(list_transcript)
    transcript_exon(list_exon)
    exon_pos(list_exon)

def chr_gene(list_gene):
    print("-------------------------------- Count gene number of each chromosome --------------------------------")
    count_gene = {}
    for info in list_gene.values():
        chr = info.chr
        if chr in count_gene:
            count_gene[info.chr] += 1
        else:
            count_gene[info.chr] = 1
    with open("chr_gene.txt", 'w') as fp_out:
        for chr, num in count_gene.items():
            print("\t".join([chr, str(num)]) + "\n")
            fp_out.write("\t".join([chr, str(num)]) + "\n")


def gene_len(list_gene):
    print("-------------------------------- Calculate the length of each gene --------------------------------")
    with open("gene_len.txt", 'w') as fp_out:
        for gene_id, info in list_gene.items():
            len = info.end - info.start + 1
            fp_out.write("\t".join([gene_id, str(len)]) + "\n")
            # print("\t".join([gene_id, str(len)]) + "\n")

def gene_transcript(list_transcript):
    print("-------------------------------- Count transcript number of each gene --------------------------------")
    count_transcript = {}
    for info in list_transcript.values():
        gene_id = info.parent
        if gene_id in count_transcript:
            count_transcript[gene_id] += 1
        else:
            count_transcript[gene_id] = 1
    with open("gene_transcript.txt", 'w') as fp_out:
        for gene_id, num in count_transcript.items():
            # print("\t".join([gene_id, str(num)]) + "\n")
            fp_out.write("\t".join([gene_id, str(num)]) + "\n")


def transcript_exon(list_exon):
    print("-------------------------------- Count exon number of each transcript --------------------------------")
    count_exon = {}
    for exon in list_exon:
        transcript_id = exon.parent
        if transcript_id in count_exon:
            count_exon[transcript_id] += 1
        else:
            count_exon[transcript_id] = 1
    with open("transcript_exon.txt", 'w') as fp_out:
        for transcript_id, num in count_exon.items():
            # print("\t".join([transcript_id, str(num)]) + "\n")
            fp_out.write("\t".join([transcript_id, str(num)]) + "\n")

def exon_pos(list_exon):
    print("-------------------------------- Save the position info of each exon --------------------------------")
    count_exon = {}
    for exon in list_exon:
        transcript_id = exon.parent
        if transcript_id in count_exon:
            count_exon[transcript_id] += ",%s-%s" % (str(exon.start), str(exon.end))
        else:
            count_exon[transcript_id] = "%s-%s" % (str(exon.start), str(exon.end))
    with open("exon_pos.txt", 'w') as fp_out:
        for transcript_id, pos in count_exon.items():
            print("\t".join([transcript_id, pos]) + "\n")
            fp_out.write("\t".join([transcript_id, pos]) + "\n")


def gene_exon_pos(list_gene, list_transcript, list_exon):
    pass


if __name__ == "__main__":
    main(args)

代码运行

python save.py input.gtf

小记

暑假在编写脚本的过程中，感觉自己Python其实进步了不少，但是还没能把数据结构、面向对象编程很好的整合起来。

因为本科阶段确实没有什么自己编写脚本去分析数据的需求，就算自己创造需求了，也没有实际应用到工作当中，因此刻意练习是很重要的。

有意思的故事是，去年有幸听到了坦哥面试学生问的问题，“解释一下类”，

要是当时的我，碰到了这个问题，那我是真答不上来。

参考资料

[1] http://www.biotrainee.com/thread-626-1-1.html【作者：dongye】

「像读文献一样读代码」如何解析GTF文件进行统计分析？
测试数据下载先验知识代码解读实现了哪些功能？ 1）读取GTF文件 2）根据feature，将其分为3种数据类...
miRNA array笔记1
1. 对gtf文件进行处理： cat GSE52313_transcripts.gtf | grep "gene_...
Java文件读写
1.知识点读文件写文件释放文件资源如何优雅的遍历 finally的用法 2.代码功能代码测试代码 3....
gffread软件使用教程
gffread 不仅可以实现GTF与GFF的互相转换，而且还可以对GFF文件进行过滤处理。可以直接读取GTF文件。...
教你如何在 Python 中读，写和解析 CSV 文
摘要：在这篇文章中关于“在Python如何阅读CSV文件”中，我们将学习如何读，写和解析的CSV文件的Python...
填表记
读文件、填写表格、整理资料都是我的弱项：读文件对我来说，就像是天书，实在没办法的话就要像精读文本一样多读几遍；...
21天（js高程）-第2天
（续···）与解析嵌入式JavaScript 代码一样，在解析外部JavaScript文件（包括下载文件）...
如何读文献
**我们很多的时候，闷在实验室闭门造车，实在不如稍抽出一点时间看看文献，看看别人是否有同样的困惑。我们的大老板说...
如何读文献
阅读学术论文，需要掌握论文的大概内容、作者、相关其他的论文、文献如何管理与如何评价论文查找文献的途径：图书馆、万...
JVM虚拟机学习(1)
从类加载解析开始 javac编译完的class文件虚拟机读取的时候通过ClassFileStream类来进行读...