学python:使用python的pyRanges模块中的rea

作者: 小明的数据分析笔记本 | 来源:发表于2022-07-27 04:26 被阅读0次

学python:使用python的pyRanges模块中的rea
使用Python重命名多个文件
python线程池的使用
Python3中的最大整数和最大浮点数
Pyhon65_Python中操作MySQL
Python--正则匹配
2018-12-03
Python录制和播放音频
Python_从list中随机选择
python --正则表达式-re模块

pyRanges的帮助文档

https://biocore-ntnu.github.io/pyranges/loadingcreating-pyranges.html

image.png

我自己的gtf文件是这样的 ID和后面字符串是用等号链接的，通常

image.png

是用空格，所以他定义函数用来查拆分字符串的时候是用空格来分隔的，所以这个地方我们把读取代码稍微改动一下，就是增加一个等号作为分隔符

首先定义拆分最后一列的函数

def to_rows(anno):
    rowdicts = []
    try:
        l = anno.head(1)
        for l in l:
            l.replace('"', '').replace(";", "").split()
    except AttributeError:
        raise Exception("Invalid attribute string: {l}. If the file is in GFF3 format, use pr.read_gff3 instead.".format(l=l))

    for l in anno:
        rowdicts.append({kk[0]: kk[-1]
                         for kk in [re.split(' |=',kv.replace('""', '"NA"').replace('"', ''), 1) 
                                    for kv in re.split('; |;',l)]})

    return pd.DataFrame.from_dict(rowdicts).set_index(anno.index)

读取gtf的函数

def read_gtf_full(f, as_df=False, nrows=None, skiprows=0):

    dtypes = {
        "Chromosome": "category",
        "Feature": "category",
        "Strand": "category"
    }

    names = "Chromosome Source Feature Start End Score Strand Frame Attribute".split(
    )

    df_iter = pd.read_csv(
        f,
        sep="\t",
        header=None,
        names=names,
        dtype=dtypes,
        chunksize=int(1e5),
        skiprows=skiprows,
        nrows=nrows,comment="#")

    _to_rows =  to_rows

    dfs = []
    for df in df_iter:
        extra = _to_rows(df.Attribute)
        df = df.drop("Attribute", axis=1)
        ndf = pd.concat([df, extra], axis=1, sort=False)
        dfs.append(ndf)

    df = pd.concat(dfs, sort=False)
    df.loc[:, "Start"] = df.Start - 1

    if not as_df:
        return PyRanges(df)
    else:
        return df

读取gtf文件

import pyranges as pr
from pyranges import PyRanges
read_gtf_full("example02.gtf")

example02.gtf文件的内容

##gff-version 3
# gffread v0.12.7
# gffread -E --keep-genes /mnt/shared/scratch/wguo/barkeRTD/stringtie/B1/Stringtie_B1.gtf -o 00.newgtf/B1/Stringtie_B1_new.gtf
chr1H_part_1    StringTie   gene    72141   73256   .   +   .   ID=STRG.1
chr1H_part_1    StringTie   transcript  72141   73256   1000    +   .   ID=STRG.1.1;Parent=STRG.1
chr1H_part_1    StringTie   exon    72141   72399   1000    +   .   Parent=STRG.1.1
chr1H_part_1    StringTie   exon    72822   73256   1000    +   .   Parent=STRG.1.1
chr1H_part_1    StringTie   gene    102332  103882  .   +   .   ID=STRG.2
chr1H_part_1    StringTie   transcript  102332  103882  1000    +   .   ID=STRG.2.1;Parent=STRG.2
chr1H_part_1    StringTie   exon    102332  103882  1000    +   .   Parent=STRG.2.1
chr1H_part_1    StringTie   transcript  102332  103750  1000    +   .   ID=STRG.2.2;Parent=STRG.2
chr1H_part_1    StringTie   exon    102332  103533  1000    +   .   Parent=STRG.2.2
chr1H_part_1    StringTie   exon    103640  103750  1000    +   .   Parent=STRG.2.2
chr1H_part_1    StringTie   gene    104391  108013  .   -   .   ID=STRG.3
chr1H_part_1    StringTie   transcript  104391  108013  1000    -   .   ID=STRG.3.4;Parent=STRG.3

欢迎大家关注我的公众号

小明的数据分析笔记本