pyRanges的帮助文档
https://biocore-ntnu.github.io/pyranges/loadingcreating-pyranges.html
image.png我自己的gtf文件是这样的 ID和后面字符串是用等号链接的,通常
image.png是用空格,所以他定义函数用来查拆分字符串的时候是用空格来分隔的,所以这个地方我们把读取代码稍微改动一下,就是增加一个等号作为分隔符
首先定义拆分最后一列的函数
def to_rows(anno):
rowdicts = []
try:
l = anno.head(1)
for l in l:
l.replace('"', '').replace(";", "").split()
except AttributeError:
raise Exception("Invalid attribute string: {l}. If the file is in GFF3 format, use pr.read_gff3 instead.".format(l=l))
for l in anno:
rowdicts.append({kk[0]: kk[-1]
for kk in [re.split(' |=',kv.replace('""', '"NA"').replace('"', ''), 1)
for kv in re.split('; |;',l)]})
return pd.DataFrame.from_dict(rowdicts).set_index(anno.index)
读取gtf的函数
def read_gtf_full(f, as_df=False, nrows=None, skiprows=0):
dtypes = {
"Chromosome": "category",
"Feature": "category",
"Strand": "category"
}
names = "Chromosome Source Feature Start End Score Strand Frame Attribute".split(
)
df_iter = pd.read_csv(
f,
sep="\t",
header=None,
names=names,
dtype=dtypes,
chunksize=int(1e5),
skiprows=skiprows,
nrows=nrows,comment="#")
_to_rows = to_rows
dfs = []
for df in df_iter:
extra = _to_rows(df.Attribute)
df = df.drop("Attribute", axis=1)
ndf = pd.concat([df, extra], axis=1, sort=False)
dfs.append(ndf)
df = pd.concat(dfs, sort=False)
df.loc[:, "Start"] = df.Start - 1
if not as_df:
return PyRanges(df)
else:
return df
读取gtf文件
import pyranges as pr
from pyranges import PyRanges
read_gtf_full("example02.gtf")
example02.gtf文件的内容
##gff-version 3
# gffread v0.12.7
# gffread -E --keep-genes /mnt/shared/scratch/wguo/barkeRTD/stringtie/B1/Stringtie_B1.gtf -o 00.newgtf/B1/Stringtie_B1_new.gtf
chr1H_part_1 StringTie gene 72141 73256 . + . ID=STRG.1
chr1H_part_1 StringTie transcript 72141 73256 1000 + . ID=STRG.1.1;Parent=STRG.1
chr1H_part_1 StringTie exon 72141 72399 1000 + . Parent=STRG.1.1
chr1H_part_1 StringTie exon 72822 73256 1000 + . Parent=STRG.1.1
chr1H_part_1 StringTie gene 102332 103882 . + . ID=STRG.2
chr1H_part_1 StringTie transcript 102332 103882 1000 + . ID=STRG.2.1;Parent=STRG.2
chr1H_part_1 StringTie exon 102332 103882 1000 + . Parent=STRG.2.1
chr1H_part_1 StringTie transcript 102332 103750 1000 + . ID=STRG.2.2;Parent=STRG.2
chr1H_part_1 StringTie exon 102332 103533 1000 + . Parent=STRG.2.2
chr1H_part_1 StringTie exon 103640 103750 1000 + . Parent=STRG.2.2
chr1H_part_1 StringTie gene 104391 108013 . - . ID=STRG.3
chr1H_part_1 StringTie transcript 104391 108013 1000 - . ID=STRG.3.4;Parent=STRG.3
欢迎大家关注我的公众号
小明的数据分析笔记本
小明的数据分析笔记本 公众号 主要分享:1、R语言和python做数据分析和数据可视化的简单小例子;2、园艺植物相关转录组学、基因组学、群体遗传学文献阅读笔记;3、生物信息学入门学习资料及自己的学习笔记!
网友评论