以下是分析和学习过程中遇到的值得记录一下的生信习题库
题目1:下载最新版的KEGG注释文本文件,编写脚本整理成kegg的pathway的ID与基因ID的对应格式。
#数据下载参考技能树教程http://www.bio-info-trainee.com/2550.html
我的解题思路:因为C开头的是KEGG pathway ID, D开头的是基因ID,C开头的第二列拿出来当作变量值赋给a,碰到D开头的第二列就把这俩一块输出,\t分隔,直到遇见下一个C开头的KEGG pathway ID,循环直到文件末
awk '{if ($0 ~ /^C/) a=$2; else if ($0 ~ /^D/) print a"\t"$2}' hsa00001.keg | less
00010 3101
00010 3098
00010 3099
00010 80201
00010 2645
00010 2821
00010 5213
00010 5214
00010 5211
00010 2203
00010 8789
00010 230
00010 226
00010 229
00010 7167
00010 2597
00010 26330
题目2: 解析go-basic.obo为tab键分隔的三列表
原始go-basic.obo是这样
format-version: 1.2
data-version: releases/2020-08-11
subsetdef: chebi_ph7_3 "Rhea list of ChEBI terms representing the major species at pH 7.3."
subsetdef: gocheck_do_not_annotate "Term not to be used for direct annotation"
subsetdef: gocheck_do_not_manually_annotate "Term not to be used for direct manual annotation"
subsetdef: goslim_agr "AGR slim"
subsetdef: goslim_aspergillus "Aspergillus GO slim"
subsetdef: goslim_candida "Candida GO slim"
subsetdef: goslim_chembl "ChEMBL protein targets summary"
subsetdef: goslim_drosophila "Drosophila GO slim"
subsetdef: goslim_flybase_ribbon "FlyBase Drosophila GO ribbon slim"
subsetdef: goslim_generic "Generic GO slim"
subsetdef: goslim_metagenomics "Metagenomics GO slim"
subsetdef: goslim_mouse "Mouse GO slim"
subsetdef: goslim_pir "PIR GO slim"
subsetdef: goslim_plant "Plant GO slim"
subsetdef: goslim_pombe "Fission yeast GO slim"
subsetdef: goslim_synapse "synapse GO slim"
subsetdef: goslim_yeast "Yeast GO slim"
synonymtypedef: syngo_official_label "label approved by the SynGO project"
synonymtypedef: systematic_synonym "Systematic synonym" EXACT
default-namespace: gene_ontology
remark: cvs version: use data-version
remark: Includes Ontology(OntologyID(OntologyIRI(<http://purl.obolibrary.org/obo/go/never_in_taxon.owl>))) [Axioms: 18 Logical Axioms: 0]
ontology: go
[Term]
id: GO:0000001
name: mitochondrion inheritance
namespace: biological_process
def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria
and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764]
synonym: "mitochondrial inheritance" EXACT []
is_a: GO:0048308 ! organelle inheritance
is_a: GO:0048311 ! mitochondrion distribution
name: reproduction
namespace: biological_process
alt_id: GO:0019952
alt_id: GO:0050876
def: "The production of new individuals that contain some portion of genetic material inherited from one or more parent organisms." [GOC:go_curators, GOC:isa_complete, GOC:jl, ISBN:0198506732]
subset: goslim_agr
subset: goslim_chembl
subset: goslim_flybase_ribbon
subset: goslim_generic
subset: goslim_pir
subset: goslim_plant
synonym: "reproductive physiological process" EXACT []
xref: Wikipedia:Reproduction
is_a: GO:0008150 ! biological_process
[Term]
id: GO:0000002
name: mitochondrial genome maintenance
namespace: biological_process
def: "The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome." [GOC:ai, GOC:vw
]
is_a: GO:0007005 ! mitochondrion organization
[Term]
id: GO:0000003
想要转换成这样
Go | Decription | Level |
---|---|---|
GO:0000001 | mitochondrion inheritance | biological_process |
GO:0000002 | mitochondrial genome maintenance | biological_process |
GO:0000003 | reproduction | biological_process |
GO:0000005 | obsolete ribosomal chaperone activity | molecular_function |
GO:0000006 | high-affinity zinc transmembrane transporter activity | molecular_function |
GO:0000007 | low-affinity zinc ion transmembrane transporter activity | molecular_function |
- 我的思路很简单,找一下规律就可以看出来:想要转换成的第一列的内容在原文件的id行;想要转换成的第二列的内容在原文件的name行;想要转换成的第三列的内容在原文件的namespace行;
依次提出来,前两个按tab键分隔,最后一个默认换行符'\n'分隔就行了。 - But, 这里有一坑。。如果以python的startswith语句匹配的话,,startswith("name")直接把namespace包含了。。
很简单,以name:开头就行了🤢
#!/usr/bin/env python
#-*-utf-8-*-
import argparse
parser = argparse.ArgumentParser(description = "\nThis python script is used to parse the go-basic.obo file into tab delimited table", add_help = False, usage = "\n python3 parse_go.py -i [go-basic.obo] -o [go.txt] \n python3 parse_go.py --input [go-basic.obo] --output [go.txt]")
required = parser.add_argument_group("Required options")
optional = parser.add_argument_group("Optional options")
required.add_argument("-i", "--input", metavar = "[go-basic.obo]", help = "input file format: go_obo", required = True)
required.add_argument("-o", "--output", metavar = "[stat.txt]", help = "output file format: tab delimited table", required = True)
optional.add_argument("-h", "--help", action = "help", help = "help.info")
args = parser.parse_args()
with open(args.input,"r") as fi:
with open(args.output,"w") as fo:
print("GO\tDescription\tLevel",file = fo)
for line in fi:
if line.startswith("id"):
print(''.join(line.split(" ")[1]).strip(), file = fo, end = '\t')
elif line.startswith("name:"):
print(' '.join(line.split(" ")[1:]).strip(), file = fo, end = '\t')
elif line.startswith("namespace"):
print(line.split(" ")[1].strip(), file = fo)
统计人类外显子区域的长度
#!/usr/bin/env python3
#2020.8
#-*-utf-8-*-
import argparse
parser = argparse.ArgumentParser(description = '\nThis is a python3 script used to count the length of human genome extron', add_help = False, usage = '\npython3 extron_len.py -i [human_extron.txt]')
required = parser.add_argument_group('Required options')
optional = parser.add_argument_group('Optional options')
required.add_argument('-i', metavar = '[human_extron.txt]', help = 'input: extron can be downloaded at https://ftp.ncbi.nlm.nih.gov/pub/CCDS/current_human/CCDS.current.txt', required = True)
optional.add_argument('-h','--help', action = 'help', help = 'help.info')
args = parser.parse_args()
extron = {}
with open(args.i,'r') as f:
next(f) #跳过第一行
for line in f:
line = line.split('\t')
if line[9] != '-':
extron_cor = line[9].lstrip('[').rstrip(']').split(', ')
for i in range(len(extron_cor)):
start = extron_cor[i].split('-')[0]
extron[start] = extron_cor[i].split('-')[1]
length = 0
for i,j in extron.items():
length += int(j.strip("'")) - int(i.strip("'")) + 1
print(length)
网友评论