做机器学习的经常需要处理数据集,可能是json,mat,h5各种格式的,里面有各种标签结构。
了解数据集的结构、格式、类型,对我们处理数据是有帮助的。
写了一个有通用性的程序,
在此用来查看mscoco数据集的json注释,相同级别的数据使用了相同的缩进。
# -*- coding: utf-8 -*-
"""
Created on Tue Nov 6 22:23:17 2018
@author: BigFly
"""
import json
def process_dict(obj,level):
print("<dict>")
for key in obj.keys():
print(" "*level, "\"%s\""%(key), end=": ")
process(obj[key],level+1)
def process_list(obj,level):
print("<list>"," len=",len(obj))
samplenum=1 # 对list,查看几个item
for idx in range(min(samplenum,len(obj))):
print(" "*level, "item",idx, end=": ")
process(obj[idx], level+1)
if len(obj)>samplenum:
print(" "*level, "item ...")
def process_str(obj,level):
print("<str>",obj)
def process_num(obj,level):
print("<num>",obj)
switch={type({}) : process_dict,
type([]) : process_list,
type("") : process_str,
type(1) : process_num,
type(1.0) : process_num }
def process(obj,level=0):
obj_typ=type(obj)
try:
switch[obj_typ](obj,level+1)
except KeyError as e:
print("ERROR: NO ", obj_typ)
path="E:\\dataset\\MSCOCO\\annotations_trainval2017\\annotations\\instances_val2017.json"
path="E:\\dataset\\MSCOCO\\annotations_trainval2017\\annotations\\instances_train2017.json"
jsonstr=open(path).readline()
print("jsonstr",type(jsonstr),len(jsonstr))
annotations=json.loads(jsonstr)
#查看annotations的结构
process(annotations) #['licenses', 'categories', 'annotations', 'info', 'images']
这里列举了对5种类型的处理,要处理其他类型,仿照加进去就是了。
python没有switch-case结构,可以用dict实现。
运行结果:
<dict>
licenses: <list> len= 8
item 0: <dict>
name: <str> Attribution-NonCommercial-ShareAlike License
id: <num> 1
url: <str> http://creativecommons.org/licenses/by-nc-sa/2.0/
item ...
categories: <list> len= 80
item 0: <dict>
supercategory: <str> person
name: <str> person
id: <num> 1
item ...
annotations: <list> len= 36781
item 0: <dict>
id: <num> 1768
bbox: <list> len= 4
item 0: <num> 473.07
item ...
image_id: <num> 289343
iscrowd: <num> 0
area: <num> 702.1057499999998
category_id: <num> 18
segmentation: <list> len= 1
item 0: <list> len= 134
item 0: <num> 510.66
item ...
item ...
info: <dict>
version: <str> 1.0
date_created: <str> 2017/09/01
description: <str> COCO 2017 Dataset
year: <num> 2017
contributor: <str> COCO Consortium
url: <str> http://cocodataset.org
images: <list> len= 5000
item 0: <dict>
file_name: <str> 000000397133.jpg
id: <num> 397133
date_captured: <str> 2013-11-14 17:02:52
license: <num> 4
height: <num> 427
flickr_url: <str> http://farm7.staticflickr.com/6116/6255196340_da26cf2c9e_z.jpg
coco_url: <str> http://images.cocodataset.org/val2017/000000397133.jpg
width: <num> 640
item ...
可以清晰的看出,annotations是dict类型,有5个key,以及每个项分别的类型和详情。
网友评论