前言
在系列的第一节中,我们介绍了如何使用 Python
的标准库 json
来读写 json
格式文件
本节,我们将介绍 pandas
提供的 JSON
格式的文件和字符串的读写操作。
介绍
1 写入 JSON
一个 Series
或 DataFrame
可以使用 to_json
方法转换为有效的 JSON
字符串。
可选的参数如下:
-
path_or_buf
: -
orient
:-
Series
:默认为index
,可选择[split, records, index, table]
-
DataFrame
:默认为columns
,可选择[split, records, index, columns, values, table]
-

-
date_format
: 日期转换类型,epoch
表示timestamp
,iso
表示ISO8601
. -
double_precision
: 浮点值的小数位数,默认为10
-
force_ascii
: 强制将字符串编码为ASCII
,默认为True
。 -
date_unit
: 编码的时间单位,控制timestamp
和ISO8601
精度。's'
、'ms'
、'us'
和'ns'
分别代表秒、毫秒、微秒和纳秒。默认为'ms'
-
default_handler
: 如果无法将对象转换为适合JSON
的格式,则调用该处理程序。它接受一个要转换的对象,并返回将其序列化后的对象 -
lines
: 如果orient=records
, 将每条记录
注意 NaN
,NaT
和 None
将被转换为 null
,并且 datetime
对象将根据 date_format
和 date_unit
参数进行转换
In [197]: json = dfj.to_json()
In [198]: json
Out[198]: '{"A":{"0":-1.2945235903,"1":0.2766617129,"2":-0.0139597524,"3":-0.0061535699,"4":0.8957173022},"B":{"0":0.4137381054,"1":-0.472034511,"2":-0.3625429925,"3":-0.923060654,"4":0.8052440254}}'
1.1 orient 选项
生成的 JSON
文件或字符串的格式有很多不同的选项,比如,下面的 DataFrame
和 Series
In [199]: dfjo = pd.DataFrame(
.....: dict(A=range(1, 4), B=range(4, 7), C=range(7, 10)),
.....: columns=list("ABC"),
.....: index=list("xyz"),
.....: )
.....:
In [200]: dfjo
Out[200]:
A B C
x 1 4 7
y 2 5 8
z 3 6 9
In [201]: sjo = pd.Series(dict(x=15, y=16, z=17), name="D")
In [202]: sjo
Out[202]:
x 15
y 16
z 17
Name: D, dtype: int64
-
columns
:DataFrame
默认是按列将数据序列化为嵌套的JSON
对象
In [203]: dfjo.to_json(orient="columns")
Out[203]: '{"A":{"x":1,"y":2,"z":3},"B":{"x":4,"y":5,"z":6},"C":{"x":7,"y":8,"z":9}}'
# Not available for Series
-
index
:Series
的默认是按索引index
序列化,类似于面向列
In [204]: dfjo.to_json(orient="index")
Out[204]: '{"x":{"A":1,"B":4,"C":7},"y":{"A":2,"B":5,"C":8},"z":{"A":3,"B":6,"C":9}}'
In [205]: sjo.to_json(orient="index")
Out[205]: '{"x":15,"y":16,"z":17}'
-
records
: 将数据序列化为 列->值记录的JSON
数组,同时会忽略索引标签,有利于将数据传入js
绘图库
In [206]: dfjo.to_json(orient="records")
Out[206]: '[{"A":1,"B":4,"C":7},{"A":2,"B":5,"C":8},{"A":3,"B":6,"C":9}]'
In [207]: sjo.to_json(orient="records")
Out[207]: '[15,16,17]'
-
value
是一个基本选项,它仅将值序列化为的嵌套JSON
数组,不包括列和索引标签
In [208]: dfjo.to_json(orient="values")
Out[208]: '[[1,4,7],[2,5,8],[3,6,9]]'
# Not available for Series
-
split
: 序列化为JSON
对象,包含值,索引和列的单独条目。Series
名称也包括在内
In [209]: dfjo.to_json(orient="split")
Out[209]: '{"columns":["A","B","C"],"index":["x","y","z"],"data":[[1,4,7],[2,5,8],[3,6,9]]}'
In [210]: sjo.to_json(orient="split")
Out[210]: '{"name":"D","index":["x","y","z"],"data":[15,16,17]}'
-
table
: 序列化为JSON
表模式,从而允许保留元数据,包括但不限于dtypes
和索引名称
>>> sjo.to_json(orient='table')
>>> '{"schema":{"fields":[{"name":"index","type":"string"},{"name":"D","type":"integer"}],"primaryKey":["index"],"pandas_version":"0.20.0"},"data":[{"index":"x","D":15},{"index":"y","D":16},{"index":"z","D":17}]}'
>>> dfjo.to_json(orient='table')
>>> '{"schema":{"fields":[{"name":"index","type":"string"},{"name":"A","type":"integer"},{"name":"B","type":"integer"},{"name":"C","type":"integer"}],"primaryKey":["index"],"pandas_version":"0.20.0"},"data":[{"index":"x","A":1,"B":4,"C":7},{"index":"y","A":2,"B":5,"C":8},{"index":"z","A":3,"B":6,"C":9}]}
注意: 任何序列化选项都不会保留索引和列标签的顺序。如果您希望保留标签排序,请使用 split
选项,因为它使用的是排序容器
1.2 日期处理
以 ISO
日期格式写出
In [211]: dfd = pd.DataFrame(np.random.randn(5, 2), columns=list("AB"))
In [212]: dfd["date"] = pd.Timestamp("20130101")
In [213]: dfd = dfd.sort_index(1, ascending=False)
In [214]: json = dfd.to_json(date_format="iso")
In [215]: json
Out[215]: '{"date":{"0":"2013-01-01T00:00:00.000Z","1":"2013-01-01T00:00:00.000Z","2":"2013-01-01T00:00:00.000Z","3":"2013-01-01T00:00:00.000Z","4":"2013-01-01T00:00:00.000Z"},"B":{"0":2.5656459463,"1":1.3403088498,"2":-0.2261692849,"3":0.8138502857,"4":-0.8273169356},"A":{"0":-1.2064117817,"1":1.4312559863,"2":-1.1702987971,"3":0.4108345112,"4":0.1320031703}}'
以 ISO
日期格式写出,且以微秒为单位
In [216]: json = dfd.to_json(date_format="iso", date_unit="us")
In [217]: json
Out[217]: '{"date":{"0":"2013-01-01T00:00:00.000000Z","1":"2013-01-01T00:00:00.000000Z","2":"2013-01-01T00:00:00.000000Z","3":"2013-01-01T00:00:00.000000Z","4":"2013-01-01T00:00:00.000000Z"},"B":{"0":2.5656459463,"1":1.3403088498,"2":-0.2261692849,"3":0.8138502857,"4":-0.8273169356},"A":{"0":-1.2064117817,"1":1.4312559863,"2":-1.1702987971,"3":0.4108345112,"4":0.1320031703}}'
Epoch
时间戳,以秒为单位:
In [218]: json = dfd.to_json(date_format="epoch", date_unit="s")
In [219]: json
Out[219]: '{"date":{"0":1356998400,"1":1356998400,"2":1356998400,"3":1356998400,"4":1356998400},"B":{"0":2.5656459463,"1":1.3403088498,"2":-0.2261692849,"3":0.8138502857,"4":-0.8273169356},"A":{"0":-1.2064117817,"1":1.4312559863,"2":-1.1702987971,"3":0.4108345112,"4":0.1320031703}}'
写出一个带有日期索引和日期列名的文件
In [220]: dfj2 = dfj.copy()
In [221]: dfj2["date"] = pd.Timestamp("20130101")
In [222]: dfj2["ints"] = list(range(5))
In [223]: dfj2["bools"] = True
In [224]: dfj2.index = pd.date_range("20130101", periods=5)
In [225]: dfj2.to_json("test.json")
In [226]: with open("test.json") as fh:
.....: print(fh.read())
.....:
{"A":{"1356998400000":-1.2945235903,"1357084800000":0.2766617129,"1357171200000":-0.0139597524,"1357257600000":-0.0061535699,"1357344000000":0.8957173022},"B":{"1356998400000":0.4137381054,"1357084800000":-0.472034511,"1357171200000":-0.3625429925,"1357257600000":-0.923060654,"1357344000000":0.8052440254},"date":{"1356998400000":1356998400000,"1357084800000":1356998400000,"1357171200000":1356998400000,"1357257600000":1356998400000,"1357344000000":1356998400000},"ints":{"1356998400000":0,"1357084800000":1,"1357171200000":2,"1357257600000":3,"1357344000000":4},"bools":{"1356998400000":true,"1357084800000":true,"1357171200000":true,"1357257600000":true,"1357344000000":true}}
1.3 回退行为
如果 JSON
序列化器不能直接处理容器内容,它将以以下方式回退:
- 如果数据的类型不支持(如,
np.complex_
),那么,当指定了default_handler
参数时,将会把函数应用到每个值上,否则会抛出异常 - 如果对象不支持,会尝试一下操作
- 检查对象是否定义了
toDict
方法并调用它。toDict
方法应返回一个字典,然后将其序列化为JSON
- 如果提供了
default_handler
,则调用它 - 通过遍历其内容将对象转换为字典。但是,这通常会因为
OverflowError
而失败或产生意外结果
- 检查对象是否定义了
通常,对于不受支持的对象或 dtypes
的最好方法是提供 default_handler
>>> DataFrame([1.0, 2.0, complex(1.0, 2.0)]).to_json() # raises
RuntimeError: Unhandled numpy dtype 15
可以通过指定一个简单的 default_handler
来处理
In [227]: pd.DataFrame([1.0, 2.0, complex(1.0, 2.0)]).to_json(default_handler=str)
Out[227]: '{"0":{"0":"(1+0j)","1":"(2+0j)","2":"(1+2j)"}}'
2 读取 JSON
将 JSON
字符串读取到 pandas
对象可以使用许多参数。
如果没有设置 typ
或为 None
,则解析器将尝试解析为 DataFrame
。
设置 typ=series
将强制转为 Series
类型对象
-
filepath_or_buffer
: 字符串或文件句柄或StringIO
-
typ
: 要保存的对象类型,Series
或DataFrame
-
orient
: 同上 -
dtype
: 如果为True
,则会推断dtype
,如果为False
,则根本不推断dtype
,默认值为True
。 -
convert_axes
: 布尔值,尝试将轴转换为适当的dtype
,默认值为True
-
convert_dates
: 要解析日期的列;如果为True
,则尝试解析类似日期的列,默认值为True
。 -
keep_default_dates
: 布尔值,默认为True
。如果解析日期,则解析默认的类似日期的列 -
numpy
: 直接解码为NumPy
数组。默认为False
,仅支持数值数据。还要注意的是,如果numpy=True
,则每个条目的JSON
顺序必须相同 -
precise_float
: 布尔值,是否启用高精度字符串转换函数,默认为False
-
date_unit
: 同上 -
lines
: 每行将文件读取为一个json
对象。 -
encoding
: 编码方式 -
chunksize
: 当与lines=True
结合使用时,返回一个JsonReader
,该JsonReader
在每次迭代中读取块大小的行
注意:从 1.0.0
版本开始不推荐使用 numpy
参数,它将引发 FutureWarning
2.1 数据转换
从 JSON
字符串中读取数据
In [228]: pd.read_json(json)
Out[228]:
date B A
0 2013-01-01 2.565646 -1.206412
1 2013-01-01 1.340309 1.431256
2 2013-01-01 -0.226169 -1.170299
3 2013-01-01 0.813850 0.410835
4 2013-01-01 -0.827317 0.132003
从文件中读取
In [229]: pd.read_json("test.json")
Out[229]:
A B date ints bools
2013-01-01 -1.294524 0.413738 2013-01-01 0 True
2013-01-02 0.276662 -0.472035 2013-01-01 1 True
2013-01-03 -0.013960 -0.362543 2013-01-01 2 True
2013-01-04 -0.006154 -0.923061 2013-01-01 3 True
2013-01-05 0.895717 0.805244 2013-01-01 4 True
不转换任何数据
In [230]: pd.read_json("test.json", dtype=object).dtypes
Out[230]:
A object
B object
date object
ints object
bools object
dtype: object
指定转换类型
In [231]: pd.read_json("test.json", dtype={"A": "float32", "bools": "int8"}).dtypes
Out[231]:
A float32
B float64
date datetime64[ns]
ints int64
bools int8
dtype: object
保留字符串索引
In [232]: si = pd.DataFrame(
.....: np.zeros((4, 4)), columns=list(range(4)), index=[str(i) for i in range(4)]
.....: )
.....:
In [233]: si
Out[233]:
0 1 2 3
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
In [234]: si.index
Out[234]: Index(['0', '1', '2', '3'], dtype='object')
In [235]: si.columns
Out[235]: Int64Index([0, 1, 2, 3], dtype='int64')
In [236]: json = si.to_json()
In [237]: sij = pd.read_json(json, convert_axes=False)
In [238]: sij
Out[238]:
0 1 2 3
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
In [239]: sij.index
Out[239]: Index(['0', '1', '2', '3'], dtype='object')
In [240]: sij.columns
Out[240]: Index(['0', '1', '2', '3'], dtype='object')
以纳秒为单位写的日期需要以纳秒为单位读回
In [241]: json = dfj2.to_json(date_unit="ns")
# Try to parse timestamps as milliseconds -> Won't Work
In [242]: dfju = pd.read_json(json, date_unit="ms")
In [243]: dfju
Out[243]:
A B date ints bools
1356998400000000000 -1.294524 0.413738 1356998400000000000 0 True
1357084800000000000 0.276662 -0.472035 1356998400000000000 1 True
1357171200000000000 -0.013960 -0.362543 1356998400000000000 2 True
1357257600000000000 -0.006154 -0.923061 1356998400000000000 3 True
1357344000000000000 0.895717 0.805244 1356998400000000000 4 True
# Let pandas detect the correct precision
In [244]: dfju = pd.read_json(json)
In [245]: dfju
Out[245]:
A B date ints bools
2013-01-01 -1.294524 0.413738 2013-01-01 0 True
2013-01-02 0.276662 -0.472035 2013-01-01 1 True
2013-01-03 -0.013960 -0.362543 2013-01-01 2 True
2013-01-04 -0.006154 -0.923061 2013-01-01 3 True
2013-01-05 0.895717 0.805244 2013-01-01 4 True
# Or specify that all timestamps are in nanoseconds
In [246]: dfju = pd.read_json(json, date_unit="ns")
In [247]: dfju
Out[247]:
A B date ints bools
2013-01-01 -1.294524 0.413738 2013-01-01 0 True
2013-01-02 0.276662 -0.472035 2013-01-01 1 True
2013-01-03 -0.013960 -0.362543 2013-01-01 2 True
2013-01-04 -0.006154 -0.923061 2013-01-01 3 True
2013-01-05 0.895717 0.805244 2013-01-01 4 True
3 Normalization
pandas
提供了一种实用程序功能,用于获取一个字典或字典列表并将此半结构化数据标准化为一个扁平的表。
In [257]: data = [
.....: {"id": 1, "name": {"first": "Coleen", "last": "Volk"}},
.....: {"name": {"given": "Mose", "family": "Regner"}},
.....: {"id": 2, "name": "Faye Raker"},
.....: ]
.....:
In [258]: pd.json_normalize(data)
Out[258]:
id name.first name.last name.given name.family name
0 1.0 Coleen Volk NaN NaN NaN
1 NaN NaN NaN Mose Regner NaN
2 2.0 NaN NaN NaN NaN Faye Raker
In [259]: data = [
.....: {
.....: "state": "Florida",
.....: "shortname": "FL",
.....: "info": {"governor": "Rick Scott"},
.....: "county": [
.....: {"name": "Dade", "population": 12345},
.....: {"name": "Broward", "population": 40000},
.....: {"name": "Palm Beach", "population": 60000},
.....: ],
.....: },
.....: {
.....: "state": "Ohio",
.....: "shortname": "OH",
.....: "info": {"governor": "John Kasich"},
.....: "county": [
.....: {"name": "Summit", "population": 1234},
.....: {"name": "Cuyahoga", "population": 1337},
.....: ],
.....: },
.....: ]
.....:
In [260]: pd.json_normalize(data, "county", ["state", "shortname", ["info", "governor"]])
Out[260]:
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich
max_level
参数提供了对级别规范化的更多控制。使用 max_level=1
,下面的代码片段将标准化到所提供的 dict
的第一个嵌套级别
In [261]: data = [
.....: {
.....: "CreatedBy": {"Name": "User001"},
.....: "Lookup": {
.....: "TextField": "Some text",
.....: "UserField": {"Id": "ID001", "Name": "Name001"},
.....: },
.....: "Image": {"a": "b"},
.....: }
.....: ]
.....:
In [262]: pd.json_normalize(data, max_level=1)
Out[262]:
CreatedBy.Name Lookup.TextField Lookup.UserField Image.a
0 User001 Some text {'Id': 'ID001', 'Name': 'Name001'} b
4 行分隔的 json
pandas
能够读取和写入使用 Hadoop
或 Spark
在数据处理管道中常见的以行分隔的 json
文件
对于行分隔的 json
文件,pandas
还可以返回一个迭代器,该迭代器一次读取 chunksize
大小的行。这对于大文件或从流中读取数据非常有用。
In [263]: jsonl = """
.....: {"a": 1, "b": 2}
.....: {"a": 3, "b": 4}
.....: """
.....:
In [264]: df = pd.read_json(jsonl, lines=True)
In [265]: df
Out[265]:
a b
0 1 2
1 3 4
In [266]: df.to_json(orient="records", lines=True)
Out[266]: '{"a":1,"b":2}\n{"a":3,"b":4}\n'
# reader is an iterator that returns ``chunksize`` lines each iteration
In [267]: with pd.read_json(StringIO(jsonl), lines=True, chunksize=1) as reader:
.....: reader
.....: for chunk in reader:
.....: print(chunk)
.....:
Empty DataFrame
Columns: []
Index: []
a b
0 1 2
a b
1 3 4
5 表结构
orient
参数的 table
选项用于将表数据描述为 JSON
对象,生成的对象包含两个字段 schema
和 data
In [268]: df = pd.DataFrame(
.....: {
.....: "A": [1, 2, 3],
.....: "B": ["a", "b", "c"],
.....: "C": pd.date_range("2016-01-01", freq="d", periods=3),
.....: },
.....: index=pd.Index(range(3), name="idx"),
.....: )
.....:
In [269]: df
Out[269]:
A B C
idx
0 1 a 2016-01-01
1 2 b 2016-01-02
2 3 c 2016-01-03
In [270]: df.to_json(orient="table", date_format="iso")
Out[270]: '{"schema":{"fields":[{"name":"idx","type":"integer"},{"name":"A","type":"integer"},{"name":"B","type":"string"},{"name":"C","type":"datetime"}],"primaryKey":["idx"],"pandas_version":"0.20.0"},"data":[{"idx":0,"A":1,"B":"a","C":"2016-01-01T00:00:00.000Z"},{"idx":1,"A":2,"B":"b","C":"2016-01-02T00:00:00.000Z"},{"idx":2,"A":3,"B":"c","C":"2016-01-03T00:00:00.000Z"}]}'
schema
字段包含 fields
键,它本身包含一个列名到类型对的列表,包括索引或多级索引。
如果索引是唯一的,schema
字段将会包含一个 primaryKey
字段
第二个字段 data
包含带有 orient=records
的序列化数据。
表结构有以下注意事项:
-
schema
对象包含一个pandas_version
字段。这包含了pandas
模式的方言版本,并将随着每次修订而递增 - 序列化时,所有日期都将转换为
UTC
In [271]: from pandas.io.json import build_table_schema
In [272]: s = pd.Series(pd.date_range("2016", periods=4))
In [273]: build_table_schema(s)
Out[273]:
{'fields': [{'name': 'index', 'type': 'integer'},
{'name': 'values', 'type': 'datetime'}],
'primaryKey': ['index'],
'pandas_version': '0.20.0'}
- 带有时区的
datetimes
,请添加一个带有时区名称的附加字段tz
In [274]: s_tz = pd.Series(pd.date_range("2016", periods=12, tz="US/Central"))
In [275]: build_table_schema(s_tz)
Out[275]:
{'fields': [{'name': 'index', 'type': 'integer'},
{'name': 'values', 'type': 'datetime', 'tz': 'US/Central'}],
'primaryKey': ['index'],
'pandas_version': '0.20.0'}
- 分类使用
any
类型和枚举约束来列出可能的值集。此外,还包括一个有序字段
In [278]: s_cat = pd.Series(pd.Categorical(["a", "b", "a"]))
In [279]: build_table_schema(s_cat)
Out[279]:
{'fields': [{'name': 'index', 'type': 'integer'},
{'name': 'values',
'type': 'any',
'constraints': {'enum': ['a', 'b']},
'ordered': False}],
'primaryKey': ['index'],
'pandas_version': '0.20.0'}
- 如果索引是唯一的,则包括一个
primaryKey
字段,否则没有该字段
In [280]: s_dupe = pd.Series([1, 2], index=[1, 1])
In [281]: build_table_schema(s_dupe)
Out[281]:
{'fields': [{'name': 'index', 'type': 'integer'},
{'name': 'values', 'type': 'integer'}],
'pandas_version': '0.20.0'}
-
primaryKey
的行为与MultiIndexes
相同,在这种情况下,primaryKey
是一个数组
In [282]: s_multi = pd.Series(1, index=pd.MultiIndex.from_product([("a", "b"), (0, 1)]))
In [283]: build_table_schema(s_multi)
Out[283]:
{'fields': [{'name': 'level_0', 'type': 'string'},
{'name': 'level_1', 'type': 'integer'},
{'name': 'values', 'type': 'integer'}],
'primaryKey': FrozenList(['level_0', 'level_1']),
'pandas_version': '0.20.0'}
- 指定
orient="table"
,可以保存元数据,保持写入和读取的数据类型一致
In [284]: df = pd.DataFrame(
.....: {
.....: "foo": [1, 2, 3, 4],
.....: "bar": ["a", "b", "c", "d"],
.....: "baz": pd.date_range("2018-01-01", freq="d", periods=4),
.....: "qux": pd.Categorical(["a", "b", "c", "c"]),
.....: },
.....: index=pd.Index(range(4), name="idx"),
.....: )
.....:
In [285]: df
Out[285]:
foo bar baz qux
idx
0 1 a 2018-01-01 a
1 2 b 2018-01-02 b
2 3 c 2018-01-03 c
3 4 d 2018-01-04 c
In [286]: df.dtypes
Out[286]:
foo int64
bar object
baz datetime64[ns]
qux category
dtype: object
In [287]: df.to_json("test.json", orient="table")
In [288]: new_df = pd.read_json("test.json", orient="table")
In [289]: new_df
Out[289]:
foo bar baz qux
idx
0 1 a 2018-01-01 a
1 2 b 2018-01-02 b
2 3 c 2018-01-03 c
3 4 d 2018-01-04 c
In [290]: new_df.dtypes
Out[290]:
foo int64
bar object
baz datetime64[ns]
qux category
dtype: object
网友评论