Python 数据处理（二十一）—— HDF5 查询

作者: 名本无名 | 来源:发表于2021-02-24 11:12 被阅读0次

7 查询

7.1 查询表

select 和 delete 操作有一个可选的条件，可以指定该条件来选择/删除数据的一个子集。

这允许我们从一个非常大的磁盘表中，检索出一小部分的数据。

比较操作

=, ==, !=, >, >=, <, <=

布尔表达式

|: 或
&: 与
(): 分组

注意

= 将会自动扩展为比较运算符 ==
~ 是取反，使用条件有限
如果传递表达式列表或元组，将使用 & 连接

下面是有效的表达式:

'index >= date'
"columns = ['A', 'D']"
"columns in ['A', 'D']"
'columns = A'
'columns == A'
"~(columns = ['A', 'B'])"
'index > df.index[3] & string = "bar"'
'(index > df.index[3] & index <= df.index[6]) | string = "bar"'
"ts >= Timestamp('2012-02-01')"
"major_axis>=20130101"

indexers 位于子表达式的左侧

columns, major_axis, ts

表达式的右侧可以是：

求值函数，如 Timestamp('2012-02-01')
字符串，如 "bar"
类似日期，如 20130101, 或 "20130101"
列表，如 "['A', 'B']"
局部定义的变量，如 date

注意：

不要使用字符串内插的方式构造查询表达式，可以将字符串赋值给变量，然后直接的表达式内使用。比如

string = "HolyMoly'"
store.select("df", "index == string")

如果替换为

string = "HolyMoly'"
store.select('df', f'index == {string}')

将会引发 SyntaxError 异常，因为 string 变量双引号内有一个单引号

如果一定要使用内插法，可以使用 "%r" 格式化

store.select("df", "index == %r" % string)

我们有如下数据

In [400]: dfq = pd.DataFrame(
   .....:     np.random.randn(10, 4),
   .....:     columns=list("ABCD"),
   .....:     index=pd.date_range("20130101", periods=10),
   .....: )
   .....: 

In [401]: store.append("dfq", dfq, format="table", data_columns=True)

使用布尔表达式和内联函数求值。

In [402]: store.select("dfq", "index>pd.Timestamp('20130104') & columns=['A', 'B']")
Out[402]: 
                   A         B
2013-01-05 -1.083889  0.811865
2013-01-06 -0.402227  1.618922
2013-01-07  0.948196  0.183573
2013-01-08 -1.043530 -0.708145
2013-01-09  0.813949  1.508891
2013-01-10  1.176488 -1.246093

使用内联列

In [403]: store.select("dfq", where="A>0 or C>0")
Out[403]: 
                   A         B         C         D
2013-01-01  0.620028  0.159416 -0.263043 -0.639244
2013-01-04 -0.536722  1.005707  0.296917  0.139796
2013-01-05 -1.083889  0.811865  1.648435 -0.164377
2013-01-07  0.948196  0.183573  0.145277  0.308146
2013-01-08 -1.043530 -0.708145  1.430905 -0.850136
2013-01-09  0.813949  1.508891 -1.556154  0.187597
2013-01-10  1.176488 -1.246093 -0.002726 -0.444249

可以提供 columns 关键字来选择要返回的列，这等价于传递 'columns=list_of_columns_to_filter':

In [404]: store.select("df", "columns=['A', 'B']")
Out[404]: 
                   A         B
2000-01-01  1.334065  0.521036
2000-01-02 -1.613932  1.088104
2000-01-03 -0.585314 -0.275038
2000-01-04  0.632369 -1.249657
2000-01-05  1.060617 -0.143682
2000-01-06  3.050329  1.317933
2000-01-07 -0.539452 -0.771133
2000-01-08  0.649464 -1.736427

可以指定 start 和 stop 参数来限制行起始和终止。这些值是根据表中的总行数计算的

>>> store.select("dfq", "columns=['A', 'B']", start=3, stop=5)
>>>
            A           B
2013-01-04  -0.483155   1.143564
2013-01-05  0.218290    -1.391789

7.2 查询 `timedelta64[ns]`

您可以使用 timedelta64 [ns] 类型进行存储和查询。

Terms 可以用以下格式指定: <float>(<unit>)，其中 float 可以是有符号的(和小数)，单位可以是 D、s、ms、us、ns。

这里有一个例子

In [405]: from datetime import timedelta

In [406]: dftd = pd.DataFrame(
   .....:     {
   .....:         "A": pd.Timestamp("20130101"),
   .....:         "B": [
   .....:             pd.Timestamp("20130101") + timedelta(days=i, seconds=10)
   .....:             for i in range(10)
   .....:         ],
   .....:     }
   .....: )
   .....: 

In [407]: dftd["C"] = dftd["A"] - dftd["B"]

In [408]: dftd
Out[408]: 
           A                   B                  C
0 2013-01-01 2013-01-01 00:00:10  -1 days +23:59:50
1 2013-01-01 2013-01-02 00:00:10  -2 days +23:59:50
2 2013-01-01 2013-01-03 00:00:10  -3 days +23:59:50
3 2013-01-01 2013-01-04 00:00:10  -4 days +23:59:50
4 2013-01-01 2013-01-05 00:00:10  -5 days +23:59:50
5 2013-01-01 2013-01-06 00:00:10  -6 days +23:59:50
6 2013-01-01 2013-01-07 00:00:10  -7 days +23:59:50
7 2013-01-01 2013-01-08 00:00:10  -8 days +23:59:50
8 2013-01-01 2013-01-09 00:00:10  -9 days +23:59:50
9 2013-01-01 2013-01-10 00:00:10 -10 days +23:59:50

In [409]: store.append("dftd", dftd, data_columns=True)

In [410]: store.select("dftd", "C<'-3.5D'")
Out[410]: 
           A                   B                  C
4 2013-01-01 2013-01-05 00:00:10  -5 days +23:59:50
5 2013-01-01 2013-01-06 00:00:10  -6 days +23:59:50
6 2013-01-01 2013-01-07 00:00:10  -7 days +23:59:50
7 2013-01-01 2013-01-08 00:00:10  -8 days +23:59:50
8 2013-01-01 2013-01-09 00:00:10  -9 days +23:59:50
9 2013-01-01 2013-01-10 00:00:10 -10 days +23:59:50

7.3 查询多级索引

可以通过使用 level 的名称从多索引中提取数据

In [411]: df_mi.index.names
Out[411]: FrozenList(['foo', 'bar'])

In [412]: store.select("df_mi", "foo=baz and bar=two")
Out[412]: 
                A         B         C
foo bar                              
baz two  1.064908  1.778161 -0.913867

如果多索引 level=None，则可以通过默认的级别名称 level_n 方访问，其中 n 为您想要选择的多索引级别

In [413]: index = pd.MultiIndex(
   .....:     levels=[["foo", "bar", "baz", "qux"], ["one", "two", "three"]],
   .....:     codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
   .....: )
   .....: 

In [414]: df_mi_2 = pd.DataFrame(np.random.randn(10, 3), index=index, columns=["A", "B", "C"])

In [415]: df_mi_2
Out[415]: 
                  A         B         C
foo one    0.856838  1.491776  0.001283
    two    0.701816 -1.097917  0.102588
    three  0.661740  0.443531  0.559313
bar one   -0.459055 -1.222598 -0.455304
    two   -0.781163  0.826204 -0.530057
baz two    0.296135  1.366810  1.073372
    three -0.994957  0.755314  2.119746
qux one   -2.628174 -0.089460 -0.133636
    two    0.337920 -0.634027  0.421107
    three  0.604303  1.053434  1.109090

In [416]: store.append("df_mi_2", df_mi_2)

# the levels are automatically included as data columns with keyword level_n
In [417]: store.select("df_mi_2", "level_0=foo and level_1=two")
Out[417]: 
                A         B         C
foo two  0.701816 -1.097917  0.102588

7.4 索引

如果数据已经存储在表中，可以使用 create_table_index 为表创建和修改索引。

强烈建议创建表索引。这样在执行 select 时，能够大大加快您的查询速度。

注意：

索引会自动在可索引对象和指定的任何数据列上自动创建。可以通过对 append 传递 index=False 来关闭这一行为

# we have automagically already created an index (in the first section)
In [418]: i = store.root.df.table.cols.index.index

In [419]: i.optlevel, i.kind
Out[419]: (6, 'medium')

# change an index by passing new parameters
In [420]: store.create_table_index("df", optlevel=9, kind="full")

In [421]: i = store.root.df.table.cols.index.index

In [422]: i.optlevel, i.kind
Out[422]: (9, 'full')

通常在将大量数据追加到一个存储中时，关闭每次追加时创建索引，然后在添加完后重新创建是很有用的

In [423]: df_1 = pd.DataFrame(np.random.randn(10, 2), columns=list("AB"))

In [424]: df_2 = pd.DataFrame(np.random.randn(10, 2), columns=list("AB"))

In [425]: st = pd.HDFStore("appends.h5", mode="w")

In [426]: st.append("df", df_1, data_columns=["B"], index=False)

In [427]: st.append("df", df_2, data_columns=["B"], index=False)

In [428]: st.get_storer("df").table
Out[428]: 
/df/table (Table(20,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (2730,)

在完成追加后，再创建索引。

In [429]: st.create_table_index("df", columns=["B"], optlevel=9, kind="full")

In [430]: st.get_storer("df").table
Out[430]: 
/df/table (Table(20,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (2730,)
  autoindex := True
  colindexes := {
    "B": Index(9, full, shuffle, zlib(1)).is_csi=True}

In [431]: st.close()

7.5 通过数据列查询

有时，你可能只想在某些列上进行查询操作，您可以指定 data_columns=True 强制所有列为 data_columns。

In [432]: df_dc = df.copy()

In [433]: df_dc["string"] = "foo"

In [434]: df_dc.loc[df_dc.index[4:6], "string"] = np.nan

In [435]: df_dc.loc[df_dc.index[7:9], "string"] = "bar"

In [436]: df_dc["string2"] = "cool"

In [437]: df_dc.loc[df_dc.index[1:3], ["B", "C"]] = 1.0

In [438]: df_dc
Out[438]: 
                   A         B         C string string2
2000-01-01  1.334065  0.521036  0.930384    foo    cool
2000-01-02 -1.613932  1.000000  1.000000    foo    cool
2000-01-03 -0.585314  1.000000  1.000000    foo    cool
2000-01-04  0.632369 -1.249657  0.975593    foo    cool
2000-01-05  1.060617 -0.143682  0.218423    NaN    cool
2000-01-06  3.050329  1.317933 -0.963725    NaN    cool
2000-01-07 -0.539452 -0.771133  0.023751    foo    cool
2000-01-08  0.649464 -1.736427  0.197288    bar    cool

# on-disk operations
In [439]: store.append("df_dc", df_dc, data_columns=["B", "C", "string", "string2"])

In [440]: store.select("df_dc", where="B > 0")
Out[440]: 
                   A         B         C string string2
2000-01-01  1.334065  0.521036  0.930384    foo    cool
2000-01-02 -1.613932  1.000000  1.000000    foo    cool
2000-01-03 -0.585314  1.000000  1.000000    foo    cool
2000-01-06  3.050329  1.317933 -0.963725    NaN    cool

# getting creative
In [441]: store.select("df_dc", "B > 0 & C > 0 & string == foo")
Out[441]: 
                   A         B         C string string2
2000-01-01  1.334065  0.521036  0.930384    foo    cool
2000-01-02 -1.613932  1.000000  1.000000    foo    cool
2000-01-03 -0.585314  1.000000  1.000000    foo    cool

# this is in-memory version of this type of selection
In [442]: df_dc[(df_dc.B > 0) & (df_dc.C > 0) & (df_dc.string == "foo")]
Out[442]: 
                   A         B         C string string2
2000-01-01  1.334065  0.521036  0.930384    foo    cool
2000-01-02 -1.613932  1.000000  1.000000    foo    cool
2000-01-03 -0.585314  1.000000  1.000000    foo    cool

# we have automagically created this index and the B/C/string/string2
# columns are stored separately as ``PyTables`` columns
In [443]: store.root.df_dc.table
Out[443]: 
/df_dc/table (Table(8,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2),
  "C": Float64Col(shape=(), dflt=0.0, pos=3),
  "string": StringCol(itemsize=3, shape=(), dflt=b'', pos=4),
  "string2": StringCol(itemsize=4, shape=(), dflt=b'', pos=5)}
  byteorder := 'little'
  chunkshape := (1680,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "B": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "C": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "string": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "string2": Index(6, medium, shuffle, zlib(1)).is_csi=False}

7.6 迭代器

您可以将 iterator=True 或 chunksize=number_in_a_chunk 传递给 select 和 select_as_multiple，然后返回一个迭代器。默认是一个块返回 50,000 行

In [444]: for df in store.select("df", chunksize=3):
   .....:     print(df)
   .....: 
                   A         B         C
2000-01-01  1.334065  0.521036  0.930384
2000-01-02 -1.613932  1.088104 -0.632963
2000-01-03 -0.585314 -0.275038 -0.937512
                   A         B         C
2000-01-04  0.632369 -1.249657  0.975593
2000-01-05  1.060617 -0.143682  0.218423
2000-01-06  3.050329  1.317933 -0.963725
                   A         B         C
2000-01-07 -0.539452 -0.771133  0.023751
2000-01-08  0.649464 -1.736427  0.197288

注意：
你也可以使用 read_hdf 返回一个迭代器，在完成迭代后会自动关闭存储

for df in pd.read_hdf("store.h5", "df", chunksize=3):
    print(df)

注意：

chunksize 关键字会应用到原始行中，意味着在你进行查询的时候，会将所有的行进行分割并应用查询，因此返回的迭代器的大小可能不一样

下面的例子用于生成大小相等的块

In [445]: dfeq = pd.DataFrame({"number": np.arange(1, 11)})

In [446]: dfeq
Out[446]: 
   number
0       1
1       2
2       3
3       4
4       5
5       6
6       7
7       8
8       9
9      10

In [447]: store.append("dfeq", dfeq, data_columns=["number"])

In [448]: def chunks(l, n):
   .....:     return [l[i: i + n] for i in range(0, len(l), n)]
   .....: 

In [449]: evens = [2, 4, 6, 8, 10]

In [450]: coordinates = store.select_as_coordinates("dfeq", "number=evens")

In [451]: for c in chunks(coordinates, 2):
   .....:     print(store.select("dfeq", where=c))
   .....: 
   number
1       2
3       4
   number
5       6
7       8
   number
9      10

7.7 高级查询

7.7.1 选择一列

可以使用 select_column 方法来选取一列，这种方法可以快速的获取索引列。返回的结果是 Series 类型，索引变成了递增的行号。

该方法还不支持 where 选择器

In [452]: store.select_column("df_dc", "index")
Out[452]: 
0   2000-01-01
1   2000-01-02
2   2000-01-03
3   2000-01-04
4   2000-01-05
5   2000-01-06
6   2000-01-07
7   2000-01-08
Name: index, dtype: datetime64[ns]

In [453]: store.select_column("df_dc", "string")
Out[453]: 
0    foo
1    foo
2    foo
3    foo
4    NaN
5    NaN
6    foo
7    bar
Name: string, dtype: object

7.7.2 选择坐标

有时，你可能想获取查询的坐标（即索引的位置）。可以将返回的 Int64Index 结果索引传递给 where 进行后续操作

In [454]: df_coord = pd.DataFrame(
   .....:     np.random.randn(1000, 2), index=pd.date_range("20000101", periods=1000)
   .....: )
   .....: 

In [455]: store.append("df_coord", df_coord)

In [456]: c = store.select_as_coordinates("df_coord", "index > 20020101")

In [457]: c
Out[457]: 
Int64Index([732, 733, 734, 735, 736, 737, 738, 739, 740, 741,
            ...
            990, 991, 992, 993, 994, 995, 996, 997, 998, 999],
           dtype='int64', length=268)

In [458]: store.select("df_coord", where=c)
Out[458]: 
                   0         1
2002-01-02 -0.165548  0.646989
2002-01-03  0.782753 -0.123409
2002-01-04 -0.391932 -0.740915
2002-01-05  1.211070 -0.668715
2002-01-06  0.341987 -0.685867
...              ...       ...
2002-09-22  1.788110 -0.405908
2002-09-23 -0.801912  0.768460
2002-09-24  0.466284 -0.457411
2002-09-25 -0.364060  0.785367
2002-09-26 -1.463093  1.187315

[268 rows x 2 columns]

7.7.3 使用 where 选择

有时您的查询操作可能会涉及要创建一个行列表来进行选择。通常可以使用索引操作返回结果的索引来进行选择。

In [459]: df_mask = pd.DataFrame(
   .....:     np.random.randn(1000, 2), index=pd.date_range("20000101", periods=1000)
   .....: )
   .....: 

In [460]: store.append("df_mask", df_mask)

In [461]: c = store.select_column("df_mask", "index")

In [462]: where = c[pd.DatetimeIndex(c).month == 5].index

In [463]: store.select("df_mask", where=where)
Out[463]: 
                   0         1
2000-05-01  1.735883 -2.615261
2000-05-02  0.422173  2.425154
2000-05-03  0.632453 -0.165640
2000-05-04 -1.017207 -0.005696
2000-05-05  0.299606  0.070606
...              ...       ...
2002-05-27  0.234503  1.199126
2002-05-28 -3.021833 -1.016828
2002-05-29  0.522794  0.063465
2002-05-30 -1.653736  0.031709
2002-05-31 -0.968402 -0.393583

[93 rows x 2 columns]

7.7.4 存储对象

如果要检查存储的对象，可以通过 get_store 进行检索。您可以通过编程方式使用它来获取对象中的行数

In [464]: store.get_storer("df_dc").nrows
Out[464]: 8

7.8 多表查询

append_to_multiple 和 select_as_multiple 方法可以一次性追加/选择多个表。

他的思想是让你使用一个表（称为选择器表）来索引大部分或所有的列，并执行查询操作。其他表是数据表，其索引与选择器表的索引匹配。

这样，您就可以在选择器表上执行一个非常快的查询，但是能够得到大量的相关数据。这个方法类似于拥有一个非常宽的表，但是支持更有效的查询。

append_to_multiple 方法会根据一个字典将给定的单个 DataFrame 拆分为多个表，该字典将表名映射到你想在该表中包含的列。如果对应的值为 None 而不是列的列表，则该表将包含给定 DataFrame 中其余未指定的列

参数 selector 定义了哪个表是选择器表（即可用该表进行查询），参数 dropna 会删除 DataFrame 中的空行。也就是说，如果一张表中包含了空行，则该行在其他表中对应的行都会被删除，以确保表是同步的

如果 dropna=False，则需要用户自行同步表。记住空行不会被写入 HDFStore 中，所以如果你选择调用 dropna=False，一些表可能比其他表有更多的行，因此 select_as_multiple 可能不起作用，或者它可能会返回意外的结果

In [465]: df_mt = pd.DataFrame(
   .....:     np.random.randn(8, 6),
   .....:     index=pd.date_range("1/1/2000", periods=8),
   .....:     columns=["A", "B", "C", "D", "E", "F"],
   .....: )
   .....: 

In [466]: df_mt["foo"] = "bar"

In [467]: df_mt.loc[df_mt.index[1], ("A", "B")] = np.nan

# you can also create the tables individually
In [468]: store.append_to_multiple(
   .....:     {"df1_mt": ["A", "B"], "df2_mt": None}, df_mt, selector="df1_mt"
   .....: )
   .....: 

In [469]: store
Out[469]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

# individual tables were created
In [470]: store.select("df1_mt")
Out[470]: 
                   A         B
2000-01-01  1.251079 -0.362628
2000-01-02       NaN       NaN
2000-01-03  0.719421 -0.448886
2000-01-04  1.140998 -0.877922
2000-01-05  1.043605  1.798494
2000-01-06 -0.467812 -0.027965
2000-01-07  0.150568  0.754820
2000-01-08 -0.596306 -0.910022

In [471]: store.select("df2_mt")
Out[471]: 
                   C         D         E         F  foo
2000-01-01  1.602451 -0.221229  0.712403  0.465927  bar
2000-01-02 -0.525571  0.851566 -0.681308 -0.549386  bar
2000-01-03 -0.044171  1.396628  1.041242 -1.588171  bar
2000-01-04  0.463351 -0.861042 -2.192841 -1.025263  bar
2000-01-05 -1.954845 -1.712882 -0.204377 -1.608953  bar
2000-01-06  1.601542 -0.417884 -2.757922 -0.307713  bar
2000-01-07 -1.935461  1.007668  0.079529 -1.459471  bar
2000-01-08 -1.057072 -0.864360 -1.124870  1.732966  bar

# as a multiple
In [472]: store.select_as_multiple(
   .....:     ["df1_mt", "df2_mt"],
   .....:     where=["A>0", "B>0"],
   .....:     selector="df1_mt",
   .....: )
   .....: 
Out[472]: 
                   A         B         C         D         E         F  foo
2000-01-05  1.043605  1.798494 -1.954845 -1.712882 -0.204377 -1.608953  bar
2000-01-07  0.150568  0.754820 -1.935461  1.007668  0.079529 -1.459471  bar

Python 数据处理（二十一）—— HDF5 查询

7 查询

7.1 查询表

7.2 查询 `timedelta64[ns]`

7.3 查询多级索引

7.4 索引

7.5 通过数据列查询

7.6 迭代器

7.7 高级查询

7.7.1 选择一列

7.7.2 选择坐标

7.7.3 使用 where 选择

7.7.4 存储对象

7.8 多表查询

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

Python

Python 数据处理（二十一）—— HDF5 查询

7 查询

7.1 查询表

7.2 查询 timedelta64[ns]

7.3 查询多级索引

7.4 索引

7.5 通过数据列查询

7.6 迭代器

7.7 高级查询

7.7.1 选择一列

7.7.2 选择坐标

7.7.3 使用 where 选择

7.7.4 存储对象

7.8 多表查询

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

Python

7.2 查询 `timedelta64[ns]`