美文网首页Python
Python 数据处理(二十一)—— HDF5 查询

Python 数据处理(二十一)—— HDF5 查询

作者: 名本无名 | 来源:发表于2021-02-24 11:12 被阅读0次

    7 查询

    7.1 查询表

    selectdelete 操作有一个可选的条件,可以指定该条件来选择/删除数据的一个子集。

    这允许我们从一个非常大的磁盘表中,检索出一小部分的数据。

    比较操作

    =, ==, !=, >, >=, <, <=
    

    布尔表达式

    • |: 或
    • &: 与
    • (): 分组

    注意

    • = 将会自动扩展为比较运算符 ==
    • ~ 是取反,使用条件有限
    • 如果传递表达式列表或元组,将使用 & 连接

    下面是有效的表达式:

    'index >= date'
    "columns = ['A', 'D']"
    "columns in ['A', 'D']"
    'columns = A'
    'columns == A'
    "~(columns = ['A', 'B'])"
    'index > df.index[3] & string = "bar"'
    '(index > df.index[3] & index <= df.index[6]) | string = "bar"'
    "ts >= Timestamp('2012-02-01')"
    "major_axis>=20130101"
    

    indexers 位于子表达式的左侧

    • columns, major_axis, ts

    表达式的右侧可以是:

    • 求值函数,如 Timestamp('2012-02-01')
    • 字符串,如 "bar"
    • 类似日期,如 20130101, 或 "20130101"
    • 列表,如 "['A', 'B']"
    • 局部定义的变量,如 date

    注意

    不要使用字符串内插的方式构造查询表达式,可以将字符串赋值给变量,然后直接的表达式内使用。比如

    string = "HolyMoly'"
    store.select("df", "index == string")
    

    如果替换为

    string = "HolyMoly'"
    store.select('df', f'index == {string}')
    

    将会引发 SyntaxError 异常,因为 string 变量双引号内有一个单引号

    如果一定要使用内插法,可以使用 "%r" 格式化

    store.select("df", "index == %r" % string)
    

    我们有如下数据

    In [400]: dfq = pd.DataFrame(
       .....:     np.random.randn(10, 4),
       .....:     columns=list("ABCD"),
       .....:     index=pd.date_range("20130101", periods=10),
       .....: )
       .....: 
    
    In [401]: store.append("dfq", dfq, format="table", data_columns=True)
    

    使用布尔表达式和内联函数求值。

    In [402]: store.select("dfq", "index>pd.Timestamp('20130104') & columns=['A', 'B']")
    Out[402]: 
                       A         B
    2013-01-05 -1.083889  0.811865
    2013-01-06 -0.402227  1.618922
    2013-01-07  0.948196  0.183573
    2013-01-08 -1.043530 -0.708145
    2013-01-09  0.813949  1.508891
    2013-01-10  1.176488 -1.246093
    

    使用内联列

    In [403]: store.select("dfq", where="A>0 or C>0")
    Out[403]: 
                       A         B         C         D
    2013-01-01  0.620028  0.159416 -0.263043 -0.639244
    2013-01-04 -0.536722  1.005707  0.296917  0.139796
    2013-01-05 -1.083889  0.811865  1.648435 -0.164377
    2013-01-07  0.948196  0.183573  0.145277  0.308146
    2013-01-08 -1.043530 -0.708145  1.430905 -0.850136
    2013-01-09  0.813949  1.508891 -1.556154  0.187597
    2013-01-10  1.176488 -1.246093 -0.002726 -0.444249
    

    可以提供 columns 关键字来选择要返回的列,这等价于传递 'columns=list_of_columns_to_filter':

    In [404]: store.select("df", "columns=['A', 'B']")
    Out[404]: 
                       A         B
    2000-01-01  1.334065  0.521036
    2000-01-02 -1.613932  1.088104
    2000-01-03 -0.585314 -0.275038
    2000-01-04  0.632369 -1.249657
    2000-01-05  1.060617 -0.143682
    2000-01-06  3.050329  1.317933
    2000-01-07 -0.539452 -0.771133
    2000-01-08  0.649464 -1.736427
    

    可以指定 startstop 参数来限制行起始和终止。这些值是根据表中的总行数计算的

    >>> store.select("dfq", "columns=['A', 'B']", start=3, stop=5)
    >>>
                A           B
    2013-01-04  -0.483155   1.143564
    2013-01-05  0.218290    -1.391789
    
    7.2 查询 timedelta64[ns]

    您可以使用 timedelta64 [ns] 类型进行存储和查询。

    Terms 可以用以下格式指定: <float>(<unit>),其中 float 可以是有符号的(和小数),单位可以是 Dsmsusns

    这里有一个例子

    In [405]: from datetime import timedelta
    
    In [406]: dftd = pd.DataFrame(
       .....:     {
       .....:         "A": pd.Timestamp("20130101"),
       .....:         "B": [
       .....:             pd.Timestamp("20130101") + timedelta(days=i, seconds=10)
       .....:             for i in range(10)
       .....:         ],
       .....:     }
       .....: )
       .....: 
    
    In [407]: dftd["C"] = dftd["A"] - dftd["B"]
    
    In [408]: dftd
    Out[408]: 
               A                   B                  C
    0 2013-01-01 2013-01-01 00:00:10  -1 days +23:59:50
    1 2013-01-01 2013-01-02 00:00:10  -2 days +23:59:50
    2 2013-01-01 2013-01-03 00:00:10  -3 days +23:59:50
    3 2013-01-01 2013-01-04 00:00:10  -4 days +23:59:50
    4 2013-01-01 2013-01-05 00:00:10  -5 days +23:59:50
    5 2013-01-01 2013-01-06 00:00:10  -6 days +23:59:50
    6 2013-01-01 2013-01-07 00:00:10  -7 days +23:59:50
    7 2013-01-01 2013-01-08 00:00:10  -8 days +23:59:50
    8 2013-01-01 2013-01-09 00:00:10  -9 days +23:59:50
    9 2013-01-01 2013-01-10 00:00:10 -10 days +23:59:50
    
    In [409]: store.append("dftd", dftd, data_columns=True)
    
    In [410]: store.select("dftd", "C<'-3.5D'")
    Out[410]: 
               A                   B                  C
    4 2013-01-01 2013-01-05 00:00:10  -5 days +23:59:50
    5 2013-01-01 2013-01-06 00:00:10  -6 days +23:59:50
    6 2013-01-01 2013-01-07 00:00:10  -7 days +23:59:50
    7 2013-01-01 2013-01-08 00:00:10  -8 days +23:59:50
    8 2013-01-01 2013-01-09 00:00:10  -9 days +23:59:50
    9 2013-01-01 2013-01-10 00:00:10 -10 days +23:59:50
    
    7.3 查询多级索引

    可以通过使用 level 的名称从多索引中提取数据

    In [411]: df_mi.index.names
    Out[411]: FrozenList(['foo', 'bar'])
    
    In [412]: store.select("df_mi", "foo=baz and bar=two")
    Out[412]: 
                    A         B         C
    foo bar                              
    baz two  1.064908  1.778161 -0.913867
    

    如果多索引 level=None,则可以通过默认的级别名称 level_n 方访问,其中 n 为您想要选择的多索引级别

    In [413]: index = pd.MultiIndex(
       .....:     levels=[["foo", "bar", "baz", "qux"], ["one", "two", "three"]],
       .....:     codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
       .....: )
       .....: 
    
    In [414]: df_mi_2 = pd.DataFrame(np.random.randn(10, 3), index=index, columns=["A", "B", "C"])
    
    In [415]: df_mi_2
    Out[415]: 
                      A         B         C
    foo one    0.856838  1.491776  0.001283
        two    0.701816 -1.097917  0.102588
        three  0.661740  0.443531  0.559313
    bar one   -0.459055 -1.222598 -0.455304
        two   -0.781163  0.826204 -0.530057
    baz two    0.296135  1.366810  1.073372
        three -0.994957  0.755314  2.119746
    qux one   -2.628174 -0.089460 -0.133636
        two    0.337920 -0.634027  0.421107
        three  0.604303  1.053434  1.109090
    
    In [416]: store.append("df_mi_2", df_mi_2)
    
    # the levels are automatically included as data columns with keyword level_n
    In [417]: store.select("df_mi_2", "level_0=foo and level_1=two")
    Out[417]: 
                    A         B         C
    foo two  0.701816 -1.097917  0.102588
    
    7.4 索引

    如果数据已经存储在表中,可以使用 create_table_index 为表创建和修改索引。

    强烈建议创建表索引。这样在执行 select 时,能够大大加快您的查询速度。

    注意

    索引会自动在可索引对象和指定的任何数据列上自动创建。可以通过对 append 传递 index=False 来关闭这一行为

    # we have automagically already created an index (in the first section)
    In [418]: i = store.root.df.table.cols.index.index
    
    In [419]: i.optlevel, i.kind
    Out[419]: (6, 'medium')
    
    # change an index by passing new parameters
    In [420]: store.create_table_index("df", optlevel=9, kind="full")
    
    In [421]: i = store.root.df.table.cols.index.index
    
    In [422]: i.optlevel, i.kind
    Out[422]: (9, 'full')
    

    通常在将大量数据追加到一个存储中时,关闭每次追加时创建索引,然后在添加完后重新创建是很有用的

    In [423]: df_1 = pd.DataFrame(np.random.randn(10, 2), columns=list("AB"))
    
    In [424]: df_2 = pd.DataFrame(np.random.randn(10, 2), columns=list("AB"))
    
    In [425]: st = pd.HDFStore("appends.h5", mode="w")
    
    In [426]: st.append("df", df_1, data_columns=["B"], index=False)
    
    In [427]: st.append("df", df_2, data_columns=["B"], index=False)
    
    In [428]: st.get_storer("df").table
    Out[428]: 
    /df/table (Table(20,)) ''
      description := {
      "index": Int64Col(shape=(), dflt=0, pos=0),
      "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
      "B": Float64Col(shape=(), dflt=0.0, pos=2)}
      byteorder := 'little'
      chunkshape := (2730,)
    

    在完成追加后,再创建索引。

    In [429]: st.create_table_index("df", columns=["B"], optlevel=9, kind="full")
    
    In [430]: st.get_storer("df").table
    Out[430]: 
    /df/table (Table(20,)) ''
      description := {
      "index": Int64Col(shape=(), dflt=0, pos=0),
      "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
      "B": Float64Col(shape=(), dflt=0.0, pos=2)}
      byteorder := 'little'
      chunkshape := (2730,)
      autoindex := True
      colindexes := {
        "B": Index(9, full, shuffle, zlib(1)).is_csi=True}
    
    In [431]: st.close()
    
    7.5 通过数据列查询

    有时,你可能只想在某些列上进行查询操作,您可以指定 data_columns=True 强制所有列为 data_columns

    In [432]: df_dc = df.copy()
    
    In [433]: df_dc["string"] = "foo"
    
    In [434]: df_dc.loc[df_dc.index[4:6], "string"] = np.nan
    
    In [435]: df_dc.loc[df_dc.index[7:9], "string"] = "bar"
    
    In [436]: df_dc["string2"] = "cool"
    
    In [437]: df_dc.loc[df_dc.index[1:3], ["B", "C"]] = 1.0
    
    In [438]: df_dc
    Out[438]: 
                       A         B         C string string2
    2000-01-01  1.334065  0.521036  0.930384    foo    cool
    2000-01-02 -1.613932  1.000000  1.000000    foo    cool
    2000-01-03 -0.585314  1.000000  1.000000    foo    cool
    2000-01-04  0.632369 -1.249657  0.975593    foo    cool
    2000-01-05  1.060617 -0.143682  0.218423    NaN    cool
    2000-01-06  3.050329  1.317933 -0.963725    NaN    cool
    2000-01-07 -0.539452 -0.771133  0.023751    foo    cool
    2000-01-08  0.649464 -1.736427  0.197288    bar    cool
    
    # on-disk operations
    In [439]: store.append("df_dc", df_dc, data_columns=["B", "C", "string", "string2"])
    
    In [440]: store.select("df_dc", where="B > 0")
    Out[440]: 
                       A         B         C string string2
    2000-01-01  1.334065  0.521036  0.930384    foo    cool
    2000-01-02 -1.613932  1.000000  1.000000    foo    cool
    2000-01-03 -0.585314  1.000000  1.000000    foo    cool
    2000-01-06  3.050329  1.317933 -0.963725    NaN    cool
    
    # getting creative
    In [441]: store.select("df_dc", "B > 0 & C > 0 & string == foo")
    Out[441]: 
                       A         B         C string string2
    2000-01-01  1.334065  0.521036  0.930384    foo    cool
    2000-01-02 -1.613932  1.000000  1.000000    foo    cool
    2000-01-03 -0.585314  1.000000  1.000000    foo    cool
    
    # this is in-memory version of this type of selection
    In [442]: df_dc[(df_dc.B > 0) & (df_dc.C > 0) & (df_dc.string == "foo")]
    Out[442]: 
                       A         B         C string string2
    2000-01-01  1.334065  0.521036  0.930384    foo    cool
    2000-01-02 -1.613932  1.000000  1.000000    foo    cool
    2000-01-03 -0.585314  1.000000  1.000000    foo    cool
    
    # we have automagically created this index and the B/C/string/string2
    # columns are stored separately as ``PyTables`` columns
    In [443]: store.root.df_dc.table
    Out[443]: 
    /df_dc/table (Table(8,)) ''
      description := {
      "index": Int64Col(shape=(), dflt=0, pos=0),
      "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
      "B": Float64Col(shape=(), dflt=0.0, pos=2),
      "C": Float64Col(shape=(), dflt=0.0, pos=3),
      "string": StringCol(itemsize=3, shape=(), dflt=b'', pos=4),
      "string2": StringCol(itemsize=4, shape=(), dflt=b'', pos=5)}
      byteorder := 'little'
      chunkshape := (1680,)
      autoindex := True
      colindexes := {
        "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
        "B": Index(6, medium, shuffle, zlib(1)).is_csi=False,
        "C": Index(6, medium, shuffle, zlib(1)).is_csi=False,
        "string": Index(6, medium, shuffle, zlib(1)).is_csi=False,
        "string2": Index(6, medium, shuffle, zlib(1)).is_csi=False}
    
    7.6 迭代器

    您可以将 iterator=Truechunksize=number_in_a_chunk 传递给 selectselect_as_multiple, 然后返回一个迭代器。 默认是一个块返回 50,000

    In [444]: for df in store.select("df", chunksize=3):
       .....:     print(df)
       .....: 
                       A         B         C
    2000-01-01  1.334065  0.521036  0.930384
    2000-01-02 -1.613932  1.088104 -0.632963
    2000-01-03 -0.585314 -0.275038 -0.937512
                       A         B         C
    2000-01-04  0.632369 -1.249657  0.975593
    2000-01-05  1.060617 -0.143682  0.218423
    2000-01-06  3.050329  1.317933 -0.963725
                       A         B         C
    2000-01-07 -0.539452 -0.771133  0.023751
    2000-01-08  0.649464 -1.736427  0.197288
    

    注意
    你也可以使用 read_hdf 返回一个迭代器,在完成迭代后会自动关闭存储

    for df in pd.read_hdf("store.h5", "df", chunksize=3):
        print(df)
    

    注意

    chunksize 关键字会应用到原始行中,意味着在你进行查询的时候,会将所有的行进行分割并应用查询,因此返回的迭代器的大小可能不一样

    下面的例子用于生成大小相等的块

    In [445]: dfeq = pd.DataFrame({"number": np.arange(1, 11)})
    
    In [446]: dfeq
    Out[446]: 
       number
    0       1
    1       2
    2       3
    3       4
    4       5
    5       6
    6       7
    7       8
    8       9
    9      10
    
    In [447]: store.append("dfeq", dfeq, data_columns=["number"])
    
    In [448]: def chunks(l, n):
       .....:     return [l[i: i + n] for i in range(0, len(l), n)]
       .....: 
    
    In [449]: evens = [2, 4, 6, 8, 10]
    
    In [450]: coordinates = store.select_as_coordinates("dfeq", "number=evens")
    
    In [451]: for c in chunks(coordinates, 2):
       .....:     print(store.select("dfeq", where=c))
       .....: 
       number
    1       2
    3       4
       number
    5       6
    7       8
       number
    9      10
    
    7.7 高级查询
    7.7.1 选择一列

    可以使用 select_column 方法来选取一列,这种方法可以快速的获取索引列。返回的结果是 Series 类型,索引变成了递增的行号。

    该方法还不支持 where 选择器

    In [452]: store.select_column("df_dc", "index")
    Out[452]: 
    0   2000-01-01
    1   2000-01-02
    2   2000-01-03
    3   2000-01-04
    4   2000-01-05
    5   2000-01-06
    6   2000-01-07
    7   2000-01-08
    Name: index, dtype: datetime64[ns]
    
    In [453]: store.select_column("df_dc", "string")
    Out[453]: 
    0    foo
    1    foo
    2    foo
    3    foo
    4    NaN
    5    NaN
    6    foo
    7    bar
    Name: string, dtype: object
    
    7.7.2 选择坐标

    有时,你可能想获取查询的坐标(即索引的位置)。可以将返回的 Int64Index 结果索引传递给 where 进行后续操作

    In [454]: df_coord = pd.DataFrame(
       .....:     np.random.randn(1000, 2), index=pd.date_range("20000101", periods=1000)
       .....: )
       .....: 
    
    In [455]: store.append("df_coord", df_coord)
    
    In [456]: c = store.select_as_coordinates("df_coord", "index > 20020101")
    
    In [457]: c
    Out[457]: 
    Int64Index([732, 733, 734, 735, 736, 737, 738, 739, 740, 741,
                ...
                990, 991, 992, 993, 994, 995, 996, 997, 998, 999],
               dtype='int64', length=268)
    
    In [458]: store.select("df_coord", where=c)
    Out[458]: 
                       0         1
    2002-01-02 -0.165548  0.646989
    2002-01-03  0.782753 -0.123409
    2002-01-04 -0.391932 -0.740915
    2002-01-05  1.211070 -0.668715
    2002-01-06  0.341987 -0.685867
    ...              ...       ...
    2002-09-22  1.788110 -0.405908
    2002-09-23 -0.801912  0.768460
    2002-09-24  0.466284 -0.457411
    2002-09-25 -0.364060  0.785367
    2002-09-26 -1.463093  1.187315
    
    [268 rows x 2 columns]
    
    7.7.3 使用 where 选择

    有时您的查询操作可能会涉及要创建一个行列表来进行选择。通常可以使用索引操作返回结果的索引来进行选择。

    In [459]: df_mask = pd.DataFrame(
       .....:     np.random.randn(1000, 2), index=pd.date_range("20000101", periods=1000)
       .....: )
       .....: 
    
    In [460]: store.append("df_mask", df_mask)
    
    In [461]: c = store.select_column("df_mask", "index")
    
    In [462]: where = c[pd.DatetimeIndex(c).month == 5].index
    
    In [463]: store.select("df_mask", where=where)
    Out[463]: 
                       0         1
    2000-05-01  1.735883 -2.615261
    2000-05-02  0.422173  2.425154
    2000-05-03  0.632453 -0.165640
    2000-05-04 -1.017207 -0.005696
    2000-05-05  0.299606  0.070606
    ...              ...       ...
    2002-05-27  0.234503  1.199126
    2002-05-28 -3.021833 -1.016828
    2002-05-29  0.522794  0.063465
    2002-05-30 -1.653736  0.031709
    2002-05-31 -0.968402 -0.393583
    
    [93 rows x 2 columns]
    
    7.7.4 存储对象

    如果要检查存储的对象,可以通过 get_store 进行检索。您可以通过编程方式使用它来获取对象中的行数

    In [464]: store.get_storer("df_dc").nrows
    Out[464]: 8
    
    7.8 多表查询

    append_to_multipleselect_as_multiple 方法可以一次性追加/选择多个表。

    他的思想是让你使用一个表(称为选择器表)来索引大部分或所有的列,并执行查询操作。其他表是数据表,其索引与选择器表的索引匹配。

    这样,您就可以在选择器表上执行一个非常快的查询,但是能够得到大量的相关数据。这个方法类似于拥有一个非常宽的表,但是支持更有效的查询。

    append_to_multiple 方法会根据一个字典将给定的单个 DataFrame 拆分为多个表,该字典将表名映射到你想在该表中包含的列。如果对应的值为 None 而不是列的列表,则该表将包含给定 DataFrame 中其余未指定的列

    参数 selector 定义了哪个表是选择器表(即可用该表进行查询),参数 dropna 会删除 DataFrame 中的空行。也就是说,如果一张表中包含了空行,则该行在其他表中对应的行都会被删除,以确保表是同步的

    如果 dropna=False,则需要用户自行同步表。记住 空行不会被写入 HDFStore 中,所以如果你选择调用 dropna=False,一些表可能比其他表有更多的行,因此 select_as_multiple 可能不起作用,或者它可能会返回意外的结果

    In [465]: df_mt = pd.DataFrame(
       .....:     np.random.randn(8, 6),
       .....:     index=pd.date_range("1/1/2000", periods=8),
       .....:     columns=["A", "B", "C", "D", "E", "F"],
       .....: )
       .....: 
    
    In [466]: df_mt["foo"] = "bar"
    
    In [467]: df_mt.loc[df_mt.index[1], ("A", "B")] = np.nan
    
    # you can also create the tables individually
    In [468]: store.append_to_multiple(
       .....:     {"df1_mt": ["A", "B"], "df2_mt": None}, df_mt, selector="df1_mt"
       .....: )
       .....: 
    
    In [469]: store
    Out[469]: 
    <class 'pandas.io.pytables.HDFStore'>
    File path: store.h5
    
    # individual tables were created
    In [470]: store.select("df1_mt")
    Out[470]: 
                       A         B
    2000-01-01  1.251079 -0.362628
    2000-01-02       NaN       NaN
    2000-01-03  0.719421 -0.448886
    2000-01-04  1.140998 -0.877922
    2000-01-05  1.043605  1.798494
    2000-01-06 -0.467812 -0.027965
    2000-01-07  0.150568  0.754820
    2000-01-08 -0.596306 -0.910022
    
    In [471]: store.select("df2_mt")
    Out[471]: 
                       C         D         E         F  foo
    2000-01-01  1.602451 -0.221229  0.712403  0.465927  bar
    2000-01-02 -0.525571  0.851566 -0.681308 -0.549386  bar
    2000-01-03 -0.044171  1.396628  1.041242 -1.588171  bar
    2000-01-04  0.463351 -0.861042 -2.192841 -1.025263  bar
    2000-01-05 -1.954845 -1.712882 -0.204377 -1.608953  bar
    2000-01-06  1.601542 -0.417884 -2.757922 -0.307713  bar
    2000-01-07 -1.935461  1.007668  0.079529 -1.459471  bar
    2000-01-08 -1.057072 -0.864360 -1.124870  1.732966  bar
    
    # as a multiple
    In [472]: store.select_as_multiple(
       .....:     ["df1_mt", "df2_mt"],
       .....:     where=["A>0", "B>0"],
       .....:     selector="df1_mt",
       .....: )
       .....: 
    Out[472]: 
                       A         B         C         D         E         F  foo
    2000-01-05  1.043605  1.798494 -1.954845 -1.712882 -0.204377 -1.608953  bar
    2000-01-07  0.150568  0.754820 -1.935461  1.007668  0.079529 -1.459471  bar
    

    相关文章

      网友评论

        本文标题:Python 数据处理(二十一)—— HDF5 查询

        本文链接:https://www.haomeiwen.com/subject/fozjfltx.html