美文网首页《Pandas 1.x Cookbook·第二版》
《Pandas 1.x Cookbook · 第二版》第03章

《Pandas 1.x Cookbook · 第二版》第03章

作者: SeanCheney | 来源:发表于2021-02-06 21:41 被阅读0次

    第01章 Pandas基础
    第02章 DataFrame基础运算
    第03章 创建和持久化DataFrame
    第04章 开始数据分析
    第05章 探索性数据分析
    第06章 选取数据子集
    第07章 过滤行
    第08章 索引对齐


    3.1 创建DataFrame

    使用平行的列表创建DataFrame

    >>> import pandas as pd
    >>> import numpy as np
    >>> fname = ["Paul", "John", "Richard", "George"]
    >>> lname = ["McCartney", "Lennon", "Starkey", "Harrison"]
    >>> birth = [1942, 1940, 1940, 1943]
    

    创建字典:

    >>> people = {"first": fname, "last": lname, "birth": birth}
    

    用该字典创建DataFrame:

    >>> beatles = pd.DataFrame(people)
    >>> beatles
         first       last  birth
    0     Paul  McCartney   1942
    1     John     Lennon   1940
    2  Richard    Starkey   1940
    3   George   Harrison   1943
    

    原理

    当调用DataFrame构造器时,Pandas会创建一个RangeIndex对象:

    >>> beatles.index
    RangeIndex(start=0, stop=4, step=1)
    

    重新指定索引:

    >>> pd.DataFrame(people, index=["a", "b", "c", "d"])
         first       last  birth
    a     Paul  McCartney   1942
    b     John     Lennon   1940
    c  Richard    Starkey   1940
    d   George   Harrison   1943
    

    更多

    还可以用字典构成的列表构建DataFrame:

    >>> pd.DataFrame(
    ...     [
    ...         {
    ...             "first": "Paul",
    ...             "last": "McCartney",
    ...             "birth": 1942,
    ...         },
    ...         {
    ...             "first": "John",
    ...             "last": "Lennon",
    ...             "birth": 1940,
    ...         },
    ...         {
    ...             "first": "Richard",
    ...             "last": "Starkey",
    ...             "birth": 1940,
    ...         },
    ...         {
    ...             "first": "George",
    ...             "last": "Harrison",
    ...             "birth": 1943,
    ...         },
    ...     ]
    ... )
       birth    first       last
    0   1942     Paul  McCartney
    1   1940     John     Lennon
    2   1940  Richard    Starkey
    3   1943   George   Harrison
    

    使用columns参数指定列的顺序:

    >>> pd.DataFrame(
    ...     [
    ...         {
    ...             "first": "Paul",
    ...             "last": "McCartney",
    ...             "birth": 1942,
    ...         },
    ...         {
    ...             "first": "John",
    ...             "last": "Lennon",
    ...             "birth": 1940,
    ...         },
    ...         {
    ...             "first": "Richard",
    ...             "last": "Starkey",
    ...             "birth": 1940,
    ...         },
    ...         {
    ...             "first": "George",
    ...             "last": "Harrison",
    ...             "birth": 1943,
    ...         },
    ...     ],
    ...     columns=["last", "first", "birth"],
    ... )
            last    first  birth
    0  McCartney     Paul   1942
    1     Lennon     John   1940
    2    Starkey  Richard   1940
    3   Harrison   George   1943
    

    3.2 写入CSV

    将DataFrame写入CSV文件:

    使用.to_csv方法将DataFrame写入CSV文件:

    >>> beatles
         first       last  birth
    0     Paul  McCartney   1942
    1     John     Lennon   1940
    2  Richard    Starkey   1940
    3   George   Harrison   1943
    >>> from io import StringIO
    >>> fout = StringIO()
    >>> beatles.to_csv(fout)  # 使用文件名
    

    查看文件内容:

    >>> print(fout.getvalue())
    ,first,last,birth
    0,Paul,McCartney,1942
    1,John,Lennon,1940
    2,Richard,Starkey,1940
    3,George,Harrison,1943
    

    更多

    如果读取刚刚保存的CSV,会读入列名为Unnamed: 0的冗余列:

    >>> _ = fout.seek(0)
    >>> pd.read_csv(fout)
       Unnamed: 0    first       last  birth
    0           0     Paul  McCartney   1942
    1           1     John     Lennon   1940
    2           2  Richard    Starkey   1940
    3           3   George   Harrison   1943
    

    使用index_col参数,可以指定列:

    >>> _ = fout.seek(0)
    >>> pd.read_csv(fout, index_col=0)
         first       last  birth
    0     Paul  McCartney   1942
    1     John     Lennon   1940
    2  Richard    Starkey   1940
    3   George   Harrison   1943
    

    如果存CSV文件时,不想保存行索引,可以将index参数设为False

    >>> fout = StringIO()
    >>> beatles.to_csv(fout, index=False)
    >>> print(fout.getvalue())
    first,last,birth
    Paul,McCartney,1942
    John,Lennon,1940
    Richard,Starkey,1940
    George,Harrison,1943
    

    3.3 读取大CSV文件

    Pandas是在内存中处理文件的,通常来讲,内存的大小需要是文件大小的3至10倍。

    这里使用的是diamonds数据集。使用nrows参数读取1000行数据。

    >>> diamonds = pd.read_csv("data/diamonds.csv", nrows=1000)
    >>> diamonds
         carat      cut color clarity  ...  price     x     y     z
    0     0.23    Ideal     E     SI2  ...    326  3.95  3.98  2.43
    1     0.21  Premium     E     SI1  ...    326  3.89  3.84  2.31
    2     0.23     Good     E     VS1  ...    327  4.05  4.07  2.31
    3     0.29  Premium     I     VS2  ...    334  4.20  4.23  2.63
    4     0.31     Good     J     SI2  ...    335  4.34  4.35  2.75
    ..     ...      ...   ...     ...  ...    ...   ...   ...   ...
    995   0.54    Ideal     D    VVS2  ...   2897  5.30  5.34  3.26
    996   0.72    Ideal     E     SI1  ...   2897  5.69  5.74  3.57
    997   0.72     Good     F     VS1  ...   2897  5.82  5.89  3.48
    998   0.74  Premium     D     VS2  ...   2897  5.81  5.77  3.58
    999   1.12  Premium     J     SI2  ...   2898  6.68  6.61  4.03
    

    使用.info方法查看消耗的内存量:

    >>> diamonds.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1000 entries, 0 to 999
    Data columns (total 10 columns):
    carat      1000 non-null float64
    cut        1000 non-null object
    color      1000 non-null object
    clarity    1000 non-null object
    depth      1000 non-null float64
    table      1000 non-null float64
    price      1000 non-null int64
    x          1000 non-null float64
    y          1000 non-null float64
    z          1000 non-null float64
    dtypes: float64(6), int64(1), object(3)
    memory usage: 78.2+ KB
    

    可以看到1000行数据使用了78.2KB内存。如果有10亿行数据,则要占用78GB的内存。

    使用dtype参数,设置读取的数值类型:

    >>> diamonds2 = pd.read_csv(
    ...     "data/diamonds.csv",
    ...     nrows=1000,
    ...     dtype={
    ...         "carat": np.float32,
    ...         "depth": np.float32,
    ...         "table": np.float32,
    ...         "x": np.float32,
    ...         "y": np.float32,
    ...         "z": np.float32,
    ...         "price": np.int16,
    ...     },
    ... )
    >>> diamonds2.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1000 entries, 0 to 999
    Data columns (total 10 columns):
    carat      1000 non-null float32
    cut        1000 non-null object
    color      1000 non-null object
    clarity    1000 non-null object
    depth      1000 non-null float32
    table      1000 non-null float32
    price      1000 non-null int16
    x          1000 non-null float32
    y          1000 non-null float32
    z          1000 non-null float32
    dtypes: float32(6), int16(1), object(3)
    memory usage: 49.0+ KB
    

    改变了数值类型,对比下新的DataFrame和原先的DataFrame:

    >>> diamonds.describe()
                 carat        depth  ...            y            z
    count  1000.000000  1000.000000  ...  1000.000000  1000.000000
    mean      0.689280    61.722800  ...     5.599180     3.457530
    std       0.195291     1.758879  ...     0.611974     0.389819
    min       0.200000    53.000000  ...     3.750000     2.270000
    25%       0.700000    60.900000  ...     5.630000     3.450000
    50%       0.710000    61.800000  ...     5.760000     3.550000
    75%       0.790000    62.600000  ...     5.910000     3.640000
    max       1.270000    69.500000  ...     7.050000     4.330000
    >>> diamonds2.describe()
                 carat        depth  ...            y            z
    count  1000.000000  1000.000000  ...  1000.000000  1000.000000
    mean      0.689453    61.718750  ...     5.601562     3.457031
    std       0.195312     1.759766  ...     0.611816     0.389648
    min       0.199951    53.000000  ...     3.750000     2.269531
    25%       0.700195    60.906250  ...     5.628906     3.449219
    50%       0.709961    61.812500  ...     5.761719     3.550781
    75%       0.790039    62.593750  ...     5.910156     3.640625
    max       1.269531    69.500000  ...     7.050781     4.328125
    

    可以看到通过改变数据类型,节省了38%的内存。

    使用dtype参数,将数据类型改为category。使用.value_counts先统计数据个数:

    >>> diamonds2.cut.value_counts(
    Ideal       333
    Premium     290
    Very Good   226
    Good         89
    Fair         62
    Name: cut, dtype: int64
    >>> diamonds2.color.value_counts()
    E    240
    F    226
    G    139
    D    129
    H    125
    I     95
    J     46
    Name: color, dtype: int64
    >>> diamonds2.clarity.value_counts()
    SI1     306
    VS2     218
    VS1     159
    SI2     154
    VVS2     62
    VVS1     58
    I1       29
    IF       14
    Name: clarity, dtype: int64
    

    因为是低基数,将其转换为category,可以节省约37%的内存:

    >>> diamonds3 = pd.read_csv(
    ...     "data/diamonds.csv",
    ...     nrows=1000,
    ...      dtype={
    ...        "carat": np.float32,
    ...         "depth": np.float32,
    ...         "table": np.float32,
    ...         "x": np.float32,
    ...         "y": np.float32,
    ...         "z": np.float32,
    ...         "price": np.int16,
    ...         "cut": "category",
    ...         "color": "category",
    ...         "clarity": "category",
    ...     },
    ... )
    >>> diamonds3.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1000 entries, 0 to 999
    Data columns (total 10 columns):
    carat      1000 non-null float32
    cut        1000 non-null category
    color      1000 non-null category
    clarity    1000 non-null category
    depth      1000 non-null float32
    table      1000 non-null float32
    price      1000 non-null int16
    x          1000 non-null float32
    y          1000 non-null float32
    z          1000 non-null float32
    dtypes: category(3), float32(6), int16(1)
    memory usage: 29.4 KB
    

    使用参数usecols,可以指定加载哪些列。这里忽略了xyz三列:

    >>> cols = [
    ...     "carat",
    ...     "cut",
    ...     "color",
    ...     "clarity",
    ...     "depth",
    ...     "table",
    ...     "price",
    ... ]
    >>> diamonds4 = pd.read_csv(
    ...     "data/diamonds.csv",
    ...     nrows=1000,
    ...     dtype={
    ...         "carat": np.float32,
    ...        "depth": np.float32,
    ...         "table": np.float32,
    ...         "price": np.int16,
    ...         "cut": "category",
    ...         "color": "category",
    ...         "clarity": "category",
    ...     },
    ...     usecols=cols,
    ... )
    >>> diamonds4.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1000 entries, 0 to 999
    Data columns (total 7 columns):
    carat      1000 non-null float32
    cut        1000 non-null category
    color      1000 non-null category
    clarity    1000 non-null category
    depth      1000 non-null float32
    table      1000 non-null float32
    price      1000 non-null int16
    dtypes: category(3), float32(3), int16(1)
    memory usage: 17.7 KB
    

    现在使用的内存只有原始的21%。

    如果每次只处理数据的一部分,可以使用chunksize参数:

    >>> cols = [
    ...     "carat",
    ...     "cut",
    ...     "color",
    ...     "clarity",
    ...     "depth",
    ...     "table",
    ...     "price",
    ... ]
    >>> diamonds_iter = pd.read_csv(
    ...     "data/diamonds.csv",
    ...     nrows=1000,
    ...     dtype={
    ...         "carat": np.float32,
    ...         "depth": np.float32,
    ...         "table": np.float32,
    ...         "price": np.int16,
    ...         "cut": "category",
    ...         "color": "category",
    ...         "clarity": "category",
    ...     },
    ...     usecols=cols,
    ...     chunksize=200,
    ... )
    >>> def process(df):
    ...     return f"processed {df.size} items"
    >>> for chunk in diamonds_iter:
    ...     process(chunk)
    

    因为CSV文件不保存数据类型,Pandas需要推断每列的数据类型是什么。如果一列的值都是整数,并且没有缺失值,则Pandas将其认定为int64。如果一列是数值类型,但不是整数,或存在缺失值,Pandas使用的是float64。这两种数据类型占用的内存比较大。例如,如果所有数都在200以下,你可以使用一个小的数据类型,比如np.int16(或np.int8,如果都是正数)。

    如果某列都是非数值类型,Pandas会将其转换为object类型。object类型占用内存很多,因为它是将数据以Python字符串存储的,将类型改为category,可以大大节省空间,因为它对每个字符串只存储一次。

    更多

    如果价格使用int8,会导致丢失信息。你可以使用NumPy的iinfo函数列出NumPy整数类型的范围:

    >>> np.iinfo(np.int8)
    iinfo(min=-128, max=127, dtype=int8)
    

    使用.finfo可以查看浮点数类型的范围:

    >>> np.finfo(np.float16)
    finfo(resolution=0.001, min=-6.55040e+04,
          max=6.55040e+04, dtype=float16)
    

    还可以用.memory_usage方法查询DataFrame或Series使用了多少字节。注意,其中是包含行索引的。另外,传入deep=True,可以查询带有对象类型的Series的内存用量:

    >>> diamonds.price.memory_usage()
    8080
    >>> diamonds.price.memory_usage(index=False)
    8000
    >>> diamonds.cut.memory_usage()
    8080
    >>> diamonds.cut.memory_usage(deep=True)
    63413
    

    一旦确定了数据类型,可以将其以二进制并带有数据类型的形式保存下来,比如Feather格式,Pandas使用的是pyarrow库。

    >>> diamonds4.to_feather("d.arr")
    >>> diamonds5 = pd.read_feather("d.arr")
    

    另一种方法是使用Parquet格式。

    >>> diamonds4.to_parquet("/tmp/d.pqt")
    

    3.4 使用Excel文件

    需要安装xlwtopenpyxl来写入XLSX文件。

    使用.to_excel方法,进行存储:

    >>> beatles.to_excel("beat.xls")
    >>> beatles.to_excel("beat.xlsx")
    

    使用read_excel读取Excel文件:

    >>> beat2 = pd.read_excel("/tmp/beat.xls")
    >>> beat2
       Unnamed: 0    first       last  birth
    0           0     Paul  McCartney   1942
    1           1     John     Lennon   1940
    2           2  Richard    Starkey   1940
    3           3   George   Harrison   1943
    

    用参数index_col,指定行索引:

    >>> beat2 = pd.read_excel("/tmp/beat.xls", index_col=0)
    >>> beat2
         first       last  birth
    0     Paul  McCartney   1942
    1     John     Lennon   1940
    2  Richard    Starkey   1940
    3   George   Harrison   1943
    

    Excel保存了数据类型:

    >>> beat2.dtypes
    first    object
    last     object
    birth     int64
    dtype: object
    

    更多

    使用sheet_name参数命名表单:

    >>> xl_writer = pd.ExcelWriter("beat2.xlsx")
    >>> beatles.to_excel(xl_writer, sheet_name="All")
    >>> beatles[beatles.birth < 1941].to_excel(
    ...     xl_writer, sheet_name="1940"
    ... )
    >>> xl_writer.save()
    

    这个Excel文件包含两个表单,一个名字是All,一个名字是1940。


    3.5 使用ZIP文件

    如果CSV文件是ZIP文件中的唯一文件,可以直接使用read_csv函数:

    >>> autos = pd.read_csv("data/vehicles.csv.zip")
    >>> autos
           barrels08  barrelsA08  ...  phevHwy  phevComb
    0      15.695714         0.0  ...        0         0
    1      29.964545         0.0  ...        0         0
    2      12.207778         0.0  ...        0         0
    3      29.964545         0.0  ...        0         0
    4      17.347895         0.0  ...        0         0
    ...          ...         ...  ...      ...       ...
    41139  14.982273         0.0  ...        0         0
    41140  14.330870         0.0  ...        0         0
    41141  15.695714         0.0  ...        0         0
    41142  15.695714         0.0  ...        0         0
    41143  18.311667         0.0  ...        0         0
    >>> autos.modifiedOn.dtype
    dtype('O')
    

    因为CSV文件中包含日期的列,它是字符串。可以在使用read_csv时使用parse_dates加载文件,另一种方法是加载文件后用to_datetime方法解析:

    >>> autos= pd.read_csv(
    ...     "data/vehicles.csv.zip", parse_dates=["modifiedOn"]
    ... )
    >>> autos.modifiedOn
    0       2013-01-0...
    1       2013-01-0...
    2       2013-01-0...
    3       2013-01-0...
    4       2013-01-0...
                ...     
    41139   2013-01-0...
    41140   2013-01-0...
    41141   2013-01-0...
    41142   2013-01-0...
    41143   2013-01-0...
    Name: modifiedOn, Length: 41144, dtype: datetime64[ns, tzlocal()]
    
    >>> autos.modifiedOn
    0        Tue Jan 01 00:00:00 EST 2013
    1        Tue Jan 01 00:00:00 EST 2013
    2        Tue Jan 01 00:00:00 EST 2013
    3        Tue Jan 01 00:00:00 EST 2013
    4        Tue Jan 01 00:00:00 EST 2013
                         ...
    39096    Tue Jan 01 00:00:00 EST 2013
    39097    Tue Jan 01 00:00:00 EST 2013
    39098    Tue Jan 01 00:00:00 EST 2013
    39099    Tue Jan 01 00:00:00 EST 2013
    39100    Tue Jan 01 00:00:00 EST 2013
    Name: modifiedOn, Length: 39101, dtype: object
    >>> pd.to_datetime(autos.modifiedOn)
    0       2013-01-01
    1       2013-01-01
    2       2013-01-01
    3       2013-01-01
    4       2013-01-01
               ...
    39096   2013-01-01
    39097   2013-01-01
    39098   2013-01-01
    39099   2013-01-01
    39100   2013-01-01
    Name: modifiedOn, Length: 39101, dtype: datetime64[ns]
    

    如果ZIP文件中有多个文件,可以使用zipfile模块。因为数据集第二行中包含问题,将其存入kag_questions

    >>> import zipfile
    >>> with zipfile.ZipFile(
    ...     "data/kaggle-survey-2018.zip"
    ... ) as z:
    ...     print("\n".join(z.namelist()))
    ...     kag = pd.read_csv(
    ...         z.open("multipleChoiceResponses.csv")
    ...     )
    ...     kag_questions = kag.iloc[0]
    ...     survey = kag.iloc[1:]
    multipleChoiceResponses.csv
    freeFormResponses.csv
    SurveySchema.csv
    >>> survey.head(2).T
    1          2
    Time from...          710        434
    Q1                 Female       Male
    Q1_OTHER_...           -1         -1
    Q2                  45-49      30-34
    Q3            United S...  Indonesia
    ...                   ...        ...
    Q50_Part_5            NaN        NaN
    Q50_Part_6            NaN        NaN
    Q50_Part_7            NaN        NaN
    Q50_Part_8            NaN        NaN
    Q50_OTHER...           -1         -1
    

    更多

    如果压缩文件中只有一个文件,则read_csv方法还可以读取GZIPBZ2XZ文件。


    3.6 使用数据库

    创建SQLite数据库,存储Beatles信息:

    >>> import sqlite3
    >>> con = sqlite3.connect("data/beat.db")
    >>> with con:
    ...     cur = con.cursor()
    ...     cur.execute("""DROP TABLE Band""")
    ...     cur.execute(
    ...         """CREATE TABLE Band(id INTEGER PRIMARY KEY,
    ...         fname TEXT, lname TEXT, birthyear INT)"""
    ...     )
    ...     cur.execute(
    ...         """INSERT INTO Band VALUES(
    ...         0, 'Paul', 'McCartney', 1942)"""
    ...     )
    ...     cur.execute(
    ...         """INSERT INTO Band VALUES(
    ...         1, 'John', 'Lennon', 1940)"""
    ...     )
    ...     _ = con.commit()
    

    从DataFrame读取数据库,这里使用的是SQLAlchemy:

    >>> import sqlalchemy as sa
    >>> engine = sa.create_engine(
    ...     "sqlite:///data/beat.db", echo=True
    ... )
    >>> sa_connection = engine.connect()
    >>> beat = pd.read_sql(
    ...     "Band", sa_connection, index_col="id"
    ... )
    >>> beat
       fname      lname  birthyear
    id                            
    0   Paul  McCartney       1942
    1   John     Lennon       1940
    

    使用SQL语句读取数据。可以使用SQLite或SQLAlchemy连接:

    >>> sql = """SELECT fname, birthyear from Band"""
    >>> fnames = pd.read_sql(sql, con)
    >>> fnames
      fname  birthyear
    0  Paul       1942
    1  John       1940
    

    3.7 读取JSON

    JSON数据的编码和加载:

    >>> import json
    >>> encoded = json.dumps(people)
    >>> encoded
    '{"first": ["Paul", "John", "Richard", "George"], "last": ["McCartney", "Lennon", "Starkey", "Harrison"], "birth": [1942, 1940, 1940, 1943]}'
    >>> json.loads(encoded)
    {'first': ['Paull', 'John', 'Richard', 'George'], 'last': ['McCartney', 'Lennon', 'Starkey', 'Harrison'], 'birth': [1942, 1940, 1940, 1943]}
    

    使用.read_json读取json数据,使用字典中的键名作为列名。

    >>> beatles = pd.read_json(encoded)
    >>> beatles
         first       last  birth
    0     Paul  McCartney   1942
    1     John     Lennon   1940
    2  Richard    Starkey   1940
    3   George   Harrison   1943
    

    读取JSON时,Pandas支持一些特定的方式:

    • columns —— (默认)将列名映射为列中的值的列表;
    • records —— 行的列表。每行是一个字典,一行映射到一个值;
    • split —— columns映射到列名,index映射到行索引值,data映射到每行数据组成的列表;
    • index —— 将索引映射到行,每行是一个列映射到值的字典;
    • values —— 数据行构成的列表(每行也是列表)。不包含列和行索引的值;
    • table —— 将schema映射到DataFrame的纲要,data映射为字典的列表。

    参考下面的代码:

    >>> records = beatles.to_json(orient="records")
    >>> records
    '[{"first":"Paul","last":"McCartney","birth":1942},{"first":"John","last":"Lennon","birth":1940},{"first":"Richard","last":"Starkey","birth":1940},{"first":"George","last":"Harrison","birth":1943}]'
    >>> pd.read_json(records, orient="records")
       birth    first       last
    0   1942     Paul  McCartney
    1   1940     John     Lennon
    2   1940  Richard    Starkey
    3   1943   George   Harrison
    >>> split = beatles.to_json(orient="split")
    >>> split
    '{"columns":["first","last","birth"],"index":[0,1,2,3],"data":[["Paul","McCartney",1942],["John","Lennon",1940],["Richard","Starkey",1940],["George","Harrison",1943]]}'
    >>> pd.read_json(split, orient="split")
         first       last  birth
    0     Paul  McCartney   1942
    1     John     Lennon   1940
    2  Richard    Starkey   1940
    3   George   Harrison   1943
    >>> index = beatles.to_json(orient="index")
    >>> index
    '{"0":{"first":"Paul","last":"McCartney","birth":1942},"1":{"first":"John","last":"Lennon","birth":1940},"2":{"first":"Richard","last":"Starkey","birth":1940},"3":{"first":"George","last":"Harrison","birth":1943}}'
    >>> pd.read_json(index, orient="index")
       birth    first       last
    0   1942     Paul  McCartney
    1   1940     John     Lennon
    2   1940  Richard    Starkey
    3   1943   George   Harrison
    >>> values = beatles.to_json(orient="values")
    >>> values
    '[["Paul","McCartney",1942],["John","Lennon",1940],["Richard","Starkey",1940],["George","Harrison",1943]]'
    >>> pd.read_json(values, orient="values")
             0          1     2
    0     Paul  McCartney  1942
    1     John     Lennon  1940
    2  Richard    Starkey  1940
    3   George   Harrison  1943
    >>> (
    ...     pd.read_json(values, orient="values").rename(
    ...         columns=dict(
    ...             enumerate(["first", "last", "birth"])
    ...         )
    ...     )
    ... )
         first       last  birth
    0     Paul  McCartney   1942
    1     John     Lennon   1940
    2  Richard    Starkey   1940
    3   George   Harrison   1943
    >>> table = beatles.to_json(orient="table")
    >>> table
    '{"schema": {"fields":[{"name":"index","type":"integer"},{"name":"first","type":"string"},{"name":"last","type":"string"},{"name":"birth","type":"integer"}],"primaryKey":["index"],"pandas_version":"0.20.0"}, "data": [{"index":0,"first":"Paul","last":"McCartney","birth":1942},{"index":1,"first":"John","last":"Lennon","birth":1940},{"index":2,"first":"Richard","last":"Starkey","birth":1940},{"index":3,"first":"George","last":"Harrison","birth":1943}]}'
    >>> pd.read_json(table, orient="table")
         first       last  birth
    0     Paul  McCartney   1942
    1     John     Lennon   1940
    2  Richard    Starkey   1940
    3   George   Harrison   1943
    

    更多

    如果要将数据转换为字典,可以使用.to_dict方法,方便添加数据:

    >>> output = beat.to_dict()
    >>> output
    {'fname': {0: 'Paul', 1: 'John'}, 'lname': {0: 'McCartney', 1: 'Lennon'}, 'birthyear': {0: 1942, 1: 1940}}
    >>> output["version"] = "0.4.1"
    >>> json.dumps(output)
    '{"fname": {"0": "Paul", "1": "John"}, "lname": {"0": "McCartney", "1": "Lennon"}, "birthyear": {"0": 1942, "1": 1940}, "version": "0.4.1"}'
    

    3.8 读取HTML表格

    可以使用Pandas读取HTML中的表格:

    Wikipedia中的表格
    # 译者注:en.wikipedia.org 需要科学上网
    >>> url = 'https://en.wikipedia.org/wiki/The_Beatles_discography'
    >>> dfs = pd.read_html(url)
    >>> len(dfs)
    51
    

    一共读取了51个df,检查一下第一个:

    >>> dfs[0]
      The Beatles discography The Beatles discography.1
    0  The Beat...             The Beat...
    1  Studio a...                      23
    2  Live albums                       5
    3  Compilat...                      53
    4  Video al...                      15
    5  Music vi...                      64
    6          EPs                      21
    7      Singles                      63
    8     Mash-ups                       2
    9     Box sets                      15
    

    .read_html有一个match参数,可以是字符串或正则表达式。还有一个attrs参数,用于定位HTML标签。

    检查下HTML的table元素:

    <table class="wikitable plainrowheaders" style="text-align:center;">
      <caption>List of studio albums,<sup id="cite_ref-1" class="reference"><a href="#cite_note-1">[A]</a></sup> with selected chart positions and certifications
      </caption>
      <tbody>
        <tr>
          <th scope="col" rowspan="2" style="width:20em;">Title
          </th>
          <th scope="col" rowspan="2" style="width:20em;">Release
           ...
    

    虽然没有属性,但可以使用字符串List of studio albums来匹配,缺失值na_values用"—"表示:

    >>> url = https://en.wikipedia.org/wiki/The_Beatles_discography
    >>> dfs = pd.read_html(
    ...     url, match="List of studio albums", na_values="—"
    ... )
    >>> len(dfs)
    1
    >>> dfs[0].columns
    Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
    

    使用前两行作为列名,但还是很乱:

    >>> url = https://en.wikipedia.org/wiki/The_Beatles_discography
    >>> dfs = pd.read_html(
    ...     url,
    ...     match="List of studio albums",
    ...     na_values="—",
    ...     header=[0, 1],
    ... )
    >>> len(dfs)
    1
    >>> dfs[0]
              Title      Release  ... Peak chart positions Certifications
              Title      Release  ...             US[8][9] Certifications
    0   Please P...  Released...  ...          NaN          BPI: Gol...
    1   With the...  Released...  ...          NaN          BPI: Gol...
    2   Introduc...  Released...  ...            2          RIAA: Pl...
    3   Meet the...  Released...  ...            1          MC: Plat...
    4   Twist an...  Released...  ...          NaN          MC: 3× P...
    ..          ...          ...  ...          ...                  ...
    22  The Beat...  Released...  ...            1          BPI: 2× ...
    23  Yellow S...  Released...  ...            2          BPI: Gol...
    24   Abbey Road  Released...  ...            1          BPI: 2× ...
    25    Let It Be  Released...  ...            1          BPI: Gol...
    26  "—" deno...  "—" deno...  ...  "—" deno...          "—" deno...
    >>> dfs[0].columns
    MultiIndex(levels=[['Certifications', 'Peak chart positions', 'Release', 'Title'], ['AUS[3]', 'CAN[4]', 'Certifications', 'FRA[5]', 'GER[6]', 'NOR[7]', 'Release', 'Title', 'UK[1][2]', 'US[8][9]']],
      codes=[[3, 2, 1, 1, 1, 1, 1, 1, 1, 0], [7, 6, 8, 0, 1, 3, 4, 5, 9, 2]])
    

    对于这种情况,最容易的方法是更新列索引:

    >>> df = dfs[0]
    >>> df.columns = [
    ...     "Title",
    ...     "Release",
    ...     "UK",
    ...     "AUS",
    ...     "CAN",
    ...     "FRA",
    ...     "GER",
    ...     "NOR",
    ...     "US",
    ...     "Certifications",
    ... ]
    >>> df
              Title      Release  ...           US Certifications
    0   Please P...  Released...  ...          NaN  BPI: Gol...
    1   With the...  Released...  ...          NaN  BPI: Gol...
    2   Introduc...  Released...  ...            2  RIAA: Pl...
    3   Meet the...  Released...  ...            1  MC: Plat...
    4   Twist an...  Released...  ...          NaN  MC: 3× P...
    ..          ...          ...  ...          ...          ...
    22  The Beat...  Released...  ...            1  BPI: 2× ...
    23  Yellow S...  Released...  ...            2  BPI: Gol...
    24   Abbey Road  Released...  ...            1  BPI: 2× ...
    25    Let It Be  Released...  ...            1  BPI: Gol...
    26  "—" deno...  "—" deno...  ...  "—" deno...  "—" deno...
    

    继续清理Release这列,在HTML中,代码是这样的:

    <th scope="row" rowspan="2">
      <i><a href="/wiki/A_Hard_Day%27s_Night_(album)" title="A Hard Day's Night (album)">A Hard Day's Night</a></i>
      <img alt="double-dagger" src="//upload.wikimedia.org/wikipedia/commons/f/f9/Double-dagger-14-plain.png" decoding="async" width="9" height="14" data-file-width="9" data-file-height="14">
    </th>
    

    将release列分别存入release_datelabel两列:

    >>> res = (
    ...     df.pipe(
    ...         lambda df_: df_[
    ...             ~df_.Title.str.startswith("Released")
    ...         ]
    ...     )
    ...     .assign(
    ...         release_date=lambda df_: pd.to_datetime(
    ...             df_.Release.str.extract(
    ...                 r"Released: (.*) Label"
    ...             )[0].str.replace(r"\[E\]", "")
    ...         ),
    ...         label=lambda df_: df_.Release.str.extract(
    ...             r"Label: (.*)"
    ...         ),
    ...     )
    ...     .loc[
    ...         :,
    ...         [
    ...             "Title",
    ...             "UK",
    ...             "AUS",
    ...             "CAN",
    ...             "FRA",
    ...             "GER",
    ...             "NOR",
    ...             "US",
    ...             "release_date",
    ...             "label",
    ...         ],
    ...     ]
    ... )
    >>> res
              Title   UK  ... release_date        label
    0   Please P...    1  ...   1963-03-22  Parlopho...
    1   With the...    1  ...   1963-11-22  Parlopho...
    2   Introduc...  NaN  ...   1964-01-10  Vee-Jay ...
    3   Meet the...  NaN  ...   1964-01-20  Capitol ...
    4   Twist an...  NaN  ...   1964-02-03  Capitol ...
    ..          ...  ...  ...          ...          ...
    21  Magical ...   31  ...   1967-11-27  Parlopho...
    22  The Beat...    1  ...   1968-11-22        Apple
    23  Yellow S...    3  ...   1969-01-13  Apple (U...
    24   Abbey Road    1  ...   1969-09-26        Apple
    25    Let It Be    1  ...   1970-05-08        Apple
    

    更多

    直接读取线上的csv文件:

    >>> url = https://github.com/mattharrison/datasets/blob/master/data/anscombes.csv
    >>> dfs = pd.read_html(url, attrs={"class": "csv-data"})
    >>> len(dfs)
    1
    >>> dfs[0]
        Unnamed: 0 quadrant     x     y
    0          NaN        I  10.0  8.04
    1          NaN        I  14.0  9.96
    2          NaN        I   6.0  7.24
    3          NaN        I   9.0  8.81
    4          NaN        I   4.0  4.26
    ..         ...      ...   ...   ...
    39         NaN       IV   8.0  6.58
    40         NaN       IV   8.0  7.91
    41         NaN       IV   8.0  8.47
    42         NaN       IV   8.0  5.25
    43         NaN       IV   8.0  6.89
    

    第01章 Pandas基础
    第02章 DataFrame基础运算
    第03章 创建和持久化DataFrame
    第04章 开始数据分析
    第05章 探索性数据分析
    第06章 选取数据子集
    第07章 过滤行
    第08章 索引对齐

    相关文章

      网友评论

        本文标题:《Pandas 1.x Cookbook · 第二版》第03章

        本文链接:https://www.haomeiwen.com/subject/xizotltx.html