美文网首页PythonPython数据分析
Python 数据处理(十八)—— HTML 表格

Python 数据处理(十八)—— HTML 表格

作者: 名本无名 | 来源:发表于2021-02-21 11:30 被阅读0次

HTML

1 读取 HTML 内容

顶级 read_html() 函数可以接受 HTML 字符串、文件或URL,并将 HTML 表解析为 pandas DataFrames 列表。

注意:即使 HTML 内容中仅包含一个表,read_html 也会返回 DataFrame 对象的列表

让我们看几个例子

In [295]: url = (
   .....:     "https://raw.githubusercontent.com/pandas-dev/pandas/master/"
   .....:     "pandas/tests/io/data/html/spam.html"
   .....: )
   .....: 

In [296]: dfs = pd.read_html(url)

In [297]: dfs
Out[297]: 
[                              Nutrient        Unit Value per 100.0g oz 1 NLEA serving  56g  Unnamed: 4  Unnamed: 5
 0                           Proximates  Proximates       Proximates             Proximates  Proximates  Proximates
 1                                Water           g            51.70                  28.95         NaN         NaN
 2                               Energy        kcal              315                    176         NaN         NaN
 3                              Protein           g            13.40                   7.50         NaN         NaN
 4                    Total lipid (fat)           g            26.60                  14.90         NaN         NaN
 ..                                 ...         ...              ...                    ...         ...         ...
 32  Fatty acids, total monounsaturated           g           13.505                  7.563         NaN         NaN
 33  Fatty acids, total polyunsaturated           g            2.019                  1.131         NaN         NaN
 34                         Cholesterol          mg               71                     40         NaN         NaN
 35                               Other       Other            Other                  Other       Other       Other
 36                            Caffeine          mg                0                      0         NaN         NaN
 
 [37 rows x 6 columns]]

读入 banklist.html 文件的内容,并将其作为字符串传递给 read_html

In [298]: with open(file_path, "r") as f:
   .....:     dfs = pd.read_html(f.read())
   .....: 

In [299]: dfs
Out[299]: 
[                                    Bank Name          City  ...       Closing Date       Updated Date
 0    Banks of Wisconsin d/b/a Bank of Kenosha       Kenosha  ...       May 31, 2013       May 31, 2013
 1                        Central Arizona Bank    Scottsdale  ...       May 14, 2013       May 20, 2013
 2                                Sunrise Bank      Valdosta  ...       May 10, 2013       May 21, 2013
 3                       Pisgah Community Bank     Asheville  ...       May 10, 2013       May 14, 2013
 4                         Douglas County Bank  Douglasville  ...     April 26, 2013       May 16, 2013
 ..                                        ...           ...  ...                ...                ...
 500                        Superior Bank, FSB      Hinsdale  ...      July 27, 2001       June 5, 2012
 501                       Malta National Bank         Malta  ...        May 3, 2001  November 18, 2002
 502           First Alliance Bank & Trust Co.    Manchester  ...   February 2, 2001  February 18, 2003
 503         National State Bank of Metropolis    Metropolis  ...  December 14, 2000     March 17, 2005
 504                          Bank of Honolulu      Honolulu  ...   October 13, 2000     March 17, 2005
 
 [505 rows x 7 columns]]

如果愿意,您甚至可以传入 StringIO 的实例

In [300]: with open(file_path, "r") as f:
   .....:     sio = StringIO(f.read())
   .....: 

In [301]: dfs = pd.read_html(sio)

In [302]: dfs
Out[302]: 
[                                    Bank Name          City  ...       Closing Date       Updated Date
 0    Banks of Wisconsin d/b/a Bank of Kenosha       Kenosha  ...       May 31, 2013       May 31, 2013
 1                        Central Arizona Bank    Scottsdale  ...       May 14, 2013       May 20, 2013
 2                                Sunrise Bank      Valdosta  ...       May 10, 2013       May 21, 2013
 3                       Pisgah Community Bank     Asheville  ...       May 10, 2013       May 14, 2013
 4                         Douglas County Bank  Douglasville  ...     April 26, 2013       May 16, 2013
 ..                                        ...           ...  ...                ...                ...
 500                        Superior Bank, FSB      Hinsdale  ...      July 27, 2001       June 5, 2012
 501                       Malta National Bank         Malta  ...        May 3, 2001  November 18, 2002
 502           First Alliance Bank & Trust Co.    Manchester  ...   February 2, 2001  February 18, 2003
 503         National State Bank of Metropolis    Metropolis  ...  December 14, 2000     March 17, 2005
 504                          Bank of Honolulu      Honolulu  ...   October 13, 2000     March 17, 2005
 
 [505 rows x 7 columns]]

读取 URL 并匹配包含特定文本的表

match = "Metcalf Bank"
df_list = pd.read_html(url, match=match)

指定一个标题行(默认情况下 <th><td> 位于 <thead> 中的元素用于形成列索引,如果 <thead> 中包含多个行,那么创建一个多索引)

dfs = pd.read_html(url, header=0)

指定索引列

dfs = pd.read_html(url, index_col=0)

指定要跳过的行数:

dfs = pd.read_html(url, skiprows=0)

使用列表指定要跳过的行数(range 函数也适用)

dfs = pd.read_html(url, skiprows=range(2))

指定一个 HTML 属性

dfs1 = pd.read_html(url, attrs={"id": "table"})
dfs2 = pd.read_html(url, attrs={"class": "sortable"})
print(np.array_equal(dfs1[0], dfs2[0]))  # Should be True

指定应转换为 NaN 的值

dfs = pd.read_html(url, na_values=["No Acquirer"])

指定是否保持默认的 NaN 值集

dfs = pd.read_html(url, keep_default_na=False)

可以为列指定转换器。这对于具有前导零的数字文本数据很有用。

默认情况下,将数字列转换为数字类型,并且前导零会丢失。为了避免这种情况,我们可以将这些列转换为字符串

url_mcc = "https://en.wikipedia.org/wiki/Mobile_country_code"
dfs = pd.read_html(
    url_mcc,
    match="Telekom Albania",
    header=0,
    converters={"MNC": str},
)

组合上面的选项

dfs = pd.read_html(url, match="Metcalf Bank", index_col=0)

读取 to_html 的输出(会损失浮点数的精度)

df = pd.DataFrame(np.random.randn(2, 2))
s = df.to_html(float_format="{0:.40g}".format)
dfin = pd.read_html(s, index_col=0)

当只提供了一个解析器时,如果解析失败,lxml 解析器会抛出异常,最好的方式是指定一个解析器列表

dfs = pd.read_html(url, "Metcalf Bank", index_col=0, flavor=["lxml"])
# or
dfs = pd.read_html(url, "Metcalf Bank", index_col=0, flavor="lxml")

但是,如果安装了 bs4html5lib 并传入 None['lxml','bs4'],则解析很可能会成功。

dfs = pd.read_html(url, "Metcalf Bank", index_col=0, flavor=["lxml", "bs4"])

2 写入 HTML 文件

DataFrame 对象有一个实例方法 to_html,它将 DataFrame 的内容呈现为 html 表格。

函数参数与上面描述的方法 to_string 相同。

In [303]: df = pd.DataFrame(np.random.randn(2, 2))

In [304]: df
Out[304]: 
          0         1
0 -0.184744  0.496971
1 -0.856240  1.857977

In [305]: print(df.to_html())  # raw html
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>0</th>
      <th>1</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>-0.184744</td>
      <td>0.496971</td>
    </tr>
    <tr>
      <th>1</th>
      <td>-0.856240</td>
      <td>1.857977</td>
    </tr>
  </tbody>
</table>
image.png

columns 参数将限制显示的列

In [306]: print(df.to_html(columns=[0]))
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>0</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>-0.184744</td>
    </tr>
    <tr>
      <th>1</th>
      <td>-0.856240</td>
    </tr>
  </tbody>
</table>
image.png

float_format 参数控制浮点值的精度

In [307]: print(df.to_html(float_format="{0:.10f}".format))
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>0</th>
      <th>1</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>-0.1847438576</td>
      <td>0.4969711327</td>
    </tr>
    <tr>
      <th>1</th>
      <td>-0.8562396763</td>
      <td>1.8579766508</td>
    </tr>
  </tbody>
</table>
image.png

bold_rows 默认情况下将使行标签加粗,但你可以关闭它

In [308]: print(df.to_html(bold_rows=False))
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>0</th>
      <th>1</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>0</td>
      <td>-0.184744</td>
      <td>0.496971</td>
    </tr>
    <tr>
      <td>1</td>
      <td>-0.856240</td>
      <td>1.857977</td>
    </tr>
  </tbody>
</table>
image.png

classes 参数提供了给 HTML 表 设置 CSS 类的能力。

请注意,这些类附加到现有的 dataframe 类之后

In [309]: print(df.to_html(classes=["awesome_table_class", "even_more_awesome_class"]))
<table border="1" class="dataframe awesome_table_class even_more_awesome_class">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>0</th>
      <th>1</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>-0.184744</td>
      <td>0.496971</td>
    </tr>
    <tr>
      <th>1</th>
      <td>-0.856240</td>
      <td>1.857977</td>
    </tr>
  </tbody>
</table>

render_links 参数提供了向包含 url 的单元格添加超链接的能力

In [310]: url_df = pd.DataFrame(
   .....:     {
   .....:         "name": ["Python", "pandas"],
   .....:         "url": ["https://www.python.org/", "https://pandas.pydata.org"],
   .....:     }
   .....: )
   .....: 

In [311]: print(url_df.to_html(render_links=True))
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>name</th>
      <th>url</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Python</td>
      <td><a href="https://www.python.org/" target="_blank">https://www.python.org/</a></td>
    </tr>
    <tr>
      <th>1</th>
      <td>pandas</td>
      <td><a href="https://pandas.pydata.org" target="_blank">https://pandas.pydata.org</a></td>
    </tr>
  </tbody>
</table>
image.png

最后,escape 参数允许您控制 HTML 结果中是否转义了 "<"">""&" 字符(默认情况下为 True)。

因此,要获得没有转义字符的 HTML,请传递 escape=False

In [312]: df = pd.DataFrame({"a": list("&<>"), "b": np.random.randn(3)})

转义

In [313]: print(df.to_html())
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>a</th>
      <th>b</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>&amp;</td>
      <td>-0.474063</td>
    </tr>
    <tr>
      <th>1</th>
      <td>&lt;</td>
      <td>-0.230305</td>
    </tr>
    <tr>
      <th>2</th>
      <td>&gt;</td>
      <td>-0.400654</td>
    </tr>
  </tbody>
</table>
image.png

不转义

In [314]: print(df.to_html(escape=False))
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>a</th>
      <th>b</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>&</td>
      <td>-0.474063</td>
    </tr>
    <tr>
      <th>1</th>
      <td><</td>
      <td>-0.230305</td>
    </tr>
    <tr>
      <th>2</th>
      <td>></td>
      <td>-0.400654</td>
    </tr>
  </tbody>
</table>
image.png

在某些浏览器上这两个 HTML 表可能并不会显示出差异。

3 HTML表解析技巧

在顶级 pandas io 函数 read_html 中,用于解析 HTML 表的库存在一些问题

  1. lxml 的问题
  • 优点
    • 快速
    • 依赖 Cython
  • 缺点
    • lxml 对其解析结果不做任何保证,除非为其提供严格有效的标记。可以选择使用 lxml 后端,但是如果 lxml 无法解析,则该后端将使用 html5lib
    • 强烈建议您同时安装 BeautifulSoup4html5lib,以便即使 lxml 失败,您仍将获得有效结果
  1. 使用 lxml 作为 BeautifulSoup4 后端的问题
  • 因为 BeautifulSoup4 本质上只是解析器后端的封装,因此上述问题同样存在
  1. 使用 html5lib 作为 BeautifulSoup4 后端的问题
  • 优点
    • html5liblxml 宽松得多,因此会以一种更理智的方式处理现实生活中的标记,而不仅仅是在不通知您的情况下删除一个元素
    • html5lib 自动从无效标记中生成有效的 HTML5 标记。这对于解析 HTML 表非常重要,因为它保证了一个有效的文档。然而,这并不意味着它一定是正确的,因为修复标记的过程并没有统一的定义
    • html5lib 是纯 Python 实现,不需要额外的构建步骤
  • 缺点
    • 使用 html5lib 的最大缺点是速度很慢

相关文章