HTML
1 读取 HTML 内容
顶级 read_html()
函数可以接受 HTML
字符串、文件或URL,并将 HTML
表解析为 pandas
DataFrames
列表。
注意:即使 HTML
内容中仅包含一个表,read_html
也会返回 DataFrame
对象的列表
让我们看几个例子
In [295]: url = (
.....: "https://raw.githubusercontent.com/pandas-dev/pandas/master/"
.....: "pandas/tests/io/data/html/spam.html"
.....: )
.....:
In [296]: dfs = pd.read_html(url)
In [297]: dfs
Out[297]:
[ Nutrient Unit Value per 100.0g oz 1 NLEA serving 56g Unnamed: 4 Unnamed: 5
0 Proximates Proximates Proximates Proximates Proximates Proximates
1 Water g 51.70 28.95 NaN NaN
2 Energy kcal 315 176 NaN NaN
3 Protein g 13.40 7.50 NaN NaN
4 Total lipid (fat) g 26.60 14.90 NaN NaN
.. ... ... ... ... ... ...
32 Fatty acids, total monounsaturated g 13.505 7.563 NaN NaN
33 Fatty acids, total polyunsaturated g 2.019 1.131 NaN NaN
34 Cholesterol mg 71 40 NaN NaN
35 Other Other Other Other Other Other
36 Caffeine mg 0 0 NaN NaN
[37 rows x 6 columns]]
读入 banklist.html
文件的内容,并将其作为字符串传递给 read_html
In [298]: with open(file_path, "r") as f:
.....: dfs = pd.read_html(f.read())
.....:
In [299]: dfs
Out[299]:
[ Bank Name City ... Closing Date Updated Date
0 Banks of Wisconsin d/b/a Bank of Kenosha Kenosha ... May 31, 2013 May 31, 2013
1 Central Arizona Bank Scottsdale ... May 14, 2013 May 20, 2013
2 Sunrise Bank Valdosta ... May 10, 2013 May 21, 2013
3 Pisgah Community Bank Asheville ... May 10, 2013 May 14, 2013
4 Douglas County Bank Douglasville ... April 26, 2013 May 16, 2013
.. ... ... ... ... ...
500 Superior Bank, FSB Hinsdale ... July 27, 2001 June 5, 2012
501 Malta National Bank Malta ... May 3, 2001 November 18, 2002
502 First Alliance Bank & Trust Co. Manchester ... February 2, 2001 February 18, 2003
503 National State Bank of Metropolis Metropolis ... December 14, 2000 March 17, 2005
504 Bank of Honolulu Honolulu ... October 13, 2000 March 17, 2005
[505 rows x 7 columns]]
如果愿意,您甚至可以传入 StringIO
的实例
In [300]: with open(file_path, "r") as f:
.....: sio = StringIO(f.read())
.....:
In [301]: dfs = pd.read_html(sio)
In [302]: dfs
Out[302]:
[ Bank Name City ... Closing Date Updated Date
0 Banks of Wisconsin d/b/a Bank of Kenosha Kenosha ... May 31, 2013 May 31, 2013
1 Central Arizona Bank Scottsdale ... May 14, 2013 May 20, 2013
2 Sunrise Bank Valdosta ... May 10, 2013 May 21, 2013
3 Pisgah Community Bank Asheville ... May 10, 2013 May 14, 2013
4 Douglas County Bank Douglasville ... April 26, 2013 May 16, 2013
.. ... ... ... ... ...
500 Superior Bank, FSB Hinsdale ... July 27, 2001 June 5, 2012
501 Malta National Bank Malta ... May 3, 2001 November 18, 2002
502 First Alliance Bank & Trust Co. Manchester ... February 2, 2001 February 18, 2003
503 National State Bank of Metropolis Metropolis ... December 14, 2000 March 17, 2005
504 Bank of Honolulu Honolulu ... October 13, 2000 March 17, 2005
[505 rows x 7 columns]]
读取 URL
并匹配包含特定文本的表
match = "Metcalf Bank"
df_list = pd.read_html(url, match=match)
指定一个标题行(默认情况下 <th>
或 <td>
位于 <thead>
中的元素用于形成列索引,如果 <thead>
中包含多个行,那么创建一个多索引)
dfs = pd.read_html(url, header=0)
指定索引列
dfs = pd.read_html(url, index_col=0)
指定要跳过的行数:
dfs = pd.read_html(url, skiprows=0)
使用列表指定要跳过的行数(range
函数也适用)
dfs = pd.read_html(url, skiprows=range(2))
指定一个 HTML
属性
dfs1 = pd.read_html(url, attrs={"id": "table"})
dfs2 = pd.read_html(url, attrs={"class": "sortable"})
print(np.array_equal(dfs1[0], dfs2[0])) # Should be True
指定应转换为 NaN
的值
dfs = pd.read_html(url, na_values=["No Acquirer"])
指定是否保持默认的 NaN
值集
dfs = pd.read_html(url, keep_default_na=False)
可以为列指定转换器。这对于具有前导零的数字文本数据很有用。
默认情况下,将数字列转换为数字类型,并且前导零会丢失。为了避免这种情况,我们可以将这些列转换为字符串
url_mcc = "https://en.wikipedia.org/wiki/Mobile_country_code"
dfs = pd.read_html(
url_mcc,
match="Telekom Albania",
header=0,
converters={"MNC": str},
)
组合上面的选项
dfs = pd.read_html(url, match="Metcalf Bank", index_col=0)
读取 to_html
的输出(会损失浮点数的精度)
df = pd.DataFrame(np.random.randn(2, 2))
s = df.to_html(float_format="{0:.40g}".format)
dfin = pd.read_html(s, index_col=0)
当只提供了一个解析器时,如果解析失败,lxml
解析器会抛出异常,最好的方式是指定一个解析器列表
dfs = pd.read_html(url, "Metcalf Bank", index_col=0, flavor=["lxml"])
# or
dfs = pd.read_html(url, "Metcalf Bank", index_col=0, flavor="lxml")
但是,如果安装了 bs4
和 html5lib
并传入 None
或 ['lxml','bs4']
,则解析很可能会成功。
dfs = pd.read_html(url, "Metcalf Bank", index_col=0, flavor=["lxml", "bs4"])
2 写入 HTML 文件
DataFrame
对象有一个实例方法 to_html
,它将 DataFrame
的内容呈现为 html
表格。
函数参数与上面描述的方法 to_string
相同。
In [303]: df = pd.DataFrame(np.random.randn(2, 2))
In [304]: df
Out[304]:
0 1
0 -0.184744 0.496971
1 -0.856240 1.857977
In [305]: print(df.to_html()) # raw html
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>-0.184744</td>
<td>0.496971</td>
</tr>
<tr>
<th>1</th>
<td>-0.856240</td>
<td>1.857977</td>
</tr>
</tbody>
</table>
image.png
columns
参数将限制显示的列
In [306]: print(df.to_html(columns=[0]))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>-0.184744</td>
</tr>
<tr>
<th>1</th>
<td>-0.856240</td>
</tr>
</tbody>
</table>
image.png
float_format
参数控制浮点值的精度
In [307]: print(df.to_html(float_format="{0:.10f}".format))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>-0.1847438576</td>
<td>0.4969711327</td>
</tr>
<tr>
<th>1</th>
<td>-0.8562396763</td>
<td>1.8579766508</td>
</tr>
</tbody>
</table>
image.png
bold_rows
默认情况下将使行标签加粗,但你可以关闭它
In [308]: print(df.to_html(bold_rows=False))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>-0.184744</td>
<td>0.496971</td>
</tr>
<tr>
<td>1</td>
<td>-0.856240</td>
<td>1.857977</td>
</tr>
</tbody>
</table>
image.png
classes
参数提供了给 HTML
表 设置 CSS
类的能力。
请注意,这些类附加到现有的 dataframe
类之后
In [309]: print(df.to_html(classes=["awesome_table_class", "even_more_awesome_class"]))
<table border="1" class="dataframe awesome_table_class even_more_awesome_class">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>-0.184744</td>
<td>0.496971</td>
</tr>
<tr>
<th>1</th>
<td>-0.856240</td>
<td>1.857977</td>
</tr>
</tbody>
</table>
render_links
参数提供了向包含 url
的单元格添加超链接的能力
In [310]: url_df = pd.DataFrame(
.....: {
.....: "name": ["Python", "pandas"],
.....: "url": ["https://www.python.org/", "https://pandas.pydata.org"],
.....: }
.....: )
.....:
In [311]: print(url_df.to_html(render_links=True))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>url</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>Python</td>
<td><a href="https://www.python.org/" target="_blank">https://www.python.org/</a></td>
</tr>
<tr>
<th>1</th>
<td>pandas</td>
<td><a href="https://pandas.pydata.org" target="_blank">https://pandas.pydata.org</a></td>
</tr>
</tbody>
</table>
image.png
最后,escape
参数允许您控制 HTML
结果中是否转义了 "<"
、">"
和 "&"
字符(默认情况下为 True
)。
因此,要获得没有转义字符的 HTML
,请传递 escape=False
In [312]: df = pd.DataFrame({"a": list("&<>"), "b": np.random.randn(3)})
转义
In [313]: print(df.to_html())
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>&</td>
<td>-0.474063</td>
</tr>
<tr>
<th>1</th>
<td><</td>
<td>-0.230305</td>
</tr>
<tr>
<th>2</th>
<td>></td>
<td>-0.400654</td>
</tr>
</tbody>
</table>
image.png
不转义
In [314]: print(df.to_html(escape=False))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>&</td>
<td>-0.474063</td>
</tr>
<tr>
<th>1</th>
<td><</td>
<td>-0.230305</td>
</tr>
<tr>
<th>2</th>
<td>></td>
<td>-0.400654</td>
</tr>
</tbody>
</table>
image.png
在某些浏览器上这两个 HTML
表可能并不会显示出差异。
3 HTML表解析技巧
在顶级 pandas
io
函数 read_html
中,用于解析 HTML
表的库存在一些问题
-
lxml
的问题
- 优点
- 快速
- 依赖
Cython
- 缺点
-
lxml
对其解析结果不做任何保证,除非为其提供严格有效的标记。可以选择使用lxml
后端,但是如果lxml
无法解析,则该后端将使用html5lib
- 强烈建议您同时安装
BeautifulSoup4
和html5lib
,以便即使lxml
失败,您仍将获得有效结果
-
- 使用
lxml
作为BeautifulSoup4
后端的问题
- 因为
BeautifulSoup4
本质上只是解析器后端的封装,因此上述问题同样存在
- 使用
html5lib
作为BeautifulSoup4
后端的问题
- 优点
-
html5lib
比lxml
宽松得多,因此会以一种更理智的方式处理现实生活中的标记,而不仅仅是在不通知您的情况下删除一个元素 -
html5lib
自动从无效标记中生成有效的HTML5
标记。这对于解析HTML
表非常重要,因为它保证了一个有效的文档。然而,这并不意味着它一定是正确的,因为修复标记的过程并没有统一的定义 -
html5lib
是纯Python
实现,不需要额外的构建步骤
-
- 缺点
- 使用
html5lib
的最大缺点是速度很慢
- 使用
网友评论