python获取网页表格数据

作者: 生信探索 | 来源:发表于2024-05-12 18:13 被阅读0次

Python数据分析基础：网页数据获取
python数据分析3：数据抽取
Python爬取网页数据基本步骤及学习资料
外行学 Python 爬虫第五篇数据存储
Python网络数据采集3-数据存到CSV以及MySql
xpath语法
jqGrid常用语法
bootstrap-table方法
国庆第一天：Python
Python urllib使用(一)

需求

需要网页中的基因（Gene Symbol），一共371个。

使用pandas读取网页表格

read_html 返回的是列表（a list of DataFrame）

import pandas as pd

import bioquest as bq

url = "http://exocarta.org/browse_results?org_name=&cont_type=&tissue=Bladder%20cancer%20cells&gene_symbol="

df = pd.read_html(url, encoding='utf-8',header=0,index_col=0)[0]

bq.tl.select(df,columns=["Gene Name","Gene Symbol","Species"]).to_csv("gene.csv",index=False)

没有学过爬虫，好奇是read_html怎么做到的，怎么解析网页的。

This function searches for <table> elements and only for <tr> and <th> rows and <td> elements within each <tr> or <th> element in the table. <td> stands for “table data”. This function attempts to properly handle colspan and rowspan attributes. If the function has a <thead> argument, it is used to construct the header, otherwise the function attempts to find the header within the body (by putting rows with only <th> elements into the header).

网页中的表格html语法大概如下

tr: 定义表格的行

th: 定义表格的表头

td: 定义表格单元

<thead>

<tr>

</tr>

</thead>

<tbody>

<tr>

</tr>

...

</tbody>

</table>

所以read_html是依靠lxml等库根据HTML语法找到表格位置，并转换为DataFrame

Reference

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html

https://zhuanlan.zhihu.com/p/51968879

https://blog.csdn.net/qq_40478273/article/details/103980288

网友评论

本文标题：python获取网页表格数据

本文链接：https://www.haomeiwen.com/subject/jjpufjtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

python获取网页表格数据

相关文章

Python数据分析基础：网页数据获取

python数据分析3：数据抽取

Python爬取网页数据基本步骤及学习资料

外行学 Python 爬虫第五篇数据存储

Python网络数据采集3-数据存到CSV以及MySql

xpath语法

jqGrid常用语法

bootstrap-table方法

国庆第一天：Python

Python urllib使用(一)

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读