通过python抓取网页内容实战

作者: ironman_ | 来源:发表于2018-03-14 18:34 被阅读0次

通过python抓取网页内容实战
Python实用练手小案例
Python爬取链家网上海市租房信息
【Python】抓取网页信息
python 爬取有道翻译，踩坑之旅
Python抓取网页内容乱码
python爬虫(四)_urllib2库的基本使用
用Mac os自带Automator抓取网页自动存储txt文件
用Python写爬虫，来来来，你也能学会
Python抓取One网页上的内容

使用urllib3做网络部分
beautifulsoup4来解析网页内容

安装几个python包：

//bs4的包,用来解析网页内容
pip3 install beautifulsoup4

//支持https的包，不安装会报warning
pip3 install certifi

//安装urllib3
pip3 install urllib3

urllib3的使用

>>> import urllib3
>>> http = urllib3.PoolManager()
>>> r = http.request('GET', 'http://httpbin.org/robots.txt')
>>> r.status
200
>>> r.data
'User-agent: *\nDisallow: /deny\n'

beautifulSoup4的使用

我认为几个比较有用的方法：

//创建
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.data, 'html.parser')

//将节点内容格式化输出
soup.prettify()

//通过点或者中括号的方式只能访问到一个节点，如果想获取table里所有的<tr>可以使用find_all()
tables = soup.find_all('table')//获取到所有的table标签
table.find_all('tr')//获取到table底下所有的tr标签

//获取节点的string
title_tag.string
# u'The Dormouse's story'

//获取一个节点底下所有的text
th_all_str = table.tr.get_text()

抓取一个网页的完整例子

import urllib3
import certifi
from bs4 import BeautifulSoup


def parse_table(table):
    ths = table.tr.find_all('th')
    headers = []
    for index, value in enumerate(ths):
        th_str = value.string
        if th_str and th_str.strip():
            headers.append((index, th_str.strip()))
    result = []
    for index, tr in enumerate(table.find_all('tr')):
        if index == 0:
            continue
        ele = {}
        for idx, val in headers:
            ele[val] = tr.find_all('td')[idx].string
            if ele[val]:
                ele[val] = ele[val].strip()
        result.append(ele)
    return result


http = urllib3.PoolManager(
    cert_reqs='CERT_REQUIRED',
    ca_certs=certifi.where()
)
# url = 'https://etherscan.io/address/0x379516f90c4ff1cb2bcffa1f24d366855e67f40c'
url = 'https://etherscan.io/token/generic-tokentxns2?contractAddress=0xb5a5f22694352c15b00323844ad545abb2b11028&a=0xd551234ae421e3bcba99a0da6d736074f22192ff'
r = http.request('GET', url)

soup = BeautifulSoup(r.data, 'html.parser')
tables = soup.find_all('table')
for table in tables:
    th_all_str = table.tr.get_text()
    print(th_all_str)
    print('----------')
    if 'TxHash' in th_all_str and 'ParentTxHash' not in th_all_str:
        result = parse_table(table)
        print(len(result))
        print(result)
        # print("----------------")
        # print(th_str)

参考文章：

beautifulsoup4官方文档
 urllib3官方文档

网友评论

本文标题：通过python抓取网页内容实战

本文链接：https://www.haomeiwen.com/subject/boqqqftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

通过python抓取网页内容实战

安装几个python包：

urllib3的使用

beautifulSoup4的使用

抓取一个网页的完整例子

参考文章：

相关文章

通过python抓取网页内容实战

Python实用练手小案例

Python爬取链家网上海市租房信息

【Python】抓取网页信息

python 爬取有道翻译，踩坑之旅

Python抓取网页内容乱码

python爬虫(四)_urllib2库的基本使用

用Mac os自带Automator抓取网页自动存储txt文件

用Python写爬虫，来来来，你也能学会

Python抓取One网页上的内容

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读