网络爬虫简介

作者: 强_f6db | 来源:发表于2018-06-02 16:09 被阅读0次

网络爬虫

网络爬虫是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。

通用爬虫的一般步骤

1.抓取页面
2.解析页面
3.数据存储

抓取页面

下载页面数据用到的库有urllib / requests / aiohttp等。
🌰栗子：
使用requests的get方法抓取知乎首页(requests用法请参考官方文档)

url = 'https://www.zhihu.com/'  # 网页的url
headers = {'user-agent': 'Baiduspider'} 请求头 
proxies = {'http': 'http://122.114.31.177:808'}  # 设置代理
response = requests.get(url,
                headers=headers,
                proxies=proxies)

页面解析

在Python可以调用很多能做页面解析的库：
re/lxml/ Beautiful Soup4(bs4)/pyquery
举个🌰栗子：
用bs4+re解析页面（Beautiful Soup4的使用参见官方文档）

soup = BeautifulSoup(response.text, 'lxml')
regex = re.compile(r'^/question')
for a in soup.find_all('a', {'href': regex}):
    if 'href' in a.attrs:
        path = a_tag.attrs['href']

数据存储

爬虫的最终目的是从网络上获取数据，这就需要做数据持久化的工作了。
这里我们同样有多种选择：
MySQL/Redis/MongoDB
一些ORM：pymysql/sqlalchemy/peewee/redis/pymongo
再举个🌰：

client = pymongo.MongoClient('47.98.56.23', 27017)
pages = client.zhihu.pages
pages.insert({'_id': page_id,
              'html_page': html_page,})

网友评论

本文标题：网络爬虫简介

本文链接：https://www.haomeiwen.com/subject/uvursftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！