爬虫系列（十二）：使用BeautifuSoup4的爬虫

爬虫系列（十二）：使用BeautifuSoup4的爬虫

作者: 文子轩 | 来源:发表于2018-01-30 10:58 被阅读4次

爬虫系列（十二）：使用BeautifuSoup4的爬虫
pip install beautifusoup4安装失败的解决
资料
爬虫入门系列（六）：正则表达式完全指南（下）
Python爬虫学习系列教程
Python代理IP爬虫的简单使用
Python网络爬虫实战之十四：Scrapy结合scrapy-s
Python网络爬虫实战之七：动态网页爬取案例实战 Seleni
Python网络爬虫实战之八：动态网页爬取案例实战 Seleni
Python网络爬虫实战之九：Selenium进阶操作与爬取京东

我们以腾讯社招页面来做演示：http://hr.tencent.com/position.php?&start=10#a

image.png
使用BeautifuSoup4解析器，将招聘网页上的职位名称、职位类别、招聘人数、工作地点、发布时间，以及每个职位详情的点击链接存储出来。

image.png

bs4_tencent.py

    from bs4 import BeautifulSoup
    import urllib2
    import urllib
    import json    # 使用了json格式存储

    def tencent():
        url = 'http://hr.tencent.com/'
        request = urllib2.Request(url + 'position.php?&start=10#a')
        response =urllib2.urlopen(request)
        resHtml = response.read()

        output =open('tencent.json','w')

        html = BeautifulSoup(resHtml,'lxml')

    # 创建CSS选择器
        result = html.select('tr[class="even"]')
        result2 = html.select('tr[class="odd"]')
        result += result2

        items = []
        for site in result:
            item = {}

            name = site.select('td a')[0].get_text()
            detailLink = site.select('td a')[0].attrs['href']
            catalog = site.select('td')[1].get_text()
            recruitNumber = site.select('td')[2].get_text()
            workLocation = site.select('td')[3].get_text()
            publishTime = site.select('td')[4].get_text()

            item['name'] = name
            item['detailLink'] = url + detailLink
            item['catalog'] = catalog
            item['recruitNumber'] = recruitNumber
            item['publishTime'] = publishTime

            items.append(item)

        # 禁用ascii编码，按utf-8编码
        line = json.dumps(items,ensure_ascii=False)

        output.write(line.encode('utf-8'))
        output.close()

    if __name__ == "__main__":
       tencent()

相关文章

爬虫系列（十二）：使用BeautifuSoup4的爬虫
我们以腾讯社招页面来做演示：http://hr.tencent.com/position.php?&start=1...
pip install beautifusoup4安装失败的解决
操作系统：Windows 7Python版本：Python 3.5.4 学习爬虫，安装beautifusoup4 ...
资料
Python爬虫系列（一）初期学习爬虫的拾遗与总结（11.4更） Python爬虫学习系列教程 Python爬虫学习手册
爬虫入门系列（六）：正则表达式完全指南（下）
爬虫入门系列目录：爬虫入门系列（一）：快速理解HTTP协议爬虫入门系列（二）：优雅的HTTP库requests...
Python爬虫学习系列教程
转自: 静觅»Python爬虫学习系列教程 Python爬虫学习系列教程 Python版本：2.7 一、爬虫入门 ...
Python代理IP爬虫的简单使用
前言 Python爬虫要经历爬虫、爬虫被限制、爬虫反限制的过程。当然后续还要网页爬虫限制优化，爬虫再反限制的一系列...
Python网络爬虫实战之十四：Scrapy结合scrapy-s
目录：Python网络爬虫实战系列 Python网络爬虫实战之一：网络爬虫理论基础 Python网络爬虫实战之二：...
Python网络爬虫实战之七：动态网页爬取案例实战 Seleni
目录：Python网络爬虫实战系列 Python网络爬虫实战之一：网络爬虫理论基础 Python网络爬虫实战之二：...
Python网络爬虫实战之八：动态网页爬取案例实战 Seleni
目录：Python网络爬虫实战系列 Python网络爬虫实战之一：网络爬虫理论基础 Python网络爬虫实战之二：...
Python网络爬虫实战之九：Selenium进阶操作与爬取京东
目录：Python网络爬虫实战系列 Python网络爬虫实战之一：网络爬虫理论基础 Python网络爬虫实战之二：...

网友评论

python网络，爬虫，数据库笔记

本文标题：爬虫系列（十二）：使用BeautifuSoup4的爬虫

本文链接：https://www.haomeiwen.com/subject/zdwqzxtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

栏目导航

python网络，爬虫，数据库笔记

热点阅读

python网络，爬虫，数据库笔记

关于我们|服务条款|联系我们|爬虫系列（十二）：使用BeautifuSoup4的爬虫|投稿指南|网站地图|RSS订阅|排版工具|手机版

提供经典美文摘抄,优美散文欣赏,现代诗歌精选,短篇小说,心情随笔,表白情书范文,故事会在线阅读欣赏

Copyright © 2014-2023 Haomeiwen.com All Rights Reserved. 好美文阅读网版权所有

备案信息：桂公网安备 45052102000051号 · 桂ICP备13007215号-3

本站所收录作品、热点评论等信息部分来源互联网，目的只是为了系统归纳学习和传递资讯

所有作品版权归原创作者所有，与本站立场无关，如不慎侵犯了你的权益，请联系我们告知，我们将做删除处理！