美文网首页大数据 爬虫Python AI SqlPython软件测试
[雪峰磁针石博客]python爬虫cookbook1爬虫入门

[雪峰磁针石博客]python爬虫cookbook1爬虫入门

作者: oychw | 来源:发表于2018-07-04 21:16 被阅读74次

    第一章 爬虫入门

    • Requests和Beautiful Soup 爬取python.org
    • urllib3和Beautiful Soup 爬取python.org
    • Scrapy 爬取python.org
    • Selenium和PhantomJs爬取Python.org

    本文最新版本地址

    请确认可以打开:https://www.python.org/events/pythonevents
    安装好requests、bs4,然后我们开始实例1:Requests和Beautiful Soup 爬取python.org, 安装如果有问题尽量自己google,如果实在搞不定可以群里提问,或私聊咨询qq37391319, 咨询是需要收费,qq红包10元起

    #!python
    
    # pip3 install requests bs4
    
    

    Requests和Beautiful Soup 爬取python.org

    01_events_with_requests.py

    #!python
    
    import requests
    from bs4 import BeautifulSoup
    
    def get_upcoming_events(url):
        req = requests.get(url)
    
        soup = BeautifulSoup(req.text, 'lxml')
    
        events = soup.find('ul', {'class': 'list-recent-events'}).findAll('li')
    
        for event in events:
            event_details = dict()
            event_details['name'] = event.find('h3').find("a").text
            event_details['location'] = event.find('span', {'class', 'event-location'}).text
            event_details['time'] = event.find('time').text
            print(event_details)
    
    get_upcoming_events('https://www.python.org/events/python-events/')
    
    
    

    执行结果:

    #!python
    
    $ python3 01_events_with_requests.py 
    {'name': 'PyCon US 2018', 'location': 'Cleveland, Ohio, USA', 'time': '09 May – 18 May  2018'}
    {'name': 'DjangoCon Europe 2018', 'location': 'Heidelberg, Germany', 'time': '23 May – 28 May  2018'}
    {'name': 'PyCon APAC 2018', 'location': 'NUS School of Computing / COM1, 13 Computing Drive, Singapore 117417, Singapore', 'time': '31 May – 03 June  2018'}
    {'name': 'PyCon CZ 2018', 'location': 'Prague, Czech Republic', 'time': '01 June – 04 June  2018'}
    {'name': 'PyConTW 2018', 'location': 'Taipei, Taiwan', 'time': '01 June – 03 June  2018'}
    {'name': 'PyLondinium', 'location': 'London, UK', 'time': '08 June – 11 June  2018'}
    
    

    注意:因为事件的内容未必相同,所以每次的结果也不会一样

    课后习题: 用requests爬取https://china-testing.github.io/首页的博客标题,共10条。

    参考答案:

    01_blog_title.py

    #!python
    
    import requests
    from bs4 import BeautifulSoup
    
    def get_upcoming_events(url):
        req = requests.get(url)
    
        soup = BeautifulSoup(req.text, 'lxml')
    
        events = soup.findAll('article')
    
        for event in events:
            event_details = {}
            event_details['name'] = event.find('h1').find("a").text
            print(event_details)
    
    get_upcoming_events('https://china-testing.github.io/')
    
    
    

    执行结果:

    #!python
    
    $ python3 01_blog_title.py 
    {'name': '10分钟学会API测试'}
    {'name': 'python数据分析快速入门教程4-数据汇聚'}
    {'name': 'python数据分析快速入门教程6-重整'}
    {'name': 'python数据分析快速入门教程5-处理缺失数据'}
    {'name': 'python库介绍-pytesseract: OCR光学字符识别'}
    {'name': '软件自动化测试初学者忠告'}
    {'name': '使用opencv转换3d图片'}
    {'name': 'python opencv3实例(对象识别和增强现实)2-边缘检测和应用图像过滤器'}
    {'name': 'numpy学习指南3rd3:常用函数'}
    {'name': 'numpy学习指南3rd2:NumPy基础'}
    
    

    urllib3和Beautiful Soup 爬取python.org

    代码:02_events_with_urlib3.py

    #!python
    
    import urllib3
    from bs4 import BeautifulSoup
    
    def get_upcoming_events(url):
        req = urllib3.PoolManager()
        res = req.request('GET', url)
    
        soup = BeautifulSoup(res.data, 'html.parser')
    
        events = soup.find('ul', {'class': 'list-recent-events'}).findAll('li')
    
        for event in events:
            event_details = dict()
            event_details['name'] = event.find('h3').find("a").text
            event_details['location'] = event.find('span', {'class', 'event-location'}).text
            event_details['time'] = event.find('time').text
            print(event_details)
    
    get_upcoming_events('https://www.python.org/events/python-events/')
    
    
    

    requests对urllib3进行了封装,一般是直接使用requests。

    Scrapy 爬取python.org

    Scrapy是用于提取数据的非常流行的开源Python抓取框架。 Scrapy提供所有这些功能以及许多其他内置模块和扩展。当涉及到使用Python进行挖掘时,它也是我们的首选工具。
    Scrapy提供了许多值得一提的强大功能:

    • 内置的扩展来生成HTTP请求并处理压缩,身份验证,缓存,操作用户代理和HTTP标头
    • 内置的支持选择和提取选择器语言如数据CSS和XPath,以及支持使用正则表达式选择内容和链接。
    • 编码支持来处理语言和非标准编码声明
    • 灵活的API来重用和编写自定义中间件和管道,提供干净而简单的方法来实现自动化等任务。比如下载资产(例如图像或媒体)并将数据存储在存储器中,如文件系统,S3,数据库等

    有几种使用Scrapy的方法。一个是程序模式我们在代码中创建抓取工具和蜘蛛。也可以配置Scrapy模板或生成器项目,然后从命令行使用运行。本书将遵循程序模式,因为它的代码在单个文件中。

    代码:03_events_with_scrapy.py

    #!python
    
    import scrapy
    from scrapy.crawler import CrawlerProcess
    
    class PythonEventsSpider(scrapy.Spider):
        name = 'pythoneventsspider'
    
        start_urls = ['https://www.python.org/events/python-events/',]
        found_events = []
    
        def parse(self, response):
            for event in response.xpath('//ul[contains(@class, "list-recent-events")]/li'):
                event_details = dict()
                event_details['name'] = event.xpath('h3[@class="event-title"]/a/text()').extract_first()
                event_details['location'] = event.xpath('p/span[@class="event-location"]/text()').extract_first()
                event_details['time'] = event.xpath('p/time/text()').extract_first()
                self.found_events.append(event_details)
    
    if __name__ == "__main__":
        process = CrawlerProcess({ 'LOG_LEVEL': 'ERROR'})
        process.crawl(PythonEventsSpider)
        spider = next(iter(process.crawlers)).spider
        process.start()
    
        for event in spider.found_events: print(event)
    
    

    课后习题: 用scrapy爬取https://china-testing.github.io/首页的博客标题,共10条。

    参考答案:

    03_blog_with_scrapy.py

    #!python
    
    from scrapy.crawler import CrawlerProcess
    
    class PythonEventsSpider(scrapy.Spider):
        name = 'pythoneventsspider'
    
        start_urls = ['https://china-testing.github.io/',]
        found_events = []
    
        def parse(self, response):
            for event in response.xpath('//article//h1'):
                event_details = dict()
                event_details['name'] = event.xpath('a/text()').extract_first()
                self.found_events.append(event_details)
    
    if __name__ == "__main__":
        process = CrawlerProcess({ 'LOG_LEVEL': 'ERROR'})
        process.crawl(PythonEventsSpider)
        spider = next(iter(process.crawlers)).spider
        process.start()
    
        for event in spider.found_events: print(event)
    
    

    Selenium和PhantomJs爬取Python.org

    04_events_with_selenium.py

    #!python
    
    from selenium import webdriver
    
    def get_upcoming_events(url):
        driver = webdriver.Chrome()
        driver.get(url)
    
        events = driver.find_elements_by_xpath('//ul[contains(@class, "list-recent-events")]/li')
    
        for event in events:
            event_details = dict()
            event_details['name'] = event.find_element_by_xpath('h3[@class="event-title"]/a').text
            event_details['location'] = event.find_element_by_xpath('p/span[@class="event-location"]').text
            event_details['time'] = event.find_element_by_xpath('p/time').text
            print(event_details)
    
        driver.close()
    
    get_upcoming_events('https://www.python.org/events/python-events/')
    
    

    改用driver = webdriver.PhantomJS('phantomjs')可以使用无界面的方式,代码如下:

    05_events_with_phantomjs.py

    #!python
    
    from selenium import webdriver
    
    def get_upcoming_events(url):
        driver = webdriver.Chrome()
        driver.get(url)
    
        events = driver.find_elements_by_xpath('//ul[contains(@class, "list-recent-events")]/li')
    
        for event in events:
            event_details = dict()
            event_details['name'] = event.find_element_by_xpath('h3[@class="event-title"]/a').text
            event_details['location'] = event.find_element_by_xpath('p/span[@class="event-location"]').text
            event_details['time'] = event.find_element_by_xpath('p/time').text
            print(event_details)
    
        driver.close()
    
    get_upcoming_events('https://www.python.org/events/python-events/')
    
    

    不过selenium的headless模式已经可以更好的代替phantomjs了。

    04_events_with_selenium_headless.py

    #!python
    
    from selenium import webdriver
    
    def get_upcoming_events(url):
        
        options = webdriver.ChromeOptions()
        options.add_argument('headless')
        driver = webdriver.Chrome(chrome_options=options)
        driver.get(url)
    
        events = driver.find_elements_by_xpath('//ul[contains(@class, "list-recent-events")]/li')
    
        for event in events:
            event_details = dict()
            event_details['name'] = event.find_element_by_xpath('h3[@class="event-title"]/a').text
            event_details['location'] = event.find_element_by_xpath('p/span[@class="event-location"]').text
            event_details['time'] = event.find_element_by_xpath('p/time').text
            print(event_details)
    
        driver.close()
    
    get_upcoming_events('https://www.python.org/events/python-events/')
    
    

    可爱的python测试开发库 请在github上点赞,谢谢!
    python中文库文档汇总
    [雪峰磁针石博客]python3标准库-中文版
    [雪峰磁针石博客]python3快速入门教程
    接口自动化性能测试线上培训大纲
    python测试开发自动化测试数据分析人工智能自学每周一练
    更多内容请关注 雪峰磁针石:简书

    • 技术支持qq群: 144081101(后期会录制视频存在该群群文件) 591302926 567351477 钉钉免费群:21745728

    • 道家技术-手相手诊看相中医等钉钉群21734177 qq群:391441566 184175668 338228106 看手相、面相、舌相、抽签、体质识别。服务费50元每人次起。请联系钉钉或者微信pythontesting

    相关文章

      网友评论

        本文标题:[雪峰磁针石博客]python爬虫cookbook1爬虫入门

        本文链接:https://www.haomeiwen.com/subject/gevyuftx.html