美文网首页解密大数据
课程作业-爬虫入门03-爬虫基础-WilliamZeng-201

课程作业-爬虫入门03-爬虫基础-WilliamZeng-201

作者: amoyyean | 来源:发表于2017-07-30 20:54 被阅读0次

    课堂作业

    • 8月9日根据爬虫入门04课曾老师的讲解做了一些补充,代码和其执行修改成先爬取解密大数据专题下的文章链接,然后
    • 选择简书解密大数据专题里面前两次作业的网址爬虫入门01爬虫入门02作为爬取页面
    • 爬取该页面中所有可以爬取的元素,我选择了爬取文章主体文字内容,文章主体中的图片和文字链接,包括他们的文字标识
    • 尝试用lxml爬取
    参考资料

    谢谢曾老师分享和介绍这些工具,为我们节省了很多时间。要了解它们需要投入一定专注的时间阅读文档和练习,希望老师没有高估我们的接受速度和投入程度,遇到一些跟不上的情况也能耐心指导。


    代码部分一:beautifulsoup4实现
    1. 导入模块
    2. 基础的下载函数:download
    3. 抓取文章页上文章主体内容的函数:crawl_page
    4. 抓取文章内图片信息和链接的函数:crawl_article_images
    5. 抓取文章内文字链信息和链接的函数:crawl_article_text_link
    6. 抓取专题页(文章)标题类的链接的函数:crawl_links

    结果都会写入带有文章标题的文件,这里抓取标题,生成文件并写入抓取内容的部分在上面后3个函数是共用的。在有限的作业时间内,本人缺乏科班训练,没有把这些共用语句写成一个函数。有一小部分冗余代码,可能将来能用上,没再修改。

    导入模块
    import os
    import time
    import urllib2
    from bs4 import BeautifulSoup # 用于解析网页中文, 安装:pip install beautifulsoup4
    

    在什么环境下运行pip我发觉提问和交流比较少,可能大多数同学都是用Python的IDE工具安装所需模块的,而不是直接调用pip或easy_install。我自己尝试Windows环境下需要在命令行模式运行,并且只能在安装pip的目录下运行,比如D:\Python27\Scripts。环境变量的配置我这次没时间研究了。因为所需模块已经通过别的方式安装,pip install beautifulsoup4的调用返回Requirement already satisfied: beautifulsoup4 in d:\python27\lib\site-packages。不完全确定命令行安装方式是否有错。

    download函数
    def download(url, retry=2):
        """
        下载页面的函数,会下载完整的页面信息
        :param url: 要下载的url
        :param retry: 重试次数
        :return: 原生html
        """
        print "downloading: ", url
        # 设置header信息,模拟浏览器请求
        header = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'
        }
        try: #爬取可能会失败,采用try-except方式来捕获处理
            request = urllib2.Request(url, headers=header) #设置请求数据
            html = urllib2.urlopen(request).read() #抓取url
        except urllib2.URLError as e: #异常处理
            print "download error: ", e.reason
            html = None
            if retry > 0: #未超过重试次数,可以继续爬取
                if hasattr(e, 'code') and 500 <= e.code < 600: #错误码范围,是请求出错才继续重试爬取
                    print e.code
                    return download(url, retry - 1)
        time.sleep(1) #等待1s,避免对服务器造成压力,也避免被服务器屏蔽爬取
        return html
    

    这一部分除了改了header之外,照搬老师的代码。urllib2没仔细研究,不赘述。

    crawl_page函数
    def crawl_page(crawled_url):
        """
        爬取文章内容
        :param crawled_url: 需要爬取的页面地址集合
        """
        for link in crawled_url: #按地址逐篇文章爬取
            html = download(link)
            soup = BeautifulSoup(html, "html.parser")
            title = soup.find('h1', {'class': 'title'}).text #获取文章标题
            """
            替换特殊字符,否则根据文章标题生成文件名的代码会运行出错
            """
            title = title.replace('|', ' ')
            title = title.replace('"', ' ')
            title = title.replace('/', ',')
            title = title.replace('<', ' ')
            title = title.replace('>', ' ')
            title = title.replace('\x08', '')
            # print (title)
            content = soup.find('div', {'class': 'show-content'}).text #获取文章内容
    
            if os.path.exists('spider_output/') == False: #检查保存文件的地址
                os.mkdir('spider_output/')
    
            file_name = 'spider_output/' + title + '.txt' #设置要保存的文件名
            if os.path.exists(file_name):
                # os.remove(file_name) # 删除文件
                continue  # 已存在的文件不再写
            file = open('spider_output/' + title + '.txt', 'wb') #写文件
            content = unicode(content).encode('utf-8', errors='ignore')
            file.write(content)
            file.close()
    

    这一部分也是基于老师的代码略作删减和修改。

    crawl_article_images函数
    def crawl_article_images(post_url):
        """
        抓取文章中图片链接
        :param post_url: 文章页面
        """
        image_url = set()  # 爬取的图片链接
        flag = True # 标记是否需要继续爬取
        while flag:
            html = download(post_url) # 下载页面
            if html == None:
                break
    
            soup = BeautifulSoup(html, "html.parser") # 格式化爬取的页面数据
            title = soup.find('h1', {'class': 'title'}).text  # 获取文章标题
            image_div = soup.find_all('div', {'class': 'image-package'}) # 获取文章图片div元素
            if image_div.__len__() == 0: # 爬取的页面中无图片div元素,终止爬取
                break
    
            i = 1
            image_content = ''
            for image in image_div:
                image_link = image.img.get('data-original-src') # 获取图片的原始链接
                image_caption = image.find('div', {'class': 'image-caption'}).text # 获取图片的标题
                image_content += str(i) + '. ' + (unicode(image_caption).encode('utf-8', errors='ignore')) + ' : '+ (unicode(image_link).encode('utf-8', errors='ignore')) + '\n'
                image_url.add(image_link)  # 记录未重复的爬取的图片链接
                i += 1
    
            if os.path.exists('spider_output/') == False:  # 检查保存文件的地址
                os.mkdir('spider_output')
    
            file_name = 'spider_output/' + title + '_images.txt'  # 设置要保存的文件名
            if os.path.exists(file_name) == False:
                file = open('spider_output/' + title + '_images.txt', 'wb')  # 写文件
                file.write(image_content)
                file.close()
            flag = False
    
        image_num = image_url.__len__()
        print 'total number of images in the article: ', image_num
    
    

    这一部分是基于老师的Demo代码和抓取页面的信息写出来的。为了只抓取文章里图片的特定信息,先抓取image的div元素,再抓取下面包含的链接和图片说明。如果大家有更快捷清晰的方法,欢迎建议。

    crawl_article_text_link函数
    def crawl_article_text_link(post_url):
        """
        抓取文章中的文字链接
        :param post_url: 文章页面
        """
        text_link_url = set()  # 爬取的文字链接
        flag = True # 标记是否需要继续爬取
        while flag:
            html = download(post_url) # 下载页面
            if html == None:
                break
    
            soup = BeautifulSoup(html, "html.parser") # 格式化爬取的页面数据
            title = soup.find('h1', {'class': 'title'}).text  # 获取文章标题
            article_content = soup.find('div', {'class': 'show-content'}) # 获取文章的内容div
            text_links = article_content.find_all('a', {'target': '_blank'})
            if text_links.__len__() == 0: # 爬取的页面中没有文字链元素,终止爬取
                break
    
            i = 1
            text_links_content = ''
            for link in text_links:
                link_url = link.get('href') # 获取文字链的链接
                link_label = link.text # 获取文字链的文本内容
                text_links_content += str(i) + '. ' + (unicode(link_label).encode('utf-8', errors='ignore')) + ' : '+ (unicode(link_url).encode('utf-8', errors='ignore')) + '\n'
                text_link_url.add(link_url)  # 记录未重复的爬取的文字链的链接
                i += 1
    
            if os.path.exists('spider_output/') == False:  # 检查保存文件的地址
                os.mkdir('spider_output')
    
            file_name = 'spider_output/' + title + '_article_text_links.txt'  # 设置要保存的文件名
            if os.path.exists(file_name) == False:
                file = open('spider_output/' + title + '_article_text_links.txt', 'wb')  # 写文件
                file.write(text_links_content)
                file.close()
            flag = False
    
        link_num = text_link_url.__len__()
        print 'total number of text links in the article: ', link_num
    
    

    先抓取文章的主体,再抓取文章主体中的链接元素。如果有更简洁清晰的方法,一样欢迎建议。

    crawl_links函数
    def crawl_links(url_seed, url_root):
        """
        抓取文章链接
        :param url_seed: 下载的种子页面地址
        :param url_root: 爬取网站的根目录
        :return: 需要爬取的页面链接
        """
        crawled_url = set()  # 需要爬取的页面
        i = 1
        flag = True  # 标记是否需要继续爬取
        while flag:
            url = url_seed % i  # 真正爬取的页面
            i += 1  # 下一次需要爬取的页面
    
            html = download(url)  # 下载页面
            if html == None:  # 下载页面为空,表示已爬取到最后
                break
    
            soup = BeautifulSoup(html, "html.parser")  # 格式化爬取的页面数据
            links = soup.find_all('a', {'class': 'title'})  # 获取标题元素
            if links.__len__() == 0:  # 爬取的页面中已无有效数据,终止爬取
                flag = False
    
            for link in links:  # 获取有效的文章地址
                link = link.get('href')
                if link not in crawled_url:
                    realUrl = urlparse.urljoin(url_root, link)
                    crawled_url.add(realUrl)  # 记录未重复的需要爬取的页面
                else:
                    print 'end'
                    flag = False  # 结束抓取
    
        paper_num = crawled_url.__len__()
        print 'total paper num: ', paper_num
        return crawled_url
    

    和曾老师课堂提供的代码一样。

    调用函数执行页面抓取
    crawl_article_images('http://www.jianshu.com/p/10b429fd9c4d')
    crawl_article_images('http://www.jianshu.com/p/faf2f4107b9b')
    crawl_article_images('http://www.jianshu.com/p/111')
    crawl_article_text_link('http://www.jianshu.com/p/10b429fd9c4d')
    crawl_article_text_link('http://www.jianshu.com/p/faf2f4107b9b')
    crawl_page(['http://www.jianshu.com/p/10b429fd9c4d'])
    crawl_page(['http://www.jianshu.com/p/faf2f4107b9b'])
    

    我尝试抓取之前2次爬虫的页面和用crawl_article_images函数抓取一个不存在的页面。

    Python Console的输出结果如下

    downloading:  http://www.jianshu.com/p/10b429fd9c4d
    total number of images in the article:  2
    downloading:  http://www.jianshu.com/p/faf2f4107b9b
    total number of images in the article:  0
    downloading:  http://www.jianshu.com/p/111
    download error:  Not Found
    total number of images in the article:  0
    downloading:  http://www.jianshu.com/p/10b429fd9c4d
    total number of text links in the article:  2
    downloading:  http://www.jianshu.com/p/faf2f4107b9b
    total number of text links in the article:  2
    downloading:  http://www.jianshu.com/p/10b429fd9c4d
    downloading:  http://www.jianshu.com/p/faf2f4107b9b
    
    抓取的结果文件如下 抓取结果文件 结果文件内容示例1 结果文件内容示例2 结果文件内容示例3

    上完爬虫04课之后明白了爬虫03课的作业是分步抓取,第一步调用crawl_links函数抓取解密大数据专题里的文章链接,第二步调用crawl_page函数抓取第一步产生的链接的网页上的文章信息。编写新的执行代码如下

    url_root = 'http://www.jianshu.com/'
    url_seed = 'http://www.jianshu.com/c/9b4685b6357c/?page=%d'
    crawled_url = crawl_links(url_seed, url_root)
    crawl_page(crawled_url)
    

    Python Console的输出结果如下

    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=1
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=2
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=3
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=4
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=5
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=6
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=7
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=8
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=9
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=10
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=11
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=12
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=13
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=14
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=15
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=16
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=17
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=18
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=19
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=20
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=21
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=22
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=23
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=24
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=25
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=26
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=27
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=28
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=29
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=30
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=31
    downloading:  http://www.jianshu.com/c/9b4685b6357c/?page=32
    total paper num:  305
    downloading:  http://www.jianshu.com/p/45df7e3ecc78
    downloading:  http://www.jianshu.com/p/99ae5b28a51f
    downloading:  http://www.jianshu.com/p/d6243f087bd9
    downloading:  http://www.jianshu.com/p/ea40c6da9fec
    downloading:  http://www.jianshu.com/p/59e0da43136e
    downloading:  http://www.jianshu.com/p/e71e5d7223bb
    downloading:  http://www.jianshu.com/p/dc07545c6607
    downloading:  http://www.jianshu.com/p/99fd951a0b8b
    downloading:  http://www.jianshu.com/p/02f33063c258
    downloading:  http://www.jianshu.com/p/ad10d79255f8
    downloading:  http://www.jianshu.com/p/062b8dfca144
    downloading:  http://www.jianshu.com/p/cb4f8ab1b380
    downloading:  http://www.jianshu.com/p/2c557a1bfa04
    downloading:  http://www.jianshu.com/p/8f7102c74a4f
    downloading:  http://www.jianshu.com/p/77876ef45ab4
    downloading:  http://www.jianshu.com/p/e5475131d03f
    downloading:  http://www.jianshu.com/p/e0bd6bfad10b
    downloading:  http://www.jianshu.com/p/a425acdaf77e
    downloading:  http://www.jianshu.com/p/729edfc613aa
    downloading:  http://www.jianshu.com/p/e50c863bb465
    downloading:  http://www.jianshu.com/p/7107b67c47bc
    downloading:  http://www.jianshu.com/p/6585d58f582a
    downloading:  http://www.jianshu.com/p/4f38600dae7c
    downloading:  http://www.jianshu.com/p/1292d7a3805e
    downloading:  http://www.jianshu.com/p/7cb84cfa56fa
    downloading:  http://www.jianshu.com/p/41c14ef3e59a
    downloading:  http://www.jianshu.com/p/1a2a07611fd8
    downloading:  http://www.jianshu.com/p/217a4578f9ab
    downloading:  http://www.jianshu.com/p/d234a015fa90
    downloading:  http://www.jianshu.com/p/e08d1a03045f
    downloading:  http://www.jianshu.com/p/41b1ee54d766
    downloading:  http://www.jianshu.com/p/6f4a7a1ef85c
    downloading:  http://www.jianshu.com/p/faf2f4107b9b
    downloading:  http://www.jianshu.com/p/9dee9886b140
    downloading:  http://www.jianshu.com/p/e2ee86a8a32b
    downloading:  http://www.jianshu.com/p/9258b0495021
    downloading:  http://www.jianshu.com/p/7e2fccb4fad9
    downloading:  http://www.jianshu.com/p/f21f01a92521
    downloading:  http://www.jianshu.com/p/d882831868fb
    downloading:  http://www.jianshu.com/p/872a67eed7af
    downloading:  http://www.jianshu.com/p/2e64c2045be5
    downloading:  http://www.jianshu.com/p/565500cfb5a4
    downloading:  http://www.jianshu.com/p/1729787990e7
    downloading:  http://www.jianshu.com/p/8ca518b3b2d5
    downloading:  http://www.jianshu.com/p/9c7fbcac3461
    downloading:  http://www.jianshu.com/p/13d76e7741c0
    downloading:  http://www.jianshu.com/p/81d17436f29e
    downloading:  http://www.jianshu.com/p/148b7cc83bcd
    downloading:  http://www.jianshu.com/p/70b7505884e9
    downloading:  http://www.jianshu.com/p/ba4100af215a
    downloading:  http://www.jianshu.com/p/333dacb0e1b2
    downloading:  http://www.jianshu.com/p/ff2d4eadebde
    downloading:  http://www.jianshu.com/p/eb01f9002091
    downloading:  http://www.jianshu.com/p/ba43beaa186a
    downloading:  http://www.jianshu.com/p/14967ec6e954
    downloading:  http://www.jianshu.com/p/d44cc7e9a0a9
    downloading:  http://www.jianshu.com/p/d0de8ee83ea1
    downloading:  http://www.jianshu.com/p/b4670cb9e998
    downloading:  http://www.jianshu.com/p/9f9fb337be0c
    downloading:  http://www.jianshu.com/p/542f41879879
    downloading:  http://www.jianshu.com/p/e9f6b15318be
    downloading:  http://www.jianshu.com/p/f1ef93a6c033
    downloading:  http://www.jianshu.com/p/92a66ccc8998
    downloading:  http://www.jianshu.com/p/f0063d735a5c
    downloading:  http://www.jianshu.com/p/856c8d648e20
    downloading:  http://www.jianshu.com/p/b9407b2c22a4
    downloading:  http://www.jianshu.com/p/a36e997b8e59
    downloading:  http://www.jianshu.com/p/c28207b3c71d
    downloading:  http://www.jianshu.com/p/8448ac374dc1
    downloading:  http://www.jianshu.com/p/4a3fbcb06981
    downloading:  http://www.jianshu.com/p/d7267956035a
    downloading:  http://www.jianshu.com/p/b1a9daef3423
    downloading:  http://www.jianshu.com/p/5eb037498c48
    downloading:  http://www.jianshu.com/p/f756bf0beb26
    downloading:  http://www.jianshu.com/p/673b768c6084
    downloading:  http://www.jianshu.com/p/6233788a8abb
    downloading:  http://www.jianshu.com/p/087ce1951647
    downloading:  http://www.jianshu.com/p/7240db1ba0af
    downloading:  http://www.jianshu.com/p/289e51eb6446
    downloading:  http://www.jianshu.com/p/39d6793a6554
    downloading:  http://www.jianshu.com/p/0565cd673282
    downloading:  http://www.jianshu.com/p/873613065502
    downloading:  http://www.jianshu.com/p/605644d688ff
    downloading:  http://www.jianshu.com/p/1ea730c97aae
    downloading:  http://www.jianshu.com/p/bab0c09416ee
    downloading:  http://www.jianshu.com/p/c6591991d1ca
    downloading:  http://www.jianshu.com/p/fd9536a0acfb
    downloading:  http://www.jianshu.com/p/ed8dc3802927
    downloading:  http://www.jianshu.com/p/f89c4032a0b2
    downloading:  http://www.jianshu.com/p/1fa23219270d
    downloading:  http://www.jianshu.com/p/defeeb920c3a
    downloading:  http://www.jianshu.com/p/412f8eab2599
    downloading:  http://www.jianshu.com/p/05c15b9f16f1
    downloading:  http://www.jianshu.com/p/4931d66276c3
    downloading:  http://www.jianshu.com/p/b5165468a32b
    downloading:  http://www.jianshu.com/p/2c02a7b0b382
    downloading:  http://www.jianshu.com/p/dffdaf11bd4c
    downloading:  http://www.jianshu.com/p/71c02ef761ac
    downloading:  http://www.jianshu.com/p/6920d5e48b31
    downloading:  http://www.jianshu.com/p/71b968bd8abb
    downloading:  http://www.jianshu.com/p/6450dce856fd
    downloading:  http://www.jianshu.com/p/c1163e39a42e
    downloading:  http://www.jianshu.com/p/bd9a27c4e2a8
    downloading:  http://www.jianshu.com/p/88d0addf64fa
    downloading:  http://www.jianshu.com/p/6a7afc98c868
    downloading:  http://www.jianshu.com/p/733475b6900d
    downloading:  http://www.jianshu.com/p/f75128ec3ea3
    downloading:  http://www.jianshu.com/p/9ee12067f35e
    downloading:  http://www.jianshu.com/p/c41624a83b71
    downloading:  http://www.jianshu.com/p/8318f5b722cf
    downloading:  http://www.jianshu.com/p/b5c292e093a2
    downloading:  http://www.jianshu.com/p/0a6977eb686d
    downloading:  http://www.jianshu.com/p/456ab3a6ef71
    downloading:  http://www.jianshu.com/p/d578d5e2755f
    downloading:  http://www.jianshu.com/p/616642976ded
    downloading:  http://www.jianshu.com/p/c9e1dffad756
    downloading:  http://www.jianshu.com/p/81819f27a7d8
    downloading:  http://www.jianshu.com/p/a4beefd8cfc2
    downloading:  http://www.jianshu.com/p/799c51fbe5f1
    downloading:  http://www.jianshu.com/p/5e4a86f8025c
    downloading:  http://www.jianshu.com/p/7acf291b2a5e
    downloading:  http://www.jianshu.com/p/6ef6b9a56b50
    downloading:  http://www.jianshu.com/p/210aacd31ef7
    downloading:  http://www.jianshu.com/p/9a9280de68f8
    downloading:  http://www.jianshu.com/p/d5bc50d8e0a2
    downloading:  http://www.jianshu.com/p/39eb230e6f15
    downloading:  http://www.jianshu.com/p/c0c0a3ed35d4
    downloading:  http://www.jianshu.com/p/74db357c7252
    downloading:  http://www.jianshu.com/p/6a91f948b62d
    downloading:  http://www.jianshu.com/p/bc75ab89fac0
    downloading:  http://www.jianshu.com/p/8088d1bede8d
    downloading:  http://www.jianshu.com/p/8ca88a90ea17
    downloading:  http://www.jianshu.com/p/a8037a38e219
    downloading:  http://www.jianshu.com/p/979b4c5c1857
    downloading:  http://www.jianshu.com/p/3dfedf60de62
    downloading:  http://www.jianshu.com/p/ada67bd7c56f
    downloading:  http://www.jianshu.com/p/486afcd4c36c
    downloading:  http://www.jianshu.com/p/2841c81d57fc
    downloading:  http://www.jianshu.com/p/e492d3acfe38
    downloading:  http://www.jianshu.com/p/b4e2e5e31154
    downloading:  http://www.jianshu.com/p/75fc36aec98e
    downloading:  http://www.jianshu.com/p/545581b0c7dd
    downloading:  http://www.jianshu.com/p/a015b756a803
    downloading:  http://www.jianshu.com/p/29062bca16aa
    downloading:  http://www.jianshu.com/p/3a95a09cda40
    downloading:  http://www.jianshu.com/p/8fbe3a7b4764
    downloading:  http://www.jianshu.com/p/0329f87c9ae4
    downloading:  http://www.jianshu.com/p/e1b28de0a1e4
    download error:  Gateway Time-out
    504
    downloading:  http://www.jianshu.com/p/e1b28de0a1e4
    downloading:  http://www.jianshu.com/p/b5c31a2eeb8b
    downloading:  http://www.jianshu.com/p/7e556f17021a
    downloading:  http://www.jianshu.com/p/23144099e9f8
    downloading:  http://www.jianshu.com/p/a91c54f96ded
    downloading:  http://www.jianshu.com/p/74ef104a9f45
    downloading:  http://www.jianshu.com/p/afa17bc391b7
    downloading:  http://www.jianshu.com/p/90914aef3636
    downloading:  http://www.jianshu.com/p/0c0e3ace0da1
    downloading:  http://www.jianshu.com/p/b7eef4033a09
    downloading:  http://www.jianshu.com/p/7b2e81589a4f
    downloading:  http://www.jianshu.com/p/2f7d10b2e508
    downloading:  http://www.jianshu.com/p/ed499f4ecdd1
    downloading:  http://www.jianshu.com/p/11c103c03d4a
    downloading:  http://www.jianshu.com/p/97ff0beca873
    downloading:  http://www.jianshu.com/p/7c54cd046d4b
    downloading:  http://www.jianshu.com/p/cfaf85b24281
    downloading:  http://www.jianshu.com/p/356a579062aa
    downloading:  http://www.jianshu.com/p/460a8eed5cfa
    downloading:  http://www.jianshu.com/p/46e82e4fe324
    downloading:  http://www.jianshu.com/p/ba00a9852a02
    downloading:  http://www.jianshu.com/p/b6359185fc26
    downloading:  http://www.jianshu.com/p/a1a2dabb4bc2
    downloading:  http://www.jianshu.com/p/4077cbc4dd37
    downloading:  http://www.jianshu.com/p/90efe88727fe
    downloading:  http://www.jianshu.com/p/17f99100525a
    downloading:  http://www.jianshu.com/p/01385e2dd129
    downloading:  http://www.jianshu.com/p/ec3c57d6a4c7
    downloading:  http://www.jianshu.com/p/9632ba906ca2
    downloading:  http://www.jianshu.com/p/85da47fddad7
    downloading:  http://www.jianshu.com/p/3b47b36cc8e8
    downloading:  http://www.jianshu.com/p/29e304a61d32
    downloading:  http://www.jianshu.com/p/649167e0e2f4
    downloading:  http://www.jianshu.com/p/13840057782d
    downloading:  http://www.jianshu.com/p/11b3dbb05c39
    downloading:  http://www.jianshu.com/p/3a5975d6ac55
    downloading:  http://www.jianshu.com/p/394856545ab0
    downloading:  http://www.jianshu.com/p/0ee1f0bfc8cb
    downloading:  http://www.jianshu.com/p/2364064e0bc9
    downloading:  http://www.jianshu.com/p/09b19b8f8886
    downloading:  http://www.jianshu.com/p/50a2ba489685
    downloading:  http://www.jianshu.com/p/f0436668cb72
    downloading:  http://www.jianshu.com/p/c0f3d36d0c7a
    downloading:  http://www.jianshu.com/p/be0192aa6486
    downloading:  http://www.jianshu.com/p/ee43c55123f8
    downloading:  http://www.jianshu.com/p/af4765b703f0
    downloading:  http://www.jianshu.com/p/ff772050bd96
    downloading:  http://www.jianshu.com/p/e121b1a420ad
    downloading:  http://www.jianshu.com/p/ed93f7f344d0
    downloading:  http://www.jianshu.com/p/8f6ee3b1efeb
    downloading:  http://www.jianshu.com/p/3f06c9f69142
    downloading:  http://www.jianshu.com/p/887889c6daee
    downloading:  http://www.jianshu.com/p/ce0e0773c6ec
    downloading:  http://www.jianshu.com/p/be384fd73bdb
    downloading:  http://www.jianshu.com/p/acc47733334f
    downloading:  http://www.jianshu.com/p/bf5984fb299a
    downloading:  http://www.jianshu.com/p/1a935c2dc911
    downloading:  http://www.jianshu.com/p/8982ad63eb85
    downloading:  http://www.jianshu.com/p/d1acbed69f45
    downloading:  http://www.jianshu.com/p/98cc73755a22
    downloading:  http://www.jianshu.com/p/bb736600b483
    downloading:  http://www.jianshu.com/p/3c71839bc660
    downloading:  http://www.jianshu.com/p/23a905cf936b
    downloading:  http://www.jianshu.com/p/169403f7e40c
    downloading:  http://www.jianshu.com/p/a9c7970bc949
    downloading:  http://www.jianshu.com/p/ed9ec88e71e4
    downloading:  http://www.jianshu.com/p/5057ab6f9ad5
    downloading:  http://www.jianshu.com/p/1b42a12dac14
    downloading:  http://www.jianshu.com/p/5dc5dfe26148
    downloading:  http://www.jianshu.com/p/c88a4453dd6d
    downloading:  http://www.jianshu.com/p/cd971afcb207
    downloading:  http://www.jianshu.com/p/2ccd37ae73e2
    downloading:  http://www.jianshu.com/p/926013888e3e
    downloading:  http://www.jianshu.com/p/888a580b2384
    downloading:  http://www.jianshu.com/p/8a0479f55b21
    downloading:  http://www.jianshu.com/p/e72c8ef71e49
    downloading:  http://www.jianshu.com/p/bb4a81624af1
    downloading:  http://www.jianshu.com/p/4b944b22fe83
    downloading:  http://www.jianshu.com/p/b3e8e9cb0141
    downloading:  http://www.jianshu.com/p/bfd9b3954038
    downloading:  http://www.jianshu.com/p/f6c26ef0f4cc
    downloading:  http://www.jianshu.com/p/56967004f8c4
    downloading:  http://www.jianshu.com/p/ae5f78b40f17
    downloading:  http://www.jianshu.com/p/aed64f7e647b
    downloading:  http://www.jianshu.com/p/a32f27199846
    downloading:  http://www.jianshu.com/p/4b4e0c343d3e
    downloading:  http://www.jianshu.com/p/8f6b5a1bb3fa
    downloading:  http://www.jianshu.com/p/f7354d1c5abf
    downloading:  http://www.jianshu.com/p/1fe31cbddc78
    downloading:  http://www.jianshu.com/p/f7dc92913f33
    downloading:  http://www.jianshu.com/p/296ae7538d1f
    downloading:  http://www.jianshu.com/p/d43125a4ff44
    downloading:  http://www.jianshu.com/p/0b0b7c33be57
    downloading:  http://www.jianshu.com/p/b4ac4473a55d
    downloading:  http://www.jianshu.com/p/4b57424173a0
    downloading:  http://www.jianshu.com/p/e0ae002925bd
    downloading:  http://www.jianshu.com/p/5250518f5cc5
    downloading:  http://www.jianshu.com/p/de3455ed089c
    downloading:  http://www.jianshu.com/p/7b946e6d6861
    downloading:  http://www.jianshu.com/p/62e127dbb73c
    downloading:  http://www.jianshu.com/p/430b5bea974d
    downloading:  http://www.jianshu.com/p/e5d13e351320
    downloading:  http://www.jianshu.com/p/5d8a3205e28e
    downloading:  http://www.jianshu.com/p/1099c3a74336
    downloading:  http://www.jianshu.com/p/761a73b7eea2
    downloading:  http://www.jianshu.com/p/83cc892eb24a
    downloading:  http://www.jianshu.com/p/b223e54fe5ee
    downloading:  http://www.jianshu.com/p/366c2594f24b
    downloading:  http://www.jianshu.com/p/cc3b5d76c587
    downloading:  http://www.jianshu.com/p/6dbadc78d231
    downloading:  http://www.jianshu.com/p/d32d7ab5063a
    downloading:  http://www.jianshu.com/p/020f0281f1df
    downloading:  http://www.jianshu.com/p/f26085aadd47
    downloading:  http://www.jianshu.com/p/df7b35249975
    downloading:  http://www.jianshu.com/p/68423bfc4c4e
    downloading:  http://www.jianshu.com/p/601d3a488a58
    downloading:  http://www.jianshu.com/p/1d6fc1a9406b
    downloading:  http://www.jianshu.com/p/76238014a03f
    downloading:  http://www.jianshu.com/p/9e7cfcc85a57
    downloading:  http://www.jianshu.com/p/819a202adecd
    downloading:  http://www.jianshu.com/p/4a8749704ebf
    downloading:  http://www.jianshu.com/p/d2dc5aa9bf8f
    downloading:  http://www.jianshu.com/p/4dda2425314a
    downloading:  http://www.jianshu.com/p/8baa664ea613
    downloading:  http://www.jianshu.com/p/cbfab5db7f6f
    downloading:  http://www.jianshu.com/p/bd78a49c9d23
    downloading:  http://www.jianshu.com/p/cf2edecdba77
    downloading:  http://www.jianshu.com/p/3b3bca4281aa
    downloading:  http://www.jianshu.com/p/f382741c2736
    downloading:  http://www.jianshu.com/p/4ffca0a43476
    downloading:  http://www.jianshu.com/p/e04bcac99c8d
    downloading:  http://www.jianshu.com/p/5a6c4b8e7700
    downloading:  http://www.jianshu.com/p/37e927476dfe
    downloading:  http://www.jianshu.com/p/67ae9d87cf3c
    downloading:  http://www.jianshu.com/p/4981df2eefe7
    downloading:  http://www.jianshu.com/p/86117613b7a6
    downloading:  http://www.jianshu.com/p/233ff48d668e
    downloading:  http://www.jianshu.com/p/13a68ac7afdd
    downloading:  http://www.jianshu.com/p/aa1121232dfd
    downloading:  http://www.jianshu.com/p/e99dacbf5c44
    downloading:  http://www.jianshu.com/p/74042ba10c0d
    downloading:  http://www.jianshu.com/p/40cc7d239513
    downloading:  http://www.jianshu.com/p/5a8b8ce0a395
    downloading:  http://www.jianshu.com/p/59ca82a11f87
    downloading:  http://www.jianshu.com/p/8266f0c736f9
    downloading:  http://www.jianshu.com/p/fa7dd359d7a8
    downloading:  http://www.jianshu.com/p/87f36332b707
    downloading:  http://www.jianshu.com/p/10b429fd9c4d
    downloading:  http://www.jianshu.com/p/9086d0300d1a
    downloading:  http://www.jianshu.com/p/e76c242c7d6a
    downloading:  http://www.jianshu.com/p/910662d6e881
    downloading:  http://www.jianshu.com/p/f68d28d3b862
    downloading:  http://www.jianshu.com/p/9457100d8763
    downloading:  http://www.jianshu.com/p/62c0a5122fa8
    downloading:  http://www.jianshu.com/p/f6420cce3040
    downloading:  http://www.jianshu.com/p/27a78b2016e0
    downloading:  http://www.jianshu.com/p/0c007dbbf728
    downloading:  http://www.jianshu.com/p/f20bc50ad0e8
    

    共生成294个文件,大部分是解密大数据专题下的文章。我这里记录了其中一次出现一个504页面访问错误的输出结果。


    代码部分二:lxml实现
    1. 导入模块
    2. 基础的下载函数:download
    3. 抓取文章页上文章主体内容的函数:crawl_page
    4. 抓取文章内图片信息和链接的函数:crawl_article_images
    5. 抓取文章内文字链信息和链接的函数:crawl_article_text_link

    基本框架和beautifulsoup4的代码一致。实现中有2类问题希望将来有机会得到老师或其他资深的人的指点。lxml的英文官方文档内容不少,不太容易查找我们这次实际使用中遇到的问题。主要靠网上搜索,解决方法不系统。

    1. 中文乱码的解码。曾老师课堂上演示过的lxml.html.fromstring()函数我还没找到一个办法可以令其返回的结果正常显示中文字符。只好用etree的相关函数解决。
    2. lxml的函数返回值基本都是一个列表。抓取一个元素下多层多处文字内容时,怎么把抓取到的文字信息列表合并成一个文字信息变量便于写入文件?现在虽然解决了。仍然好奇老师使用lxml的函数返回列表时用什么方法处理。

    lxml实现的代码,包括调用函数执行抓取的部分,全部展示,不再分块。

    # coding: utf-8
    """
    爬虫课练习代码lxml版本
    课程作业-爬虫入门03-爬虫基础-WilliamZeng-20170716
    """
    
    import os
    import time
    import urllib2
    import lxml.html # lxml中的HTML返回结果解析模块
    import lxml.etree # 为了解决中文乱码而专门引入的lxml模块
    
    def download(url, retry=2):
        """
        下载页面的函数,会下载完整的页面信息
        :param url: 要下载的url
        :param retry: 重试次数
        :return: 原生html
        """
        print "downloading: ", url
        # 设置header信息,模拟浏览器请求
        header = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'
        }
        try: #爬取可能会失败,采用try-except方式来捕获处理
            request = urllib2.Request(url, headers=header) #设置请求数据
            html = urllib2.urlopen(request).read() #抓取url
        except urllib2.URLError as e: #异常处理
            print "download error: ", e.reason
            html = None
            if retry > 0: #未超过重试次数,可以继续爬取
                if hasattr(e, 'code') and 500 <= e.code < 600: #错误码范围,是请求出错才继续重试爬取
                    print e.code
                    return download(url, retry - 1)
        time.sleep(1) #等待1s,避免对服务器造成压力,也避免被服务器屏蔽爬取
        return html
    
    def crawl_article_images(post_url):
        """
        抓取文章中图片链接
        :param post_url: 文章页面
        """
        image_link = []
        flag = True # 标记是否需要继续爬取
        while flag:
            page = download(post_url) # 下载页面
            if page == None:
                break
            my_parser = lxml.etree.HTMLParser(encoding="utf-8")
            html_content = lxml.etree.HTML(page, parser=my_parser) # 格式化爬取的页面数据
            # html_content = lxml.html.fromstring(page) # 格式化爬取的页面数据,fromstring函数未找到解决中文乱码的办法
            title = html_content.xpath('//h1[@class="title"]/text()')  # 获取文章标题
            image_link = html_content.xpath('//div/img/@data-original-src') # 获取图片的原始链接
            image_caption = html_content.xpath('//div[@class="image-caption"]/text()') # 获取图片的标题
            if image_link.__len__() == 0: # 爬取的页面中无图片div元素,终止爬取
                break
    
            image_content = ''
            for i in range(image_link.__len__()):
                image_content += str(i + 1) + '. ' + (unicode(image_caption[i]).encode('utf-8', errors='ignore')) + ' : '+ image_link[i] + '\n'
    
            if os.path.exists('spider_output/') == False:  # 检查保存文件的地址
                os.mkdir('spider_output')
    
            file_name = 'spider_output/' + title[0] + '_images_by_lxml.txt'  # 设置要保存的文件名
            if os.path.exists(file_name) == False:
                file = open('spider_output/' + title[0] + '_images_by_lxml.txt', 'wb')  # 写文件
                file.write(image_content)
                file.close()
            flag = False
    
        image_num = image_link.__len__()
        print 'total number of images in the article: ', image_num
    
    def crawl_article_text_link(post_url):
        """
        抓取文章中的文字链接
        :param post_url: 文章页面
        """
        flag = True # 标记是否需要继续爬取
        while flag:
            page = download(post_url) # 下载页面
            if page == None:
                break
    
            my_parser = lxml.etree.HTMLParser(encoding="utf-8")
            html_content = lxml.etree.HTML(page, parser=my_parser)  # 格式化爬取的页面数据
            title = html_content.xpath('//h1[@class="title"]/text()')  # 获取文章标题
            text_links = html_content.xpath('//div[@class="show-content"]//a/@href')
            text_links_label = html_content.xpath('//div[@class="show-content"]//a/text()')
            if text_links.__len__() == 0: # 爬取的页面中没有文字链元素,终止爬取
                break
    
            text_links_content = ''
            for i in range(text_links.__len__()):
                text_links_content += str(i + 1) + '. ' + (unicode(text_links_label[i]).encode('utf-8', errors='ignore')) + ' : '+ text_links[i] + '\n'
    
            if os.path.exists('spider_output/') == False:  # 检查保存文件的地址
                os.mkdir('spider_output')
    
            file_name = 'spider_output/' + title[0] + '_article_text_links_by_lxml.txt'  # 设置要保存的文件名
            if os.path.exists(file_name) == False:
                file = open('spider_output/' + title[0] + '_article_text_links_by_lxml.txt', 'wb')  # 写文件
                file.write(text_links_content)
                file.close()
            flag = False
    
        link_num = text_links.__len__()
        print 'total number of text links in the article: ', link_num
    
    def crawl_page(crawled_url):
        """
        爬取文章内容
        :param crawled_url: 需要爬取的页面地址集合
        """
        for link in crawled_url: #按地址逐篇文章爬取
            page = download(link)
            my_parser = lxml.etree.HTMLParser(encoding="utf-8")
            html_content = lxml.etree.HTML(page, parser=my_parser)
            title = html_content.xpath('//h1[@class="title"]/text()') #获取文章标题
            contents = html_content.xpath('//div[@class="show-content"]//text()') #获取文章内容
            content = ''.join(contents)
    
            if os.path.exists('spider_output/') == False: #检查保存文件的地址
                os.mkdir('spider_output/')
    
            file_name = 'spider_output/' + title[0] + '_by_lxml.txt' #设置要保存的文件名
            if os.path.exists(file_name):
                # os.remove(file_name) # 删除文件
                continue  # 已存在的文件不再写
            file = open('spider_output/' + title[0] + '_by_lxml.txt', 'wb') #写文件
            content = unicode(content).encode('utf-8', errors='ignore')
            file.write(content)
            file.close()
    
    
    crawl_article_images('http://www.jianshu.com/p/10b429fd9c4d')
    crawl_article_images('http://www.jianshu.com/p/faf2f4107b9b')
    crawl_article_images('http://www.jianshu.com/p/111')
    crawl_article_text_link('http://www.jianshu.com/p/10b429fd9c4d')
    crawl_article_text_link('http://www.jianshu.com/p/faf2f4107b9b')
    crawl_page(['http://www.jianshu.com/p/10b429fd9c4d'])
    crawl_page(['http://www.jianshu.com/p/faf2f4107b9b'])
    

    crawl_article_images和crawl_article_text_link函数的返回结果和beautifulsoup4的一样。crawl_page函数的返回结果格式上有点不同,如下

    结果文件内容示例1

    调用crawl_links函数和crawl_page函数分步抓取解密大数据专题里的文章链接及链接对应网页上的文章信息新的执行代码未编写。主体除了xpath选取元素的代码外不会有太大差异,可能潜在的问题还是文章标题特殊字符的处理。


    这次内容和代码注释有点多,可能会有些文字上的错误或忘了修改的地方,代码运行结果没有问题。

    相关文章

      网友评论

        本文标题:课程作业-爬虫入门03-爬虫基础-WilliamZeng-201

        本文链接:https://www.haomeiwen.com/subject/wmzhlxtx.html