美文网首页解密大数据
Python 爬虫入门课作业4-构建爬虫

Python 爬虫入门课作业4-构建爬虫

作者: 不忘初心2017 | 来源:发表于2017-08-09 09:47 被阅读15次

    课程作业

    爬取大数据专题所有文章列表,并输出到文件中保存
    每篇文章需要爬取的数据: 作者,标题,文章地址,摘要,缩略图地址,阅读数,评论数,点赞数和打赏数

    作业网址

    http://www.jianshu.com/c/9b4685b6357c

    构建爬虫

    本来想在之前的框架上补充,可是发现一个问题,在获取页面数时会出现死循环,因为最后一页的内容无论page=多少都一样。解决方法想到通过每页第一篇文章的timestamp来比较,是可行的。但是这个在每次爬取之前先获得所有页面数的设计会加大对爬取网站的访问次数,似乎不太可取,暂时先放下,以后再好好考虑下框架的设计。

    这次作业用lxml实现,全部hardcode在函数中,有些粗糙。

    爬虫代码

    def crawl_page_jianshu(url_root,url_seed):
        count = 1
        flag = True
        temptime = '2010-1-1'
        contents = []
        while flag:
            html = common.download(url_seed % count)
            tree = etree.HTML(html)
            notelist = tree.xpath('//*[@id="list-container"]/ul/li')
            if(notelist!=[]):
                timestamp = notelist[0].xpath('descendant-or-self::*/span')
            # compare page html, exit if the first item has same timestamp, which is probably the same page
            if (type(timestamp[0]) != type('str')):
                timestamp = timestamp[0].get('data-shared-at')
            else:
                timestamp = timestamp[0]
            print timestamp
            if timestamp == temptime:
                flag = False;
                continue;
            # loop notelist and extract expected data
            for note in notelist:
                nodes_a = note.xpath("child::*//a")
                author = unicode(nodes_a[1].text).encode('utf-8',errors='ignore')
                title = unicode(nodes_a[2].text).encode('utf-8',errors='ignore')
                # get author info
                author_link = url_root + nodes_a[3].get('href')
                #print "author: %s, title:%s, author_link: %s" % (author,title,author_link)
                # get title image url
                note_image_src = note.xpath("child::*//a/img")
                note_image_src= url_root+note_image_src[0].get('src')
                abstract = note.xpath('descendant::*//*[@class="abstract"]')
                abstract = abstract[0].text
                # get abstract info
                #print "abstract:%s" % abstract
    #            numbers = note.xpath('descendant::*//*[@class="meta"]/*')
    #            for number in numbers:
    #                print "number is: %r" % number.text
                # get numbers
                number_of_read = note.xpath('descendant::*//*[@class="iconfont ic-list-read"]')[0].tail.replace("\n","")
                #print "number of read: %s" % number_of_read
                number_of_likes = note.xpath('descendant::*//*[@class="iconfont ic-list-like"]')[0].tail.replace("\n","")
                #print "number of likes: %s" % number_of_likes
                number_of_money = note.xpath('descendant::*//*[@class="iconfont ic-list-money"]')
                #print len(number_of_money)
                if len(number_of_money) !=0:
                    number_of_money = number_of_money[0].tail.replace("\n","")
                else:
                    number_of_money = 0
                #print "number of money: %r" % number_of_money
                page_content = {'author':author,
                                'title':title,
                                'image_src':note_image_src,
                                'abstract':abstract,
                                'number_of_read':number_of_read,
                                'number_of_likes':number_of_likes,
                                'number_of_money':number_of_money
                        }
                contents.append(page_content)
            temptime = timestamp
            count += 1
        print "Total number of pages: %d" % count
        return contents
    

    客户端代码

    # -*- coding: utf-8 -*-
    import os
    import spiderlxml as lxml
    import spidercomm as common
    import codecs
    
    # set up             
    url_root = 'http://www.jianshu.com'
    url_seed = 'http://www.jianshu.com/c/9b4685b6357c?page=%d'
    spider_path='spider_res/lxml/ex4'
    if os.path.exists(spider_path) == False:
        os.makedirs(spider_path)
        
    # get expected contents from crawled pages
    contents = lxml.crawl_page_jianshu(url_root,url_seed)
    #print contents
    # write contents to file
    with codecs.open(spider_path+"/_all_contents.txt",'a','utf-8') as file:
        file.write('author,title,image_src,abstract,number_of_read,number_of_likes,number_of_money\n')
        for content in contents:
            print "\n"+ "*"*50 +"\n"
            for key in content.keys():
                print "%s:%s" % (key, content.get(key))
                #file.write(content.get(key))
    

    运行结果(部分):

    **************************************************
    
    title:python第三课进阶作业
    abstract:
          数据的集中趋势• 均值、中位数、众数 • 偏度 数据的离散程度• 全距Range • 四分位距IQR& 箱图 • 方差、标准差 • 拇指规则& 切比雪夫定理 两个变量的关系 ...
        
    author:_bobo_
    number_of_read: 10
    number_of_likes: 1
    number_of_money:0
    image_src:http://www.jianshu.com//upload.jianshu.io/users/upload_avatars/4421285/c3d8c27a-9f75-4070-982f-7e12c0fe16a6.jpg?imageMogr2/auto-orient/strip|imageView2/1/w/96/h/96
    
    **************************************************
    
    title:爬虫作业3
    abstract:
          课程作业 选择第二次课程作业中选中的网址 爬取该页面中的所有可以爬取的元素,至少要求爬取文章主体内容 可以尝试用lxml爬取 在完成这节课的过程中遇到许多问题: 环境问题:电...
        
    author:mudu86
    number_of_read: 6
    number_of_likes: 0
    number_of_money:0
    image_src:http://www.jianshu.com//cdn2.jianshu.io/assets/default_avatar/11-4d7c6ca89f439111aff57b23be1c73ba.jpg?imageMogr2/auto-orient/strip|imageView2/1/w/96/h/96
    
    **************************************************
    

    问题:

    遇到两个问题,都没解决,只好找了workaround。

    问题1:

    在获取阅读数等数字时,<a>的文本中是换行的,number.text打印出来的总是第一行"\n ",不知道如何解决?

    numbers = note.xpath('descendant::*//*[@class="meta"]/*')
                for number in numbers:
                   print "number is: %r" % number.text
    

    问题2:

    获取到所有内容后,想输出到文件,可是总是报错:

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)
    

    代码是:

    with codecs.open(spider_path+"/_all_contents.txt",'a','utf-8') as file:
        file.write('author,title,image_src,abstract,number_of_read,number_of_likes,number_of_money\n')
        for content in contents:
            print "\n"+ "*"*50 +"\n"
            for key in content.keys():
                print "%s:%s" % (key, content.get(key))
                file.write(content.get(key))
    

    相关文章

      网友评论

        本文标题:Python 爬虫入门课作业4-构建爬虫

        本文链接:https://www.haomeiwen.com/subject/imbnlxtx.html