课程作业

爬取大数据专题所有文章列表，并输出到文件中保存
每篇文章需要爬取的数据：作者，标题，文章地址，摘要，缩略图地址，阅读数，评论数，点赞数和打赏数

作业网址

http://www.jianshu.com/c/9b4685b6357c

构建爬虫

本来想在之前的框架上补充，可是发现一个问题，在获取页面数时会出现死循环，因为最后一页的内容无论page=多少都一样。解决方法想到通过每页第一篇文章的timestamp来比较，是可行的。但是这个在每次爬取之前先获得所有页面数的设计会加大对爬取网站的访问次数，似乎不太可取，暂时先放下，以后再好好考虑下框架的设计。

这次作业用lxml实现，全部hardcode在函数中，有些粗糙。

爬虫代码

def crawl_page_jianshu(url_root,url_seed):
    count = 1
    flag = True
    temptime = '2010-1-1'
    contents = []
    while flag:
        html = common.download(url_seed % count)
        tree = etree.HTML(html)
        notelist = tree.xpath('//*[@id="list-container"]/ul/li')
        if(notelist!=[]):
            timestamp = notelist[0].xpath('descendant-or-self::*/span')
        # compare page html, exit if the first item has same timestamp, which is probably the same page
        if (type(timestamp[0]) != type('str')):
            timestamp = timestamp[0].get('data-shared-at')
        else:
            timestamp = timestamp[0]
        print timestamp
        if timestamp == temptime:
            flag = False;
            continue;
        # loop notelist and extract expected data
        for note in notelist:
            nodes_a = note.xpath("child::*//a")
            author = unicode(nodes_a[1].text).encode('utf-8',errors='ignore')
            title = unicode(nodes_a[2].text).encode('utf-8',errors='ignore')
            # get author info
            author_link = url_root + nodes_a[3].get('href')
            #print "author: %s, title:%s, author_link: %s" % (author,title,author_link)
            # get title image url
            note_image_src = note.xpath("child::*//a/img")
            note_image_src= url_root+note_image_src[0].get('src')
            abstract = note.xpath('descendant::*//*[@class="abstract"]')
            abstract = abstract[0].text
            # get abstract info
            #print "abstract:%s" % abstract
#            numbers = note.xpath('descendant::*//*[@class="meta"]/*')
#            for number in numbers:
#                print "number is: %r" % number.text
            # get numbers
            number_of_read = note.xpath('descendant::*//*[@class="iconfont ic-list-read"]')[0].tail.replace("\n","")
            #print "number of read: %s" % number_of_read
            number_of_likes = note.xpath('descendant::*//*[@class="iconfont ic-list-like"]')[0].tail.replace("\n","")
            #print "number of likes: %s" % number_of_likes
            number_of_money = note.xpath('descendant::*//*[@class="iconfont ic-list-money"]')
            #print len(number_of_money)
            if len(number_of_money) !=0:
                number_of_money = number_of_money[0].tail.replace("\n","")
            else:
                number_of_money = 0
            #print "number of money: %r" % number_of_money
            page_content = {'author':author,
                            'title':title,
                            'image_src':note_image_src,
                            'abstract':abstract,
                            'number_of_read':number_of_read,
                            'number_of_likes':number_of_likes,
                            'number_of_money':number_of_money
                    }
            contents.append(page_content)
        temptime = timestamp
        count += 1
    print "Total number of pages: %d" % count
    return contents

客户端代码

# -*- coding: utf-8 -*-
import os
import spiderlxml as lxml
import spidercomm as common
import codecs

# set up             
url_root = 'http://www.jianshu.com'
url_seed = 'http://www.jianshu.com/c/9b4685b6357c?page=%d'
spider_path='spider_res/lxml/ex4'
if os.path.exists(spider_path) == False:
    os.makedirs(spider_path)
    
# get expected contents from crawled pages
contents = lxml.crawl_page_jianshu(url_root,url_seed)
#print contents
# write contents to file
with codecs.open(spider_path+"/_all_contents.txt",'a','utf-8') as file:
    file.write('author,title,image_src,abstract,number_of_read,number_of_likes,number_of_money\n')
    for content in contents:
        print "\n"+ "*"*50 +"\n"
        for key in content.keys():
            print "%s:%s" % (key, content.get(key))
            #file.write(content.get(key))

运行结果（部分）：

**************************************************

title:python第三课进阶作业
abstract:
      数据的集中趋势• 均值、中位数、众数 • 偏度 数据的离散程度• 全距Range • 四分位距IQR& 箱图 • 方差、标准差 • 拇指规则& 切比雪夫定理 两个变量的关系 ...
    
author:_bobo_
number_of_read: 10
number_of_likes: 1
number_of_money:0
image_src:http://www.jianshu.com//upload.jianshu.io/users/upload_avatars/4421285/c3d8c27a-9f75-4070-982f-7e12c0fe16a6.jpg?imageMogr2/auto-orient/strip|imageView2/1/w/96/h/96

**************************************************

title:爬虫作业3
abstract:
      课程作业 选择第二次课程作业中选中的网址 爬取该页面中的所有可以爬取的元素，至少要求爬取文章主体内容 可以尝试用lxml爬取 在完成这节课的过程中遇到许多问题： 环境问题：电...
    
author:mudu86
number_of_read: 6
number_of_likes: 0
number_of_money:0
image_src:http://www.jianshu.com//cdn2.jianshu.io/assets/default_avatar/11-4d7c6ca89f439111aff57b23be1c73ba.jpg?imageMogr2/auto-orient/strip|imageView2/1/w/96/h/96

**************************************************

问题：

遇到两个问题，都没解决，只好找了workaround。

问题1：

在获取阅读数等数字时，<a>的文本中是换行的，number.text打印出来的总是第一行"\n "，不知道如何解决？

numbers = note.xpath('descendant::*//*[@class="meta"]/*')
            for number in numbers:
               print "number is: %r" % number.text

问题2：

获取到所有内容后，想输出到文件，可是总是报错：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)

代码是：

with codecs.open(spider_path+"/_all_contents.txt",'a','utf-8') as file:
    file.write('author,title,image_src,abstract,number_of_read,number_of_likes,number_of_money\n')
    for content in contents:
        print "\n"+ "*"*50 +"\n"
        for key in content.keys():
            print "%s:%s" % (key, content.get(key))
            file.write(content.get(key))