美文网首页简书产品改进python爬虫路Python爬虫作业
第二周作业(2) 简书读书与豆瓣读书

第二周作业(2) 简书读书与豆瓣读书

作者: 谁占了我的一年的称号 | 来源:发表于2017-04-28 14:31 被阅读83次

    这次的作业是爬取豆瓣读书和简书读书专题,并且进行简单的分析、比较。

    https://book.douban.com/ 豆瓣
    http://www.jianshu.com/c/yD9GAd 简书读书专题

    豆瓣我主要爬取的是最受关注的读书排行榜,分为两类,虚构和非虚构类。



    简书读书爬取的是热门的页面下的读书笔记。并且从标题中分析出所推荐的书名。
    爬虫的部分就不多说了,这两个页面都是很简单的。

    豆瓣读书最受关注

    import requests
    from lxml import etree
    import csv
    
    fp = open('d:\\豆瓣.csv','wt',newline='')
    writer= csv.writer(fp)
    writer.writerow(('name','day','author','point','comment'))
    url='https://book.douban.com/'
    headers={
    'Accept':'*/*',
    'Accept-Encoding':'gzip, deflate',
    'Accept-Language':'zh-CN,zh;q=0.8',
    'Connection':'keep-alive',
    'Content-Length':'270',
    'Host':'api.growingio.com',
    'Origin':'https://book.douban.com',
    'Referer':'https://book.douban.com/',
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'
    }
    html = requests.get(url,headers=headers).content
    sel= etree.HTML(html)
    fiction = sel.xpath('//div[@class="section popular-books"]/div/h2/span[2]/a/@href')[0]
    non_fiction = sel.xpath('//div[@class="section popular-books"]/div/h2/span[3]/a/@href')[0]
    print(fiction,non_fiction)
    colls=[]
    colls.append(url+fiction)
    colls.append(url + non_fiction)
    for coll in colls:
        html1 = requests.get(coll).content
        sel = etree.HTML(html1)
        infos = sel.xpath('//ul[@class="chart-dashed-list"]/li/div[2]')
        print(len(infos))
        for info in infos:
            name = info.xpath('h2/a/text()')[0].strip()
            days = info.xpath('h2/span/text()')[0].strip()
            author = info.xpath('p[1]/text()')[0].strip()
            point = info.xpath('p[2]/span[2]/text()')[0].strip()
            comment_num =  info.xpath('p[2]/span[3]/text()')[0].strip()
            # coll.append(name),coll.append(days),coll.append(author),coll.append(point),coll.append(comment_num)
            print(name, days, author, point, comment_num)
            writer.writerow((name,days,author,point,comment_num))
    fp.close()
    print('d')
    

    简书读书专题,这里主要是一旦爬的过多了,就被BAN了,所以加了随机选取useragent会爬的多一点。设置了延迟,但是还是会被ban,后续还要加随机IP。最后爬下来了16000条数据。

    import requests
    from lxml import etree
    import csv
    import time
    import random
    
    fp = open('d:\\简书读书2.csv', 'wt', newline='',encoding='GB18030')
    write = csv.writer(fp)
    write.writerow(('作者', '发表时间', '标题', '阅读量', '评论量', '点赞量', '打赏量'))
    for i in range(1,10000):
        url ='http://www.jianshu.com/c/yD9GAd?order_by=top&page=%s'%i
        header_list=[
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
    
        "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    
        "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
    
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
    
        "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
    
        "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
    
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
    
        "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
    
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
    
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
    
        "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
    
        "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
    
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
    
        "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
    
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
    
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
    
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
    
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
    
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",
    
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
    
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
    
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
    
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
    
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
    
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
    
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
    
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
    
        "Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",
    
        "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",
    
        "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",
    
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
    
        "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10"]
    
        a = random.choice(header_list)
        print(a)
        header={
        'Accept':'text/html, */*; q=0.01',
        'Accept-Encoding':'gzip, deflate, sdch',
        'Accept-Language':'zh-CN,zh;q=0.8',
        'Connection':'keep-alive',
        'Cookie':'UM_distinctid=15b37d3ac97314-07c83256049285-1571466f-1fa400-15b37d3ac989d; remember_user_token=W1s0MzI0MzI2XSwiJDJhJDEwJGxXbTNyTXg3UHA5UTFHdGR3NWlWdi4iLCIxNDkyNzY2MDAyLjgzMjI0NzciXQ%3D%3D--377e6cf673717abdbd3e45bdea36f27479473fbe; CNZZDATA1258679142=1667859620-1491287654-https%253A%252F%252Fwww.baidu.com%252F%7C1493188772; _ga=GA1.2.1819733996.1491290271; _session_id=SGNyaVNIV0F6VUVRcXJpN1A5ZHhKaGJidmRiSG1jL3oxYm5qaW93N0tUbXJlMkorWW1CSFhTY2VCWEtjSWZPVjE0RWJwNlBTYVpWS1NoVmVIZ2tOWjJxcHY2LzBJdGRhejFYM0xDZHYwSFBoc09OMkNUOHpHVUdTQTJDQm96VjdXRC9kQVRGRnhTejNVZkF3eTFxWDU5b1J4N0F6ZTllSm5Pd2VnRHVHcUg5RU9NK0dsbnQ1NW5hSTJEN0NYYWtKbzhPSnJDNWt5QlJLOWQxM211d3IrUnJzemJoZzQ0dDk5VXJzZXVHWktVdExvamFtRTBYZ0s0V1B4UE5NTXM3bG1BQ1VlTWcva2dVWFp2S0JNQ0lkQnF2RUxIS3NIaEVIYi9KbTIvVjI3NUExZ3QySEcrQ0lLWDdNV1dFMzBXN2pDRWtwcWFHN016Zk5EVDdkMnQzTm55RSsyWmorRmdkNTh4bkg5aVk5a3BVeFV1ZkJXS1pkY1hHQzFIc3JQb2VEbk9sTXlhcG5WOGFheC9SWDdnRkNDbnM3UWkzcVd5bENxVmp2eDA2VmtObz0tLW03RDBpWHlkVmIzVTJmKyt3c3YrZkE9PQ%3D%3D--5c873af39872ea4dda72db314ed3920cf0328201',
        'Host':'www.jianshu.com',
        'Referer':'http://www.jianshu.com/c/yD9GAd',
        'User-Agent':'%s'%a
        }
        time.sleep(3)
        html = requests.get(url,headers=header).content
        sel = etree.HTML(html)
        infos = sel.xpath('//ul[@class="note-list"]/li/div[@class="content"]')
        for info in infos:
            try:
                author = info.xpath('div[@class="author"]/div/a/text()')[0]
                if len(author)==0:
                    print('到底了')
                    break
                else:
                    get_time = (info.xpath('div[@class="author"]/div/span/@data-shared-at')[0].replace('T', ' ')).replace('+08:00','')
                    title = info.xpath('a[@class="title"]/text()')[0]
                    read_num = info.xpath('div[@class="meta"]/a[1]/text()')[1][:-1]
                    comment_num = info.xpath('div[@class="meta"]/a[2]/text()')[1][:-1]
                    point_num = info.xpath('div[@class="meta"]/span[1]/text()')[0]
                    reward_num = info.xpath('div[@class="meta"]/span[2]/text()')
    
                    if len(reward_num) == 0:
                        reward_num = '0'
                    else:
                        reward_num = reward_num[0]
            except IndexError as e:
                print('有错误')
            print(author, get_time, title, read_num, comment_num, point_num, reward_num)
            write.writerow((author, get_time, title, read_num, comment_num, point_num, reward_num))
    

    结果如下图:

    Paste_Image.png

    通过正则将标题中的书名给匹配出来。然后用wordcloud画了一个简单的图

    Paste_Image.png

    简书·读书推荐前十名:

    '菜根谭': 74,
    '如何阅读一本书': 66,
    '红楼梦': 64,
    '解忧杂货店': 58,
    '活着': 56,
    '平凡的世界': 43,
    '追风筝的人': 39,
    '白夜行': 36,
    '围城': 33,
    '成为作家': 33,

    前十名之中的书,我倒是只看过其中的五本。《解忧杂货店》成为了床头读物,《平凡的世界》是在火车上看完的,《活着》是看完让我极度压抑的一本书。

    豆瓣读书:

    Paste_Image.png

    没想到的是期待前十名里,还有一本地理学的书。羞耻的是,一本没看过。

    阅读量前十名:
    ![Uploading Paste_Image_853015.png . . .]](http:https://img.haomeiwen.com/i4324326/51d0ab5d0e1bc4f6.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

    写作前十名:

    Paste_Image.png

    打赏前5名:

    Paste_Image.png

    相关文章

      网友评论

      本文标题:第二周作业(2) 简书读书与豆瓣读书

      本文链接:https://www.haomeiwen.com/subject/tauozttx.html