二期Python爬虫作业No.1一简书

作者: 只是不在意 | 来源:发表于2017-05-23 21:44 被阅读0次

二期Python爬虫作业No.1一简书
Python爬虫入门(01) -- 10行代码实现一个爬虫
Swfit爬虫通过作者ID无接口获取简书文章列表，正则匹配HTM
万丈高楼平地起——记python开发环境安装流程
python学习资料网站
Python爬虫从0开始学（1）
【Python爬虫】简书专题内容和作业次数爬虫
《从零开始学Arduino电子设计》第2本小黄人书
在简书上一起学Python是怎样一种体验
新手向爬虫（三）别人的爬虫在干啥

糗事百科之前爬过类似的：
http://www.jianshu.com/p/a191726ed66d
为了集中注意力，主要爬简书。简书之前虽然2组的同学爬的很热闹，但我一次都没有爬过。开营式上向右老师把我拎出来表扬了一下，果然根据人品守恒定律，这么有代表性而且参考文章无数的网站我都爬的很辛苦...（其实是技能还没过关）

根据图片，第一级是jianshu.com，
第二级我选了class=note-list(试过id=list-container)也可以，然后第三级就按照以前爬过其他网站做的循环格式，选取了author,title,column等。

000.png

自己写的代码：

url = 'http://www.jianshu.com'
html = requests.get(url, headers=getReqHeaders()).content
selector = etree.HTML(html)
infos= selector.xpath('//*[@class="note-list"]/li')
print(infos)

for info in infos:
     title=info.xpath('//a[@class="title"]/text()')[0]
     author=info.xpath('//div[@class="name"]/text()')[0]
     collection=info.xpath('//div[2]/a[1]/text()')[0]
     print title, '      ',author, '      ',collection

出来倒是出来了，但结果就好像鬼打墙一样的循环...

图片.png

如果把text后面的[0]依次改为[1]/[2]/[3]，每一项倒是会一行行列出了，但我以前并不是这样做的，从网页结构也看不出来为什么这次需要这样做才出来。

所以主要还是xpath选取的不对。
后来程工改的代码：（我把原来的代码注释在下面）

import random
import requests
from lxml import etree

def getReqHeaders():  # 功能：随机获取HTTP_User_Agent
    user_agents = ["Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"]
    user_agent = random.choice(user_agents)
    req_headers = {'User-Agent': user_agent}
    return req_headers

url = 'http://www.jianshu.com'
html = requests.get(url, headers=getReqHeaders()).content
selector = etree.HTML(html)
infos=selector.xpath('//div[@id="list-container"]/ul[@class="note-list"]/li')
#infos= selector.xpath('//*[@class="note-list"]/li')
print(infos)

for info in infos:
    author = info.xpath('div/div[1]/div/a/text()')[0]
    #author=info.xpath('//div[@class="name"]/text()')[0]
    authorurl = 'http://www.jianshu.com' + info.xpath('div/div[1]/div/a/@href')[0]
    title = info.xpath('div/a/text()')[0] if len(info.xpath('div/a/text()')) > 0 else ""
    #title=info.xpath('//a[@class="title"]/text()')[0]
    print title, '      ',author

程工的代码是直接copy xpath得到的，我开始也试过，但不知道为什么不行，自己看的又辛苦（因为不很直观）就改回class=name。总之还是技术不过关。

程工同时又翻出向右老师的参考文：《再谈Scrapy抓取结构化数据》
http://www.jianshu.com/p/3d52e6046782
虽然是讲scrapy，但也提到了简书首页的结构，我再对比一下。（从上到下分别是向右老师，程工，小白本人）

infos = selector.xpath('//ul[@class="note-list"]/li')
infos=selector.xpath('//div[@id="list-container"]/ul[@class="note-list"]/li')
#infos= selector.xpath('//*[@class="note-list"]/li')

for info in infos:
       title = info.xpath('div/a/text()').extract()[0]
       title = info.xpath('div/a/text()')[0]#if省略先
        #title=info.xpath('//a[@class="title"]/text()')[0]

        author = info.xpath('div/div[1]/div/a/text()').extract()[0]
        author = info.xpath('div/div[1]/div/a/text()')[0]
        #author=info.xpath('//div[@class="name"]/text()')[0]

以后还是要尽量多用右键copy xpath法，熟悉结构写法，把各级抓取写全面一些。
然后beautiful也要继续熟练，这个月的挑战还是很大啊！