Level 1: HTML and CSS Selector
官网提供代码运行结果不对,修改如下:
import re
from pyspider.libs.base_handler import *
class Handler(BaseHandler):
crawl_config = {
}
@every(minutes=24 * 60)
def on_start(self):
self.crawl('http://www.imdb.com/search/title?count=100&title_type=feature,tv_series,tv_movie&ref_=nv_ch_mm_1', callback=self.index_page)
@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
for each in response.doc('a[href^="https"]').items():
if re.match("https://www.imdb.com/title/tt\d+/\?ref_=", each.attr.href):
movie_group = re.match('(https://www.imdb.com/title/tt\d+/).*', each.attr.href)
self.crawl(movie_group.groups()[0], callback=self.detail_page)
@config(priority=2)
def detail_page(self, response):
title = re.search('<h1 class="">(.*?)<', response.text).group(1).replace(' ', '').strip()
item_list = response.doc('.credit_summary_item').items()
director = []
for item in item_list:
if "Director" in item('h4').text():
director = [x.text() for x in item('a').items()]
return {
"url": response.url,
"title": title,
"rating": response.doc('[itemprop="ratingValue"]').text(),
"director": director,
}
Level 2: AJAX and More HTTP
使用 Postman 模拟请求失败 待完成!!!
通过浏览器跟踪不到原有 XHR 数据接口: http://api.twitch.tv/kraken/streams?limit=20&offset=0&game=Dota+2&broadcaster_language=&on_site=1
新的 json 数据请求为: https://gql.twitch.tv/gql
header:
POST /gql HTTP/1.1
Host: gql.twitch.tv
Connection: keep-alive
Content-Length: 255
Pragma: no-cache
Cache-Control: no-cache
Origin: https://www.twitch.tv
Accept-Language: zh-CN
Client-Id: kimne78kx3ncx6brgo4mv6wki5h1ko
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36
X-Device-Id: 326baa0403887e01
Content-Type: text/plain;charset=UTF-8
Accept: */*
Referer: https://www.twitch.tv/directory/game/Dota%202
Accept-Encoding: gzip, deflate, br
payload:
[{"operationName":"DirectoryPage_Game","variables":{"name":"dota 2","limit":30,"sort":"VIEWER_COUNT","tags":[],"cursor":"Nzc="},"extensions":{"persistedQuery":{"version":1,"sha256Hash":"f7c5ea69517715f8ab06d30ce66f6355af61593ac0ff806b518286932d177cc7"}}}]
Level 3: Render with PhantomJS
使用 PhantomJS 获取页面 http://www.twitch.tv/directory/game/Dota%202 失败!!!
成功: pyspider 爬虫教程(三):使用 PhantomJS 渲染带 JS 的页面
问题: 浏览器中能获取到的 dom,pyspider + phantomjs 获取不到
解决方法:在项目列表中,将项目的状态设置为 debug 或者 running,再重新运行项目
网友评论