这一节,我们利用scrapy来爬取简书整站的内容。对于一篇文章详情页面,我们发现许多内容是Ajax异步加载的,所以使用传统方式返回的response里并没有我们想要的数据,例如评论数,喜欢数等等。对于动态数据请求,我们使用selenium+chromedriver来完成
1.到淘宝镜像https://npm.taobao.org/mirrors/chromedriver选择对应的chromedriver。将解压后的chromedriver.exe放到chrome浏览器的安装目录下
2.安装 selenium pip install selenium
数据库设计
设计爬虫之前,我们需要知道爬取的内容有哪些,如下是数据库字段部分。其中id是主键,且设置为自动增长。

整个爬虫的执行流程
- 首先从start_urls 里取出要执行的请求地址
- 返回的内容经过下载中间件进行处理(selenium加载动态数据)
- 经过中间件处理的数据(response)返回给爬虫进行提取数据
- 提取出的数据返回给pipeline进行存储
创建爬虫项目
1.进入到虚拟环境下
scrapy startproject jianshu
2.进入项目(jianshu)下,新建spider,由于我们是整站爬虫,所以我们可以指定crawl模板,利用里面的rule来方便爬取
scrapy genspider -t crawl js jianshu.com
此时spiders文件夹下多了一个js.py
在items.py中定义字段
import scrapy
class JianshuItem(scrapy.Item):
title = scrapy.Field()
content = scrapy.Field()
article_id = scrapy.Field()
origin_url = scrapy.Field()
author = scrapy.Field()
avatar = scrapy.Field()
pub_time = scrapy.Field()
read_count = scrapy.Field()
like_count = scrapy.Field()
word_count = scrapy.Field()
subjects = scrapy.Field()
comment_count = scrapy.Field()
定义下载中间件
request请求和response响应都是要经过下载中间件,所以我们在这里将selenium集成到scrapy的中间件下载器中。定义完中间件之后,一定要在setting中开启。
# middlewares.py
from scrapy import signals
from selenium import webdriver
import time
from scrapy.http.response.html import HtmlResponse
class SeleniumDownloadMiddleware(object):
def __init__(self):
self.driver = webdriver.Chrome(executable_path=r"C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe")
def process_request(self, request, spider):
self.driver.get(request.url)
time.sleep(2)
try:
while True:
showMore = self.driver.find_element_by_class_name('show-more') #获取标签
showMore.click()
time.sleep(0.5)
if not showMore:
break
except:
pass
source = self.driver.page_source
response = HtmlResponse(url=self.driver.current_url, body=source, request=request,encoding='utf-8')
return response
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36'
}
爬虫的设计
发现简书上的文章的页面的url满足特点的规则:
https://www.jianshu.com/p/d65909d2173a
前面是域名,后面是uid,这样可以通过crawl模板来制定Rule。
# js.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from jianshu.items import JianshuItem
class JsSpider(CrawlSpider):
name = 'js'
allowed_domains = ['jianshu.com']
start_urls = ['http://jianshu.com/']
rules = (
Rule(LinkExtractor(allow=r'.*/p/[0-9a-z]{12}.*'), callback='parse_detail', follow=True),
)
def parse_detail(self, response):
title = response.xpath("//h1[@class='title']/text()").get()
avatar = response.xpath("//a[@class='avatar']/img/@src").get()
author = response.xpath("//span[@class='name']/a/text()").get()
pub_time = response.xpath("//span[@class='publish-time']/text()").get().replace("*","")
url = response.url
url1 = url.split("?")[0]
article_id = url1.split('/')[-1]
content = response.xpath("//div[@class='show-content']").get()
word_count_list = response.xpath("//span[@class='wordage']/text()").get().split(' ') #字数 10000
word_count = int(word_count_list[-1])
comment_count_list = response.xpath("//span[@class='comments-count']/text()").get().split(' ') #评论 427
comment_count = int(comment_count_list[-1])
read_count_list = response.xpath("//span[@class='views-count']/text()").get().split(' ') #喜欢 427
read_count=int(read_count_list[-1])
like_count_list = response.xpath("//span[@class='likes-count']/text()").get().split(' ') #喜欢 3
like_count = int(like_count_list[-1])
subjects = ",".join(response.xpath("//div[@class='include-collection']/a/div/text()").getall())
item = JianshuItem(
title=title,
avatar=avatar,
author=author,
pub_time=pub_time,
origin_url=response.url,
article_id=article_id,
content=content,
subjects=subjects,
word_count=word_count,
comment_count=comment_count,
read_count=read_count,
like_count=like_count,
)
print('y' * 100)
yield item
pipelines的设计
这里是将爬虫(js.py)中返回的item保存到数据库中。下面代码操作数据库是同步的。
import pymysql
class JianshuPipeline(object):
def __init__(self):
dbparams = {
'host':'127.0.0.1',
'port':3306,
'user':'root',
'password':'123456',
'database':'jianshu',
'charset':'utf8'
}
self.conn = pymysql.connect(**dbparams)
self.cursor = self.conn.cursor()
self._sql = None
def process_item(self,item,spider):
self.cursor.execute(self.sql,(item['title'],item['content'],
item['author'],item['avatar'],
item['pub_time'],item['origin_url'],
item['article_id'],item['read_count'],
item['like_count'],item['word_count'],
item['subjects'],item['comment_count']))
self.conn.commit()
return item
@property
def sql(self):
if not self._sql:
self._sql = """
insert into js(id,title,content,author,avatar,pub_time,
origin_url,article_id,read_count,like_count,word_count,subjects,comment_count) values (null,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
"""
return self._sql
return self._sql
setting.py
为了使上面定义的中间件起作用,必须在setting中开启中间件
- 设置user-agent
- 关闭robot协议
- 设置合理地下载延迟,否则会被服务器禁用
- 开启下载器中间件
- 开启管道
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36'
}
DOWNLOADER_MIDDLEWARES = {
'jianshu.middlewares.SeleniumDownloadMiddleware': 543,
}
ITEM_PIPELINES = {
'jianshu.pipelines.JianshuPipeline': 300,
}
......
启动爬虫
我们可以在终端中取运行爬虫,我们也可以新建start.py,用py文件来执行命令行操作,从而来运行爬虫。
from scrapy import cmdline
cmdline.execute(['scrapy','crawl','js'])
网友评论