归属文集: Python + Scrapy + MongoDB实例
环境需求
本机环境:32bit Windows + Python3 + Scrapy + MongoDB ; FireFox浏览器
Scrapy安装
以管理员身份启动cmd,运行:
pip install scrapy
MongoDB安装
详见:爬取猫眼电影Top100 - Request 和 MongoDB。
创建项目
创建E:\pycodes目录专门存放项目代码,在cmd中进入E:\pycodes目录,输入命令:
scrapy startproject zhihuuser
这样,在E:\pycodes目录下就生成了zhuhuuser爬虫项目文件,进入到zhuhuuser文件夹,创建zhuhu爬虫:
cd zhihuuser
scrapy genspider zhihu www.zhihu.com
反爬伪装
禁止ROBOTSTXT_OBEY
在settings.py文件中,将ROBOTSTXT_OBEY修改为False,表示不遵守robots协议
ROBOTSTXT_OBEY = False
修改请求头
DEFAULT_REQUEST_HEADERS = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'authorization': 'oauth c3cef7c66a1843f8b3a9e6a1e3160e20'
}
网页分析
打开浏览器的控制台,切换到Network监听模式。找一个大V,以轮子哥为例吧,它的个人信息页面网址是:https://www.zhihu.com/people/excited-vczh。
- 首先打开轮子哥的首页 ,点击“关注者”
- 鼠标点击分页"2"时,发现下面的增加了批量请求,其中一个是batch的POST请求,一个是followers?开头的Get请求中,Get请求头里有一个属性叫Referer,值为:
https://www.zhihu.com/people/excited-vczh/followers?page=2
get请求网址是:
https://www.zhihu.com/api/v4/members/excited-vczh/followers?include=data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics&offset=20&limit=20
![](https://img.haomeiwen.com/i1780773/d2eb21485c377c51.png)
查看它返回的信息如下:
![](https://img.haomeiwen.com/i1780773/f4bd3b7acfa09d81.png)
所以,我们要得到一个用户的关注者信息取用的URL接口是:
https://www.zhihu.com/api/v4/members/{
user}/followers?include={
include}&offset={
offset}&limit={
limit}
其中,user是'url_token',include是固定的查询参数,offset是分页偏移量,limit是一页取多少个
同理,将鼠标悬浮在某一个关注者上面,可在监测框看到增加了一个POST请求,一个GET请求。可知:
获取用户的详细信息的URL接口是:
https://www.zhihu.com/api/v4/members/{
user}?include={
include}
流程与实现
流程
综上分析,我们可以从大V出发,得到关注列表的每一个'url_token',利用得到的每一个'url_token',获取用户信息,并且再次获得其关注列表。
代码实现
zhihu.py
----------------------------------------------------------
# -*- coding: utf-8 -*-
import json
from scrapy import Spider, Request
from zhihuuser.items import UserItem
class ZhihuSpider(Spider):
name = "zhihu"
allowed_domains = ["www.zhihu.com"]
user_url = 'https://www.zhihu.com/api/v4/members/{user}?include={include}'
follows_url = 'https://www.zhihu.com/api/v4/members/{user}/followees?include={include}&offset={offset}&limit={limit}'
start_user = 'excited-vczh'
user_query = 'locations,employments,gender,educations,business,voteup_count,thanked_Count,follower_count,following_count,cover_url,following_topic_count,following_question_count,following_favlists_count,following_columns_count,answer_count,articles_count,pins_count,question_count,commercial_question_count,favorite_count,favorited_count,logs_count,marked_answers_count,marked_answers_text,message_thread_token,account_status,is_active,is_force_renamed,is_bind_sina,sina_weibo_url,sina_weibo_name,show_sina_weibo,is_blocking,is_blocked,is_following,is_followed,mutual_followees_count,vote_to_count,vote_from_count,thank_to_count,thank_from_count,thanked_count,description,hosted_live_count,participated_live_count,allow_message,industry_category,org_name,org_homepage,badge[?(type=best_answerer)].topics'
follows_query = 'data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics'
def start_requests(self):
#生成器生成Request(url, 回调函数 - 默认使用spider的parse()方法)
yield Request(self.user_url.format(user=self.start_user, include=self.user_query), self.parse_user)
yield Request(self.follows_url.format(user=self.start_user, include=self.follows_query, limit=20, offset=0),
self.parse_follows)
def parse_user(self, response):
result = json.loads(response.text)
item = UserItem()
for field in item.fields:
if field in result.keys():
item[field] = result.get(field)
yield item
yield Request(
self.follows_url.format(user=result.get('url_token'), include=self.follows_query, limit=20, offset=0),
self.parse_follows)
def parse_follows(self, response):
results = json.loads(response.text)
if 'data' in results.keys():
for result in results.get('data'):
yield Request(self.user_url.format(user=result.get('url_token'), include=self.user_query),
self.parse_user)
if 'paging' in results.keys() and results.get('paging').get('is_end') == False:
#paging的'previous'和'next'属性存储着前后页的URL,因此offset在这里值更新
next_page = results.get('paging').get('next')
yield Request(next_page,
self.parse_follows)
items.py
----------------------------------------------------------
# -*- coding: utf-8 -*-
from scrapy import Item, Field
class UserItem(Item):
# define the fields for your item here like:
id = Field()
name = Field()
avatar_url = Field()
headline = Field()
description = Field()
url = Field()
url_token = Field()
gender = Field()
cover_url = Field()
type = Field()
badge = Field()
answer_count = Field()
articles_count = Field()
commercial_question_count = Field()
favorite_count = Field()
favorited_count = Field()
follower_count = Field()
following_columns_count = Field()
following_count = Field()
pins_count = Field()
question_count = Field()
thank_from_count = Field()
thank_to_count = Field()
thanked_count = Field()
vote_from_count = Field()
vote_to_count = Field()
voteup_count = Field()
following_favlists_count = Field()
following_question_count = Field()
following_topic_count = Field()
marked_answers_count = Field()
mutual_followees_count = Field()
hosted_live_count = Field()
participated_live_count = Field()
locations = Field()
educations = Field()
employments = Field()
settings.py
----------------------------------------------------------
# -*- coding: utf-8 -*-
# Scrapy settings for zhihuuser project
BOT_NAME = 'zhihuuser'
SPIDER_MODULES = ['zhihuuser.spiders']
NEWSPIDER_MODULE = 'zhihuuser.spiders'
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'authorization': 'oauth c3cef7c66a1843f8b3a9e6a1e3160e20',
}
ITEM_PIPELINES = {
'zhihuuser.pipelines.MongoPipeline': 300,
}
MONGODB_SERVER = '127.0.0.1';
MONGODB_PORT = 27017;
MONGODB_DB = 'zhuser';
MONGODB_COLLECTION = 'zhu';
pipelines.py
----------------------------------------------------------
# -*- coding: utf-8 -*-
import pymongo
from scrapy.conf import settings
class MongoPipeline(object):
def __init__(self):
#将setting里面数据连接相关的参数取过来
server = settings['MONGODB_SERVER']
port = settings['MONGODB_PORT']
db = settings['MONGODB_DB']
collection = settings['MONGODB_COLLECTION']
#数据连接
client = pymongo.MongoClient(server, port)
db = client[db]
self.collection = db[collection]
def process_item(self, item, spider):
self.collection.insert(dict(item))
return item
结果与遗留问题
数据保存到MongoDB中,但是只有400行不到
![](https://img.haomeiwen.com/i1780773/227d9526401a2f5a.png)
原因:知乎爬虫限制,403 404错误。
![](https://img.haomeiwen.com/i1780773/ca86c3742af25680.png)
网友评论