爬取知乎用户信息 - Scrapy + MongoDB

作者: 马淑 | 来源:发表于2017-10-05 21:02 被阅读156次

scrapy实战--爬取知乎用户信息(上)
爬取知乎用户信息 - Scrapy + MongoDB
【零基础学爬虫】scrapy实战：抓取知乎用户信息
使用Scrapy爬取知乎用户信息
使用Docker部署scrapy-redis分布式爬虫
Scrapy-01-知乎全站用户信息爬取
scrapy实战--爬取知乎用户(下)
scrapy爬取豆瓣电影
第五章知乎问题爬取
从0开始部署scrapy-redis分布式爬虫

`归属文集: Python + Scrapy + MongoDB实例`

环境需求

本机环境：32bit Windows + Python3 + Scrapy + MongoDB ； FireFox浏览器

Scrapy安装

以管理员身份启动cmd，运行：
pip install scrapy

MongoDB安装

详见:爬取猫眼电影Top100 - Request 和 MongoDB。

创建项目

创建E:\pycodes目录专门存放项目代码，在cmd中进入E:\pycodes目录，输入命令：
scrapy startproject zhihuuser

这样，在E:\pycodes目录下就生成了zhuhuuser爬虫项目文件，进入到zhuhuuser文件夹，创建zhuhu爬虫:

cd zhihuuser
scrapy genspider zhihu www.zhihu.com

反爬伪装

禁止ROBOTSTXT_OBEY

在settings.py文件中，将ROBOTSTXT_OBEY修改为False，表示不遵守robots协议
ROBOTSTXT_OBEY = False

修改请求头

DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
    'authorization': 'oauth c3cef7c66a1843f8b3a9e6a1e3160e20'
}

网页分析

打开浏览器的控制台，切换到Network监听模式。找一个大V，以轮子哥为例吧，它的个人信息页面网址是：https://www.zhihu.com/people/excited-vczh。

首先打开轮子哥的首页，点击“关注者”
鼠标点击分页"2"时，发现下面的增加了批量请求，其中一个是batch的POST请求，一个是followers?开头的Get请求中，Get请求头里有一个属性叫Referer，值为：
https://www.zhihu.com/people/excited-vczh/followers?page=2
get请求网址是：
https://www.zhihu.com/api/v4/members/excited-vczh/followers?include=data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics&offset=20&limit=20

查看它返回的信息如下：

所以，我们要得到一个用户的关注者信息取用的URL接口是：
https://www.zhihu.com/api/v4/members/{user}/followers?include={include}&offset={offset}&limit={limit}

其中，user是'url_token'，include是固定的查询参数，offset是分页偏移量，limit是一页取多少个

同理，将鼠标悬浮在某一个关注者上面，可在监测框看到增加了一个POST请求，一个GET请求。可知：

获取用户的详细信息的URL接口是：
https://www.zhihu.com/api/v4/members/{user}?include={include}

流程与实现

流程

综上分析，我们可以从大V出发，得到关注列表的每一个'url_token'，利用得到的每一个'url_token'，获取用户信息，并且再次获得其关注列表。

代码实现

zhihu.py
----------------------------------------------------------
# -*- coding: utf-8 -*-
import json

from scrapy import Spider, Request
from zhihuuser.items import UserItem


class ZhihuSpider(Spider):
    name = "zhihu"
    allowed_domains = ["www.zhihu.com"]
    user_url = 'https://www.zhihu.com/api/v4/members/{user}?include={include}'
    follows_url = 'https://www.zhihu.com/api/v4/members/{user}/followees?include={include}&offset={offset}&limit={limit}'
    start_user = 'excited-vczh'
    user_query = 'locations,employments,gender,educations,business,voteup_count,thanked_Count,follower_count,following_count,cover_url,following_topic_count,following_question_count,following_favlists_count,following_columns_count,answer_count,articles_count,pins_count,question_count,commercial_question_count,favorite_count,favorited_count,logs_count,marked_answers_count,marked_answers_text,message_thread_token,account_status,is_active,is_force_renamed,is_bind_sina,sina_weibo_url,sina_weibo_name,show_sina_weibo,is_blocking,is_blocked,is_following,is_followed,mutual_followees_count,vote_to_count,vote_from_count,thank_to_count,thank_from_count,thanked_count,description,hosted_live_count,participated_live_count,allow_message,industry_category,org_name,org_homepage,badge[?(type=best_answerer)].topics'
    follows_query = 'data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics'
    
    def start_requests(self):
        #生成器生成Request(url, 回调函数 - 默认使用spider的parse()方法)
        yield Request(self.user_url.format(user=self.start_user, include=self.user_query), self.parse_user)
        yield Request(self.follows_url.format(user=self.start_user, include=self.follows_query, limit=20, offset=0),
                      self.parse_follows)

    def parse_user(self, response):
        result = json.loads(response.text)
        item = UserItem()

        for field in item.fields:
            if field in result.keys():
                item[field] = result.get(field)
        yield item

        yield Request(
            self.follows_url.format(user=result.get('url_token'), include=self.follows_query, limit=20, offset=0),
            self.parse_follows)

    def parse_follows(self, response):
        results = json.loads(response.text)

        if 'data' in results.keys():
            for result in results.get('data'):
                yield Request(self.user_url.format(user=result.get('url_token'), include=self.user_query),
                              self.parse_user)

        if 'paging' in results.keys() and results.get('paging').get('is_end') == False:

            #paging的'previous'和'next'属性存储着前后页的URL，因此offset在这里值更新
            next_page = results.get('paging').get('next')  
            yield Request(next_page,
                          self.parse_follows)

items.py
----------------------------------------------------------
# -*- coding: utf-8 -*-

from scrapy import Item, Field
 
class UserItem(Item):
    # define the fields for your item here like:
    id = Field()
    name = Field()
    avatar_url = Field()
    headline = Field()
    description = Field()
    url = Field()
    url_token = Field()
    gender = Field()
    cover_url = Field()
    type = Field()
    badge = Field()
 
    answer_count = Field()
    articles_count = Field()
    commercial_question_count = Field()
    favorite_count = Field()
    favorited_count = Field()
    follower_count = Field()
    following_columns_count = Field()
    following_count = Field()
    pins_count = Field()
    question_count = Field()
    thank_from_count = Field()
    thank_to_count = Field()
    thanked_count = Field()
    vote_from_count = Field()
    vote_to_count = Field()
    voteup_count = Field()
    following_favlists_count = Field()
    following_question_count = Field()
    following_topic_count = Field()
    marked_answers_count = Field()
    mutual_followees_count = Field()
    hosted_live_count = Field()
    participated_live_count = Field()
 
    locations = Field()
    educations = Field()
    employments = Field()

settings.py
----------------------------------------------------------
# -*- coding: utf-8 -*-

# Scrapy settings for zhihuuser project

BOT_NAME = 'zhihuuser'

SPIDER_MODULES = ['zhihuuser.spiders']
NEWSPIDER_MODULE = 'zhihuuser.spiders'

ROBOTSTXT_OBEY = False

DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
    'authorization': 'oauth c3cef7c66a1843f8b3a9e6a1e3160e20',
}

ITEM_PIPELINES = {
    'zhihuuser.pipelines.MongoPipeline': 300,
}

MONGODB_SERVER = '127.0.0.1';
MONGODB_PORT = 27017;
MONGODB_DB = 'zhuser';
MONGODB_COLLECTION = 'zhu';

pipelines.py
----------------------------------------------------------
# -*- coding: utf-8 -*-
import pymongo
from scrapy.conf import settings

class MongoPipeline(object):
    def __init__(self):

        #将setting里面数据连接相关的参数取过来
        server = settings['MONGODB_SERVER']
        port = settings['MONGODB_PORT']
        db = settings['MONGODB_DB']
        collection = settings['MONGODB_COLLECTION']

        #数据连接
        client = pymongo.MongoClient(server, port)
        db = client[db]
        self.collection = db[collection]

    def process_item(self, item, spider):
        self.collection.insert(dict(item))
        return item

结果与遗留问题

数据保存到MongoDB中，但是只有400行不到

原因：知乎爬虫限制，403 404错误。

scrapy实战--爬取知乎用户信息(上)
背景使用Scrapy分布式爬取知乎所有用户个人信息!项目地址爬取知乎所有用户大规模抓取静态网页Scrapy绝...
爬取知乎用户信息 - Scrapy + MongoDB
归属文集: Python + Scrapy + MongoDB实例环境需求本机环境：32bit Windows...
【零基础学爬虫】scrapy实战：抓取知乎用户信息
简介知乎用户信息是非常大的，本文是一个scrapy实战：怎样抓取所有知乎用户信息。爬取的思路如下图所示：选择一...
使用Scrapy爬取知乎用户信息
本文记录了关于知乎用户信息的模块化抓取，使用到了Scrapy这个开源项目，对其不熟悉的同学建议提前了解知乎是现在...
使用Docker部署scrapy-redis分布式爬虫
引言在上篇使用Scrapy爬取知乎用户信息我们编写了一个单机的爬虫，这篇记录了使用Scrapy-Redis将其重...
Scrapy-01-知乎全站用户信息爬取
这是Scrapy实战的第一个项目所以比较简陋不健壮 IP地址随机更换、分布式等等都没有先说一下遇到的一个小坑...
scrapy实战--爬取知乎用户(下)
背景当你用scrapy写好一个爬虫后,惬意的坐在凳子上看它在运动,老板走过来...老板: 东西做好了吗?你: 做...
scrapy爬取豆瓣电影
scrapy爬取豆瓣电影，存储在MongoDB 本节分享用的Scrapy爬取豆瓣电影Top250的实战。本节要实...
第五章知乎问题爬取
爬取知乎问答标签（空格分隔）： python scrapy session cookie session和coo...
从0开始部署scrapy-redis分布式爬虫
之前用scrapy爬取了知乎用户数据，由于数据量很大，便考虑采取分布式提升爬取效率，便有的此文。爬虫源码为http...

爬取知乎用户信息 - Scrapy + MongoDB

`归属文集: Python + Scrapy + MongoDB实例`

环境需求

Scrapy安装

MongoDB安装

创建项目

反爬伪装

禁止ROBOTSTXT_OBEY

修改请求头

网页分析

流程与实现

流程

代码实现

结果与遗留问题

相关文章

scrapy实战--爬取知乎用户信息(上)

爬取知乎用户信息 - Scrapy + MongoDB

【零基础学爬虫】scrapy实战：抓取知乎用户信息

使用Scrapy爬取知乎用户信息

使用Docker部署scrapy-redis分布式爬虫

Scrapy-01-知乎全站用户信息爬取

scrapy实战--爬取知乎用户(下)

scrapy爬取豆瓣电影

第五章知乎问题爬取

从0开始部署scrapy-redis分布式爬虫

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读