1、明确目标

目标网页：https://careers.tencent.com/search.html?index=1
发现该页面属于动态页面，相关内容靠后期渲染，所以先通过谷歌浏览器中的F12进行分析：

1.png
由此直接得到页面内容来源网址为：https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1565961876427&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=2&pageSize=10&language=zh-cn&area=cn
注意到pageIndex=后面加的是页码，pageSize=后面加的是单个页面信息条数
由此确定爬取的目标网页大概地址

接下来分析如何提取数据

取api网页中的的一组数据发现可转换为python中的字典格式，而所需数据均在该字典的"Posts"一项中
由此初步确定spider内容：
1 获取页面内容
2 转化为字典
3 提取字典内容
4 数据提交给pipline

2、创建scrapy项目

在cmd中键入如下命令并且回车
scrapy startproject tencent
创建成功之后键入
cd tencent

2.png
此时已经自动生成了一个文件夹，文件夹结构为：

tencent
  │  scrapy.cfg
  │
  └─tencent
      │  items.py
      │  middlewares.py
      │  pipelines.py
      │  settings.py
      │  __init__.py
      │
      ├─spiders    #存放爬虫文件
      │  │  __init__.py
      │  │
      │  └─__pycache__
      └─__pycache__

3、编写items.py

import scrapy

class TencentItem(scrapy.Item):
    #先定义要爬取数据的Field
    CareerTitle = scrapy.Field()#职务名称
    Classify = scrapy.Field()#职务分类
    Location = scrapy.Field()#工作地点
    CareerType = scrapy.Field()#职务种类
    Time = scrapy.Field()#发布时间
    Text = scrapy.Field()#职务详情

4、创建并且编写spiders

先在cmd中scrapy工程文件目录中输入：

scrapy genspider one "tencent.com"
#one为爬虫名称 "tencent.com"为爬取连接限制范围

这时在spiders文件中就会生成one.py
编写one.py：

import scrapy
from tencent.items import TencentItem
from json import loads
class OneSpider(scrapy.Spider):
    name = 'one'
    allowed_domains = ['tencent.com']
    base_url='https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1565872500295&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn'
    #用offset来记录pageindex的变化
    offset = 1
    start_urls =['https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1565872500295&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn']

    def parse(self, response):
        # 把从接口里拿到的数据转成字典
        model_dict = loads(response.text)
        for elem in model_dict['Data']['Posts']:
            # 找到对应的数据放在item里
            item=TencentItem()
            item['CareerTitle']=elem['RecruitPostName']
            item['Classify']=elem['PostURL']
            item['Location']=elem['LocationName']
            item['CareerType']=elem['CategoryName']
            item['Time']=elem['LastUpdateTime']
            item['Text']=elem['Responsibility']
            yield item #返回该次循环所提取的一组数据

            #爬完当前页面后利用offset得到下一页地址，并且返回（注意页数不要超出范围）
        if self.offset<486:
            self.offset+=1
            url=self.base_url.format(str(self.offset))
            yield scrapy.Request(url,callback=self.parse)

5、编写piplines.py

import json

class TencentPipeline(object):
    #初始化时先打开用来存储数据的文件tencent.json
    def __init__(self):
        self.f=open("tencent.json","w")
        
    def process_item(self, item, spider):
        content=json.dumps(dict(item),ensure_ascii=False)+",\n"
    #写入数据   
    self.f.write(content)
        return item
    
    #当spider运行完毕时关闭文件
    def close_spider(self,spider):
        self.f.close()