part1:写在前面

如果仅仅只是使用axios+cheerio，那么只能完成对网页的静态爬取，要模拟一些dom操作获取动态资源，就必须使用一些特殊的技巧。比如Headless Chromium。

part2:库介绍

axios

什么是 axios？
Axios 是一个基于 promise 的 HTTP 库，可以用在浏览器和 node.js 中。
浏览器:中我们用axios完成对后台接口的调用，因为它封装了许多方法比如get和post,比起我们自己手写ajax轮子来说，这方便了许多，但在本文中，由于使用的是nodejs，这里就不多展开。
node:使用axios.get可以实现对网站的get请求，请求回来的响应信息中有我们需求的html资源，下面的文章会对node中的axios详细展开描述。

cheerio：

node版jquery，但是无法触发元素上的事件（因为响应信息实际上只是字符串版的html信息）

part3:实例操作

在码代码前需要下载axios，cheerio库，具体方法详见官方文档。

var cheerio = require('cheerio')  //只能用于jq式的解析html，但注意是解析,所以不能进行事件操作
axios.get('https://www.bilibili.com/')
    .then(function (res) {
        // 获取网页数据
       console.log(res);

    })
    .catch(function (err) {
        console.log('failed', err);
    });

使用node 文件名.js启动脚本，
但很可惜，第一次尝试，失败。

结果:失败.png
注意错误原因Error: Request failed with status code 403
403很明显是http错误码，

403错误是一种在网站访问过程中，常见的错误提示，表示资源不可用。服务器理解客户的请求，但拒绝处理它，通常由于服务器上文件或目录的权限设置导致的WEB访问错误。
如果你了解http请求，那么你一定知道，请求报文中是包含请求头的，但我们的代码中，很明显没有对请求头进行配置,

那么，如何获取请求头的详细信息?
使用浏览器审查元素即可。

往下翻，我们可以找到request header，这就是我们需要的请求头信息。

get的第二个参数可以写入相关配置，这里我们用es6语法直接引入header

var axios = require('axios');
var cheerio = require('cheerio')  //只能用于jq式的解析html，但注意是解析,所以不能进行事件操作
let headers={
    authority:"x",
    method: "x",
    path: "x",
    scheme: "x",
    accept: "x",
    acceptEncoding: "x",
    acceptLanguage: "x",
    cacheControl:"x",
    cookie:"x",
    secFetchDest:"x",
    secFetchMode:"x",
    secFetchSite:"x",
    secFetchUser:"x",
    upgradeInsecureRequests:"x",
    userAgent:"x",
}
axios.get('https://www.bilibili.com/', {headers})
    .then(function (res) {
        // 获取网页数据
      console.log(res);

    })
    .catch(function (err) {
        console.log('failed', err);
    });

这下我们就爬取成功了！

image.png

但是怎么总感觉怪怪的？
类似于\uxxxx的字符串格式非常的多，
实际上这是unicode编码，我们只需要进一步转换即可
将第一个then中的内容改为

let text=unescape((res.data).replace(/\\u/g, '%u')) //将unicode码转换成中文

你就可以看到正确的内容了。

image.png

接下来，让我们使用cheerio抓取我们想要的信息。
第一步：
为$赋予意义，使用cheerio.load即可
代码如下：

let $ = cheerio.load(text, {
      decodeEntities: false,
    });

decodeEntities是一个配置选项，也是为了解决中文乱码问题所设置,如果不设置爬下来的数据也依然会乱码。

这里，我们想爬取首页视频信息，那么我们继续审查元素

image.png

那么基本可以锁定是.info-box下的.title元素，包含了我们要的信息。
利用jq语法，我们可以得到我们要的数据：
最终代码如下：

var axios = require("axios");
var cheerio = require("cheerio"); //只能用于jq式的解析html，但注意是解析,所以不能进行事件操作
let headers={
    authority:"x",
    method: "x",
    path: "x",
    scheme: "x",
    accept: "x",
    acceptEncoding: "x",
    acceptLanguage: "x",
    cacheControl:"x",
    cookie:"x",
    secFetchDest:"x",
    secFetchMode:"x",
    secFetchSite:"x",
    secFetchUser:"x",
    upgradeInsecureRequests:"x",
    userAgent:"x",
}
axios
  .get("https://www.bilibili.com/", { headers })
  .then(function (res) {
    // 获取网页数据
    let text = unescape(res.data.replace(/\\u/g, "%u")); //将unicode码转换成中文
    let $ = cheerio.load(text, {
      decodeEntities: false,
    });
    console.log("首页推荐" + "in " + new Date().toString());
    $(".video-card-reco .info .title").each(function () {
      let text = $(this).html();
      console.log(text);
    });
  })
  .catch(function (err) {
    console.log("failed", err);
  });

数据：

image.png