node爬取丁香园的方法

作者: 暖男Gatsby | 来源:发表于2020-02-28 11:52 被阅读0次

'首先yarn add superagent cheerio下载依赖的包（要是没有yarn先加 npm install -g yarn）注：

- superagent(http://visionmedia.github.io/superagent/ ) 是个 http 方面的库，可以发起 get 或 post 请求

- cheerio(https://github.com/cheeriojs/cheerio )为服务器特别定制的，快速、灵活、实施的 jQuery. 用来从网页中以 css selector 取数据，使用方式跟 jquery 一样。

之后，新建一个spider.js文件

步骤

var url='https://ncov.dxy.cn/ncovh5/view/pneumonia',

superagent

.get(url)

.then(res=>{

const $ =cheerio.load(res.text) //通过jquery抓取css

console.log(res.text)

})

.catch(err=>{throw err})

代码全文（id为关键内容外围id）

// 目标是？

// 爬取丁香园网站的疫情数据

// 在node端要有一个帮助我请求丁香园网站

const superagent = require('superagent')

const cheerio = require('cheerio')

const fs = require('fs')

const path = require('path')

// superagent.get(url).then()

// 1. 请求目标网站

const url = `https://ncov.dxy.cn/ncovh5/view/pneumonia`

superagent

.get(url)

.then(res => {

// console.log(res.text) // 相应的内容

// 浏览器可以解析html 但是node端不行

// 2. 去解析html字符串从里面提取对应疫情数据

const $ = cheerio.load(res.text) // 然后后我们就可以通过jQuery的方法操作dom

// 获取全国疫情信息数据

var $getListByCountryTypeService1 = $('#getListByCountryTypeService1').html()

var $getAreaStat = $('#getAreaStat').html()

var $getStatisticsService = $('#getStatisticsService').html()

// console.log($getListByCountryTypeService1)

// 使用字符切割正则匹配 eval函数

var dataObj = {}

eval($getListByCountryTypeService1.replace(/window/g, 'dataObj'))

//在丁香园源代码中window.getListByCountryTypeService1即为对应的json数据

eval($getAreaStat.replace(/window/g, 'dataObj'))

eval($getStatisticsService.replace(/window/g, 'dataObj'))

// console.log(dataObj)

// 3. fs写入数据到本地

fs.writeFile(path.join(__dirname, './data.json'), JSON.stringify(dataObj), err => {

if (err) throw err

console.log('数据写入成功')

})

.catch(err => {

throw err

})

（网络爬虫主要是将数据所在的相关的几种格式转换成对应的json格式）

网友评论

本文标题：node爬取丁香园的方法

本文链接：https://www.haomeiwen.com/subject/vkkxhhtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

node爬取丁香园的方法

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读