(一)使用工具
这里使用了火狐浏览器的user-agent插件,不懂的可以点这里火狐插件使用
![](https://img.haomeiwen.com/i2577413/17ced045a0348b70.png)
(二)爬虫操作步骤:
百度 网易新闻并选择
步骤一:
![](https://img.haomeiwen.com/i2577413/960cd15b0689225c.png)
步骤二:
![](https://img.haomeiwen.com/i2577413/44037d74eb5e0be7.png)
步骤三:
![](https://img.haomeiwen.com/i2577413/e3ff92e221f30944.png)
步骤四:
![](https://img.haomeiwen.com/i2577413/d660169b3287b9c4.png)
最后一步:
![](https://img.haomeiwen.com/i2577413/dcfa120c82b13bf5.png)
注意点:
(1)网易新闻类型,一共是下面的几种:
{"BBM54PGAwangning","BCR1UC1Qwangning","BD29LPUBwangning","BD29MJTVwangning","C275ML7Gwangning"}
(2)新闻翻页动作:
从0-10 ------> 10-10
步数为10,0起步
我们试着从0页开始获取一下(浏览器输入一下):
http://3g.163.com/touch/reconstruct/article/list/BD29LPUBwangning/0-10.html
![](https://img.haomeiwen.com/i2577413/cfa7d31f41f4928c.png)
很好的json,有没有。下面开始coding,不多说,直接看代码。
(三) 代码编写部分:
使用环境:win10 python3 scrapy
这里给出了spider文件部分
# -*- coding: utf-8 -*-
# @Time : 2018/5/23 13:56
# @Author : 蛇崽
# @Email : 643435675@QQ.com
# @File : wangyi3g.py
import json
import re
import scrapy
from bs4 import BeautifulSoup
class Wangyi3GSpider(scrapy.Spider):
name = 'wangyi3g'
allowed_domains = ['3g.163.com']
start_urls = ['http://3g.163.com/touch/news/']
baseurl = 'http://3g.163.com/touch/reconstruct/article/list/BD29LPUBwangning/{}-10.html'
def parse(self, response):
# 10 20
for page in range(0,80,10):
jsonurl = self.baseurl.format(page)
yield scrapy.Request(jsonurl,callback=self.parse_li_json)
def parse_li_json(self,response):
res = response.body.decode('utf-8')
print(res)
res = str(res).replace('artiList(','')
res = res.replace(')','')
j = json.loads(res)
datas = j['BD29LPUBwangning']
print(datas)
for data in datas:
title = data['title']
ptime = data['ptime']
url = data['url']
source = data['source']
print(title,ptime,url,source)
if url:
yield scrapy.Request(url,callback=self.parse_detail)
def parse_detail(self,response):
soup = BeautifulSoup(response.body,'lxml')
content = soup.find('div','content')
image_urls = re.findall(r'data-src="(.*?)"', str(content))
# print(image_urls)
![](https://img.haomeiwen.com/i2577413/0e2423ec0e271b9a.png)
以上就是网易新闻APP爬虫代码的实现,更多技术学习交流可查看主页加群。我们一起学习。
网友评论