爬取网址:https://www.jianshu.com/recommendations/collections?order_by=hot
爬取内容:专题名称、专题介绍、收录文章、关注人数
爬取方式:Scrapy框架
储存方式:csv文件
使用F12,观察动态加载的URL,共37页:
https://www.jianshu.com/recommendations/collections?page=1&order_by=hot
https://www.jianshu.com/recommendations/collections?page=2&order_by=hot
https://www.jianshu.com/recommendations/collections?page=3&order_by=hot
1.items.py文件
import scrapy
class ZhuantiItem(scrapy.Item):
# define the fields for your item here:
name = scrapy.Field()
content = scrapy.Field()
article = scrapy.Field()
fans = scrapy.Field()
2.zhuantispider.py文件
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request
from zhuanti.items import ZhuantiItem
class zhuanti(CrawlSpider):
name = "zhuanti"
start_urls = ["https://www.jianshu.com/recommendations/collections?page=1&order_by=hot"]
def parse(self, response):
item = ZhuantiItem()
selector = Selector(response)
infos = selector.xpath('//div[@class="collection-wrap"]')
for info in infos:
try:
name = info.xpath('a/h4/text()').extract()[0]
content = info.xpath('a/p/text()').extract()[0].replace('\n', '')
article = info.xpath('div[@class="count"]/a/text()').extract()[0]
fans = info.xpath('div[@class="count"]/text()').extract()[0].strip('· ')
item['name'] = name
item['content'] =content
item['article'] = article
item['fans'] = fans
yield item
except IndexError:
pass
#构造第2页到第37页的‘热门专题’URL,通过Request请求URL,并回调parse()函数
urls = ['https://www.jianshu.com/recommendations/collections?page={}&order_by=hot'.format(str(i)) for i in range(2,37)]
for url in urls:
yield Request(url,callback=self.parse)
3.settings.py文件: 此处使用Scrapy自带的存储功能(Feed exports),所以不需要使用pipelines.py进行数据的处理存储。
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3294.6 Safari/537.36' #请求头
DOWNLOAD_DELAY = 0.5 #睡眠时间0.5秒
FEED_URI = 'file:F:/zhuanti.csv'
FEED_FORMAT = 'csv' #存入csv文件
4.main.py文件
from scrapy import cmdline
cmdline.execute("scrapy crawl zhuanti".split())
运行main.py文件即可得到运行结果,用记事本打开:
image.png
如果用EXCEL文件打开,需要先在Notepad+中选择编码“以UTF-8格式编码”,并重新保存即可。
image.png
网友评论