专题文章日益增多,想通过首页的abstract来大体了解文章内容,存储文章链接,从而有目的的阅读,提升效率。
利用Scrapy爬取专题作业,将爬取的数据存入Mongodb。
1. 确定爬取信息:
文章题目+简介+作者+时间+字数+(阅读次数+评论数量+喜欢人数)--PS:后三个用re或者Xpath都没有成功抓取,爬取到的数据都为空,目前还没找到原因。。。


item.py文件:
import scrapy
class CrazydataanalyzeItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
article_url = scrapy.Field()
title = scrapy.Field()
abstract = scrapy.Field()
nickname = scrapy.Field()
publish_time = scrapy.Field()
wordage = scrapy.Field()
views_count = scrapy.Field()
comments_count = scrapy.Field()
likes_count = scrapy.Field()
pass
2. 编辑爬虫程序
确定title/abstract/nickname以及子网页的url:base_url+href

爬取发表日期以及字数
PS:wordage与views-count/comments-count/likes-count同样的结构,为什么可以爬取字数,但阅读数/评论数/喜欢数都爬取不到呢???

CrazyData.py
from scrapy.spiders import CrawlSpider # spiders 加s
from scrapy.selector import Selector
from CrazyDataAnalyze.items import CrazydataanalyzeItem
from scrapy.http import Request
import re
import requests
class CrazyData(CrawlSpider):
name = 'CrazyData'
start_urls = ['https://www.jianshu.com/c/af12635a5aa3?order_by=added_at&page=1']
def parse(self, response):
base_url = 'https://www.jianshu.com'
selector = Selector(response)
infos = selector.xpath('//ul[@class="note-list"]/li')
for info in infos:
#print(info.xpath('div/a/@href'))
#print(info.xpath('div/a/@href').extract())
#print(info.xpath('div/a/@href').extract()[0])
#print(info.xpath('div/a/@href')[0].extract())
#print(info.xpath('div/a/@href')[0].extract()[0])
article_url = base_url + info.xpath('div/a/@href').extract()[0]
title = info.xpath('div/a/text()').extract()[0]
abstract = info.xpath('div/p/text()').extract()[0]
nickname = info.xpath('div/div/a/text()').extract()[0]
yield Request(article_url,meta={'article_url':article_url,'title':title,'abstract':abstract,'nickname':nickname},callback=self.parse_item)
urls = ['https://www.jianshu.com/c/af12635a5aa3?order_by=added_at&page={}'.format(str(i)) for i in range(2,20)]
for url in urls:
yield Request(url,callback=self.parse)
def parse_item(self,response):
item = CrazydataanalyzeItem()
item['article_url'] = response.meta['article_url']
item['title'] = response.meta['title']
item['abstract'] = response.meta['abstract']
item['nickname'] = response.meta['nickname']
try:
#url1 = response.meta['article_url']
'''headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'
}'''
#html = requests.get(item['article_url'], headers=headers)
selector = Selector(response)
publish_time = selector.xpath('//div[@class="meta"]/span[1]/text()').extract()[0]
wordage = selector.xpath('//div[@class="meta"]/span[2]/text()').extract()[0]
#views_count = re.findall('<span class="views-count">(.*?)</span>',html.text,re.S)
#comments_count = re.findall('"comments-count">(.*?)</span>',html.text,re.S)
#likes_count = re.findall('"likes-count">(.*?)</span>',html.text,re.S)
views_count = selector.xpath('//div[@class="meta"]/span[3]/text()').extract()
comments_count = selector.xpath('//div[@class="meta"]/span[4]/text()').extract()
likes_count = selector.xpath('//div[@class="meta"]/span[5]/text()').extract()
item['publish_time'] = publish_time
item['wordage'] = wordage
item['views_count'] = views_count
item['comments_count'] = comments_count
item['likes_count'] = likes_count
yield item
except IndexError:
pass
3.存入MongoDB数据库(Just为了练习下数据存储)
pipelines.py
import pymongo
class CrazydataanalyzePipeline(object):
def __init__(self):
client = pymongo.MongoClient('localhost',27017)
test = client['test']
CrazyData = test['CrazyData']
self.post = CrazyData
def process_item(self, item, spider):
info = dict(item)
self.post.insert(info)
return item
4.setting设置
settings.py
BOT_NAME = 'CrazyDataAnalyze'
SPIDER_MODULES = ['CrazyDataAnalyze.spiders']
NEWSPIDER_MODULE = 'CrazyDataAnalyze.spiders'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3
ITEM_PIPELINES = {
'CrazyDataAnalyze.pipelines.CrazydataanalyzePipeline': 300,
}
5.MAIN文件
main.py
from scrapy import cmdline
cmdline.execute("scrapy crawl CrazyData".split())
6.MongoDB
可以通过命令行导出CSV文件
mongoexport -d test -c CrazyData --type=csv -f article_url,title,abstract,nickname,publish_time,wordage -o crazydata.csv
并通过EXCEL中=HYPERLINK()函数,直接将url转换为超链接~

这样就可以通过简介大体了解文章内容了,也希望各位小伙伴多多采用开门见山的方法,加油加油~元旦即将来临!
网友评论