今天写了一个爬虫,抓取了一下国内知名的某足球网站的内容。
首先就是去创建项目:
scrapy startproject dongXXXdi
然后去创建一个爬虫
scrapy genspider DQD "dongXXXdi.com"
然后出现了如下的目录:
项目的目录结构
具体的网页结构就不分析了。可以参考我上一篇博客的Chrome的network调试。
直接上代码:
先看看spider.py 。
# -*- coding: utf-8 -*-
import scrapy
class DqdSpider(scrapy.Spider):
name = "DQD"
allowed_domains = ["dongqiudi.com"]
start_urls = ['http://dongqiudi.com/archives/1?page=1']
def parse(self, response):
html = response.text
text = json.loads(html)
dataArray = text['data']
for data in dataArray:
yield data
for i in range(2,50): #暂时就先抓取50页内容
new_url = "http://dongqiudi.com/archives/1?page={}".format(i)
yield scrapy.Request(url=new_url,callback=self.parse) #回调函数
再看看items.py
import scrapy
class DongqiudiItem(scrapy.Item):
# define the fields for your item here like:
id = scrapy.Field()
title = scrapy.Field()
discription = scrapy.Field()
user_id = scrapy.Field()
type = scrapy.Field()
display_time = scrapy.Field()
thumb = scrapy.Field()
comments_total = scrapy.Field()
web_url = scrapy.Field()
official_account = scrapy.Field()
然后再看看pipilines.py,将所有的数据存储为json格式。
import json
class DongqiudiPipeline(object):
def process_item(self, item, spider):
with open("DQD.json","a") as f:
f.write(json.dumps(item,ensure_ascii=False)+"\n")
最后看看settings.py。
BOT_NAME = 'dongqiudi'
SPIDER_MODULES = ['dongqiudi.spiders']
NEWSPIDER_MODULE = 'dongqiudi.spiders'
ROBOTSTXT_OBEY = False #不遵守机器人协议
#请求头的设置
DEFAULT_REQUEST_HEADERS = {
'Accept':
'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
}
ITEM_PIPELINES = {
'dongqiudi.pipelines.DongqiudiPipeline': 300,
}
到此所有的要写的代码就简单的完成了。然后看看抓取的结果。
抓取结果
网友评论