采集目标
四川大学公共管理学院所有全职教师的照片、姓名、职称、院系、邮箱、详细页的链接及简介。如下图:
1.创建项目
创建一个名为teacher的项目
scrapy startproject teachers
2.定义Item
在items.py里面定义我们要抓取的数据:
import scrapy
class QuotesItem(scrapy.Item):
name = scrapy.Field()
posion = scrapy.Field()
department = scrapy.Field()
email = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
pass
3.编写Spider类
Spider有三个必需的定义的成员:
name: 名字,这个spider的标识
start_urls:一个url列表,spider从这些网页开始抓取
parse():一个方法,当start_urls里面的网页抓取下来之后需要调用这个方法解析网页内容,同时需要返回下一个需要抓取的网页,或者返回items列表
所以在spiders目录下新建一个spider,teacher_spider.py:
import scrapy
import json
from scrapy.selector import Selector
from teachers.items import TeachersItem
class TeachersSpider(scrapy.Spider):
name="tspider"
allowed_domains=["ggglxy.scu.edu.cn"]
start_urls=['http://ggglxy.scu.edu.cn/index.php?c=article&a=type&tid=18',]
def parse_item(self, response):
items = []
sel = Selector(response)
base_url =get_base_url(response)
for teacher in response.xpath("//ul[@class='teachers_ul mt20 cf']/li"):
item=TeachersItem()
item['name']=teacher.xpath("div[@class='rfr']/h3/text()").extract_first()
item['position']=teacher.xpath("div[@class='r fr']/p/text()").extract_first()
item['email']=teacher.xpath("div[@class='r fr']/div[@class='desc']/p[2]/text()").extract_first()
item['department']=teacher.xpath("div[@class='r fr']/div[@class='desc']/p[1]/text()").extract_first()
url =response.urljoin(response.url, href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
next_page=response.xpath("//div[@class='pager cf tc pt10 pb10 mobile_dn']/li[last()-1]/a/@href").extract_first()
last_page=response.xpath("//div[@class='pager cf tc pt10 pb10 mobile_dn']/li[last()]/a/@href").extract_first()
if last_page:
next_page="http://ggglxy.scu.edu.cn/"+next_page
yield scrapy.http.Request(next_page,callback=self.parse)
def parse_dir_contents(self, response):
for sel in response.xpath('//ul/li'):
item = TeachersItem()
item['link'] = sel.xpath('a/@href').extract()
item['desc']=response.xpath("//div[@class='desc']/text()").extract()
yield item
info('parsed ' + str(response))
return items
def _process_request(self, request):
info('process ' + str(request))
return request
4.将抓取的items以json格式保存到文件中
从spider抓取到的items将被序列化为json格式
import json
import codecs
class QuotesPipeline(object):
def __init__(self):
self.file = open('items.py', 'wb', encoding='utf-8')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
5.激活Item Pipeline组件
在settings.py文件中,往ITEM_PIPELINES中添加项目管道的类名,就可以激活项目管道组件:
ITEM_PIPELINES = {
'myproject.pipeline.PricePipeline':300,
'myproject.pipeline.JsonWriterPipeline':800,
}
6.启动服务开始抓取
scrapy crawl tencent
网友评论