建包
对于网络爬虫,我们首先要做的便是利用命令行创建文本包,本文命名为cast
scrapy startproject cast
cd cast
scrapy genspider ast itcast.cn
具体步骤如下图:

对生成的item文件进行编写:
import scrapy
class CastItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
position = scrapy.Field()
detail = scrapy.Field()
对ast文件进行修改
# -*- coding: utf-8 -*-
import scrapy
from cast.items import CastItem
class AstSpider(scrapy.Spider):
name = 'ast'
allowed_domains = ['itcast.cn']
start_urls = ['http://www.itcast.cn/channel/teacher.shtml']
def parse(self, response):
node_list = response.xpath('//div[@class="li_txt"]')
for node in node_list:
item = CastItem()
name = response.xpath('//h3/text()').extract()
position = response.xpath('//h4/text()').extract()
detail = response.xpath('//p/text()').extract()
item['name'] = name[0].encode('utf-8')
item['position'] = position[0].encode('utf-8')
item['detail'] = detail[0].encode('utf-8')
yield item
修改管道文件
import json
class CastPipeline(object):
def __init__(self):
self.f = open("1.json", "w")
def process_item(self, item, spider):
content = json.dumps(str(dict(item)), ensure_ascii=False) + ',\n'
self.f.write(content)
return item
def close_spider(self, spider):
self.f.close()
网友评论