工程搭建流程:
1、cmd: cd PyCharmProject(工程所在目标文件)
2、cmd: scrapy startproject movie
3、cmd: cd movie
4、cmd: scrapy genspider meiju meijutt.com
5、IDE(PyCharm) 打开工程:
items.py -- 该文件定义存储模板,用于结构化数据
import scrapy
class MovieItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
name = scrapy.Field()
meiju.py -- 存储实际的爬虫代码
import scrapy
from movie.items import MovieItem
class MeijuSpider(scrapy.Spider):
name = 'meiju'
allowed_domains = ['meijutt.com']
start_urls = ['http://www.meijutt.com/new100.html']
# def start_requests(self):
# urls = ['http://www.meijutt.com/new100.html']
# for url in urls:
# yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
movies = response.xpath('//ul[@class="top-list fn-clear"]/li')
for each_movie in movies:
item = MovieItem()
item['name'] = each_movie.xpath('./h5/a/@title').extract()[0]
yield item
pipelines.py --该文件定义数据的存储方式,可以是文件、数据库或其他
class MoviePipeline(object):
def process_item(self, item, spider):
with open("my_meiju.txt",'a') as fp:
fp.write(item['name'])
# fp.write(str(value=item['name'], encoding="utf-8"))
fp.write('\n------------\n')
setting.py -- 配置文件,可设置用户代理、爬取延时等
ITEM_PIPELINES = {'movie.pipelines.MoviePipeline': 100}
6、cmd: cd movie
7、cmd: scrapy crawl meiju --log 或 scrapy crawl meiju
网友评论