目标网站:http://www.umei.cc
爬虫代码:
import scrapy
import urllib
from douban_movie.items import DoubanMovieItem
class MovieSpider(scrapy.Spider):
# 爬虫名
name = 'meinv'
# 起始url
start_urls = [
'http://www.umei.cc/',
]
def parse(self, response):
# 水平抓取页面
urls = response.xpath("//div[@class='PicListTxt']/ul/li/a/@href").extract()
for url in urls :
yield scrapy.Request(url,callback = self.parse_item)
# 处理每个美女详情页照片
def parse_item(self,response):
next_pic = response.xpath("//div[@class='NewPages']/ul/li/a[contains(text(),'下一页')]/@href").extract_first()
if next_pic != '#':
url = 'http://www.umei.cc/p/gaoqing/cn/'+next_pic
yield scrapy.Request(url,callback=self.parse_item)
item = DoubanMovieItem()
item['name'] = response.xpath("//div[@class='ArticleTitle']/strong/text()").extract_first()
item['imgurl'] = response.xpath("//div[@class='ImageBody']//p/a/img/@src").extract_first()
yield item
管道代码:
from scrapy.http import Request
import os
import urllib
class DoubanMoviePipeline(object):
def process_item(self,item,info):
conn = urllib.request.urlopen(item['imgurl'])
with open("download/"+ item['name'] + '.jpeg','wb') as file:
file.write(conn.read())
file.close()
配置文件:
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
ITEM_PIPELINES = {
'douban_movie.pipelines.DoubanMoviePipeline': 300,
}
运行界面
网友评论