分析网站结构
爬取blog.jobbole.com
该网站提供了所有文章的URL
新建虚拟环境(指定虚拟环境)
mkvirtualenv --python=路径 虚拟环境名字
安装scrapy(使用豆瓣源)
workon 虚拟环境名,进入虚拟环境后
pip install -i https://pypi.douban.com/simple/ scrapy
新建Scrapy工程
scrapy startproject 项目名
利用模板新建爬虫文件
在项目目录下
scrapy genspider jobbole blog.jobbole.com
data:image/s3,"s3://crabby-images/8df18/8df184b3a1c6e6b204638da56a44a478cdf95c08" alt=""
继承了scrapy.Spider类,start_urls是一个list,可以放入想爬取的所有的URL。
data:image/s3,"s3://crabby-images/77dff/77dffdb0449d9355cc51bcf872d6af38b70f5ea1" alt=""
对start_urls 进行遍历,yield Request交给Scrapy的下载器,下载完之后,进入到parse函数中,有一个response对象。
自定义main文件调用命令行使pycharm可以调试
os.path.abspath(__file__) # 得到当前文件的绝对路径
os.path.dirname(os.path.abspath(__file__)) # 得到当前文件的父目录
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
execute(["scrapy","crawl","jobbole"]) # 启动jobbole爬虫
注意settings.py 的ROBOTSTXT_OBEY协议设置为False
出现错误:No module named 'win32api'
因为windows下缺少这个包,通过pip命令安装
pip install -i 豆瓣源 pypiwin32
通过xpath提取值
xpath使用路径表达式在xml和html中进行导航
xpath语法
1. article:选取所有article元素的所有子节点
2. /article:选取根元素article
3. article/a: 属于article的子元素的a元素
4. //div: 获取所有div
5. article//div: article下的所有div
6. //@class: 选取所有名为class的属性
/article/div[1]
/article/div[last()]
//div[@lang='eng'] 取lang属性为eng的div
/div/* div下的所有子节点
//* 选取所有元素
//div[@*] 选取所有带属性的元素
response.xpath('//*[@id="post-110287"]/div[1]/h1/text') # xpath提取标题
response.xpath('//span[contains(@class,'vote-post-up')]') # 找一个span,他的class包含vote-post-up
通过CSS选择器提取值
response.css('.entry-header h1::text').extract()
关键代码
from scrapy.http import Request
from urllib import parse # python2 中是urlparse
data:image/s3,"s3://crabby-images/b1db1/b1db127b2899b657d75182bc31e58b24e317ea12" alt=""
data:image/s3,"s3://crabby-images/f1431/f143145f72af208a073c943ea5835980b449a9c1" alt=""
data:image/s3,"s3://crabby-images/66b27/66b271f8fb9053cbd17974b7cb72c11f4de88239" alt=""
Item
data:image/s3,"s3://crabby-images/56c30/56c30c429de870eb92b1de7b698e94e14705a2fd" alt=""
Scrapy 自动下载图片pipeline设置
data:image/s3,"s3://crabby-images/8680e/8680edfa6b1a10da470f906c3ba1ff7333d945a2" alt=""
no module PIL报错:
pip install pillow
data:image/s3,"s3://crabby-images/ad3b9/ad3b9a27eb6455c593b9595ac4fb0dc30037026d" alt=""
将数据作为json文件保存
data:image/s3,"s3://crabby-images/5276f/5276fdd48f5005c070723b3f8dd8a395f7530766" alt=""
data:image/s3,"s3://crabby-images/b1e72/b1e7288e1bd9bfed4ab28735de26e1ce1f2a33a2" alt=""
将数据插入数据库中
安装mysql驱动:pip install mysqlclient
data:image/s3,"s3://crabby-images/32add/32addf2c1bd246d04cd41afdf31a626d69391a5a" alt=""
data:image/s3,"s3://crabby-images/879dc/879dcc58b407610054f2764b6f5292820684fb02" alt=""
data:image/s3,"s3://crabby-images/485df/485df09e81d64cb7008dafaa50ec4d5d80b52033" alt=""
data:image/s3,"s3://crabby-images/80cf9/80cf98e115102ccb8ded5789d038109ae2c62210" alt=""
data:image/s3,"s3://crabby-images/454f5/454f57eac3c7d87529436c6d958c7ba91a5d141b" alt=""
网友评论