安装scrapy包:
pip install scrapy
安装时会报错...如果是py3需要手动下载依赖包Twisted
下载地址:https://pypi.org/simple/twisted/
或者:https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
下载后放在桌面Twisted-19.2.1-cp37-cp37m-win_amd64.whl
pip install C:\Users\Administrator\Desktop\Twisted-19.2.1-cp37-cp37m-win_amd64.whl
再次pip isntall scrapy,显示下面的表示依赖包都安装完成
D:\>pip install scrapy
Looking in indexes: https://mirrors.aliyun.com/pypi/simple/
Requirement already satisfied: scrapy in d:\python\lib\site-packages (1.6.0)
Requirement already satisfied: Twisted>=13.1.0 in d:\python\lib\site-packages (from scrapy) (19.2.1)
Requirement already satisfied: parsel>=1.5 in d:\python\lib\site-packages (from scrapy) (1.5.1)
Requirement already satisfied: PyDispatcher>=2.0.5 in d:\python\lib\site-packages (from scrapy) (2.0.5)
Requirement already satisfied: w3lib>=1.17.0 in d:\python\lib\site-packages (from scrapy) (1.20.0)
Requirement already satisfied: queuelib in d:\python\lib\site-packages (from scrapy) (1.5.0)
Requirement already satisfied: cssselect>=0.9 in d:\python\lib\site-packages (from scrapy) (1.0.3)
Requirement already satisfied: pyOpenSSL in d:\python\lib\site-packages (from scrapy) (19.0.0)
Requirement already satisfied: lxml in d:\python\lib\site-packages (from scrapy) (4.3.4)
Requirement already satisfied: service-identity in d:\python\lib\site-packages (from scrapy) (18.1.0)
Requirement already satisfied: six>=1.5.2 in d:\python\lib\site-packages (from scrapy) (1.12.0)
Requirement already satisfied: hyperlink>=17.1.1 in d:\python\lib\site-packages (from Twisted>=13.1.0->scrapy) (19.0.0)
Requirement already satisfied: zope.interface>=4.4.2 in d:\python\lib\site-packages (from Twisted>=13.1.0->scrapy) (4.6.0)
Requirement already satisfied: attrs>=17.4.0 in d:\python\lib\site-packages (from Twisted>=13.1.0->scrapy) (19.1.0)
Requirement already satisfied: PyHamcrest>=1.9.0 in d:\python\lib\site-packages (from Twisted>=13.1.0->scrapy) (1.9.0)
Requirement already satisfied: constantly>=15.1 in d:\python\lib\site-packages (from Twisted>=13.1.0->scrapy) (15.1.0)
Requirement already satisfied: incremental>=16.10.1 in d:\python\lib\site-packages (from Twisted>=13.1.0->scrapy) (17.5.0)
Requirement already satisfied: Automat>=0.3.0 in d:\python\lib\site-packages (from Twisted>=13.1.0->scrapy) (0.7.0)
Requirement already satisfied: cryptography>=2.3 in d:\python\lib\site-packages (from pyOpenSSL->scrapy) (2.7)
Requirement already satisfied: pyasn1-modules in d:\python\lib\site-packages (from service-identity->scrapy) (0.2.5)
Requirement already satisfied: pyasn1 in d:\python\lib\site-packages (from service-identity->scrapy) (0.4.5)
Requirement already satisfied: idna>=2.5 in d:\python\lib\site-packages (from hyperlink>=17.1.1->Twisted>=13.1.0->scrapy) (2.8)
Requirement already satisfied: setuptools in d:\python\lib\site-packages (from zope.interface>=4.4.2->Twisted>=13.1.0->scrapy) (40.8.0)
Requirement already satisfied: asn1crypto>=0.21.0 in d:\python\lib\site-packages (from cryptography>=2.3->pyOpenSSL->scrapy) (0.24.0)
Requirement already satisfied: cffi!=1.11.3,>=1.8 in d:\python\lib\site-packages (from cryptography>=2.3->pyOpenSSL->scrapy) (1.12.3)
Requirement already satisfied: pycparser in d:\python\lib\site-packages (from cffi!=1.11.3,>=1.8->cryptography>=2.3->pyOpenSSL->scrapy) (2.19)
如果显示下面图片,表示scrapy安装完成:
image.png
安装scrapy工程可以放在任意磁盘目录下
先切换到如D盘下,运行scrapy startproject Tencent
工程配置文件在d盘下的Tencent下的Tencent中 image.png
这个下面包含了scrapy框架的主要文件
-
item.py:定义需要爬取的item,明确目标,如职位名称,工作地点等(需要自己设置)
image.png - middlewares.py,爬虫中间件 ,很少用到,创建后都已经自定义好,不需要更改(不需要自己设置)
# -*- coding: utf-8 -*-
# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
from scrapy import signals
class TencentSpiderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, dict or Item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Response, dict
# or Item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
class TencentDownloaderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
- pipelines.py:管道文件,需要对文件格式存储方式做修改(需要自己配置)
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
class TencentPipeline(object):
def __init__(self):
self.f = open("tencent.csv","w",encoding='utf8')
def process_item(self,item,spider):
content=json.dumps(dict(item),ensure_ascii=False) + ",\n"
self.f.write(content)
return item
def close_spider(self,spider):
self.f.close()
-
settings.py:对需要的设置做打开或者关闭处理默认大部分关闭,如需要打开管道设置:
image.png
默认关闭状态
创建爬虫文件:scrapy genspider tencent "tencent.com"
我们需要编写的爬虫文件在spiders里面的tencent.py
image.png
tencent.py:
# -*- coding: utf-8 -*-
import scrapy
from Tencent.items import TencentItem
import json
class TencentSpider(scrapy.Spider):
name = 'tencent'
baseurl="https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1562249003305&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn"
# allowed_domains = ['tencent.com']
offset = 1
# url='https://careers.tencent.com/tencentcareer/api/post/Query?'
start_urls=[baseurl.format(offset)]
def parse(self, response):
job_items=json.loads(response.body.decode())['Data']['Posts']
for job_item in job_items:
item = TencentItem()
item['positionName'] = job_item["RecruitPostName"]
item['positionLink'] = job_item["PostURL"] + job_item["PostId"]
item['positionType'] = job_item["Responsibility"]
item['worklocation'] = job_item["LocationName"]
item['publishTime'] = job_item["LastUpdateTime"]
yield item
if self.offset < 430:
self.offset += 1
url = self.baseurl.format(self.offset)
yield scrapy.Request(
url = url,
callback = self.parse
)
运行爬虫:scrapy crawl tencent
11D:\Tencent\Tencent\spiders>scrapy crawl tencent
2019-07-05 13:56:28 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: Tencent)
2019-07-05 13:56:28 [scrapy.utils.log] INFO: Versions: lxml 4.3.4.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.1, Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1c 28 May 2019), cryptography 2.7, Platform Windows-10-10.0.18362-SP0
2019-07-05 13:56:28 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'Tencent', 'NEWSPIDER_MODULE': 'Tencent.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['Tencent.spiders']}
2019-07-05 13:56:28 [scrapy.extensions.telnet] INFO: Telnet Password: 1f4cb6e4d1fc4caa
2019-07-05 13:56:28 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-07-05 13:56:28 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-07-05 13:56:28 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-07-05 13:56:28 [scrapy.middleware] INFO: Enabled item pipelines:
['Tencent.pipelines.TencentPipeline']
2019-07-05 13:56:28 [scrapy.core.engine] INFO: Spider opened
2019-07-05 13:56:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-07-05 13:56:28 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-07-05 13:56:28 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://careers.tencent.com/404.html> from <GET https://careers.tencent.com/robots.txt>
2019-07-05 13:56:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://careers.tencent.com/404.html> (referer: None)
2019-07-05 13:56:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1562249003305&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn> (referer: None)
2019-07-05 13:56:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1562249003305&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn>
{'positionLink': 'http://careers.tencent.com/jobdesc.html?postId=01147013579229106176',
'positionName': '22989-Serverless前端架构师',
'positionType': '负责腾讯 Serverless 平台战略目标规划、整体平台产品能力设计;\n'
'负责探索前端技术与 Serverless 的结合落地,包括不限于腾讯大前端架构建设,公共组件的设计, '
'Serverless 的前端应用场景落地;\n'
'负责分析 Serverless 客户复杂应用场景的具体实现(小程序,Node.js);\n'
'负责 Serverless 场景中 Node.js 以及微信小程序相关生态建设。',
'publishTime': '2019年07月05日',
'worklocation': '深圳'}
2019-07-05 13:56:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1562249003305&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn>
{'positionLink': 'http://careers.tencent.com/jobdesc.html?postId=01147013576054018048',
'positionName': '22989-语音通信研发工程师(深圳)',
'positionType': '负责腾讯云通信号码保护、企业总机、呼叫中心、融合通信产品开发;\n'
'负责融合通信PaaS平台的构建和优化;\n'
'负责通话质量分析和调优;',
'publishTime': '2019年07月05日',
'worklocation': '深圳'}
2019-07-05 13:56:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1562249003305&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn>
{'positionLink': 'http://careers.tencent.com/jobdesc.html?postId=11231766955960606721123176695596060672',
'positionName': '18435-合规反洗钱岗',
'positionType': '1、根据反洗钱法律法规及监管规定的要求,完善落实反洗钱工作,指导各业务部门、分支机构开展反洗钱工作,支 持反洗钱监管沟通及监管报告反馈工作;\n'
'2、制定与完善内部反洗钱配套制度与流程,推动公司反洗钱标准化及流程化建设;\n'
'3、熟悉监管部门各项反洗钱政策制度要求,能就日常产品业务及合同及时进行反洗钱合规评审;\n'
'4、开展对各业务部门、分支机构的反洗钱合规自查工作,跟进缺陷问题;\n'
'5、根据反洗钱法律法规及监管规定的更新情况,及时对各业务部门进行法规解读,并追踪落实;\n'
'6、重点项目的跟进及推动工作;\n'
'7、领导交办的其他工作。',
'publishTime': '2019年07月05日',
'worklocation': '深圳总部'}
2019-07-05 13:56:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1562249003305&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn>
{'positionLink': 'http://careers.tencent.com/jobdesc.html?postId=11231779032200683521123177903220068352',
'positionName': '25927-游戏测试项目经理',
'positionType': '负责项目计划和迭代计划的制定、跟进和总结回顾,推动产品需求、运营需求和技术需求的落地执行,排除障碍,确保交付时间和质量;\n'
'负责跟合作有关部门和团队对接,确保内部外部团队高效协同工作;\n'
'不断优化项目流程规范;,及时发现并跟踪解决项目问题,有效管理项目风险。',
'publishTime': '2019年07月05日',
'worklocation': '深圳总部'}
image.png
篇幅有限,只能图片展示
image.png
总结:
编写scrapy步骤:
scrapy project XXXX
scrapy genspider xxxx "xxx.com"
编写item.py 明确需要提取的数据
编写spider文件下面的xxxx.py编写爬虫文件处理,处理请求和响应,以及提取数据(yield item)
编写pipelines.py编写管道文件处理spider返回的item数据,比如本地持久化存储等...
编写settings.py启动管道组件 ITEM_PIPLELINES = {.....},以及其他相关设置
执行爬虫
网友评论