美文网首页
Scrapy基础

Scrapy基础

作者: kingloongMagic | 来源:发表于2024-03-20 15:04 被阅读0次

一、常用命令

  • 安装:pip install scrapy
  • 创建项目:scrapy startproject myproject
  • 创建spider:scrapy genspider mydomain mydomain.com
  • 运行spider:scrapy crawl mydomain

二、基础功能

1、通过脚本运行Spider

import subprocess
 
def run_scrapy_spider(spider_name):
    try:
        subprocess.run(['scrapy', 'crawl', spider_name])
        print(f"Spider {spider_name} finished.")
    except Exception as e:
        print(f"Error running spider: {e}")
 
# 使用函数运行爬虫,替换'your_spider_name'为你的爬虫名
run_scrapy_spider('your_spider_name')

2、Spider匹配多个类型URL

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
 
class MySpider(CrawlSpider):
    name = 'my_spider'
    
    # 初始种子URL
    start_urls = ['http://example.com/', 'http://example.com/posts']
    
    # 定义第一种页面的爬取规则
    # 元祖最后一个元素的逗号不能省略,否则报错
    rules = (
        # follow=True表示会跟随页面中提取的新链接继续爬取。
        Rule(LinkExtractor(allow=r'/details/\d+'), callback='parse_details', follow=True),
        # 定义第二种页面的爬取规则
        Rule(LinkExtractor(allow=r'/posts/\d+'), callback='parse_post', follow=True),
    )
    
    def parse_details(self, response):
        # 解析第一种类型的页面
        pass
    
    def parse_post(self, response):
        # 解析第二种类型的页面
        pass

3、一个pipeline处理多个item

class MultipleItemPipeline:
    def process_item(self, item, spider):
        rules = {
            MyspiderItem: self.handle_first_item,
        }
        for key, view in rules.items():
            if isinstance(item, key):
                view(item)
                break
        return item
 
    def handle_first_item(self, item):
        # 处理FirstItem
        pass
 
    def handle_second_item(self, item):
        # 处理SecondItem
        pass

4、设置User-Agent

设置随机User-Agent

5、Setting常用配置

# 日志级别
LOG_LEVEL = 'CRITICAL'
# 设置Proxy
HTTP_PROXY = 'http://xxx'
# 机器人协议
ROBOTSTXT_OBEY = False

相关文章

网友评论

      本文标题:Scrapy基础

      本文链接:https://www.haomeiwen.com/subject/zmyxtjtx.html