美文网首页
Scrapy项目基本流程--以中银保报为例

Scrapy项目基本流程--以中银保报为例

作者: 朝朝朝朝朝落 | 来源:发表于2022-01-04 13:53 被阅读0次

    以' 中国银行保险报' 资讯爬取MongoDB入库 为例, 不讲原理, 只为程序化的工作服务.scrapy每次根据不同的网站只需更改少量地方就可实现信息抓取


    WX20220104-123157.png

    一, 准备阶段

    1, 建立项目目录

    终端 (terminal) 输入:scrapy startproject general_spider(name随意)


    WX20220104-115934@2x.png

    2, 进入项目, 看名字大概知道每个文件的含义:

    WX20220104-120328.png

    右键将项目作为根目录:


    WX20220104-120448.png

    3, 修改配置文件settings,

    #开始改的不多, 这些本地调试足够了, 后期功能增加再改来得及:
    ROBOTSTXT_OBEY = False
    
    ITEM_PIPELINES = {
        'general_spider.pipelines.PipelineDB': 300,
    }
    MONGO_URL='xxxxxx'
    DOWNLOADER_MIDDLEWARES = {
        'general_spider.middlewares.MyUserAgentMiddleware': 600,
        'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
        'general_spider.middlewares.ProxyMiddleware': 350,# 没有代理可以注释
        'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400,# 没有代理可以注释
    }
    # 日志
    LOG_LEVEL = 'WARNING'#DEBUG ,INFO ,ERROR ,CRITICAL ,
    LOG_FILE = './log.log'
    
    #UA 网上一大堆, 复制过来就行
    USER_AGENT = [
        'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Mobile Safari/537.36',
        'Mozilla/5.0 (iPhone; CPU iPhone OS 10_2 like Mac OS X) AppleWebKit/602.3.12 (KHTML, like Gecko) Mobile/14C92 MicroMessenger/6.5.16 NetType/WIFI Language/zh_CN',
        'Mozilla/5.0 (Linux; U; Android 5.1.1; zh-cn; MI 4S Build/LMY47V) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.146 Mobile Safari/537.36 XiaoMi/MiuiBrowser/9.1.3',
        'Mozilla/5.0 (iPhone; CPU iPhone OS 10_2 like Mac OS X) AppleWebKit/602.3.12 (KHTML, like Gecko) Mobile/14C92 MicroMessenger/6.5.16 NetType/WIFI Language/zh_CN',
        'Mozilla/5.0 (Linux; U; Android 5.1.1; zh-cn; MI 4S Build/LMY47V) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.146 Mobile Safari/537.36 XiaoMi/MiuiBrowser/9.1.3',
        'Mozilla/5.0 (Linux; U; Android 7.0; zh-CN; SM-G9550 Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/40.0.2214.89 UCBrowser/11.7.0.953 Mobile Safari/537.36',
        'Mozilla/5.0 (Linux; U; Android 6.0.1; zh-CN; SM-C7000 Build/MMB29M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/40.0.2214.89 UCBrowser/11.6.2.948 Mobile Safari/537.36',
        ]
    

    4,修改中间件middlewares.py

    # 保留这个换UA 的class, IP
    class MyUserAgentMiddleware(UserAgentMiddleware):
        '''
        设置User-Agent
        '''
    
        def __init__(self, user_agent):
            self.user_agent = user_agent
    
        @classmethod
        def from_crawler(cls, crawler):
            return cls(
                user_agent=crawler.settings.get('USER_AGENT')
            )
    
        def process_request(self, request, spider):
            agent = random.choice(self.user_agent)
            request.headers['User-Agent'] = agent
    
    class ProxyMiddleware():
        ''' 换ip, 这里以阿布云为例'''
        def __init__(self, settings):
            self.logger = logging.getLogger(__name__)
            self.settings = settings
            self.proxy_type = self.settings.get('GET_PROXY_TYPE', 'no')
            self.last_fetch_proxy_time = time.time()
    
        @classmethod
        def from_crawler(cls, crawler):
            return cls(crawler.settings)
    
        def process_request(self, request, spider):
            # 设置headers Connection: close
            request.headers['Connection'] = 'close'
            
            proxyServer, proxyAuth = get_abuyun_pro()
            request.meta["proxy"] = proxyServer
            request.headers["Proxy-Authorization"] = proxyAuth
            request.headers["Proxy-Switch-Ip"] = 'yes'
            
    

    5, 修改pipeline.py, 增加入库功能(这里以MongoDB为例):

    from pymongo import MongoClient
    
    from general_spider import settings
    class PipelineDB(object):
    
        def __init__(self, logger, client):
            self.logger = logger
            self.client = client
            self.collection = self.client['pa_crawler']
    
        @classmethod
        def from_crawler(cls, crawler):
            return cls(
                logger=crawler.spider.logger,
                client=MongoClient(settings.MONGO_URL, maxPoolSize=10)
            )
    
        def process_item(self, item, spider):
            table_name = 'xxxx' #表明字
    
            # 如果文章存在, 不存该条
            if self.collection[table_name].find({'article_title': item["article_title"]}).count() == 0:
                # 这里存的是格式化好 的数据
                self.collection[table_name].insert(item)
                self.logger.info(f'{table_name}-{item["article_title"]},提交数据到 mongo 成功')
                return item
            else:
                self.logger.info(f'{table_name}-{item["article_title"]},mongo数据库已经存在, 舍弃')
                return item
    
        def close_spider(self, spider):
            self.client.close()
    

    6, 修改items.py, 入库的字段

    
    import scrapy
    
    
    class BadouItem(scrapy.Item):
        # define the fields for your item here like:
        article_title = scrapy.Field()#标题
        article_content_html = scrapy.Field()#原始内容
        article_content_raw = scrapy.Field()#纯文字内容,去除html标签
        article_content = scrapy.Field()#纯文字内容+图片url(插在文章里)
        article_cover = scrapy.Field()#封面
        crawl_date = scrapy.Field()#抓取时间
        ref_url = scrapy.Field()#文章url
        publish_time = scrapy.Field()#发布时间
        site = scrapy.Field()#平台
        pictures = scrapy.Field()#文章原始图片
        _id = scrapy.Field()#文章原始图片
    

    二, 编写爬虫代码

    在spiders里建一个.py文件


    WX20220104-122907.png
    
    class Baoxianbao(scrapy.Spider):
        '''
        中国银行保险报  资讯
        '''
        name='baoxianbao'# spider的名字, 
        start_urls=['http://www.cbimc.cn/']# 域名
    
        def start_requests(self):
    
            self.headers = {
                'Connection': 'keep-alive',
                'Upgrade-Insecure-Requests': '1',
                'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
                'Referer': 'http://www.cbimc.cn/node_7037.html',
                'Accept-Language': 'zh-CN,zh;q=0.9',
            }
            for category in ['07','08',10,12,16,18,29,32,33,34,35,36,37,38,39]:#每个分类
                url=f'http://www.cbimc.cn/node_70{category}.html'
                #post -->scrapy.FormRequest(...)
                yield scrapy.Request(url, headers=self.headers, callback=self.parse)
    
        def parse(self,response):
            # 文章链接, 标题, 封面, 发布时间
            urls=response.xpath('//div[@class="list nav_left_a1"]//li[@class="list_item"]/a/@href').extract()
            titles=response.xpath('//div[@class="list nav_left_a1"]//li[@class="list_item"]//h1/text()').extract()
            article_covers=response.xpath('//div[@class="list nav_left_a1"]//li[@class="list_item"]//img/@src').extract()
            publish_times=response.xpath('//div[@class="list nav_left_a1"]//li[@class="list_item"]//span/text()').extract()
    
            for url,title,article_cover,publish_time in zip(urls,titles,article_covers,publish_times):
                item=BadouItem()
    
                item['publish_time']=publish_time
                item['article_cover']=article_cover
                item['article_title']=title
                item['ref_url'] =url
                item['crawl_date'] = str(datetime.datetime.now()).split('.')[0]  # 抓取时间
                _id=url.split('_')[-1].split('.')[0]
                # meta为传递的变量, json格式
                yield scrapy.Request(url, headers=self.headers, callback=self.parse_content,meta=item)
    
        def parse_content(self,response):
            item=response.meta
            text=response.text
            content=scrapy.Selector(text=text).css('.detail-d').extract_first()
            item['article_content_html'] = content# 原始内容
            article_content_raw,pictures,article_content=parse_content(content)
            item['article_content_raw']=article_content_raw#纯文字内容,去除html标签
            item['article_content']=article_content#纯文字内容+图片url(插在文章里)
    
            item['site']='中国银行保险报'
            item['pictures']=pictures#图片
    
            yield item
    

    三, 编写run.py

    WX20220104-133554.png
    # 第一种运行方法
    from scrapy.cmdline import execute
    
    execute(['scrapy', 'crawl', 'baoxianbao'])
    # execute(['scrapy', 'crawl', '第二个爬虫名字'])# 多个爬虫复制一行就OK
    
    # 第二种运行方法
    from scrapy.crawler import CrawlerProcess
    from scrapy.utils.project import get_project_settings
    
    process = CrawlerProcess(get_project_settings())
    process.crawl('baoxianbao')
    #process.crawl('第二个爬虫名字')# 多个爬虫复制一行就OK
    
    process.start()
    

    run起来, 结果刚好抓了100条(剔除了几个老新闻):


    WX20220104-133936@2x.png

    看看MongoDB里badou这张表:


    WX20220104-134259@2x.png

    附: 有一个文章解析方法parse_content(), 目的是HTML提取纯文本和图片:

    
    def parse_content(content):
        '''去除HTML标签, 保留文字, 图片'''
        html_ = etree.HTML(content)
        # 纯文字
        article_content_raw = re.sub('<.*?>', '', content).replace('&nbsp;', '\n')  # html_.xpath('string(.)')
        # 图片
        pictures_html = re.findall('<img.*?>', content)
        pictures = []
        # 图文
        for pic_html in pictures_html:
            pic = re.findall(r'img.*?(http.*?)\"', pic_html)[0]
            pictures.append(pic)
            content = content.replace(pic_html, pic)
        # content=etree.HTML(content)
        article_content = re.sub('<.*?>', '', content).replace('&nbsp;', '')
        return article_content_raw, pictures, article_content
    

    相关文章

      网友评论

          本文标题:Scrapy项目基本流程--以中银保报为例

          本文链接:https://www.haomeiwen.com/subject/bjeqcrtx.html