美文网首页
Python代理IP池(IP proxy pool)构建

Python代理IP池(IP proxy pool)构建

作者: 拾柒丶_8257 | 来源:发表于2019-11-12 23:48 被阅读0次

    本文转自:https://blog.csdn.net/qq_42415326/article/details/95044280

    使用的主要模块:

    • requests,lxml,pymongo,Flask

    代理池工作流程

    image.png

    文字描述

    • 代理IP采集模块:抓取代理IP—>校验代理IP的可用性—>可用存储数据库
    • 校验模块:读取数据库的代理IP—>校验代理IP可用性—>更新或删除代理IP
    • 带来API模块:从数据库中获取稳定可用的代理IP,供其他爬虫使用

    代理池项目结构:

    image.png
    • mongo_pool模块:代理IP增删改查模块
    • proxy_spider包:采集代理IP
    • httpbin_validator模块:检测代理的可用性—speed,协议类型,匿名程度(原因: 网站上所标注的协议类型和匿名类型是不准确的)
    • proxy_api模块:提供爬虫或稳定可用代理IP和指定不可用域名的接口
    • proxy_test模块:获取数据库中代理IP,定期检测可用性
    • dbmodle模块:代理IP数据模型
    • main模块:程序入口
    • http模块:提供随机User-Agent的请求头
    • log模块:记录日志
    • settings模块:项目配置文件

    项目实现思路

    先实现不依赖其他模块的基础模块,然后在实现具体功能模块

    1. 实现代理IP的数据模型类(dbmodle.py)

    '''
    代理ip数据模型模块
    定义一个类, 继承object
    实现init方法, 负责初始化, 包含如下字段:
        ip: 代理的IP地址
        port: 代理IP的端口号
        protocol: 代理IP支持的协议类型,http是0, https是1, https和http都支持是2
        nick_type: 代理IP的匿名程度, 高匿:0, 匿名: 1, 透明:2
        speed: 代理IP的响应速度, 单位s
        area: 代理IP所在地区
        score: 代理IP的评分, 默认分值可以通过配置文件进行配置. 在进行代理可用性检查的时候, 每遇到一次请求失败就减1份, 减到0的时候从池中删除. 如果检查代理可用, 就恢复默认分值
        disable_domains: 不可用域名列表, 有些代理IP在某些域名下不可用, 但是在其他域名下可用
    创建配置文件: settings.py; 定义MAX_SCORE = 50, 
    '''
    from settings import MAX_SCORE
     
    class Proxy(object):
     
        def __init__(self,ip,port,protocol=-1,nick_type=-1,speed=-1,area=None,score=MAX_SCORE,disuseble_dommains=[]):
            #代理ip的地址
            self.ip = ip
            #代理ip的端口号
            self.port = port
            #代理ip支持协议类型:支持http为0,支持https为1,都支持为2
            self.protocol = protocol
            #代理ip的匿名程度:高匿为0,匿名为1,透明为2
            self.nick_type =nick_type
            #代理ip的响应速度
            self.speed = speed
            #代理ip所在地区
            self.area = area
            #代理ip的评分,衡量代理ip的可用性
            self.score =score
            #代理ip的不可用域名列表
            self.disuseble_dommains =disuseble_dommains
     
        def __str__(self):
            #返回数据字符串
            return str(self.__dict)
    

    2. 实现日志记录模块(log.py)

    目的:

    能够方便的对程序进行调试
    能够记录程序的运行状态
    记录错误信息
    实现:日志模块在网上有很多现成的实现, 我们开发的时候, 通常不会再自己写; 而是使用拿来主义,拿来用就完了。

    把日志模块中的相关配置信息放到配置文件中
    修改日志模块代码,使用配置文件中的配置信息

    '''
    记录日志的模块
    '''
     
    import sys,os
    #Python的标准日志模块:logging
    import logging
    #将上级目录添加到搜索路径中
    sys.path.append("../")
     
    from settings import LOG_LEVEL,LOG_FMT ,LOG_DATEFMT,LOG_FILENAME 
     
    class Logger(object):
     
        def __init__(self):
            #获取一个logger对象
            self._logger = logging.getLogger()
            #设置format对象
            self.formatter = logging.Formatter(fmt=LOG_FMT,datefmt=LOG_DATEFMT)
            #设置日志输出——文件日志模式
            self._logger.addHandler(self._get_file_handler(LOG_FILENAME))
            #设置日志输出——终端日志模式
            self._logger.addHandler(self._get_console_handler())
            # 4. 设置日志等级
            self._logger.setLevel(LOG_LEVEL)
     
        def _get_file_handler(self, filename):
            '''
            返回一个文件日志handler
            '''
            # 获取一个输出为文件日志的handler
            filehandler = logging.FileHandler(filename=filename,encoding="utf-8")
            # 设置日志格式
            filehandler.setFormatter(self.formatter)
            # 返回
            return filehandler
     
        def _get_console_handler(self):
            '''
            返回一个输出到终端日志handler
            '''
            #获取一个输出到终端的日志handler
            console_handler = logging.StreamHandler(sys.stdout)
            #设置日志格式
            console_handler.setFormatter(self.formatter)
            # 返回handler
            return console_handler
     
        #属性装饰器,返回一个logger对象
        @property
        def logger(self):
            return self._logger
     
    # 初始化并配一个logger对象,达到单例
    # 使用时,直接导入logger就可以使用
    logger = Logger().logger
     
    if __name__ == '__main__':
        print(logger)
        logger.debug("调试信息")
        logger.info("状态信息")
        logger.warning("警告信息")
        logger.error("错误信息")
        logger.critical("严重错误信息")
    

    3.校验代理IP的协议类型、匿名程度,速度(httpbin_validator.py)

    '''
    代理IP速度检查: 就是发送请求到获取响应的时间间隔
    匿名程度检查:
            对 http://httpbin.org/get 或 https://httpbin.org/get 发送请求
            if : origin 中有','分割的两个IP就是透明代理IP
            if : headers 中包含 Proxy-Connection 说明是匿名代理IP
            else : 就是高匿代理IP
    检查代理IP协议类型:
        如果 http://httpbin.org/get 发送请求可以成功, 说明支持http协议
        如果 https://httpbin.org/get 发送请求可以成功, 说明支持https协议
    '''
    import sys
    import time
    import requests
    import json
    sys.path.append("..")
    from proxy_utils import random_headers
    from settings import CHECK_TIMEOUT
    from proxy_utils.log import logger
    from dbmodle import Proxy
     
     
    def check_proxy(proxy):
        '''
        分别判断http和https是否请求成功
        '''
        #代理ip
        proxies = {
        'http':'http://{}:{}'.format(proxy.ip,proxy.port),
        'https':'https://{}:{}'.format(proxy.ip,proxy.port),
        }
     
        http,http_nick_type,http_speed = http_check_proxies(proxies)
        https,https_nick_type,https_speed = http_check_proxies(proxies,False)
     
        if http and https:
            proxy.protocol = 2  #支持https和http
            proxy.nick_type =http_nick_type
            proxy.speed = http_speed
        elif http:
            proxy.protocol = 0  #只支持http
            proxy.nick_type =http_nick_type
            proxy.speed = http_speed
        elif https:
            proxy.protocol = 1 #只支持https
            proxy.nick_type =https_nick_type
            proxy.speed = https_speed
        else:  
            proxy.protocol = -1
            proxy.nick_type = -1
            proxy.speed = -1
        
        #logger.debug(proxy)
     
        return proxy
     
    def http_check_proxies(proxies,isHttp = True):
        '''
        代理ip请求校验ip
        '''
        nick_type = -1 #匿名程度变量
        speed = -1  #响应速度变量
        if isHttp:
            test_url = 'http://httpbin.org/get'
        else:
            test_url = 'https://httpbin.org/get'
        #requests库请求test_url
        try:
            #响应时间
            start_time = time.time()
            res = requests.get(test_url,headers = random_headers.get_request_headers(),proxies = proxies,timeout = TCHECK_TIMEOUT)
            end_time = time.time()
            cost_time =end_time-start_time
     
            if res.status_code == 200:
                #响应速度
                speed = round(cost_time,2)
                #转换为字典
                res_dict = json.loads(res.text)
                #获取请求来源ip
                origin_ip = res_dict['origin']
                #获取响应请求头中'Proxy-Connection',若有,说明是匿名代理
                proxy_connection = res_dict['headers'].get('Proxy-Conntion',None)
     
                if "," in origin_ip:
                    #如果响应内容中的源ip中有‘,’分割的两个ip的话及时透明代理ip
                    nick_type = 2 #透明
                elif proxy_connection:
                    #'Proxy-Connection'存在说明是匿名ip
                    nick_type = 1 #匿名
                else:
                    nick_type =0  #高匿
                return True,nick_type,speed
            else:
                return False,nick_type,speed
        except Exception as e:
            #logger.exception(e)
            return False,nick_type,speed
     
    if __name__ == '__main__':
        proxy = Proxy('60.13.42.94','9999')
        result = check_proxy(proxy)
        print(result)
    

    4. 实现数据库模块(增删改查功能和api功能——mongo_pool.py)

    定义MongoPool类, 继承object

    1. init中, 建立数据连接, 获取要操作的集合, 在 del 方法中关闭数据库连接
    2. 提供基础的增删改查功能
      1. 实现插入功能
      2. 实现修改该功能
      3. 实现删除代理: 根据代理的IP删除代理
      4. 查询所有代理IP的功能
    3. 提供代理API模块使用的功能
      1. 实现查询功能: 根据条件进行查询, 可以指定查询数量, 先分数降序, 速度升序排, 保证优质的代理IP在上面.
      2. 实现根据协议类型 和 要访问网站的域名, 获取代理IP列表
      3. 实现根据协议类型 和 要访问网站的域名, 随机获取一个代理IP
      4. 实现把指定域名添加到指定IP的disable_domain列表中.
    '''
    针对proxies集合进行数据库的增删改查的操作,并提供代理api使用
    '''
    import random
    import pymongo
    import sys
     
    sys.path.append("..")
    from settings import MONGO_URL
    from proxy_utils.log import logger
    from dbmodle import Proxy
     
    class MongoPool(object):
     
        def __init__(self):
            #连接数据库
            self.client = pymongo.MongoClient(MONGO_URL)
            #获取操作字典集
            self.proxies = self.client['proxy_pool']['proxies']
     
        def __del__(self):
            #关闭数据库连接
            self.client.close()
     
        def insert(self,proxy):
            '''
            代理ip插入方法
            '''
            count = self.proxies.count_documents({'_id':proxy.ip})
            if count == 0:
                #Proxy对象转换为字典
                proxy_dict = proxy.__dict__
                #主键
                proxy_dict['_id'] = proxy.ip
                #向proxies字典集中插入代理ip
                self.proxies.insert_one(proxy_dict)
                logger.info('插入新的代理:{}'.format(proxy))
            else:
                logger.warning('已经存在的代理{}'.format(proxy))
     
        def update(self,proxy):
            '''
            修改更新数据库中代理ip
            '''
            self.proxies.update_one({'_id':proxy.ip},{'$set':proxy.__dict__})
            logger.info('更新代理ip:{}'.format(proxy))
     
        def delete(self,proxy):
            '''
            删除数据库中代理ip
            '''
            self.proxies.delete_one({'_id':proxy.ip})
            logger.info('删除代理ip:{}'.format(proxy))
     
        def find_all(self):
            '''
            查询数据库中所有的代理ip
            '''
            cursor = self.proxies.find()
     
            for item in cursor:
                #删除_id键值对
                item.pop('_id')
                proxy = Proxy(**item)
                #生成器yield
                yield proxy
     
        def limit_find(self,conditions = {},count = 0):
            '''根据条件进行查询, 
            可以指定查询数量, 先分数降序, 速度升序排, 
            保证优质的代理IP在上面'''
            cursor = self.proxies.find(conditions,limit = count).sort([
                ('score',pymongo.DESCENDING),('speed',pymongo.ASCENDING)])
            #接受查询所得代理IP
            proxy_list = []
     
            for item in cursor:
                itme.pop('_id')
                proxy = Proxy(**item)
                proxy_list.append(proxy)
            return proxy_list
     
        def get_proxies(self,protocol =None,domain = None,nick_type =0,count = 0):
            '''
            实现根据协议类型和要访问网站的域名, 获取代理IP列表
            '''
            conditions = {'nike_type':nick_type}
            if protocol is None:
                conditions['protocol'] = 2
            elif protocol.lower() == 'http':
                conditions['protocol'] ={'$in':[0,2]}
            else:
                conditons['protocol'] ={'$in':[1,2]}
     
            if domain:
                conditons['disable_domains'] = {'$nin':[domain]}
     
            return self.limit_find(conditions,count = count)
     
        def random_proxy(self,protocol = None,domain =None,count = 0,nick_type =0):
            '''
            根据协议类型 和 要访问网站的域名, 随机获取一个代理IP
            '''
            proxy_list = self.get_proxies(protocol =protocol,domain = domain ,count = count ,nick_type =nick_type)
     
            return random.choice(proxy_list)
     
        def add_disable_domain(self,ip,domain):
            '''
            把指定域名添加到指定IP的disable_domain列表中,没有才添加
            '''
            count = self.proxies.count_documents({'_id':ip,'disable_domains':domain})
            if count == 0:
                self.proxies.update_one({'_id':ip},{'$push':{'disable_domains':domain}})
     
     
     
    if __name__ == '__main__':
        mongo = MongoPool()
        #插入测试
        #proxy = Proxy('202.104.113.32','53281')
        #mongo.insert(proxy)
     
        #更新测试
        #proxy = Proxy('202.104.113.32','8888')
        #mongo.update(proxy)
     
        #删除测试
        #proxy = Proxy('202.104.113.32','8888')
        #mongo.delete(proxy)
     
        #查询所有测试
        #for proxy in mongo.find_all():
            #print(proxy)
    

    5.实现随机获取User-Agent 的请求头模块(random_headers.py)

    
    '''
    获取随机User-Agent的请求头
    '''
    import random
     
    #用户代理User-Agent列表
    USER_AGENTS = [
        "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
        "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
        "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
        "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko",
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
        "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
        "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
        "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
        "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
        "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
        "Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
        "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
        "Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10",
        "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
        "Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+",
        "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0",
        "Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124",
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",
        "UCWEB7.0.2.37/28/999",
        "NOKIA5700/ UCWEB7.0.2.37/28/999",
        "Openwave/ UCWEB7.0.2.37/28/999",
        "Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999",
        # iPhone 6:
        "Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25",
    ]
     
    #随机获取一个用户代理User-Agent的请求头
    def get_request_headers():
        headers = {
        'User-Agent':random.choice(USER_AGENTS),
        'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
        'Accept-language':'zh-CN,zh;q=0.9,en;q=0.8',
        'Referer':'https://www.baidu.com',
        'Accept-Encoding':'gzip, deflate,br',
        'Connection':'keep-alive',
        }
        return headers
     
        
     
    if __name__ == '__main__':
        #测试随机与否
        print(get_request_headers())
        print("------------"*20)
        print(get_request_headers())
    

    6.实现通用爬虫,作为具体爬虫的父类(base_poxies.py)

    '''
    通用爬虫:通过指定URL列表, 分组XPATH和组内XPATH, 来提取不同网站的代理IP
    定义一个BaseSpider类, 继承object
      - 提供三个类成员变量:urls, group_xpath, detail_xpath: ip, port, area
      - 提供初始方法, 传入爬虫URL列表, 分组XPATH, 详情(组内)XPATH
      - 对外提供一个获取代理IP的方法
    '''
     
    import requests
    import sys
    import time,random
    from lxml import etree
    sys.path.append('..')
    from proxies_utils.random_headers import get_request_headers
    from dbmodle import Proxy
     
    class BaseSpider(object):
        #类成员变量
        #代理IP网址的URL的列表
        urls =[]
        #分组XPATH, 获取包含代理IP信息标签列表的XPATH
        group_xpath = ''
        #组内XPATH, 获取代理IP详情的信息XPATH, 格式为: {'ip':'xx', 'port':'xx', 'area':'xx'}
        detail_xpath ={}
        def __init__(self,urls =[],group_xpath='',detail_xpath={}):
            #提供初始方法, 传入爬虫URL列表, 分组XPATH, 详情(组内)XPATH
            if urls:
                self.urls =urls
            if group_xpath:
                self.group_xpath =group_xpath
            if detail_xpath:
                self.detail_xpath=detail_xpath
     
     
        def get_proxies(self):
            #获取页面数据
            for url in self.urls:
                page_html = self.get_page(url)
                proxies =self.get_html_proxies(page_html)
                #yeild from 返回的是proxies内的数据
                yield from proxies
     
        def get_page(self,url):
            #请求页面数据
            res =requests.get(url,headers = get_request_headers())
            #每次请求url休眠1秒
            time.sleep(random.uniform(1,5))
            return res.content
     
        def get_html_proxies(self,page_html):
            element = etree.HTML(page_html)
            trs = element.xpath(self.group_xpath)
            for tr in trs:
                ip = self.get_list_first(tr.xpath(self.detail_xpath['ip']))
                port = self.get_list_first(tr.xpath(self.detail_xpath['port']))
                area = self.get_list_first(tr.xpath(self.detail_xpath['area']))
                proxy = Proxy(ip,port,area=area)
                yield proxy
     
        def get_list_first(self,lst):
            #返回列表的第一个元素
            return lst[0] if len(lst) !=0 else ''
     
    if __name__ == '__main__':
        config = {
        'urls':['http://www.ip3366.net/free/?stype=1&page={}'.format(i) for i in range(1,3)],
        'group_xpath':'//*[@id="list"]/table/tbody/tr',
        'detail_xpath':{
        'ip':'./td[1]/text()',
        'port':'./td[2]/text()',
        'area':'./td[5]/text()',
        }
        }
        spider = BaseSpider(**config)
        for proxy in spider.get_proxies():
            print(proxy)
    

    7.实现具体的爬虫类(proxy_spiders.py)

    '''
    实现具体的爬虫类
    '''
    #import time
    #import random
    import requests
    import sys
    sys.path.append('../')
    #import re
    #import js2py
    from proxy_spider.base_spider import BaseSpider
     
    class XiciSpider(BaseSpider):
        '''西刺代理爬虫   '''
        urls = ['http://www.xicidaili.com/nn/{}'.format(i) for i in range(1,21)]
     
        group_xpath = '//*[@id="ip_list"]//tr[position()>1]'
        detail_xpath = {
        'ip':'./td[2]/text()',
        'port':'./td[3]/text()',
        'area':'./td[4]/a/text()',
        }
     
    class Ip3366Spider(BaseSpider):
        '''
        ip3366代理爬虫
        '''
     
        urls = ['http://www.ip3366.net/free/?stype={}&page={}'.format(i,j) for i in range(1,4,2) for j in range(1,8)]
     
        group_xpath = '//*[@id="list"]/table/tbody/tr'
        detail_xpath = {
        'ip':'./td[1]/text()',
        'port':'./td[2]/text()',
        'area':'./td[5]/text()',
        }
     
     
    class kuaiSpider(BaseSpider):
        '''
        快代理爬虫
        '''
     
        urls = ['http://www.kuaidaili.com/free/in{}/{}'.format(i,j) for i in ['ha','tr'] for j in range(1,21)]
     
        group_xpath = '//*[@id="list"]/table/tbody/tr'
        detail_xpath = {
        'ip':'./td[1]/text()',
        'port':'./td[2]/text()',
        'area':'./td[5]/a/text()',
        }
     
        '''
        def get_page(self,url):
            #随机等待时间
            time.sleep(random.uniform(1,2))
            return super().get_page(url)
        '''
     
    class Free89ipSpider(BaseSpider):
        '''
        89ip代理爬虫
        '''
        urls = ['http://www.89ip.cn/index{}.html'.format(i) for i in range(1,17)]
     
        group_xpath = '//div[3]//table/tbody/tr'
        detail_xpath = {
        'ip':'./td[1]/text()',
        'port':'./td[2]/text()',
        'area':'./td[3]/text()',
        }
     
        def get_page(self,url):
            return super().get_page(url).decode()
     
        def get_proxies(self):
            proxies = super().get_proxies()
            for item in proxies:
                item.ip = str(item.ip).replace("\n","").replace("\t","")
                item.area = str(item.area).replace("\n","").replace("\t","")
                item.port = str(item.port).replace("\n","").replace("\t","")
                #返回Proxy对象
                yield item
     
    if __name__ == '__main__':
        spider = Free89ipSpider()
        count= 0
        for proxy in spider.get_proxies():
            count+=1
            print(proxy)
    

    8. 实现运行爬虫模块(run_spider.py)

    创建RunSpider类
        run方法运行爬虫, 作为运行爬虫的入口,获取爬虫列表并运行,检测代理IP,可用,写入数据库并处理爬虫内部异常
        使用协程异步来执行每一个爬虫任务, 以提高抓取代理IP效率
        使用schedule模块, 实现每隔一定的时间, 执行一次爬取任务
    '''
     
    from gevent import monkey
    monkey.patch_all()
    from gevent.pool import Pool
     
    import importlib
    import sys,time
    import schedule
    sys.path.append('../')
    from settings import PROXIES_SPIDERS,SPIDERS_RUN_INTERVAL
    from proxy_validate.httpbin_validator import check_proxy
    from proxies_db.mongo_pool import MongoPool
    from proxies_utils.log import logger
     
    class RunSpider(object):
     
        def __init__(self):
     
            self.mongo_pool = MongoPool()
            self.coroutine_pool = Pool()
     
        def get_spider_from_settings(self):
            '''
            获取配置文件中的具体爬虫列表创建对象
            '''
            for full_class_name in PROXIES_SPIDERS:
                module_name,class_name = full_class_name.rsplit('.',maxsplit =1)
                #动态导入模块
                module = importlib.import_module(module_name)
     
                cls = getattr(module,class_name)
                spider = cls()
                yield spider
     
        def run(self):
            '''
            遍历爬虫对象,执行get_proxies方法
            '''
            spiders = self.get_spider_from_settings()
            for spider in spiders:
                self.coroutine_pool.apply_async(self.__run_one_spider,args=(spider,))
            #当前线程等待爬虫执行完毕
            self.coroutine_pool.join()
        
     
        def __run_one_spider(self,spider):
            try:
                for proxy in spider.get_proxies():
                    time.sleep(0.1)
                    checked_proxy = check_proxy(proxy)
                    if proxy.speed != -1:
                        self.mongo_pool.insert(checked_proxy)
            except Exception as er:
                logger.exception(er)
                logger.exception("爬虫{} 出现错误".format(spider))
     
        @classmethod
        def start(cls):
            '''
            类方法,依据配置文件汇总的时间间隔run爬虫,单位小时
            '''
            rs = RunSpider()
            rs.run()
            schedule.every(SPIDERS_RUN_INTERVAL).hours.do(rs.run)
     
            while 1:
                schedule.run_pending()
                time.sleep(60)
     
     
     
    if __name__ == '__main__':
        #类方法调用
        RunSpider.start()
        #app = RunSpider()
        #app.run()
     
        #测试schedue
        '''def task():
            print("haha")
        schedule.every(10).seconds.do(task)
        while 1:
            schedule.run_pending()
            time.sleep(1)'''
    

    9. 代理IP检查模块(proxy_test.py)

    '''
    定期检测数据库中的代理ip的可用性,分数评级,更新数据库
    '''
    from gevent import monkey
    monkey.patch_all()
    from gevent.pool import Pool
    from queue import Queue
    import schedule
    import sys
    sys.path.append('../')
    from proxy_validate.httpbin_validator import check_proxy
    from proxies_db.mongo_pool import MongoPool
    from settings import TEST_PROXIES_ASYNC_COUNT,MAX_SCORE,TEST_RUN_INTERVAL
     
     
    class DbProxiesCheck(object):
     
        def __init__(self):
            #创建操作数据库对象
            self.mongo_pool = MongoPool()
            #待检测ip队列
            self.queue = Queue()
            #协程池
            self.coroutine_pool = Pool()
     
        #异步回调函数
        def __check_callback(self,temp):
            self.coroutine_pool.apply_async(self.__check_one,callback = self.__check_one())
     
     
        def run(self):
            #处理检测代理ip核心逻辑
            proxies = self.mongo_pool.find_all()
     
            for proxy in proxies:
                self.queue.put(proxy)
     
            #开启多异步任务
            for i in range(TEST_PROXIES_ASYNC_COUNT):
                #异步回调,死循环执行该方法
                self.coroutine_pool.apply_async(self.__check_one,callback =self.__check_one())
            #当前线程等待队列任务完成
            self.queue.join()
     
     
     
        def __check_one(self):
            #检查一个代理ip可用性
            #从队列中获取一个proxy
            proxy = self.queue.get()
     
            checked_proxy = check_proxy(proxy)
     
            if checked_proxy.speed == -1:
                checked_proxy.score -= 1
                if checked_proxy.score == 0:
                    self.mongo_pool.delete(checked_proxy)
                else:
                    self.mongo_pool.update(checked_proxy)
            else:
                checked_proxy.score = MAX_SCORE
                self.mongo_pool.updata(checked_proxy)
            #调度队列的task_done方法(一个任务完成)
            self.queue.task_done()
     
     
        @classmethod
        def start(cls):
            '''
            类方法,依据配置文件的时间间隔运行检测数据库中的ip可用性,单位小时
            '''
            test = DbProxiesCheck()
            test.run()
            schedule.every(TEST_RUN_INTERVAL).hours.do(test.run)
     
            while 1:
                schedule.run_pending()
                time.sleep(60)
     
     
    if __name__ == '__main__':
        DbProxiesCheck.start()
        #test = DbProxiesCheck()
        #test.run()
    

    10. 代理IP池的API模块(proxy_api.py)

    '''
    为爬虫提供稳定可用的代理ip的接口
        根据协议类型和域名,提供随机的稳定可用ip的服务
        根据协议类型和域名,提供获取多个高可用代理ip的服务
        给指定ip上追加不可用域名的服务
    '''
    from flask import Flask
    from flask import request
    import json
     
    from proxies_db.mongo_pool import MongoPool
     
    from settings import PROXIES_MAX_COUNT
     
     
    class ProxyApi(object):
     
        def __init__(self):
     
            self.app = Flask(__name__)
     
            #操作数据库的对象
            self.mongo_pool =  MongoPool()
     
            #获取接口url中参数
            @self.app.route('/random')
            def random():
                protocol = request.args.get('protocol')
                domain = request.args.get('domain')
                proxy = self.mongo_pool.random_proxy(protocal,domain,count = PROXIES_MAX_COUNT)
     
                if protocol:
                    return '{}://{}:{}'.format(protocol,proxy.ip,proxy.port)
                else:
                    return '{}:{}'.format(proxy.ip,proxy.port)
     
            @self.app.route('/proxies')
            def proxies():
                protocol = request.args.get('protocol')
                domain = request.args.get('domain')
                proxies =self.mongo_pool.get_proxies(protocol,domain,count =PROXIES_MAX_COUNT)
                #proxies是proxy对象构成的列表,需要转换为字典的列表
     
                proxies_dict_list =[proxy.__dict__ for proxy in proxies]
                return json.dumps(proxies_dict_list)
            
            @self.app.route('/disabldomain')
            def disable_domain():
                ip = request.args.get('ip')
                domain = request.args.get('domain')
     
                if ip is None:
                    return '请提供ip参数'
                if domain is None:
                    return '请提供域名domain参数'
                self.mongo_pool.add_disable_domain(ip,domain)
                return '{} 禁用域名 {} 成功'.format(ip,domain)
     
        def run(self,debug):
            self.app.run('0.0.0.0',port = 16888,debug = debug)
     
        @classmethod
        def start(cls,debug = None):
            proxy_api = cls()
            proxy_api.run(debug = debug)
     
    if __name__ == '__main__':
        ProxyApi.start(debug = True)
        #proxy_api = ProxyApi()
        #proxy_api.run(debug = True)
    

    11. 代理池启动入口(main.py)

    '''
    代理池统一入口:
       开启多个进程,分别启动,爬虫,检测代理ip,WEB服务
    '''
     
    from multiprocessing import Process
    from proxy_spider.run_spider import RunSpider
    from proxy_test import DbProxiesCheck
    from proxy_api import ProxyApi
     
    def run():
        process_list = []
        #启动爬虫
        process_list.append(Process(target = RunSpider.start))
        #启动检测
        process_list.append(Process(target = DbProxiesCheck.start))
        #启动web服务
        process_list.append(Process(target = ProxyApi.start))
     
        for process in process_list:
            #设置守护进程
            process.daemon = True
            process.start()
        #主进程等待子进程的完成
        for process in process_list:
            process.join()
     
     
    if __name__ == '__main__':
        run()
    

    12 .配置文件模块(settings.py)

    #默代理IP的默认最高分数
    MAX_SCORE =50
     
    import logging
     
    #日志模块默认配置:
    # 默认等级
    LOG_LEVEL = logging.DEBUG
    #默认日志格式
    LOG_FMT = '%(asctime)s %(filename)s [line:%(lineno)d] %(levelname)s: %(message)s'
    # 默认时间格式
    LOG_DATEFMT = '%Y-%m-%d %H:%M:%S'
    # 默认日志文件名称
    LOG_FILENAME = 'log.log'
     
    #请求超时参数
    CHECK_TIMEOUT = 10
     
    #mongodb的URL配置
    MONGO_URL = 'mongodb://127.0.0.1:27017/'
     
    #具体爬虫的配置列表
    PROXIES_SPIDERS = [
    "proxy_spider.proxy_spiders.XiciSpider",
    "proxy_spider.proxy_spiders.Ip3366Spider",
    "proxy_spider.proxy_spiders.kuaiSpider",
    "proxy_spider.proxy_spiders.Free89ipSpider",
    ]
     
    #爬虫间隔自动运行时间
    SPIDERS_RUN_INTERVAL = 4
     
    #配置检测代理ip的异步数量
    TEST_PROXIES_ASYNC_COUNT = 10
     
    #db中ip间隔自动运行时间
    TEST_RUN_INTERVAL = 2
     
    #随机获取代理ip的最大数量
    PROXIES_MAX_COUNT = 50
    

    相关文章

      网友评论

          本文标题:Python代理IP池(IP proxy pool)构建

          本文链接:https://www.haomeiwen.com/subject/qfycictx.html