美文网首页分布式爬虫框架
python爬虫学习-day6-ip池

python爬虫学习-day6-ip池

作者: 光小月 | 来源:发表于2019-05-15 23:34 被阅读26次

    目录

    1. python爬虫学习-day1
    2. python爬虫学习-day2正则表达式
    3. python爬虫学习-day3-BeautifulSoup
    4. python爬虫学习-day4-使用lxml+xpath提取内容
    5. python爬虫学习-day5-selenium
    6. python爬虫学习-day6-ip池
    7. python爬虫学习-day7-实战

    学习IP相关知识

    1. 学习什么是IP,为什么会出现IP被封,如何应对IP被封的问题。

    2. 抓取西刺代理,并构建自己的代理池。

    3. 西刺直通点:https://www.xicidaili.com/

    1. 为什么会出现IP被封,如何应对IP被封的问题。

    网站为了防止被爬取,会有反爬机制,对于同一个IP地址的大量同类型的访问,会封锁IP,过一段时间后,才能继续访问
    现有的反扒策略:

    0. 检测浏览器header, User-Agent
    1. ip 封禁
    2. 图片验证码
    3. 滑块
    4. JS轨迹
    5. 证书加密
    6. AI识别
    

    2. 如何应对IP被封

    1. 建立代理IP, 轮换访问
    2. 设置访问时间间隔
    3. 可动态设置user agent
    4. 禁用cookies
    5. 设置延迟下载
    6. 使用Google Cache
    7. 使用IP地址池(代理IP、VPN等)
    8. 使用Crawlera
    

    参考: https://desmonday.github.io/2019/03/06/python%E7%88%AC%E8%99%AB%E5%AD%A6%E4%B9%A0-day6-IP%E4%BB%A3%E7%90%86/

    3. 获取代理IP地址

    网站: https://www.xicidaili.com/

    示例

    import requests, re
    from bs4 import BeautifulSoup as bs
    import json
    
    
    def get_html(url):
        user_agent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
        headers = {'User-Agent': user_agent}
        try:
            # html = requests.get(url=url, headers=headers)
            r = requests.get(url, headers=headers, timeout=10)
            r.raise_for_status()
            r.encoding = r.apparent_encoding
            return r.text
        except:
            print('error , not open page by url:' + url)
    
    
    def get_proxy_ip(html):
        html = bs(html, 'html.parser')
        proxy_ips = html.find(id='ip_list').find_all('tr')
        for proxy_ip in proxy_ips:
            if len(proxy_ip.select('td')) > 0:
                ip = proxy_ip.select('td')[1].text
                port = proxy_ip.select('td')[2].text
                protocol = proxy_ip.select('td')[5].text
                if protocol in protocollists:
                    proxy_ip_list.append(f'{protocol}://{ip}:{port}')
        return proxy_ip_list
    
    
    def check_proxy_avaliability(ip):
        url = 'https://www.baidu.com'
        user_agent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
        headers = {'User-Agent': user_agent}
        try:
            proxies = {}
            if ip.startswith(('HTTPS', 'https')):
                proxies['HTTPS'] = ip
            else:
                proxies['HTTP'] = ip
            r = requests.get(url=url, headers=headers, proxies=proxies, timeout=10)
            r.raise_for_status()
            r.encoding = r.apparent_encoding
            text, status_code = r.text, r.status_code
            if status_code == 200:
                print('有效IP, %s', ip)
                return True
            else:
                print('无效IP, %s', ip)
                return False
        except:
            print('error  ', url)
            return False
    
    
    if __name__ == '__main__':
        proxy_ip_list = []
        url = 'https://www.xicidaili.com/'
        protocollists = ['http', 'https', 'HTTP', 'HTTPS']
        html = get_html(url)
        ips = get_proxy_ip(html)
        print(ips)
        use_ip_list = []
        for ip in ips:
            if check_proxy_avaliability(ip):
                use_ip_list.append(ip)
        print('有效代理ip')
        print(use_ip_list)
    
    

    结果

    ['HTTP://120.83.98.192:9999', 'HTTP://115.239.25.244:9999', 'HTTP://120.83.111.221:9999', 'HTTP://1.198.72.153:9999', 'HTTP://180.126.169.6:8118', 'HTTPS://122.137.4.230:8118', 'HTTP://163.204.244.150:9999', 'HTTPS://112.85.170.172:9999', 'HTTPS://112.85.131.34:9999', 'HTTP://112.85.171.8:9999', 'HTTPS://112.85.164.213:9999', 'HTTPS://117.95.98.120:42704', 'HTTP://115.225.49.188:8118', 'HTTPS://119.162.37.165:8118', 'HTTP://123.139.28.36:8118', 'HTTPS://171.80.3.169:9999', 'HTTPS://112.85.164.161:9999', 'HTTPS://115.53.16.222:9999', 'HTTPS://36.99.212.233:9999', 'HTTPS://113.121.46.78:9999', 'HTTPS://221.225.147.58:8118', 'HTTP://222.186.45.145:57273', 'HTTP://121.238.82.201:8118', 'HTTP://114.221.20.254:8118', 'HTTPS://58.250.23.210:1080', 'HTTPS://180.160.139.246:8118', 'HTTPS://112.80.41.86:8888', 'HTTP://218.64.69.79:8080', 'HTTPS://59.38.61.164:9797', 'HTTPS://182.18.13.149:53281', 'HTTP://113.251.221.143:8118', 'HTTP://117.90.5.64:9000', 'HTTPS://125.32.80.52:8080', 'HTTPS://58.247.127.145:53281', 'HTTPS://175.23.40.250:8080', 'HTTP://211.162.70.229:3128', 'HTTPS://182.149.157.168:8118', 'HTTPS://218.22.7.62:53281', 'HTTP://121.79.131.58:8080', 'HTTP://14.115.106.178:808', 'HTTPS://122.137.4.230:8118', 'HTTPS://112.85.170.172:9999', 'HTTPS://112.85.131.34:9999', 'HTTPS://221.225.147.58:8118', 'HTTPS://112.85.164.213:9999', 'HTTPS://117.95.98.120:42704', 'HTTPS://119.162.37.165:8118', 'HTTPS://171.80.3.169:9999', 'HTTPS://112.85.164.161:9999', 'HTTPS://115.53.16.222:9999', 'HTTPS://36.99.212.233:9999', 'HTTPS://113.121.46.78:9999', 'HTTPS://123.163.96.141:9999', 'HTTPS://222.137.4.96:8118', 'HTTPS://115.53.20.2:9999', 'HTTPS://124.94.199.204:9999', 'HTTPS://112.87.71.206:9999', 'HTTPS://120.83.105.77:9999', 'HTTPS://112.85.151.97:9999', 'HTTPS://58.250.23.210:1080', 'HTTP://120.83.98.192:9999', 'HTTP://115.239.25.244:9999', 'HTTP://120.83.111.221:9999', 'HTTP://1.198.72.153:9999', 'HTTP://180.126.169.6:8118', 'HTTP://163.204.244.150:9999', 'HTTP://112.85.171.8:9999', 'HTTP://115.225.49.188:8118', 'HTTP://222.186.45.145:57273', 'HTTP://123.139.28.36:8118', 'HTTP://112.85.169.44:9999', 'HTTP://121.238.82.201:8118', 'HTTP://1.198.72.48:9999', 'HTTP://112.85.129.140:9999', 'HTTP://171.37.157.61:8123', 'HTTP://119.162.150.192:8118', 'HTTP://114.221.20.254:8118', 'HTTP://49.86.176.110:9999', 'HTTP://114.230.69.201:9999', 'HTTP://112.255.118.2:8118']
    有效IP, %s HTTP://120.83.98.192:9999
    有效IP, %s HTTP://115.239.25.244:9999
    有效IP, %s HTTP://120.83.111.221:9999
    有效IP, %s HTTP://1.198.72.153:9999
    有效IP, %s HTTP://180.126.169.6:8118
    有效IP, %s HTTPS://122.137.4.230:8118
    有效IP, %s HTTP://163.204.244.150:9999
    有效IP, %s HTTPS://112.85.170.172:9999
    有效IP, %s HTTPS://112.85.131.34:9999
    有效IP, %s HTTP://112.85.171.8:9999
    有效IP, %s HTTPS://112.85.164.213:9999
    有效IP, %s HTTPS://117.95.98.120:42704
    有效IP, %s HTTP://115.225.49.188:8118
    有效IP, %s HTTPS://119.162.37.165:8118
    有效IP, %s HTTP://123.139.28.36:8118
    有效IP, %s HTTPS://171.80.3.169:9999
    有效IP, %s HTTPS://112.85.164.161:9999
    有效IP, %s HTTPS://115.53.16.222:9999
    有效IP, %s HTTPS://36.99.212.233:9999
    有效IP, %s HTTPS://113.121.46.78:9999
    有效IP, %s HTTPS://221.225.147.58:8118
    有效IP, %s HTTP://222.186.45.145:57273
    有效IP, %s HTTP://121.238.82.201:8118
    有效IP, %s HTTP://114.221.20.254:8118
    有效IP, %s HTTPS://58.250.23.210:1080
    有效IP, %s HTTPS://180.160.139.246:8118
    有效IP, %s HTTPS://112.80.41.86:8888
    有效IP, %s HTTP://218.64.69.79:8080
    有效IP, %s HTTPS://59.38.61.164:9797
    有效IP, %s HTTPS://182.18.13.149:53281
    有效IP, %s HTTP://113.251.221.143:8118
    有效IP, %s HTTP://117.90.5.64:9000
    有效IP, %s HTTPS://125.32.80.52:8080
    有效IP, %s HTTPS://58.247.127.145:53281
    有效IP, %s HTTPS://175.23.40.250:8080
    有效IP, %s HTTP://211.162.70.229:3128
    有效IP, %s HTTPS://182.149.157.168:8118
    有效IP, %s HTTPS://218.22.7.62:53281
    有效IP, %s HTTP://121.79.131.58:8080
    有效IP, %s HTTP://14.115.106.178:808
    有效IP, %s HTTPS://122.137.4.230:8118
    有效IP, %s HTTPS://112.85.170.172:9999
    有效IP, %s HTTPS://112.85.131.34:9999
    有效IP, %s HTTPS://221.225.147.58:8118
    有效IP, %s HTTPS://112.85.164.213:9999
    有效IP, %s HTTPS://117.95.98.120:42704
    有效IP, %s HTTPS://119.162.37.165:8118
    有效IP, %s HTTPS://171.80.3.169:9999
    有效IP, %s HTTPS://112.85.164.161:9999
    有效IP, %s HTTPS://115.53.16.222:9999
    有效IP, %s HTTPS://36.99.212.233:9999
    有效IP, %s HTTPS://113.121.46.78:9999
    有效IP, %s HTTPS://123.163.96.141:9999
    有效IP, %s HTTPS://222.137.4.96:8118
    有效IP, %s HTTPS://115.53.20.2:9999
    有效IP, %s HTTPS://124.94.199.204:9999
    有效IP, %s HTTPS://112.87.71.206:9999
    有效IP, %s HTTPS://120.83.105.77:9999
    有效IP, %s HTTPS://112.85.151.97:9999
    有效IP, %s HTTPS://58.250.23.210:1080
    有效IP, %s HTTP://120.83.98.192:9999
    有效IP, %s HTTP://115.239.25.244:9999
    有效IP, %s HTTP://120.83.111.221:9999
    有效IP, %s HTTP://1.198.72.153:9999
    有效IP, %s HTTP://180.126.169.6:8118
    有效IP, %s HTTP://163.204.244.150:9999
    有效IP, %s HTTP://112.85.171.8:9999
    有效IP, %s HTTP://115.225.49.188:8118
    有效IP, %s HTTP://222.186.45.145:57273
    有效IP, %s HTTP://123.139.28.36:8118
    有效IP, %s HTTP://112.85.169.44:9999
    有效IP, %s HTTP://121.238.82.201:8118
    有效IP, %s HTTP://1.198.72.48:9999
    有效IP, %s HTTP://112.85.129.140:9999
    有效IP, %s HTTP://171.37.157.61:8123
    有效IP, %s HTTP://119.162.150.192:8118
    有效IP, %s HTTP://114.221.20.254:8118
    有效IP, %s HTTP://49.86.176.110:9999
    有效IP, %s HTTP://114.230.69.201:9999
    有效IP, %s HTTP://112.255.118.2:8118
    有效代理ip
    ['HTTP://120.83.98.192:9999', 'HTTP://115.239.25.244:9999', 'HTTP://120.83.111.221:9999', 'HTTP://1.198.72.153:9999', 'HTTP://180.126.169.6:8118', 'HTTPS://122.137.4.230:8118', 'HTTP://163.204.244.150:9999', 'HTTPS://112.85.170.172:9999', 'HTTPS://112.85.131.34:9999', 'HTTP://112.85.171.8:9999', 'HTTPS://112.85.164.213:9999', 'HTTPS://117.95.98.120:42704', 'HTTP://115.225.49.188:8118', 'HTTPS://119.162.37.165:8118', 'HTTP://123.139.28.36:8118', 'HTTPS://171.80.3.169:9999', 'HTTPS://112.85.164.161:9999', 'HTTPS://115.53.16.222:9999', 'HTTPS://36.99.212.233:9999', 'HTTPS://113.121.46.78:9999', 'HTTPS://221.225.147.58:8118', 'HTTP://222.186.45.145:57273', 'HTTP://121.238.82.201:8118', 'HTTP://114.221.20.254:8118', 'HTTPS://58.250.23.210:1080', 'HTTPS://180.160.139.246:8118', 'HTTPS://112.80.41.86:8888', 'HTTP://218.64.69.79:8080', 'HTTPS://59.38.61.164:9797', 'HTTPS://182.18.13.149:53281', 'HTTP://113.251.221.143:8118', 'HTTP://117.90.5.64:9000', 'HTTPS://125.32.80.52:8080', 'HTTPS://58.247.127.145:53281', 'HTTPS://175.23.40.250:8080', 'HTTP://211.162.70.229:3128', 'HTTPS://182.149.157.168:8118', 'HTTPS://218.22.7.62:53281', 'HTTP://121.79.131.58:8080', 'HTTP://14.115.106.178:808', 'HTTPS://122.137.4.230:8118', 'HTTPS://112.85.170.172:9999', 'HTTPS://112.85.131.34:9999', 'HTTPS://221.225.147.58:8118', 'HTTPS://112.85.164.213:9999', 'HTTPS://117.95.98.120:42704', 'HTTPS://119.162.37.165:8118', 'HTTPS://171.80.3.169:9999', 'HTTPS://112.85.164.161:9999', 'HTTPS://115.53.16.222:9999', 'HTTPS://36.99.212.233:9999', 'HTTPS://113.121.46.78:9999', 'HTTPS://123.163.96.141:9999', 'HTTPS://222.137.4.96:8118', 'HTTPS://115.53.20.2:9999', 'HTTPS://124.94.199.204:9999', 'HTTPS://112.87.71.206:9999', 'HTTPS://120.83.105.77:9999', 'HTTPS://112.85.151.97:9999', 'HTTPS://58.250.23.210:1080', 'HTTP://120.83.98.192:9999', 'HTTP://115.239.25.244:9999', 'HTTP://120.83.111.221:9999', 'HTTP://1.198.72.153:9999', 'HTTP://180.126.169.6:8118', 'HTTP://163.204.244.150:9999', 'HTTP://112.85.171.8:9999', 'HTTP://115.225.49.188:8118', 'HTTP://222.186.45.145:57273', 'HTTP://123.139.28.36:8118', 'HTTP://112.85.169.44:9999', 'HTTP://121.238.82.201:8118', 'HTTP://1.198.72.48:9999', 'HTTP://112.85.129.140:9999', 'HTTP://171.37.157.61:8123', 'HTTP://119.162.150.192:8118', 'HTTP://114.221.20.254:8118', 'HTTP://49.86.176.110:9999', 'HTTP://114.230.69.201:9999', 'HTTP://112.255.118.2:8118']
    

    参考资料:

    1. https://blog.csdn.net/weixin_43720396/article/details/88218204
    2. https://desmonday.github.io/2019/03/06/python%E7%88%AC%E8%99%AB%E5%AD%A6%E4%B9%A0-day6-IP%E4%BB%A3%E7%90%86/

    PS: 若你觉得可以、还行、过得去、甚至不太差的话,可以“关注”一下,就此谢过!

    相关文章

      网友评论

        本文标题:python爬虫学习-day6-ip池

        本文链接:https://www.haomeiwen.com/subject/fxkvaqtx.html