爬取代理 & Telnet验证

上节我们介绍了Scrapy框架的整体运行流程，这节我们介绍爬取网上免费代理，并验证代理的有效性。

新建爬虫

命令行运行：

    scrapy genspider name(代理文件名称) url
    例如：scrapy genspider proxy www.xicidaili.com/wt/1

结果如下图：

image.png

xpath解析response

解析网页的内容一般用到的是正则、bs4、xpath。正则牛逼的同学可以直接用正则解析，bs4和xpath对比来说，个人觉得xpath更直观。
以下我们用xpath来解析response内容，获取的内容默认回调函数是parse（也可自定义）：

def parse(self, response):
    sel = Selector(response)
    tr_list = sel.xpath('//table[@id="ip_list"]/tr')
    for i in range(2, len(tr_list)):
        cells = tr_list[i].xpath('td')
        if len(cells) <= 0: continue
        origin_ip = cells[1].xpath('text()').extract()[0]
        port = cells[2].xpath('text()').extract()[0]
        type = cells[5].xpath('text()').extract()[0].lower()
        ip = type + '://' + origin_ip + ':' + port
        # 只爬取http、https代理
        if type == 'http' or type == 'https':
            item = ProxycrawlerItem()
            item['ip'] = ip
            item['port'] = port
            item['type'] = type
            item['origin_ip'] = origin_ip
            # 验证ip是否可用
            if self.telnet(item):
                yield item

Telnet的正确使用姿势

如何验证代理的有效性呢？网上经常会看到很多人事这样验证代理的：

import requests

try:
    requests.get('http://wenshu.court.gov.cn/', proxies={"http":"http://121.31.154.12:8123"})
except:
    print 'connect failed'
else:
    print 'success'

直接用代理随便去打开一个网页，百度、淘宝等等，打开成功代理就有效，失败无效。总感觉这种方法有点low。于是想到用Telnet直接验证这个代理的有效性。代码如下：

def telnet(self, item):
    try:
        telnetlib.Telnet(item['origin_ip'], port=item['port'], timeout=10.0)
    except:
        print('connect failure')
        return False
    else:
        print('conncet success')
        return True