爬取代理网站,检验代理ip效果并写入文档
主要进步:构造headers头,构造代理请求requests,多线程。
1、xicidaili的网站需要检测headers,构造headers后返回成功状态码,否则5xx错误。
2、select选择时多次尝试返回空列表,不断折腾尝试最终成功,积累了不少失败经验。
3、使用多线程验证代理ip的有效性,以baidu为检验url,百度的承压能力不错。如果用多线程去抓取代理ip网站内容,容易被封。
import requests
from bs4 import BeautifulSoup
from multiprocessing import Pool
import time
proxies=[]
urls=["http://www.xicidaili.com/nn/{}".format(i) for i in range(1,1000)]
def test_ip(proxy):
url="http://www.baidu.com"
try:
res=requests.get(url,timeout=3,proxies={'http':proxy } )
if res.status_code != 200:
print (proxy + " failed")
else:
print (proxy+ " ok")
with open('ip_pool.csv','a',encoding='utf-8')as f:
f.write( proxy + "\n" )
except:
print (proxy+ " timeout")
def get_ip(url):
head= {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"
}
html=requests.get(url,headers=head)
soup=BeautifulSoup(html.text,'lxml')
ips=soup.select('#ip_list > tr.odd > td:nth-of-type(2)')
ports=soup.select('#ip_list > tr.odd > td:nth-of-type(3)')
for ip,port in zip(ips,ports):
proxie=ip.get_text()+":"+port.get_text()
proxies.append(proxie)
return proxies
if __name__ == "__main__":
for url in urls:
get_ip(url)
pool=Pool(processes=8)
pool.map(test_ip,proxies)
time.sleep(1)
写入文档
网友评论