爬虫实例
(一)更换用户代理
爬取时默认user-agent是'python-requests/2.18.4'
import requests
def getHTMLText(url):
try:
r = requests.get(url, timeout=30)
print(r.request.headers)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text[500:800]
except:
return "wrong connection..."
if __name__ == "__main__":
this_url = "http://www.amazon.cn/gp/product/B01M8L5Z3Y"
print(getHTMLText(this_url))
{'User-Agent': 'python-requests/2.18.4', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
{'user-agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
ue = {};
(function(d){var e=d.ue=d.ue||{},f=Date.now||function(){return+new Date};e.d=function(b){return f()-(b?0:d.ue_t0)};e.stub=function(b,a){if(!b[a]){var c=[];b[a]=function(){c.push([c.slice.call(arguments),e.d(),d.ue_id])};b[a].replay=function(b){for(var a;a=c.shift();)b(a[0],a[1],a[2])};b[a].
有可能会被网站根据user-agent识别出是爬虫而不是普通用户,然后可以通过更改user-agent来爬取
kv = {'user-agent':'Mozilla/5.0'}
r = requests.get(url, headers = kv)
print(r.request.headers)
{'user-agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
这里user-agent替换成Mozilla/5.0(或者其他浏览器),而且是在get的头部headers里面进行替换
(二)根据关键词搜索
baidu的关键词接口:
http://www.baidu.com/s?wd=keyword
360的关键词接口:
http://www.baidu.com/s?q=keyword
kv = {'wd':'python'}
r = requests.get(url, params = kv)
注意这里是把params进行替换,同时打印结果的时候注意获取的数据长度,有时候可能太大,需要进一步筛选
(三)图片的爬取和存储
网络图片链接格式:http://www.example.com/picture.jpg
import requests
import os
url = "http://i0.hdslb.com/bfs/article/7cdff66e4d44de434f7096fcda11f05505a8f831.jpg"
root = ".//pics//"
kv = {'user-agent':'Mozilla/5.0'}
path = root + url.split('/')[-1]
print(path)
try:
if not os.path.exists(root):
os.mkdir(root)
if not os.path.exists(path):
r = requests.get(url, headers = kv)
print(r.status_code)
with open(path, 'wb') as f:
f.write(r.content)
f.close()
print("File saved successfully.")
else:
print("File saved.")
except:
print("Failed.")
(四)ip地址归属地查询
iP138可以用来查询ip地址归属地
url = 'http://m.ip138.com/ip.asp?ip='+ipaddress
网友评论