ua:UA的中文翻译是用户代理,全称是User Agent,简单来说是终端的环境信息如:Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1
它是一个特殊字符串头,使得服务器能够识别客户使用的操作系统及版本、CPU 类型、浏览器及版本、浏览器渲染引擎、浏览器语言、浏览器插件等
进群“700341555”获取Python爬虫入门学习资料哦!
Python爬虫中深不可测的ua参数,爬虫的身份证但是这只是一般意义上的ua参数,实际使用中ua代表了一个终端的标识,ID代表了一个用户的标识,随着网络安全的深入,ua已经不再是一串环境的信息字符串,而是开发者单独发开出来的一套甄别终端的算法,成为了单独执行的js文件。
一般在requests的headers里面我们都会附带上User_Agent以规避后台检测是否为爬虫的请求,但是随着安全的提升现在的大型网站已经不再用请求头里面的ua,而是使用单独的js本地生成ua参数,再发给浏览器然后对当前的终端进行鉴别,这样无疑加大了爬虫的难度。
在控制台输出ua信息:
<pre style="-webkit-tap-highlight-color: transparent; box-sizing: border-box; font-family: Consolas, Menlo, Courier, monospace; font-size: 16px; white-space: pre-wrap; position: relative; line-height: 1.5; color: rgb(153, 153, 153); margin: 1em 0px; padding: 12px 10px; background: rgb(244, 245, 246); border: 1px solid rgb(232, 232, 232); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">> navigator.userAgent
< "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.5603.400 QQBrowser/10.1.1775.400"
</pre>
淘宝的ua:
在访问淘帮网站在登陆页会自动发送一个post请求如下
<pre style="-webkit-tap-highlight-color: transparent; box-sizing: border-box; font-family: Consolas, Menlo, Courier, monospace; font-size: 16px; white-space: pre-wrap; position: relative; line-height: 1.5; color: rgb(153, 153, 153); margin: 1em 0px; padding: 12px 10px; background: rgb(244, 245, 246); border: 1px solid rgb(232, 232, 232); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">Request URL:https://login.taobao.com/member/request_nick_check.do?_input_charset=utf-8
Request Method:POST
Status Code:200
Remote Address:203.119.244.126:443
Referrer Policy:no-referrer-when-downgrade
</pre>
其中携带的form数据如下:
<pre style="-webkit-tap-highlight-color: transparent; box-sizing: border-box; font-family: Consolas, Menlo, Courier, monospace; font-size: 16px; white-space: pre-wrap; position: relative; line-height: 1.5; color: rgb(153, 153, 153); margin: 1em 0px; padding: 12px 10px; background: rgb(244, 245, 246); border: 1px solid rgb(232, 232, 232); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">username:
ua:115#1YaWif1O1TN2HmW0G1If1Csof11GBwAa1g2u169nlCbF4vNoOMeF81dnEX11hB8GyzeUhsfPe1L8ykwUV7WJhUU4AWNcaTpbyW1QOSAyeK1G1MKpITibhZz8AWd2aLBXOZrQbvhletT8DkNQiQWJhZz8OWaKaTBfuzFQASvHeK/8uWZpg8Xg6LSHEZNSZTM61JHJdPYMTNTWAFt3Pjv3KQLEb3zEdlttBasGXGXfBq9FWBuF2Yh0wf6IeMfFKAV77t2HLbmhFm5ZtK1uFY2LtxsV1ywVAHMKmrP6gwS9b6n/nGUPwB7tx7C2P0UEYw5HvMUxcok+2cmBC5jmBODKipmTgIg9MKKok8CzoUMoieKwoUBra+tqBaFIyCs2a0OGVgPuoWKWM24WGi7jrtLVwoOnFMBbg4CdXX/3SSYKSLNwq05OV3Xz0/zlGS5F2/7mAbMBqbFUWBtvszktk3q5SREAXuRfdKnEY4KelWeiP4hUQdX+nl+ipyO/Lt49lTvDmiWMzKTNHmYwtFsQLNiA/c0FbTm5f9LJOKkcIx6iOEPnT+gGCYwUsU34FdYEDD2Xvh5LtRfdSixmouqlEU9i48W2B3gake3F28wwFwsHGveJtCMs1Ec/LJYU1cI9RuA6OE9BHJky89ZlgIHHeDa+fU/y1IMXKrIGq8zVWNmQS/IUnQ90KkByvfHBJfOt7l2pGVYNVSgBhVAhk7+HnHUHuyd7S1l5jxnMnbU8aqr/7vkNWONmrb9+tEot/KjQRQMG4SiVIMdKMf08FgqQjMn6TrFgZCBa/Pz4nFru5VCAjiKVYPwPYzs+lBf1YF2xBxk7S5emNkmzClyV0Mt+NPKx
</pre>
ua参数很长一串,看似像RSA加密过的,并且这个参数时刻变化着在控制台连续输出两次结果如下:
Python爬虫中深不可测的ua参数,爬虫的身份证仅有前几位是相同的,其他部分完全不一致,并且在系统window对象中是一个变量,通过一个后台方法更新该参数,目前还未分析出他的计算方式,但可以大胆猜想和本地环境相关和用户标示相关和是否验证相关。
另一个案例
网址https://www.pm9b.com/,看它的post参数
<pre style="-webkit-tap-highlight-color: transparent; box-sizing: border-box; font-family: Consolas, Menlo, Courier, monospace; font-size: 16px; white-space: pre-wrap; position: relative; line-height: 1.5; color: rgb(153, 153, 153); margin: 1em 0px; padding: 12px 10px; background: rgb(244, 245, 246); border: 1px solid rgb(232, 232, 232); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">method: login
values: 18328496888<a>666666<a>3ff70d13213246e3c19a6b2d1161d29a
type: Web
lang: zh-CN
</pre>
values是密码加上一串固定的字符串ff70d13213246e3c19a6b2d1161d29a这个字符串不是md5,而是后台js计算的一个值,分析该字典的生成js
<pre style="-webkit-tap-highlight-color: transparent; box-sizing: border-box; font-family: Consolas, Menlo, Courier, monospace; font-size: 16px; white-space: pre-wrap; position: relative; line-height: 1.5; color: rgb(153, 153, 153); margin: 1em 0px; padding: 12px 10px; background: rgb(244, 245, 246); border: 1px solid rgb(232, 232, 232); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">var i = this;
this.fontsKey(t, function(t) {
var n = [];
i.each(t, function(e) {
var t = e.value;
void 0 !== e.value.join && (t = e.value.join(";")),
n.push(t)
});
var r = i.x64hash128(n.join("~~~"), 31);
</pre>
其中this是一个全局对象这里他的值等于window,保存了全局的方法和变量输出如下:
Python爬虫中深不可测的ua参数,爬虫的身份证js中的t参数是由一堆环境信息组成的如下:
Python爬虫中深不可测的ua参数,爬虫的身份证然后在经过哈希计算,在这里我们就可以看到他的ua信息是由环境变量生成的,只是没有添加账户信息,那我们可以用一个固定值来代替。
本片文章的中心在于阐述反爬虫机制中对ua信息的应用,目前淘宝ua算法是接触到的比较难的,ua就是一个终端的身份证,服务器可以通过这个身份来过滤很多爬虫。
同时推荐两个随机生成User-Agent的库
fake_useragent
<pre style="-webkit-tap-highlight-color: transparent; box-sizing: border-box; font-family: Consolas, Menlo, Courier, monospace; font-size: 16px; white-space: pre-wrap; position: relative; line-height: 1.5; color: rgb(153, 153, 153); margin: 1em 0px; padding: 12px 10px; background: rgb(244, 245, 246); border: 1px solid rgb(232, 232, 232); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from fake_useragent import UserAgent
ua = UserAgent()
ua.ie
Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US);
ua.chrome
Mozilla/5.0 (compatible; MSIE 10.0; Macintosh; Intel Mac OS X 10_7_3; Trident/6.0)'
ua.opera
Opera/9.80 (X11; Linux i686; U; ru) Presto/2.8.131 Version/11.11
</pre>
faker
<pre style="-webkit-tap-highlight-color: transparent; box-sizing: border-box; font-family: Consolas, Menlo, Courier, monospace; font-size: 16px; white-space: pre-wrap; position: relative; line-height: 1.5; color: rgb(153, 153, 153); margin: 1em 0px; padding: 12px 10px; background: rgb(244, 245, 246); border: 1px solid rgb(232, 232, 232); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from faker import Faker
ua = Faker()
ua.user_agent()
Out[24]: 'Mozilla/5.0 (Macintosh; PPC Mac OS X 10_9_1; rv:1.9.2.20) Gecko/2016-12-13 00:29:52 Firefox/4.0'
ua.user_agent()
Out[25]: 'Mozilla/5.0 (Windows; U; Windows 98) AppleWebKit/531.46.5 (KHTML, like Gecko) Version/4.0 Safari/531.46.5'
ua.chrome()
Out[26]: 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/5362 (KHTML, like Gecko) Chrome/60.0.850.0 Safari/5362'
</pre>
注意:生成ua要指定浏览器,不然因为网页的兼容问题,不同浏览器头返回源码不同;测试ua去重后总数只有50多条。
网友评论