先有瓜子,后有人人车,现在二手车很火啊,像g哥这样的屌丝,想买个新车,又不想只是买个买菜车,再高档点的供不起。。。买不起,咱们看的起,今天g哥分享下如何采集人人车网站上的二手车信息。
分析请求
打开人人车二手车列表网址:https://www.renrenche.com/dg/ershouche/p1/,初步看了下,从列表页到内容也,并没有用ajax加载数据,只需要用xpath提取元素字段即可。
内容页网址类似于:https://www.renrenche.com/dg/car/86ed37707e9fd266,其中86ed37707e9fd266并不是信息ID,打开内页才知道,这只是一个hash,里面才有真正的ID,类似这样的:plog_id=e32c532242267dafd73d27f2554ff68b,这里如果不用浏览器渲染,源代码是看不到这个ID的,看来是有做反爬,但只要使用基于浏览器引擎的采集器基本可以搞定。
难点解析
在内容也中,如果不看源码看不出什么端倪,但如果你打开源代码会发现,原来TM有“投毒”,源代码数据与显示数据不一致,看来这也是一种反爬措施,之前g哥还没见过。
源码被替换但是毕竟这个数据在网页是有正常显示的,即使是动态解码也应该在源代码解码,但并不是,很奇怪,仔细一看原来,有一个css定义如下:
.rrcttf71b47a07ccdb90665a3223e59c4229ef {font-family: "rrc-font-defender"!important;}
显然是加载了一个字体,去除样式,g哥发现,网页显示内容还原为源码内容,看来就是这个字体搞的鬼。。。
于是g哥从源代码里提取出了字体下载链接,打开以后发现,原来只是把正常字体的数据做了顺序颠倒,很有创意啊。。。
伪造字体所以,你应该懂得,只要把源码的数字替换为原来位置的数据即可还原真实数据,具体实现g哥就不贴了。这是还原采集到的内容:
{
"title":"起亚-K22012款三厢1.4LMTGLS纪念版",
"carId":"dg-231924",
"city":"杭州",
"price":"3.50万",
"newPrice":"8.12万",
"date":"2012-12上牌",
"kilometer":"11.86万公里",
"brandCity":"佛山",
"paifang":"国四",
"biansu":"手动",
"guohu":"0次",
"pics":[
"https://img2.rrcimg.com/o_1cnolrudj3113211902293183393121012.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolrudj145262648158253860331966.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolrudj71011087714907518699524.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolrudj198295900417354428133584.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolrtjj109444063521438774942632.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolrum4362691854917893134488223.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolrtjj161644419743836183614430.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolrudj5826345003128522969362695.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolrtjj965434405219005879416815.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolru524759614512511962104634906.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolru523367464186131772801956264.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolru52739139589489402008342976.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolrtjj3332128392229513784042747.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolru5222626768739545825248120.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolrudj19714756326776531421641.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolru522249013949175493526945292.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolru523978035522291331000523160.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolrtjj437894446390509916389.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolru5212795616610592449221733.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolru5246982248016485236216632.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolrtjj575691765516907624326418.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolrum427273459346636210929321.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolrudj351119170464826541064386.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolru52246155187638090413985516.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolrtjj930121502225294339324993.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolrtjj5147449930488381850634406.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolru5229936196416444351355027.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolrudj218814199358833593757409.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolrum43123747882273722735629842.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolrudj20420509143883089758915.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolrtjj5891364654435723885040029.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolrtjj4577436231191193009644904.jpg?imageView/4/w/600/h/400",
"https://img2.rrcimg.com/o_1cnolrudj263646010614971198636093.jpg?imageView/4/w/600/h/400"
]
}
就此,人人车反爬的难点基本就这里,即使你通过浏览器渲染采集也不能采集到正常数据,只有解码还原后才能拿到真实数据。peace~
此文只做爬虫技术交流,切勿用于非法用途,后果自负,与作者无关。
网友评论