Python 爬取拉钩网异步加载页面

作者: 大象爱着丁小姐 | 来源:发表于2018-01-06 16:36 被阅读0次

Python 爬取拉钩网异步加载页面
【Python爬虫案例学习17】爬取拉钩网数据
simple数据分析——拉钩职位（python）
【实战试听课】第四节：如何爬取异步加载数据
爬取python异步社区图书并写入excel
jsoup+okhttp实现网页搜索表单的爬取
爬虫：爬取异步加载的动态网页图片
爬虫之路-1
爬取拉钩网入库MongoDB
使用php 爬取拉钩网

如下是我简单的获取拉钩网异步加载页面信息的过程
获取的是深圳 Python 岗位的所有信息，并保存在Mongo中
（对于异步加载，有的人说是把你要爬页面的信息整个页面先爬下来，保存本地，然后再看有没有你要的东西，有不是异步，没有就是异步；这种方式当然是没有任何问题，但是我的判断方式是，当我点击页面某个位置时，页面的链接并没有变化，而内容却发生了变化，这种我就说它是异步加载，当然，异步加载方式很多，我们要具体网站具体分析）
这个东西完全可以封装成类，各司其职（这里就可以延伸到Scrapy框架）后面会更新一个使用Scrapy框架抓取信息的教程
当然还有selenium+phantomjs

直接上代码

import requests
import json
import pymongo

headers = {
'Referer':'https://www.lagou.com/jobs/list_Python?px=default&city=%E6%B7%B1%E5%9C%B3',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0'}
# headers中的Referer参数是必须的，？号之前都是必须的后面可以省略，不会对结果有影响
pagenum = 1
key = 'Python' #这里可以设置一个列表，先抓取页面所有的技术名称，保存起来，然后抓取职位信息的时候循环嵌套遍历
first = 'true'#可以不要，没发现有什么作用
post_data = {'first': first,'kd':key,'pn':pagenum}
#first:代表是不是首页，kd:代表关键字，pn:代表第几页
json_url =  'https://www.lagou.com/jobs/positionAjax.json?px=default&city=%E6%B7%B1%E5%9C%B3&needAddtionalResult=false&isSchoolJob=0'

#获取json内容
def get_content(post_data):
    r = requests.post(json_url,headers=headers,data=post_data)
    datas = json.loads(r.text)
    return datas['content']
#获取mongo连接
def get_connect():
    client = pymongo.MongoClient('localhost', 27017)
    lagou = client['panpan']
    lagoudt = lagou['lagou']
    return  lagoudt
#数据写入数据库
def to_mongo(results):
    lagou = get_connect()
    for result in results:
        lagou.insert(
        {'positionName' : result['positionName'],
        'positionLibles' : ','.join(result['positionLables']),
        'workYear' : result['workYear'],
        'education': result['education'],
        'salary' : result['salary'],
        'city' :  result['city'],
        'financeStage' : result['financeStage'],
        'industryField' : result['industryField'],
        'createTime' : result['createTime'],
        'positionAdvantage' : result['positionAdvantage'],
        'companySize' : result['companySize'],
        'district' : result['district'],
        'companyShortName' : result['companyShortName'],
        'companyFullName' : result['companyFullName'],
        'firstType' : result['firstType'],
        'secondType' : result['secondType'],
        'subwayline' : result['subwayline'],
        'stationname' : result['stationname'],
        'linestaion' : result['linestaion']})
    

total_page =  get_content(post_data)['pageSize'] #总页数
#循环每一页的内容
for page in range(1,total_page+1):
    first = 'false'
    print(page)#记录当前页码
    post_data = {'kd':'Python','pn':page}
    data = get_content(post_data)
    to_mongo(data['positionResult']['result'])

这明细是一个异步加载的例子，我就不多说了，前面有