美文网首页
python 自动爬取文章数据

python 自动爬取文章数据

作者: 水落斜阳 | 来源:发表于2019-10-08 16:08 被阅读0次

    https://www.runoob.com/w3cnote/python-pip-install-usage.html
    pip相关网站

    1.准备工作
    1.1 用到技术

    python3
    requests: http 爬取 json 数据

    1.2 安装
    pip install requests
    
    
    1.3 导入
    import requests
    
    2. URL分析
    2.1 复制 json 接口请求 URL

    进入掘金个人主页,打开开发者工具,点击“专栏” tab ,在开发者工具"Network->XHR->Name->get_entry_by_self->Headers->Request URL" 复制 url。


    将上面复制的 url 赋值给下方代码中 juejin_youliang_api_full_url 变量。

    import requests
    
    juejin_youliang_api_full_url = 'https://timeline-merger-ms.juejin.im/v1/get_entry_by_self?src=web&uid=59bf12c05188256c6d77d0db&device_id=1568894045269&token=eyJhY2Nlc3NfdG9rZW4iOiJ2cVFna25YNU9oM2VFSVM4IiwicmVmcmVzaF90b2tlbiI6ImlwYXJUazFzUGJyNlZVQUgiLCJ0b2tlbl90eXBlIjoibWFjIiwiZXhwaXJlX2luIjoyNTkyMDAwfQ%3D%3D&targetUid=59bf12c05188256c6d77d0db&type=post&limit=20&order=createdAt'
    
    def decode_url(url):
        adr, query = url.split("?")
        params = { kv.split("=")[0]:kv.split("=")[1] for kv in query.split("&")}
        return adr, params
    
    decode_url(juejin_youliang_api_full_url)
    print (decode_url(juejin_youliang_api_full_url))
    
    

    输出:

    ('https://timeline-merger-ms.juejin.im/v1/get_entry_by_self',
     {'src': 'web', 
    'uid': '59bf12c05188256c6d77d0db',
     'device_id': '1568894045269', 
    'token': 'eyJhY2Nlc3NfdG9rZW4iOiJ2cVFna25YNU9oM2VFSVM4IiwicmVmcmVzaF90b2tlbiI6ImlwYXJUazFzUGJyNlZVQUgiLCJ0b2tlbl90eXBlIjoibWFjIiwiZXhwaXJlX2luIjoyNTkyMDAwfQ%3D%3D', 
    'targetUid': '59bf12c05188256c6d77d0db',
     'type': 'post', 
    'limit': '20',
     'order': 'createdAt'
    })
    
    3. 抓取数据

    助手函数:

    def encode_url(url, params):
        query = "&".join(["{}={}".format(k, v) for k, v in params.items()])
        return "{}?{}".format(url, query)
    
    def get_juejin_url(uid, device_id, token):
        url = "https://timeline-merger-ms.juejin.im/v1/get_entry_by_self"
        params = {'src': 'web',
                  'uid': uid,
                  'device_id': device_id,
                  'token': token,
                  'targetUid': uid,
                  'type': 'post',
                  'limit': 20,
                  'order': 'createdAt'}
        return encode_url(url, params)
    

    requests 抓取数据:

    uid='59bf12c05188256c6d77d0db'
    device_id='1568894045269'
    token='eyJhY2Nlc3NfdG9rZW4iOiJ2cVFna25YNU9oM2VFSVM4IiwicmVmcmVzaF90b2tlbiI6ImlwYXJUazFzUGJyNlZVQUgiLCJ0b2tlbl90eXBlIjoibWFjIiwiZXhwaXJlX2luIjoyNTkyMDAwfQ%3D%3D'
    
    url = get_juejin_url(uid, device_id, token)
    
    headers = {
      'Origin': 'https://juejin.im',
      'Referer': 'https://juejin.im/user/5bd2b8b25188252a784d19d7/posts',
      'Sec-Fetch-Mode': 'cors',
      'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Mobile Safari/537.36'
    }
    
    res = requests.get(url, headers=headers)
    
    if res.status_code == 200:
        json_data = res.json()
        print('获取数据陈成功')
        print(json_data)
    else:
        print('数据获取失败,请检查token是否失效')
    
    4. 分析数据
    4.1 分析 json 数据 找到文章列表字段

    输出:

    json_data = res.json()
        
        for k, v in json_data.items():
            print(k, ':', v)
    
    t --host localhost --port 57248 /Users/qiuchendichen/Desktop/python脚本/demo.py 
    s : 1
    m : ok
    d : {'total': 2, 'entrylist': [{'collectionCount': 0, 'userRankIndex': 0, 'buildTime': 1568894959.9655, 'commentsCount': 0 .... .... .... 
    

    相关文章

      网友评论

          本文标题:python 自动爬取文章数据

          本文链接:https://www.haomeiwen.com/subject/ttnructx.html