美文网首页大数据 爬虫Python AI Sql
Python爬虫入门小练习之简单的50行(一)

Python爬虫入门小练习之简单的50行(一)

作者: 顾四秋 | 来源:发表于2019-10-11 19:50 被阅读0次

HI 最近得空,小看了一下python爬虫方面的资料,于是就打算把笔记和代码练习整理成文章分享给大家。

一、先来认识一下啥玩意叫爬虫吧

爬虫的目的:数据
            1、企业内部数据
                由企业内部服务器产生的数据
            2、企业外部的数据
                通过爬虫技术进行抓取的有规则的数据

 爬虫分为两种:
            1、通用爬虫
                就是搜索引擎抓取系统的重要组成部分
                主要目的:将网页下载到本地形成一个互联网的内容镜像备份
                通用爬虫并不是一切皆可爬,爬虫时需要遵守爬虫协议,协议名:robots.txt
            2、聚焦爬虫
                面向特定主题需求的一种网络爬虫程序
                它与通过搜索引擎爬虫的区别在于,聚焦爬虫在实施网页抓取的时候会对内容进行处理和筛选,尽量保证只抓取
                与我们需求相关的网页信息
                我把玩的就是聚焦爬虫,不需要遵守爬虫协议

接下来可以通过一个简单的代码小练习来上手

练习:爬取求职网站职位数据

import requests
import xlwt
import time

#模拟浏览器,用来请求网站,获取网站响应结果并抓取结果并保存

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36',
    'Referer': 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=',
    'Cookie': 'user_trace_token=20191011152218-c9f0615c-912c-43d9-9405-9ee5dab19c11; _ga=GA1.2.591106157.1570778543; _gat=1; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1570778543; PRE_UTM=m_cf_cpt_baidu_pcbt; PRE_HOST=sp0.baidu.com; PRE_SITE=https%3A%2F%2Fsp0.baidu.com%2F9q9JcDHa2gU2pMbgoY3K%2Fadrc.php%3Ft%3D06KL00c00fZNKw_0PpN-0FNkUsa-QpKI00000AZkiNC00000V-xYmg.THL0oUh11x60UWdBmy-bIfK15yDzuHm1uAPWnj0srjc1mH60IHYYnHckfbDsn1wjn19anjbYPYPKwHwaf1bdn197nRNaw6K95gTqFhdWpyfqn1c4Pj6krHfvPBusThqbpyfqnHm0uHdCIZwsT1CEQLILIz4_myIEIi4WUvYEUA78uA-8uzdsmyI-QLKWQLP-mgFWpa4CIAd_5LNYUNq1ULNzmvRqUNqWu-qWTZwxmh7GuZNxTAPBI0KWThnqPWfkP16%26tpl%3Dtpl_11534_19968_16032%26l%3D1514795361%26attach%3Dlocation%253D%2526linkName%253D%2525E6%2525A0%252587%2525E5%252587%252586%2525E5%2525A4%2525B4%2525E9%252583%2525A8-%2525E6%2525A0%252587%2525E9%2525A2%252598-%2525E4%2525B8%2525BB%2525E6%2525A0%252587%2525E9%2525A2%252598%2526linkText%253D%2525E3%252580%252590%2525E6%25258B%252589%2525E5%25258B%2525BE%2525E7%2525BD%252591%2525E3%252580%252591-%252520%2525E9%2525AB%252598%2525E8%252596%2525AA%2525E5%2525A5%2525BD%2525E5%2525B7%2525A5%2525E4%2525BD%25259C%2525EF%2525BC%25258C%2525E5%2525AE%25259E%2525E6%252597%2525B6%2525E6%25259B%2525B4%2525E6%252596%2525B0%21%2526xp%253Did%28%252522m3294819466_canvas%252522%29%25252FDIV%25255B1%25255D%25252FDIV%25255B1%25255D%25252FDIV%25255B1%25255D%25252FDIV%25255B1%25255D%25252FDIV%25255B1%25255D%25252FH2%25255B1%25255D%25252FA%25255B1%25255D%2526linkType%253D%2526checksum%253D219%26ie%3Dutf-8%26f%3D8%26tn%3Dbaidu%26wd%3D%25E6%258B%2589%25E9%2592%25A9%25E7%25BD%2591%26rqlang%3Dcn%26inputT%3D2549; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Flanding-page%2Fpc%2Fsearch.html%3Futm_source%3Dm_cf_cpt_baidu_pcbt; LGSID=20191011152224-df01e141-ebf7-11e9-9a31-525400f775ce; LGUID=20191011152224-df01e363-ebf7-11e9-9a31-525400f775ce; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2216db9b2acab3cb-048cc083f3e7e1-67e1b3f-1049088-16db9b2acac608%22%2C%22%24device_id%22%3A%2216db9b2acab3cb-048cc083f3e7e1-67e1b3f-1049088-16db9b2acac608%22%7D; sajssdk_2015_cross_new_user=1; _gid=GA1.2.1123494377.1570778558; gate_login_token=e8b6d7058b440e1fe1fb6323e98f0de0646cdd593b57ec5b; LG_LOGIN_USER_ID=5c82fabcf6385aabf76ba85cfa7388d6d2a47cb053137cd0; LG_HAS_LOGIN=1; _putrc=17B45A575BBA096A; JSESSIONID=ABAAABAAAFCAAEG039EDFAF80460596C420D86F6F888DFA; login=true; unick=%E6%B2%88%E5%B0%8F%E7%BF%94; showExpriedIndex=1; showExpriedCompanyHome=1; showExpriedMyPublish=1; index_location_city=%E6%B7%B1%E5%9C%B3; WEBTJ-ID=20191011152333-16db9b38860a2-030f2352645797-67e1b3f-1049088-16db9b388612af; privacyPolicyPopup=false; TG-TRACK-CODE=index_navigation; hasDeliver=1; X_HTTP_TOKEN=9ba0614357a88eb89568770751ce2112c7b6159499; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1570778662; LGRID=20191011152422-2547076e-ebf8-11e9-a57b-5254005c3644; SEARCH_ID=d4111bd4c0e3465fb0b3a2b838bba950'
}

def getJobList(page):
    data = {
        'first': 'true',
        'pn': '1',
        'kd': 'python'
    }

    #向网站发送请求,并获取结果
    res = requests.post("https://www.lagou.com/jobs/positionAjax.json?city=%E6%B7%B1%E5%9C%B3&needAddtionalResult=false",data=data,headers=headers)
    result = res.json()
    job = result['content']['positionResult']['result']
    return job

#创建excel保存数据
excell = xlwt.Workbook()
#设置单元格
sheetl = excell.add_sheet('lagou',cell_overwrite_ok=True)
#设置表格表头
sheetl.write(0,0,'companyFullName')
sheetl.write(0,1,'companySize')
sheetl.write(0,2,'secondType')
sheetl.write(0,3,'salary')

#开始遍历每一页数据
n = 1

'''
外层循环 遍历的是所有的分页

内层循环 遍历的是每一层的数据 

'''
for page in range(1,30):
    print page
    for job in getJobList(page):
        print job
        #写入数据到excel中
        sheet1.write(n,0,job['companyFullName'])
        sheet1.write(n,1,job['companySize'])
        sheet1.write(n,2,job['secondType'])
        sheet1.write(n,3,job['salary'])
        n+=1
    time.sleep(1)

excell.save('langou_shuju.xls')

看客们注意哦,headers、data等请求参数都是在网页按f12后获取哦
还有,代码其实是没有完完全全的提供完的,因为要保证人家的数据不被恶意爬取,这只是个小练习笔记!小练习笔记!没有任何其他用途~

相关文章

网友评论

    本文标题:Python爬虫入门小练习之简单的50行(一)

    本文链接:https://www.haomeiwen.com/subject/tsmxmctx.html