美文网首页
Python爬虫初探

Python爬虫初探

作者: Picidae | 来源:发表于2018-01-12 15:57 被阅读0次

    距离上次爬虫有一段时间了,这次就用requests做点事儿,爬取一个网页并且把数据添加到数据库中,酷不酷?
    首先声明使用技术栈:
    Python3 : python3+ 版本
    requests: python的一个http请求库
    xpath : 用于将html转换为xml,解析xml
    pymongo: 连接mongodb数据库,储存数据

    • requests + xpath + pymongo
      直接上代码
    import requests
    from lxml import etree
    from pymongo import MongoClient
    
    # 建立mongo
    client = MongoClient('mongodb://localhost:27017/')
    db = client['mydb']
    coupon = db['coupon']
    
    startUrl = "https://www.haoshsh.com/jiu/index/cid/4/p/1.html"
    headerUrl = "https://www.haoshsh.com"
    
    data = []
    
    def VisitUrl(url):
        r = requests.get(url)
        r.encoding="UTF-8"
        html = r.text
        xmlContene = etree.HTML(html)
        nextUrl = xmlContene.xpath('//div[@class="page"]/div/a[text()="下一页"]/@href')
        if len(nextUrl)!=0:
            nextUrl = xmlContene.xpath('//div[@class="page"]/div/a[text()="下一页"]/@href')[0]
        
        list = xmlContene.xpath('//ul[@class="goods-list clear"]/li')
        for index in range(len(list)):
            value = list[index]
            # # title
            title = value.xpath('./div/h3/a/text()')[0]
            # # 原价
            OriginalPrice = value.xpath('./div/div[@class="good-price"]/span[@class="des-other"]/span[@class="price-old"]/text()')[0]
            # # 券后价
            PostPrice = value.xpath('./div/div[@class="good-price"]/span[@class="price-current"]/text()')[0]
            # # 图片路径
            imgsrc = value.xpath('./div/div[@class="good-pic"]/a/img/@d-src')[0]
            json = {
                "title":title,
                "OriginalPrice":OriginalPrice,
                "PostPrice":PostPrice,
                "imgsrc":imgsrc
            }
            # 存储数据到数据库
            saveDataBase(json)
    
        if len(nextUrl)==0:
            print('\033[1;31m 无数据,爬取完毕 \033[0m')
            return url
        else:
            print('\033[1;32m 下一页开始爬取 : '+headerUrl+nextUrl+'\033[0m ')
            return VisitUrl(headerUrl+nextUrl)
    
    
    def saveDataBase(data):
        coupon.insert_one(data)
    
    
    VisitUrl(startUrl)
    

    相关文章

      网友评论

          本文标题:Python爬虫初探

          本文链接:https://www.haomeiwen.com/subject/nuufoxtx.html