美文网首页一起学python
简单爬取豆瓣书籍并保存csv文件

简单爬取豆瓣书籍并保存csv文件

作者: MingSha | 来源:发表于2017-04-30 13:42 被阅读243次

    知识点:

    1、csv文件的保存
    2、requests的text content方法区别
    3、xpath的使用

    最受关注的读书排行榜,分为两类,虚构和非虚构类。
    代码从拷贝的,简单修改一下。

    import requests
    from lxml import etree
    import csv
    
    fp = open('d:\\豆瓣.csv','wt',newline='')
    writer= csv.writer(fp)
    writer.writerow(('name','days','author','date','publisher','price','booktype','point','comment_num'))
    url='https://book.douban.com/'
    headers={
        'Accept':'*/*',
        'Accept-Encoding':'gzip, deflate',
        'Accept-Language':'zh-CN,zh;q=0.8',
        'Connection':'keep-alive',
        'Origin':'https://book.douban.com',
        'Referer':'https://book.douban.com/',
        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'
    }
    html = requests.get(url,headers=headers).content
    sel= etree.HTML(html)
    fiction = sel.xpath('//div[@class="section popular-books"]/div/h2/span/a/@href')[0]
    non_fiction = sel.xpath('//div[@class="section popular-books"]/div/h2/span/a/@href')[1]
    print(fiction,non_fiction)
    colls=[]
    colls.append(url+fiction)
    colls.append(url + non_fiction)
    for coll in colls:
        html1 = requests.get(coll).content
        sel = etree.HTML(html1)
        infos = sel.xpath('//ul[@class="chart-dashed-list"]/li/div[@class="media__body"]')
        print(len(infos))
        for info in infos:
            name = info.xpath('h2/a/text()')[0].strip()
            days = info.xpath('h2/span/text()')[0].strip()
            bookinfo = info.xpath('p[@class="subject-abstract color-gray"]/text()')[0].strip().split("/")
            author = bookinfo[0]
            date = bookinfo[1]
            publisher = bookinfo[2]
            price = bookinfo[3]
            booktype = bookinfo[4]
            point = info.xpath('p[@class="clearfix w250"]/span[2]/text()')[0].strip()
            comment_num =  info.xpath('p[@class="clearfix w250"]/span[3]/text()')[0].strip()
            print(name,days,author,date,publisher,price,booktype,point,comment_num)
            writer.writerow((name,days,author,date,publisher,price,booktype,point,comment_num))
    fp.close()
    

    主要是写csv及保存方法。

    Paste_Image.png

    另外注意:

    requests的text返回的是Unicode型的数据。
    requests的content返回的是bytes型也就是二进制的数据。
    也就是说,如果你想取文本,可以通过r.text。
    如果想取图片,文件,则可以通过r.content。
    requests的json()返回的是json格式数据

    下面保存图片的代码,则必须 用content方法:

    import requests
    jpg_url = 'http:https://img.haomeiwen.com/i2744623/55f59803c7aa7301.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240'
    content = requests.get(jpg_url).content
    with open('demo.jpg', 'wb') as fp:
        fp.write(content)
    

    相关文章

      网友评论

        本文标题:简单爬取豆瓣书籍并保存csv文件

        本文链接:https://www.haomeiwen.com/subject/fwcxtxtx.html