目标

爬取豆瓣图书TOP250的图书信息，包括书名(name)、书本的URL链接(url)、作者(author)、出版社(publisher)、出版时间(date)、书本价格(price)、评分(rate)和评价(comment)

网址

https://book.douban.com/top250

思路

（1）手动浏览，观察url地址的变化，构建url列表。很容易发现url地址是以数字递增的方式改变的，步长为25，共10页。

https://book.douban.com/top250?start=25

https://book.douban.com/top250?start=50

（2）爬取相关信息

（3）将爬取的信息写入csv文件

具体代码如下：

import csv
from lxml import etree
import requests

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 \
(KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}

f = open('doubanTop250.csv', 'wt', newline='', encoding='UTF-8')   #创建csv文件
writer = csv.writer(f)
writer.writerow(('name', 'url', 'author', 'publisher', 'date', 'price', 'rate',
                 'comment'))

urls = ["https://book.douban.com/top250?start={}"\
        .format(str(i)) for i in range(0,226,25)]   #构造url列表

for url in urls:
    print('正在爬取'+url)
    r = requests.get(url, headers=headers)
    selector = etree.HTML(r.text)
    infos = selector.xpath('//tr[@class="item"]')   #取大标签，以此循环
    for info in infos:
        name = info.xpath('td/div/a/@title')[0]
        url = info.xpath('td/div/a/@href')[0]
        book_infos = info.xpath('td/p/text()')[0]
        author = book_infos.split('/')[0]
        publisher = book_infos.split('/')[-3]
        date = book_infos.split('/')[-2]
        price = book_infos.split('/')[-1]
        rate = info.xpath('td/div/span[2]/text()')[0]
        comments = info.xpath('td/p/span/text()')
        comment = comments[0] if len(comments) != 0 else "空"
        
        writer.writerow((name,url,author,publisher,date,price,rate,comment))

f.close()