一。准备工作

安装两个非常好用的library，帮助抓取网页，Requests 和 Beautiful Soup，再安装一个python工具管理pip来帮助下载以上两个库。

如果你的python安装包是从python官网下载，那么所有2.7.9以后的版本都是自带pip的，无需安装，更新到最新版就好。

更新pip
打开power shell，敲入

python -m pip install -U pip

通过pip安装requests 和 beautifulsoup4

pip install BeautifulSoup4
pip install requests

这两行是我看到的各个教程上用的命令，然而不知道为什么在我的电脑上不成功。我开始在shell里敲，又在python自带命令行里敲，都无法识别，最后发现在power shell里输入

python -m pip install requests
python -m pip install beautifulsoup4

成功安装了，虽然在安装requests完成后出现以下警告

The script chardetect.exe is installed in 'C:\Python27\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed certifi-2018.4.16 chardet-3.0.4 idna-2.6 requests-2.18.4 urllib3-1.22

对我后续使用暂时没发现影响。

二。简单学习HTML

理论

见 HTML Basic，其实不看没关系，知道一段内容中，代码总是首尾对应就行，如<p>....</p>;<div>....</div>。

实例：

打开豆瓣页面开智正典：心智升级百本经典导读，进入源代码模式。
注意：点击鼠标右键，选择“检查”，而不是“查看网页源代码”。
点击代码窗口左上角的箭头，开启网页内容与代码的同步对比功能
在网页内容页点击内容，即可在右边代码窗口看到对应的html语言。反过来也可以，点右边代码可以看到左边对应的内容。
比如图书的标题：

探索后发现，豆瓣豆列里每一本书的信息在class=doulist-item里。

三。开始爬虫

爬虫的核心步骤是

请求——连接找到需要的网页——requests
解析——分析网页内容——beautifulsoup
储存——把需要的东西储存——python的读写file函数

Neo写的代码对目前的我来说很复杂，我加了很多注释，且调换成了我能理解的顺序。。见开智正典爬虫粗糙版本

以下只是对关键点的解释,代码做了简化。

1.请求网页

url = 'https://www.douban.com/doulist/41691053/?start=0&sort=seq&sub_type=',
content = requests.get(url).content

第一步，aa=requests.get('想爬的网址'）
第二步，bb=aa.content ——提取该网址源代码，于是可以用于下一步解析。

2.解析网页

soup = BeautifulSoup(content, "html.parser") 
bookList = soup.find_all(name="div", attrs={"class": "doulist-item"})

beautifulsoup这个库里有多个解码器，安装时自带的是html.parser.
美丽汤自带各种method，比如find_all,可以抓取所有包含XX字符的内容，这里XX我们定义是div,class=doulist-item。如果用find则只抓取第一个符合的内容后便停止了。
这时抓出的booklist是个清单，清单中的每一项是单本书的全部信息。

现在提取书的标题，每一个书是一个item.

for item in booklist
  title0 = item.find(name="div", attrs={"class": "title"})
  tittle = tittle0.a.text.strip().replace('"', '')

作者、出版商、出版日期稍微复杂一点，因为他们都在abstract块里。

abstact.png

abstract = item.find(name="div", attrs={"class": "abstract" })
if abstract is not None:
        for line in abstract.text.strip().split('\n'):
            if line.strip() != '':
                theMap = line.strip().split(':') 
                if theMap[0].strip() == '作者':  
                    book.author = theMap[1].strip()
                elif theMap[0].strip() == '出版社':
                    book.publisher = theMap[1].strip()
                elif theMap[0].strip() == '出版年':
                    book.pubDate = theMap[1].strip()
                else:
                    print(">>> not process: " + line.strip())

所以代码中先提取整个abstract内容，按行切割，生成list，位置为0的元素是表头，内容在1位置。