第一个python爬虫&编码格式的理解

作者: 张松_48a5 | 来源:发表于2018-08-03 11:00 被阅读0次

第一个python爬虫&编码格式的理解
Python基础语法及变量类型
好文推荐
Python 2
用python实现接口测试（八、实现序列化与反序列化）
基础语法
Python初学者入门随笔 01 Python 语法学习
unicode编码转中文
Python学习打call第十天：字节
说清楚python的字符编码问题

这几天工作不忙，写一个爬虫程序试试。起初想爬美团的外卖商家数据，但是ajax那的_token参数一时半会搞不定，先写个简单的。对于我这python新手也是收获不小。

先放代码，是从一个网站下载小说的程序。

# -*- coding:utf-8 -*-
from urllib import request
from bs4 import BeautifulSoup


def getResponse(url):
    head = {}
    # 写入User Agent信息
    head[
        'User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'

    url_request = request.Request(url, headers=head)
    url_response = request.urlopen(url_request)
    //response的header表明内容'utf-8'的，header信息在后面
    return url_response.read().decode('utf-8')


# 笔趣阁??貌似也是冒牌的~
url = 'https://www.qu.la/book/61600/'
response = getResponse(url)
# print(response)
soup = BeautifulSoup(response, 'lxml')
p = soup.find('dt', text='《太初》正文卷')
print(p)
chapter = p.find_next('dd')
# for chapter in chapters:
print(chapter.a['href'])
print(chapter.a.text)

url = "https://www.qu.la" + chapter.a['href']
print('url is ' + url)
response = getResponse(url)
soup = BeautifulSoup(response, 'lxml')
content = soup.find('div', id="content")

strs = content.text.split("\n")
print('!' * 50)
with open('test.txt', 'w', encoding='gbk', errors='ignore') as f:
    for str_ in strs:
        str_ = str_.strip()
        if (str_ == '' or str_ == '\n'):
            continue
        print(type(str_))
        f.write(str_)
        print('*' * 50)

其中的知识点主要是BeautifulSoup的使用，以及编码格式的问题。BeautifulSoup的使用网上的例子很多，不再赘述，主要谈一下编码格式的问题吧。

看了很多python编码格式的文章，但是和自己以前的理解有些不同，这里谈一下自己的理解。

我认为编码格式是针对二进制的bytes数组的，而不是针对str的。在系统内部，数据是以Unicode格式的二进制存在的，我们看到的str是这些二进制表达的意思。这些str要落地(如存磁盘，网络传输，控制台打印等)时，这是输出，需要转换为二进制然后存储/传输，这里需要做的是encode。而作输入时，是二进制转化为为str，这里做的是decode。因此可以说二进制流是哪种编码方式的，而不是str。而且一种编码方式二进制流不能直接变为另一种编码方式的二进制流，需要先变为str

在本段代码中，爬到的内容是UTF-8格式的，因此作了一次decode，存储时放到了gbk的文件，作了一次encode。

Date: Fri, 03 Aug 2018 02:04:49 GMT
Cache-Control: private
Content-Type: text/html; charset=utf-8
Server: Microsoft-IIS/7.5
X-AspNet-Version: 2.0.50727
X-Powered-By: ASP.NET
Content-Length: 145886
Age: 149
X-Via: 1.1 zhjhzhdx19:1 (Cdn Cache Server V2.0), 1.1 PSmgnyNY2li89:0 (Cdn Cache Server V2.0)
Connection: close
X-Dscp-Value: 0