知识星球精华区的 Python 脚本

作者: 赫子丰 | 来源:发表于2019-01-31 18:30 被阅读9次

知识星球精华区的 Python 脚本
100G Python干货
Python 知识星球爬虫（二）根据 Group > to
2019 年我在知识星球上都发了些什么
selenium2+Python--学习进阶路线图
Python每日一题：第一题
2018最常见的Python面试题，一文看完带你搞定考官（下）
知识星球文章导出
Python笔记：import导入本地脚本【转】
使用python脚本进行SpringBoot项目多节点上传部署

一直想学习 Python 脚本，但是一直没有行动，今天促使我学习 Python 主要是因为:

不经意间刷 BOSS 直聘发现 iOS 开发的招聘需求明确写着：至少掌握一种脚本语言；
新加入的知识星球内容实在太多，即便只看精华区也需要很长的时间，而且没法记录位置，每次都要从头来看，很不方便，所以想着把数据都爬下来；
大云都会写 Python 了，我还不会太丢人了😶

今天探索了爬取知识星球精华区的 Python 脚本，感谢 96chh贡献的源码：👇

https://github.com/96chh/crawl-zsxq

使用环境：


Mac OS : 10.13.6 ,Python 2.7

使用问题记录如下：

1. 'NoneType' object is not iterable


File "/Users/hegao/Downloads/crawl-zsxq-master-2/crawl.py", line 42, in get_data

    for topic in json.loads(f.read()).get('resp_data').get('topics'):

TypeError: 'NoneType' object is not iterable

网络搜索结果都表示：这个错误提示一般发生在将None赋给多个值时。结合报错位置猜测应该是没有抓取到数据导致的，查看 test.json 文件发现 401 了：


{

  "info": "", 

  "code": 401, 

  "resp_data": {}, 

  "succeeded": false

}

一般来说 401 错误消息表明您首先需要登录（输入有效的用户名和密码）

分析页面的报文发现请求头里面没有Authorization参数了，而是用Cookie了，headers改成如下形式即可：


headers = {

        'Cookie':'zsxq_access_token=E56C5E4D-C79D-F758-12B6-947D0F9A43AA',

        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'

    }

2. Non-ASCII character '\xe5' in file


SyntaxError: Non-ASCII character '\xe5' in file /Users/hegao/Downloads/crawl-zsxq-master-2/crawl.py on line 33, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

解决方法：参考👇
https://blog.csdn.net/qq_39521554/article/details/79920374


第一步，别忘了给顶行加上：

# -*- coding: utf-8 -*-



第二步，重新载入SYS模块并设置uft-8：



import sys

reload(sys)

sys.setdefaultencoding('utf-8')

3. ImportError: No module named parse


File "/Users/hegao/Downloads/crawl-zsxq-master-2/crawl.py", line 9, in <module>

    from urllib.parse import quote

ImportError: No module named parse

开始以为是需要安装 parse 库，执行 sudo pip install parse 之后还是报错

在搜索才发现是 Python 版本的问题导致，我使用的是 Python 2.7，作者应该是 Python 3.0 +，这里修改如下即可：


from urllib import quote

4. 'encoding' is an invalid keyword argument for this function


File "/Users/hegao/Downloads/crawl-zsxq-master-2/crawl.py", line 38, in get_data

    with open('test.json', 'w', encoding='utf-8') as f:        # 将返回数据写入 test.json 方便查看

TypeError: 'encoding' is an invalid keyword argument for this function

这里解决办法比较简单粗暴，直接将脚本中是 encoding="utf-8" 删除掉即可了

5. 脚本执行完毕已经打印：已制作电子书在当前目录，但是本地并没有生成PDF

这个问题查找耗费了很多时间，因为要排除各方面的原因

1.查看返回报文和脚本字段对比，排除了返回报文字段是否已更改的问题；
2.用 “ wkhtmltopdf www.baidu.com ***/baidu.pdf ” 命令排除了 wkhtmltopdf 安装失败的问题，Mac电脑直接下载了 pkg 安装即可，无需后续各种配置；
3.注释掉 os.remove(file) 发现本地是有爬取到的 HTML 文档的，然后转换了单个HTML 文档居然成功了，再截取50个HTML 文档也能转换成功，才恍然大悟应该是HTML 文档个数太多导致了转换 PDF 失败，所以添加如下判断解决了


 if index%60 == 0 and index>0 :

      pdfkit.from_file(html_files, str(index) +"电子书——辉哥.pdf", options=options)

      for file in html_files:

          os.remove(file)

          html_files =[]

6. Python 语法

Python语法还真是跟我接触的所有语言都不太一样啊，首先方法体居然是根据 tab 空格来界定的，没有了大括号还真是不适应啊；
Python 中的与或非居然直接用关键字来表示：and、or和not，大开眼界啊，写 & | ! 习惯了，找了半天才发现是要用 and

7. 修改之后的完整 Python 脚本如下：


# -*- coding: utf-8 -*-



import re

import requests

import json

import os

import pdfkit

from bs4 import BeautifulSoup

from urllib import quote

import sys

reload(sys)

sys.setdefaultencoding('utf-8')



html_template = """

<!DOCTYPE html>

<html lang="en">

<head>

    <meta charset="UTF-8">

</head>

<body>

<h1>{title}</h1>

<p>{text}</p>

</body>

</html>

"""

htmls = []

num = 0

def get_data(url):



    global htmls, num

        

    headers = {

        'Cookie':'zsxq_access_token=E56C5E4D-C79D-F758-12B6-947D0F9A43AA',

        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'

    }

    

    rsp = requests.get(url, headers=headers)

    with open('test.json', 'w') as f:        # 将返回数据写入 test.json 方便查看

        f.write(json.dumps(rsp.json(), indent=2, ensure_ascii=False))

    

    with open('test.json') as f:

        for topic in json.loads(f.read()).get('resp_data').get('topics'):

            content = topic.get('question', topic.get('talk', topic.get('task', topic.get('solution'))))

            # print(content)

            text = content.get('text', '')

            text = re.sub(r'<[^>]*>', '', text).strip()

            text = text.replace('\n', '<br>')

            title = str(num) + text[:9]

            num += 1



            if content.get('images'):

                soup = BeautifulSoup(html_template, 'html.parser')

                for img in content.get('images'):

                    url = img.get('large').get('url')

                    img_tag = soup.new_tag('img', src=url)

                    soup.body.append(img_tag)

                    html_img = str(soup)

                    html = html_img.format(title=title, text=text)

            else:

                html = html_template.format(title=title, text=text)



            if topic.get('question'):

                answer = topic.get('answer').get('text', "")

                soup = BeautifulSoup(html, 'html.parser')

                answer_tag = soup.new_tag('p')

                answer_tag.string = answer

                soup.body.append(answer_tag)

                html_answer = str(soup)

                html = html_answer.format(title=title, text=text)



            htmls.append(html)



    next_page = rsp.json().get('resp_data').get('topics')

    if next_page:

        create_time = next_page[-1].get('create_time')

        if create_time[20:23] == "000":

            end_time = create_time[:20]+"999"+create_time[23:]

        else :

            res = int(create_time[20:23])-1

            end_time = create_time[:20]+str(res).zfill(3)+create_time[23:] # zfill 函数补足结果前面的零，始终为3位数

        end_time = quote(end_time)

        if len(end_time) == 33:

            end_time = end_time[:24] + '0' + end_time[24:]

        next_url = start_url + '&end_time=' + end_time

        print(next_url)

        get_data(next_url)



    return htmls



def make_pdf(htmls):

    options = {

        "user-style-sheet": "test.css",

        "page-size": "Letter",

        "margin-top": "0.75in",

        "margin-right": "0.75in",

        "margin-bottom": "0.75in",

        "margin-left": "0.75in",

        "encoding": "UTF-8",

        "custom-header": [("Accept-Encoding", "gzip")],

        "cookie": [

                   ("cookie-name1", "cookie-value1"), ("cookie-name2", "cookie-value2")

                   ],

        "outline-depth": 10,

    }

    

    html_files = []

    for index, html in enumerate(htmls):

        file = str(index) + ".html"

        html_files.append(file)

        with open(file, "w") as f:

            f.write(html)

        if index%60 == 0 and index>0 :

            pdfkit.from_file(html_files, str(index) +"电子书.pdf", options=options)

            for file in html_files:

                os.remove(file)

                html_files =[]

    try:

        pdfkit.from_file(html_files, "电子书.pdf", options=options)

    except Exception as e:

        pass



    for file in html_files:

        os.remove(file)



    print("已制作电子书在当前目录！")



if __name__ == '__main__':

    start_url = 'https://api.zsxq.com/v1.10/groups/4224842218/topics?scope=digests&count=20'

    make_pdf(get_data(start_url))

知识星球精华区的 Python 脚本
一直想学习 Python 脚本，但是一直没有行动，今天促使我学习 Python 主要是因为: 不经意间刷 BOSS...
100G Python干货
扫码☟☟加Python交流群：限时免费知识星球
Python 知识星球爬虫（二）根据 Group > to
背景想快速地提取组队学习知识星球打卡的信息在原有的基础上进行改良，此处附上链接 Python 知识星球爬虫（...
2019 年我在知识星球上都发了些什么
这是苏生不惑第 86 篇原创文章关于知识星球之前写过Python 抓取知识星球内容生成词云和 PDF，为什么想到...
selenium2+Python--学习进阶路线图
selenium2+python 自动化测试知识储备： 1. Python:脚本编程基础。 2. Selenium...
Python每日一题：第一题
最近加入了Python之禅的知识星球Python之禅和他朋友们军哥在星球内分享干货，组织交流，让我这个编程小白获益...
2018最常见的Python面试题，一文看完带你搞定考官（下）
本文为Python面试题系列的下篇，和上篇一样，既有基础知识也有进阶版知识，话题涵盖脚本撰写、Python编码和数...
知识星球文章导出
知识星球文章爬取，知识星球爬取，知识星球文章保存，获取知识星球文章，知识星球是精华内容的沉淀，许多大佬的知识星球干...
Python笔记：import导入本地脚本【转】
Python笔记：import导入本地脚本导入本地脚本import 如果你要导入的 Python 脚本与当前脚本位...
使用python脚本进行SpringBoot项目多节点上传部署
环境搭建脚本运行环境安装Python3 部署脚本基于Python3实现，Python2.7无法使用该脚本，需要...

知识星球精华区的 Python 脚本

1. 'NoneType' object is not iterable

2. Non-ASCII character '\xe5' in file

3. ImportError: No module named parse

4. 'encoding' is an invalid keyword argument for this function

5. 脚本执行完毕已经打印：已制作电子书在当前目录，但是本地并没有生成PDF

6. Python 语法

7. 修改之后的完整 Python 脚本如下：

相关文章

知识星球精华区的 Python 脚本

100G Python干货

Python 知识星球爬虫（二）根据 Group > to

2019 年我在知识星球上都发了些什么

selenium2+Python--学习进阶路线图

Python每日一题：第一题

2018最常见的Python面试题，一文看完带你搞定考官（下）

知识星球文章导出

Python笔记：import导入本地脚本【转】

使用python脚本进行SpringBoot项目多节点上传部署

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读