一直想学习 Python 脚本,但是一直没有行动,今天促使我学习 Python 主要是因为:
不经意间刷 BOSS 直聘发现 iOS 开发的招聘需求明确写着:至少掌握一种脚本语言;
新加入的知识星球内容实在太多,即便只看精华区也需要很长的时间,而且没法记录位置,每次都要从头来看,很不方便,所以想着把数据都爬下来;
大云都会写 Python 了,我还不会太丢人了😶
今天探索了爬取知识星球精华区的 Python 脚本,感谢 96chh贡献的源码:👇
https://github.com/96chh/crawl-zsxq
使用环境:
Mac OS : 10.13.6 ,Python 2.7
使用问题记录如下:
1. 'NoneType' object is not iterable
File "/Users/hegao/Downloads/crawl-zsxq-master-2/crawl.py", line 42, in get_data
for topic in json.loads(f.read()).get('resp_data').get('topics'):
TypeError: 'NoneType' object is not iterable
网络搜索结果都表示:这个错误提示一般发生在将None赋给多个值时。结合报错位置猜测应该是没有抓取到数据导致的,查看 test.json 文件发现 401 了:
{
"info": "",
"code": 401,
"resp_data": {},
"succeeded": false
}
一般来说 401 错误消息表明您首先需要登录(输入有效的用户名和密码)
分析页面的报文发现请求头里面没有Authorization参数了,而是用Cookie了,headers改成如下形式即可:
headers = {
'Cookie':'zsxq_access_token=E56C5E4D-C79D-F758-12B6-947D0F9A43AA',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}
2. Non-ASCII character '\xe5' in file
SyntaxError: Non-ASCII character '\xe5' in file /Users/hegao/Downloads/crawl-zsxq-master-2/crawl.py on line 33, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
解决方法:参考👇
https://blog.csdn.net/qq_39521554/article/details/79920374
第一步,别忘了给顶行加上:
# -*- coding: utf-8 -*-
第二步,重新载入SYS模块并设置uft-8:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
3. ImportError: No module named parse
File "/Users/hegao/Downloads/crawl-zsxq-master-2/crawl.py", line 9, in <module>
from urllib.parse import quote
ImportError: No module named parse
开始以为是需要安装 parse 库,执行 sudo pip install parse 之后还是报错
在搜索才发现是 Python 版本的问题导致,我使用的是 Python 2.7,作者应该是 Python 3.0 +,这里修改如下即可:
from urllib import quote
4. 'encoding' is an invalid keyword argument for this function
File "/Users/hegao/Downloads/crawl-zsxq-master-2/crawl.py", line 38, in get_data
with open('test.json', 'w', encoding='utf-8') as f: # 将返回数据写入 test.json 方便查看
TypeError: 'encoding' is an invalid keyword argument for this function
这里解决办法比较简单粗暴,直接将脚本中是 encoding="utf-8" 删除掉即可了
5. 脚本执行完毕已经打印:已制作电子书在当前目录,但是本地并没有生成PDF
这个问题查找耗费了很多时间,因为要排除各方面的原因
-
1.查看返回报文和脚本字段对比,排除了返回报文字段是否已更改的问题;
-
2.用 “ wkhtmltopdf www.baidu.com ***/baidu.pdf ” 命令 排除了 wkhtmltopdf 安装失败的问题,Mac电脑直接下载了 pkg 安装即可,无需后续各种配置;
-
3.注释掉 os.remove(file) 发现本地是有爬取到的 HTML 文档的,然后转换了单个HTML 文档居然成功了,再截取50个HTML 文档也能转换成功,才恍然大悟应该是HTML 文档个数太多导致了转换 PDF 失败,所以添加如下判断解决了
if index%60 == 0 and index>0 :
pdfkit.from_file(html_files, str(index) +"电子书——辉哥.pdf", options=options)
for file in html_files:
os.remove(file)
html_files =[]
6. Python 语法
-
Python语法还真是跟我接触的所有语言都不太一样啊,首先方法体居然是根据 tab 空格来界定的,没有了大括号还真是不适应啊;
-
Python 中的与或非居然直接用关键字来表示:and、or和not,大开眼界啊,写 & | ! 习惯了,找了半天才发现是要用 and
7. 修改之后的完整 Python 脚本如下:
# -*- coding: utf-8 -*-
import re
import requests
import json
import os
import pdfkit
from bs4 import BeautifulSoup
from urllib import quote
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
html_template = """
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
</head>
<body>
<h1>{title}</h1>
<p>{text}</p>
</body>
</html>
"""
htmls = []
num = 0
def get_data(url):
global htmls, num
headers = {
'Cookie':'zsxq_access_token=E56C5E4D-C79D-F758-12B6-947D0F9A43AA',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}
rsp = requests.get(url, headers=headers)
with open('test.json', 'w') as f: # 将返回数据写入 test.json 方便查看
f.write(json.dumps(rsp.json(), indent=2, ensure_ascii=False))
with open('test.json') as f:
for topic in json.loads(f.read()).get('resp_data').get('topics'):
content = topic.get('question', topic.get('talk', topic.get('task', topic.get('solution'))))
# print(content)
text = content.get('text', '')
text = re.sub(r'<[^>]*>', '', text).strip()
text = text.replace('\n', '<br>')
title = str(num) + text[:9]
num += 1
if content.get('images'):
soup = BeautifulSoup(html_template, 'html.parser')
for img in content.get('images'):
url = img.get('large').get('url')
img_tag = soup.new_tag('img', src=url)
soup.body.append(img_tag)
html_img = str(soup)
html = html_img.format(title=title, text=text)
else:
html = html_template.format(title=title, text=text)
if topic.get('question'):
answer = topic.get('answer').get('text', "")
soup = BeautifulSoup(html, 'html.parser')
answer_tag = soup.new_tag('p')
answer_tag.string = answer
soup.body.append(answer_tag)
html_answer = str(soup)
html = html_answer.format(title=title, text=text)
htmls.append(html)
next_page = rsp.json().get('resp_data').get('topics')
if next_page:
create_time = next_page[-1].get('create_time')
if create_time[20:23] == "000":
end_time = create_time[:20]+"999"+create_time[23:]
else :
res = int(create_time[20:23])-1
end_time = create_time[:20]+str(res).zfill(3)+create_time[23:] # zfill 函数补足结果前面的零,始终为3位数
end_time = quote(end_time)
if len(end_time) == 33:
end_time = end_time[:24] + '0' + end_time[24:]
next_url = start_url + '&end_time=' + end_time
print(next_url)
get_data(next_url)
return htmls
def make_pdf(htmls):
options = {
"user-style-sheet": "test.css",
"page-size": "Letter",
"margin-top": "0.75in",
"margin-right": "0.75in",
"margin-bottom": "0.75in",
"margin-left": "0.75in",
"encoding": "UTF-8",
"custom-header": [("Accept-Encoding", "gzip")],
"cookie": [
("cookie-name1", "cookie-value1"), ("cookie-name2", "cookie-value2")
],
"outline-depth": 10,
}
html_files = []
for index, html in enumerate(htmls):
file = str(index) + ".html"
html_files.append(file)
with open(file, "w") as f:
f.write(html)
if index%60 == 0 and index>0 :
pdfkit.from_file(html_files, str(index) +"电子书.pdf", options=options)
for file in html_files:
os.remove(file)
html_files =[]
try:
pdfkit.from_file(html_files, "电子书.pdf", options=options)
except Exception as e:
pass
for file in html_files:
os.remove(file)
print("已制作电子书在当前目录!")
if __name__ == '__main__':
start_url = 'https://api.zsxq.com/v1.10/groups/4224842218/topics?scope=digests&count=20'
make_pdf(get_data(start_url))
网友评论