知乎问题答案图片爬虫(三)

作者: 江山斜睨 | 来源:发表于2017-08-15 09:01 被阅读77次

知乎问题答案图片爬虫(三)
知乎问题答案图片爬虫(一)
知乎问题答案图片爬虫(二)
2017.07.20
Python爬虫新手教程：知乎文章图片爬取器!
Python爬虫新手教程：知乎文章图片爬取器
Python爬虫新手教程：知乎文章图片爬取器
Python爬虫入门教程第二十二讲：知乎文章图片爬取器之二
Python爬虫入门教程：知乎文章图片爬取器
知乎图片批量下载爬虫

登录成功以后就是要调用知乎的链接地址获取数据了,定义如下函数从问题ID中保存图片。apiUrl中有三个参数需要上传，offset用来分页，limit是每页的答案数量，sort_by用default好了。

def saveImagesFromQuestionId(questionId, filePath):

baseQuestionUrl ='https://www.zhihu.com/question/'+str(questionId)+'/answer/'

apiUrl ='https://www.zhihu.com/api/v4/questions/'+str(questionId)\

+'/answers?offset=0&limit=20&sort_by=default'

whileTrue:

pageCode = getPageCode(apiUrl)

if notpageCode:

print"打开网页链接失败.."

returnNone

pageCodeJson = json.loads(pageCode)

if notpageCodeJson['data']:

break

foriinpageCodeJson['data']:

answerId = i['id']

answerUrl = baseQuestionUrl +str(answerId)

saveImagesFromUrl(answerUrl, filePath)

apiUrl = pageCodeJson['paging']['next']
函数getPageCode的作用是获取一个URL地址的内容，apiUrl返回的内容为Json格式，因此对其用json.load进行解析。

标签data里存放了具体的数据，也就是我们想要的问题的答案列表，我们根据id再从知乎上获取具体的答案内容，然后用saveImagesFromUrl函数抓取其中的图片

defsaveImagesFromUrl(pageUrl, filePath):

imagesUrl = getImageUrlFirstPage(pageUrl)

if notimagesUrl:

print'imagesUrl is empty'

return

if notos.path.exists(filePath):

os.makedirs(filePath)

write2File(imagesUrl, filePath)

这样就完成了根据一个问题ID抓取其对应的所有答案的图片的小爬虫啦。

下一步打算继续对爬虫进行改进，使用多线程的策略保存图片，可以大大提升性能。

网友评论

本文标题：知乎问题答案图片爬虫(三)

本文链接：https://www.haomeiwen.com/subject/gjturxtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

知乎问题答案图片爬虫(三)

相关文章