关于豆瓣网站书籍的一些的爬虫
前言
好久没有在简书更新了,这次更新就来个比较实用的。
首先呢,这个爬虫是用Python 写的,这篇文章也仅供学术参考,并没有什么用途
脚本运行时的坑点
该脚本所需要的Interpreter 是3.7的版本,用到的解析html的库有BeautifulSoup4 。如果在运行时遇到没有找到bs4 的包,或者是BeautifulSoup 的包,请先配置好Project Interpreter ,如果你是用过Pycharm开发的话,就可以直接在Setting > Project Interpreter 选中 ‘+’ 然后在线选择BeautifulSoup4 包安装就行。
(这里就不公开全部的脚本代码了(为防止恶意),如果遇到什么问题,欢迎私信我,尽可能帮助解决。)
下面就开始脚本的简单介绍
1.选择豆瓣上的一个标签链接,通过不断循环这些标签来爬取这些书籍(整体的思路就是这样)
main_url = 'https://book.douban.com/tag/?view=type&icn=index-sorttags-hot'
标签.png
2.循环获取这些标签
def get_book_tags():
req_result = requests.get(main_url)
if req_result.status_code == 200:
html_str = req_result.content.decode('utf-8')
soup = BeautifulSoup(html_str, 'html.parser')
tags = soup.select('#content > div > div.article > div:nth-child(2) > div')
for div in tags:
trs = div.select('.tagCol > tbody > tr')
for tr in trs:
print(tr.a.attrs['href'] + ' ' + tr.a.text.strip())
get_book_list(tag_head_url + tr.a.attrs['href'])
if isDebug:
return
[get_book_list()]方法是获取当前标签页下的所有书籍
3.获取某个标签下的所有书籍。这里注意的一点就是有下一页的标签,如果有下一页的标签就需要递归循环
1566443354(1).jpgdef get_book_list(url):
print('get_book_list: ' + url)
req_result = requests.get(url)
# print('get_book_list: ' + str(req_result.status_code))
if req_result.status_code == 200:
html_str = req_result.content.decode('utf-8')
soup = BeautifulSoup(html_str, 'html.parser')
book_list = soup.select('#subject_list > ul > li')
for book in book_list:
book_detail_url = book.select('.info > h2')[0].a.attrs['href']
book_detail_name = book.select('.info > h2')[0].a.attrs['title']
get_book_detail(book_detail_url, book_detail_name)
if isDebug:
return
# 下一页标签
next_tags = soup.select('#subject_list > div.paginator > span.next')[0].a.attrs['href']
time.sleep(1)
get_book_list(tag_head_url + next_tags)
4.接下来就是获取列表下的书籍的详情[get_book_detail()]
def get_book_detail(url, name):
global book_count
book_count = book_count + 1
print('第' + str(book_count) + '本书 :' + name + ' url: ' + url)
if is_test_url is False:
if start_index >= book_count:
return
try:
req_result = requests.get(url)
except requests.exceptions.ConnectionError:
print('错误url: ' + url)
return
if req_result.status_code == 200:
html_str = req_result.content.decode('utf-8')
soup = BeautifulSoup(html_str, 'html.parser')
# 图书URL
main_pic = soup.select('#mainpic')[0].a.attrs['href']
print('封面:' + main_pic)
# 作者ID ??
articles = soup.select('#content > div > div.article > div')
for article_item in articles:
pos = str(article_item).find('collect_form_')
if pos > -1:
start_pos = str(article_item).rfind('_') + 1
end_pos = str(article_item).rfind('"')
author_id = str(article_item)[start_pos:end_pos]
print('作者Id: ' + author_id)
break
# 内容简介
# intro = ''
# link_report_list = soup.select("#link-report > div")
# link_len = len(link_report_list)
# if link_len > 0:
# link_report = link_report_list[0]
# intro = link_report.select(".intro")[0].text.strip()
# else:
# intro_div = soup.select('#link-report > span.all.hidden > div > div')[0]
# intro = intro_div.text.strip()
# print('内容简介:' + str(intro))
deal_content_intro(soup)
# 作者简介
deal_author_intro(soup)
# 图书信息
book_detail = soup.select('#info')[0]
pl_list = book_detail.select('.pl')
for pl in pl_list:
deal_with_key_map(pl.text, str(pl), str(book_detail))
# 常用标签
tags_section_span = soup.select("#db-tags-section > div.indent > span")
tag_value = ''
for span in tags_section_span:
# print(span.a.text.strip())
tag_value = tag_value + span.a.text.strip() + ' '
print('常用标签:' + tag_value)
time.sleep(1)
网友评论