很久很久之前发过一次初学爬虫时爬取斗破小说的笔记,后来由于工作调整原因,爬虫很久没碰了,现在又想重新拾起这个技能,再重头学习一次吧。
这次爬取直接从首页网址入手
http://book.doupoxs.com/doupocangqiong/
主要思路就是:
1-爬取首页所有章节的超链接,按章节顺序排列。
2-获取各章节的标题和正文。
3-将爬取的内容保存到文本文件中。
代码如下:
import requests
from bs4 import BeautifulSoup
import time
import random
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
def get_chapter_urls(url):
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
#方法一:获取所有章节的超链接
chapters = soup.find('div',class_='xsbox clearfix').find_all('a')
urls = [url.get('href') for url in chapters]
#方法二:获取所有章节的超链接
## chapters = soup.find_all('a')
## urls = [url.get('href') for url in chapters if url.get('href').startswith('/doupocangqiong/') and url.get('href').split('/')[-1].split('.')[0].isdigit()]
## urls.sort(key=lambda x: int(x.split('/')[-1].split('.')[0])) # 按章节顺序排序
return ['http://book.doupoxs.com' + url for url in urls]
def get_chapter_content(url):
time.sleep(random.randint(1,3))#设置随机1-3秒内延迟
response = requests.get(url,headers=headers)
response.encoding = 'utf-8'
soup = BeautifulSoup(response.text, 'lxml')
title = soup.find('div',class_='entry-tit').text # 获取章节标题
content = soup.find('div', class_='m-post').text.replace('\xa0'*8, '\n\n') # 获取章节内容
return title, content
def save_to_file(title, content, filename):
with open(filename, 'a', encoding='utf-8') as f:
f.write(title + '\n')
f.write(content + '\n\n')
def main(url, filename):
chapter_urls = get_chapter_urls(url)
for url in chapter_urls:
title, content = get_chapter_content(url)
save_to_file(title, content, filename)
if __name__ == "__main__":
main('http://book.doupoxs.com/doupocangqiong/', 'doupo.txt')
最后爬取的小说如图:
2023-08-01_12-01-21.png
网友评论