前言:
介绍小例子,加深对爬虫的理解,主要用bs4完成
-
1 1.对python练习100例内容的爬取
-
页面分析
主页面:
image.png
副页面:
image.png
-
代码实现
- 1.请求主网页源码
from bs4 import BeautifulSoup
import requests
url = "http://www.runoob.com/python/python-100-examples.html"
#发送请求
content = requests.get(url,params=None).content.decode("utf-8")
-
2.获得每一个副页面的网址
找到通往副页面网址的标签的id
image.png
html = BeautifulSoup(content,"html.parser")
# print(type(html))
#查找每个练习的a标签的href属性
a=html.find(id="content").ul.find_all("a")
#创建一个列表保存url
url_list=[]
for x in a:
url_list.append("http://www.runoob.com"+x["href"])
-
3.对每一个副页面的内容就行抓取
找到对应内容的的标签下面的id
image.png
datas=[]
for i in range(100):
dic = {}
html01 = requests.get(url_list[i]).content.decode("utf-8")
soup02 = BeautifulSoup(html01, "html.parser")
dic['title'] = soup02.find(id="content").h1.text
# 题目
dic['content01'] = soup02.find(id="content").p.next_sibling.next_sibling.text
# print(content01)
# 程序分析
dic['content02'] = soup02.find(id="content").p.next_sibling.next_sibling.next_sibling.next_sibling.text
try:
dic['content03'] = soup02.find(class_="hl-main").text
except Exception as e:
dic["content03"] = soup02.find("pre").text
- 4.把内容保存到csv文件中
with open("100_py.csv","a+",encoding="utf-8") as file:
file.write(dic['title']+"\n")
file.write(dic['content01']+"\n")
file.write(dic['content02']+"\n")
file.write(dic['content03']+"\n")
file.write("*"*60+"\n")
结果:
可以看到有四千多行数据
image.png
后记:
bs4中的find方法查找标签太麻烦,还是推荐用xpath
网友评论