Python Beautifulsoup4 爬取网页

作者: 九千年小妖 | 来源:发表于2022-08-22 18:33 被阅读0次

用beautifulsoup爬取微信公号的二手房信息
爬虫有多好玩？所见即所爬！抓取网页、图片、文章！无所不爬！
QQ空间爬虫，打造历时说说词云图，python来唤醒你的记忆！
BeautifulSoup4爬取某社招网站数据
爬取不可视化爬虫源码，复制粘贴就能用！python 暴力爬_极简
python 爬取BOSS直聘网页信息
使用Scrapy爬取网页数据并保存到MongoDB
Python抓取新浪新闻数据
Python ☞ day 14
Python实战计划学习笔记示例（2）爬取商品信息

爬取网页数据简单学习

一、安装解析库

#安装requests网络请求库
pip install requests

#安装bs4解析库
pip install beautifulsoup4

#安装html解析器
pip install lxml

#安装xlutils3、xlwt 2选1
pip install xlutils3
pip install xlwt

二、打开网页，获取网页内容

#以get方式打开网页，获取html页面内容
htmlCon=requests.get(dataurl)
#按实际网页设置编码格式
htmlCon.encoding="UTF-8"
#bs4以lxml解析器网页整体内容
soup = BeautifulSoup(htmlCon.text, 'lxml')

三、按网页实际标签进行解析

根据标签名查找

soup.a 只能查找得到第一个符合要求的节点,是一个对象，bs4自己封装类的对象

获取属性

soup.a.attrs 获取得到所有属性和值，是一个字典
soup.a.attrs['href'] 获取指定的属性值
soup.a['href'] 简写形式

获取文本

soup.a.string
soup.a.text
soup.a.get_text()

[注]如果标签里面还有标签，那么string获取就是空，而后两个获取的是纯文本内容

find_all方法

返回的是一个列表，列表里面都是节点对象

soup.find_all('a', limit=2) #提取符合要求的前两个a
soup.find_all(['a', 'li']) #查找得到所有的a和li
soup.find_all('a', class_='xxx')#查找得到所有class是xxx的a
soup.find_all('li', class_=re.compile(r'^xiao'))#查找所有的class以xiao开头的li标签

select方法

id选择器 #dudu
类选择器 .xixi
标签选择器 div a h1

eg :
div #dudu .xixi a
空格：代表后面的节点是前面节点的子节点或者子孙节点
div > #dudu > a > .xixi

四、保存工作表

#新建工作表
workbook=xlwt.Workbook()
#工作表新建一个sheet表格
sheet=workbook.add_sheet(sheet_name,cell_overwrite_ok=True)
style=xlwt.XFStyle()#初始化样式
sheet.write(rol,col,data)#写入指定数据
workbook.save("xxx.xlsx")#保存

五、实践

按类型获取一个中医网页里的所有页数数据

import requests
import xlwt
from bs4 import BeautifulSoup

# https://m.cnkang.com/cm/zcy/jry/
url = 'https://m.cnkang.com/'
sort = 1 #网页页数
rol=0 #默认行
types = [] # 网页分类型# {"sheet. name": "解表药", "urlPath" :"cm/zcy/jby"}
urlPath = "cm/zcy/jby/"
workbook = xlwt.Workbook() #新建一个工作薄
sheet=None
#获取网页不同type种类
def getHtmlType():
    typeCon = soup. select(' #mainNav > li')
    print (typeCon)
    for type in typeCon:
        title = type.find('a').get_text()
        path = type.find("a").get('href')
        typedata={}
        typedata["title"] = str(title)
        typedata["urlPath"] = str(path)
        # print (typedata)
        types.append(typedata)
    print(types)
    return types

#获取页数
def getHtmlSort():
    sortCon = soup. select(' #touch_page > option')
    sort = len(sortCon)
    return sort

#获取网页列表数据加键接
def getHtml():
    #循环按页数构造url
    for num in range(sort):
        dataurl= url+urlPath+ "List_"+str(num +1)+ '.html'
        print (dataurl)
        #以get方式打开网页，获取html页面内容
        htmlCon=requests.get(dataurl)
        #按实际网页设置编码格式
        htmlCon.encoding="UTF-8"
        #bs4以lxml解析器网页整体内容
        soup = BeautifulSoup(htmlCon.text, 'lxml')
        li_list =soup.select('body > div:nth-child(6) > div >a ')
        #循环进入下一层子集
        data = {}
        dataKey = ["title","content","contentUrl", "imgUrl", "sourceUrl"]
        for li in li_list:
            title =li.find ("dt").get_text()
            content = li.find ("dd").get_text()
            contentUrl = li.get("href") 
            imgUrl = li.find("img").get( "src")
            #print(contentUrl)
            data["title"]=str(title)
            data["content"]=str(content).strip()
            data["contentUrl"]=str(url+contentUrl)
            data["imgUrl"]=str(imgUrl)
            data["sourceUrl"]=str(dataurl)
            setDataBeau(data,dataKey)

def  setDataBeau(data,dataKey):
        global rol
        rol=rol+1
        for num in range(len(data)):
            sheet.write(rol,num,data.get(dataKey[num]))


if __name__ == '__main__':
        dataurl= url+urlPath
        htmlCon=requests.get(dataurl)
        #按实际网页设置编码格式
        htmlCon.encoding="UTF-8"
        #bs4以lxml解析器网页整体内容
        soup = BeautifulSoup(htmlCon.text, 'lxml')
        types=getHtmlType()  
        for index in range(len(types)):
            type= types[ index]
            urlPath = type.get("urlPath")
            sheet_name = type.get("title")
            if sheet_name =="展开":
                break;
            sheet=workbook.add_sheet(sheet_name,cell_overwrite_ok=True)
            style=xlwt.XFStyle()#初始化样式
            heads =['名称','简介','详情','封面','来源']
            for head in range(len(heads)):
                sheet.write(0,head,heads[head])
            dataurl=url+urlPath
            print (dataurl)
            htmlCon=requests.get(dataurl)
            #按实际网页设置编码格式
            htmlCon.encoding="UTF-8"
            #bs4以lxml解析器网页整体内容
            soup = BeautifulSoup(htmlCon.text, 'lxml')
            #获取页数
            sort= getHtmlSort()
            getHtml()
            rol = 0
        workbook.save('ZY.xlsx')

六、进步需解决的问题

问题：因为有些分类再页面是隐藏属性，是需要通过点击触发js后才能查看

例图

解决办法：
方法1、脚本中固定写死这些隐藏的分类
方法2、通过selenium获取网页数据，再赋值给bs4

网友评论

本文标题：Python Beautifulsoup4 爬取网页

本文链接：https://www.haomeiwen.com/subject/djyggrtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Python Beautifulsoup4 爬取网页

一、安装解析库

二、打开网页，获取网页内容

三、按网页实际标签进行解析

根据标签名查找

获取属性

获取文本

find_all方法

select方法

四、保存工作表

五、实践

六、进步需解决的问题

相关文章

用beautifulsoup爬取微信公号的二手房信息

爬虫有多好玩？所见即所爬！抓取网页、图片、文章！无所不爬！

QQ空间爬虫，打造历时说说词云图，python来唤醒你的记忆！

BeautifulSoup4爬取某社招网站数据

爬取不可视化爬虫源码，复制粘贴就能用！python 暴力爬_极简

python 爬取BOSS直聘网页信息

使用Scrapy爬取网页数据并保存到MongoDB

Python抓取新浪新闻数据

Python ☞ day 14

Python实战计划学习笔记示例（2）爬取商品信息

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读