解析网页常用的三种方式

作者: AlastairYuan | 来源:发表于2018-11-21 18:57 被阅读0次

解析网页常用的三种方式
[MyBatis源码详解 - 解析器模块 - 组件一] XNod
XML 解析
Jsoup解析HTML基础用法
Android XML解析的三种方式
xml几种解析
iOS UIWebView 详解
爬虫基础之CSS学习不完全总结
爬虫 - python + selenium + webdriv
前端开发入门到实战：浏览器的三种Js弹窗方式

一、scrapy 框架

info1 = response.xpath('//div[contains(@class,"store-info blk blk-reg")]/div//text()').extract()

二、selenium 框架抓取动态网页打开浏览器

browser = webdriver.Chrome()

divcon = browser.find_element_by_xpath('//div[@class="pagehero__content"]')

三、etree 解析网页，不使用框架或者读取本地html文件（在实际工作中，如果项目较小或者网站是动态网站，优先选择这个，可把精力放在理解项目上，有效避免网络问题）

html = etree.HTML(htmlContent)

itemlist= html.xpath('//div[@class="box--list"]/div[@class="box--list-item"]')

封装了3个方法

def etreeWebElemToOuterHTML(webitem):

outerHTML = etree.tostring(webitem)

outerHTML = outerHTML.decode('utf-8')

return outerHTML

def etreeWebElemGetAttributeValue(webitem, attributeid):

return webitem.get(attributeid)

def loadpage(filepath, pagename):

try:

pagepath = filepath +'//' + pagename +'.html'

htmlf =open(pagepath,'r',encoding="utf-8")

htmlContent = htmlf.read()

return htmlContent

except Exception as excpt:

print(excpt)

logging.error('........................loadpage....:' + pagename)

return ''

def savepage(browser, filepath, pagename):

try:

textContent = browser.find_element_by_xpath('//html').get_attribute('outerHTML')

str_utf8 = textContent.encode("UTF-8")

textContent = str_utf8.decode('UTF-8','strict')

pagepath = filepath +'//'+ pagename +'.html'

fp =open(pagepath,"w",encoding='UTF-8');

fp.write(textContent);

fp.close()

except Exception as excpt:

print(excpt)

四、beautifulsoup （暂时还没用到）

1. 使用python的logging模块记录爬虫错误

2. 使用 try-except 让爬虫更健壮

try

# code

except Exception as exx:

logging.error('.............error found!..............')

网友评论

本文标题：解析网页常用的三种方式

本文链接：https://www.haomeiwen.com/subject/duxhqqtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

解析网页常用的三种方式

相关文章