一、scrapy 框架
info1 = response.xpath('//div[contains(@class,"store-info blk blk-reg")]/div//text()').extract()
二、selenium 框架 抓取动态网页打开浏览器
browser = webdriver.Chrome()
divcon = browser.find_element_by_xpath('//div[@class="pagehero__content"]')
三、etree 解析网页,不使用框架或者读取本地html文件 (在实际工作中,如果项目较小或者网站是动态网站,优先选择这个,可把精力放在理解项目上,有效避免网络问题)
html = etree.HTML(htmlContent)
itemlist= html.xpath('//div[@class="box--list"]/div[@class="box--list-item"]')
封装了3个方法
def etreeWebElemToOuterHTML(webitem):
outerHTML = etree.tostring(webitem)
outerHTML = outerHTML.decode('utf-8')
return outerHTML
def etreeWebElemGetAttributeValue(webitem, attributeid):
return webitem.get(attributeid)
def loadpage(filepath, pagename):
try:
pagepath = filepath +'//' + pagename +'.html'
htmlf =open(pagepath,'r',encoding="utf-8")
htmlContent = htmlf.read()
return htmlContent
except Exception as excpt:
print(excpt)
logging.error('........................loadpage....:' + pagename)
return ''
def savepage(browser, filepath, pagename):
try:
textContent = browser.find_element_by_xpath('//html').get_attribute('outerHTML')
str_utf8 = textContent.encode("UTF-8")
textContent = str_utf8.decode('UTF-8','strict')
pagepath = filepath +'//'+ pagename +'.html'
fp =open(pagepath,"w",encoding='UTF-8');
fp.write(textContent);
fp.close()
except Exception as excpt:
print(excpt)
四、beautifulsoup (暂时还没用到)
1. 使用python的logging模块 记录爬虫错误
2. 使用 try-except 让爬虫更健壮
try
# code
except Exception as exx:
logging.error('.............error found!..............')
网友评论