python3[爬虫实战] 使用selenium，xpath爬

作者: 简书用户9527 | 来源:发表于2017-10-11 22:28 被阅读152次

python3[爬虫实战] 使用selenium，xpath爬
爬虫实战1.3.2 页面解析之Xpath
python3[爬虫实战] 使用selenium，xpath爬取
爬虫实战1.5.1 了解Selenium
实战2：使用selenium爬取淘宝数据，保存在mongodb
Pyppeteer入门及中文教程
爬虫（七）使用Selenium爬取百度文库word文章
Scrapy爬虫实战项目【002】 - 抓取360摄影美图
xpath-函数的使用
爬虫系列（十）：使用xpath做爬虫

这次主要是进行京东具体某个店铺手机评论内容的爬取。

本来是跟上一起写的，只是没有时间一块做总结，现在写上来是有点生疏了。这里是暂时获取一个商品的评论内容

爬取的字段内容.png

爬取的字段：评论内容，购买机型，评论人

上代码：

# -*- coding: utf-8 -*-
# @Time    : 2017/9/18 23:16
# @Author  : 蛇崽
# @Email   : 17193337679@163.com
# @File    : TaoBaoZUK1Detail.py zuk z1 详情页内容

import time
from selenium import webdriver
from lxml import etree

chromedriver = "C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe"
browser = webdriver.Chrome(chromedriver)

# 获取第一页的数据
def gethtml():
    url = "https://detail.tmall.com/item.htm?id=531993957001&skuId=3609796167425&user_id=268451883&cat_id=2&is_b=1&rn=71b9b0aeb233411c4f59fe8c610bc34b"
    browser.get(url)
    time.sleep(5)
    browser.execute_script('window.scrollBy(0,3000)')
    time.sleep(2)
    browser.execute_script('window.scrollBy(0,5000)')
    time.sleep(2)

    # 累计评价
    btnNext = browser.find_element_by_xpath('//*[@id="J_TabBar"]/li[3]/a')
    btnNext.click()
    html = browser.page_source
    return html



def getcomments(html):
    source = etree.HTML(html)
    commens = source.xpath("//*[@id='J_TabBar']/li[3]/a/em/text()")
    print('评论数一：',commens)
    # 将评论转为int类型
    commens = (int(commens[0]) / 20) + 1
    # 获取到总评论
    print('评论数：',int(commens))
    return  int(commens)



# print(html)
def parseHtml(html):
    html = etree.HTML(html)
    commentlist = html.xpath("//*[@class='rate-grid']/table/tbody")
    for comment in commentlist:
        # 评论
        vercomment = comment.xpath(
            "./tr/td[@class='tm-col-master']/div[@class='tm-rate-content']/div[@class='tm-rate-fulltxt']/text()")
        # 机器类型
        verphone = comment.xpath("./tr/td[@class='col-meta']/div[@class='rate-sku']/p[@title]/text()")
        print(vercomment)
        print(verphone)
        # 用户(头尾各一个字，中间用****代替)
        veruser = comment.xpath("./tr/td[@class='col-author']/div[@class='rate-user-info']/text()")
        print(veruser)
    print(len(commentlist))

# parseHtml(html)
# print('*'*20)

def nextbuttonwork(num):

    if num != 0 :
        browser.execute_script('window.scrollBy(0,3000)')
        time.sleep(2)
        # browser.find_element_by_css_selector('#J_Reviews > div > div.rate-page > div > a:nth-child(6)').click()
        try:
            browser.find_element_by_css_selector('#J_Reviews > div > div.rate-page > div > a:last-child').click()
            # browser.find_element_by_xpath('//*[@id="J_Reviews"]/div/div[7]/div/a[3][contains(text(), "下一页")]').click()
        except:
            pass
            # browser.find_element_by_xpath('//*[@id="J_Reviews"]/div/div[7]/div/a[3][contains(text(), "下一页")]').click()
        time.sleep(2)
        browser.execute_script('window.scrollBy(0,3000)')
        time.sleep(2)
        browser.execute_script('window.scrollBy(0,5000)')
        time.sleep(2)
        html = browser.page_source
        parseHtml(html)
        print('nextclick finish  ')


def selenuim_work(html):
    print('selenuim start ... ')
    parseHtml(html)
    nextbuttonwork(1)
    print('selenuim  end....')
    pass


def gettotalpagecomments(comments):
    html = gethtml()
    for i in range(0,comments):
        selenuim_work(html)

data = gethtml()
# 得到评论
commens = getcomments(data)
# 根据评论内容进行遍历
gettotalpagecomments(commens)

这里头还是好的

爬取结果

不足：

这里主要进行了单页的爬取，下一页的按钮还是没有获取到，不知道为什么获取不到，可能是axaj的原因吧，另外想说一下大公司确实tm牛，当然了作为爬虫工程师，这在工作中是不可避免的。还麻烦写京东商品评论的帮忙指导一下小白。

python3[爬虫实战] 使用selenium，xpath爬
这次主要是进行京东具体某个店铺手机评论内容的爬取。本来是跟上一起写的，只是没有时间一块做总结，现在写上来是有点生...
爬虫实战1.3.2 页面解析之Xpath
本文转载：静觅 » [Python3网络爬虫开发实战] 4.1-使用XPath XPath，全称XML Path ...
python3[爬虫实战] 使用selenium，xpath爬取
当然了，这个任务也是从QQ群里面接过来的，主要是想提升自己的技术，一接过来是很开心的，但是，接完之后，写了又写，昨...
爬虫实战1.5.1 了解Selenium
本文转载：静觅 » [Python3网络爬虫开发实战] 7.1-Selenium的使用 Selenium是一个自动...
实战2：使用selenium爬取淘宝数据，保存在mongodb
实战2：使用selenium爬取淘宝数据，保存在mongodb 配置文件爬虫文件
Pyppeteer入门及中文教程
参考：爬虫神器 Pyppeteer 介绍及爬取某商城实战Pyppeteer：比selenium更高效的爬虫界的新神...
爬虫（七）使用Selenium爬取百度文库word文章
目录前言问题分析 Selenium简介 Selenium安装 Selenium基础知识 Xpath 动手实战 ...
Scrapy爬虫实战项目【002】 - 抓取360摄影美图
爬取360摄影美图参考来源：《Python3网络爬虫开发实战》第497页作者：崔庆才目的：使用Scrap...
xpath-函数的使用
继上篇xpath使用后,接下来简单的就爬虫相关方面的关于xpath函数的使用! 1.xpath函数应用场景: 在爬...
爬虫系列（十）：使用xpath做爬虫
案例：使用XPath的爬虫现在我们用XPath来做一个简单的爬虫，我们尝试爬取某个贴吧里的所有帖子，并且将该这个...