Java爬虫实战—利用xpath表达式抓取页面信息

作者: 测试开发栈 | 来源:发表于2017-09-12 09:35 被阅读1270次

Java爬虫实战—利用xpath表达式抓取页面信息
爬虫-python-scrapy框架基本命令
Python现学现用xpath爬取豆瓣音乐！
Python爬虫：现学现用xpath爬取豆瓣音乐
Python爬虫：现学现用xpath爬取豆瓣音乐
java 利用chrome+puppeteer实现爬虫
XPath轴方法 - 捕捉
标签内容但不捕获
下的<
数据分析工程师_14A静态网站数据获取
需要些例子
Node爬虫

前言

之前有写过如何利用Jsoup去爬取页面信息《Java爬虫实战——利用Jsoup爬取网页资源》，那里主要是借助Jsoup的Selector语法去定位和筛选页面信息，那样使用起来有一定的局限性，并且不太方便（不熟练的话花在确定Selector的时间挺久），特别是对于有Web自动化基础的童鞋，写过WebDriver的元素定位肯定会知道Xpath，几乎所有的元素都可以通过Xpath定位到。所以本文就会将Jsoup与Xpath结合使用，愉快的抓取页面上我们需要的信息。

如何做？

采用Jsoup + javax.xml + htmlcleaner方案实现，通过Jsoup拿到整个页面的HTML代码，然后利用 javax.xml + htmlcleaner解析页面DOM，再利用Xpath表达式定位页面信息，下面看具体的实现步骤：

实现步骤

1、需要的依赖包
这里采用maven方式：

<dependency>    
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.10.3</version>
</dependency>
<dependency>  
    <groupId>javax.xml</groupId>  
    <artifactId>jaxp-api</artifactId>  
    <version>1.4.2</version>  
</dependency>  
<dependency>  
    <groupId>net.sourceforge.htmlcleaner</groupId>  
    <artifactId>htmlcleaner</artifactId>  
    <version>2.9</version>  
</dependency>

2、封装Jsoup
代码没啥好说的，都是边看API边写，我这里就直接给出已封装好的JsoupHelper类供参考吧。

import java.io.IOException;
import java.util.ArrayList;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;

import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathFactory;

import org.htmlcleaner.CleanerProperties;
import org.htmlcleaner.DomSerializer;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

public class JsoupHelper {
    
    public static Object fecthNode(String url, String xpath) throws Exception {
        String html = null;
        try {
            Connection connect = Jsoup.connect(url);
            html = connect.get().body().html();
        } catch (IOException e) {
            e.printStackTrace();
            return null;
        }
        
        HtmlCleaner hc = new HtmlCleaner();
        TagNode tn = hc.clean(html);
        Document dom = new DomSerializer(new CleanerProperties()).createDOM(tn);
        XPath xPath = XPathFactory.newInstance().newXPath();
        
        Object result = xPath.evaluate(xpath, dom, XPathConstants.NODESET);
        
        return result;
    }
    /**
     *获取xpath下的a标签的文本值及href属性值
    /**
    public static Map<String, String> fecthByMap(String url, String xpath) throws Exception {
        Map<String, String> nodeMap = new LinkedHashMap<>();
        
        Object result = fecthNode(url, xpath);
        
        if (result instanceof NodeList) {
            NodeList nodeList = (NodeList) result;
            
            for (int i = 0; i < nodeList.getLength(); i++) {
                Node node = nodeList.item(i);
                if(node == null){
                    continue;
                }
                nodeMap.put(node.getTextContent(), node.getAttributes().getNamedItem("href")!=null ? 
                        node.getAttributes().getNamedItem("href").getTextContent() : "");
                
                System.out.println(node.getTextContent() + " : " + node.getAttributes().getNamedItem("href"));
            }
        }
        
        return nodeMap;
    }
    /**
     *获取xpath下的某个属性值
    /**
    public static List<String> fecthAttr(String url, String xpath, String attr) throws Exception {
        List<String> list = new ArrayList<>();
        
        Object result = fecthNode(url, xpath);
        
        if (result instanceof NodeList) {
            NodeList nodeList = (NodeList) result;
            
            for (int i = 0; i < nodeList.getLength(); i++) {
                Node node = nodeList.item(i);
                if(node == null){
                    continue;
                }
                list.add(node.getAttributes().getNamedItem(attr).getTextContent());
                
                //System.out.println(node.getTextContent() + " : " + node.getAttributes().getNamedItem("href"));
            }
        }
        
        return list;
    }

测试一下看下效果：

public static void main(String[] args) throws Exception{
        fecthByMap("http://www.jianshu.com/u/bf7b9c013c55","//ul[@class='note-list']/li//a[@class='title']");
}

抓取简书个人主页的所有文章标题及url，xpath定位的效果如下：

执行上述main方法的测试代码，打印的log如下：

Java爬虫实战&mdash;&mdash;利用jsoup爬取网页资源 : href="/p/3c23726ab833"
WebDriver对象管理之PageObject与PageFactory对比 : href="/p/de643bbf534b"
测试er如何开启自己的第二事业？ : href="/p/2fa1555a34d0"
自动化测试之随机化测试思想 : href="/p/2fc1b73bacda"
SDK、JAR等Library API形式的发布包该如何测试？ : href="/p/de3a3e7b4ef7"
Android自定义Dialog对话框 : href="/p/42cebea746e7"

总结

最后总结下使用这种方式的好处：
1、Xpath表达式更加灵活，定位范围更加广泛，几乎所有的元素都可以定位到；
2、对于熟悉Web自动化测试的童鞋更容易把握，使用几乎无障碍；
那使用它又有什么用处呢？
1、用于Web自动化测试，通过Jsoup去遍历页面元素，从而验证页面的功能；
2、用于数据采集，遍历页面信息，通过xpath过滤，采集需要的信息；
当然这种组合的好处和用处远远不止上述几点（小编目前就是用它作为数据采集），暂时体验到的就是上面几点，欢迎大家继续挖掘！

原文来自下方公众号，转载请联系作者，并务必保留出处。
想第一时间看到更多原创技术好文和资料，请关注公众号：测试开发栈