Randall | 四、Jsoup

作者: mrzhqiang | 来源:发表于2017-10-15 21:26 被阅读0次

Randall | 四、Jsoup
Android jsoup解析html、ncx文件
Jsoup 解析html 根据节点获取内容
Jsoup 网络爬虫
曼彻斯特的巴西烤肉居然是这样做的？
java爬虫Jsoup简单学习
Android端 WebView动态注入js
读书推荐《What If》只要脑洞开得大，蠢题也能答出花
Java爬虫jsoup工具类
Android客户端修改网页

一、Jsoup是什么？

引用Jsoup官网的介绍：

jsoup: Java HTML Parser

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

具体有什么作用？像浏览器F12那样，将网页源码分成节点和子节点，很轻松就可以提取网页上的一个个元素。这种功能十分强大，配合强大的网络工具，可以自己写爬虫去扒【小网站】。比如：在Retrofit的Sample中，就用Jsoup配合Retrofit实现了一个简单爬虫程序。

二、我有点想用了，但有没有入门文档？

看！点击这里就有！

而且还可以找到中文版

三、如何用Jsoup解析haowanba.com？

好玩吧主页
下面开始我们的表演：

使用URL生成资源链接

  URL url = new URL("http://haowanba.com");

使用Jsoup解析这个URl

  Document home = Jsoup.parse(url, (int) TimeUnit.SECONDS.toMillis(10));
  System.out.println(home);

输出本次结果

<!--?xml version="1.0" encoding="utf-8"?-->
<!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile 1.0//EN" "http://www.wapforum.org/DTD/xhtml-mobile10.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <title>好玩吧</title>
 </head>
 <body>
  <p> 
    [图片上传失败...(image-702e2e-1544607507738)]
    <br> 
    <a href="cardh.php?action=register">注册</a>.
    <a href="loginh.php?exuid=">登录</a>
    .
    <a href="linkh.php?action=coop">加盟</a>
    <br>
    [图片上传失败...(image-2682b2-1544607507738)]
    <br>
    <a href="http://haowanba.com/cardh.php?action=register&url=http://dy.haowanba.com/dy/dy61.jsp&_do=game">[轮回]</a>
    <a href="http://haowanba.com/cardh.php?action=register&url=http://dy.haowanba.com/dy/dy60.jsp&_do=game">[御龙]</a>
    <a href="http://haowanba.com/cardh.php?action=register&url=http://dy.haowanba.com/dy/dy59.jsp&_do=game">[通灵]</a>
    <br>
    <a href="http://haowanba.com/cardh.php?action=register&url=http://dy.haowanba.com/dy/dy50.jsp&_do=game">[战魂]</a>
    <a href="http://haowanba.com/cardh.php?action=register&url=http://dy.haowanba.com/dy/dy29.jsp&_do=game">[天启]</a>
    <a href="http://haowanba.com/cardh.php?action=register&url=http://dy.haowanba.com/dy/dy5.jsp&_do=game">[永恒]</a>
    <br>
    =================
    <br>
    <a href="bannerh.php?page=0">友链</a>
    |
    <a href="linkh.php?action=intro">简介</a>
    |
    <a href="linkh.php?action=touch">联系</a>
    |
    <a href="linkh.php?action=coop">合作</a>
    <br>
    荣唐科技 10-15 12:20
    <br>
    <a href="http://www.miitbeian.gov.cn">京ICP备17055155号</a>
     | 京ICP证100435号
    <br>
  </p> 
 </body>
</html>

通过Html约定的元素名（title、body）获取内容

  String title = home.title();
  System.out.println(title);
  Element body = home.body();
  System.out.println(body);

取得元素的所有子元素/父元素——不会返回null

  Elements bodyList = body.children();
  Elements bodyList = body.parent();

取得元素的子元素

  // 第一个子元素
  Element first = bodyList.first();
  // 按下标获取子元素（可以配合size进行ForEach）
  Element index = bodyList.get(1);
  // 最后一个子元素，返回与第一个有可能相同
  Element first = bodyList.last();

注意：由于bodyList只有一个p元素，因此get(1)时会抛出下标越界异常

下标越界异常

可以用next()方法获取下一个子元素

调用next方法

用prev()方法获取前一个子元素

调用prev方法

注意，next和prev在当前实例下，返回的永远是相同的内容，因为它是在“询问子元素有没有上下文”

可以重复使用next()方法去获得下一个的下一个的下一个，这种链式调用很方便，但对于数量繁多的同级元素来讲，还是用ForEach遍历舒服些。

子元素的类似方法

  bodyList.first().children().first().nextElementSibling()
  bodyList.first().children().first().previousElementSibling()

解析为Markdown文档（Element是Node的扩展类）

  public static final SimpleDateFormat DATE_NORMAL =
      new SimpleDateFormat("yyyyMMdd", Locale.getDefault());
  public static final SimpleDateFormat DATE_HMS =
      new SimpleDateFormat("HH-mm-ss_SSS", Locale.getDefault());

  public static String parseNode(Node node) {
    StringBuilder contentSb = new StringBuilder();
    String nodeName = node.nodeName();

    switch (nodeName) {
      case "p":
        for (Node child : node.childNodes()) {
          contentSb.append(parseNode(child));
        }
        return contentSb.toString();
      case "a":
        Element a = (Element) node;
        return "[" + a.text() + "]" + "(" + a.absUrl("href") + ")";
      case "img":
        return "![img.png]" + "(" + node.absUrl("src") + ")";
      case "br":
        return "\r\n   \r\n";
      case "#text":
        // 如果是文字内容，查看源码可知匹配nodeName为#text
        return node.toString();
    }

    return "";
  }

  public static void createFileByString(String title, String content) throws IOException {
    Date now = new Date();
    File file = new File(".\\wml2md\\" + DATE_NORMAL.format(now));
    if (!file.exists()) {
      file.mkdirs();
    }
    file = new File(file, title + DATE_HMS.format(now) + ".md");
    if (!file.exists()) {
      file.createNewFile();
    }
    BufferedWriter writer = new BufferedWriter(new FileWriter(file));
    writer.write(content);
    writer.flush();
    writer.close();
  }

--- 文档内容

 ![img.png](https://img.haomeiwen.com/i7426376/fdf73d8c9cd674b4.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
   
 [注册](http://haowanba.com/cardh.php?action=register).[登录](http://haowanba.com/loginh.php?exuid=).[加盟](http://haowanba.com/linkh.php?action=coop)
   
![img.png](https://img.haomeiwen.com/i7426376/91897469a522e76b.gif?imageMogr2/auto-orient/strip)
   
[[轮回]](http://haowanba.com/cardh.php?action=register&url=http://dy.haowanba.com/dy/dy61.jsp&_do=game)[[御龙]](http://haowanba.com/cardh.php?action=register&url=http://dy.haowanba.com/dy/dy60.jsp&_do=game)[[通灵]](http://haowanba.com/cardh.php?action=register&url=http://dy.haowanba.com/dy/dy59.jsp&_do=game)
   
[[战魂]](http://haowanba.com/cardh.php?action=register&url=http://dy.haowanba.com/dy/dy50.jsp&_do=game)[[天启]](http://haowanba.com/cardh.php?action=register&url=http://dy.haowanba.com/dy/dy29.jsp&_do=game)[[永恒]](http://haowanba.com/cardh.php?action=register&url=http://dy.haowanba.com/dy/dy5.jsp&_do=game)
   
=================
   
[友链](http://haowanba.com/bannerh.php?page=0)|[简介](http://haowanba.com/linkh.php?action=intro)|[联系](http://haowanba.com/linkh.php?action=touch)|[合作](http://haowanba.com/linkh.php?action=coop)
   
荣唐科技 10-15 12:44
   
[京ICP备17055155号](http://www.miitbeian.gov.cn) | 京ICP证100435号

--- 直接展示