美文网首页
jsoup爬虫图文实战

jsoup爬虫图文实战

作者: Ktry | 来源:发表于2020-03-21 08:50 被阅读0次

    jsoup爬虫实战

    爬取地址http://wufazhuce.com/

    这是一个很不错的网站,每天会更新一个鸡汤以及一幅配图,下面是爬取近7天的鸡汤以及配图的实战。

    • 导包

      <dependency>
          <groupId>org.jsoup</groupId>
          <artifactId>jsoup</artifactId>
          <version>1.10.2</version>
      </dependency>
      
    • 使用jsoup拿到网页

      Document doc = Jsoup.connect("http://wufazhuce.com/").get();
      
    • 分析源代码,在查看源代码中,我发现,所需要的内容集中在这一块

      <div id="carousel-one" class="carousel slide">
      
                      <div class="carousel-inner">
                                          <div class="item active">
                              <a href="http://wufazhuce.com/one/2763"><img class="fp-one-imagen" src="http://image.wufazhuce.com/FtwQJesJhVV0Ho_iaanwPF4QnDPw" alt="" /></a>                        <div class="fp-one-imagen-footer">
                                  插画                        </div>
                              <div class="fp-one-cita-wrapper">
                                  <div class="fp-one-titulo-pubdate">
                                      <p class="titulo">VOL.2723</p>
                                      <p class="dom">21</p>
                                      <p class="may">Mar 2020</p>
                                  </div>
                                  <div class="fp-one-cita">
                                  <a href="http://wufazhuce.com/one/2763">我并不期待人生可以过得很顺利,但我希望碰到人生难关的时候,自己可以是它的对手。</a>                            </div>
                                  <div class="clearfix"></div>
                              </div>
                          </div>
                                          <div class="item">
                              <a href="http://wufazhuce.com/one/2762"><img class="fp-one-imagen" src="http://image.wufazhuce.com/FobC3u_uHKxmnc8gf_kOc6loL-gv" alt="" /></a>                        <div class="fp-one-imagen-footer">
                                  摄影                        </div>
                              <div class="fp-one-cita-wrapper">
                                  <div class="fp-one-titulo-pubdate">
                                      <p class="titulo">VOL.2722</p>
                                      <p class="dom">20</p>
                                      <p class="may">Mar 2020</p>
                                  </div>
                                  <div class="fp-one-cita">
                                  <a href="http://wufazhuce.com/one/2762">当一个人不能拥有的时候,他唯一能做的便是不要忘记。</a>                            </div>
                                  <div class="clearfix"></div>
                              </div>
                          </div>
                                          <div class="item">
                              <a href="http://wufazhuce.com/one/2761"><img class="fp-one-imagen" src="http://image.wufazhuce.com/Fp-WZpBGvXVtnDTpIH3IuQDtnAQN" alt="" /></a>                        <div class="fp-one-imagen-footer">
                                  摄影                        </div>
                              <div class="fp-one-cita-wrapper">
                                  <div class="fp-one-titulo-pubdate">
                                      <p class="titulo">VOL.2721</p>
                                      <p class="dom">19</p>
                                      <p class="may">Mar 2020</p>
                                  </div>
                                  <div class="fp-one-cita">
                                  <a href="http://wufazhuce.com/one/2761">改变心态只需一分钟,而这一分钟却能改变一整天。</a>                            </div>
                                  <div class="clearfix"></div>
                              </div>
                          </div>
                                          <div class="item">
                              <a href="http://wufazhuce.com/one/2760"><img class="fp-one-imagen" src="http://image.wufazhuce.com/Fm-faU1mWIBGdREYoq_SxbueMx8q" alt="" /></a>                        <div class="fp-one-imagen-footer">
                                  摄影                        </div>
                              <div class="fp-one-cita-wrapper">
                                  <div class="fp-one-titulo-pubdate">
                                      <p class="titulo">VOL.2720</p>
                                      <p class="dom">18</p>
                                      <p class="may">Mar 2020</p>
                                  </div>
                                  <div class="fp-one-cita">
                                  <a href="http://wufazhuce.com/one/2760">我们每个人都是宇宙的囚徒。</a>                            </div>
                                  <div class="clearfix"></div>
                              </div>
                          </div>
                                          <div class="item">
                              <a href="http://wufazhuce.com/one/2759"><img class="fp-one-imagen" src="http://image.wufazhuce.com/Frjvh22RpfARajcvPKinwhwsPHOM" alt="" /></a>                        <div class="fp-one-imagen-footer">
                                  摄影                        </div>
                              <div class="fp-one-cita-wrapper">
                                  <div class="fp-one-titulo-pubdate">
                                      <p class="titulo">VOL.2719</p>
                                      <p class="dom">17</p>
                                      <p class="may">Mar 2020</p>
                                  </div>
                                  <div class="fp-one-cita">
                                  <a href="http://wufazhuce.com/one/2759">对世间的一切事物报以虚无的态度其实是轻松的,真正困难的是如何勇敢地介入其中。​​​</a>                            </div>
                                  <div class="clearfix"></div>
                              </div>
                          </div>
                                          <div class="item">
                              <a href="http://wufazhuce.com/one/2758"><img class="fp-one-imagen" src="http://image.wufazhuce.com/Fnpd4sv1WSdFfTZ7pFO-I9fD2610" alt="" /></a>                        <div class="fp-one-imagen-footer">
                                  摄影                        </div>
                              <div class="fp-one-cita-wrapper">
                                  <div class="fp-one-titulo-pubdate">
                                      <p class="titulo">VOL.2718</p>
                                      <p class="dom">16</p>
                                      <p class="may">Mar 2020</p>
                                  </div>
                                  <div class="fp-one-cita">
                                  <a href="http://wufazhuce.com/one/2758">有人总说:已经晚了。实际上,现在就是最好的时光。对于一个真正有所追求的人来说,生命的每个时期都是年轻的、及时的。</a>                            </div>
                                  <div class="clearfix"></div>
                              </div>
                          </div>
                                          <div class="item">
                              <a href="http://wufazhuce.com/one/2757"><img class="fp-one-imagen" src="http://image.wufazhuce.com/Fj8isdfGOFm9RQULX4p41wPsG9JW" alt="" /></a>                        <div class="fp-one-imagen-footer">
                                  摄影                        </div>
                              <div class="fp-one-cita-wrapper">
                                  <div class="fp-one-titulo-pubdate">
                                      <p class="titulo">VOL.2717</p>
                                      <p class="dom">15</p>
                                      <p class="may">Mar 2020</p>
                                  </div>
                                  <div class="fp-one-cita">
                                  <a href="http://wufazhuce.com/one/2757">维持日常生活,就是抗压的最好药方。</a>                            </div>
                                  <div class="clearfix"></div>
                              </div>
                          </div>
                                      </div>
      

      所以,首先先拿到这一部分

      Element body = doc.getElementById("carousel-one");
      
    • 然后抽取出图片,文本以及时间,遍历放到集合中

      Elements imgs = body.getElementsByClass("fp-one-imagen");
              Elements texts = body.getElementsByClass("fp-one-cita");
              Elements times1 = body.getElementsByClass("dom");
              Elements times2 = body.getElementsByClass("may");
      
              int i = 0;
              for (Element elem : imgs) {
                  LinkedHashMap<String, String> map = new LinkedHashMap<>();
      
                  map.put("img",imgs.get(i).attr("src"));
                  map.put("text",texts.get(i).text());
                  map.put("time",times1.get(i).text() +" "+ times2.get(i).text());
                  arrayList.add(map);
      
                  i++;
              }
      
    • 完整接口示范

      import com.example.api.util.Res;
      import org.jsoup.Jsoup;
      import org.jsoup.nodes.Document;
      import org.jsoup.nodes.Element;
      import org.jsoup.select.Elements;
      import org.springframework.stereotype.Controller;
      import org.springframework.web.bind.annotation.RequestMapping;
      import org.springframework.web.bind.annotation.ResponseBody;
      
      import java.io.IOException;
      import java.util.ArrayList;
      import java.util.LinkedHashMap;
      
      /**
       * @ClassName: OneGe
       * @Auyher: Ktry
       * @Date: 2020/3/19 23:29
       */
      @Controller
      public class OneGe {
      
          /**
           *
           * @return 返回最近7天的短句及配图
           * @throws IOException
           */
          @ResponseBody
          @RequestMapping("imgText")
          public Res imgText() {
              ArrayList arrayList = new ArrayList();
      
            try {
                doc = Jsoup.connect("http://wufazhuce.com/").get();
            } catch (IOException e) {
                return new Res("-1","访问错误");
            }
              Element body = doc.getElementById("carousel-one");
      
              Elements imgs = body.getElementsByClass("fp-one-imagen");
              Elements texts = body.getElementsByClass("fp-one-cita");
              Elements times1 = body.getElementsByClass("dom");
              Elements times2 = body.getElementsByClass("may");
      
              int i = 0;
              for (Element elem : imgs) {
                  LinkedHashMap<String, String> map = new LinkedHashMap<>();
      
                  map.put("img",imgs.get(i).attr("src"));
                  map.put("text",texts.get(i).text());
                  map.put("time",times1.get(i).text() +" "+ times2.get(i).text());
                  arrayList.add(map);
      
                  i++;
              }
      
              return new Res("200",arrayList);
      
          }
      }
      

      结果集实体类

      import lombok.AllArgsConstructor;
      import lombok.Data;
      import lombok.NoArgsConstructor;
      import lombok.ToString;
      
      /**
       * @ClassName: Res
       * @Auyher: Ktry
       * @Date: 2020/3/19 23:57
       */
      @Data
      @AllArgsConstructor
      @NoArgsConstructor
      @ToString
      public class Res{
          private String code;
          private Object data;
      }
      

      效果

      {
        "code": "200",
        "data": [
          {
            "img": "http://image.wufazhuce.com/FtwQJesJhVV0Ho_iaanwPF4QnDPw",
            "text": "我并不期待人生可以过得很顺利,但我希望碰到人生难关的时候,自己可以是它的对手。",
            "time": "21 Mar 2020"
          },
          {
            "img": "http://image.wufazhuce.com/FobC3u_uHKxmnc8gf_kOc6loL-gv",
            "text": "当一个人不能拥有的时候,他唯一能做的便是不要忘记。",
            "time": "20 Mar 2020"
          },
          {
            "img": "http://image.wufazhuce.com/Fp-WZpBGvXVtnDTpIH3IuQDtnAQN",
            "text": "改变心态只需一分钟,而这一分钟却能改变一整天。",
            "time": "19 Mar 2020"
          },
          {
            "img": "http://image.wufazhuce.com/Fm-faU1mWIBGdREYoq_SxbueMx8q",
            "text": "我们每个人都是宇宙的囚徒。",
            "time": "18 Mar 2020"
          },
          {
            "img": "http://image.wufazhuce.com/Frjvh22RpfARajcvPKinwhwsPHOM",
            "text": "对世间的一切事物报以虚无的态度其实是轻松的,真正困难的是如何勇敢地介入其中。​​​",
            "time": "17 Mar 2020"
          },
          {
            "img": "http://image.wufazhuce.com/Fnpd4sv1WSdFfTZ7pFO-I9fD2610",
            "text": "有人总说:已经晚了。实际上,现在就是最好的时光。对于一个真正有所追求的人来说,生命的每个时期都是年轻的、及时的。",
            "time": "16 Mar 2020"
          },
          {
            "img": "http://image.wufazhuce.com/Fj8isdfGOFm9RQULX4p41wPsG9JW",
            "text": "维持日常生活,就是抗压的最好药方。",
            "time": "15 Mar 2020"
          }
        ]
      }
      

    相关文章

      网友评论

          本文标题:jsoup爬虫图文实战

          本文链接:https://www.haomeiwen.com/subject/soneyhtx.html