xxl-crawler 爬虫框架使用

作者: 西5d | 来源:发表于2020-06-17 19:33 被阅读0次

面向对象的分布式爬虫框架XXL-CRAWLER
xxl-crawler 爬虫框架使用
【openshift-4】实现简单爬虫功能+生成在线API
python爬虫框架scrapy
python爬虫框架Scrapy
python爬虫框架Scrapy
爬虫框架htmlunit整合springboot不兼容的问题
「爬虫」13爬虫框架之scrapy框架的安装与常用命令
Python爬虫基础：scrapy框架简介及第一个scrapy爬
windows环境下安装Python Scrapy

XxlCrawler 爬虫框架介绍

XxlCrawler 是XXL(xuxueli)开发家族的产品成员之一，是一个比较容易上手的java爬虫框架，详情可以从其官网查看：https://github.com/xuxueli/xxl-crawler/。根据介绍主要支持如下的一些功能:异步、多线程、面向对象、动态代理支持、模块化等主要特点，基本实现了一般爬虫框架经常用到的功能。

框架结构

爬虫相关技术总归会涉及Jsoup, HtmlUnit, selenium , phantomJs等,XxlCrawler也基于以上技术实现。个人理解可以分成三个大的部分。首先是执行器，用来运行爬虫任务，核心是线程池和任务队列；其次是载入器，用来从网页地址获取网页内容；第三个就是解析器，用来解析和处理网页内容；这里xxl-crawler在解析的时候可以直接根据内容映射成目标对象，这种思路确实比较新颖，使用也比较方便，但对于一些复杂的页面可能无法很好直接支持。

框架使用

配置依赖

这里除了xxl-crawler和爬虫相关的依赖，由于最后存储考虑放在es/mongoDB中，额外引入了对应部分。

       <dependency>
           <groupId>org.jsoup</groupId>
           <artifactId>jsoup</artifactId>
           <version>1.10.3</version>
       </dependency>
       <dependency>
           <groupId>com.google.code.gson</groupId>
           <artifactId>gson</artifactId>
           <version>2.8.6</version>
       </dependency>
       <dependency>
           <groupId>com.xuxueli</groupId>
           <artifactId>xxl-crawler</artifactId>
           <version>1.2.2</version>
       </dependency>
       <dependency>
           <groupId>net.sourceforge.htmlunit</groupId>
           <artifactId>htmlunit</artifactId>
           <version>2.36.0</version>
           <scope>provided</scope>
       </dependency>
       <dependency>
           <groupId>commons-io</groupId>
           <artifactId>commons-io</artifactId>
           <version>2.6</version>
       </dependency>
       <dependency>
           <groupId>org.mongodb</groupId>
           <artifactId>mongo-java-driver</artifactId>
           <version>3.12.5</version>
       </dependency>
       <dependency>
           <groupId>org.seleniumhq.selenium</groupId>
           <artifactId>selenium-java</artifactId>
           <version>3.141.59</version>
       </dependency>
       <dependency>
           <groupId>org.seleniumhq.selenium</groupId>
           <artifactId>selenium-chrome-driver</artifactId>
           <version>3.141.59</version>
       </dependency>
       <dependency>
           <groupId>com.codeborne</groupId>
           <artifactId>phantomjsdriver</artifactId>
           <version>1.4.4</version>
       </dependency>
       <dependency>
           <groupId>org.elasticsearch.client</groupId>
           <artifactId>elasticsearch-rest-client</artifactId>
           <version>6.7.2</version>
       </dependency>
       <dependency>
           <groupId>redis.clients</groupId>
           <artifactId>jedis</artifactId>
           <version>3.1.0</version>
       </dependency>

定义爬取目标对象

这里定义一个对象来处理最终解析到的对象。这里以爬取ifeng网站的新闻内容为例子. @PageFieldSelect 表示指定的元素位置，都是cssSelector的方式，注意参数selectType默认是text，可设置为HTML，用于让某些文章保持正确的缩进和排版。最后俩字段是额外添加的，方便索引数据。

@Data
@PageSelect(cssQuery = "html")
public class IfengVo {
    @PageFieldSelect(cssQuery = ".caption-b2hvOK2k h1")
    private String title;
    @PageFieldSelect(cssQuery = ".source-2rumsCOj")
    private String source;
    @PageFieldSelect(cssQuery = "#root > div.main-3MFAZ6wn > div > div.content-2ddxT7Uc > div.caption-b2hvOK2k > div > div.source_box-ksRj2PYP > div > span:nth-child(2)")
    private String createTime;
    @PageFieldSelect(cssQuery = ".main_content-fdgs0kGw", selectType = XxlCrawlerConf.SelectType.HTML)
    private String content;

    private String url;

    private long timestamp;

    public boolean isValid() {
        return null != title && source != null && content != null;
    }
}

自定义内容载入器

这里基于HtmlUnitPageLoader做了个简单的扩展，过滤了一些无效的地址和请求结果。因为爬虫的结果总体不是特别可控的，所以我们只关心需要的结果即可。至于如何使用PageLoader后面会有介绍。

public class CusPageLoader extends HtmlUnitPageLoader {

    @Override
    public Document load(PageRequest pageRequest) {
        if (pageRequest.getUrl().contains("special")) {
            return null;
        }
        Document document = super.load(pageRequest);
        document.setBaseUri(pageRequest.getUrl());
        if (document.select("head meta[name=og:type]").attr("content").equals("video")) {
            return null;
        }
        if (document.select(".errorImg-1MmxoDF-").size() != 0) {
            return null;
        }
        return document;
    }
}

自定义内容解析器

这里算是结果处理了，其实IfengVo对象已经做了赋值，在额外补充了字段值之后就可以持久化到存储里了。这里也简单提供了写入mongoDB和es的示例.

//保存
   public void saveMongo(Object obj) {
        MongoDatabase database = mongoClient.getDatabase("ifeng");
        String collectionName = LocalDate.now().toString();
        MongoCollection collection = database.getCollection(collectionName);
        collection.insertOne(org.bson.Document.parse(GsonUtil.toJson(obj)));
    }

    public void saveES(IfengVo obj) {
        try {
            Request request = new Request("POST", "/ifeng/_doc/");
            request.setJsonEntity(GsonUtil.toJson(obj));
            Response response = elasticRestClient.performRequest(request);
            System.out.println(EntityUtils.toString(response.getEntity()));
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

//定义和处理
    private static final String IFENG_CRAWLER_SETS = "IFENG_CRAWLER_SETS";
    private static final String UA =
            "Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1";
    
    
    private MongoClient mongoClient;
    private RestClient elasticRestClient;
    private Jedis jedis;

    //初始化
    @Before
    public void init() {
        mongoClient = new MongoClient("172.17.0.4", 27017);
        elasticRestClient = RestClient.builder(
                new HttpHost("172.17.0.3", 9200, "http")).build();
        jedis = new Jedis("172.17.0.2", 6379);
    }
    //关闭
       @After
        public void after() throws IOException {
            mongoClient.close();
            elasticRestClient.close();
            jedis.close();
        }

VO对象如何创建和赋值可以看这里：

//com.xuxueli.crawler.thread.CrawlerThread.processPage
       // pagevo class-field info
        Class pageVoClassType = Object.class;

        Type pageVoParserClass = crawler.getRunConf().getPageParser().getClass().getGenericSuperclass();
        if (pageVoParserClass instanceof ParameterizedType) {
            Type[] pageVoClassTypes = ((ParameterizedType)pageVoParserClass).getActualTypeArguments();
            pageVoClassType = (Class) pageVoClassTypes[0];
        }
        //...
                Object pageVo = pageVoClassType.newInstance();

代码后续根据PageFieldSelect的设置，解析得到相应的结果，赋值給pageVo对象.

new PageParser<IfengVo>() {
                    @Override
                    public void parse(Document html, Element pageVoElement, IfengVo pageVo) {
                        if (null == html) {
                            return;
                        }
                        // 解析封装 PageVo 对象
                        String pageUrl = html.baseUri();
                        System.out.println(pageUrl + "：" + pageVo.getTitle());
                        if (pageVo.isValid()) {
                            pageVo.setTimestamp(System.currentTimeMillis());
                            pageVo.setUrl(pageUrl);
                            //saveMongo(pageVo);
                            saveES(pageVo);
                        }
                    }
                }

创建爬虫

正式进入创建爬虫的阶段，给设置对应的参数，设置载入器，解析器。注意这里新加了个RunData，用redis实现，这样就可以实现多实例的爬虫部署；其中AllowSpread表示可以扩散爬虫范围。有一点需要注意是RunData如果换了实现,setUrls要放到最后，否则初始化的时候有个调用getUrlNum的判断，是有问题的，导致无法启动.


    @Test
    public void ifengTest() {
        XxlCrawler crawler = new XxlCrawler.Builder()
                .setWhiteUrlRegexs("https://[a-z]+\\.ifeng\\.com/c/[0-9A-Za-z]{11}$")
                .setThreadCount(1)
                .setFailRetryCount(10)
                .setAllowSpread(true)
                .setPageLoader(new CusPageLoader())
                .setUserAgent(UA)
                .setRunData(new RunData() {
                    @Override
                    public boolean addUrl(String link) {
                        return jedis.sadd(IFENG_CRAWLER_SETS, link) > 0;
                    }

                    @Override
                    public String getUrl() {
                        return jedis.spop(IFENG_CRAWLER_SETS);
                    }

                    @Override
                    public int getUrlNum() {
                        return jedis.scard(IFENG_CRAWLER_SETS).intValue();
                    }
                })
                .setPageParser(new PageParser<IfengVo>() {
                    @Override
                    public void parse(Document html, Element pageVoElement, IfengVo pageVo) {
                        if (null == html) {
                            return;
                        }
                        // 解析封装 PageVo 对象
                        String pageUrl = html.baseUri();
                        System.out.println(pageUrl + "：" + pageVo.getTitle());
                        if (pageVo.isValid()) {
                            pageVo.setTimestamp(System.currentTimeMillis());
                            pageVo.setUrl(pageUrl);
                            //                            saveMongo(pageVo);
                            saveES(pageVo);
                        }
                    }
                })
                .setUrls("https://i.ifeng.com/?srctag=xzydh10")
                .build();
        crawler.start(true);
        crawler.stop();
    }

执行结果展示

最后是爬取的内容展示，相对来看还是比较容易的。

image.png

新闻

结语

好了，以上就是本期的内容。暂时遗留几个问题:

在实验中有某些网页无法完全等js解析完成，换了selenium+phantomJs的PageLoader也还是不行，也是刚接触相关内容，这块还需要深入了解下。
除了xxl-crawler还有个非常流行的webmagic框架，也是用java实现的，可以比较了解下。

面向对象的分布式爬虫框架XXL-CRAWLER
《面向对象的分布式爬虫框架XXL-CRAWLER》一、简介 1.1 概述 XXL-CRAWLER 是一个面向对象...
xxl-crawler 爬虫框架使用
XxlCrawler 爬虫框架介绍 XxlCrawler 是XXL(xuxueli)开发家族的产品成员之一，是一个...
【openshift-4】实现简单爬虫功能+生成在线API
前提：请先学习爬虫框架BeautifulSoup和flask中jsonify的简单使用 1、如何简单的使用爬虫框架...
python爬虫框架scrapy
爬虫框架Scrapy(四) 使用框架Scrapy开发一个爬虫只需要四步：创建项目：scrapy startproj...
python爬虫框架Scrapy
爬虫框架Scrapy(二) 使用框架Scrapy开发一个爬虫只需要四步：创建项目：scrapy startproj...
python爬虫框架Scrapy
爬虫框架Scrapy(三) 使用框架Scrapy开发一个爬虫只需要四步：创建项目：scrapy startproj...
爬虫框架htmlunit整合springboot不兼容的问题
使用爬虫框架htmlunit整合springboot不兼容的一个问题本来使用htmlunit爬虫爬取数据非常正常...
「爬虫」13爬虫框架之scrapy框架的安装与常用命令
1.爬虫框架爬虫框架就是一些爬虫项目的半成品，即对一些常见的功能代码、业务逻辑等进行封装。用户在使用时，需要根据...
Python爬虫基础：scrapy框架简介及第一个scrapy爬
scrapy框架简介 scrapy是一个使用Python语言（基于Twisted框架）编写的开源网络爬虫框架，目前...
windows环境下安装Python Scrapy
学习背景由于最近项目需要爬虫获取数据，现学习python语言，可使用pthon原生爬虫和scrapy框架两种爬虫...

xxl-crawler 爬虫框架使用

XxlCrawler 爬虫框架介绍

框架结构

框架使用

配置依赖

定义爬取目标对象

自定义内容载入器

自定义内容解析器

创建爬虫

执行结果展示

结语

相关文章

面向对象的分布式爬虫框架XXL-CRAWLER

xxl-crawler 爬虫框架使用

【openshift-4】实现简单爬虫功能+生成在线API

python爬虫框架scrapy

python爬虫框架Scrapy

python爬虫框架Scrapy

爬虫框架htmlunit整合springboot不兼容的问题

「爬虫」13爬虫框架之scrapy框架的安装与常用命令

Python爬虫基础：scrapy框架简介及第一个scrapy爬

windows环境下安装Python Scrapy

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读