美文网首页
webmagic+springboot+mybatis-plus

webmagic+springboot+mybatis-plus

作者: 飞鹰雪玉 | 来源:发表于2021-04-25 15:56 被阅读0次

    背景

    最近应项目需求,需要考虑实现一个网络爬虫的功能,去爬取各个网站的稿件内容,统计其点赞数,转发数,评论数等内容。因为没有用过python,就在开源中国上面搜索有没有什么框架。然后才发现很多语言写的框架。java的有spiderman2,webCollector,webmagic等。我自己选用了webmagic实现自己的功能。
    很遗憾的是最后爬取数据的时候今日头条,微博等没有爬取下来数据,不知道是这个框架太老,还是其他原因。只能爬取简书等一些博客。记录一下,为以后使用做点记录。
    首先,我下载了webmagic的源码,并且拉包项目成功运行。github项目地址:https://github.com/code4craft/webmagic
    开发文档地址:http://webmagic.io/docs/zh/

    概览

    主要是两个包下的代码。webmagic-core和webmagic-extension。分为核心和扩展两部分。
    核心部分(webmagic-core)是一个精简的、模块化的爬虫实现,而扩展部分则包括一些便利的、实用性的功能。WebMagic的架构设计参照了Scrapy,目标是尽量的模块化,并体现爬虫的功能特点。这部分提供非常简单、灵活的API,在基本不改变开发模式的情况下,编写一个爬虫。
    扩展部分(webmagic-extension)提供一些便捷的功能,例如注解模式编写爬虫等。同时内置了一些常用的组件,便于爬虫开发。

    总体架构

    WebMagic的结构分为Downloader、PageProcessor、Scheduler、Pipeline四大组件,并由Spider将它们彼此组织起来。这四大组件对应爬虫生命周期中的下载、处理、管理和持久化等功能。WebMagic的设计参考了Scapy,但是实现方式更Java化一些。
    而Spider则将这几个组件组织起来,让它们可以互相交互,流程化的执行,可以认为Spider是一个大的容器,它也是WebMagic逻辑的核心。
    WebMagic总体架构图如下:


    image.png

    WebMagic的四个组件

    1.Downloader

    Downloader负责从互联网上下载页面,以便后续处理。WebMagic默认使用了Apache HttpClient作为下载工具。

    2.PageProcessor

    PageProcessor负责解析页面,抽取有用信息,以及发现新的链接。WebMagic使用Jsoup作为HTML解析工具,并基于其开发了解析XPath的工具Xsoup

    在这四个组件中,PageProcessor对于每个站点每个页面都不一样,是需要使用者定制的部分。

    3.Scheduler

    Scheduler负责管理待抓取的URL,以及一些去重的工作。WebMagic默认提供了JDK的内存队列来管理URL,并用集合来进行去重。也支持使用Redis进行分布式管理。

    除非项目有一些特殊的分布式需求,否则无需自己定制Scheduler。

    4.Pipeline

    Pipeline负责抽取结果的处理,包括计算、持久化到文件、数据库等。WebMagic默认提供了“输出到控制台”和“保存到文件”两种结果处理方案。

    Pipeline定义了结果保存的方式,如果你要保存到指定数据库,则需要编写对应的Pipeline。对于一类需求一般只需编写一个Pipeline

    用于数据流转的对象

    1. Request

    Request是对URL地址的一层封装,一个Request对应一个URL地址。

    它是PageProcessor与Downloader交互的载体,也是PageProcessor控制Downloader唯一方式。

    除了URL本身外,它还包含一个Key-Value结构的字段extra。你可以在extra中保存一些特殊的属性,然后在其他地方读取,以完成不同的功能。例如附加上一个页面的一些信息等。

    2. Page

    Page代表了从Downloader下载到的一个页面——可能是HTML,也可能是JSON或者其他文本格式的内容。

    Page是WebMagic抽取过程的核心对象,它提供一些方法可供抽取、结果保存等。在第四章的例子中,我们会详细介绍它的使用。

    3. ResultItems

    ResultItems相当于一个Map,它保存PageProcessor处理的结果,供Pipeline使用。它的API与Map很类似,值得注意的是它有一个字段skip,若设置为true,则不应被Pipeline处理。

    使用

    pom依赖

            <!--webMagic依赖-->
            <dependency>
                <groupId>us.codecraft</groupId>
                <artifactId>webmagic-core</artifactId>
                <version>0.7.3</version>
            </dependency>
            <dependency>
                <groupId>us.codecraft</groupId>
                <artifactId>webmagic-extension</artifactId>
                <version>0.7.3</version>
            </dependency>
    

    项目结构

    image.png

    要点

    1. 定义一个JianshuPageProcessor实现webmagic的PageProcessor,重写process(Page page)方法。将自己需要的网站信息填写进去。同时在这个类里面注入mybatis-plus的dao层接口,因为不是service实现类,所以用Objects.requireNonNull(ContextLoader.getCurrentWebApplicationContext()).getBean(SpiderJianshuDao.class)来将bean注入。将爬取到的信息存入数据库。
    2. 因为webmagic码云和github上都只更新到0.7.3版本,所以里面有一个bug一直没解决,黄sir一直说0.7.4会修复,目前还没有,不过给了解决办法。可以看作者说的解决思路 https://github.com/code4craft/webmagic/issues/701
      我照着实现了自己的HttpClientGenerator,这个方法里面private SSLConnectionSocketFactory buildSSLConnectionSocketFactory()加上new DefaultHostnameVerifier()

    代码

    HttpClientGenerator

    package com.tongxing.spider.downloader;
    
    import org.apache.http.HttpException;
    import org.apache.http.HttpRequest;
    import org.apache.http.HttpRequestInterceptor;
    import org.apache.http.client.CookieStore;
    import org.apache.http.config.Registry;
    import org.apache.http.config.RegistryBuilder;
    import org.apache.http.config.SocketConfig;
    import org.apache.http.conn.socket.ConnectionSocketFactory;
    import org.apache.http.conn.socket.PlainConnectionSocketFactory;
    import org.apache.http.conn.ssl.DefaultHostnameVerifier;
    import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
    import org.apache.http.impl.client.*;
    import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
    import org.apache.http.impl.cookie.BasicClientCookie;
    import org.apache.http.protocol.HttpContext;
    import org.slf4j.Logger;
    import org.slf4j.LoggerFactory;
    import us.codecraft.webmagic.Site;
    import us.codecraft.webmagic.downloader.CustomRedirectStrategy;
    
    import javax.net.ssl.SSLContext;
    import javax.net.ssl.TrustManager;
    import javax.net.ssl.X509TrustManager;
    import java.io.IOException;
    import java.security.KeyManagementException;
    import java.security.NoSuchAlgorithmException;
    import java.security.cert.CertificateException;
    import java.security.cert.X509Certificate;
    import java.util.Map;
    
    /**
     * @author code4crafter@gmail.com <br>
     * @since 0.4.0
     */
    public class HttpClientGenerator {
        
        private transient Logger logger = LoggerFactory.getLogger(getClass());
        
        private PoolingHttpClientConnectionManager connectionManager;
    
        public HttpClientGenerator() {
            Registry<ConnectionSocketFactory> reg = RegistryBuilder.<ConnectionSocketFactory>create()
                    .register("http", PlainConnectionSocketFactory.INSTANCE)
                    .register("https", buildSSLConnectionSocketFactory())
                    .build();
            connectionManager = new PoolingHttpClientConnectionManager(reg);
            connectionManager.setDefaultMaxPerRoute(100);
        }
    
        private SSLConnectionSocketFactory buildSSLConnectionSocketFactory() {
            try {
                return new SSLConnectionSocketFactory(createIgnoreVerifySSL(), new String[]{"SSLv3", "TLSv1", "TLSv1.1", "TLSv1.2"},
                        null,
                        new DefaultHostnameVerifier()); // 优先绕过安全证书
            } catch (KeyManagementException e) {
                logger.error("ssl connection fail", e);
            } catch (NoSuchAlgorithmException e) {
                logger.error("ssl connection fail", e);
            }
            return SSLConnectionSocketFactory.getSocketFactory();
        }
    
        private SSLContext createIgnoreVerifySSL() throws NoSuchAlgorithmException, KeyManagementException {
            // 实现一个X509TrustManager接口,用于绕过验证,不用修改里面的方法
            X509TrustManager trustManager = new X509TrustManager() {
    
                @Override
                public void checkClientTrusted(X509Certificate[] chain, String authType) throws CertificateException {
                }
    
                @Override
                public void checkServerTrusted(X509Certificate[] chain, String authType) throws CertificateException {
                }
    
                @Override
                public X509Certificate[] getAcceptedIssuers() {
                    return null;
                }
                
            };
            
            SSLContext sc = SSLContext.getInstance("SSLv3");
            sc.init(null, new TrustManager[] { trustManager }, null);
            return sc;
        }
        
        public HttpClientGenerator setPoolSize(int poolSize) {
            connectionManager.setMaxTotal(poolSize);
            return this;
        }
    
        public CloseableHttpClient getClient(Site site) {
            return generateClient(site);
        }
    
        private CloseableHttpClient generateClient(Site site) {
            HttpClientBuilder httpClientBuilder = HttpClients.custom();
            
            httpClientBuilder.setConnectionManager(connectionManager);
            if (site.getUserAgent() != null) {
                httpClientBuilder.setUserAgent(site.getUserAgent());
            } else {
                httpClientBuilder.setUserAgent("");
            }
            if (site.isUseGzip()) {
                httpClientBuilder.addInterceptorFirst(new HttpRequestInterceptor() {
    
                    public void process(
                            final HttpRequest request,
                            final HttpContext context) throws HttpException, IOException {
                        if (!request.containsHeader("Accept-Encoding")) {
                            request.addHeader("Accept-Encoding", "gzip");
                        }
                    }
                });
            }
            //解决post/redirect/post 302跳转问题
            httpClientBuilder.setRedirectStrategy(new CustomRedirectStrategy());
    
            SocketConfig.Builder socketConfigBuilder = SocketConfig.custom();
            socketConfigBuilder.setSoKeepAlive(true).setTcpNoDelay(true);
            socketConfigBuilder.setSoTimeout(site.getTimeOut());
            SocketConfig socketConfig = socketConfigBuilder.build();
            httpClientBuilder.setDefaultSocketConfig(socketConfig);
            connectionManager.setDefaultSocketConfig(socketConfig);
            httpClientBuilder.setRetryHandler(new DefaultHttpRequestRetryHandler(site.getRetryTimes(), true));
            generateCookie(httpClientBuilder, site);
            return httpClientBuilder.build();
        }
    
        private void generateCookie(HttpClientBuilder httpClientBuilder, Site site) {
            if (site.isDisableCookieManagement()) {
                httpClientBuilder.disableCookieManagement();
                return;
            }
            CookieStore cookieStore = new BasicCookieStore();
            for (Map.Entry<String, String> cookieEntry : site.getCookies().entrySet()) {
                BasicClientCookie cookie = new BasicClientCookie(cookieEntry.getKey(), cookieEntry.getValue());
                cookie.setDomain(site.getDomain());
                cookieStore.addCookie(cookie);
            }
            for (Map.Entry<String, Map<String, String>> domainEntry : site.getAllCookies().entrySet()) {
                for (Map.Entry<String, String> cookieEntry : domainEntry.getValue().entrySet()) {
                    BasicClientCookie cookie = new BasicClientCookie(cookieEntry.getKey(), cookieEntry.getValue());
                    cookie.setDomain(domainEntry.getKey());
                    cookieStore.addCookie(cookie);
                }
            }
            httpClientBuilder.setDefaultCookieStore(cookieStore);
        }
    
    }
    

    HttpClientDownloader

    package com.tongxing.spider.downloader;
    
    import org.apache.commons.io.IOUtils;
    import org.apache.http.HttpResponse;
    import org.apache.http.client.methods.CloseableHttpResponse;
    import org.apache.http.impl.client.CloseableHttpClient;
    import org.apache.http.util.EntityUtils;
    import org.slf4j.Logger;
    import org.slf4j.LoggerFactory;
    import us.codecraft.webmagic.Page;
    import us.codecraft.webmagic.Request;
    import us.codecraft.webmagic.Site;
    import us.codecraft.webmagic.Task;
    import us.codecraft.webmagic.downloader.AbstractDownloader;
    import us.codecraft.webmagic.downloader.HttpClientRequestContext;
    import us.codecraft.webmagic.downloader.HttpUriRequestConverter;
    import us.codecraft.webmagic.proxy.Proxy;
    import us.codecraft.webmagic.proxy.ProxyProvider;
    import us.codecraft.webmagic.selector.PlainText;
    import us.codecraft.webmagic.utils.CharsetUtils;
    import us.codecraft.webmagic.utils.HttpClientUtils;
    
    import java.io.IOException;
    import java.nio.charset.Charset;
    import java.util.HashMap;
    import java.util.Map;
    
    
    /**
     * The http downloader based on HttpClient.
     *
     * @author code4crafter@gmail.com <br>
     * @since 0.1.0
     */
    public class HttpClientDownloader extends AbstractDownloader {
    
        private Logger logger = LoggerFactory.getLogger(getClass());
    
        private final Map<String, CloseableHttpClient> httpClients = new HashMap<String, CloseableHttpClient>();
    
        private HttpClientGenerator httpClientGenerator = new HttpClientGenerator();
    
        private HttpUriRequestConverter httpUriRequestConverter = new HttpUriRequestConverter();
        
        private ProxyProvider proxyProvider;
    
        private boolean responseHeader = true;
    
        public void setHttpUriRequestConverter(HttpUriRequestConverter httpUriRequestConverter) {
            this.httpUriRequestConverter = httpUriRequestConverter;
        }
    
        public void setProxyProvider(ProxyProvider proxyProvider) {
            this.proxyProvider = proxyProvider;
        }
    
        private CloseableHttpClient getHttpClient(Site site) {
            if (site == null) {
                return httpClientGenerator.getClient(null);
            }
            String domain = site.getDomain();
            CloseableHttpClient httpClient = httpClients.get(domain);
            if (httpClient == null) {
                synchronized (this) {
                    httpClient = httpClients.get(domain);
                    if (httpClient == null) {
                        httpClient = httpClientGenerator.getClient(site);
                        httpClients.put(domain, httpClient);
                    }
                }
            }
            return httpClient;
        }
    
        @Override
        public Page download(Request request, Task task) {
            if (task == null || task.getSite() == null) {
                throw new NullPointerException("task or site can not be null");
            }
            CloseableHttpResponse httpResponse = null;
            CloseableHttpClient httpClient = getHttpClient(task.getSite());
            Proxy proxy = proxyProvider != null ? proxyProvider.getProxy(task) : null;
            HttpClientRequestContext requestContext = httpUriRequestConverter.convert(request, task.getSite(), proxy);
            Page page = Page.fail();
            try {
                httpResponse = httpClient.execute(requestContext.getHttpUriRequest(), requestContext.getHttpClientContext());
                page = handleResponse(request, request.getCharset() != null ? request.getCharset() : task.getSite().getCharset(), httpResponse, task);
                onSuccess(request);
                logger.info("downloading page success {}", request.getUrl());
                return page;
            } catch (IOException e) {
                logger.warn("download page {} error", request.getUrl(), e);
                onError(request);
                return page;
            } finally {
                if (httpResponse != null) {
                    //ensure the connection is released back to pool
                    EntityUtils.consumeQuietly(httpResponse.getEntity());
                }
                if (proxyProvider != null && proxy != null) {
                    proxyProvider.returnProxy(proxy, page, task);
                }
            }
        }
    
        @Override
        public void setThread(int thread) {
            httpClientGenerator.setPoolSize(thread);
        }
    
        protected Page handleResponse(Request request, String charset, HttpResponse httpResponse, Task task) throws IOException {
            byte[] bytes = IOUtils.toByteArray(httpResponse.getEntity().getContent());
            String contentType = httpResponse.getEntity().getContentType() == null ? "" : httpResponse.getEntity().getContentType().getValue();
            Page page = new Page();
            page.setBytes(bytes);
            if (!request.isBinaryContent()){
                if (charset == null) {
                    charset = getHtmlCharset(contentType, bytes);
                }
                page.setCharset(charset);
                page.setRawText(new String(bytes, charset));
            }
            page.setUrl(new PlainText(request.getUrl()));
            page.setRequest(request);
            page.setStatusCode(httpResponse.getStatusLine().getStatusCode());
            page.setDownloadSuccess(true);
            if (responseHeader) {
                page.setHeaders(HttpClientUtils.convertHeaders(httpResponse.getAllHeaders()));
            }
            return page;
        }
    
        private String getHtmlCharset(String contentType, byte[] contentBytes) throws IOException {
            String charset = CharsetUtils.detectCharset(contentType, contentBytes);
            if (charset == null) {
                charset = Charset.defaultCharset().name();
                logger.warn("Charset autodetect failed, use {} as charset. Please specify charset in Site.setCharset()", Charset.defaultCharset());
            }
            return charset;
        }
    }
    

    JianshuPageProcessor

    package com.tongxing.spider.processer;
    
    import com.tongxing.spider.dao.SpiderJianshuDao;
    import com.tongxing.spider.entity.SpiderJianshu;
    import org.springframework.stereotype.Component;
    import org.springframework.web.context.ContextLoader;
    import us.codecraft.webmagic.Page;
    import us.codecraft.webmagic.Site;
    import us.codecraft.webmagic.processor.PageProcessor;
    
    import javax.annotation.Resource;
    import java.util.Objects;
    
    /**
     * 爬取简书
     *
     * @author 刘鹏
     * @date 2021/4/16 15:29
     */
    @Component
    public class JianshuPageProcessor implements PageProcessor {
    
        @Resource
        private SpiderJianshuDao spiderJianshuDao;
    
        private final Site site = Site.me().setRetryTimes(3).setSleepTime(1000);
    
        /**
         * process the page, extract urls to fetch, extract the data and store
         *
         * @param page page
         */
        @Override
        public void process(Page page) {
    
            spiderJianshuDao = Objects.requireNonNull(ContextLoader.getCurrentWebApplicationContext()).getBean(SpiderJianshuDao.class);
            SpiderJianshu spiderJianshu = new SpiderJianshu();
            spiderJianshu.setTitle(page.getHtml().xpath("//h1[@class='_1RuRku']/text()").toString());
            page.putField("jianshu", spiderJianshu);
            int x = spiderJianshuDao.insert(spiderJianshu);
            System.out.println("==========x:" + x);
        }
    
        /**
         * get the site settings
         *
         * @return site
         * @see Site
         */
        @Override
        public Site getSite() {
            return site;
        }
    }
    

    JianshuService

    package com.tongxing.spider.service;
    
    import com.tongxing.spider.downloader.HttpClientDownloader;
    import com.tongxing.spider.pipeline.SpiderPipeLine;
    import com.tongxing.spider.processer.JianshuPageProcessor;
    import org.springframework.stereotype.Service;
    import us.codecraft.webmagic.Spider;
    
    /**
     * 爬取简书数据并持久化到数据库
     *
     * @author 刘鹏
     * @date 2021/4/16 15:51
     */
    @Service
    public class JianshuService {
    
        public static void main(String[] args) {
            Spider.create(new JianshuPageProcessor())
                    .setDownloader(new HttpClientDownloader())
                    .addUrl("https://www.jianshu.com/p/85a3004b5c06")
                    .addPipeline(new SpiderPipeLine())
                    .thread(5)
                    .run();
        }
    }
    

    SpiderPipeLine

    package com.tongxing.spider.pipeline;
    
    import com.tongxing.spider.entity.SpiderJianshu;
    import org.springframework.stereotype.Component;
    import us.codecraft.webmagic.ResultItems;
    import us.codecraft.webmagic.Task;
    import us.codecraft.webmagic.pipeline.Pipeline;
    
    /**
     * 自定义pipeline实现将爬取的数据存储到数据库
     *
     * @author 刘鹏
     * @date 2021/4/16 16:07
     */
    @Component("SpiderPipeLine")
    public class SpiderPipeLine implements Pipeline {
    
        public SpiderPipeLine(){
    
        }
    
        /**
         * Process extracted results.
         *
         * @param resultItems resultItems
         * @param task        task
         */
        @Override
        public void process(ResultItems resultItems, Task task) {
            SpiderJianshu spiderJianshu = resultItems.get("jianshu");
            System.out.println("=============spider: " + spiderJianshu);
        }
    }
    

    SpiderJianshu

    package com.tongxing.spider.entity;
    
    import com.baomidou.mybatisplus.annotation.IdType;
    import com.baomidou.mybatisplus.annotation.TableId;
    import com.baomidou.mybatisplus.annotation.TableName;
    import lombok.Data;
    
    import java.io.Serializable;
    
    /**
     * 爬虫数据表
     *
     * @author 刘鹏
     * @date 2021/4/16 11:44
     */
    @Data
    @TableName("spider_jianshu")
    public class SpiderJianshu implements Serializable {
        private static final long serialVersionUID = -3411930825161068742L;
        /**
         * 主键
         */
        @TableId(type = IdType.ASSIGN_ID)
        private String id;
        /**
         * 标题
         */
        private String title;
        /**
         * 点赞数
         */
        private String numLikes;
        /**
         * 评论数
         */
        private String numComments;
        /**
         * 转发数
         */
        private String numRelay;
    }
    

    SpiderJianshuDao

    package com.tongxing.spider.dao;
    
    import com.baomidou.mybatisplus.core.mapper.BaseMapper;
    import com.tongxing.spider.entity.SpiderJianshu;
    import org.apache.ibatis.annotations.Mapper;
    
    /**
     * 爬取简书数据的接口
     *
     * @author 刘鹏
     * @date 2021/4/16 15:52
     */
    @Mapper
    public interface SpiderJianshuDao extends BaseMapper<SpiderJianshu> {
    }
    

    相关文章

      网友评论

          本文标题:webmagic+springboot+mybatis-plus

          本文链接:https://www.haomeiwen.com/subject/ymkilltx.html