美文网首页
Webmagic+selenium+chromedriver使用

Webmagic+selenium+chromedriver使用

作者: MrL槑槑 | 来源:发表于2021-03-20 15:52 被阅读0次

    一、Webmagic总体架构:

    WebMagic的结构分为Downloader、PageProcessor、Scheduler、Pipeline四大组件,并由Spider将它们彼此组织起来。这四大组件对应爬虫生命周期中的下载、处理、管理和持久化等功能。

    而Spider则将这几个组件组织起来,让它们可以互相交互,流程化的执行,可以认为Spider是一个大的容器,它也是WebMagic逻辑的核心。


    架构图

    二、WebMagic的四个组件

    1.Downloader

    Downloader负责从互联网上下载页面,以便后续处理。WebMagic默认使用了Apache HttpClient作为下载工具。

    2.PageProcessor

    PageProcessor负责解析页面,抽取有用信息,以及发现新的链接。WebMagic使用Jsoup作为HTML解析工具,并基于其开发了解析XPath的工具Xsoup

    在这四个组件中,PageProcessor对于每个站点每个页面都不一样,是需要使用者定制的部分。

    3.Scheduler

    Scheduler负责管理待抓取的URL,以及一些去重的工作。WebMagic默认提供了JDK的内存队列来管理URL,并用集合来进行去重。也支持使用Redis进行分布式管理。

    除非项目有一些特殊的分布式需求,否则无需自己定制Scheduler。

    4.Pipeline

    Pipeline负责抽取结果的处理,包括计算、持久化到文件、数据库等。WebMagic默认提供了“输出到控制台”和“保存到文件”两种结果处理方案。

    Pipeline定义了结果保存的方式,如果你要保存到指定数据库,则需要编写对应的Pipeline。对于一类需求一般只需编写一个Pipeline

    三、selenium模拟登陆

    selenium本身是一种自动化测试工具,可以模拟浏览器进行页面的加载,好处在于能通过程序,自动的完成例如页面登录、AJAX内容获取的的操作。

    尤其是获取AJAX生成的动态信息方面,一般爬虫只会获取当前页面的静态信息,不会加载动态生成的内容,但是selenium则完美的帮我们实现了这一功能。

    但同样他也有一些不好的地方,就是使用selenium功能的时候,需要事先加载selenium的驱动,在通过selenium本身加载出页面动态生成的内容,以供之后爬取。

    四、下载浏览器和驱动

    使用selenium对页面尽进行爬取时,首先需要下载相关的浏览器驱动,不同版本的浏览器对应的驱动也不一样。

    centos下载安装Google浏览器:
        1.1 chrome下载安装命令:yum install https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm
        1.2 查看chrome版本命令:google-chrome --version
        1.3 下载chrome版本号对应的驱动(地址:http://chromedriver.storage.googleapis.com/index.html):
            例如:chrome版本号 89.0.4389.82
            http://chromedriver.storage.googleapis.com/89.0.4389.23/chromedriver_linux64.zip
        1.4 解压下载的驱动包,解压到目录:/home/chrome/
    

    五、项目搭建

    1. 添加依赖

            <!--java支持的selenium包-->
            <dependency>
                <groupId>org.seleniumhq.selenium</groupId>
                <artifactId>selenium-java</artifactId>
                <version>3.141.59</version>
            </dependency>
            <!--chromedriver驱动jar包-->
            <dependency>
                <groupId>org.seleniumhq.selenium</groupId>
                <artifactId>selenium-chrome-driver</artifactId>
                <version>3.141.59</version>
            </dependency>
            <dependency>
                <groupId>us.codecraft</groupId>
                <artifactId>webmagic-core</artifactId>
                <version>0.7.4</version>
            </dependency>
            <dependency>
                <groupId>us.codecraft</groupId>
                <artifactId>webmagic-extension</artifactId>
                <version>0.7.4</version>
            </dependency>
            <!--&lt;!&ndash; commons-collections &ndash;&gt;-->
            <dependency>
                <groupId>commons-collections</groupId>
                <artifactId>commons-collections</artifactId>
                <version>3.2.2</version>
            </dependency>
            <dependency>
                <groupId>us.codecraft</groupId>
                <artifactId>webmagic-selenium</artifactId>
                <version>0.7.4</version>
            </dependency>
    

    2. 修改WebDriverPool

    package com.nieyue.news.webmagic.downloader;
    
    import org.openqa.selenium.WebDriver;
    import org.openqa.selenium.chrome.ChromeDriver;
    import org.openqa.selenium.chrome.ChromeOptions;
    import org.openqa.selenium.firefox.FirefoxDriver;
    import org.openqa.selenium.phantomjs.PhantomJSDriver;
    import org.openqa.selenium.phantomjs.PhantomJSDriverService;
    import org.openqa.selenium.remote.DesiredCapabilities;
    import org.openqa.selenium.remote.RemoteWebDriver;
    import org.slf4j.Logger;
    import org.slf4j.LoggerFactory;
    
    import java.io.IOException;
    import java.net.MalformedURLException;
    import java.net.URL;
    import java.util.ArrayList;
    import java.util.Collections;
    import java.util.List;
    import java.util.Properties;
    import java.util.concurrent.BlockingDeque;
    import java.util.concurrent.LinkedBlockingDeque;
    import java.util.concurrent.atomic.AtomicInteger;
    
    /**
     * @author lsj
     * WebDriverPool:频繁开关phantomJS进程比较耗费资源,所以需要维护一个线程池控制访问以减少内存消耗
     *
     */
    class WebDriverPool {
    
        private Logger logger=  LoggerFactory.getLogger(this.getClass());
    
        private final static int DEFAULT_CAPACITY = 5;
    
        private final int capacity;
    
        private final static int STAT_RUNNING = 1;
    
        private final static int STAT_CLODED = 2;
    
        private AtomicInteger stat = new AtomicInteger(STAT_RUNNING);
    
        /*
         * new fields for configuring phantomJS
         */
        private WebDriver mDriver = null;
        private boolean mAutoQuitDriver = true;
    
        private static final String DEFAULT_CONFIG_FILE = "selenium.properties";
        private static final String DRIVER_FIREFOX = "firefox";
        private static final String DRIVER_CHROME = "chrome";
        private static final String DRIVER_PHANTOMJS = "phantomjs";
    
        protected static Properties sConfig;
        protected static DesiredCapabilities sCaps;
    
        /**
         * Configure the GhostDriver, and initialize a WebDriver instance. This part
         * of code comes from GhostDriver.
         * https://github.com/detro/ghostdriver/tree/master/test/java/src/test/java/ghostdriver
         *
         * @throws IOException
         */
        public void configure() throws IOException {
            // Read config file
            sConfig = new Properties();
            String configFile = DEFAULT_CONFIG_FILE;
            if (System.getProperty("selenuim_config")!=null){
                configFile = System.getProperty("selenuim_config");
            }
            sConfig.load(Thread.currentThread().getContextClassLoader().getResourceAsStream(configFile));
    //        sConfig.load(new FileReader(configFile));
    
            // Prepare capabilities
            sCaps = new DesiredCapabilities();
            sCaps.setJavascriptEnabled(true);
            sCaps.setCapability("takesScreenshot", false);
    
            String driver = sConfig.getProperty("driver", DRIVER_PHANTOMJS);
    
            // Fetch PhantomJS-specific configuration parameters
            if (driver.equals(DRIVER_PHANTOMJS)) {
                // "phantomjs_exec_path"
                if (sConfig.getProperty("phantomjs_exec_path") != null) {
                    sCaps.setCapability(
                            PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY,
                            sConfig.getProperty("phantomjs_exec_path"));
                } else {
                    throw new IOException(
                            String.format(
                                    "Property '%s' not set!",
                                    PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY));
                }
                // "phantomjs_driver_path"
                if (sConfig.getProperty("phantomjs_driver_path") != null) {
                    System.out.println("Test will use an external GhostDriver");
                    sCaps.setCapability(
                            PhantomJSDriverService.PHANTOMJS_GHOSTDRIVER_PATH_PROPERTY,
                            sConfig.getProperty("phantomjs_driver_path"));
                } else {
                    System.out
                            .println("Test will use PhantomJS internal GhostDriver");
                }
            }
    
            // Disable "web-security", enable all possible "ssl-protocols" and
            // "ignore-ssl-errors" for PhantomJSDriver
            // sCaps.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS, new
            // String[] {
            // "--web-security=false",
            // "--ssl-protocol=any",
            // "--ignore-ssl-errors=true"
            // });
    
            ArrayList<String> cliArgsCap = new ArrayList<String>();
            cliArgsCap.add("--web-security=false");
            cliArgsCap.add("--ssl-protocol=any");
            cliArgsCap.add("--ignore-ssl-errors=true");
            sCaps.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS,
                    cliArgsCap);
    
            // Control LogLevel for GhostDriver, via CLI arguments
            sCaps.setCapability(
                    PhantomJSDriverService.PHANTOMJS_GHOSTDRIVER_CLI_ARGS,
                    new String[] { "--logLevel="
                            + (sConfig.getProperty("phantomjs_driver_loglevel") != null ? sConfig
                            .getProperty("phantomjs_driver_loglevel")
                            : "INFO") });
    
            // String driver = sConfig.getProperty("driver", DRIVER_PHANTOMJS);
    
            // Start appropriate Driver
            if (isUrl(driver)) {
                sCaps.setBrowserName("phantomjs");
                mDriver = new RemoteWebDriver(new URL(driver), sCaps);
            } else if (driver.equals(DRIVER_FIREFOX)) {
                mDriver = new FirefoxDriver(sCaps);
            } else if (driver.equals(DRIVER_CHROME)) {
                ChromeOptions options = new ChromeOptions();
                // 谷歌文档提到需要加上这个属性来规避bug
                options.addArguments("headless");
                options.addArguments("disable-gpu");
                options.addArguments("disable-dev-shm-usage");
                options.addArguments("disable-plugins");
                // 禁用java
                options.addArguments("disable-java");
                // 以最高权限运行
                options.addArguments("no-sandbox");
    //            options.addArguments("user-agent=\"Mozilla/5.0 (iPod; U; CPU iPhone OS 2_1 like Mac OS X; ja-jp) AppleWebKit/525.18.1 (KHTML, like Gecko) Version/3.1.1 Mobile/5F137 Safari/525.20\"");
                //不显示弹出窗口
                options.setHeadless(true);
                mDriver = new ChromeDriver(options);
            } else if (driver.equals(DRIVER_PHANTOMJS)) {
                mDriver = new PhantomJSDriver(sCaps);
            }
        }
    
        /**
         * check whether input is a valid URL
         *
         * @param urlString urlString
         * @return true means yes, otherwise no.
         */
        private boolean isUrl(String urlString) {
            try {
                new URL(urlString);
                return true;
            } catch (MalformedURLException mue) {
                return false;
            }
        }
    
        /**
         * store webDrivers created
         */
        private List<WebDriver> webDriverList = Collections
                .synchronizedList(new ArrayList<WebDriver>());
    
        /**
         * store webDrivers available
         */
        private BlockingDeque<WebDriver> innerQueue = new LinkedBlockingDeque<WebDriver>();
    
        public WebDriverPool(int capacity) {
            this.capacity = capacity;
        }
    
        public WebDriverPool() {
            this(DEFAULT_CAPACITY);
        }
    
        /**
         *
         * @return
         * @throws InterruptedException
         */
        public WebDriver get() throws InterruptedException {
            checkRunning();
            WebDriver poll = innerQueue.poll();
            if (poll != null) {
                return poll;
            }
            if (webDriverList.size() < capacity) {
                synchronized (webDriverList) {
                    if (webDriverList.size() < capacity) {
    
                        // add new WebDriver instance into pool
                        try {
                            configure();
                            innerQueue.add(mDriver);
                            webDriverList.add(mDriver);
                        } catch (IOException e) {
                            e.printStackTrace();
                        }
    
                        // ChromeDriver e = new ChromeDriver();
                        // WebDriver e = getWebDriver();
                        // innerQueue.add(e);
                        // webDriverList.add(e);
                    }
                }
    
            }
            return innerQueue.take();
        }
    
        public void returnToPool(WebDriver webDriver) {
            checkRunning();
            innerQueue.add(webDriver);
        }
    
        protected void checkRunning() {
            if (!stat.compareAndSet(STAT_RUNNING, STAT_RUNNING)) {
                throw new IllegalStateException("Already closed!");
            }
        }
    
        public void closeAll() {
            boolean b = stat.compareAndSet(STAT_RUNNING, STAT_CLODED);
            if (!b) {
                throw new IllegalStateException("Already closed!");
            }
            for (WebDriver webDriver : webDriverList) {
                logger.info("Quit webDriver" + webDriver);
                webDriver.quit();
                webDriver = null;
            }
        }
    }
    

    3. 修改SeleniumDownloader

    package com.nieyue.news.webmagic.downloader;
    
    import org.openqa.selenium.By;
    import org.openqa.selenium.Cookie;
    import org.openqa.selenium.WebDriver;
    import org.openqa.selenium.WebElement;
    import org.slf4j.Logger;
    import org.slf4j.LoggerFactory;
    import us.codecraft.webmagic.Page;
    import us.codecraft.webmagic.Request;
    import us.codecraft.webmagic.Site;
    import us.codecraft.webmagic.Task;
    import us.codecraft.webmagic.downloader.Downloader;
    import us.codecraft.webmagic.selector.PlainText;
    
    import java.io.Closeable;
    import java.io.IOException;
    import java.util.Map;
    
    /**
     * 使用Selenium调用浏览器进行渲染。目前仅支持chrome。
     * 需要下载Selenium driver支持。
     */
    public class SeleniumDownloader implements Downloader, Closeable {
    
        private volatile WebDriverPool webDriverPool;
    
        private Logger logger=  LoggerFactory.getLogger(this.getClass());
    
        private int sleepTime = 0;
    
        private int poolSize = 1;
    
        private static final String DRIVER_PHANTOMJS = "phantomjs";
    
        /**
         * 新建
         *
         * @param chromeDriverPath chromeDriverPath
         */
        public SeleniumDownloader(String chromeDriverPath) {
            System.getProperties().setProperty("webdriver.chrome.driver",
                    chromeDriverPath);
        }
    
        /**
         * Constructor without any filed. Construct PhantomJS browser
         */
        public SeleniumDownloader() {
            // System.setProperty("phantomjs.binary.path",
            // "/Users/Bingo/Downloads/phantomjs-1.9.7-macosx/bin/phantomjs");
        }
    
        /**
         * set sleep time to wait until load success
         *
         * @param sleepTime sleepTime
         * @return this
         */
        public SeleniumDownloader setSleepTime(int sleepTime) {
            this.sleepTime = sleepTime;
            return this;
        }
    
        @Override
        public Page download(Request request, Task task) {
            checkInit();
            WebDriver webDriver;
            try {
                webDriver = webDriverPool.get();
            } catch (InterruptedException e) {
                logger.warn("interrupted", e);
                return null;
            }
            logger.info("downloading page " + request.getUrl());
            webDriver.get(request.getUrl());
            try {
                Thread.sleep(sleepTime);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
            WebDriver.Options manage = webDriver.manage();
            Site site = task.getSite();
            if (site.getCookies() != null) {
                for (Map.Entry<String, String> cookieEntry : site.getCookies()
                        .entrySet()) {
                    Cookie cookie = new Cookie(cookieEntry.getKey(),
                            cookieEntry.getValue());
                    manage.addCookie(cookie);
                }
            }
    
            /*
             * TODO You can add mouse event or other processes
             *
             */
    
            WebElement webElement = webDriver.findElement(By.xpath("/html"));
            String content = webElement.getAttribute("outerHTML");
            Page page = new Page();
            page.setRawText(content);
    //        page.setHtml(new Html(content, request.getUrl()));
            page.setUrl(new PlainText(request.getUrl()));
            page.setRequest(request);
            webDriverPool.returnToPool(webDriver);
            return page;
        }
    
        private void checkInit() {
            if (webDriverPool == null) {
                synchronized (this) {
                    webDriverPool = new WebDriverPool(poolSize);
                }
            }
        }
    
        @Override
        public void setThread(int thread) {
            this.poolSize = thread;
        }
    
        @Override
        public void close() throws IOException {
            webDriverPool.closeAll();
        }
    }
    

    4.添加selenium.properties配置文件

    # What WebDriver to use for the tests
    #driver=phantomjs
    #driver=firefox
    driver=chrome
    #driver=http://localhost:8910
    #driver=http://localhost:4444/wd/hub
    
    # PhantomJS specific config (change according to your installation)
    #phantomjs_exec_path=/Users/Bingo/bin/phantomjs-qt5
    #phantomjs_exec_path=d:/phantomjs.exe
    #chrome_exec_path=E:\\demo\\crawler\\chromedriver.exe
    #phantomjs_driver_path=/Users/Bingo/Documents/workspace/webmagic/webmagic-selenium/src/main.js
    #phantomjs_driver_loglevel=DEBUG
    chrome_driver_loglevel=DEBUG
    
    # 本地
    #chrome_driver_path=D://MyProject//chromedriver.exe
    # 测试环境
    chrome_driver_path=/home/chrome/chromedriver
    

    5.使用案例

    package com.nieyue.news.webmagic.processor;
    
    import com.nieyue.news.bean.ArticleWebmagic;
    import com.nieyue.news.webmagic.downloader.SeleniumDownloader;
    import com.nieyue.news.webmagic.pipeline.ArticlePipeline;
    import com.nieyue.news.webmagic.utils.WebmagicRedisUtil;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    import org.slf4j.Logger;
    import org.slf4j.LoggerFactory;
    import org.springframework.beans.factory.annotation.Autowired;
    import org.springframework.stereotype.Component;
    import us.codecraft.webmagic.Page;
    import us.codecraft.webmagic.Site;
    import us.codecraft.webmagic.Spider;
    import us.codecraft.webmagic.processor.PageProcessor;
    import us.codecraft.webmagic.scheduler.BloomFilterDuplicateRemover;
    import us.codecraft.webmagic.scheduler.QueueScheduler;
    import us.codecraft.webmagic.selector.Html;
    import java.util.*;
    
    /**
     * 凤凰网
     * PageProcessor解析器
     */
    @Component
    public class ArticleProcessor implements PageProcessor {
    
        // 初始页URL 推荐
        private static final String URL = "https://finance.ifeng.com/c/84cRLNKrrar";
    
        private Logger logger=  LoggerFactory.getLogger(this.getClass());
    
        @Autowired
        private WebmagicRedisUtil webmagicRedisUtil;
    
        // chromedriver.exe地址
        private static final String address = "D:\\MyProject\\chromedriver.exe";
    
        private Site site;
    
        @Override
        public Site getSite() {
            if (site == null) {
                site = Site.me()
                        .setCharset("utf8")  // 字符集,charset具体看网站的字符集
                        .setSleepTime(3 * 1000)  // 抓取间隔时间,单位都是毫秒
                        .setTimeOut(5 * 1000)  // 超时时间
                        .setRetrySleepTime(3 * 1000)  // 重试间隔时间
                        .setRetryTimes(3);  // 重试次数
            }
            return site;
        }
    
        /**
         * 具体的解析逻辑
         * @param page Page, WebMagic经过Downloader下载后自动封装的
         */
        @Override
        public void process(Page page) {
            Html html = page.getHtml();
            Document document = html.getDocument();
            int select = document.select("div#root").select("div.layout-u18-agac").size();
            if (select == 0 ){
                // 详情页
                // getElementsByAttributeValueContaining, 寻找键为key,值包含match的元素集
                Elements elements = document.getElementsByAttributeValueContaining("class", "main_content-");
                if (elements != null && elements.size()>0){
                    Element element = elements.get(0);
                    // 移除关键词文章(原创、不得转载、禁止转载、禁止任何方式转载、未经允许)
                    Elements words1 = element.getElementsContainingText("原创");
                    Elements words2 = element.getElementsContainingText("不得转载");
                    Elements words3 = element.getElementsContainingText("禁止转载");
                    Elements words4 = element.getElementsContainingText("禁止任何方式转载");
                    Elements words5 = element.getElementsContainingText("未经允许");
                    if ((words1 != null && words1.size()>0)||(words2 != null && words2.size()>0)|| (words3 != null && words3.size()>0)
                            ||(words4 != null && words4.size()>0)||(words5 != null && words5.size()>0)){
                        logger.info("凤凰网->此界面包含敏感词:"+page.getRequest().getUrl());
                        return;
                    }
                    // 移除文章中的广告
                    element.select("div#embed_hzh_div").remove();
                    // 移除自动播放
                    element.getElementsByAttributeValueContaining("class", "video_box-").remove();
                    // 移除底部广告
                    element.getElementsByAttributeValue("style","position: relative;").remove();
                    // 保存对象
                    ArticleWebmagic articleWebmagic = new ArticleWebmagic();
                    // 标题
                    // String title = document.getElementsByAttributeValueContaining("class", "leftContent-").select("h1").text();
                    String title = document.select("h1").text();
                    // 图片
                    Elements img = element.getElementsByTag("img");
                    if (img != null && img.size()>0){
                        StringBuilder sb = new StringBuilder();
                        // 三图
                        if (img.size()>=3){
                            for (int i = 0;i<3;i++){
                                articleWebmagic.setImgMode(6);
                                String src = img.get(i).attr("src");
                                sb.append(src).append(",");
                            }
                        } else {  // 右小图
                            articleWebmagic.setImgMode(4);
                            String src = img.get(0).attr("src");
                            sb.append(src).append(",");
                        }
                        articleWebmagic.setImgAddress(sb.substring(0,sb.length()-1));
                    }
                    String url = page.getRequest().getUrl();
                    articleWebmagic.setTitle(title);
                    articleWebmagic.setContent(element.toString());
                    articleWebmagic.setUrl(url);
                    // 存数据
                    page.putField("articleWebmagic",articleWebmagic);
                } else {
                    logger.info("凤凰网->此界面不满足:"+page.getRequest().getUrl());
                }
            } else {
                // 热点资讯
                Elements elements = document.select("div.hot_box-1yXFLW7e").select("div.news_list-1dYUdgWQ").get(0).select("a");
                // 要闻
                elements.addAll(document.select("div.center_box-2F8qYPeE").select("div.tabBodyItemActive-H7rMJtKB").select("a"));
                // 军事
                elements.addAll(document.select("div.left_box-aXjri-Gu").select("div.news_list-1dYUdgWQ").select("a"));
                // 科技
                elements.addAll(document.select("div.center_box-_l_Nle8B").select("div.news_list-1dYUdgWQ").select("a"));
                // 体育
                elements.addAll(document.select("div.left_box-7AdOw5gz").select("div.news_list-1dYUdgWQ").select("a"));
                // 娱乐
                elements.addAll(document.select("div.center_box-2d2syNWk").select("div.news_list-1dYUdgWQ").select("a"));
                // 时尚
                elements.addAll(document.select("div.center_box-39hkxdBA").select("div.news_list-1dYUdgWQ").select("a"));
                // 教育
                elements.addAll(document.select("div.left_box-3iQHsHjU").select("div.news_list-1dYUdgWQ").select("a"));
                // 文化·读书
                elements.addAll(document.select("div.center_box-2ghWH00s").select("div.news_list-1dYUdgWQ").select("a"));
                // 新list
                List<String> article = new ArrayList<>();
                // 备份新的list
                List<String> article1 = new ArrayList<>();
                for(Element element : elements){
                    String url = element.attr("href");
                    article.add(url);
                    article1.add(url);
                }
                // 获取redis中的老文章
    //            webmagicRedisUtil.del("articleWebmagic");
                List<String> articlebmagic = (List<String>) webmagicRedisUtil.get("articleWebmagic");
                if (articlebmagic == null || articlebmagic.size() ==0){
                    webmagicRedisUtil.set("articleWebmagic",article);
                    for(String a : article){
                        page.addTargetRequest(a);
                    }
                } else {
                    for(String article0 : articlebmagic){
                        Iterator<String> iterator1 = article.iterator();
                        while (iterator1.hasNext()){
                            String next = iterator1.next();
                            if (article0.equals(next)){
                                iterator1.remove();
                            }
                        }
                    }
                    // 更新的部分url进行请求
                    if (article !=null && article.size() > 0){
                        for(String url : article){
                            page.addTargetRequest(url);
                        }
                    }
                    // 更新redis
                    webmagicRedisUtil.del("articleWebmagic");
                    webmagicRedisUtil.set("articleWebmagic",article1);
                }
            }
        }
    
    //    public static void main(String[] args) {
            // 执行
    //        Spider.create(new ArticleProcessor())
    //                .addUrl(URL)
    //                .thread(5)
    //                // 自定义Pipeline,保存到数据库
    //                .addPipeline(new ArticlePipeline())
    //                /**
    //                 * 为 SeleniumDownloader 设置休眠时间:
    //                 * 当动态加载页面时,可能还存在部分数据没有加载完毕,为它设置休眠时间后,可保证有足够的时间,加载完
    //                */
    //                .setDownloader(new SeleniumDownloader(address).setSleepTime(3 * 1000))
    //                // 设置调度策略及去重策略(并设置对最多10万数据进行去重)
    //                .setScheduler(new QueueScheduler().setDuplicateRemover(new BloomFilterDuplicateRemover(10 * 1000)))
    //                .run();
    //    }
    
    }
    

    使用定时任务抓取

    package com.nieyue.news.webmagic.schedule;
    
    import com.nieyue.common.comments.lock.LockTemplate;
    import com.nieyue.common.comments.lock.LockedCallback;
    import com.nieyue.common.util.RedisUtil;
    import com.nieyue.news.webmagic.downloader.SeleniumDownloader;
    import com.nieyue.news.webmagic.pipeline.ArticlePipeline;
    import com.nieyue.news.webmagic.processor.ArticleProcessor;
    import org.slf4j.Logger;
    import org.slf4j.LoggerFactory;
    import org.springframework.beans.factory.annotation.Autowired;
    import org.springframework.beans.factory.annotation.Value;
    import org.springframework.boot.autoconfigure.condition.ConditionalOnProperty;
    import org.springframework.scheduling.annotation.EnableScheduling;
    import org.springframework.scheduling.annotation.Scheduled;
    import org.springframework.stereotype.Component;
    import org.springframework.util.StringUtils;
    import us.codecraft.webmagic.Spider;
    import us.codecraft.webmagic.scheduler.BloomFilterDuplicateRemover;
    import us.codecraft.webmagic.scheduler.QueueScheduler;
    
    import java.io.IOException;
    import java.util.Properties;
    
    /**
     * 定时任务自动爬取凤凰网文章
     */
    @Component
    //启用定时任务
    @EnableScheduling
    //配置文件读取是否启用此配置
    @ConditionalOnProperty(prefix = "scheduling", name = "enabled", havingValue = "true")
    public class FengArticleScheduled {
    
        private static final String URL = "https://www.ifeng.com/";
    
        @Autowired
        private ArticleProcessor articleProcessor;
    
        @Autowired
        private ArticlePipeline articlePipeline;
    
        @Value("${server.port}")
        private int serverPort;
    
        private Logger logger=  LoggerFactory.getLogger(this.getClass());
    
        @Autowired
        private LockTemplate lockTemplate;
    
        //每天执行一次
        @Scheduled(cron = "0 0 9,12,15 * * ?")
        public void updateArticle() {
            lockTemplate.doBiz(new LockedCallback<String>() {
                @Override
                public String callback() {
                    Properties sConfig = new Properties();
                    try {
                        sConfig.load(Thread.currentThread().getContextClassLoader().getResourceAsStream("selenium.properties"));
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
    
                    // 执行
                    Spider.create(articleProcessor)
                        .addUrl(URL)
                        // 自定义Pipeline,保存到数据库
                        .addPipeline(articlePipeline)
                        .thread(5)
                        /*
                         * 为 SeleniumDownloader 设置休眠时间:
                         * 当动态加载页面时,可能还存在部分数据没有加载完毕,为它设置休眠时间后,可保证有足够的时间,加载完
                         */
                        .setDownloader(new SeleniumDownloader((String)sConfig.get("chrome_driver_path")).setSleepTime(3000))
                        // 设置调度策略及去重策略(并设置对最多10万数据进行去重)
                        .setScheduler(new QueueScheduler().setDuplicateRemover(new BloomFilterDuplicateRemover(10 * 1000)))
                        .run();
    
                    logger.info("凤凰网文章定时任务执行完成,端口号: "+serverPort);
                    return "";
                }
            },"campusNewFengArticleScheduledUpdateArticleScene","","campusNewFengArticleScheduledUpdateArticleKey",10L,"锁异常campusNewFengArticleScheduledUpdateArticleSceneUniqueId");
        }
    
    }
    

    参考文档:

    http://webmagic.io/docs/zh/
    https://blog.csdn.net/qixinbruce/article/details/71105444
    https://blog.csdn.net/panchang199266/article/details/85413746

    相关文章

      网友评论

          本文标题:Webmagic+selenium+chromedriver使用

          本文链接:https://www.haomeiwen.com/subject/spjwcltx.html