爬虫：技术调研

作者: ScratchPad | 来源:发表于2018-11-06 16:38 被阅读0次

纯静态网页

这部分因为只涉及到http/https请求以及文本处理，因此不同的技术方案在不同的平台共通性较大，最多是因为程序语言环境不同带来的不同，基本方案就是进行http请求，将返回的结果进行解析，可能会遇到重定向的问题。

Java：jsoup，这个就比较轻便，不支持xpath查询语法，但是有支持jquery-like syntax的selector

Python：scrapy，相较于jsoup，提供了xpath查询语法

动态网页

动态网页相对麻烦点，以下按照Android端和服务端方案两个维度来讨论这个问题。

Android端方案

WebView

主要是利用WebView，通过Override掉WebViewClient的回调方法，以及配合部分javascript方法

protected void executeJS(String js) {
   if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.KITKAT) {
       webView.evaluateJavascript(js, null);
   } else {
        webView.loadUrl(js);
   }
}

我们看WebViewClient的方式

webView.setWebViewClient(new WebViewClient() {

  /**
   * Notify the host application that a page has finished loading. This method
   * is called only for main frame. When onPageFinished() is called, the
   * rendering picture may not be updated yet. To get the notification for the
   * new Picture, use {@link WebView.PictureListener#onNewPicture}.
   *
   * @param view The WebView that is initiating the callback.
   * @param url The url of the page.
   */
  @Override
  public void onPageFinished(WebView view, String url) {
      // we may be able to retrieve the DOM from here
  }

  /**
   * Notify the host application of a resource request and allow the
   * application to return the data.  If the return value is null, the WebView
   * will continue to load the resource as usual.  Otherwise, the return
   * response and data will be used.  NOTE: This method is called on a thread
   * other than the UI thread so clients should exercise caution
   * when accessing private data or the view system.
   *
   * @param view The {@link android.webkit.WebView} that is requesting the
   *             resource.
   * @param request Object containing the details of the request.
   * @return A {@link android.webkit.WebResourceResponse} containing the
   *         response information or null if the WebView should load the
   *         resource itself.
   */
  @Override
  public WebResourceResponse shouldInterceptRequest(WebView view, final WebResourceRequest request) {
        final String url = request.getUrl().toString();
    // 这个方式就是在这拦截请求的url，然后通过再次请求该地址来获取数据，相比IOS，android只能在这获取请求的链接
        final WebResourceResponse response = super.shouldInterceptRequest(view, request);
        // 这样获取的response == null，所以要想从这通过response.getData()获取响应结果是不可行的，详情可以看WebViewClient类中这个方法是怎么实现的
        
        return response;
    }
});

需要注意的是onPageFinished()只是在html加载完成就回调，这个对于动态网页来讲，并不代表资源加载以及javascript运行完成，因此在这个回调中调用以下的方法

protected static class MyJavaScriptInterface {

    @JavascriptInterface
    public void html(String html) {
        // deal with html here
    }
}

webView.addJavascriptInterface(new MyJavaScriptInterface(this), "HtmlViewer");

executeJs("javascript:window.HtmlViewer.html" +
                    "('<html>'+document.getElementsByTagName('html')[0].innerHTML+'</html>');”);

并不能保证获取想要的DOM，问题的原因上面已经提到了。

另外由于部分网页的实现，某些数据并不会出现在DOM中（例如在爬取淘宝商品列表的时候，商品ID并没有出现在DOM中），而是在javascript执行引擎的内存中，对于这种情况，我们可以从数据请求来入手，例如对于基于接口请求方式的数据请求方式，我们可以在获取到请求API地址后加上Cookie等信息模拟WebView再获取一次信息，以此来获取页面的raw data。

而对于raw data的解析，通过常规的Gson来解析通常会比较麻烦，因为返回的jsonp过大，而且嵌套层级过多，可以利用类似xpath或者selector这种机制来进行解析

fastjson高版本默认支持jsonpath, 而Gson则没有内建这种机制，需要配合别的第三方库来实现

implementation 'com.google.code.gson:gson:2.8.2'
implementation 'com.jayway.jsonpath:json-path:2.3.0’

jsonpath提供了如下的API来查询特定节点

JsonPath.parse(json).read("$.data.models.content.item.itemPriceDTO.price.item_price")

Chrome有个JsonPath Finder的扩展，可以用来辅助确定jsonpath

Selenium

http://selendroid.io/

目前对于Android而言，需要配合PC端的环境来运行，并不能独立实现功能，这一点对于IOS也是一样

DSpider

也是基于WebView的方案，主要原理是针对目标网页编写特定的js脚本，然后通过WebView.loadUrl()方案来加载这些脚本，进而获取页面的信息。

https://dspider.dtworkroom.com

这个项目貌似是中断了

服务器端方案

Selenium

selenium需要配合特定的浏览器驱动来实现功能，其还提供了不同语言的api库，例如对于chrome，则需要chrome driver。

基于Chrome Dev Protocol实现的方案

Chrome Dev Protocol

awesome-chrome-devtools

这个方案由于和Chome的结合度更高，因此能够提供监控网络等其他功能，类似ChromeDevTools中的调试器中的功能，详情具体看Chrome Dev Protocol

Ubuntu + Chrome + Cdpj4 方案细节

ubuntu上安装google-chrome

或

直接去chrome官方网站下载deb就行了

cdpj4

2018年04月23日这个版本的cdpj4在macOS上存在bug：
调用如下方法

session.close()
launcher.kill()

不能正确终止chrome进程，问题出在ProcessManager机制上，LinuxProcessManager的具体实现并不兼容macOS。

有用的链接

GoogleChrome Github Home Page

Chrome command line options

网友评论

本文标题：爬虫：技术调研

本文链接：https://www.haomeiwen.com/subject/knmezftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

爬虫：技术调研

纯静态网页

动态网页

Android端方案

WebView

Selenium

DSpider

服务器端方案

Selenium

基于Chrome Dev Protocol实现的方案

Ubuntu + Chrome + Cdpj4 方案细节

ubuntu上安装google-chrome

cdpj4

有用的链接

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读