美文网首页
利用jsoup爬虫

利用jsoup爬虫

作者: bingoc | 来源:发表于2016-04-27 20:38 被阅读2493次

    jsoup其实只是一种网页分析器,帮助java程序员进行网页元素分析,其代替了用正则表达式去匹配信息的方式,效率更高也跟容易编写。

    分析需求

    从51job上爬取职位信息,然后记录下来

    步骤

    爬虫的步骤无非就是下载网页,分析网络,获取信息,继续爬新的网页。来看下jsoup的代码。

    下载网页

    jsoup对下载网页的过程进行了封装,设置完一些http参数之后执行,最终返回一个Document对象,这个对象其实就是一个DOM树,他继承Element。

            Document document = Jsoup.connect(url)//连接url
                    .userAgent("ie7:mozilla/4.0 (compatible; msie 7.0b; windows nt 6.0)")//模拟浏览器访问
                    .timeout(3000)//设置超时
                    .cookie("guide", "1")//一个坑
                    .followRedirects(false)//是否跳转
                    .execute().parse();//执行
    

    在做这个网页时,遇到了一个坑。在jsoup的过程中发现一直抓不到正确的页面,后来发现抓到的是他的引导页面。就抓了一个浏览器的http包,查看了一下cookie。猜测了一下应该是guide这个参数了,于是.cookie("guide", "1”),再次获取页面,获取正常!


    抓到的包

    然后利用jsoup抓取信息

    没有正则表达式感觉好清晰~由于Document中存了各种标签的信息,直接获取标签就可以了。

               job.setJobID(element.getElementsByClass("checkbox").first().attr("value"));
                Element t1 = element.getElementsByClass("t1").first();
                job.setJobTitle(t1.getElementsByTag("a").attr("title"));
                job.setJobDetailUrl(t1.getElementsByTag("a").attr("href"));
                Element t2 = element.getElementsByClass("t2").first();
                job.setCompanyName(t2.getElementsByTag("a").attr("title"));
                job.setLocation(element.getElementsByClass("t3").text());
                job.setSalary(element.getElementsByClass("t4").text());
                job.setDate(element.getElementsByClass("t5").text());
                jobs.add(job);```
    jsoup支持CSS抓取方式,这里没有体现,其大概用法如下。
    

    // 使用select方法选择元素,参数是CSS Selector表达式
    Elements links = doc.select("a[href]");

        print("\nLinks: (%d)", links.size());
        for (Element link : links) {
            //使用abs:前缀取绝对url地址
            print(" * a: <%s>  (%s)", link.attr("abs:href"), trim(link.text(), 35));
        }
    
    源码在git上:https://github.com/bingochaos/Spider51job
    
    #jsoup源码阅读
    ###如何解析成DOM树
    其实解析html就是一个词法分析的过程。阅读标签头然后存储,找到标签尾结束当前标签。看到jsoup的实现的时候我还是惊了个呆。
    

    enum TokeniserState {
    Data {
    // in data state, gather characters until a character reference or tag is found
    void read(Tokeniser t, CharacterReader r) {
    switch (r.current()) {
    case '&':
    t.advanceTransition(CharacterReferenceInData);
    break;
    case '<':
    t.advanceTransition(TagOpen);
    break;
    case nullChar:
    t.error(this); // NOT replacement character (oddly?)
    t.emit(r.consume());
    break;
    case eof:
    t.emit(new Token.EOF());
    break;
    default:
    String data = r.consumeData();
    t.emit(data);
    break;
    }
    }
    },
    CharacterReferenceInData {
    // from & in data
    void read(Tokeniser t, CharacterReader r) {
    char[] c = t.consumeCharacterReference(null, false);
    if (c == null)
    t.emit('&');
    else
    t.emit(c);
    t.transition(Data);
    }
    },
    大部分略。。。

    jsoup利用枚举来表示状态,每个状态需要执行的代码放在了枚举里面,并用下面这种方式完成了状态机的转化。
    

    while (!isEmitPending)
    state.read(this, reader);

    这里盗图一张来说明这个词法分析的过程
    
    ![jsoup词法分析过程](https://img.haomeiwen.com/i1677321/2a3886f9fd8c765d.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
    
    #####缺少标签会发生什么事?
    答案是很多会当做处理被处理掉,代码如下
    
    case EndTag:
        if (StringUtil.in(name,"div","dl", "fieldset", "figcaption", "figure", "footer", "header", "pre", "section", "summary", "ul")) {                
            if (!tb.inScope(name)) {
            tb.error(this);
            return false;
            } 
        }  
    
    #####最后附上jsoup对tag的分类
    
    // internal static initialisers:
    // prepped from http://www.w3.org/TR/REC-html40/sgml/dtd.html and other sources
    private static final String[] blockTags = {
            "html", "head", "body", "frameset", "script", "noscript", "style", "meta", "link", "title", "frame",
            "noframes", "section", "nav", "aside", "hgroup", "header", "footer", "p", "h1", "h2", "h3", "h4", "h5", "h6",
            "ul", "ol", "pre", "div", "blockquote", "hr", "address", "figure", "figcaption", "form", "fieldset", "ins",
            "del", "s", "dl", "dt", "dd", "li", "table", "caption", "thead", "tfoot", "tbody", "colgroup", "col", "tr", "th",
            "td", "video", "audio", "canvas", "details", "menu", "plaintext", "template", "article", "main",
            "svg", "math"
    };
    private static final String[] inlineTags = {
            "object", "base", "font", "tt", "i", "b", "u", "big", "small", "em", "strong", "dfn", "code", "samp", "kbd",
            "var", "cite", "abbr", "time", "acronym", "mark", "ruby", "rt", "rp", "a", "img", "br", "wbr", "map", "q",
            "sub", "sup", "bdo", "iframe", "embed", "span", "input", "select", "textarea", "label", "button", "optgroup",
            "option", "legend", "datalist", "keygen", "output", "progress", "meter", "area", "param", "source", "track",
            "summary", "command", "device", "area", "basefont", "bgsound", "menuitem", "param", "source", "track",
            "data", "bdi"
    };
    private static final String[] emptyTags = {
            "meta", "link", "base", "frame", "img", "br", "wbr", "embed", "hr", "input", "keygen", "col", "command",
            "device", "area", "basefont", "bgsound", "menuitem", "param", "source", "track"
    };
    private static final String[] formatAsInlineTags = {
            "title", "a", "p", "h1", "h2", "h3", "h4", "h5", "h6", "pre", "address", "li", "th", "td", "script", "style",
            "ins", "del", "s"
    };
    private static final String[] preserveWhitespaceTags = {
            "pre", "plaintext", "title", "textarea"
            // script is not here as it is a data node, which always preserve whitespace
    };
    // todo: I think we just need submit tags, and can scrub listed
    private static final String[] formListedTags = {
            "button", "fieldset", "input", "keygen", "object", "output", "select", "textarea"
    };
    private static final String[] formSubmitTags = {
            "input", "keygen", "object", "select", "textarea"
    };
    

    相关文章

      网友评论

          本文标题:利用jsoup爬虫

          本文链接:https://www.haomeiwen.com/subject/wwicrttx.html