网页内容清洗

作者: 艾剪疏 | 来源:发表于2018-08-25 10:55 被阅读216次

1 通过HTML标签提取（Java、Python）
2 基于正则表达式的网页抽取
3 机器学习方法

1 通过HTML标签提取（Java、Python）

浏览器在收到服务器返回的html源码后，会将网页解析为DOM树。HTML标签提取是基于DOM树的特征，被广泛用于网页抽取。目前最流行的网页抽取组件Jsoup(Java)和BeautifulSoup(Python)都是基于CSS选择器的。
这里主要说一下我用过的Java、Python两种爬虫中对HTML标签提取器的使用。

1.1 Java部分

HTMLPareser
HTML解析器，这个工具出现的较早，具有小巧，快速的优点。官方文档API如下。
http://htmlparser.sourceforge.net/javadoc/index.html
使用例子
http://free0007.iteye.com/blog/1131163
jsoup
一款Java的HTML解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于jQuery的操作方法来取出和操作数据。
官网
https://jsoup.org/
操作手册（中文）
http://www.open-open.com/jsoup/

1.2 Python部分

基于XPATH的网页抽取
python的scrapy的爬虫中的Selector可以通过xpath和css两种方式提取HTML标签。xpath + chrome浏览器：ctrl+F(Element页面)，即可帮助快速定位。
这个书不错，可以帮你快速入门刘硕. 精通Scrapy网络爬虫。
https://download.csdn.net/download/oceaman/10387949
这个是scrapy的代码示例，很好的资源。
https://github.com/geekan/scrapy-examples

2 基于正则表达式的网页抽取

利用正则表达式进行网页抽取，是在html源码的基础上做字符串级别的检索。要详细了解如何利用正则表达式进行网页抽取，只要了解正则表达式的基本用法即可，与网页特征无关。
这里主要分享一下，自己使用的一些正则表达式的技巧。下面这个例子是之前对heritrix的优化，通过正则抽取出符合要求的URL，放入到抓取队列中使用的。

public class Extractor_Cnpc extends Extractor{
    public static final String patternString1 = ".*href\\s*=\\s*(\"|'|)http://.*";
    public static final String patternString2 = "<\\s*[aA][\\s\\S]+(href\\s*=\\s*(\"|'|)(http|https)://[^>]+\\s*)>";
    public static final String patternString3 = "<\\s*[aA][\\s\\S]+(href\\s*=\\s*(\"|'|)((http|https)://[^\"]+(\")))";
    public static final String patternString4 = "<a(.*)href\\s*=\\s*(.*)>";//Extractor_Cnpc-row      "<a(.*)href\\s*=\\s*(\"([^\"]*)\"|[^\\s>])(.*)>"
    public static final String patternString5 = "(http://|https://)+((\\w|\\.|\\/|-|=|\\?|&)+)+";//抽取特征

    public Extractor_Cnpc(String name, String description) {
        super(name, description);
    }

    public Extractor_Cnpc(String name) {
        super(name, "Cnpc news extractor");
    }

    private static String URL_REGEX  = null;

    public static Pattern pattern1 = Pattern.compile(patternString1,
            Pattern.DOTALL);
    public static Pattern pattern2 = Pattern.compile(patternString2,
            Pattern.DOTALL);
    public static Pattern pattern3 = Pattern.compile(patternString3,
            Pattern.DOTALL);
    public static Pattern pattern4 = Pattern.compile(patternString4,
            Pattern.CASE_INSENSITIVE);
    public static Pattern pattern5 = Pattern.compile(patternString5,
            Pattern.DOTALL);

    @Override
    protected void extract(CrawlURI curi) {
        String url = "";
        try {
            HttpRecorder hr = curi.getHttpRecorder();
            if(hr == null){
                throw new IOException("HttpRecorder is null");
            }
            ReplayCharSequence cs = hr.getReplayCharSequence();
            if(cs == null){
                return;
            }
            String context = cs.toString();
            Pattern pattern = Pattern.compile(patternString4, Pattern.CASE_INSENSITIVE);
            Matcher matcher = pattern.matcher(context);
            while(matcher.find()){
                url = matcher.group();//先将url截取出来
                writeUriIntoTxt(url,"F:\\data\\test\\output.txt");
                url = url.replace("\"", "");//替换",清除前后双引号
                URL_REGEX = extractor_Feature(readCrawlUrl());
                if(url.matches(URL_REGEX)){//和正则进行匹配
                    curi.createAndAddLinkRelativeToBase(url, context, Link.NAVLINK_HOP);
                    writeUriIntoTxt(url,getWriteUrlPath());
                }
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

对于这个正则对于URL的抽取，想说一下自己的理解。

regex1 = ".*href\\s*=\\s*(\"|'|)http://.*";
regex2 = "[^\\s]*((<\\s*[aA]\\s+(href\\s*=[^>]+\\s*)>)(.*)</[aA]>).*"

测试链接如下：

String ss = "这是测试<a href=http://www.google.cn>www.google.cn</a>真的是测试了";

例如，regex1是找出含有链接的部分，是判断其中是否含有链接，匹配含有href、=、“、‘、http:这些元素，并按照指定的顺序出现。这样就将其找出。

关键是对于regex2的解析，代码中用了这么一句

matcher.group(matcher.groupCount() - 3);

matcher.groupCount()是返回此匹配器模式中的捕获组数，-3将最后一次匹配的结果向前移动3。在regex2中，一共有4个“()”，每一个小括号就代表一个组。所以，-3就代表，找出第一次匹配的元素，然后，第二个括号的匹配是在第一个括号匹配的范围之内在进行匹配的，第三个又是在第二个匹配结果之内进行匹配的，以此类推......。最后在通过-3、-2、-1、0取出对应的匹配部分。
所以，

matcher.group(matcher.groupCount() - 3);
matcher.group(matcher.groupCount() - 2);
matcher.group(matcher.groupCount() - 1);
matcher.group(matcher.groupCount());

对应的结果是

//<a href=http://www.google.cn>www.google.cn</a>
//<a href=http://www.google.cn>
//href=http://www.google.cn
//www.google.cn

另外[^>]，意思是取出不包含<的部分。
正则表达式是一个很厉害的东西，用好了能省不少力气。

3 机器学习方法

基于正则或CSS选择器(或xpath)的网页抽取都基于属于基于包装器(wrapper)的网页抽取，这类抽取算法的通病就在于，对于不同结构的网页，要制定不同的抽取规则。如果一个舆情系统需要监控10000个异构网站，就需要编写并维护10000套抽取规则。从2000年左右就开始有人研究如何用机器学习的方法，让程序在不需要人工制定规则的情况下从网页中提取所需的信息。
基于机器学习的网页抽取的重心偏向于新闻网页内容自动抽取，即输入一个新闻网页，程序可以自动输出新闻的标题、正文、时间等信息。新闻、博客、百科类网站包含的结构化数据较为单一，基本都满足{标题，时间，正文}这种结构，抽取目标很明确，机器学习算法也较好设计。但电商、求职等类型的网页中包含的结构化数据非常复杂，有些还有嵌套，并没有统一的抽取目标，针对这类页面设计机器学习抽取算法难度较大。

基于机器学习的网页抽取算法大致可以分为以下几类：

基于网页模板自动生成的网页抽取算法
基于启发式规则和无监督学习的网页抽取算法
基于分类器的网页抽取算法

这里我主要使用过前两种。

3.1 基于网页模板自动生成的网页抽取算法

通过模板自动生成的网页，核心还是要使用HTML标签解析器。思想就是，将某个页面需要解析字段的HTML标签全部配置好，然后就可以通过模版不断清洗出需要的内容。问题就是每个网页都需要去单独配置，较为麻烦。

也可以通过admin自动分析某个页面的结构，在前台搭建一个输入页面，对admin分析结果进行配置，然后分析生成该网站的HTML标签解析模版（XML），实现对不同的网站进行解析。

image.png

但是这个方法的可扩展性较差，而且不是每个网页都有效，可能需要不断修改代码对应需求，而且对编码的要求较高。

3.2 基于启发式规则和无监督学习的网页抽取算法

基于启发式规则和无监督学习的网页抽取算法（第一类算法）是目前最简单，也是效果最好的方法。且其具有较高的通用性，即算法往往在不同语种、不同结构的网页上都有效。
这类算法大多数没有将网页解析为DOM树，而是将网页解析为一个token序列，例如对于下面这段html源码:

<body>
    <div>广告...(8字)</div>
    <div class="main">正文...(500字)</div>
    <div class="foot">页脚...(6字)</div>
</body>

程序将其转换为token序列：

标签(body),标签(div),文本,文本....(8次),标签(/div),标签(div),
文本,文本...(500次),标签(/div),标签(div),文本,文本...(6次),标签(/div),标签(/body)

早期有一种MSS算法(Maximum Subsequence Segmentation)以token序列为基础，根据某种规则，对token进行打分，例如：

一个标签给-3.25分
一个文本给1分

根据打分规则和上面的token序列，我们可以获取一个分数序列：

-3.25,-3.25,1,1,1...(8次),-3.25,-3.25,1,1,1...(500次),-3.25,-3.25,1,1,1...(6次),-3.25,-3.25

MSS算法认为，找出token序列中的一个子序列，使得这个子序列中token对应的分数总和达到最大，则这个子序列就是网页中的正文。

而文本路径标签比用的就是这个思想。
文本标签路径比算法是一种抽取新闻正文比较有效的方法,其核心是通过比较正文内容与噪音内容在标签路径和文本内容等特征上的一些显著区别, 抽取出网页中的正文信息。

（2）两个重要原则

Web新闻网页的内容部分有相似的标签路径,且包含较长的文本内容和较多的标点符号;
Web新闻网页的噪音部分有相似的标签路径,且包含较短的文本内容和较少的标点符号.

（3）整个抽取过程分为三个阶段

HTML 解析模块:给定一个 HTML 网页,从网页中移去脚本、注释、样式标签,因为这些信息在页面上是不可见信息,不必纳入计算范畴.并将网页解析成一棵 DOM 树,可使用已有的任意 HTML 网页解析器,如 HTMLParser 等;
标签路径特征系计算模块:给定 HTML 网页的解析树,遍历 DOM 树,获取所有的标签路径,并统计标签路径的相关信息.
基于标签路径特征融合的 Web 新闻内容抽取模块:该模块包含两部分,分别是组合特征选择模块和抽取模块:组合特征选择模块用于过滤冗余特征,选择一组相对独立的特征;抽取模块首先将选择后的特征集合融合为一个综合特征,然后根据抽取控制策略对该解析树上的 Web 新闻内容节点进行抽取,并返回抽取后的 Web 新闻内容.

以下面网页为例：

image.png

新闻网页的主题部分是一个整体,每个段落具有相似的显示格式;
噪音内容有很多块,同一块中的噪音内容有相似的显示格式;
噪音内容大部分都分布在整个新闻网页页面的边缘；
新闻网页的主题部分和噪音部分在文本内容的表现形式上有显著的区别;
同一块的信息片段对应的标签路径有着类似的结构.

（4）标签路径特征系的设计

标签路径信息:Web新闻网页的正文内容有着相同或相似的标签路径,噪音内容有着相似或相同的标签路径;
节点信息:所有的文本节点都是叶子节点;
节点的文本信息:内容节点的文本长度通常较长,噪音节点的文本长度通常较短;
节点的标点符号信息:内容节点上的文本包含较多的标点符号,噪音节点上的文本包含较少的标点符号;
网页的文本信息:新闻网页的正文内容的总文本长度一般大于网页中噪音的总文本长度;
网页的标点符号信息:新闻网页中,正文内容中所包含的文本节点的个数一般多于噪音中标点符号的个数;
修饰信息:Web 新闻网页的正文内容有较少的修饰信息,即,正文内容的标签路径层次较低;网页的噪音部分有着较多的修饰信息,如超链接、背景、字体颜色等多方面的修饰,涉及较多的标签,即,噪音的标签路径层次较高.

（5）内容抽取部分实现代码

1 将HTML转化为DOM对象，并清除不用的标签。

public WebExtractor(Document doc) {
        this.doc = doc;
}
protected void clean() {
    doc.select("script,noscript,style,iframe,br").remove();
}

2 传入body标签，递归统计标签内的：文本信息、超链接文本信息、所有标签节点个数、<a>链接的个数、叶子节点信息的长度等信息作为token。（这个方法是重点）

protected CountWebInfo computeInfo(Node node) { 
        //节点是元素
        if (node instanceof Element) {
            Element tag = (Element) node;

            CountWebInfo countInfo = new CountWebInfo();
            for (Node childNode : tag.childNodes()) {
                CountWebInfo childCountInfo = computeInfo(childNode);
                countInfo.textCount += childCountInfo.textCount;//文本信息的长度
                countInfo.linkTextCount += childCountInfo.linkTextCount;//超链接文本信息节点（1 a标签中文本信息长度 ）
                countInfo.tagCount += childCountInfo.tagCount;//所有标签节点个数
                countInfo.linkTagCount += childCountInfo.linkTagCount;//<a>链接的个数
                countInfo.leafList.addAll(childCountInfo.leafList);//叶子节点信息的长度
                countInfo.densitySum += childCountInfo.density;//文本的密度
                countInfo.pCount += childCountInfo.pCount;//p标签节点
                countInfo.punctuation += childCountInfo.punctuation;//标点个数
                countInfo.strongCount += childCountInfo.strongCount;//strong个数
                countInfo.imageCount += childCountInfo.imageCount;//image个数
            }
            countInfo.tagCount++;
            String tagName = tag.tagName();
            if (tagName.equals("a")) {
                countInfo.linkTextCount = countInfo.textCount;
                countInfo.linkTagCount++;
            } else if (tagName.equals("p")) {
                countInfo.pCount++;
            }else if (tagName.equals("strong")){
                countInfo.strongCount++;
            } else if(tagName.equals("img")){
                countInfo.imageCount++;
            }

            int pureLen = countInfo.textCount - countInfo.linkTextCount;//文本信息的长度-a标签中文本信息长度
            int len = countInfo.tagCount - countInfo.linkTagCount;//所有标签节点个数-<a>链接的个数
            if (pureLen == 0 || len == 0) {
                countInfo.density = 0;
            } else {
                countInfo.density = (pureLen + 0.0) / len;//文本字数/文本节点数
            }

            infoMap.put(tag, countInfo);

            return countInfo;
        } else if (node instanceof TextNode) {//节点是文本
            TextNode tn = (TextNode) node;
            CountWebInfo countInfo = new CountWebInfo();
            String text = tn.text();
            int pLen = calPunctuation(text);
            countInfo.punctuation = pLen;//标点个数
            int len = text.length();
            countInfo.textCount = len;
            countInfo.leafList.add(len);
            return countInfo;
        } else {
            return new CountWebInfo();
        }
    }

3、计算标签的分数。
score = log(叶子节点方差) * 所有文本密度之和 * log(文本字数总和) * log10(<p>个数+2)
score越高 = 叶子节点文本数差异越大 * 所有文本密度之和越大 * 文本字数总和越大 * <p>个数越多

protected double computeScore(Element tag) {
        CountWebInfo countInfo = infoMap.get(tag);
        double var = Math.sqrt(computeVar(countInfo.leafList) + 1);//计算叶子节点方差
        double score = Math.log(var) * countInfo.densitySum * Math.log(countInfo.textCount - countInfo.linkTextCount + 1) * Math.log10(countInfo.pCount + 2);
        countInfo.score = score;
        countInfo.tag = tag;
        sortCount.add(countInfo);
        //score = log(叶子节点方差) * 所有文本密度之和     * log(文本字数总和) * log10(<p>个数+2)
        //score越高 = 叶子节点文本数差异越大  * 所有文本密度之和越大 * 文本字数总和越大 * <p>个数越多
        return score;
    }

4、统计token的分数，取出最高分对应的标签

    public Element getContentElement() throws Exception {
        clean();
        computeInfo(doc.body());
        double maxScore = 0;
        Element content = null;
        sortCount = new ArrayList<CountWebInfo>();
        for (Map.Entry<Element, CountWebInfo> entry : infoMap.entrySet()) {
            Element tag = entry.getKey();
            if (tag.tagName().equals("a") || tag == doc.body()) {
                continue;
            }
            double score = computeScore(tag);
            if (score > maxScore) {
                maxScore = score;
                content = tag;
            }
        }
        return content;
    }

（6）标题抽取部分实现代码

传入上面经过统计分数最高的标签块，经过三个优先级进行计算（代码中已标注）。还用到了基于改进编辑距离的字符串相似度求解算法。

public String getTitle(final Element contentElement) throws Exception {
        final ArrayList<Element> titleList = new ArrayList<Element>();
        final ArrayList<Double> titleSim = new ArrayList<Double>();
        final AtomicInteger contentIndex = new AtomicInteger();
        final String metaTitle = doc.title().trim();
        if (!metaTitle.isEmpty()) {
            doc.body().traverse(new NodeVisitor() {

                public void head(Node node, int i) {
                    if (node instanceof Element) {
                        Element tag = (Element) node;
                        if (tag == contentElement) {
                            contentIndex.set(titleList.size());
                            return;
                        }
                        String tagName = tag.tagName();
                        //抽取出html中h1-h6的标签，计算其中文字和title之间的相似度
                        if (Pattern.matches("h[1-6]", tagName)) {
                            String title = tag.text().trim();
                            double sim = strSim(title, metaTitle);//计算两个字符串之间的相似度
                            titleSim.add(sim);
                            titleList.add(tag);
                        }
                    }
                }

                public void tail(Node node, int i) {
                }
            });
            //取出html中h1-h6的标签中的相似度，并计算其中相似度最大的文本作为标题
            int index = contentIndex.get();
            if (index > 0) {
                double maxScore = 0;
                int maxIndex = -1;
                for (int i = 0; i < index; i++) {
                    double score = (i + 1) * titleSim.get(i);
                    if (score > maxScore) {
                        maxScore = score;
                        maxIndex = i;
                    }
                }
                if (maxIndex != -1) {
                    return titleList.get(maxIndex).text();
                }
            }
        }
        //若metaTitle无信息，抽取出html中的title几乎所有包含title的部分，将其中的第一个作为标题
        Elements titles = doc.body().select("*[id^=title],*[id$=title],*[class^=title],*[class$=title]");
        if (titles.size() > 0) {
            String title = titles.first().text();
            if (title.length() > 5 && title.length()<40) {
                return titles.first().text();
            }
        }
        return getTitleByEditDistance(contentElement);
    }

/**
     * @Description:将HTML中所有的文本和metaTitle计算字符串相似度，取出最相似的作为标题
     * @return:
     * @date: 2017-9-29  
     */
    protected String getTitleByEditDistance(Element contentElement) throws Exception {
        final String metaTitle = doc.title();

        final ArrayList<Double> max = new ArrayList<Double>();
        max.add(0.0);
        final StringBuilder sb = new StringBuilder();
        doc.body().traverse(new NodeVisitor() {

            public void head(Node node, int i) {

                if (node instanceof TextNode) {
                    TextNode tn = (TextNode) node;
                    String text = tn.text().trim();
                    double sim = strSim(text, metaTitle);
                    if (sim > 0) {
                        if (sim > max.get(0)) {
                            max.set(0, sim);
                            sb.setLength(0);
                            sb.append(text);
                        }
                    }

                }
            }

            public void tail(Node node, int i) {
            }
        });
        if (sb.length() > 0) {
            return sb.toString();
        }
        return null;
    }




    protected double strSim(String a, String b) {
        int len1 = a.length();
        int len2 = b.length();
        if (len1 == 0 || len2 == 0) {
            return 0;
        }
        double ratio;
        if (len1 > len2) {
            ratio = (len1 + 0.0) / len2;
        } else {
            ratio = (len2 + 0.0) / len1;
        }
        if (ratio >= 3) {
            return 0;
        }
        return (lcs(a, b) + 0.0) / Math.max(len1, len2);
    }



    protected int lcs(String x, String y) {

        int M = x.length();
        int N = y.length();
        if (M == 0 || N == 0) {
            return 0;
        }
        int[][] opt = new int[M + 1][N + 1];

        for (int i = M - 1; i >= 0; i--) {
            for (int j = N - 1; j >= 0; j--) {
                if (x.charAt(i) == y.charAt(j)) {
                    opt[i][j] = opt[i + 1][j + 1] + 1;
                } else {
                    opt[i][j] = Math.max(opt[i + 1][j], opt[i][j + 1]);
                }
            }
        }

        return opt[0][0];
    }

（7）抽取时间信息代码

传入的依旧是统计过分数的contentElement部分，此处主要是通过正则表达式。

public String getTime(Element contentElement) throws Exception {
        String regex = "([1-2][0-9]{3})[^0-9]{1,5}?([0-1]?[0-9])[^0-9]{1,5}?([0-9]{1,2})[^0-9]{1,5}?([0-2]?[1-9])[^0-9]{1,5}?([0-9]{1,2})[^0-9]{1,5}?([0-9]{1,2})";
        Pattern pattern = Pattern.compile(regex);
        Element current = contentElement;
        for (int i = 0; i < 2; i++) {
            if (current != null && current != doc.body()) {
                Element parent = current.parent();
                if (parent != null) {
                    current = parent;
                }
            }
        }
        for (int i = 0; i < 6; i++) {
            if (current == null) {
                break;
            }
            String currentHtml = current.outerHtml();
            Matcher matcher = pattern.matcher(currentHtml);
            if (matcher.find()) {//年份和时间信息
                return matcher.group(1) + "-" + matcher.group(2) + "-" + matcher.group(3) + " " + matcher.group(4) + ":" + matcher.group(5) + ":" + matcher.group(6);
            }
            if (current != doc.body()) {
                current = current.parent();
            }
        }

        try {
            return getDate(contentElement);
        } catch (Exception ex) {
            throw new Exception("time not found");
        }

    }

//仅仅含有年份信息
    protected String getDate(Element contentElement) throws Exception {
        String regex = "([1-2][0-9]{3})[^0-9]{1,5}?([0-1]?[0-9])[^0-9]{1,5}?([0-9]{1,2})";
        Pattern pattern = Pattern.compile(regex);
        Element current = contentElement;
        for (int i = 0; i < 2; i++) {
            if (current != null && current != doc.body()) {
                Element parent = current.parent();
                if (parent != null) {
                    current = parent;
                }
            }
        }
        for (int i = 0; i < 6; i++) {
            if (current == null) {
                break;
            }
            String currentHtml = current.outerHtml();
            Matcher matcher = pattern.matcher(currentHtml);
            if (matcher.find()) {
                return matcher.group(1) + "-" + matcher.group(2) + "-" + matcher.group(3);
            }
            if (current != doc.body()) {
                current = current.parent();
            }
        }
        return null;
        //      throw new Exception("date not found");
    }

END