美文网首页
使用Python进行网页正文提取

使用Python进行网页正文提取

作者: 泊牧 | 来源:发表于2017-09-18 11:25 被阅读1488次

1. Goose Extractor

1.1 Python Goose介绍

Goose Extractor是一个Python的开源文章提取库。可以用它提取文章的文本内容、图片、视频、元信息和标签。Goose本来是由Gravity.com编写的Java库,最近转向了scala。

Goose Extractor网站是这么介绍的:

Goose Extractor完全用Python重写了。目标是给定任意资讯文章或者任意文章类的网页,不仅提取出文章的主体,同时提取出所有元信息以及图片等信息。

Goose Extractor基于NLTKBeautiful Soup,分别是文本处理和HTML解析的领导者。用Python进行文章提取可以使用Python Goose。

Goose目前只支持Python2

1.2 安装Python Goose
 pip install goose-extractor

直接使用Url链接抽取示例:

from goose import Goose

url = 'https://www.fireeye.com/blog/executive-perspective/2017/08/fireeye-provides-update-on-allegations-of-breach.html'
g = Goose()
article = g.extract(url=url)
print article.title
print article.meta_description
print article.cleaned_text[:150]
print article.top_image.src

使用Html文档抽取示例:

# -*- coding: utf-8 -*- 
import goose,urllib2,sys

reload(sys)
sys.setdefaultencoding("utf-8")

#url = "https://www.fireeye.com/blog/executive-perspective/2017/08/anti-encryption-and-cyber-sovereignty.html"

url = "https://krebsonsecurity.com/2017/09/equifax-hackers-stole-200k-credit-card-accounts-in-one-fell-swoop/"
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
response = opener.open(url)
raw_html = response.read()
g = goose.Goose()
article = g.extract(raw_html=raw_html)
print article.title.encode('gbk', 'ignore')
print article.meta_description.encode('gbk', 'ignore')
print article.cleaned_text.encode('gbk', 'ignore')
1.3 urllib2获取的HTML网页乱码问题

网页可能是压缩了,看里面是不是有 Content-Encoding:xxx
如果是压缩了,需要手动解压,urllib是不会帮你解压的

解决代码:

#-*- encoding: utf-8 -*-
import urllib2,gzip,StringIO

url = r'https://krebsonsecurity.com/2017/09/equifax-hackers-stole-200k-credit-card-accounts-in-one-fell-swoop/'
response = urllib2.urlopen(url)
stream = StringIO.StringIO(response.read())
with gzip.GzipFile(fileobj=stream) as f:
    data = f.read()
print(data)

附一篇文章谈Python编码:也谈Python的中文编码处理

2. Boilerpipe

Github开源代码:Boilerpipe
在开源系统里Boilerpipe的precision和recall都好过Goose,甚至比收费的Alchemy API还要好。Boilerpipe是Java的,在Python里调用需要用python-boilerpipe这个包装,它底层用的是jpype。也可以用JCC来调。代码如下:
安装:

git clone https://github.com/misja/python-boilerpipe.git
cd python-boilerpipe
pip install -r requirements.txt
python setup.py install

使用:

from boilerpipe.extract import Extractor 

url = "https://krebsonsecurity.com/2017/09/equifax-hackers-stole-200k-credit-card-accounts-in-one-fell-swoop/"
extractor = Extractor(extractor='ArticleExtractor', url=url)
print extractor.getText().encode('gbk', 'ignore')

或传入一个HTML文本作为参数:

extractor = Extractor(extractor='ArticleExtractor', html=myWebPage)

用getText() or getHTML() 拿回处理过的纯文本或加亮了正文的HTML

processed_plaintext = extractor.getText() 
highlighted_html = extractor.getHTML()

也可以用JCC把Java的包编译成Python可以调用的包

wget http://boilerpipe.googlecode.com/files/boilerpipe-1.2.0-bin.tar.gz 
tar xvzf boilerpipe-*.tar.gz 
cd boilerpipe-1.2.0 
sudo python -m jcc \ --jar boilerpipe-1.2.0.jar \ --classpath lib/nekohtml-1.9.13.jar \ --classpath lib/xerces-2.9.1.jar \ --package java.net \ java.net.URL \ --python boilerpipe --build --install

使用:

import boilerpipe

jars = ':'.join(('lib/nekohtml-1.9.13.jar', 'lib/xerces-2.9.1.jar'))
boilerpipe.initVM(boilerpipe.CLASSPATH+':'+jars)
extractor = boilerpipe.ArticleExtractor.getInstance()
url = boilerpipe.URL('http://readthedocs.org/docs/jcc')
extractor.getText(url)

3. 各种Python正文抽取工具比较

各种Python正文抽取工具比较

相关文章

网友评论

      本文标题:使用Python进行网页正文提取

      本文链接:https://www.haomeiwen.com/subject/tigvsxtx.html