python爬取所有类型新闻包newspaper提取正文和标题

作者: 俊采星驰_87e0 | 来源:发表于2018-10-13 12:02 被阅读0次

python爬取所有类型新闻包newspaper提取正文和标题
使用自己的语料训练word2vec模型
Python 使用newspaper实现正文提取
爬取不可视化爬虫源码，复制粘贴就能用！python 暴力爬_极简
python爬虫
各类链接
【Python爬虫】人民日报科技1
新闻正文提取器python版
前程无忧python岗位信息爬取和分析
Python爬虫入门-爬取新浪新闻

之前在爬取百度新闻的时候遇到了一个问题，就是百度新闻的种类太多了，没有办法统一的提取，而且每个网站的请求方式都不太一样，一个一个的写有太麻烦，所有就就找了下有没有通用的包，结果还真的有，而且十份强大。使用后发现，整正确提取出绝大部分新闻的正文，有需要的可以尝试下。

Newspaper可以用来提取新闻、文章和内容分析。使用多线程，支持10多种语言等。

作者从requests库的简洁与强大得到灵感，使用python开发的可用于提取文章内容的程序。

支持10多种语言并且所有的都是unicode编码。

>>> import newspaper
>>> newspaper.languages()
 
Your available langauges are:
input code      full name
 
  ar              Arabic
  ru              Russian
  nl              Dutch
  de              German
  en              English
  es              Spanish
  fr              French
  it              Italian
  ko              Korean
  no              Norwegian
  pt              Portuguese
  sv              Swedish
  hu              Hungarian
  fi              Finnish
  da              Danish
  zh              Chinese

以下是简单的使用示例：

>>> import newspaper
 
>>> cnn_paper = newspaper.build('http://cnn.com')
 
>>> for article in cnn_paper.articles:
>>>     print article.url
u'http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/'
u'http://www.cnn.com/2013/12/11/us/texas-teen-dwi-wreck/index.html'
...
 
>>> for category in cnn_paper.category_urls():
>>>     print category
 
u'http://lifestyle.cnn.com'
u'http://cnn.com/world'
u'http://tech.cnn.com'
...  
 
>>> article = cnn_paper.articles[0]
>>> article.download()
 
>>> article.html
u'<!DOCTYPE HTML><html itemscope itemtype="http://...'
>>> article.parse()
 
>>> article.authors
[u'Leigh Ann Caldwell', 'John Honway']
 
>>> article.text
u'Washington (CNN) -- Not everyone subscribes to a New Year's resolution...'
 
>>> article.top_image
u'http://someCDN.com/blah/blah/blah/file.png'
 
>>> article.movies
[u'http://youtube.com/path/to/link.com', ...]
>>> article.nlp()
 
>>> article.keywords
['New Years', 'resolution', ...]
 
>>> article.summary
u'The study shows that 93% of people ...'

如果没有指定语言，Newspaper会尝试自动识别。不过还是建议手动设置，减少错误判断情况。

>>> from newspaper import Article
>>> url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'
 
>>> a = Article(url, language='zh') # Chinese
 
>>> a.download()
>>> a.parse()
 
>>> print a.text[:150]
香港行政长官梁振英在各方压力下就其大宅的违章建
筑（僭建）问题到立法会接受质询，并向香港民众道歉。
梁振英在星期二（12月10日）的答问大会开始之际
在其演说中道歉，但强调他在违章建筑问题上没有隐瞒的
意图和动机。 一些亲北京阵营议员欢迎梁振英道歉，
且认为应能获得香港民众接受，但这些议员也质问梁振英有
 
>>> print a.title
港特首梁振英就住宅违建事件道歉

如果你能确定文章使用的是同一种语言，可以使用同样的API。

>>> import newspaper
>>> sina_paper = newspaper.build('http://www.sina.com.cn/', language='zh')
 
>>> for category in sina_paper.category_urls():
>>>     print category
u'http://health.sina.com.cn'
u'http://eladies.sina.com.cn'
u'http://english.sina.com'
...
 
>>> article = sina_paper.articles[0]
>>> article.download()
>>> article.parse()
 
>>> print article.text
新浪武汉汽车综合 随着汽车市场的日趋成熟，
传统的“集全家之力抱得爱车归”的全额购车模式已然过时，
另一种轻松的新兴 车模式――金融购车正逐步成为时下消费者购
买爱车最为时尚的消费理念，他们认为，这种新颖的购车
模式既能在短期内
...
 
>>> print article.title
两年双免0手续0利率 科鲁兹掀背金融轻松购_武汉车市_武汉汽
车网_新浪汽车_新浪网

文档：http://newspaper.readthedocs.org/en/latest/
GitHub主页：https://github.com/codelucas/newspaper