美文网首页Python3自学 爬虫实战
使用Newspaper框架抓取新闻

使用Newspaper框架抓取新闻

作者: SeanCheney | 来源:发表于2019-01-21 11:01 被阅读30次

    Newspaper框架是Python爬虫框架中在GitHub上点赞排名第三的爬虫框架,适合抓取新闻网页。

    推荐安装Python3版本:pip3 install newspaper3k (pip install newspaper是Python2版本)

    1. 基本使用方法
    url = 'https://www.washingtonpost.com/powerpost/trump-to-make-new-offer-to-democrats-as-government-shutdown-drags-on/2019/01/19/2cde029e-1bf3-11e9-9ebf-c5fed1b7a081_story.html?utm_term=.4db5c2055c6d'
    
    # 创建文章对象
    article = Article(url)
    
    # 下载网页
    article.download()
    
    # 打印html文档
    print(article.html)
    
    # 网页解析
    article.parse()
    
    # 标题
    print(article.title)
    
    # # 作者
    print(article.authors)
    
    # 发布日期
    print(article.publish_date)
    
    # 正文
    print(article.text)
    
    # 配图
    print(article.top_image)
    
    # 视频
    print(article.movies)
    
    
    # 自然语言处理
    article.nlp()
    
    # 关键词
    print(article.keywords)
    
    # 文章摘要
    print(article.summary)
    
    1. 整体抓取首页
    import newspaper
    
    # 构建新闻源
    washingtonpost_paper = newspaper.build('https://www.washingtonpost.com')
    
    # 所有文章的url
    for article in washingtonpost_paper.articles:
        print(article.url)
    
    # 文章分裂
    for category in washingtonpost_paper.category_urls():
        print(category)
    
    1. Requests和Newspaper结合解析正文
    import requests
    from newspaper import fulltext
    
    html = requests.get('https://www.washingtonpost.com/business/economy/2019/01/17/19662748-1a84-11e9-9ebf-c5fed1b7a081_story.html?utm_term=.26198c91916f').text
    text = fulltext(html)
    
    print(text)
    
    1. Google Trends信息
    import newspaper
    
    # Google的新闻热点
    print(newspaper.hot())
    
    # 流行网站
    print(newspaper.popular_urls())
    
    1. 多任务
    import newspaper
    from newspaper import news_pool
    
    # 创建并行任务
    slate_paper = newspaper.build('http://slate.com')
    tc_paper = newspaper.build('http://techcrunch.com')
    espn_paper = newspaper.build('http://espn.com')
    
    papers = [slate_paper, tc_paper, espn_paper]
    news_pool.set(papers, threads_per_source=2) # (3*2) = 6 共6个线程
    
    news_pool.join()
    
    print(slate_paper.articles[10].html)
    

    相关文章

      网友评论

        本文标题:使用Newspaper框架抓取新闻

        本文链接:https://www.haomeiwen.com/subject/uwnqjqtx.html