python网络爬虫4：【完整代码】获取百度新闻的标题、来源、日

作者: 0清婉0 | 来源:发表于2020-12-28 21:07 被阅读0次

python网络爬虫4：【完整代码】获取百度新闻的标题、来源、日
python金融大数据挖掘与分析——新闻数据挖掘
【Python】python爬虫获取腾讯新闻正文标题内容-源码
Python爬虫技巧-西瓜视频MP4地址获取并下载
Python爬虫获取百度新闻
Python-爬虫基础-爬虫框架Scrapy入门-爬取豆瓣电影排
python爬取百度美女图片
Python Tkinter 窗口的管理与设置（二）：窗口的基本
使用python3爬取今日头条街拍美女
2019年python、golang、java、c++如何选择？

从今天开始自学Python网络爬虫实战了，买到一本好书，和大家一起分享学习，也建议大家要多写多练。今天的收获感觉好多呢。越来越觉得Python有意思了。今天结合书上练习，自己实践了一把。书上的部分代码和实际代码有出入，根据书上的方法，经过一天的研究，最终把10页的新闻列表提取到了WORD文档里^_^

一、获取网度新闻headers

二、获取网页源代码

import requests

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}

url = 'https://www.baidu.com/s?tn=news&rtt=1&bsst=1&cl=2&wd=考察'

res = requests.get(url, headers=headers).text

print(res)

三、编写正则表达式提取新闻信息

import re

res = '''

<div class="news-source">

<div class="c-img c-img1 c-img-circle news-source-icon_1tdlx c-gap-right-xsmall">

<span class="c-img-border c-img-circle"></span>

<img class="source-img_33bs5" src="https://timg01.bdimg.com/timg?pacompress=&imgtype=0&sec=1439619614&autorotate=1&di=834cbb7d72ef5d6290e356c3a9b82679&quality=90&size=b870_10000&src=http%3A%2F%2Fpic.rmb.bdstatic.com%2Fb5aef7a1e77791d0387d001e5fa2d184.png">

</div>

<span class="c-color-gray c-font-normal c-gap-right">网易新闻</span>

<span class="c-color-gray2 c-font-normal">2020年12月27日 18:37</span>

</div>

'''

p_info = '<div class="news-source">(.*?)</div>'

info = re.findall(p_info, res, re.S)

print(info)

四、编写正则表达式提取新闻链接

import re

res = '''

<h3 class="news-title_1YtI1">

<a href="https://finance.ifeng.com/c/82Z0Nx2QiJ6" target="_blank" class="news-title-font_1xS-F" data-click="{

'f0':'77A717EA',

'f1':'9F63F1E4',

'f2':'4CA6DE6E',

'f3':'54E5243F',

't':'1609115182',

}"><em>阿里巴巴</em>某某某某某某,由...</a>

</h3>

'''

p_href = '<h3 class="news-title_1YtI1">.*?<a href="(.*?)"'

href = re.findall(p_href, res, re.S)

print(href) # ['https://finance.ifeng.com/c/82Z0Nx2QiJ6']

五、编写正则表达式提取新闻标题

import re

res = '''

<h3 class="news-title_1YtI1">

<a href="https://finance.ifeng.com/c/82Z0Nx2QiJ6" target="_blank" class="news-title-font_1xS-F" data-click="{

'f0':'77A717EA',

'f1':'9F63F1E4',

'f2':'4CA6DE6E',

'f3':'54E5243F',

't':'1609115182',

}"><em>阿里巴巴</em>在港公告:董事会已授权增加本公司的股份回购计划总额,由...</a>

</h3>

'''

p_title = '<h3 class="news-title_1YtI1">.*?>(.*?)</a>'

title = re.findall(p_title, res, re.S)

print(title) # ['<em>阿里巴巴</em>在港公告:董事会已授权增加本公司的股份回购计划总额,由...']

六、数据清洗并打印输出

1.新闻标题清洗

import re

res = '''

<h3 class="news-title_1YtI1">

<a href="https://finance.ifeng.com/c/82Z0Nx2QiJ6" target="_blank" class="news-title-font_1xS-F" data-click="{

'f0':'77A717EA',

'f1':'9F63F1E4',

'f2':'4CA6DE6E',

'f3':'54E5243F',

't':'1609115182',

}"> <em>阿里巴巴</em>在港公告:董事会已授权增加本公司的股份回购计划总额,由...</a>

</h3>

'''

p_title = '<h3 class="news-title_1YtI1">.*?>(.*?)</a>'

title = re.findall(p_title, res, re.S)

# strip()函数，清理空格和换行符

# 该方法只能删除开头或是结尾的字符，不能删除中间部分的字符。

for i in range(len(title)): # len(title) title的长度

title[i] = title[i].strip()

print(title[i])

2.新闻来源和日期清洗

import re

res = '''

<div class="news-source">

<div class="c-img c-img1 c-img-circle news-source-icon_1tdlx c-gap-right-xsmall">

<span class="c-img-border c-img-circle"></span>

<img class="source-img_33bs5" src="https://timg01.bdimg.com/timg?pacompress=&imgtype=0&sec=1439619614&autorotate=1&di=834cbb7d72ef5d6290e356c3a9b82679&quality=90&size=b870_10000&src=http%3A%2F%2Fpic.rmb.bdstatic.com%2Fb5aef7a1e77791d0387d001e5fa2d184.png">

</div>

<span class="c-color-gray c-font-normal c-gap-right">网易新闻</span>

<span class="c-color-gray2 c-font-normal">2020年12月27日 18:37</span>

</div>

'''

p_source = '<span class="c-color-gray c-font-normal c-gap-right">(.*?)</span>'

source = re.findall(p_source, res, re.S)

for i in range(len(source)):

source[i] = re.sub('<.*?>', '', source[i])

print(source[i])

p_date = '<span class="c-color-gray2 c-font-normal">(.*?)</span>'

date = re.findall(p_date, res, re.S)

for j in range(len(date)):

print(date[j])

完整代码如下：

import requests

import re

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}

url = 'https://www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news&rsv_dl=ns_pc&word=考察'

res = requests.get(url, headers=headers).text

p_href = '<h3 class="news-title_1YtI1">.*?<a href="(.*?)"'

p_title = '<h3 class="news-title_1YtI1">.*?>(.*?)</a>'

p_source = '<span class="c-color-gray c-font-normal c-gap-right">(.*?)</span>'

p_date = '<span class="c-color-gray2 c-font-normal">(.*?)</span>'

href = re.findall(p_href, res, re.S)

title = re.findall(p_title, res, re.S)

source = re.findall(p_source, res, re.S)

date = re.findall(p_date, res, re.S)

# 数据清洗及打印输出

for i in range(len(title)):

title[i] = title[i].strip()

title[i] = re.sub('<.*?>', '', title[i])

print(str(i+1) + '.' + title[i] + '(' + date[i] + '-' + source[i] + ')')

print(href[i])