热点平台搭建（一）——Python爬取热榜数据

作者: 请不要问我是谁 | 来源:发表于2020-02-25 21:30 被阅读0次

热点平台搭建（一）——Python爬取热榜数据
三十. 模拟登陆实战 - 爬取微博信息
那些年的畅销书你看了吗？当当图书畅销榜分析
Python-爬虫基础-Xpath-爬取百度搜索列表（获取标题和
2017-12-31
Python自动发邮件，定制上班划水神器
手把手系列：用Python3+PyQt5做一个有界面的小爬虫（一
爬虫案例
scrapy 爬取当当网-图书排行榜-多条件爬取
Python学习

寻找要爬取热榜

要爬取热榜当然先要确定爬哪个，这里我已爬取虎扑步行街热榜为例。网址：https://bbs.hupu.com/all-gambia

我们需要的是：

排名
标题
详情链接
热度指数（可以是点击量，回帖数等）
内容详情

获取页面

这里需要用到Python的requests库，用它来发送请求，并得到返回的页面。

url = 'https://bbs.hupu.com/all-gambia'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
html = get_html(url, headers)

def get_html(url, headers):
    r = requests.get(url, headers=headers, timeout=30)
    r.encoding = r.apparent_encoding
    return r.text

r.enconding是对返回的html页面进行编码解析，有时需要强制设置为utf-8

解析页面，取出需要的元素

使用BeautifulSoup对页面进行解析，BeautifulSoup提取元素的方法有很多，可以用class,href,get_text等。

# 获取当前时间作为爬取的信息时间
now = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))
soup = BeautifulSoup(html, 'html.parser')
# 需要爬取的内容所属class
all_topic = soup.find_all(class_="textSpan")
# 一共是20条，后面的热榜内容舍去
main_topic = all_topic[:20]
# print(main_topic)
title = []
link = []
hot_number = []
detail = []
for t in main_topic:
    # 得到标题
    title.append(t.a.get("title"))
    # 得到链接，链接不完整时需要补充
    link.append("https://bbs.hupu.com" + t.a.get("href"))
    # 得到需要计算热度的文本
    hot_number_str = t.em.get_text().strip()
    index_l = hot_number_str.find("亮")
    index_h = hot_number_str.find("回复")
    # 计算得到热度
    ln = 0
    hn = 0
    if index_l >= 0:
        ln = int(hot_number_str[:index_l])
    if index_h >= 0:
        if index_l >= 0:
            hn = int(hot_number_str[index_l + 1:index_h])
        else:
            hn = int(hot_number_str[:index_h])
    hot_number.append(ln+hn)
for i in link:
    # 得到内容详情
    detail.append(get_detail(i))

获取内容详情

在获取内容详情之前需要先获得详情链接，详情链接由上一步得到。这里同样用BeautifulSoup解析页面后获得需要的内容。

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
html = get_html(url, headers)
# print(html)
soup = BeautifulSoup(html, 'lxml', from_encoding='utf-8')
# 获得详情
content = soup.find(class_='quote-content').get_text().strip()

将数据写入数据库中

共需要连接三种数据库，首先是MySQL，当有一条新的热点时，用户会从MySQL中查找。

第二个是Redis，用户第一次查找完之后数据就会缓存到Redis中，要是用户每次能读取到最新的数据，当新数据插入MySQL后要将Redis中旧的缓存去除。

第三个是Elasticsearch，需要将标题写入用于后续查询操作。

若网络连接失败将内容写入本地文件。

db = pymysql.connect(mysql_host, "root", "0", "social_network_data")
cursor = db.cursor()
try:
    cursor.executemany('insert into hupu_hot_list(rank, title, link, hot_number, detail, timestamp) '
                       'VALUES (%s, %s, %s, %s, %s, %s)', datas)
    db.commit()
    logger.info("虎扑热榜写入数据库完成:" + now)
    r = redis.StrictRedis(host=redis_host, port='6379', password='0')
    try:
        if r.exists('hupuData::SimpleKey []'):
            r.delete('hupuData::SimpleKey []')
        logger.info("清除虎扑redis完成: " + now)
    except Exception as e:
        logger.error("清除虎扑redis错误" + str(e))
    finally:
        r.close()
    try:
        # 连接ES
        es = Elasticsearch(
            [es_host],
            port=9200
        )
        actions = []
        for d in es_datas:
            # 拼接插入数据结构
            action = {
                "_index": "hupu_data",
                "_source": {
                    "title_text": d[0],
                    "title_keyword": d[1],
                    "rank": d[2],
                    "hot_number": d[3],
                    "timestamp": d[4]
                }
            }
            # 形成一个长度与查询结果数量相等的列表
            actions.append(action)
        # 批量插入
        a = helpers.bulk(es, actions)
        logger.info("虎扑数据写入es成功：" + now)
    except Exception as e:
        logger.error("虎扑数据写入es错误：" + str(e))
except Exception as e:
    logger.error("虎扑数据写入数据库错误:" + str(e))
    db.rollback()
    with open("../data/hupu_hot_list.txt", "a", encoding="utf-8") as f:
        for data in datas:
            f.writelines(str(data[0]) + "\t" + data[1] + "\t" + data[2] + "\t" + str(data[3]) + "\t" + now + "\n")
    logger.info("虎扑热榜写入文件完成:" + now)
finally:
    db.close()
except Exception as e:
logger.error("连接数据失败:" + str(e))
with open("../data/hupu_hot_list.txt", "a", encoding="utf-8") as f:
    for data in datas:
        f.writelines(str(data[0]) + "\t" + data[1] + "\t" + data[2] + "\t" + str(data[3]) + "\t" + now + "\n")
logger.info("虎扑热榜写入文件完成:" + now)

项目github地址：https://github.com/wanggangkun/get_social_network_data