【爬虫】简单爬取b站的弹幕列表

作者: MarcoHorse | 来源:发表于2018-04-17 16:20 被阅读27次

最近有朋友在群上面说做个b站某视频的弹幕统计列表，筛选出弹幕最多的那条！那么如何解决这个问题呢？首先肯定是要获取弹幕的列表吧，然后再进行分析吧。筛选出弹幕最多的那条，这个好办用collections可以解决，那么关键的问题应该就在怎么获取b站的弹幕列表了吧。

开发环境：
window7+chrome
idea+python插件
requests+json+beautifulsoup+collections

那解决步骤如下：

解析视频播放页面链接
查找弹幕资源所在地
数据分析（collections.Counter）
数据存储(file)
导出exe

解析视频播放页资源，查找弹幕资源所在

b站的视频链接地址都是https://www.bilibili.com/video/av22068969/
也就是https://www.bilibili.com/video/+视频av号
查看下html网页源文件是否包含着弹幕的资源快捷键ctrl+u，选择其中的弹幕内容，发现不存在
按f12+选择network这个tab查看页面加载的时候的js链接处理
我们发现弹幕的内容包括在一个xml文件里面，对，而且还没有分页的处理
那就简单了，获取弹幕的链接列表就是https://comment.bilibili.com/+数字+.xml
那么下一步就是找到这个数字所对应的链接了

https://api.bilibili.com/x/player/pagelist?aid={}&jsonp=jsonp.format("av号")

定代码架构：

def get_movie_url(av):pass #获取cid
def get_barrage_list(url):pass #获取列表
def get_barrage_count(list):pass #获取筛选数值
def write_text(content):pass #写进文本

写py代码

l_list = []

def get_movie_url(av):
    response = requests.get(url_get_comment_cid.format(str(av)))
    content = response.content.decode(response.encoding)
    content = json.loads(content)
    for d in content.get('data'):
        cid = d.get('cid')
        get_content(cid)
    print("共有{}条弹幕".format(len(l_list)))
    count = collections.Counter(l_list).most_common()
    writetest(av, l_list)
    writetest(str(av) + "count", count)


def get_barrage_list(cid):
    response = requests.get(url_get_comment.format(str(cid)))
    content = response.content.decode(response.encoding)
    b = bs(content, 'xml')
    for i in b.findChildren('d'):
        l_list.append(i.text)


def write_text(av, l):
    with open(str(av) + '.txt', 'w+', encoding='utf-8') as f:
        for i in l:
            f.write(str(i)+"\n")

使用pyInstanller导出exe

安装pyInstaller ，pip install pyinstaller 最新版本是3.3.1
pyinstaller -F py文件路径
生成exe文件，打开运行，至于下一步想导入excel

好的，就这样写完了。。。

网友评论

小白白白白_d9d5:求问：xml或者json输入后，open in new tab就网页丢失是怎么回事
共在远方:找xml的方法已经失效，你找找list
MarcoHorse:什么意思？json输入后？什么操作

IT产品

本文标题：【爬虫】简单爬取b站的弹幕列表

本文链接：https://www.haomeiwen.com/subject/dsxpkftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

【爬虫】简单爬取b站的弹幕列表

解析视频播放页资源，查找弹幕资源所在

定代码架构：

写py代码

使用pyInstanller导出exe

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

IT产品