美文网首页
youtobe视频评论抓取

youtobe视频评论抓取

作者: 克里斯托弗的梦想 | 来源:发表于2019-05-17 16:52 被阅读0次

    受人之托,需要youtube视频下面的评论数据,由于youtube对于爬虫有限制,因此第一次尝试selenium模拟人工点击爬虫,但爬取到的评论数远远低于youtube上面显示的评论数。后面经过调查发现,Google有自己的youtube data api,方便你对评论数据的需要。

    网址:https://developers.google.com/youtube/v3/quickstart/python

    第一步:
    点击刚才给的网址,注册一个Google cloud platform,操作就按下面图片进行就好。


    第二步:
    根据主页面网址:https://developers.google.com/youtube/v3/
    找到下面的页面:
    image.png
    第三步:
    测试一下

    执行execute,查看右边出现的结果,没有报错就说明成功了。
    第四步:
    下面我用Python代码抓取评论
    import os
    import numpy as np
    import google_auth_oauthlib.flow
    import googleapiclient.discovery
    import googleapiclient.errors
    from googleapiclient.errors import HttpError
    import pandas as pd
    import json
    import socket
    import socks
    import requests
    ## 科学上网,你也需要科学使用代理,不然科学不了外网,也许你不会需要。
    ## headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36'}
    ## socks.set_default_proxy(socks.SOCKS5, "127.0.0.1", 1080)
    ## socket.socket = socks.socksocket
    ## 
    scopes = ["https://www.googleapis.com/auth/youtube.force-ssl"]
    
    def main():
        # Disable OAuthlib's HTTPS verification when running locally.
        # *DO NOT* leave this option enabled in production.
        os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"
    
        api_service_name = "youtube"
        api_version = "v3"
    # 这个文件需要自己注册完,自己下载
        client_secrets_file = "client_secret_961831598513-urdgliumr9j4ab4g68jtocc30dimqb9g.apps.googleusercontent.com.json"
        # Get credentials and create an API client
        flow = google_auth_oauthlib.flow.InstalledAppFlow.from_client_secrets_file(
            client_secrets_file, scopes)
        credentials = flow.run_console()
    
        youtube = googleapiclient.discovery.build(
            api_service_name, api_version, credentials=credentials)
        videoId = '5YGc4zOqozo'
        request = youtube.commentThreads().list(
            part="snippet,replies",
            videoId=videoId,
            maxResults = 100
        )
        response = request.execute()
        # print(response)
    
        totalResults = 0
        totalResults = int(response['pageInfo']['totalResults'])
    
        count = 0
        nextPageToken = ''
        comments = []
        first = True
        further = True
        while further:
            halt = False
            if first == False:
                print('..')
                try:
                    response = youtube.commentThreads().list(
                        part="snippet,replies",
                        videoId=videoId,
                        maxResults = 100,
                        textFormat='plainText',
                        pageToken=nextPageToken
                                ).execute()
                    totalResults = int(response['pageInfo']['totalResults'])
                except HttpError as e:
                    print("An HTTP error %d occurred:\n%s" % (e.resp.status, e.content))
                    halt = True
    
            if halt == False:
                count += totalResults
                for item in response["items"]:
                    # 这只是一部分数据,你需要啥自己选就行,可以先打印下你能拿到那些数据信息,按需爬取。
                    comment = item["snippet"]["topLevelComment"]
                    author = comment["snippet"]["authorDisplayName"]
                    text = comment["snippet"]["textDisplay"]
                    likeCount = comment["snippet"]['likeCount']
                    publishtime = comment['snippet']['publishedAt']
                    comments.append([author, publishtime, likeCount, text])
                if totalResults < 100:
                    further = False
                    first = False
                else:
                    further = True
                    first = False
                    try:
                        nextPageToken = response["nextPageToken"]
                    except KeyError as e:
                        print("An KeyError error occurred: %s" % (e))
                        further = False
        print('get data count: ', str(count))
        ### write to csv file
        data = np.array(comments)
        df = pd.DataFrame(data, columns=['author', 'publishtime', 'likeCount', 'comment'])
        df.to_csv('google_comments.csv', index=0, encoding='utf-8')
    
        ### write to json file
        result = []
        for name, time, vote, comment in comments:
            temp = {}
            temp['author'] = name
            temp['publishtime'] = time
            temp['likeCount'] = vote
            temp['comment'] = comment
            result.append(temp)
        print('result: ', len(result))
    
        json_str = json.dumps(result, indent=4)
        with open('google_comments.json', 'w', encoding='utf-8') as f:
            f.write(json_str)
    
        f.close()
    if __name__ == "__main__":
        main()
    

    相关文章

      网友评论

          本文标题:youtobe视频评论抓取

          本文链接:https://www.haomeiwen.com/subject/pouoaqtx.html