Python爬取NBA2017-2018赛季数据实战

作者: 东阿王 | 来源:发表于2018-04-28 12:26 被阅读0次

Python爬取NBA2017-2018赛季数据实战
以『公众号』为例，手把手教你爬取PC端数据
Python爬虫实战之爬取链家广州房价_03存储
爬虫入门练习（三）爬取小猪租房网信息
【Python实战】爬取国家社科基金项目数据
2017-12-31
Python 3爬虫、数据清洗与可视化实
python爬虫实战——爬取股票个股信息
Python爬虫集合，20个爬虫项目让你一次吃到撑！！！
爬虫案例

爬取目标

爬取NBA2017-2018赛季球队的球赛数据
并保存到.csv 文件中

难点

经分析得知，要获取的数据是动态生产的，经过抓包得到目标链接
json格式数据提取的层级分析

缺点

对比赛时间部分，分析不足，只是给出了具体年月，没有日期和具体几点
代码方面优化不足

代码部分

以下代码仅供学习参考，请勿用作其它非法用途。

# -*- coding: utf-8 -*-
# @Date:   2018-04-27 20:13:28
# @Last Modified by:   Happydong
# @Last Modified time: 2018-04-28 10:19:34
# 引入模块部分
import requests
from bs4 import BeautifulSoup
import bs4
import json
import csv

#  定义get_Html()函数
#  目的：获取url链接的网页源码
#  @param $url string
#  @return string
     
def get_Html(url):
    # 设定模拟浏览器访问的user_agent
    user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.3"
    # 设定headers中的user_agent
    headers = {'User-Agent': user_agent}
    # 获取目标网页源代码
    r = requests.get(url, headers=headers)
    return r.text 

# 定义目标url
srcUrl = 'http://china.nba.com/teams/schedule/#!/clippers'
# 调用get_Html() 函数
ghtml = get_Html(srcUrl)
# 创建BeautifulSoup对象
soup =  BeautifulSoup(ghtml, 'html.parser', from_encoding='utf-8')


# 获取目标链接地址部分
# 爬取目标链接存储位置声明
team_links = []
# 获取所有a标记中的链接和内容
for box in ["east-box", 'west-box']:
    for a in soup.find(class_=box).find_all('a'):
        # 获取球队英文名字
        box_name = a.get('href').strip('/')
        # 把爬取球队名字，拼接成目标链接地址
        target_url = "http://china.nba.com/static/data/team/schedule_"+box_name+".json"
        # 把link加入之前定义的列表中
        team_links.append(target_url)


# 获取目标数据部分
# 比赛数据存储位置
match_target = []
for link in team_links:
    r = requests.get(link)
    # 获取的r文本 就是json字符串
    json_response = r.content.decode()
    # 将已编码的 JSON 字符串解码为 Python 对象
    dict_json = json.loads(json_response)

    # 处理json数据部分
    for item in dict_json['payload']['monthGroups']:
        # 处理时间部分
        if item['number'] < 13 and item['number'] >9:
            scheduleYear = '2017'
        else:
            scheduleYear = '2018'
        # 比赛时间
        match_scheduleYM = scheduleYear + item['name']
        # 处理具体比赛数据部分
        for i in item['games']:
            # for j in i:
            # 客场方球队名称
            awayTeam = i['awayTeam']['profile']['displayAbbr']
            # 主场球队名称
            homeTeam = i['homeTeam']['profile']['displayAbbr']
            # 比赛结果
            scoreStatus = i['winOrLoss']
            # 对方比赛得分
            oppTeamScore = i['oppTeamScore']
            # 我方比赛得分
            teamScore = i['teamScore']
            # 比赛
            arenaName = i['profile']['arenaName']
            # 组装数据
            match_info = (match_scheduleYM ,homeTeam+'vs'+awayTeam ,str(teamScore)+'-'+str(oppTeamScore), scoreStatus ,arenaName)
            # 把组装好的数据写入列表中
            match_target.append(match_info)

# 目标数据保存到csv文件部分
# 定义表头
header_info = ['比赛时间','主场球队vs客场球队','比分','比赛结果','比赛地点']
# 写入数据
with open('match_box.csv', 'w') as f:
    f_csv = csv.writer(f)
    # 写表头
    f_csv.writerow(header_info)
    # 写数据
    f_csv.writerows(match_target)