探索性数据分析-RedCard-EDA

作者: ForgetThatNight | 来源:发表于2018-07-03 22:54 被阅读67次

探索性数据分析-RedCard-EDA
数据的探索性分析
新闻推荐02——探索性数据分析
探索性数据分析和数据可视化（上）
ML入门——EDA探索性数据分析（中）(Seaborn)
ML入门——EDA探索性数据分析（下）(特征工程)
ML入门——EDA探索性数据分析（上）
非参数探索性空间数据分析法（ESDA）笔记
如何用4行 R 语句，快速探索你的数据集？
R for data science ||探索性数据分析

1-Redcard-Dataset

探索性数据分析（EDA）

有趣的数据集，包含球员和裁判之间的故事！

数据集介绍点击： here.

任务：

探索性数据分析（EDA）. 挑战目标: 这些裁判在给红牌的时候咋想的呢，会不会被跟球员的肤色有关?

%matplotlib inline
%config InlineBackend.figure_format='retina'

from __future__ import absolute_import, division, print_function
import matplotlib as mpl
from matplotlib import pyplot as plt
from matplotlib.pyplot import GridSpec
import seaborn as sns
import numpy as np
import pandas as pd
import os, sys
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')
sns.set_context("poster", font_scale=1.3)

import missingno as msno
import pandas_profiling

from sklearn.datasets import make_blobs
import time

数据简介：

数据包含球员和裁判的信息，2012-2013年的比赛数据，总共设计球员2053名，裁判3147名，特征列表如下：

-- https://docs.google.com/document/d/1uCF5wmbcL90qvrk_J27fWAvDcDNrO9o_APkicwRkOKc/edit

# Uncomment one of the following lines and run the cell:

df = pd.read_csv("redcard.csv.gz", compression='gzip')

df.shape

输出 (146028, 28)

df.head()

df.describe().T

df.dtypes

all_columns = df.columns.tolist()
all_columns

输出:
['playerShort',
'player',
'club',
'leagueCountry',
'birthday',
'height',
'weight',
'position',
'games',
'victories',
'ties',
'defeats',
'goals',
'yellowCards',
'yellowReds',
'redCards',
'photoID',
'rater1',
'rater2',
'refNum',
'refCountry',
'Alpha_3',
'meanIAT',
'nIAT',
'seIAT',
'meanExp',
'nExp',
'seExp']

Challenge

Before looking below, try to answer some high level questions about the dataset.

How do we operationalize the question of referees giving more red cards to dark skinned players?

Counterfactual: if the player were lighter, a ref is more likely to have given a yellow or no card for the same offense under the same conditions
Regression: accounting for confounding, darker players have positive coefficient on regression against proportion red/total card

Potential issues

How to combine rater1 and rater2? Average them? What if they disagree? Throw it out?
Is data imbalanced, i.e. red cards are very rare?
Is data biased, i.e. players have different amounts of play time? Is this a summary of their whole career?
How do I know I've accounted for all forms of confounding?

First, is there systematic discrimination across all refs?

Exploration/hypotheses:

Distribution of games played
red cards vs games played
Reds per game played vs total cards per game played by skin color
Distribution of # red, # yellow, total cards, and fraction red per game played for all players by avg skin color
How many refs did players encounter?
Do some clubs play more aggresively and get carded more? Or are more reserved and get less?
Does carding vary by leagueCountry?
Do high scorers get more slack (fewer cards) for the same position?
Are there some referees that give more red/yellow cards than others?
how consistent are raters? Check with Cohen's kappa.
how do red cards vary by position? e.g. defenders get more?
Do players with more games get more cards, and is there difference across skin color?
indication of bias depending on refCountry?

Understand how the data's organized

The dataset is a single csv where it aggregated every interaction between referee and player into a single row. In other words: Referee A refereed Player B in, say, 10 games, and gave 2 redcards during those 10 games. Then there would be a unique row in the dataset that said:

Referee A, Player B, 2 redcards, ...

This has several implications that make this first step to understanding and dealing with this data a bit tricky. First, is that the information about Player B is repeated each time -- meaning if we did a simple average of some metric of we would likely get a misleading result.

For example, asking "what is the average weight of the players?"

df['height'].mean()

输出 181.93593798236887

df['height'].mean()

输出 181.93593798236887

np.mean(df.groupby('playerShort').height.mean())

输出 181.74372848007872

Tidy Data

Hadley Wickham's concept of a tidy dataset summarized as:

Each variable forms a column

Each observation forms a row

Each type of observational unit forms a table

A longer paper describing this can be found in this pdf.

Having datasets in this form allows for much simpler analyses. So the first step is to try and clean up the dataset into a tidy dataset.

The first step that I am going to take is to break up the dataset into the different observational units. By that I'm going to have separate tables (or dataframes) for:

players
clubs
referees
countries
dyads

Create Tidy Players Table

df2 = pd.DataFrame({'key1':['a', 'a', 'b', 'b', 'a'],
     'key2':['one', 'two', 'one', 'two', 'one'],
     'data1':np.random.randn(5),
     'data2':np.random.randn(5)})
df2

grouped = df2['data1'].groupby(df['key1'])
grouped.mean()

输出:
key1
a -0.093686
b -0.322711
Name: data1, dtype: float64

player_index = 'playerShort'
player_cols = [#'player', # drop player name, we have unique identifier
               'birthday',
               'height',
               'weight',
               'position',
               'photoID',
               'rater1',
               'rater2',
              ]

# Count the unique variables (if we got different weight values, 
# for example, then we should get more than one unique value in this groupby)
all_cols_unique_players = df.groupby('playerShort').agg({col:'nunique' for col in player_cols})

all_cols_unique_players.head()

# If all values are the same per player then this should be empty (and it is!)
all_cols_unique_players[all_cols_unique_players > 1].dropna().head()

# A slightly more elegant way to test the uniqueness
all_cols_unique_players[all_cols_unique_players > 1].dropna().shape[0] == 0

输出 True

def get_subgroup(dataframe, g_index, g_columns):
    
    """Helper function that creates a sub-table from the columns and runs a quick uniqueness test."""
    g = dataframe.groupby(g_index).agg({col:'nunique' for col in g_columns})
    if g[g > 1].dropna().shape[0] != 0:
        print("Warning: you probably assumed this had all unique values but it doesn't.")
    return dataframe.groupby(g_index).agg({col:'max' for col in g_columns})

players = get_subgroup(df, player_index, player_cols)
players.head()

def save_subgroup(dataframe, g_index, subgroup_name, prefix='raw_'):
    save_subgroup_filename = "".join([prefix, subgroup_name, ".csv.gz"])
    dataframe.to_csv(save_subgroup_filename, compression='gzip', encoding='UTF-8')
    test_df = pd.read_csv(save_subgroup_filename, compression='gzip', index_col=g_index, encoding='UTF-8')
    # Test that we recover what we send in
    if dataframe.equals(test_df):
        print("Test-passed: we recover the equivalent subgroup dataframe.")
    else:
        print("Warning -- equivalence test!!! Double-check.")

players = get_subgroup(df, player_index, player_cols)
players.head()

save_subgroup(players, player_index, "players")

输出 Test-passed: we recover the equivalent subgroup dataframe.

Create Tidy Clubs Table

Create the clubs table.

club_index = 'club'
club_cols = ['leagueCountry']
clubs = get_subgroup(df, club_index, club_cols)
clubs.head()

clubs['leagueCountry'].value_counts()

输出 :
England 48
Spain 27
France 22
Germany 21
Name: leagueCountry, dtype: int64

save_subgroup(clubs, club_index, "clubs", )

输出 Test-passed: we recover the equivalent subgroup dataframe.

Create Tidy Referees Table

referee_index = 'refNum'
referee_cols = ['refCountry']
referees = get_subgroup(df, referee_index, referee_cols)
referees.head()

referees.refCountry.nunique()

输出 161

referees.tail()

referees.shape

输出 (3147, 1)

save_subgroup(referees, referee_index, "referees")

输出 Test-passed: we recover the equivalent subgroup dataframe.

Create Tidy Countries Table

country_index = 'refCountry'
country_cols = ['Alpha_3', # rename this name of country
                'meanIAT',
                'nIAT',
                'seIAT',
                'meanExp',
                'nExp',
                'seExp',
               ]
countries = get_subgroup(df, country_index, country_cols)
countries.head()

rename_columns = {'Alpha_3':'countryName', }
countries = countries.rename(columns=rename_columns)
countries.head()

countries.shape

输出 (161, 7)

save_subgroup(countries, country_index, "countries")

输出 Warning -- equivalence test!!! Double-check.

Create separate (not yet Tidy) Dyads Table

This is one of the more complex tables to reason about -- so we'll save it for a bit later.

dyad_index = ['refNum', 'playerShort']
dyad_cols = ['games',
             'victories',
             'ties',
             'defeats',
             'goals',
             'yellowCards',
             'yellowReds',
             'redCards',
            ]

dyads = get_subgroup(df, g_index=dyad_index, g_columns=dyad_cols)
dyads.head()

dyads.shape

输出 (146028, 8)

dyads[dyads.redCards > 1].head(10)

save_subgroup(dyads, dyad_index, "dyads")

输出 Test-passed: we recover the equivalent subgroup dataframe.

dyads.redCards.max()

输出 2

2-Redcard-Players

%matplotlib inline
%config InlineBackend.figure_format='retina'

from __future__ import absolute_import, division, print_function
import matplotlib as mpl
from matplotlib import pyplot as plt
from matplotlib.pyplot import GridSpec
import seaborn as sns
import numpy as np
import pandas as pd
import os, sys
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')
sns.set_context("poster", font_scale=1.3)

import missingno as msno
import pandas_profiling

from sklearn.datasets import make_blobs
import time

def save_subgroup(dataframe, g_index, subgroup_name, prefix='raw_'):
    save_subgroup_filename = "".join([prefix, subgroup_name, ".csv.gz"])
    dataframe.to_csv(save_subgroup_filename, compression='gzip', encoding='UTF-8')
    test_df = pd.read_csv(save_subgroup_filename, compression='gzip', index_col=g_index, encoding='UTF-8')
    # Test that we recover what we send in
    if dataframe.equals(test_df):
        print("Test-passed: we recover the equivalent subgroup dataframe.")
    else:
        print("Warning -- equivalence test!!! Double-check.")

def load_subgroup(filename, index_col=[0]):
    return pd.read_csv(filename, compression='gzip', index_col=index_col)

Players

# players = load_subgroup("../data/redcard/raw_players.csv.gz")
players = load_subgroup("raw_players.csv.gz")

players.head()

players.shape

输出 (2053, 7)

Visualize the missing-ness of data

https://github.com/ResidentMario/missingno

msno.matrix(players.sample(500),
            figsize=(16, 7),
            width_ratios=(15, 1))

msno.heatmap(players.sample(500),
            figsize=(16, 7),)

ranges from -1 (if one variable appears the other definitely does not) to 0 (variables appearing or not appearing have no effect on one another) to 1 (if one variable appears the other definitely also does).

print("All players:", len(players))
print("rater1 nulls:", len(players[(players.rater1.isnull())]))
print("rater2 nulls:", len(players[players.rater2.isnull()]))
print("Both nulls:", len(players[(players.rater1.isnull()) & (players.rater2.isnull())]))

输出:
All players: 2053
rater1 nulls: 468
rater2 nulls: 468
Both nulls: 468

# modifying dataframe
players = players[players.rater1.notnull()]
players.shape[0]

输出 1585

2053-1585

输出 468
We've removed 468 players from the table who had no skintone rating.
Let's look again at the missing data in this table.

msno.matrix(players.sample(500),
            figsize=(16, 7),
            width_ratios=(15, 1))

Each remaining player in the table has 2 skintone ratings -- are they sufficiently close that they can be combined?

pd.crosstab(players.rater1, players.rater2)

fig, ax = plt.subplots(figsize=(12, 8))
sns.heatmap(pd.crosstab(players.rater1, players.rater2), cmap='Blues', annot=True, fmt='d', ax=ax)
ax.set_title("Correlation between Rater 1 and Rater 2\n")
fig.tight_layout()

Create useful new columns

This looks correlated enough to me -- let's combine the rater's skintone ratings into a new column that is the average rating.

players.head()

# modifying dataframe
players['skintone'] = players[['rater1', 'rater2']].mean(axis=1)
players.head()

Visualize distributions of univariate features

What is the skintone distribution?

sns.distplot(players.skintone, kde=False);

Positions

Might the player's position correlate with the baseline susceptibility to redcards? Likely that a defender would have a higher rate than a keeper, for example.

MIDSIZE = (12, 8)
fig, ax = plt.subplots(figsize=MIDSIZE)
players.position.value_counts(dropna=False, ascending=True).plot(kind='barh', ax=ax)
ax.set_ylabel("Position")
ax.set_xlabel("Counts")
fig.tight_layout()

Create higher level categories

Intuitively, the different positions in the field probably have different redcard rates, but we have data that's very granular.

Recommendation:

create a new column
Don't overwrite the original data in case you need it or decide later that the higher level category is not useful

I chose to split up the position types by their primary roles (you can disagree with my categorization and do it differently if you feel).

position_types = players.position.unique()
position_types

输出:
array(['Center Back', 'Attacking Midfielder', 'Right Midfielder',
'Center Midfielder', 'Goalkeeper', 'Defensive Midfielder',
'Left Fullback', nan, 'Left Midfielder', 'Right Fullback',
'Center Forward', 'Left Winger', 'Right Winger'], dtype=object)

defense = ['Center Back','Defensive Midfielder', 'Left Fullback', 'Right Fullback', ]
midfield = ['Right Midfielder', 'Center Midfielder', 'Left Midfielder',]
forward = ['Attacking Midfielder', 'Left Winger', 'Right Winger', 'Center Forward']
keeper = 'Goalkeeper'

# modifying dataframe -- adding the aggregated position categorical position_agg
players.loc[players['position'].isin(defense), 'position_agg'] = "Defense"
players.loc[players['position'].isin(midfield), 'position_agg'] = "Midfield"
players.loc[players['position'].isin(forward), 'position_agg'] = "Forward"
players.loc[players['position'].eq(keeper), 'position_agg'] = "Keeper"

MIDSIZE = (12, 8)
fig, ax = plt.subplots(figsize=MIDSIZE)
players['position_agg'].value_counts(dropna=False, ascending=True).plot(kind='barh', ax=ax)
ax.set_ylabel("position_agg")
ax.set_xlabel("Counts")
fig.tight_layout()

Examine pair-wise relationships

Take a look at measures that will let you quickly see if there are problems or opportunities in the data.

from pandas.tools.plotting import scatter_matrix
fig, ax = plt.subplots(figsize=(10, 10))
scatter_matrix(players[['height', 'weight', 'skintone']], alpha=0.2, diagonal='hist', ax=ax);

Perhaps you want to see a particular relationship more clearly

fig, ax = plt.subplots(figsize=MIDSIZE)
sns.regplot('weight', 'height', data=players, ax=ax)
ax.set_ylabel("Height [cm]")
ax.set_xlabel("Weight [kg]")
fig.tight_layout()

There aren't obvious outliers in the height vs weight relationship. Things that I check for:

Are there zeros?
Are there strange correlations?
Are there separate clusters (something that could come from recording the height in inches for some and in cm for others, for example)?

Create quantile bins for continuous variables

weight_categories = ["vlow_weight",
                     "low_weight",
                     "mid_weight",
                     "high_weight",
                     "vhigh_weight",
                    ]

players['weightclass'] = pd.qcut(players['weight'],
                                 len(weight_categories),
                                 weight_categories)

players.head()

height_categories = ["vlow_height",
                     "low_height",
                     "mid_height",
                     "high_height",
                     "vhigh_height",
                    ]

players['heightclass'] = pd.qcut(players['height'],
                                 len(height_categories),
                                 height_categories)

print (players['skintone'])
pd.qcut(players['skintone'], 3)

输出

playerShort
aaron-hughes              0.125
aaron-hunt                0.125
aaron-lennon              0.250
aaron-ramsey              0.000
abdelhamid-el-kaoutari    0.250
abdou-traore_2            0.750
abdoulaye-diallo_2        0.875
abdoulaye-keita_2         0.875
abdoulwhaid-sissoko       1.000
adil-rami                 0.125
adrian-colunga            0.250
adrian-mutu               0.250
adrian_2                  0.125
adrian_7                  0.250
                          ...  
yoann-gourcuff            0.125
yohan-cabaye              0.000
yohandry-orozco           0.500
yossi-benayoun            0.250
younes-belhanda           0.250
younes-kaboul             0.500
youssef-el-arabi          0.625
yunus-malli               0.125
zdenk-pospch              0.125
zdravko-kuzmanovic        0.000
ze-castro                 0.250
zhi-gin-lam               0.250
zubikarai                 0.000
zurutuza                  0.000
Name: skintone, Length: 1585, dtype: float64
Out[78]:
playerShort
aaron-hughes              (-0.001, 0.125]
aaron-hunt                (-0.001, 0.125]
                               ...       
zlatan-alomerovic         (-0.001, 0.125]
zlatan-ibrahimovic          (0.125, 0.25]
zlatko-junuzovic          (-0.001, 0.125]
zoltan-gera                 (0.125, 0.25]
zoltan-stieber            (-0.001, 0.125]
zoumana-camara                (0.25, 1.0]
zubikarai                 (-0.001, 0.125]
zurutuza                  (-0.001, 0.125]
Name: skintone, Length: 1585, dtype: category
Categories (3, interval[float64]): [(-0.001, 0.125] < (0.125, 0.25] < (0.25, 1.0]]

players['skintoneclass'] = pd.qcut(players['skintone'], 3)

Pandas profiling

There is a library that gives a high level overview -- https://github.com/JosPolfliet/pandas-profiling

pandas_profiling.ProfileReport(players)

Notice a few things -- it automatically finds highly correlated features -- is that what we want?
In this case no -- we want the skintone column, and the fact that it's highly correlated with rater1 and rater2, 1, makes sense, but 2, should be consciously selected.

Question -- What to do with birthday column?

Some points to consider:

this is a career-long dataset of redcards as of 2012-2013 season
the age of the player at 2012 does not (necessarily) correspond to the date of receiving a redcard

players.birthday.head()

输出:
playerShort
aaron-hughes 08.11.1979
aaron-hunt 04.09.1986
aaron-lennon 16.04.1987
aaron-ramsey 26.12.1990
abdelhamid-el-kaoutari 17.03.1990
Name: birthday, dtype: object

# modifying dataframe
players['birth_date'] = pd.to_datetime(players.birthday, format='%d.%m.%Y')
players['age_years'] = ((pd.to_datetime("2013-01-01") - players['birth_date']).dt.days)/365.25
players['age_years']

输出

playerShort
aaron-hughes              33.149897
aaron-hunt                26.327173
aaron-lennon              25.713895
abdoulwhaid-sissoko       22.787132
abdul-rahman-baba         18.502396
abdul-razak               20.139630
                            ...    
xisco_2                   26.518823
yacine-brahimi            22.896646
yann-mvila                22.518823
zubikarai                 28.848734
zurutuza                  26.455852
Name: age_years, Length: 1585, dtype: float64

Select variables to (possibly) use

players.head()

players_cleaned_variables = players.columns.tolist()
players_cleaned_variables

输出:
['position',
'rater2',
'birthday',
'height',
'weight',
'photoID',
'rater1',
'skintone',
'position_agg',
'weightclass',
'heightclass',
'skintoneclass',
'birth_date',
'age_years']

players_cleaned_variables = [#'birthday',
                             'height',
                             'weight',
#                              'position',
#                              'photoID',
#                              'rater1',
#                              'rater2',
                             'skintone',
                             'position_agg',
                             'weightclass',
                             'heightclass',
                             'skintoneclass',
#                              'birth_date',
                             'age_years']
pandas_profiling.ProfileReport(players[players_cleaned_variables])

players[players_cleaned_variables].to_csv("cleaned_players.csv.gz", compression='gzip')

Challenge: Do the same process we did above but for the Referee, Clubs, and Country dataframes

3-Redcard-Dyads

%matplotlib inline
%config InlineBackend.figure_format='retina'

from __future__ import absolute_import, division, print_function
import matplotlib as mpl
from matplotlib import pyplot as plt
from matplotlib.pyplot import GridSpec
import seaborn as sns
import numpy as np
import pandas as pd
import os, sys
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')
sns.set_context("poster", font_scale=1.3)

import missingno as msno
import pandas_profiling

from sklearn.datasets import make_blobs
import time

def save_subgroup(dataframe, g_index, subgroup_name, prefix='raw_'):
    save_subgroup_filename = "".join([prefix, subgroup_name, ".csv.gz"])
    dataframe.to_csv(save_subgroup_filename, compression='gzip', encoding='UTF-8')
    test_df = pd.read_csv(save_subgroup_filename, compression='gzip', index_col=g_index, encoding='UTF-8')
    # Test that we recover what we send in
    if dataframe.equals(test_df):
        print("Test-passed: we recover the equivalent subgroup dataframe.")
    else:
        print("Warning -- equivalence test!!! Double-check.")

def load_subgroup(filename, index_col=[0]):
    return pd.read_csv(filename, compression='gzip', index_col=index_col)

Tidy Dyads and Starting Joins

clean_players = load_subgroup("cleaned_players.csv.gz")
players = load_subgroup("raw_players.csv.gz", )
countries = load_subgroup("raw_countries.csv.gz")
referees = load_subgroup("raw_referees.csv.gz")
agg_dyads = pd.read_csv("raw_dyads.csv.gz", compression='gzip', index_col=[0, 1])

agg_dyads.head(10)

Test if the number of games is equal to the victories + ties + defeats in the dataset

all(agg_dyads['games'] == agg_dyads.victories + agg_dyads.ties + agg_dyads.defeats)

输出 True
Sanity check passes

len(agg_dyads.reset_index().set_index('playerShort'))

输出 146028

agg_dyads['totalRedCards'] = agg_dyads['yellowReds'] + agg_dyads['redCards']
agg_dyads.rename(columns={'redCards': 'strictRedCards'}, inplace=True)

agg_dyads.head()

Remove records that come from players who don't have a skintone rating

There are a couple of ways to do this -- set operations and joins are two ways demonstrated below:

clean_players.head()

agg_dyads.head()

agg_dyads.reset_index().head()

agg_dyads.reset_index().set_index('playerShort').head()

player_dyad = (clean_players.merge(agg_dyads.reset_index().set_index('playerShort'),
                                   left_index=True,
                                   right_index=True))
player_dyad.head()

clean_dyads = (agg_dyads.reset_index()[agg_dyads.reset_index()
                                   .playerShort
                                   .isin(set(clean_players.index))
                                  ]).set_index(['refNum', 'playerShort'])
clean_dyads.head()

clean_dyads.shape, agg_dyads.shape, player_dyad.shape

输出 ((124621, 9), (146028, 9), (124621, 18))

Disaggregate

The dyads are currently an aggregated metric summarizing all times a particular referee-player pair play were matched. To properly handle the data, we have to disaggregate the data into a tidy/long format. This means that each game is a row.

# inspired by https://github.com/mathewzilla/redcard/blob/master/Crowdstorming_visualisation.ipynb
colnames = ['games', 'totalRedCards']
j = 0
out = [0 for _ in range(sum(clean_dyads['games']))]

for index, row in clean_dyads.reset_index().iterrows():
    n = row['games']
    d = row['totalRedCards']
    ref = row['refNum']
    player = row['playerShort']
    for _ in range(n):
        row['totalRedCards'] = 1 if (d-_) > 0 else 0
        rowlist=list([ref, player, row['totalRedCards']])
        out[j] = rowlist
        j += 1

tidy_dyads = pd.DataFrame(out, columns=['refNum', 'playerShort', 'redcard'],).set_index(['refNum', 'playerShort'])

# 3092
tidy_dyads.redcard.sum()

输出 3092

# Notice this is longer than before
clean_dyads.games.sum()

输出 373067

tidy_dyads.shape

输出 (373067, 1)

# Ok, this is a bit crazy... tear it apart and figure out what each piece is doing if it's not clear
clean_referees = (referees.reset_index()[referees.reset_index()
                                                 .refNum.isin(tidy_dyads.reset_index().refNum
                                                                                       .unique())
                                        ]).set_index('refNum')

clean_referees.shape, referees.shape

输出 ((2978, 1), (3147, 1))

clean_countries = (countries.reset_index()[countries.reset_index()
                                           .refCountry
                                           .isin(clean_referees.refCountry
                                                 .unique())
                                          ].set_index('refCountry'))

clean_countries.shape, countries.shape

输出 ((160, 7), (161, 7))

tidy_dyads.head()

tidy_dyads.to_csv("cleaned_dyads.csv.gz", compression='gzip')
tidy_dyads.shape

4-Redcard-final-joins

%matplotlib inline
%config InlineBackend.figure_format='retina'

from __future__ import absolute_import, division, print_function
import matplotlib as mpl
from matplotlib import pyplot as plt
from matplotlib.pyplot import GridSpec
import seaborn as sns
import numpy as np
import pandas as pd
import os, sys
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')
sns.set_context("poster", font_scale=1.3)

import missingno as msno
import pandas_profiling

from sklearn.datasets import make_blobs
import time

# Uncomment one of the following lines and run the cell:

# df = pd.read_csv("redcard.csv.gz", compression='gzip')
# df = pd.read_csv("https://github.com/cmawer/pycon-2017-eda-tutorial/raw/master/data/redcard/redcard.csv.gz", compression='gzip')

def save_subgroup(dataframe, g_index, subgroup_name, prefix='raw_'):
    save_subgroup_filename = "".join([prefix, subgroup_name, ".csv.gz"])
    dataframe.to_csv(save_subgroup_filename, compression='gzip', encoding='UTF-8')
    test_df = pd.read_csv(save_subgroup_filename, compression='gzip', index_col=g_index, encoding='UTF-8')
    # Test that we recover what we send in
    if dataframe.equals(test_df):
        print("Test-passed: we recover the equivalent subgroup dataframe.")
    else:
        print("Warning -- equivalence test!!! Double-check.")

def load_subgroup(filename, index_col=[0]):
    return pd.read_csv(filename, compression='gzip', index_col=index_col)

clean_players = load_subgroup("cleaned_players.csv.gz")
players = load_subgroup("raw_players.csv.gz", )
countries = load_subgroup("raw_countries.csv.gz")
referees = load_subgroup("raw_referees.csv.gz")
agg_dyads = pd.read_csv("raw_dyads.csv.gz", compression='gzip', index_col=[0, 1])
# tidy_dyads = load_subgroup("cleaned_dyads.csv.gz")
tidy_dyads = pd.read_csv("cleaned_dyads.csv.gz", compression='gzip', index_col=[0, 1])

Joining and further considerations

clean_players = load_subgroup("cleaned_players.csv.gz")
temp = tidy_dyads.reset_index().set_index('playerShort').merge(clean_players, left_index=True, right_index=True)
temp.shape

输出 (373067, 10)

tidy_dyads.head()

(tidy_dyads.groupby(level=0)
           .sum()
           .sort_values('redcard', ascending=False)
           .rename(columns={'redcard':'total redcards given'})).head()

(tidy_dyads.groupby(level=1)
           .sum()
           .sort_values('redcard', ascending=False)
           .rename(columns={'redcard':'total redcards received'})).head()

tidy_dyads.head()

total_ref_games = tidy_dyads.groupby(level=0).size().sort_values(ascending=False)
total_player_games = tidy_dyads.groupby(level=1).size().sort_values(ascending=False)
total_ref_given = tidy_dyads.groupby(level=0).sum().sort_values(ascending=False,by='redcard')
total_player_received = tidy_dyads.groupby(level=1).sum().sort_values(ascending=False, by='redcard')

sns.distplot(total_player_received, kde=False);

sns.distplot(total_ref_given, kde=False);

tidy_dyads.groupby(level=1).sum().sort_values(ascending=False, by='redcard').head()

tidy_dyads.sum()

输出:
redcard 3092
dtype: int64

tidy_dyads.sum(), tidy_dyads.count(), tidy_dyads.sum()/tidy_dyads.count()

输出:
(redcard 3092
dtype: int64, redcard 373067
dtype: int64, redcard 0.008288
dtype: float64)

player_ref_game = (tidy_dyads.reset_index()
                               .set_index('playerShort')
                                       .merge(clean_players,
                                              left_index=True,
                                              right_index=True)
                  )
player_ref_game.head()

player_ref_game.shape

输出 (373067, 10)

bootstrap = pd.concat([player_ref_game.sample(replace=True, 
                                              n=10000).groupby('skintone').mean() 
                       for _ in range(100)])
player_ref_game.sample(replace=True,n=10000).groupby('skintone').mean()

ax = sns.regplot(bootstrap.index.values,
                 y='redcard',
                 data=bootstrap,
                 lowess=True,
                 scatter_kws={'alpha':0.4,},
                 x_jitter=(0.125 / 4.0))
ax.set_xlabel("Skintone");

探索性数据分析-RedCard-EDA
1-Redcard-Dataset 探索性数据分析（EDA）有趣的数据集，包含球员和裁判之间的故事！数据集介绍...
数据的探索性分析
数据的探索性分析探索性数据分析（Exploratory Data Analysis，EDA）是指对已有数据在尽量...
新闻推荐02——探索性数据分析
1. 探索性数据分析探索性数据分析( Exp lorat。可Data Analysis , EDA) 是采用各种...
探索性数据分析和数据可视化（上）
1、什么是探索性数据分析？探索性数据分析的概念（1）简称----Exploratory Data Analysi...
ML入门——EDA探索性数据分析（中）(Seaborn)
本系列其他全部链接：ML入门——EDA探索性数据分析（上）ML入门——EDA探索性数据分析（中）(Seaborn)...
ML入门——EDA探索性数据分析（下）(特征工程)
本系列其他全部链接：ML入门——EDA探索性数据分析（上）ML入门——EDA探索性数据分析（中）(Seaborn)...
ML入门——EDA探索性数据分析（上）
本系列其他全部链接：ML入门——EDA探索性数据分析（上）ML入门——EDA探索性数据分析（中）(Seaborn)...
非参数探索性空间数据分析法（ESDA）笔记
基础概念： ①探索性数据分析数据分析包括探索阶段和实证阶段。探索性数据分析是在一组数据中寻求重要信息的过程，利...
如何用4行 R 语句，快速探索你的数据集？
用最简单的方式，完成探索性分析。痛点实践中，大量数据分析时间，都会花在数据清洗与探索性数据分析（Explora...
R for data science ||探索性数据分析
什么是探索性数据分析参看之前的文章：数量生态学笔记||数据探索环境与生态统计||探索性数据分析环境与生态统计||...

探索性数据分析-RedCard-EDA

1-Redcard-Dataset

探索性数据分析（EDA）

任务：

数据简介：

Challenge

Understand how the data's organized

Tidy Data

Create Tidy Players Table

Create Tidy Clubs Table

Create Tidy Referees Table

Create Tidy Countries Table

Create separate (not yet Tidy) Dyads Table

2-Redcard-Players

Players

Visualize the missing-ness of data

Create useful new columns

Visualize distributions of univariate features

Positions

Create higher level categories

Examine pair-wise relationships

Perhaps you want to see a particular relationship more clearly

Create quantile bins for continuous variables

Pandas profiling

Question -- What to do with birthday column?

Select variables to (possibly) use

Challenge: Do the same process we did above but for the Referee, Clubs, and Country dataframes

3-Redcard-Dyads

Tidy Dyads and Starting Joins

Remove records that come from players who don't have a skintone rating

Disaggregate

4-Redcard-final-joins

Joining and further considerations

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读