pandas 2

Learn to handle missing data using pandas and a data set on Titanic survival.

Introduction

import pandas as pd
titanic_survival = pd.read_csv("titanic_survival.csv")

Finding the Missing Data

The Pandas library uses NaN, which stands for "not a number", to indicate a missing value.

If we want to see which values are NaN, we can use the pandas.isnull() function which takes a pandas series and returns a series of True and False values, the same way that NumPy did when we compared arrays.

sex = titanic_survival["sex"]
sex_is_null = pandas.isnull(sex)

We can use this resultant series to select only the rows that have null values.

sex_null_true = sex[sex_is_null]

We'll use this structure to look at the null values for the "age" column.

Instructions

Count how many values in the "age" column have null values:

Use pandas.isnull() on age variable to create a Series of True and False values.
Use the resulting series to select only the elements in age that are null, and assign the result to age_null_true
Assign the length of age_null_true to age_null_count.

Print age_null_count to see how many null values are in the "age" column.

age = titanic_survival["age"]
print(age.loc[10:20])
age_is_null = pd.isnull(age)
age_null_true = age[age_is_null]
age_null_count = len(age_null_true)
print(age_null_count)

Easier Ways to Do Math

Luckily, missing data is so common that many pandas methods automatically filter for it. For example, if we use use the Series.mean() method to calculate the mean of a column, missing values will not be included in the calculation.

To calculate the mean age that we did earlier, we can replace all of our code with one line

correct_mean_age = titanic_survival["age"].mean()
############
age_is_null =pd.isnull(titanic_survival["age"])

good_ages = titanic_survival["age"][age_is_null == False]

correct_mean_age =sum(good_ages) / len(good_ages)

##########

correct_mean_fare =titanic_survival["fare"].mean()

Calculating Summary Statistics

Let's calculate more summary statistics for the data.

The pclass column indicates the cabin class for each passenger, which was either first class (1), second class (2), or third class (3).

passenger_classes = [1, 2, 3]

You'll use the list passenger_classes, which contains these values, in the following exercise.

Instructions

Use a for loop to iterate over passenger_classes. Within the for loop:

Select just the rows in titanic_survival where the pclass value is equivalent to the current iterator value (class).

for this_class in passenger_classes:
    pclass_rows =titanic_survival[titanic_survival["pclass"] == this_class]

Select just the fare column for the current subset of rows.

pclass_fares = pclass_rows["fare"]

Use the Series.mean method to calculate the mean of this subset.

fare_for_class = pclass_fares.mean()

Add the mean of the class to the fares_by_class dictionary with class as the key.


fares_by_class[this_class] = fare_for_class

Once the loop completes, the dictionary fares_by_class should have 1, 2, and 3 as keys, with the average fares as the corresponding values.

passenger_classes = [1, 2, 3]

fares_by_class = {}

for this_class in passenger_classes:

    pclass_rows =titanic_survival[titanic_survival["pclass"]== this_class]
    
    pclass_fares = pclass_rows["fare"]
    
    fare_for_class = pclass_fares.mean()
    
    fares_by_class[this_class] = fare_for_class