pandas 2
Learn to handle missing data using pandas and a data set on Titanic survival.
Introduction
import pandas as pd
titanic_survival = pd.read_csv("titanic_survival.csv")
Finding the Missing Data
The Pandas library uses NaN, which stands for "not a number", to indicate a missing value.
If we want to see which values are NaN, we can use the pandas.isnull() function which takes a pandas series and returns a series of True and False values, the same way that NumPy did when we compared arrays.
sex = titanic_survival["sex"]
sex_is_null = pandas.isnull(sex)
We can use this resultant series to select only the rows that have null values.
sex_null_true = sex[sex_is_null]
We'll use this structure to look at the null values for the "age" column.
Instructions
Count how many values in the "age" column have null values:
-
Use pandas.isnull() on age variable to create a Series of True and False values.
-
Use the resulting series to select only the elements in age that are null, and assign the result to age_null_true
-
Assign the length of age_null_true to age_null_count.
Print age_null_count to see how many null values are in the "age" column.
age = titanic_survival["age"]
print(age.loc[10:20])
age_is_null = pd.isnull(age)
age_null_true = age[age_is_null]
age_null_count = len(age_null_true)
print(age_null_count)
Easier Ways to Do Math
Luckily, missing data is so common that many pandas methods automatically filter for it. For example, if we use use the Series.mean() method to calculate the mean of a column, missing values will not be included in the calculation.
To calculate the mean age that we did earlier, we can replace all of our code with one line
correct_mean_age = titanic_survival["age"].mean()
############
age_is_null =pd.isnull(titanic_survival["age"])
good_ages = titanic_survival["age"][age_is_null == False]
correct_mean_age =sum(good_ages) / len(good_ages)
##########
correct_mean_fare =titanic_survival["fare"].mean()
Calculating Summary Statistics
Let's calculate more summary statistics for the data.
The pclass column indicates the cabin class for each passenger, which was either first class (1), second class (2), or third class (3).
passenger_classes = [1, 2, 3]
You'll use the list passenger_classes, which contains these values, in the following exercise.
Instructions
Use a for loop to iterate over passenger_classes. Within the for loop:
- Select just the rows in titanic_survival where the pclass value is equivalent to the current iterator value (class).
for this_class in passenger_classes:
pclass_rows =titanic_survival[titanic_survival["pclass"] == this_class]
- Select just the fare column for the current subset of rows.
pclass_fares = pclass_rows["fare"]
- Use the Series.mean method to calculate the mean of this subset.
fare_for_class = pclass_fares.mean()
- Add the mean of the class to the fares_by_class dictionary with class as the key.
fares_by_class[this_class] = fare_for_class
Once the loop completes, the dictionary fares_by_class should have 1, 2, and 3 as keys, with the average fares as the corresponding values.
passenger_classes = [1, 2, 3]
fares_by_class = {}
for this_class in passenger_classes:
pclass_rows =titanic_survival[titanic_survival["pclass"]== this_class]
pclass_fares = pclass_rows["fare"]
fare_for_class = pclass_fares.mean()
fares_by_class[this_class] = fare_for_class
网友评论