pandas 3
Making Pivot Tables
Pivot tables provide an easy way to subset by one column and then apply a calculation like a sum or a mean.
Pivot tables first group and then apply a calculation. In the previous screen, we actually made a pivot table manually by grouping by the column "pclass" and then calculating the mean of the "fare" column for each class.
Luckily, we can use the Dataframe.pivot_table() method instead, which simplifies the kind of work we did on the last screen. To produce the same data, we could use one line.
passenger_class_fares =titanic_survival.pivot_table(index="pclass", values="fare", aggfunc=np.mean)
The first parameter of the method, index tells the method which column to group by.
The second parameter values is the column that we want to apply the calculation to, and aggfunc specifies the calculation we want to perform.
The default for the aggfunc parameter is actually the mean, so if we're calculating this we can omit this parameter.
Instructions
- Use the DataFrame.pivot_table() method to calculate the mean age for each passenger class ("pclass").
- Assign the result to passenger_age.
- Display the passenger_age pivot table using the print() function.
import numpy as np
passenger_survival =titanic_survival.pivot_table(index="pclass", values="survived")
passenger_age =titanic_survival.pivot_table(index="pclass", values="age")
print(passenger_age)
If we pass a list of column names to the values parameter instead of a single value, we can perform calculations on multiple columns at once.
We can also specify a custom calculation to be made. For instance, if we pass np.sum to the aggfunc parameter it will total the values in each column.
Instructions
- Make a pivot table that calculates the total fares collected ("fare") and total number of survivors ("survived") for each embarkation port ("embarked").
- Assign the result to port_stats.
- Display port_stats using the print() function.
import numpy as np
port_stats =titanic_survival.pivot_table(index = 'embarked',values = ['fare',"survived"],aggfunc= numpy.sum)
print(port_stats)
Drop Missing Values
We can use the DataFrame.dropna() method on pandas DataFrames to do this. The method will drop any rows that contain missing values.
The dropna() method takes an axis parameter, which indicates whether you would like to drop rows or columns.
Specifying axis=0 or axis='index' will drop any rows that have null values, while specifying axis=1 or axis='columns' will drop any columns that have null values.
Instructions
Drop all columns in titanic_survival that have missing values and assign the result to drop_na_columns.
Drop all rows in titanic_survival where the columns "age" or "sex" have missing values and assign the result to new_titanic_survival.
drop_na_columns =titanic_survival.dropna(axis = 1)
new_titanic_survival = titanic_survival.dropna(axis =0,subset=['sex','age'])
网友评论