1.NaN(空值)与None(缺失值)
Missing data can take a few different forms:
- In Python, the None
keyword and type indicates no value. - The Pandas library uses NaN
, which stands for "not a number", to indicate a missing value.
In general terms, both NaN and None can be called null values.
2.判断缺失值/空字符:pandas.isnull(XXX)
If we want to see which values are NaN, we can use the pandas.isnull() function which takes a pandas series and returns a series of True and False
values, the same way that NumPy did when we compared arrays.
input
age = titanic_survival["age"]
print(age.loc[10:20])
age_is_null = pandas.isnull(age) # 如果是NaN或者None,返回True;否则,返回False
age_null_true = age[age_is_null] #
age_null_count = len(age_null_true)
print(age_null_count)
output
10 47.0
11 18.0
12 24.0
13 26.0
14 80.0
15 NaN
16 24.0
17 50.0
18 32.0
19 36.0
20 37.0
Name: age, dtype: float64
264
3.有null值时做加减乘除法
#计算有null值下的平均年龄
age_is_null = pd.isnull(titanic_survival["age"])
good_ages = titanic_survival['age'][age_is_null == False]
correct_mean_age1 = sum(good_ages) / len(good_ages)
#使用Series.mean()
correct_mean_age2 = titanic_survival["age"].mean()
4.用词典统计不同等级船舱的票价问题
input
passenger_classes = [1, 2, 3] #泰坦尼克的船舱等级分为1,2,3
fares_by_class = {} #创建一个空字典
for this_class in passenger_classes:
pclass_rows = titanic_survival[titanic_survival['pclass'] == this_class] # X等舱的所有数据
mean_fares = pclass_rows['fare'].mean() # X等舱的船票均值
fares_by_class[this_class] = mean_fares # 构建词典用于统计
print(fares_by_class)
output
{1: 87.508991640866881, 2: 21.179196389891697, 3: 13.302888700564973}
5.使用Dataframe.pivot_table()
Pivot tables provide an easy way to subset by one column and then apply a calculation like a sum or a mean.
刚才第4点的问题,可以用Dataframe.pivot_table()
-
The first parameter of the method, index
tells the method which column to group by. -
The second parameter values
is the column that we want to apply the calculation to. -
aggfunc
specifies the calculation we want to perform. The default for the aggfunc
parameter is actually the mean
input1
passenger_class_fares = titanic_survival.pivot_table(index="pclass", values="fare", aggfunc=numpy.mean)
print(passenger_class_fares)
output1
pclass
1.0 87.508992
2.0 21.179196
3.0 13.302889
Name: fare, dtype: float64
input2
passenger_age = titanic_survival.pivot_table(index="pclass", values="age",aggfunc=numpy.mean)
print(passenger_age)
output2
pclass
1.0 39.159918
2.0 29.506705
3.0 24.816367
Name: age, dtype: float64
input3
import numpy as np
port_stats = titanic_survival.pivot_table(index='embarked', values=["fare", "survived"], aggfunc=numpy.sum)
print(port_stats)
output3
fare survivedembarked C 16830.7922 150.0Q 1526.3085 44.0S 25033.3862 304.0
6.剔除缺失值:DataFrame.dropna()
The methodDataFrame.dropna()
will drop any rows that contain missing values.
drop_na_rows = titanic_survival.dropna(axis=0) # 剔除所有含缺失值的行
drop_na_columns = titanic_survival.dropna(axis=1) # 剔除所有含缺失值的列
new_titanic_survival = titanic_survival.dropna(axis=0,subset=["age", "sex"]) # 剔除所有在‘age’和‘sex’中,有缺失值的行
7.Dataframe.loc[4]与Dataframe.iloc[4]
input
# We have already sorted new_titanic_survival by age
first_five_rows_1 = new_titanic_survival.iloc[5] # 定位到按顺序第5的对象
first_five_rows_2 = new_titanic_survival.loc[5] # 定位到索引值为5的对象
row_index_25_survived = new_titanic_survival.loc[25, 'survived'] # 定位到索引值为5,且列名为'survived'的对象
print(first_five_rows_1)
print('------------------------------------------')
print(first_five_rows_2)
output
pclass 3survived 0name Connors, Mr. Patricksex maleage 70.5sibsp 0parch 0ticket 370369fare 7.75cabin NaNembarked Qboat NaNbody 171home.dest NaNName: 727, dtype: object------------------------------------------pclass 1survived 1name Anderson, Mr. Harrysex maleage 48sibsp 0parch 0ticket 19952fare 26.55cabin E12embarked Sboat 3body NaNhome.dest New York, NYName: 5, dtype: object
8.重新整理索引值:Dataframe.reset_index(drop=True)
input
titanic_reindexed = new_titanic_survival.reset_index(drop=True)
print(titanic_reindexed.iloc[0:5,0:3])
output
pclass survived name0 1.0 1.0 Barkworth, Mr. Algernon Henry Wilson1 1.0 1.0 Cavendish, Mrs. Tyrell William (Julia Florence...2 3.0 0.0 Svensson, Mr. Johan3 1.0 0.0 Goldschmidt, Mr. George B4 1.0 0.0 Artagaveytia, Mr. Ramon
9.Apply Functions Over a DataFrame
DataFrame.apply() will iterate through each column in a DataFrame, and perform on each function. When we create our function, we give it one parameter, apply() method passes each column to the parameter as a pandas series.
DataFrame可以调用apply函数对每一列(行)应用一个函数
input
def not_null_count(column):
columns_null = pandas.isnull(column) #
null = column[column_null]
return len(null)
column_null_count = titanic_survival.apply(not_null_count)
print(column_null_count)
output
pclass 1survived 1name 1sex 1age 264sibsp 1parch 1ticket 1fare 2cabin 1015embarked 3boat 824body 1189home.dest 565dtype: int64
10.Applying a Function to a Row
input
def age_label(row):
age = row['age']
if pandas.isnull(age):
return 'unknown'
elif age < 18:
return 'minor'
else:
return 'adult'
age_labels = titanic_survival.apply(age_label, axis=1) # use axis=1
so that the apply()
method applies your function over the rows
print(age_labels[0:5])
output
0 adult1 minor2 minor3 adult4 adultdtype: object
11.Calculating Survival Percentage by Age Group
Now that we have age labels for everyone, let's make a pivot table to find the probability of survival for each age group.
We have added an "age_labels"
column to the dataframe containing the age_labels
variable from the previous step.
input
age_group_survival = titanic_survival.pivot_table(index="age_labels", values="survived")
print(age_group_survival)
output
age_labelsadult 0.387892minor 0.525974unknown 0.277567Name: survived, dtype: float64
网友评论