Kaggle Titanic竞赛是根据信息,预测是否能生还,是一个典型的二分类问题。
第一步,查看数据,根据直觉来洞察数据
import pandas as pd
dataset = pd.read_csv("train.csv")
print(dataset.head())
print(dataset.isna().sum())
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
Cabin缺失的数据很多,Cabin number(客舱号)跟是否能生还,没有太大联系,所以该列数据不作为特征。Embarked,作为登船的港口,也没太大联系,不作为特征。Ticket(票号)与Fare(票价)重复,Fare(票价)作为特征更好。
姓名也没有什么直观联系,不作为特征;PassengerID,也没有什么直观联系。
年龄有177个缺失,后面做两种方案,一种带年龄;一种不带年龄。
最终可得特征向量:
第一种: ['pclass','sex','sibsp','parch','fare']
第二种: ['pclass','sex','age','sibsp','parch','fare']
第二步,数据清洗:The 4 C's of Data Cleaning: Correcting, Completing, Creating, and Converting
- Correcting,把不合理的数据更正
- Completing, 把缺失的数据补充完整。定量数据的基本方法是使用平均值、中位数或平均值 + 随机标准差进行插补,例如,年龄将使用中位数进行估算。
- Creating,特征工程是指我们使用现有特征来创建新特征
- Converting,消除单位的影响,归一化等
第三步,这是一个二分类问题,手动提取了特征向量后,就可以用keras建模并训练了,Python代码如下:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.python.keras import activations
import matplotlib.pylab as plt
from tensorflow.keras import regularizers
raw_train_dataset = pd.read_csv("train.csv")
raw_test_dataset = pd.read_csv("test.csv")
features_test = ['PassengerId','Pclass','Sex','Fare','Age']
features_train = features_test + ['Survived']
def preprocess(raw_dataset, features, train=True):
if 'Age' in features:
raw_dataset['Age'].fillna(raw_dataset['Age'].median(), inplace=True)
raw_dataset['Fare'].fillna(raw_dataset['Fare'].median(),inplace=True)
dataset = raw_dataset[features]
dataset.replace('male',1,inplace=True)
dataset.replace('female',0,inplace=True)
dataset_withoutna = dataset.dropna()
if train:
labels = dataset_withoutna['Survived']
dataset_withoutna.pop('PassengerId')
dataset_withoutna.pop('Survived')
#print(dataset_withoutna.head())
train_stats = dataset_withoutna.describe()
train_stats = train_stats.transpose()
normed_train_data = (dataset_withoutna - train_stats['mean']) / train_stats['std']
#print(normed_train_data.head())
return np.array(normed_train_data), np.array(labels)
else:
labels = dataset.pop('PassengerId')
dataset.fillna(0, inplace=True)
train_stats = dataset.describe()
train_stats = train_stats.transpose()
normed_test_data = (dataset - train_stats['mean']) / train_stats['std']
return np.array(normed_test_data), np.array(labels)
train_dataset, labels = preprocess(raw_train_dataset, features_train)
print(train_dataset.shape, labels.shape)
test_dataset, passenger_id = preprocess(raw_test_dataset, features_test, train=False)
print(test_dataset.shape, passenger_id.shape)
model = tf.keras.Sequential(
[
tf.keras.layers.Dense(64, activation='relu', input_shape=(train_dataset.shape[1],), kernel_regularizer=regularizers.l2(0.001)),
tf.keras.layers.Dense(32, activation='relu',kernel_regularizer=regularizers.l2(0.001)),
tf.keras.layers.Dense(16, activation='relu'),
tf.keras.layers.Dense(1, name='prediction')
]
)
base_lr = 0.001
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=base_lr),
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
#early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)
history = model.fit(
train_dataset,labels,
epochs=400,
validation_split=0.2,
batch_size=32,
verbose=0,
#callbacks=[early_stop]
)
def plot_history(history):
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
plt.figure()
plt.xlabel('Num of Epochs')
plt.ylabel('value')
plt.plot(hist['epoch'], hist['loss'],
label='Loss')
plt.plot(hist['epoch'], hist['val_loss'],
label = 'val_loss')
plt.ylim([0,5])
plt.legend()
plt.show()
loss, accuracy = model.evaluate(train_dataset, labels, verbose=2)
print("Accuracy:", accuracy)
plot_history(history)
predictions = model.predict(test_dataset)
predictions = (tf.sigmoid(predictions).numpy().flatten() > 0.5).astype(int)
print(predictions.shape, predictions)
output = pd.DataFrame({'PassengerId':passenger_id, 'Survived':predictions})
output.to_csv("submission.csv", index=False)
print("Your submisstion was successfully saved!")
网友评论