美文网首页
用TensorFlow完成Kaggle Titanic竞赛

用TensorFlow完成Kaggle Titanic竞赛

作者: LabVIEW_Python | 来源:发表于2021-10-03 14:33 被阅读0次

    Kaggle Titanic竞赛是根据信息,预测是否能生还,是一个典型的二分类问题。

    第一步,查看数据,根据直觉来洞察数据

    import pandas as pd 
    dataset = pd.read_csv("train.csv")
    print(dataset.head())
    print(dataset.isna().sum())
    

    PassengerId 0
    Survived 0
    Pclass 0
    Name 0
    Sex 0
    Age 177
    SibSp 0
    Parch 0
    Ticket 0
    Fare 0
    Cabin 687
    Embarked 2

    Cabin缺失的数据很多,Cabin number(客舱号)跟是否能生还,没有太大联系,所以该列数据不作为特征。Embarked,作为登船的港口,也没太大联系,不作为特征。Ticket(票号)与Fare(票价)重复,Fare(票价)作为特征更好。
    姓名也没有什么直观联系,不作为特征;PassengerID,也没有什么直观联系。

    年龄有177个缺失,后面做两种方案,一种带年龄;一种不带年龄。
    最终可得特征向量:
    第一种: ['pclass','sex','sibsp','parch','fare']
    第二种: ['pclass','sex','age','sibsp','parch','fare']

    第二步,数据清洗:The 4 C's of Data Cleaning: Correcting, Completing, Creating, and Converting

    • Correcting,把不合理的数据更正
    • Completing, 把缺失的数据补充完整。定量数据的基本方法是使用平均值、中位数或平均值 + 随机标准差进行插补,例如,年龄将使用中位数进行估算。
    • Creating,特征工程是指我们使用现有特征来创建新特征
    • Converting,消除单位的影响,归一化等

    第三步,这是一个二分类问题,手动提取了特征向量后,就可以用keras建模并训练了,Python代码如下:

    import pandas as pd 
    import numpy as np 
    import tensorflow as tf
    from tensorflow.python.keras import activations 
    import matplotlib.pylab as plt 
    from tensorflow.keras import regularizers
    
    raw_train_dataset = pd.read_csv("train.csv")
    raw_test_dataset = pd.read_csv("test.csv")
    
    features_test = ['PassengerId','Pclass','Sex','Fare','Age']
    features_train = features_test + ['Survived']
    
    def preprocess(raw_dataset, features, train=True):
        if 'Age' in features:
            raw_dataset['Age'].fillna(raw_dataset['Age'].median(), inplace=True)
        raw_dataset['Fare'].fillna(raw_dataset['Fare'].median(),inplace=True)
    
        dataset = raw_dataset[features]
        dataset.replace('male',1,inplace=True)
        dataset.replace('female',0,inplace=True)
        dataset_withoutna = dataset.dropna()
    
        if train:
            labels = dataset_withoutna['Survived']
            dataset_withoutna.pop('PassengerId')
            dataset_withoutna.pop('Survived')
            #print(dataset_withoutna.head())
            train_stats = dataset_withoutna.describe()
            train_stats = train_stats.transpose()
            normed_train_data = (dataset_withoutna - train_stats['mean']) / train_stats['std']
            #print(normed_train_data.head())
            return np.array(normed_train_data), np.array(labels)
        else:
            labels = dataset.pop('PassengerId')
            dataset.fillna(0, inplace=True)
            train_stats = dataset.describe()
            train_stats = train_stats.transpose()
            normed_test_data = (dataset - train_stats['mean']) / train_stats['std']
            return np.array(normed_test_data), np.array(labels)
    
    train_dataset, labels = preprocess(raw_train_dataset, features_train)
    print(train_dataset.shape, labels.shape)
    
    test_dataset, passenger_id = preprocess(raw_test_dataset, features_test, train=False)
    print(test_dataset.shape, passenger_id.shape)
    
    
    model = tf.keras.Sequential(
        [
            tf.keras.layers.Dense(64, activation='relu', input_shape=(train_dataset.shape[1],), kernel_regularizer=regularizers.l2(0.001)),
            tf.keras.layers.Dense(32, activation='relu',kernel_regularizer=regularizers.l2(0.001)),
            tf.keras.layers.Dense(16, activation='relu'),
            tf.keras.layers.Dense(1, name='prediction')
        ]
    )
    
    base_lr = 0.001
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=base_lr), 
        loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
        metrics=['accuracy'])
    
    #early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)
    
    history = model.fit(
        train_dataset,labels,
        epochs=400,
        validation_split=0.2,
        batch_size=32,
        verbose=0,
        #callbacks=[early_stop]
    )
    
    def plot_history(history):
        hist = pd.DataFrame(history.history)
        hist['epoch'] = history.epoch
    
        plt.figure()
        plt.xlabel('Num of Epochs')
        plt.ylabel('value')
        plt.plot(hist['epoch'], hist['loss'],
               label='Loss')
        plt.plot(hist['epoch'], hist['val_loss'],
               label = 'val_loss')
        plt.ylim([0,5])
        plt.legend()
        plt.show()
    
    loss, accuracy = model.evaluate(train_dataset, labels, verbose=2)
    print("Accuracy:", accuracy)
    plot_history(history)
    
    predictions = model.predict(test_dataset)
    predictions = (tf.sigmoid(predictions).numpy().flatten() > 0.5).astype(int)
    print(predictions.shape, predictions)
    
    output = pd.DataFrame({'PassengerId':passenger_id, 'Survived':predictions})
    output.to_csv("submission.csv", index=False)
    print("Your submisstion was successfully saved!")
    

    相关文章

      网友评论

          本文标题:用TensorFlow完成Kaggle Titanic竞赛

          本文链接:https://www.haomeiwen.com/subject/qbfmnltx.html