美文网首页
Kaggle实战: Ion Switching

Kaggle实战: Ion Switching

作者: 海盗船长_coco | 来源:发表于2020-04-10 10:10 被阅读0次

    比赛地址:https://www.kaggle.com/c/liverpool-ion-switching
    参考notebook:https://www.kaggle.com/cdeotte/one-feature-model-0-930
    完整代码:https://github.com/872699467/kaggle_Ion_Switching

    数据描述

    训练集数据是50秒内进行10 kHz的不连续批次采样,并记录打开的离子通道数。故一个bacth的长度为50x10k=500k。即0.0001-50.0000的数据与50.0001-100.0000的数据不同,因此在50.0000和50.0001之间是不连续的。我们的任务是建立一个模型,该模型可以预测每个时间步长信号的开放通道数。训练集数据包含10个batch,测试集数据包含4个batch。

    EDA

    一、训练集分为10个batch,每个batch的信号变化

    fig = plt.figure(figsize=(20, 5))
    step = 1000
    plt.plot(range(0, train_length, step), train_signal[0::step])
    for i in range(11):
        plt.plot([i * 500000, i * 500000], [-5, 12.5], 'r')
    for i in range(10):
        plt.text(i * 500000 + 200000, 10, str(i + 1), size=20)
    plt.xlabel('Row',size=16)
    plt.ylabel('Signal',size=16)
    plt.title('Training Data Signal - 10 batches',size=20)
    plt.savefig('EDA/time_signal.png')
    plt.show()
    
    训练集signal
    二、训练集分为10个batch,每个batch的开放通道数变化
    开放的通道数即我们需要预测的数据。
    fig = plt.figure(figsize=(20, 5))
    step = 1000
    plt.plot(range(0, train_length, step), train_channel[0::step])
    for i in range(11):
        plt.plot([i * 500000, i * 500000], [-5, 12.5], 'r')
    for i in range(10):
        plt.text(i * 500000 + 200000, 10, str(i + 1), size=20)
    plt.xlabel('Row', size=16)
    plt.ylabel('Channels Open', size=16)
    plt.title('Training Data Open Channels - 10 batches', size=20)
    plt.savefig('EDA/time_channel.png')
    plt.show()
    
    训练集Open Channels变化
    对训练集的数据进行分析
    从上面的图中可以看出,可以使用了5种不同的合成模型。 1、模型为通道打开最大数为1,且打开的可能性很小(batch1和2)。 2、模型为通道打开最大数为1,且打开的可能性很大(batch3和7)。 3、模型为通道打开最大数为3(batch4和8)。 4、模型为通道打开最大数为5(batch6和9)。5、模型为通道打开最大数为10(batch5和10)。 此外,第一幅图可以看到在第7、8、9、10 batch中存在噪声,造成了数据漂移,并在第2个batch的开始处产生了漂移。
    根据此处的论文,数据是综合的。 还添加了“电生理”噪声和漂移。 漂移是一种信号偏置,从而导致上面的batch 2、7、8、9、10的信号呈现为水平线形状。
    三、信号和开放通道数的关系
    for k in range(5):
        start = int(np.random.uniform(0, train_length - 5000))
        end = start + 5000
        step = 10
        print('#' * 25)
        print('### Random {} to {} '.format(start, end))
        print('#' * 25)
        plt.figure(figsize=(20, 5))
        plt.plot(range(start, end, step), train_signal[start:end][0::step])
        plt.plot(range(start, end, step), train_channel[start:end][0::step])
        # plt.savefig('EDA/signal_channel.png')
    plt.show()
    
    image.png
    image.png
    image.png
    image.png
    image.png
    黄色的线为开放的通道数量,蓝色的线为信号量。仔细查看信号和开放通道的随机间隔,以观察它们之间的关系。 我们注意到它们是高度相关的,并且一起上下移动。 因此,我们可以根据一个特征信号来预测开放信道。 唯一的麻烦是信号数据中添加了噪声造成了漂移。 因此,我们需要将其删除。
    四、测试集数据信号量变化
    plt.figure(figsize=(20, 5))
    step = 1000
    label = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
    plt.plot(range(0, test_length, step), test_signal[0::step])
    for i in range(5):
        plt.plot([i * 500000, i * 500000], [-5, 12.5], 'r')
    for j in range(21):
        plt.plot([j * 100000, j * 100000], [-5, 12.5], 'r:')
    for k in range(4):
        plt.text(k * 500000 + 200000, 10, str(k + 1), size=20)
    for k in range(10):
        plt.text(k * 100000 + 40000, 7, label[k], size=16)
    plt.xlabel('Row', size=16)
    plt.ylabel('Channels Open', size=16)
    plt.title('Test Data Signal - 4 batches - 10 subsamples', size=20)
    plt.savefig('EDA/test_data.png')
    plt.show()
    
    测试集信号量变化
    对测试集数据进行分析
    从上图中可以找到对应的5个模型。而且 我们可以识别出数据添加的噪声。 batch 1似乎是5个子样本,其中分别由模型1s,max 3、max 5、1s和1f创建了A,B,C,D,E。 模型1s是具有低概率打开最多1个通道的模型。 模型1f是具有高概率打开最多1个通道的模型。 max 3、max 5、max 10分别是最多具有3、5、10个通道数的模型。 我们在子样本A,B,E,G,H,I 中观察到倾斜漂移。在batch 3中观察到抛物线的噪声。

    移除训练集噪声

    下图为倾斜漂移消除的效果。 我们还可以在第7、8、9、10 batch 中消除抛物线漂移。原本我们将只训练具有批次1、3、4、5、6的模型。但是,在删除训练漂移之后,我们可以将批次2、7、8、9、10中的数据包括在训练数据中。

    # ==========Remove Train Data Drift=============
    train_copy = train_df.copy()
    
    a = 500000
    b = 600000  # CLEAN TRAIN BATCH 2
    signal = train_copy['signal'][a:b].values
    time = train_copy['time'][a:b].values
    train_copy.loc[train_copy.index[a:b], 'signal'] = signal - 3 * (time - 50) / 10.
    
    #可视化
    batch = 2
    a = 500000 * (batch - 1)
    b = 500000 * batch
    res = 50
    plt.figure(figsize=(20, 5))
    plt.plot(range(a, b, res), train_copy['signal'][a:b][0::res])
    plt.title('Training Batch 2 without Slant Drift', size=16)
    plt.savefig('EDA/without Slant Drift.png')
    plt.figure(figsize=(20, 5))
    plt.plot(range(a, b, res), train_signal[a:b][0::res])
    plt.title('Training Batch 2 with Slant Drift', size=16)
    plt.savefig('EDA/with Slant Drift.png')
    plt.show()
    
    batch 2中含有倾斜漂移
    batch 2中删除了倾斜漂移

    移除batch 7、8、9、10噪声

    # 一元二次方程
    def f(x, low, high, mid):
        return -((-low + high) / 625) * (x - mid) ** 2 + high - low
    
    
    # CLEAN TRAIN BATCH 7
    batch = 7
    a = 500000 * (batch - 1)
    b = 500000 * batch
    train_copy.loc[train_copy.index[a:b], 'signal'] = train_signal[a:b] - f(train_time[a:b], -1.817, 3.186, 325)
    # CLEAN TRAIN BATCH 8
    batch = 8
    a = 500000 * (batch - 1)
    b = 500000 * batch
    train_copy.loc[train_copy.index[a:b], 'signal'] = train_signal[a:b] - f(train_time[a:b], -0.094, 4.936, 375)
    # CLEAN TRAIN BATCH 9
    batch = 9
    a = 500000 * (batch - 1)
    b = 500000 * batch
    train_copy.loc[train_copy.index[a:b], 'signal'] = train_signal[a:b] - f(train_time[a:b], 1.715, 6.689, 425)
    # CLEAN TRAIN BATCH 10
    batch = 10
    a = 500000 * (batch - 1)
    b = 500000 * batch
    train_copy.loc[train_copy.index[a:b], 'signal'] = train_signal[a:b] - f(train_time[a:b], 3.361, 8.45, 475)
    
    #可视化
    plt.figure(figsize=(20, 5))
    plt.plot(train_time[::1000], train_signal[::1000])
    plt.title('Training Batches 7-10 with Parabolic Drift', size=16)
    plt.savefig('EDA/with Parabolic Drift.png')
    plt.figure(figsize=(20, 5))
    plt.plot(train_copy['time'][::1000], train_copy['signal'][::1000])
    plt.title('Training Batches 7-10 without Parabolic Drift', size=16)
    plt.savefig('EDA/without Parabolic Drift.png')
    plt.show()
    
    batch 7、8、9、10含有抛物线漂移
    batch 7、8、9、10不含有抛物线漂移

    预测模型搭建

    1、开放通道数最大为1且概率低的模型(1 Slow Open Channel)
    使用训练集的batch 1和batch 2进行搭建决策树

    batch = 1
    a = 500000 * (batch - 1);
    b = 500000 * batch
    batch = 2
    c = 500000 * (batch - 1)
    d = 500000 * batch
    temp = np.concatenate([train_copy['signal'].values[a:b], train_copy['signal'].values[c:d]])
    X_train = np.concatenate([train_copy['signal'].values[a:b], train_copy['signal'].values[c:d]]).reshape((-1, 1))
    y_train = np.concatenate([train_copy['open_channels'].values[a:b], train_copy['open_channels'].values[c:d]]).reshape(
        (-1, 1))
    
    clf1s = tree.DecisionTreeClassifier(max_depth=1)
    clf1s = clf1s.fit(X_train, y_train)
    print('Training model 1s channel')
    preds = clf1s.predict(X_train)
    print('has f1 validation score =', f1_score(y_train, preds, average='macro'))
    
    tree_graph = tree.export_graphviz(clf1s, out_file=None, max_depth=10,
                                      impurity=False, feature_names=['signal'], class_names=['0', '1'],
                                      rounded=True, filled=True)
    graph = graphviz.Source(tree_graph)
    graph.view('Slow Open Channel')
    
    Training model 1s channel
    has f1 validation score = 0.984750462652431
    
    1 Slow Open Channel 决策树模型

    2、开放通道数最大为1且概率高的模型(1 Fast Open Channel)
    使用训练集的batch 3和batch 7进行搭建决策树

    batch = 3
    a = 500000 * (batch - 1)
    b = 500000 * batch
    batch = 7
    c = 500000 * (batch - 1)
    d = 500000 * batch
    X_train = np.concatenate([train_copy['signal'].values[a:b], train_copy['signal'].values[c:d]]).reshape((-1, 1))
    y_train = np.concatenate([train_copy['open_channels'].values[a:b], train_copy['open_channels'].values[c:d]]).reshape(
        (-1, 1))
    
    clf1f = tree.DecisionTreeClassifier(max_depth=1)
    clf1f = clf1f.fit(X_train, y_train)
    print('Training model 1f channel')
    preds = clf1f.predict(X_train)
    print('has f1 validation score =', f1_score(y_train, preds, average='macro'))
    
    tree_graph = tree.export_graphviz(clf1f, out_file=None, max_depth=10,
                                      impurity=False, feature_names=['signal'], class_names=['0', '1'],
                                      rounded=True, filled=True)
    graph = graphviz.Source(tree_graph)
    graph.view('Fast Open Channel')
    
    Training model 1f channel
    has f1 validation score = 0.9879287335590583
    
    1 Fast Open Channel 决策树模型

    3、开放通道数最大为3的模型(maximum 3 Open Channel)
    使用训练集的batch 4和batch 8进行搭建决策树

    batch = 4
    a = 500000 * (batch - 1)
    b = 500000 * batch
    batch = 8
    c = 500000 * (batch - 1)
    d = 500000 * batch
    X_train = np.concatenate([train_copy['signal'].values[a:b], train_copy['signal'].values[c:d]]).reshape((-1, 1))
    y_train = np.concatenate([train_copy['open_channels'].values[a:b], train_copy['open_channels'].values[c:d]]).reshape(
        (-1, 1))
    
    clf3 = tree.DecisionTreeClassifier(max_leaf_nodes=4)
    clf3 = clf3.fit(X_train, y_train)
    print('Training model 3 channel')
    preds = clf3.predict(X_train)
    print('has f1 validation score =', f1_score(y_train, preds, average='macro'))
    
    tree_graph = tree.export_graphviz(clf3, out_file=None, max_depth=10,
                                      impurity=False, feature_names=['signal'], class_names=['0', '1', '2', '3'],
                                      rounded=True, filled=True)
    graph = graphviz.Source(tree_graph)
    graph.view('Maximum 3 Open Channel')
    
    Training model 3 channel
    has f1 validation score = 0.9321291153218245
    
    maximum3 Open Channel 决策树模型

    4、开放通道数最大为5的模型(maximum 5 Open Channel)
    使用训练集的batch 6和batch 9进行搭建决策树

    batch = 6
    a = 500000 * (batch - 1)
    b = 500000 * batch
    batch = 9
    c = 500000 * (batch - 1)
    d = 500000 * batch
    X_train = np.concatenate([train_copy['signal'].values[a:b], train_copy['signal'].values[c:d]]).reshape((-1, 1))
    y_train = np.concatenate([train_copy['open_channels'].values[a:b], train_copy['open_channels'].values[c:d]]).reshape(
        (-1, 1))
    
    clf5 = tree.DecisionTreeClassifier(max_leaf_nodes=6)
    clf5 = clf5.fit(X_train, y_train)
    print('Trained model 5 channel')
    preds = clf5.predict(X_train)
    print('has f1 validation score =', f1_score(y_train, preds, average='macro'))
    
    tree_graph = tree.export_graphviz(clf5, out_file=None, max_depth=10,
                                      impurity=False, feature_names=['signal'], class_names=['0', '1', '2', '3', '4', '5'],
                                      rounded=True, filled=True)
    graph = graphviz.Source(tree_graph)
    graph.view('Maximum 5 Open Channel')
    
    Trained model 5 channel
    has f1 validation score = 0.9538294064086686
    
    maximum 5 Open Channel决策树模型

    5、开放通道数最大为10的模型(maximum 10 Open Channel)
    使用训练集的batch 5和batch 10进行搭建决策树

    batch = 5
    a = 500000 * (batch - 1)
    b = 500000 * batch
    batch = 10
    c = 500000 * (batch - 1)
    d = 500000 * batch
    X_train = np.concatenate([train_copy['signal'].values[a:b], train_copy['signal'].values[c:d]]).reshape((-1, 1))
    y_train = np.concatenate([train_copy['open_channels'].values[a:b], train_copy['open_channels'].values[c:d]]).reshape(
        (-1, 1))
    
    clf10 = tree.DecisionTreeClassifier(max_leaf_nodes=8)
    clf10 = clf10.fit(X_train, y_train)
    print('Trained model 10 channel')
    preds = clf10.predict(X_train)
    print('has f1 validation score =', f1_score(y_train, preds, average='macro'))
    
    tree_graph = tree.export_graphviz(clf10, out_file=None, max_depth=10,
                                      impurity=False, feature_names=['signal'], class_names=[str(x) for x in range(11)],
                                      rounded=True, filled=True)
    graph = graphviz.Source(tree_graph)
    graph.view('Maximum 10 Open Channel')
    
    Trained model 10 channel
    has f1 validation score = 0.617082087312885
    
    maximum 10 Open Channel 决策树模型

    移除测试集噪声

    一、训练集噪声移除效果

    # ORIGINAL TRAIN DATA
    plt.figure(figsize=(20, 5))
    r = train_df['signal'].rolling(30000).mean()
    plt.plot(train_time, r)
    for i in range(11):
        plt.plot([i * 50, i * 50], [-3, 8], 'r:')
    for j in range(10):
        plt.text(j * 50 + 20, 6, str(j + 1), size=20)
    plt.title('Training Signal Rolling Mean. Has Drift wherever plot is not horizontal line', size=16)
    plt.savefig('EDA/Training Signal Rolling Mean with Drift.png')
    
    # TRAIN DATA WITHOUT DRIFT
    plt.figure(figsize=(20, 5))
    r = train_copy['signal'].rolling(30000).mean()
    plt.plot(train_copy['time'].values, r)
    for i in range(11):
        plt.plot([i * 50, i * 50], [-3, 8], 'r:')
    for j in range(10):
        plt.text(j * 50 + 20, 6, str(j + 1), size=20)
    plt.title('Training Signal Rolling Mean without Drift', size=16)
    plt.savefig('EDA/Training Signal Rolling Mean without Drift.png')
    plt.show()
    
    原始数据
    移除漂移后的效果
    二、测试集数据噪声移除
    plt.figure(figsize=(20, 5))
    let = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
    r = test_df['signal'].rolling(30000).mean()
    plt.plot(test_df['time'].values, r)
    for i in range(21):
        plt.plot([500 + i * 10, 500 + i * 10], [-3, 6], 'r:')
    for i in range(5):
        plt.plot([500 + i * 50, 500 + i * 50], [-3, 6], 'r')
    for k in range(4):
        plt.text(525 + k * 50, 5.5, str(k + 1), size=20)
    for k in range(10):
        plt.text(505 + k * 10, 4, let[k], size=16)
    plt.title('Test Signal Rolling Mean. Has Drift wherever plot is not horizontal line', size=16)
    plt.savefig('Test Signal Rolling Mean.png')
    plt.show()
    
    测试集样本
    可以看到在测试集的子样本A,B,E,G,H,I和测试集 batch 3中观察到数据漂移。
    移除噪声
    # Remove Test Data Drift
    test_copy = test_df.copy()
    
    # REMOVE BATCH 1 DRIFT
    start = 500
    a = 0
    b = 100000
    
    test_copy.loc[test_copy.index[a:b], 'signal'] = test_signal[a:b] - 3 * (test_time[a:b] - start) / 10.
    start = 510
    a = 100000
    b = 200000
    test_copy.loc[test_copy.index[a:b], 'signal'] = test_signal[a:b] - 3 * (test_time[a:b] - start) / 10.
    start = 540
    a = 400000
    b = 500000
    test_copy.loc[test_copy.index[a:b], 'signal'] = test_signal[a:b] - 3 * (test_time[a:b] - start) / 10.
    
    # REMOVE BATCH 2 DRIFT
    start = 560
    a = 600000
    b = 700000
    test_copy.loc[test_copy.index[a:b], 'signal'] = test_signal[a:b] - 3 * (test_time[a:b] - start) / 10.
    start = 570
    a = 700000
    b = 800000
    test_copy.loc[test_copy.index[a:b], 'signal'] = test_signal[a:b] - 3 * (test_time[a:b] - start) / 10.
    start = 580
    a = 800000
    b = 900000
    test_copy.loc[test_copy.index[a:b], 'signal'] = test_signal[a:b] - 3 * (test_time[a:b] - start) / 10.
    
    
    # REMOVE BATCH 3 DRIFT
    def f(x):
        return -(0.00788) * (x - 625) ** 2 + 2.345 + 2.58
    
    
    a = 1000000
    b = 1500000
    test_copy.loc[test_copy.index[a:b], 'signal'] = test_signal[a:b] - f(test_time[a:b])
    

    移除效果可视化

    let = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
    plt.figure(figsize=(20, 5))
    res = 1000
    plt.plot(range(0, test_copy.shape[0], res), test_copy['signal'][0::res])
    for i in range(5):
        plt.plot([i * 500000, i * 500000], [-5, 12.5], 'r')
    for i in range(21):
        plt.plot([i * 100000, i * 100000], [-5, 12.5], 'r:')
    for k in range(4):
        plt.text(k * 500000 + 250000, 10, str(k + 1), size=20)
    for k in range(10):
        plt.text(k * 100000 + 40000, 7.5, let[k], size=16)
    plt.title('Test Signal without Drift', size=16)
    
    plt.figure(figsize=(20, 5))
    r = test_df['signal'].rolling(30000).mean()
    plt.plot(test_time, r)
    for i in range(21):
        plt.plot([500 + i * 10, 500 + i * 10], [-2, 6], 'r:')
    for i in range(5):
        plt.plot([500 + i * 50, 500 + i * 50], [-2, 6], 'r')
    for k in range(4):
        plt.text(525 + k * 50, 5.5, str(k + 1), size=20)
    for k in range(10):
        plt.text(505 + k * 10, 4, let[k], size=16)
    plt.title('Test Signal Rolling Mean without Drift', size=16)
    plt.show()
    
    测试集移除噪声前数据
    测试集移除噪声后数据

    测试集预测

    # =====================Predict Test==================
    sub = pd.read_csv('data/sample_submission.csv')
    
    a = 0  # SUBSAMPLE A, Model 1s
    sub.iloc[100000 * a:100000 * (a + 1), 1] = clf1s.predict(
        test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))
    
    a = 1  # SUBSAMPLE B, Model 3
    sub.iloc[100000 * a:100000 * (a + 1), 1] = clf3.predict(
        test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))
    
    a = 2  # SUBSAMPLE C, Model 5
    sub.iloc[100000 * a:100000 * (a + 1), 1] = clf5.predict(
        test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))
    
    a = 3  # SUBSAMPLE D, Model 1s
    sub.iloc[100000 * a:100000 * (a + 1), 1] = clf1s.predict(
        test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))
    
    a = 4  # SUBSAMPLE E, Model 1f
    sub.iloc[100000 * a:100000 * (a + 1), 1] = clf1f.predict(
        test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))
    
    a = 5  # SUBSAMPLE F, Model 10
    sub.iloc[100000 * a:100000 * (a + 1), 1] = clf10.predict(
        test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))
    
    a = 6  # SUBSAMPLE G, Model 5
    sub.iloc[100000 * a:100000 * (a + 1), 1] = clf5.predict(
        test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))
    
    a = 7  # SUBSAMPLE H, Model 10
    sub.iloc[100000 * a:100000 * (a + 1), 1] = clf10.predict(
        test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))
    
    a = 8  # SUBSAMPLE I, Model 1s
    sub.iloc[100000 * a:100000 * (a + 1), 1] = clf1s.predict(
        test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))
    
    a = 9  # SUBSAMPLE J, Model 3
    sub.iloc[100000 * a:100000 * (a + 1), 1] = clf3.predict(
        test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))
    
    # BATCHES 3 AND 4, Model 1s
    sub.iloc[1000000:2000000, 1] = clf1s.predict(test_copy['signal'].values[1000000:2000000].reshape((-1, 1)))
    

    预测效果可视化

    # ===============Display Test Predictions===============
    let = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
    plt.figure(figsize=(20, 5))
    res = 1000
    plt.plot(range(0, test_df.shape[0], res), sub.open_channels[0::res])
    for i in range(5):
        plt.plot([i * 500000, i * 500000], [-5, 12.5], 'r')
    for i in range(21):
        plt.plot([i * 100000, i * 100000], [-5, 12.5], 'r:')
    for k in range(4):
        plt.text(k * 500000 + 250000, 10, str(k + 1), size=20)
    for k in range(10):
        plt.text(k * 100000 + 40000, 7.5, let[k], size=16)
    plt.title('Test Data Predictions', size=16)
    plt.show()
    
    测试集预测的开放通道数

    保存预测文件

    sub.to_csv('submission.csv', index=False, float_format='%.4f')
    

    提交结果

    提交结果

    补充说明

    上述说到batch 2的500000-600000行的数据中存在噪声,因为正常情况应像后半段那样是水平线,可以看出前半段为斜线,为了拟合该段数据,我们需要求得线段的斜率k和偏置b,且由两点便可确定一条直线。故现在的任务是在前半段的斜线中如何确定该两个点。


    batch 2中含有倾斜漂移

    一、计算每个滑动窗口的均值,通过rolling方法计算每1000个数据的均值。下图可看到平均值于数据段拟合的不错。

    # train data batch 2
    start = 500000
    end = 600000
    window = 1000
    mean = train_df.loc[start:end, 'signal'].rolling(window).mean()
    plt.figure(figsize=(20, 5))
    step = 100
    plt.plot(range(start, end, step), train_df['signal'].values[start:end:step], zorder=1)
    plt.plot(range(start, end, step), mean[0:(end - start):step], zorder=1)
    plt.show()
    
    原始数据与rolling平均值的拟合线

    二、在rolling的平均线上取两个点

    start = 500000
    end = 600000
    window = 1000
    mean = train_df.loc[start:end, 'signal'].rolling(window).mean()
    
    x_1 = start + 3000
    x_2 = end - 3000
    y_1 = mean[x_1]
    y_2 = mean[x_2]
    plt.figure(figsize=(20, 5))
    step = 100
    plt.plot(range(start, end, step), train_df['signal'].values[start:end:step], zorder=1)
    plt.plot(range(start, end, step), mean[0:(end - start):step], zorder=1)
    plt.scatter([x_1, x_2], [y_1, y_2], s=20, c='b', zorder=3)
    plt.show()
    
    image.png

    三、通过两个点计算斜率k和偏置b,下图红线为最终拟合的斜线。

    start = 500000
    end = 600000
    window = 1000
    mean = train_df.loc[start:end, 'signal'].rolling(window).mean()
    
    x_1 = start + 3000
    x_2 = end - 3000
    y_1 = mean[x_1]
    y_2 = mean[x_2]
    k = (y_2 - y_1) / (x_2 - x_1)
    b = y_2 - k * x_2
    x = np.array([(start + i) for i in range(1, 100001)])
    y = k * x + b
    plt.figure(figsize=(20, 5))
    step = 100
    plt.plot(range(start, end, step), train_df['signal'].values[start:end:step], zorder=1)
    plt.plot(range(start, end, step), mean[0:(end - start):step], zorder=1)
    plt.plot(range(start, end, step), y[0:(end - start):step], 'r:', linewidth=5, zorder=2)
    plt.scatter([x_1, x_2], [y_1, y_2], s=20, c='b', zorder=3)
    plt.show()
    
    image.png

    四、添加常数constant
    如果我们现在的斜线直接减拟合的线,其均值理论上为0,所以我们还要计算无噪声的后半段的均值,将其添加到相减之后的函数中。

    start = 500000
    end = 600000
    window = 1000
    mean = train_df.loc[start:end, 'signal'].rolling(window).mean()
    
    x_1 = start + 3000
    x_2 = end - 3000
    y_1 = mean[x_1]
    y_2 = mean[x_2]
    k = (y_2 - y_1) / (x_2 - x_1)
    b = y_2 - k * x_2
    x = np.array([(start + i) for i in range(1, 100001)])
    y = k * x + b
    constant = np.mean(train_df['signal'].values[end:1000000])
    train_df.loc[train_df.index[start:end], 'signal'] = train_df['signal'].values[start:end] - (
            k * ((train_df['time'].values[start:end] - 50) * 10000 + start) + b) + constant
    
    plt.figure(figsize=(20, 5))
    step = 100
    plt.plot(range(start, 1000000, step), train_df['signal'].values[start:1000000:step], zorder=1)
    plt.show()
    
    消去噪声后的batch 2 图像

    五、整理曲线公式
    我们得到的拟合曲线的x每个步长为1,但是表格中time的步长为0.0001,所以在第4步中得到time的值后要乘以10000,且x又加了一个start。现希望能够直接化成关于time的公式。


    转化公式

    得到K=0.2957577872340394,B=-0.044442693920208054。K约等于0.3,B约等于0。所以有上述代码

    train_copy.loc[train_copy.index[a:b], 'signal'] = signal - 3 * (time - 50) / 10.
    

    同样对于batch 7也存在噪声,但是其为抛物线型,需要3个点才能确定一个一元二次方程。


    batch 7的signal分布

    一、计算每个滑动窗口的均值,通过rolling方法计算每1000个数据的均值。下图可看到平均值于数据段拟合的不错。

    start = 3000000
    end = 3500000
    window = 1000
    mean = train_df.loc[start:end, 'signal'].rolling(window).mean()
    
    plt.figure(figsize=(20, 5))
    step = 100
    plt.plot(range(start, end, step), train_df['signal'].values[start:end:step], zorder=1)
    plt.plot(range(start, end, step), mean[0:end - start:step], zorder=1)
    plt.show()
    
    窗口平均值

    二、在rolling的平均线上取三个点

    start = 3000000
    end = 3500000
    window = 1000
    mean = train_df.loc[start:end, 'signal'].rolling(window).mean()
    
    x_1 = start + 3000
    x_2 = end - 3000
    x_3 = (start + end) // 2
    y_1 = mean[x_1]
    y_2 = mean[x_2]
    y_3 = mean[x_3]
    
    plt.figure(figsize=(20, 5))
    step = 100
    plt.plot(range(start, end, step), train_df['signal'].values[start:end:step], zorder=1)
    plt.plot(range(start, end, step), mean[0:end - start:step], zorder=1)
    plt.scatter([x_1, x_2, x_3], [y_1, y_2, y_3], s=20, c='b', zorder=3)
    plt.show()
    
    在平均值上取3个点

    三、通过三个点计算拟合曲线,下图红线为最终拟合的斜线。
    根据下列公式,求得一元二次方程得a,b,c。


    求得a,b,c
    start = 3000000
    end = 3500000
    window = 1000
    mean = train_df.loc[start:end, 'signal'].rolling(window).mean()
    
    x_1 = start + 3000
    x_2 = end - 3000
    x_3 = (start + end) // 2
    y_1 = mean[x_1]
    y_2 = mean[x_2]
    y_3 = mean[x_3]
    
    a = ((x_3 - x_1) * (y_2 - y_1) - (x_2 - x_1) * (y_3 - y_1)) / (x_3 - x_1) / (x_2 - x_1) / (x_2 - x_3)
    b = ((y_3 - y_1) - a * (x_3 - x_1) * (x_3 + x_1)) / (x_3 - x_1)
    c = y_1 - a * x_1 * x_1 - b * x_1
    
    x = np.array([i + start for i in range(0, end - start)])
    y = a * x * x + b * x + c
    
    plt.figure(figsize=(20, 5))
    step = 100
    plt.plot(range(start, end, step), train_df['signal'].values[start:end:step], zorder=1)
    plt.plot(range(start, end, step), mean[0:end - start:step], zorder=1)
    plt.plot(range(start, end, step), y[0:end - start:step], 'r:', linewidth=5, zorder=2)
    plt.scatter([x_1, x_2, x_3], [y_1, y_2, y_3], s=20, c='b', zorder=3)
    plt.show()
    

    红色为最终拟合的曲线


    拟合曲线

    四、添加常数constant
    如果现在的抛物线直接减拟合的一元二次方程,其均值理论上为0,所以我们还要计算对应无噪声batch 3的均值,将其添加到相减之后的函数中。因为batch 3的open channels与batch 7的open channels非常接近。


    训练集的通道数分布
    start = 3000000
    end = 3500000
    window = 1000
    mean = train_df.loc[start:end, 'signal'].rolling(window).mean()
    
    x_1 = start + 3000
    x_2 = end - 3000
    x_3 = (start + end) // 2
    y_1 = mean[x_1]
    y_2 = mean[x_2]
    y_3 = mean[x_3]
    
    a = ((x_3 - x_1) * (y_2 - y_1) - (x_2 - x_1) * (y_3 - y_1)) / (x_3 - x_1) / (x_2 - x_1) / (x_2 - x_3)
    b = ((y_3 - y_1) - a * (x_3 - x_1) * (x_3 + x_1)) / (x_3 - x_1)
    c = y_1 - a * x_1 * x_1 - b * x_1
    
    constant = np.mean(train_df['signal'].values[500000 * 2:500000 * 3])
    
    x = (train_df['time'].values[start:end] - 50 * 6) * 10000 + start
    train_df.loc[train_df.index[start:end], 'signal'] = train_df['signal'].values[start:end] \
                                                        - (a * x * x + b * x + c) + constant
    
    plt.figure(figsize=(20, 5))
    step = 100
    plt.plot(range(start, end, step), train_df['signal'].values[start:end:step], zorder=1)
    plt.show()
    
    拟合后的水平数据

    五、整理曲线公式
    根据上述步骤,还未将代码中用到的该方程推理出来,so sad~~~

    # 一元二次方程
    def f(x, low, high, mid):
        return -((-low + high) / 625) * (x - mid) ** 2 + high - low
    

    总结

    该kernel的亮点是充分观察了数据,挖掘特征之间的联系,在不同的数据段中使用不同的预测模型,并且手工去除了添加的噪声点,真可谓让人眼前一亮。

    相关文章

      网友评论

          本文标题:Kaggle实战: Ion Switching

          本文链接:https://www.haomeiwen.com/subject/jzllmhtx.html