Kaggle实战： Ion Switching

作者: 海盗船长_coco | 来源:发表于2020-04-10 10:10 被阅读0次

比赛地址：https://www.kaggle.com/c/liverpool-ion-switching
参考notebook：https://www.kaggle.com/cdeotte/one-feature-model-0-930
完整代码：https://github.com/872699467/kaggle_Ion_Switching

数据描述

训练集数据是50秒内进行10 kHz的不连续批次采样，并记录打开的离子通道数。故一个bacth的长度为50x10k=500k。即0.0001-50.0000的数据与50.0001-100.0000的数据不同，因此在50.0000和50.0001之间是不连续的。我们的任务是建立一个模型，该模型可以预测每个时间步长信号的开放通道数。训练集数据包含10个batch，测试集数据包含4个batch。

EDA

一、训练集分为10个batch，每个batch的信号变化

fig = plt.figure(figsize=(20, 5))
step = 1000
plt.plot(range(0, train_length, step), train_signal[0::step])
for i in range(11):
    plt.plot([i * 500000, i * 500000], [-5, 12.5], 'r')
for i in range(10):
    plt.text(i * 500000 + 200000, 10, str(i + 1), size=20)
plt.xlabel('Row',size=16)
plt.ylabel('Signal',size=16)
plt.title('Training Data Signal - 10 batches',size=20)
plt.savefig('EDA/time_signal.png')
plt.show()

训练集signal
二、训练集分为10个batch，每个batch的开放通道数变化
开放的通道数即我们需要预测的数据。

fig = plt.figure(figsize=(20, 5))
step = 1000
plt.plot(range(0, train_length, step), train_channel[0::step])
for i in range(11):
    plt.plot([i * 500000, i * 500000], [-5, 12.5], 'r')
for i in range(10):
    plt.text(i * 500000 + 200000, 10, str(i + 1), size=20)
plt.xlabel('Row', size=16)
plt.ylabel('Channels Open', size=16)
plt.title('Training Data Open Channels - 10 batches', size=20)
plt.savefig('EDA/time_channel.png')
plt.show()

训练集Open Channels变化
对训练集的数据进行分析
从上面的图中可以看出，可以使用了5种不同的合成模型。 1、模型为通道打开最大数为1，且打开的可能性很小（batch1和2）。 2、模型为通道打开最大数为1，且打开的可能性很大（batch3和7）。 3、模型为通道打开最大数为3（batch4和8）。 4、模型为通道打开最大数为5（batch6和9）。5、模型为通道打开最大数为10（batch5和10）。此外，第一幅图可以看到在第7、8、9、10 batch中存在噪声，造成了数据漂移，并在第2个batch的开始处产生了漂移。
根据此处的论文，数据是综合的。还添加了“电生理”噪声和漂移。漂移是一种信号偏置，从而导致上面的batch 2、7、8、9、10的信号呈现为水平线形状。
三、信号和开放通道数的关系

for k in range(5):
    start = int(np.random.uniform(0, train_length - 5000))
    end = start + 5000
    step = 10
    print('#' * 25)
    print('### Random {} to {} '.format(start, end))
    print('#' * 25)
    plt.figure(figsize=(20, 5))
    plt.plot(range(start, end, step), train_signal[start:end][0::step])
    plt.plot(range(start, end, step), train_channel[start:end][0::step])
    # plt.savefig('EDA/signal_channel.png')
plt.show()

image.png

image.png
黄色的线为开放的通道数量，蓝色的线为信号量。仔细查看信号和开放通道的随机间隔，以观察它们之间的关系。我们注意到它们是高度相关的，并且一起上下移动。因此，我们可以根据一个特征信号来预测开放信道。唯一的麻烦是信号数据中添加了噪声造成了漂移。因此，我们需要将其删除。
四、测试集数据信号量变化

plt.figure(figsize=(20, 5))
step = 1000
label = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
plt.plot(range(0, test_length, step), test_signal[0::step])
for i in range(5):
    plt.plot([i * 500000, i * 500000], [-5, 12.5], 'r')
for j in range(21):
    plt.plot([j * 100000, j * 100000], [-5, 12.5], 'r:')
for k in range(4):
    plt.text(k * 500000 + 200000, 10, str(k + 1), size=20)
for k in range(10):
    plt.text(k * 100000 + 40000, 7, label[k], size=16)
plt.xlabel('Row', size=16)
plt.ylabel('Channels Open', size=16)
plt.title('Test Data Signal - 4 batches - 10 subsamples', size=20)
plt.savefig('EDA/test_data.png')
plt.show()

测试集信号量变化
对测试集数据进行分析
从上图中可以找到对应的5个模型。而且我们可以识别出数据添加的噪声。 batch 1似乎是5个子样本，其中分别由模型1s，max 3、max 5、1s和1f创建了A，B，C，D，E。模型1s是具有低概率打开最多1个通道的模型。模型1f是具有高概率打开最多1个通道的模型。 max 3、max 5、max 10分别是最多具有3、5、10个通道数的模型。我们在子样本A，B，E，G，H，I 中观察到倾斜漂移。在batch 3中观察到抛物线的噪声。

移除训练集噪声

下图为倾斜漂移消除的效果。我们还可以在第7、8、9、10 batch 中消除抛物线漂移。原本我们将只训练具有批次1、3、4、5、6的模型。但是，在删除训练漂移之后，我们可以将批次2、7、8、9、10中的数据包括在训练数据中。

# ==========Remove Train Data Drift=============
train_copy = train_df.copy()

a = 500000
b = 600000  # CLEAN TRAIN BATCH 2
signal = train_copy['signal'][a:b].values
time = train_copy['time'][a:b].values
train_copy.loc[train_copy.index[a:b], 'signal'] = signal - 3 * (time - 50) / 10.

#可视化
batch = 2
a = 500000 * (batch - 1)
b = 500000 * batch
res = 50
plt.figure(figsize=(20, 5))
plt.plot(range(a, b, res), train_copy['signal'][a:b][0::res])
plt.title('Training Batch 2 without Slant Drift', size=16)
plt.savefig('EDA/without Slant Drift.png')
plt.figure(figsize=(20, 5))
plt.plot(range(a, b, res), train_signal[a:b][0::res])
plt.title('Training Batch 2 with Slant Drift', size=16)
plt.savefig('EDA/with Slant Drift.png')
plt.show()

batch 2中含有倾斜漂移

batch 2中删除了倾斜漂移

移除batch 7、8、9、10噪声

# 一元二次方程
def f(x, low, high, mid):
    return -((-low + high) / 625) * (x - mid) ** 2 + high - low


# CLEAN TRAIN BATCH 7
batch = 7
a = 500000 * (batch - 1)
b = 500000 * batch
train_copy.loc[train_copy.index[a:b], 'signal'] = train_signal[a:b] - f(train_time[a:b], -1.817, 3.186, 325)
# CLEAN TRAIN BATCH 8
batch = 8
a = 500000 * (batch - 1)
b = 500000 * batch
train_copy.loc[train_copy.index[a:b], 'signal'] = train_signal[a:b] - f(train_time[a:b], -0.094, 4.936, 375)
# CLEAN TRAIN BATCH 9
batch = 9
a = 500000 * (batch - 1)
b = 500000 * batch
train_copy.loc[train_copy.index[a:b], 'signal'] = train_signal[a:b] - f(train_time[a:b], 1.715, 6.689, 425)
# CLEAN TRAIN BATCH 10
batch = 10
a = 500000 * (batch - 1)
b = 500000 * batch
train_copy.loc[train_copy.index[a:b], 'signal'] = train_signal[a:b] - f(train_time[a:b], 3.361, 8.45, 475)

#可视化
plt.figure(figsize=(20, 5))
plt.plot(train_time[::1000], train_signal[::1000])
plt.title('Training Batches 7-10 with Parabolic Drift', size=16)
plt.savefig('EDA/with Parabolic Drift.png')
plt.figure(figsize=(20, 5))
plt.plot(train_copy['time'][::1000], train_copy['signal'][::1000])
plt.title('Training Batches 7-10 without Parabolic Drift', size=16)
plt.savefig('EDA/without Parabolic Drift.png')
plt.show()

batch 7、8、9、10含有抛物线漂移

batch 7、8、9、10不含有抛物线漂移

预测模型搭建

1、开放通道数最大为1且概率低的模型（1 Slow Open Channel）
使用训练集的batch 1和batch 2进行搭建决策树

batch = 1
a = 500000 * (batch - 1);
b = 500000 * batch
batch = 2
c = 500000 * (batch - 1)
d = 500000 * batch
temp = np.concatenate([train_copy['signal'].values[a:b], train_copy['signal'].values[c:d]])
X_train = np.concatenate([train_copy['signal'].values[a:b], train_copy['signal'].values[c:d]]).reshape((-1, 1))
y_train = np.concatenate([train_copy['open_channels'].values[a:b], train_copy['open_channels'].values[c:d]]).reshape(
    (-1, 1))

clf1s = tree.DecisionTreeClassifier(max_depth=1)
clf1s = clf1s.fit(X_train, y_train)
print('Training model 1s channel')
preds = clf1s.predict(X_train)
print('has f1 validation score =', f1_score(y_train, preds, average='macro'))

tree_graph = tree.export_graphviz(clf1s, out_file=None, max_depth=10,
                                  impurity=False, feature_names=['signal'], class_names=['0', '1'],
                                  rounded=True, filled=True)
graph = graphviz.Source(tree_graph)
graph.view('Slow Open Channel')

Training model 1s channel
has f1 validation score = 0.984750462652431

1 Slow Open Channel 决策树模型

2、开放通道数最大为1且概率高的模型（1 Fast Open Channel）
使用训练集的batch 3和batch 7进行搭建决策树

batch = 3
a = 500000 * (batch - 1)
b = 500000 * batch
batch = 7
c = 500000 * (batch - 1)
d = 500000 * batch
X_train = np.concatenate([train_copy['signal'].values[a:b], train_copy['signal'].values[c:d]]).reshape((-1, 1))
y_train = np.concatenate([train_copy['open_channels'].values[a:b], train_copy['open_channels'].values[c:d]]).reshape(
    (-1, 1))

clf1f = tree.DecisionTreeClassifier(max_depth=1)
clf1f = clf1f.fit(X_train, y_train)
print('Training model 1f channel')
preds = clf1f.predict(X_train)
print('has f1 validation score =', f1_score(y_train, preds, average='macro'))

tree_graph = tree.export_graphviz(clf1f, out_file=None, max_depth=10,
                                  impurity=False, feature_names=['signal'], class_names=['0', '1'],
                                  rounded=True, filled=True)
graph = graphviz.Source(tree_graph)
graph.view('Fast Open Channel')

Training model 1f channel
has f1 validation score = 0.9879287335590583

1 Fast Open Channel 决策树模型

3、开放通道数最大为3的模型（maximum 3 Open Channel）
使用训练集的batch 4和batch 8进行搭建决策树

batch = 4
a = 500000 * (batch - 1)
b = 500000 * batch
batch = 8
c = 500000 * (batch - 1)
d = 500000 * batch
X_train = np.concatenate([train_copy['signal'].values[a:b], train_copy['signal'].values[c:d]]).reshape((-1, 1))
y_train = np.concatenate([train_copy['open_channels'].values[a:b], train_copy['open_channels'].values[c:d]]).reshape(
    (-1, 1))

clf3 = tree.DecisionTreeClassifier(max_leaf_nodes=4)
clf3 = clf3.fit(X_train, y_train)
print('Training model 3 channel')
preds = clf3.predict(X_train)
print('has f1 validation score =', f1_score(y_train, preds, average='macro'))

tree_graph = tree.export_graphviz(clf3, out_file=None, max_depth=10,
                                  impurity=False, feature_names=['signal'], class_names=['0', '1', '2', '3'],
                                  rounded=True, filled=True)
graph = graphviz.Source(tree_graph)
graph.view('Maximum 3 Open Channel')

Training model 3 channel
has f1 validation score = 0.9321291153218245

maximum3 Open Channel 决策树模型

4、开放通道数最大为5的模型（maximum 5 Open Channel）
使用训练集的batch 6和batch 9进行搭建决策树

batch = 6
a = 500000 * (batch - 1)
b = 500000 * batch
batch = 9
c = 500000 * (batch - 1)
d = 500000 * batch
X_train = np.concatenate([train_copy['signal'].values[a:b], train_copy['signal'].values[c:d]]).reshape((-1, 1))
y_train = np.concatenate([train_copy['open_channels'].values[a:b], train_copy['open_channels'].values[c:d]]).reshape(
    (-1, 1))

clf5 = tree.DecisionTreeClassifier(max_leaf_nodes=6)
clf5 = clf5.fit(X_train, y_train)
print('Trained model 5 channel')
preds = clf5.predict(X_train)
print('has f1 validation score =', f1_score(y_train, preds, average='macro'))

tree_graph = tree.export_graphviz(clf5, out_file=None, max_depth=10,
                                  impurity=False, feature_names=['signal'], class_names=['0', '1', '2', '3', '4', '5'],
                                  rounded=True, filled=True)
graph = graphviz.Source(tree_graph)
graph.view('Maximum 5 Open Channel')

Trained model 5 channel
has f1 validation score = 0.9538294064086686

maximum 5 Open Channel决策树模型

5、开放通道数最大为10的模型（maximum 10 Open Channel）
使用训练集的batch 5和batch 10进行搭建决策树

batch = 5
a = 500000 * (batch - 1)
b = 500000 * batch
batch = 10
c = 500000 * (batch - 1)
d = 500000 * batch
X_train = np.concatenate([train_copy['signal'].values[a:b], train_copy['signal'].values[c:d]]).reshape((-1, 1))
y_train = np.concatenate([train_copy['open_channels'].values[a:b], train_copy['open_channels'].values[c:d]]).reshape(
    (-1, 1))

clf10 = tree.DecisionTreeClassifier(max_leaf_nodes=8)
clf10 = clf10.fit(X_train, y_train)
print('Trained model 10 channel')
preds = clf10.predict(X_train)
print('has f1 validation score =', f1_score(y_train, preds, average='macro'))

tree_graph = tree.export_graphviz(clf10, out_file=None, max_depth=10,
                                  impurity=False, feature_names=['signal'], class_names=[str(x) for x in range(11)],
                                  rounded=True, filled=True)
graph = graphviz.Source(tree_graph)
graph.view('Maximum 10 Open Channel')

Trained model 10 channel
has f1 validation score = 0.617082087312885

maximum 10 Open Channel 决策树模型

移除测试集噪声

一、训练集噪声移除效果

# ORIGINAL TRAIN DATA
plt.figure(figsize=(20, 5))
r = train_df['signal'].rolling(30000).mean()
plt.plot(train_time, r)
for i in range(11):
    plt.plot([i * 50, i * 50], [-3, 8], 'r:')
for j in range(10):
    plt.text(j * 50 + 20, 6, str(j + 1), size=20)
plt.title('Training Signal Rolling Mean. Has Drift wherever plot is not horizontal line', size=16)
plt.savefig('EDA/Training Signal Rolling Mean with Drift.png')

# TRAIN DATA WITHOUT DRIFT
plt.figure(figsize=(20, 5))
r = train_copy['signal'].rolling(30000).mean()
plt.plot(train_copy['time'].values, r)
for i in range(11):
    plt.plot([i * 50, i * 50], [-3, 8], 'r:')
for j in range(10):
    plt.text(j * 50 + 20, 6, str(j + 1), size=20)
plt.title('Training Signal Rolling Mean without Drift', size=16)
plt.savefig('EDA/Training Signal Rolling Mean without Drift.png')
plt.show()

原始数据

移除漂移后的效果
二、测试集数据噪声移除

plt.figure(figsize=(20, 5))
let = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
r = test_df['signal'].rolling(30000).mean()
plt.plot(test_df['time'].values, r)
for i in range(21):
    plt.plot([500 + i * 10, 500 + i * 10], [-3, 6], 'r:')
for i in range(5):
    plt.plot([500 + i * 50, 500 + i * 50], [-3, 6], 'r')
for k in range(4):
    plt.text(525 + k * 50, 5.5, str(k + 1), size=20)
for k in range(10):
    plt.text(505 + k * 10, 4, let[k], size=16)
plt.title('Test Signal Rolling Mean. Has Drift wherever plot is not horizontal line', size=16)
plt.savefig('Test Signal Rolling Mean.png')
plt.show()

测试集样本
可以看到在测试集的子样本A，B，E，G，H，I和测试集 batch 3中观察到数据漂移。
移除噪声

# Remove Test Data Drift
test_copy = test_df.copy()

# REMOVE BATCH 1 DRIFT
start = 500
a = 0
b = 100000

test_copy.loc[test_copy.index[a:b], 'signal'] = test_signal[a:b] - 3 * (test_time[a:b] - start) / 10.
start = 510
a = 100000
b = 200000
test_copy.loc[test_copy.index[a:b], 'signal'] = test_signal[a:b] - 3 * (test_time[a:b] - start) / 10.
start = 540
a = 400000
b = 500000
test_copy.loc[test_copy.index[a:b], 'signal'] = test_signal[a:b] - 3 * (test_time[a:b] - start) / 10.

# REMOVE BATCH 2 DRIFT
start = 560
a = 600000
b = 700000
test_copy.loc[test_copy.index[a:b], 'signal'] = test_signal[a:b] - 3 * (test_time[a:b] - start) / 10.
start = 570
a = 700000
b = 800000
test_copy.loc[test_copy.index[a:b], 'signal'] = test_signal[a:b] - 3 * (test_time[a:b] - start) / 10.
start = 580
a = 800000
b = 900000
test_copy.loc[test_copy.index[a:b], 'signal'] = test_signal[a:b] - 3 * (test_time[a:b] - start) / 10.


# REMOVE BATCH 3 DRIFT
def f(x):
    return -(0.00788) * (x - 625) ** 2 + 2.345 + 2.58


a = 1000000
b = 1500000
test_copy.loc[test_copy.index[a:b], 'signal'] = test_signal[a:b] - f(test_time[a:b])

移除效果可视化

let = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
plt.figure(figsize=(20, 5))
res = 1000
plt.plot(range(0, test_copy.shape[0], res), test_copy['signal'][0::res])
for i in range(5):
    plt.plot([i * 500000, i * 500000], [-5, 12.5], 'r')
for i in range(21):
    plt.plot([i * 100000, i * 100000], [-5, 12.5], 'r:')
for k in range(4):
    plt.text(k * 500000 + 250000, 10, str(k + 1), size=20)
for k in range(10):
    plt.text(k * 100000 + 40000, 7.5, let[k], size=16)
plt.title('Test Signal without Drift', size=16)

plt.figure(figsize=(20, 5))
r = test_df['signal'].rolling(30000).mean()
plt.plot(test_time, r)
for i in range(21):
    plt.plot([500 + i * 10, 500 + i * 10], [-2, 6], 'r:')
for i in range(5):
    plt.plot([500 + i * 50, 500 + i * 50], [-2, 6], 'r')
for k in range(4):
    plt.text(525 + k * 50, 5.5, str(k + 1), size=20)
for k in range(10):
    plt.text(505 + k * 10, 4, let[k], size=16)
plt.title('Test Signal Rolling Mean without Drift', size=16)
plt.show()

测试集移除噪声前数据

测试集移除噪声后数据

测试集预测

# =====================Predict Test==================
sub = pd.read_csv('data/sample_submission.csv')

a = 0  # SUBSAMPLE A, Model 1s
sub.iloc[100000 * a:100000 * (a + 1), 1] = clf1s.predict(
    test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))

a = 1  # SUBSAMPLE B, Model 3
sub.iloc[100000 * a:100000 * (a + 1), 1] = clf3.predict(
    test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))

a = 2  # SUBSAMPLE C, Model 5
sub.iloc[100000 * a:100000 * (a + 1), 1] = clf5.predict(
    test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))

a = 3  # SUBSAMPLE D, Model 1s
sub.iloc[100000 * a:100000 * (a + 1), 1] = clf1s.predict(
    test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))

a = 4  # SUBSAMPLE E, Model 1f
sub.iloc[100000 * a:100000 * (a + 1), 1] = clf1f.predict(
    test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))

a = 5  # SUBSAMPLE F, Model 10
sub.iloc[100000 * a:100000 * (a + 1), 1] = clf10.predict(
    test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))

a = 6  # SUBSAMPLE G, Model 5
sub.iloc[100000 * a:100000 * (a + 1), 1] = clf5.predict(
    test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))

a = 7  # SUBSAMPLE H, Model 10
sub.iloc[100000 * a:100000 * (a + 1), 1] = clf10.predict(
    test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))

a = 8  # SUBSAMPLE I, Model 1s
sub.iloc[100000 * a:100000 * (a + 1), 1] = clf1s.predict(
    test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))

a = 9  # SUBSAMPLE J, Model 3
sub.iloc[100000 * a:100000 * (a + 1), 1] = clf3.predict(
    test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))

# BATCHES 3 AND 4, Model 1s
sub.iloc[1000000:2000000, 1] = clf1s.predict(test_copy['signal'].values[1000000:2000000].reshape((-1, 1)))

预测效果可视化

# ===============Display Test Predictions===============
let = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
plt.figure(figsize=(20, 5))
res = 1000
plt.plot(range(0, test_df.shape[0], res), sub.open_channels[0::res])
for i in range(5):
    plt.plot([i * 500000, i * 500000], [-5, 12.5], 'r')
for i in range(21):
    plt.plot([i * 100000, i * 100000], [-5, 12.5], 'r:')
for k in range(4):
    plt.text(k * 500000 + 250000, 10, str(k + 1), size=20)
for k in range(10):
    plt.text(k * 100000 + 40000, 7.5, let[k], size=16)
plt.title('Test Data Predictions', size=16)
plt.show()

测试集预测的开放通道数

保存预测文件

sub.to_csv('submission.csv', index=False, float_format='%.4f')

提交结果

补充说明

上述说到batch 2的500000-600000行的数据中存在噪声，因为正常情况应像后半段那样是水平线，可以看出前半段为斜线，为了拟合该段数据，我们需要求得线段的斜率k和偏置b，且由两点便可确定一条直线。故现在的任务是在前半段的斜线中如何确定该两个点。

batch 2中含有倾斜漂移

一、计算每个滑动窗口的均值，通过rolling方法计算每1000个数据的均值。下图可看到平均值于数据段拟合的不错。

# train data batch 2
start = 500000
end = 600000
window = 1000
mean = train_df.loc[start:end, 'signal'].rolling(window).mean()
plt.figure(figsize=(20, 5))
step = 100
plt.plot(range(start, end, step), train_df['signal'].values[start:end:step], zorder=1)
plt.plot(range(start, end, step), mean[0:(end - start):step], zorder=1)
plt.show()

原始数据与rolling平均值的拟合线

二、在rolling的平均线上取两个点

start = 500000
end = 600000
window = 1000
mean = train_df.loc[start:end, 'signal'].rolling(window).mean()

x_1 = start + 3000
x_2 = end - 3000
y_1 = mean[x_1]
y_2 = mean[x_2]
plt.figure(figsize=(20, 5))
step = 100
plt.plot(range(start, end, step), train_df['signal'].values[start:end:step], zorder=1)
plt.plot(range(start, end, step), mean[0:(end - start):step], zorder=1)
plt.scatter([x_1, x_2], [y_1, y_2], s=20, c='b', zorder=3)
plt.show()

image.png

三、通过两个点计算斜率k和偏置b，下图红线为最终拟合的斜线。

start = 500000
end = 600000
window = 1000
mean = train_df.loc[start:end, 'signal'].rolling(window).mean()

x_1 = start + 3000
x_2 = end - 3000
y_1 = mean[x_1]
y_2 = mean[x_2]
k = (y_2 - y_1) / (x_2 - x_1)
b = y_2 - k * x_2
x = np.array([(start + i) for i in range(1, 100001)])
y = k * x + b
plt.figure(figsize=(20, 5))
step = 100
plt.plot(range(start, end, step), train_df['signal'].values[start:end:step], zorder=1)
plt.plot(range(start, end, step), mean[0:(end - start):step], zorder=1)
plt.plot(range(start, end, step), y[0:(end - start):step], 'r:', linewidth=5, zorder=2)
plt.scatter([x_1, x_2], [y_1, y_2], s=20, c='b', zorder=3)
plt.show()

image.png

四、添加常数constant
如果我们现在的斜线直接减拟合的线，其均值理论上为0，所以我们还要计算无噪声的后半段的均值，将其添加到相减之后的函数中。

start = 500000
end = 600000
window = 1000
mean = train_df.loc[start:end, 'signal'].rolling(window).mean()

x_1 = start + 3000
x_2 = end - 3000
y_1 = mean[x_1]
y_2 = mean[x_2]
k = (y_2 - y_1) / (x_2 - x_1)
b = y_2 - k * x_2
x = np.array([(start + i) for i in range(1, 100001)])
y = k * x + b
constant = np.mean(train_df['signal'].values[end:1000000])
train_df.loc[train_df.index[start:end], 'signal'] = train_df['signal'].values[start:end] - (
        k * ((train_df['time'].values[start:end] - 50) * 10000 + start) + b) + constant

plt.figure(figsize=(20, 5))
step = 100
plt.plot(range(start, 1000000, step), train_df['signal'].values[start:1000000:step], zorder=1)
plt.show()

消去噪声后的batch 2 图像

五、整理曲线公式
我们得到的拟合曲线的x每个步长为1，但是表格中time的步长为0.0001，所以在第4步中得到time的值后要乘以10000，且x又加了一个start。现希望能够直接化成关于time的公式。

转化公式

得到K=0.2957577872340394，B=-0.044442693920208054。K约等于0.3，B约等于0。所以有上述代码

train_copy.loc[train_copy.index[a:b], 'signal'] = signal - 3 * (time - 50) / 10.

同样对于batch 7也存在噪声，但是其为抛物线型，需要3个点才能确定一个一元二次方程。

batch 7的signal分布

一、计算每个滑动窗口的均值，通过rolling方法计算每1000个数据的均值。下图可看到平均值于数据段拟合的不错。

start = 3000000
end = 3500000
window = 1000
mean = train_df.loc[start:end, 'signal'].rolling(window).mean()

plt.figure(figsize=(20, 5))
step = 100
plt.plot(range(start, end, step), train_df['signal'].values[start:end:step], zorder=1)
plt.plot(range(start, end, step), mean[0:end - start:step], zorder=1)
plt.show()

窗口平均值

二、在rolling的平均线上取三个点

start = 3000000
end = 3500000
window = 1000
mean = train_df.loc[start:end, 'signal'].rolling(window).mean()

x_1 = start + 3000
x_2 = end - 3000
x_3 = (start + end) // 2
y_1 = mean[x_1]
y_2 = mean[x_2]
y_3 = mean[x_3]

plt.figure(figsize=(20, 5))
step = 100
plt.plot(range(start, end, step), train_df['signal'].values[start:end:step], zorder=1)
plt.plot(range(start, end, step), mean[0:end - start:step], zorder=1)
plt.scatter([x_1, x_2, x_3], [y_1, y_2, y_3], s=20, c='b', zorder=3)
plt.show()

在平均值上取3个点

三、通过三个点计算拟合曲线，下图红线为最终拟合的斜线。
根据下列公式，求得一元二次方程得a，b，c。

求得a,b,c

start = 3000000
end = 3500000
window = 1000
mean = train_df.loc[start:end, 'signal'].rolling(window).mean()

x_1 = start + 3000
x_2 = end - 3000
x_3 = (start + end) // 2
y_1 = mean[x_1]
y_2 = mean[x_2]
y_3 = mean[x_3]

a = ((x_3 - x_1) * (y_2 - y_1) - (x_2 - x_1) * (y_3 - y_1)) / (x_3 - x_1) / (x_2 - x_1) / (x_2 - x_3)
b = ((y_3 - y_1) - a * (x_3 - x_1) * (x_3 + x_1)) / (x_3 - x_1)
c = y_1 - a * x_1 * x_1 - b * x_1

x = np.array([i + start for i in range(0, end - start)])
y = a * x * x + b * x + c

plt.figure(figsize=(20, 5))
step = 100
plt.plot(range(start, end, step), train_df['signal'].values[start:end:step], zorder=1)
plt.plot(range(start, end, step), mean[0:end - start:step], zorder=1)
plt.plot(range(start, end, step), y[0:end - start:step], 'r:', linewidth=5, zorder=2)
plt.scatter([x_1, x_2, x_3], [y_1, y_2, y_3], s=20, c='b', zorder=3)
plt.show()

红色为最终拟合的曲线

拟合曲线

四、添加常数constant
如果现在的抛物线直接减拟合的一元二次方程，其均值理论上为0，所以我们还要计算对应无噪声batch 3的均值，将其添加到相减之后的函数中。因为batch 3的open channels与batch 7的open channels非常接近。

训练集的通道数分布

start = 3000000
end = 3500000
window = 1000
mean = train_df.loc[start:end, 'signal'].rolling(window).mean()

x_1 = start + 3000
x_2 = end - 3000
x_3 = (start + end) // 2
y_1 = mean[x_1]
y_2 = mean[x_2]
y_3 = mean[x_3]

a = ((x_3 - x_1) * (y_2 - y_1) - (x_2 - x_1) * (y_3 - y_1)) / (x_3 - x_1) / (x_2 - x_1) / (x_2 - x_3)
b = ((y_3 - y_1) - a * (x_3 - x_1) * (x_3 + x_1)) / (x_3 - x_1)
c = y_1 - a * x_1 * x_1 - b * x_1

constant = np.mean(train_df['signal'].values[500000 * 2:500000 * 3])

x = (train_df['time'].values[start:end] - 50 * 6) * 10000 + start
train_df.loc[train_df.index[start:end], 'signal'] = train_df['signal'].values[start:end] \
                                                    - (a * x * x + b * x + c) + constant

plt.figure(figsize=(20, 5))
step = 100
plt.plot(range(start, end, step), train_df['signal'].values[start:end:step], zorder=1)
plt.show()

拟合后的水平数据

五、整理曲线公式
根据上述步骤，还未将代码中用到的该方程推理出来，so sad~~~

# 一元二次方程
def f(x, low, high, mid):
    return -((-low + high) / 625) * (x - mid) ** 2 + high - low

总结

该kernel的亮点是充分观察了数据，挖掘特征之间的联系，在不同的数据段中使用不同的预测模型，并且手工去除了添加的噪声点，真可谓让人眼前一亮。

Kaggle实战： Ion Switching

数据描述

EDA

移除训练集噪声

预测模型搭建

移除测试集噪声

测试集预测

预测效果可视化

保存预测文件

提交结果

补充说明

总结

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读