比赛地址:https://www.kaggle.com/c/liverpool-ion-switching
参考notebook:https://www.kaggle.com/cdeotte/one-feature-model-0-930
完整代码:https://github.com/872699467/kaggle_Ion_Switching
数据描述
训练集数据是50秒内进行10 kHz的不连续批次采样,并记录打开的离子通道数。故一个bacth的长度为50x10k=500k。即0.0001-50.0000的数据与50.0001-100.0000的数据不同,因此在50.0000和50.0001之间是不连续的。我们的任务是建立一个模型,该模型可以预测每个时间步长信号的开放通道数。训练集数据包含10个batch,测试集数据包含4个batch。
EDA
一、训练集分为10个batch,每个batch的信号变化
fig = plt.figure(figsize=(20, 5))
step = 1000
plt.plot(range(0, train_length, step), train_signal[0::step])
for i in range(11):
plt.plot([i * 500000, i * 500000], [-5, 12.5], 'r')
for i in range(10):
plt.text(i * 500000 + 200000, 10, str(i + 1), size=20)
plt.xlabel('Row',size=16)
plt.ylabel('Signal',size=16)
plt.title('Training Data Signal - 10 batches',size=20)
plt.savefig('EDA/time_signal.png')
plt.show()
训练集signal
二、训练集分为10个batch,每个batch的开放通道数变化
开放的通道数即我们需要预测的数据。
fig = plt.figure(figsize=(20, 5))
step = 1000
plt.plot(range(0, train_length, step), train_channel[0::step])
for i in range(11):
plt.plot([i * 500000, i * 500000], [-5, 12.5], 'r')
for i in range(10):
plt.text(i * 500000 + 200000, 10, str(i + 1), size=20)
plt.xlabel('Row', size=16)
plt.ylabel('Channels Open', size=16)
plt.title('Training Data Open Channels - 10 batches', size=20)
plt.savefig('EDA/time_channel.png')
plt.show()
训练集Open Channels变化
对训练集的数据进行分析
从上面的图中可以看出,可以使用了5种不同的合成模型。 1、模型为通道打开最大数为1,且打开的可能性很小(batch1和2)。 2、模型为通道打开最大数为1,且打开的可能性很大(batch3和7)。 3、模型为通道打开最大数为3(batch4和8)。 4、模型为通道打开最大数为5(batch6和9)。5、模型为通道打开最大数为10(batch5和10)。 此外,第一幅图可以看到在第7、8、9、10 batch中存在噪声,造成了数据漂移,并在第2个batch的开始处产生了漂移。
根据此处的论文,数据是综合的。 还添加了“电生理”噪声和漂移。 漂移是一种信号偏置,从而导致上面的batch 2、7、8、9、10的信号呈现为水平线形状。
三、信号和开放通道数的关系
for k in range(5):
start = int(np.random.uniform(0, train_length - 5000))
end = start + 5000
step = 10
print('#' * 25)
print('### Random {} to {} '.format(start, end))
print('#' * 25)
plt.figure(figsize=(20, 5))
plt.plot(range(start, end, step), train_signal[start:end][0::step])
plt.plot(range(start, end, step), train_channel[start:end][0::step])
# plt.savefig('EDA/signal_channel.png')
plt.show()
image.png
image.png
image.png
image.png
image.png
黄色的线为开放的通道数量,蓝色的线为信号量。仔细查看信号和开放通道的随机间隔,以观察它们之间的关系。 我们注意到它们是高度相关的,并且一起上下移动。 因此,我们可以根据一个特征信号来预测开放信道。 唯一的麻烦是信号数据中添加了噪声造成了漂移。 因此,我们需要将其删除。
四、测试集数据信号量变化
plt.figure(figsize=(20, 5))
step = 1000
label = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
plt.plot(range(0, test_length, step), test_signal[0::step])
for i in range(5):
plt.plot([i * 500000, i * 500000], [-5, 12.5], 'r')
for j in range(21):
plt.plot([j * 100000, j * 100000], [-5, 12.5], 'r:')
for k in range(4):
plt.text(k * 500000 + 200000, 10, str(k + 1), size=20)
for k in range(10):
plt.text(k * 100000 + 40000, 7, label[k], size=16)
plt.xlabel('Row', size=16)
plt.ylabel('Channels Open', size=16)
plt.title('Test Data Signal - 4 batches - 10 subsamples', size=20)
plt.savefig('EDA/test_data.png')
plt.show()
测试集信号量变化
对测试集数据进行分析
从上图中可以找到对应的5个模型。而且 我们可以识别出数据添加的噪声。 batch 1似乎是5个子样本,其中分别由模型1s,max 3、max 5、1s和1f创建了A,B,C,D,E。 模型1s是具有低概率打开最多1个通道的模型。 模型1f是具有高概率打开最多1个通道的模型。 max 3、max 5、max 10分别是最多具有3、5、10个通道数的模型。 我们在子样本A,B,E,G,H,I 中观察到倾斜漂移。在batch 3中观察到抛物线的噪声。
移除训练集噪声
下图为倾斜漂移消除的效果。 我们还可以在第7、8、9、10 batch 中消除抛物线漂移。原本我们将只训练具有批次1、3、4、5、6的模型。但是,在删除训练漂移之后,我们可以将批次2、7、8、9、10中的数据包括在训练数据中。
# ==========Remove Train Data Drift=============
train_copy = train_df.copy()
a = 500000
b = 600000 # CLEAN TRAIN BATCH 2
signal = train_copy['signal'][a:b].values
time = train_copy['time'][a:b].values
train_copy.loc[train_copy.index[a:b], 'signal'] = signal - 3 * (time - 50) / 10.
#可视化
batch = 2
a = 500000 * (batch - 1)
b = 500000 * batch
res = 50
plt.figure(figsize=(20, 5))
plt.plot(range(a, b, res), train_copy['signal'][a:b][0::res])
plt.title('Training Batch 2 without Slant Drift', size=16)
plt.savefig('EDA/without Slant Drift.png')
plt.figure(figsize=(20, 5))
plt.plot(range(a, b, res), train_signal[a:b][0::res])
plt.title('Training Batch 2 with Slant Drift', size=16)
plt.savefig('EDA/with Slant Drift.png')
plt.show()
batch 2中含有倾斜漂移
batch 2中删除了倾斜漂移
移除batch 7、8、9、10噪声
# 一元二次方程
def f(x, low, high, mid):
return -((-low + high) / 625) * (x - mid) ** 2 + high - low
# CLEAN TRAIN BATCH 7
batch = 7
a = 500000 * (batch - 1)
b = 500000 * batch
train_copy.loc[train_copy.index[a:b], 'signal'] = train_signal[a:b] - f(train_time[a:b], -1.817, 3.186, 325)
# CLEAN TRAIN BATCH 8
batch = 8
a = 500000 * (batch - 1)
b = 500000 * batch
train_copy.loc[train_copy.index[a:b], 'signal'] = train_signal[a:b] - f(train_time[a:b], -0.094, 4.936, 375)
# CLEAN TRAIN BATCH 9
batch = 9
a = 500000 * (batch - 1)
b = 500000 * batch
train_copy.loc[train_copy.index[a:b], 'signal'] = train_signal[a:b] - f(train_time[a:b], 1.715, 6.689, 425)
# CLEAN TRAIN BATCH 10
batch = 10
a = 500000 * (batch - 1)
b = 500000 * batch
train_copy.loc[train_copy.index[a:b], 'signal'] = train_signal[a:b] - f(train_time[a:b], 3.361, 8.45, 475)
#可视化
plt.figure(figsize=(20, 5))
plt.plot(train_time[::1000], train_signal[::1000])
plt.title('Training Batches 7-10 with Parabolic Drift', size=16)
plt.savefig('EDA/with Parabolic Drift.png')
plt.figure(figsize=(20, 5))
plt.plot(train_copy['time'][::1000], train_copy['signal'][::1000])
plt.title('Training Batches 7-10 without Parabolic Drift', size=16)
plt.savefig('EDA/without Parabolic Drift.png')
plt.show()
batch 7、8、9、10含有抛物线漂移
batch 7、8、9、10不含有抛物线漂移
预测模型搭建
1、开放通道数最大为1且概率低的模型(1 Slow Open Channel)
使用训练集的batch 1和batch 2进行搭建决策树
batch = 1
a = 500000 * (batch - 1);
b = 500000 * batch
batch = 2
c = 500000 * (batch - 1)
d = 500000 * batch
temp = np.concatenate([train_copy['signal'].values[a:b], train_copy['signal'].values[c:d]])
X_train = np.concatenate([train_copy['signal'].values[a:b], train_copy['signal'].values[c:d]]).reshape((-1, 1))
y_train = np.concatenate([train_copy['open_channels'].values[a:b], train_copy['open_channels'].values[c:d]]).reshape(
(-1, 1))
clf1s = tree.DecisionTreeClassifier(max_depth=1)
clf1s = clf1s.fit(X_train, y_train)
print('Training model 1s channel')
preds = clf1s.predict(X_train)
print('has f1 validation score =', f1_score(y_train, preds, average='macro'))
tree_graph = tree.export_graphviz(clf1s, out_file=None, max_depth=10,
impurity=False, feature_names=['signal'], class_names=['0', '1'],
rounded=True, filled=True)
graph = graphviz.Source(tree_graph)
graph.view('Slow Open Channel')
Training model 1s channel
has f1 validation score = 0.984750462652431
1 Slow Open Channel 决策树模型
2、开放通道数最大为1且概率高的模型(1 Fast Open Channel)
使用训练集的batch 3和batch 7进行搭建决策树
batch = 3
a = 500000 * (batch - 1)
b = 500000 * batch
batch = 7
c = 500000 * (batch - 1)
d = 500000 * batch
X_train = np.concatenate([train_copy['signal'].values[a:b], train_copy['signal'].values[c:d]]).reshape((-1, 1))
y_train = np.concatenate([train_copy['open_channels'].values[a:b], train_copy['open_channels'].values[c:d]]).reshape(
(-1, 1))
clf1f = tree.DecisionTreeClassifier(max_depth=1)
clf1f = clf1f.fit(X_train, y_train)
print('Training model 1f channel')
preds = clf1f.predict(X_train)
print('has f1 validation score =', f1_score(y_train, preds, average='macro'))
tree_graph = tree.export_graphviz(clf1f, out_file=None, max_depth=10,
impurity=False, feature_names=['signal'], class_names=['0', '1'],
rounded=True, filled=True)
graph = graphviz.Source(tree_graph)
graph.view('Fast Open Channel')
Training model 1f channel
has f1 validation score = 0.9879287335590583
1 Fast Open Channel 决策树模型
3、开放通道数最大为3的模型(maximum 3 Open Channel)
使用训练集的batch 4和batch 8进行搭建决策树
batch = 4
a = 500000 * (batch - 1)
b = 500000 * batch
batch = 8
c = 500000 * (batch - 1)
d = 500000 * batch
X_train = np.concatenate([train_copy['signal'].values[a:b], train_copy['signal'].values[c:d]]).reshape((-1, 1))
y_train = np.concatenate([train_copy['open_channels'].values[a:b], train_copy['open_channels'].values[c:d]]).reshape(
(-1, 1))
clf3 = tree.DecisionTreeClassifier(max_leaf_nodes=4)
clf3 = clf3.fit(X_train, y_train)
print('Training model 3 channel')
preds = clf3.predict(X_train)
print('has f1 validation score =', f1_score(y_train, preds, average='macro'))
tree_graph = tree.export_graphviz(clf3, out_file=None, max_depth=10,
impurity=False, feature_names=['signal'], class_names=['0', '1', '2', '3'],
rounded=True, filled=True)
graph = graphviz.Source(tree_graph)
graph.view('Maximum 3 Open Channel')
Training model 3 channel
has f1 validation score = 0.9321291153218245
maximum3 Open Channel 决策树模型
4、开放通道数最大为5的模型(maximum 5 Open Channel)
使用训练集的batch 6和batch 9进行搭建决策树
batch = 6
a = 500000 * (batch - 1)
b = 500000 * batch
batch = 9
c = 500000 * (batch - 1)
d = 500000 * batch
X_train = np.concatenate([train_copy['signal'].values[a:b], train_copy['signal'].values[c:d]]).reshape((-1, 1))
y_train = np.concatenate([train_copy['open_channels'].values[a:b], train_copy['open_channels'].values[c:d]]).reshape(
(-1, 1))
clf5 = tree.DecisionTreeClassifier(max_leaf_nodes=6)
clf5 = clf5.fit(X_train, y_train)
print('Trained model 5 channel')
preds = clf5.predict(X_train)
print('has f1 validation score =', f1_score(y_train, preds, average='macro'))
tree_graph = tree.export_graphviz(clf5, out_file=None, max_depth=10,
impurity=False, feature_names=['signal'], class_names=['0', '1', '2', '3', '4', '5'],
rounded=True, filled=True)
graph = graphviz.Source(tree_graph)
graph.view('Maximum 5 Open Channel')
Trained model 5 channel
has f1 validation score = 0.9538294064086686
maximum 5 Open Channel决策树模型
5、开放通道数最大为10的模型(maximum 10 Open Channel)
使用训练集的batch 5和batch 10进行搭建决策树
batch = 5
a = 500000 * (batch - 1)
b = 500000 * batch
batch = 10
c = 500000 * (batch - 1)
d = 500000 * batch
X_train = np.concatenate([train_copy['signal'].values[a:b], train_copy['signal'].values[c:d]]).reshape((-1, 1))
y_train = np.concatenate([train_copy['open_channels'].values[a:b], train_copy['open_channels'].values[c:d]]).reshape(
(-1, 1))
clf10 = tree.DecisionTreeClassifier(max_leaf_nodes=8)
clf10 = clf10.fit(X_train, y_train)
print('Trained model 10 channel')
preds = clf10.predict(X_train)
print('has f1 validation score =', f1_score(y_train, preds, average='macro'))
tree_graph = tree.export_graphviz(clf10, out_file=None, max_depth=10,
impurity=False, feature_names=['signal'], class_names=[str(x) for x in range(11)],
rounded=True, filled=True)
graph = graphviz.Source(tree_graph)
graph.view('Maximum 10 Open Channel')
Trained model 10 channel
has f1 validation score = 0.617082087312885
maximum 10 Open Channel 决策树模型
移除测试集噪声
一、训练集噪声移除效果
# ORIGINAL TRAIN DATA
plt.figure(figsize=(20, 5))
r = train_df['signal'].rolling(30000).mean()
plt.plot(train_time, r)
for i in range(11):
plt.plot([i * 50, i * 50], [-3, 8], 'r:')
for j in range(10):
plt.text(j * 50 + 20, 6, str(j + 1), size=20)
plt.title('Training Signal Rolling Mean. Has Drift wherever plot is not horizontal line', size=16)
plt.savefig('EDA/Training Signal Rolling Mean with Drift.png')
# TRAIN DATA WITHOUT DRIFT
plt.figure(figsize=(20, 5))
r = train_copy['signal'].rolling(30000).mean()
plt.plot(train_copy['time'].values, r)
for i in range(11):
plt.plot([i * 50, i * 50], [-3, 8], 'r:')
for j in range(10):
plt.text(j * 50 + 20, 6, str(j + 1), size=20)
plt.title('Training Signal Rolling Mean without Drift', size=16)
plt.savefig('EDA/Training Signal Rolling Mean without Drift.png')
plt.show()
原始数据
移除漂移后的效果
二、测试集数据噪声移除
plt.figure(figsize=(20, 5))
let = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
r = test_df['signal'].rolling(30000).mean()
plt.plot(test_df['time'].values, r)
for i in range(21):
plt.plot([500 + i * 10, 500 + i * 10], [-3, 6], 'r:')
for i in range(5):
plt.plot([500 + i * 50, 500 + i * 50], [-3, 6], 'r')
for k in range(4):
plt.text(525 + k * 50, 5.5, str(k + 1), size=20)
for k in range(10):
plt.text(505 + k * 10, 4, let[k], size=16)
plt.title('Test Signal Rolling Mean. Has Drift wherever plot is not horizontal line', size=16)
plt.savefig('Test Signal Rolling Mean.png')
plt.show()
测试集样本
可以看到在测试集的子样本A,B,E,G,H,I和测试集 batch 3中观察到数据漂移。
移除噪声
# Remove Test Data Drift
test_copy = test_df.copy()
# REMOVE BATCH 1 DRIFT
start = 500
a = 0
b = 100000
test_copy.loc[test_copy.index[a:b], 'signal'] = test_signal[a:b] - 3 * (test_time[a:b] - start) / 10.
start = 510
a = 100000
b = 200000
test_copy.loc[test_copy.index[a:b], 'signal'] = test_signal[a:b] - 3 * (test_time[a:b] - start) / 10.
start = 540
a = 400000
b = 500000
test_copy.loc[test_copy.index[a:b], 'signal'] = test_signal[a:b] - 3 * (test_time[a:b] - start) / 10.
# REMOVE BATCH 2 DRIFT
start = 560
a = 600000
b = 700000
test_copy.loc[test_copy.index[a:b], 'signal'] = test_signal[a:b] - 3 * (test_time[a:b] - start) / 10.
start = 570
a = 700000
b = 800000
test_copy.loc[test_copy.index[a:b], 'signal'] = test_signal[a:b] - 3 * (test_time[a:b] - start) / 10.
start = 580
a = 800000
b = 900000
test_copy.loc[test_copy.index[a:b], 'signal'] = test_signal[a:b] - 3 * (test_time[a:b] - start) / 10.
# REMOVE BATCH 3 DRIFT
def f(x):
return -(0.00788) * (x - 625) ** 2 + 2.345 + 2.58
a = 1000000
b = 1500000
test_copy.loc[test_copy.index[a:b], 'signal'] = test_signal[a:b] - f(test_time[a:b])
移除效果可视化
let = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
plt.figure(figsize=(20, 5))
res = 1000
plt.plot(range(0, test_copy.shape[0], res), test_copy['signal'][0::res])
for i in range(5):
plt.plot([i * 500000, i * 500000], [-5, 12.5], 'r')
for i in range(21):
plt.plot([i * 100000, i * 100000], [-5, 12.5], 'r:')
for k in range(4):
plt.text(k * 500000 + 250000, 10, str(k + 1), size=20)
for k in range(10):
plt.text(k * 100000 + 40000, 7.5, let[k], size=16)
plt.title('Test Signal without Drift', size=16)
plt.figure(figsize=(20, 5))
r = test_df['signal'].rolling(30000).mean()
plt.plot(test_time, r)
for i in range(21):
plt.plot([500 + i * 10, 500 + i * 10], [-2, 6], 'r:')
for i in range(5):
plt.plot([500 + i * 50, 500 + i * 50], [-2, 6], 'r')
for k in range(4):
plt.text(525 + k * 50, 5.5, str(k + 1), size=20)
for k in range(10):
plt.text(505 + k * 10, 4, let[k], size=16)
plt.title('Test Signal Rolling Mean without Drift', size=16)
plt.show()
测试集移除噪声前数据
测试集移除噪声后数据
测试集预测
# =====================Predict Test==================
sub = pd.read_csv('data/sample_submission.csv')
a = 0 # SUBSAMPLE A, Model 1s
sub.iloc[100000 * a:100000 * (a + 1), 1] = clf1s.predict(
test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))
a = 1 # SUBSAMPLE B, Model 3
sub.iloc[100000 * a:100000 * (a + 1), 1] = clf3.predict(
test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))
a = 2 # SUBSAMPLE C, Model 5
sub.iloc[100000 * a:100000 * (a + 1), 1] = clf5.predict(
test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))
a = 3 # SUBSAMPLE D, Model 1s
sub.iloc[100000 * a:100000 * (a + 1), 1] = clf1s.predict(
test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))
a = 4 # SUBSAMPLE E, Model 1f
sub.iloc[100000 * a:100000 * (a + 1), 1] = clf1f.predict(
test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))
a = 5 # SUBSAMPLE F, Model 10
sub.iloc[100000 * a:100000 * (a + 1), 1] = clf10.predict(
test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))
a = 6 # SUBSAMPLE G, Model 5
sub.iloc[100000 * a:100000 * (a + 1), 1] = clf5.predict(
test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))
a = 7 # SUBSAMPLE H, Model 10
sub.iloc[100000 * a:100000 * (a + 1), 1] = clf10.predict(
test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))
a = 8 # SUBSAMPLE I, Model 1s
sub.iloc[100000 * a:100000 * (a + 1), 1] = clf1s.predict(
test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))
a = 9 # SUBSAMPLE J, Model 3
sub.iloc[100000 * a:100000 * (a + 1), 1] = clf3.predict(
test_copy['signal'].values[100000 * a:100000 * (a + 1)].reshape((-1, 1)))
# BATCHES 3 AND 4, Model 1s
sub.iloc[1000000:2000000, 1] = clf1s.predict(test_copy['signal'].values[1000000:2000000].reshape((-1, 1)))
预测效果可视化
# ===============Display Test Predictions===============
let = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
plt.figure(figsize=(20, 5))
res = 1000
plt.plot(range(0, test_df.shape[0], res), sub.open_channels[0::res])
for i in range(5):
plt.plot([i * 500000, i * 500000], [-5, 12.5], 'r')
for i in range(21):
plt.plot([i * 100000, i * 100000], [-5, 12.5], 'r:')
for k in range(4):
plt.text(k * 500000 + 250000, 10, str(k + 1), size=20)
for k in range(10):
plt.text(k * 100000 + 40000, 7.5, let[k], size=16)
plt.title('Test Data Predictions', size=16)
plt.show()
测试集预测的开放通道数
保存预测文件
sub.to_csv('submission.csv', index=False, float_format='%.4f')
提交结果
提交结果补充说明
上述说到batch 2的500000-600000行的数据中存在噪声,因为正常情况应像后半段那样是水平线,可以看出前半段为斜线,为了拟合该段数据,我们需要求得线段的斜率k和偏置b,且由两点便可确定一条直线。故现在的任务是在前半段的斜线中如何确定该两个点。
batch 2中含有倾斜漂移
一、计算每个滑动窗口的均值,通过rolling方法计算每1000个数据的均值。下图可看到平均值于数据段拟合的不错。
# train data batch 2
start = 500000
end = 600000
window = 1000
mean = train_df.loc[start:end, 'signal'].rolling(window).mean()
plt.figure(figsize=(20, 5))
step = 100
plt.plot(range(start, end, step), train_df['signal'].values[start:end:step], zorder=1)
plt.plot(range(start, end, step), mean[0:(end - start):step], zorder=1)
plt.show()
原始数据与rolling平均值的拟合线
二、在rolling的平均线上取两个点
start = 500000
end = 600000
window = 1000
mean = train_df.loc[start:end, 'signal'].rolling(window).mean()
x_1 = start + 3000
x_2 = end - 3000
y_1 = mean[x_1]
y_2 = mean[x_2]
plt.figure(figsize=(20, 5))
step = 100
plt.plot(range(start, end, step), train_df['signal'].values[start:end:step], zorder=1)
plt.plot(range(start, end, step), mean[0:(end - start):step], zorder=1)
plt.scatter([x_1, x_2], [y_1, y_2], s=20, c='b', zorder=3)
plt.show()
image.png
三、通过两个点计算斜率k和偏置b,下图红线为最终拟合的斜线。
start = 500000
end = 600000
window = 1000
mean = train_df.loc[start:end, 'signal'].rolling(window).mean()
x_1 = start + 3000
x_2 = end - 3000
y_1 = mean[x_1]
y_2 = mean[x_2]
k = (y_2 - y_1) / (x_2 - x_1)
b = y_2 - k * x_2
x = np.array([(start + i) for i in range(1, 100001)])
y = k * x + b
plt.figure(figsize=(20, 5))
step = 100
plt.plot(range(start, end, step), train_df['signal'].values[start:end:step], zorder=1)
plt.plot(range(start, end, step), mean[0:(end - start):step], zorder=1)
plt.plot(range(start, end, step), y[0:(end - start):step], 'r:', linewidth=5, zorder=2)
plt.scatter([x_1, x_2], [y_1, y_2], s=20, c='b', zorder=3)
plt.show()
image.png
四、添加常数constant
如果我们现在的斜线直接减拟合的线,其均值理论上为0,所以我们还要计算无噪声的后半段的均值,将其添加到相减之后的函数中。
start = 500000
end = 600000
window = 1000
mean = train_df.loc[start:end, 'signal'].rolling(window).mean()
x_1 = start + 3000
x_2 = end - 3000
y_1 = mean[x_1]
y_2 = mean[x_2]
k = (y_2 - y_1) / (x_2 - x_1)
b = y_2 - k * x_2
x = np.array([(start + i) for i in range(1, 100001)])
y = k * x + b
constant = np.mean(train_df['signal'].values[end:1000000])
train_df.loc[train_df.index[start:end], 'signal'] = train_df['signal'].values[start:end] - (
k * ((train_df['time'].values[start:end] - 50) * 10000 + start) + b) + constant
plt.figure(figsize=(20, 5))
step = 100
plt.plot(range(start, 1000000, step), train_df['signal'].values[start:1000000:step], zorder=1)
plt.show()
消去噪声后的batch 2 图像
五、整理曲线公式
我们得到的拟合曲线的x每个步长为1,但是表格中time的步长为0.0001,所以在第4步中得到time的值后要乘以10000,且x又加了一个start。现希望能够直接化成关于time的公式。
转化公式
得到K=0.2957577872340394,B=-0.044442693920208054。K约等于0.3,B约等于0。所以有上述代码
train_copy.loc[train_copy.index[a:b], 'signal'] = signal - 3 * (time - 50) / 10.
同样对于batch 7也存在噪声,但是其为抛物线型,需要3个点才能确定一个一元二次方程。
batch 7的signal分布
一、计算每个滑动窗口的均值,通过rolling方法计算每1000个数据的均值。下图可看到平均值于数据段拟合的不错。
start = 3000000
end = 3500000
window = 1000
mean = train_df.loc[start:end, 'signal'].rolling(window).mean()
plt.figure(figsize=(20, 5))
step = 100
plt.plot(range(start, end, step), train_df['signal'].values[start:end:step], zorder=1)
plt.plot(range(start, end, step), mean[0:end - start:step], zorder=1)
plt.show()
窗口平均值
二、在rolling的平均线上取三个点
start = 3000000
end = 3500000
window = 1000
mean = train_df.loc[start:end, 'signal'].rolling(window).mean()
x_1 = start + 3000
x_2 = end - 3000
x_3 = (start + end) // 2
y_1 = mean[x_1]
y_2 = mean[x_2]
y_3 = mean[x_3]
plt.figure(figsize=(20, 5))
step = 100
plt.plot(range(start, end, step), train_df['signal'].values[start:end:step], zorder=1)
plt.plot(range(start, end, step), mean[0:end - start:step], zorder=1)
plt.scatter([x_1, x_2, x_3], [y_1, y_2, y_3], s=20, c='b', zorder=3)
plt.show()
在平均值上取3个点
三、通过三个点计算拟合曲线,下图红线为最终拟合的斜线。
根据下列公式,求得一元二次方程得a,b,c。
求得a,b,c
start = 3000000
end = 3500000
window = 1000
mean = train_df.loc[start:end, 'signal'].rolling(window).mean()
x_1 = start + 3000
x_2 = end - 3000
x_3 = (start + end) // 2
y_1 = mean[x_1]
y_2 = mean[x_2]
y_3 = mean[x_3]
a = ((x_3 - x_1) * (y_2 - y_1) - (x_2 - x_1) * (y_3 - y_1)) / (x_3 - x_1) / (x_2 - x_1) / (x_2 - x_3)
b = ((y_3 - y_1) - a * (x_3 - x_1) * (x_3 + x_1)) / (x_3 - x_1)
c = y_1 - a * x_1 * x_1 - b * x_1
x = np.array([i + start for i in range(0, end - start)])
y = a * x * x + b * x + c
plt.figure(figsize=(20, 5))
step = 100
plt.plot(range(start, end, step), train_df['signal'].values[start:end:step], zorder=1)
plt.plot(range(start, end, step), mean[0:end - start:step], zorder=1)
plt.plot(range(start, end, step), y[0:end - start:step], 'r:', linewidth=5, zorder=2)
plt.scatter([x_1, x_2, x_3], [y_1, y_2, y_3], s=20, c='b', zorder=3)
plt.show()
红色为最终拟合的曲线
拟合曲线
四、添加常数constant
如果现在的抛物线直接减拟合的一元二次方程,其均值理论上为0,所以我们还要计算对应无噪声batch 3的均值,将其添加到相减之后的函数中。因为batch 3的open channels与batch 7的open channels非常接近。
训练集的通道数分布
start = 3000000
end = 3500000
window = 1000
mean = train_df.loc[start:end, 'signal'].rolling(window).mean()
x_1 = start + 3000
x_2 = end - 3000
x_3 = (start + end) // 2
y_1 = mean[x_1]
y_2 = mean[x_2]
y_3 = mean[x_3]
a = ((x_3 - x_1) * (y_2 - y_1) - (x_2 - x_1) * (y_3 - y_1)) / (x_3 - x_1) / (x_2 - x_1) / (x_2 - x_3)
b = ((y_3 - y_1) - a * (x_3 - x_1) * (x_3 + x_1)) / (x_3 - x_1)
c = y_1 - a * x_1 * x_1 - b * x_1
constant = np.mean(train_df['signal'].values[500000 * 2:500000 * 3])
x = (train_df['time'].values[start:end] - 50 * 6) * 10000 + start
train_df.loc[train_df.index[start:end], 'signal'] = train_df['signal'].values[start:end] \
- (a * x * x + b * x + c) + constant
plt.figure(figsize=(20, 5))
step = 100
plt.plot(range(start, end, step), train_df['signal'].values[start:end:step], zorder=1)
plt.show()
拟合后的水平数据
五、整理曲线公式
根据上述步骤,还未将代码中用到的该方程推理出来,so sad~~~
# 一元二次方程
def f(x, low, high, mid):
return -((-low + high) / 625) * (x - mid) ** 2 + high - low
总结
该kernel的亮点是充分观察了数据,挖掘特征之间的联系,在不同的数据段中使用不同的预测模型,并且手工去除了添加的噪声点,真可谓让人眼前一亮。
网友评论