1.算法简述
它是Adaboost学习方法发展而来,作者是Wenyuan Dai。TrAdaboost算法是用来解决训练集和测试集分布不同的问题。在迁移学习的某些情况下,一个训练集中会包含大量的辅助训练样本和少量的源训练样本,我们会将两个不同分布的训练集放在一起训练,这种方法也称为基于实例的迁移学习方法。
2.算法原理
-
Tradaboost是由Adaboost算法演变而来的,我们先来看Adaboost算法的基本思想:当一个训练样本被错误分类,算法就会认为这个样本是难分类的,就相应地给此样本增加样本权重,下次训练时这个样本被分错的概率就会降低。
-
类似地,在一个包含源训练数据和辅助训练数据的训练集中,TrAdaboost[1]算法也会对训练样本进行权重调整,对于源数据样本,权重调整策略跟Adaboost差不多:如果一个源训练样本被错误分类,根据这一次源样本训练时的错误率进行调整,增加权重,降低下次训练时的分类误差;对于辅助训练样本:当它们被误分类后,算法则认为它们是与目标数据很不同的,于是降低它们的权重,权重调整的依据是Hedge(b);
-
Tradaboost通过提升多个弱分类器,对后半(N/2~N)个弱分类器进行综合投票,得出最后的决策(图1)。
图1
3.代码实现
github地址:https://github.com/chenchiwei/tradaboost/tree/master
# -*- coding: UTF-8 -*-
import numpy as np
from sklearn import tree
# H 测试样本分类结果
# TrainS 原训练样本 np数组
# TrainA 辅助训练样本
# LabelS 原训练样本标签
# LabelA 辅助训练样本标签
# Test 测试样本
# N 迭代次数
def tradaboost(trans_S, trans_A, label_S, label_A, test, N):
trans_data = np.concatenate((trans_A, trans_S), axis=0)
trans_label = np.concatenate((label_A, label_S), axis=0)
row_A = trans_A.shape[0]
row_S = trans_S.shape[0]
row_T = test.shape[0]
test_data = np.concatenate((trans_data, test), axis=0)
# 初始化权重
weights_A = np.ones([row_A, 1]) / row_A
weights_S = np.ones([row_S, 1]) / row_S
weights = np.concatenate((weights_A, weights_S), axis=0)
bata = 1 / (1 + np.sqrt(2 * np.log(row_A / N)))
# 存储每次迭代的标签和bata值?
bata_T = np.zeros([1, N])
result_label = np.ones([row_A + row_S + row_T, N])
predict = np.zeros([row_T])
print 'params initial finished.'
trans_data = np.asarray(trans_data, order='C')
trans_label = np.asarray(trans_label, order='C')
test_data = np.asarray(test_data, order='C')
for i in range(N):
P = calculate_P(weights, trans_label)
result_label[:, i] = train_classify(trans_data, trans_label,
test_data, P)
print 'result,', result_label[:, i], row_A, row_S, i, result_label.shape
error_rate = calculate_error_rate(label_S, result_label[row_A:row_A + row_S, i],
weights[row_A:row_A + row_S, :])
print 'Error rate:', error_rate
if error_rate > 0.5:
error_rate = 0.5
if error_rate == 0:
N = i
break # 防止过拟合
# error_rate = 0.001
bata_T[0, i] = error_rate / (1 - error_rate)
# 调整源域样本权重
for j in range(row_S):
weights[row_A + j] = weights[row_A + j] * np.power(bata_T[0, i],
np.abs(result_label[row_A + j, i] - label_S[j]))
# 调整辅域样本权重
for j in range(row_A):
weights[j] = weights[j] * np.power(bata, (-np.abs(result_label[j, i] - label_A[j])))
# print bata_T
for i in range(row_T):
# 跳过训练数据的标签
left = np.sum(
result_label[row_A + row_S + i, int(np.ceil(N / 2)):N] * np.log(1 / bata_T[0, int(np.ceil(N / 2)):N]))
right = 0.5 * np.sum(np.log(1 / bata_T[0, int(np.ceil(N / 2)):N]))
if left >= right:
predict[i] = 1
else:
predict[i] = 0
# print left, right, predict[i]
return predict
def calculate_P(weights, label):
total = np.sum(weights)
return np.asarray(weights / total, order='C')
def train_classify(trans_data, trans_label, test_data, P):
clf = tree.DecisionTreeClassifier(criterion="gini", max_features="log2", splitter="random")
clf.fit(trans_data, trans_label, sample_weight=P[:, 0])
return clf.predict(test_data)
def calculate_error_rate(label_R, label_H, weight):
total = np.sum(weight)
print weight[:, 0] / total
print np.abs(label_R - label_H)
return np.sum(weight[:, 0] / total * np.abs(label_R - label_H))
参考文献:
[1] Dai W, Yang Q, Xue G R, et al. Boosting for transfer learning[C]//Proceedings of the 24th international conference on Machine learning. ACM, 2007: 193-200.

网友评论