美文网首页
机器学习基石第一次作业

机器学习基石第一次作业

作者: ThomasYoungK | 来源:发表于2018-10-03 22:29 被阅读62次

    coursera林轩田的《机器学习基石》很有意思,我把一些编程作业总结在这里,参考了mac Jiang的答案:https://blog.csdn.net/a1015553840/article/details/51085129

    作业1

    15-17是naive pla(perceptron learning algorithm), 算法如下:

    1. 初始化w
      repeat {
      1.寻找w(t)的下一个错误分类点(x,y)(即sign(w(t)’*x)!=y);
      2.纠正错误:w(t+1) = w(t) + y * x;
      } until(每个样本都无错)
    2. 返回w
    image.png
    def naive_PLA():
        updates = 0
        w = np.zeros(5)
        while True:
            halt = True
            with open('hw1_15_train.dat') as csvfile:
                reader = csv.reader(csvfile, delimiter='\t')
                for line in reader:
                    x = line[0].split()
                    x = np.asarray(x)
                    x = x.astype(np.float)
                    x = np.insert(x, 0, 1)
                    y = np.array(line[1], dtype='int')
                    if sign(w.dot(x)) != y:
                        updates += 1
                        w += y * x
                        halt = False
            if halt:
                break
        return updates
    

    最终计算结果是:45次

    image.png

    为了方便处理,定义了DataSet类和PLA类

    class DataSet:
        def __init__(self, filename):
            self.input = []
            self.output = []
            self.load_data(filename)
    
        def load_data(self, filename):
            with open(filename) as csvfile:
                reader = csv.reader(csvfile, delimiter='\t')
                for line in reader:
                    x = line[0].split()
                    x.insert(0, '1')
                    y = line[1]
                    self.input.append(x)
                    self.output.append(y)
    
    
    class PLA:
        def __init__(self, train_name='hw1_15_train.dat', test_name=None):
            self.train_set = DataSet(train_name)
            if test_name:
                self.test_set = DataSet(test_name)
    
        def random_cycle_pla(self, times=2000, eta=1, print_out=False):
            total_updates = 0
            data_set = list(zip(self.train_set.input, self.train_set.output))
            for idx, _ in enumerate(range(times)):
                shuffle(data_set)
                current_updates = self.naive_pla(data_set, eta, print_out)
                total_updates += current_updates
            return total_updates / times
    
        def naive_pla(self, data_set=None, eta=1, print_out=False):
            """naive perceptron learning algorithm"""
            current_updates = 0
            w = np.zeros(5)
            if not data_set:
                data_set = list(zip(self.train_set.input, self.train_set.output))
            while True:
                halt = True
                for item in data_set:
                    x = np.array(item[0], dtype=float)
                    y = np.array(item[1], dtype=int)
                    if sign(w.dot(x)) != y:
                        current_updates += 1
                        w += eta * y * x
                        halt = False
                if halt:
                    break
            if print_out:
                print(f'第{idx}次终止次数:{current_updates}')
            return current_updates
    

    200次平均,次数为38.145次

    image.png

    对w做更新时乘以eta=0.5即可:w += eta * y * x
    所得的结果是:200次平均,次数为40.245

    18-20题是非线性可分的问题,用pocket PLA算法,算法如下:

    1. 初始化w,pocket_w
      {

      1.寻找分类错误点(x,y)

      2.修正错误:w(t+1) = w(t) + y*x

      3.如果w(t+1)对训练样本的错误率比pocket_w更小,则用w(t+1)替代pocket_w

    } until(达到足够的迭代次数)

    1. 返回pocket_w

    该算法每次更新后都需要计算w的所有样本的错误率,因此计算量比naive pla大,好处是对线性不可分问题也有解

    image.png

    只要添加计算错误率的函数errors_count, 以及计算pocket_w的函数pocket_algorithm,最后计算出错误率即可。

        def errors_count(self, w, data_set):
            """"统计errors发生次数"""
            count = 0
            for x, y in data_set:
                x = np.array(x, dtype=float)
                y = np.array(y, dtype=int)
                if sign(w.dot(x)) != y:
                    count += 1
            return count
    
        def pocket_algorithm(self, update_times=50, pocket=True):
            """pocket=True: 返回pocketWeight
            否则返回w"""
            data_set = list(zip(self.train_set.input, self.train_set.output))
            updates = 0
            w = np.zeros(5)
            pocket_weight = np.zeros(45)
            min_errors = sys.maxsize
            halt = False
            while not halt:
                shuffle(data_set)
                for item in data_set:
                    x = np.array(item[0], dtype=float)
                    y = np.array(item[1], dtype=int)
                    if sign(w.dot(x)) != y:
                        w = w + y * x  # w每次都更新
                        updates += 1
                        # print(f'updates: {updates}')
                        errors_count = self.errors_count(w, data_set)
                        if errors_count < min_errors:
                            min_errors = errors_count
                            pocket_weight = w  # pocket_weight只有遇到更好的w才更新
                    if updates >= update_times or min_errors == 0:
                        halt = True
                        break
            return pocket_weight if pocket else w
    
        def cal_test_error_rate(self, update_times=50, times=2000, pocket=True):
            train_set = list(zip(self.train_set.input, self.train_set.output))
            test_set = list(zip(self.test_set.input, self.test_set.output))
            train_avg_rate, test_avg_rate = 0, 0
            for _ in range(times):
                w = self.pocket_algorithm(update_times=update_times, pocket=pocket)
                train_error_counts = self.errors_count(w, train_set)
                test_error_counts = self.errors_count(w, test_set)
                train_avg_rate += train_error_counts / len(train_set)
                test_avg_rate += test_error_counts / len(test_set)
            return train_avg_rate / times, test_avg_rate / times
    

    2000次的平均值是:0.13131999999999994


    image.png

    pocket_algorithm返回w,而非最有权重w_pocket,200次的平均错误率是:0.3798

    image.png

    将pocket algorithm的更新次数从50改为100即可,错误率有略微的下降,计算200次的平均值为:0.11630000000000004

    全部作业链接见:https://www.jianshu.com/p/c8d06e7cb3c4

    相关文章

      网友评论

          本文标题:机器学习基石第一次作业

          本文链接:https://www.haomeiwen.com/subject/noetaftx.html