2020-08-22--梯度下降法02

作者: program_white | 来源:发表于2020-08-24 23:20 被阅读0次

1.随机梯度下降法

1.1随机梯度下降法介绍

批量梯度下降法带来的一个问题是η的值需要设置的比较小，在样本数比较多的时候导致速度特别慢。
这时候观察随机梯度下降法损失函数的求导公式，可以发现，我们对每一项数据都做了求和操作，又在最外面除以了m，那么可以考虑将求和和除以m的两个运算约掉，采用每次使用一个随机的一行样本。

上式推导过程之前说过，这里不在赘述。计算结果是一个(n+1) * 1的向量。

这样计算也就是说每次计算样本点中某一个点的DJ，之前的DJ是所有点的DJ，但是要除以m，为了效率，使用这种方式用来表示每次梯度下降值。

随机梯度下降法俯视图例：

它的轨迹相对曲折，因为每次计算下降值使用的都可能是不同的样本点。所以导致我们的最终结果不会像批量梯度下降法一样准确的朝着一个方向运算，而是曲线行下降。
问题：
这样的话当下降到接近于最佳theta的位置时，由于学习率eta不变的，这就导致他一直在底部徘徊，也就是上图俯视图的中央部分徘徊，类似于跳棋中在接进于终点的地方，如果跳的过大，会往回跳出来，所以这时候除非摇到刚好道终点的步数，否则只能一点一点的走。
还有之前学习一元线性回归中η值不合理的时候，会学习过头，跑到对面去了。
解决：
跟跳棋一个道理，在接近最佳值的时候，由于DJ是随机的，改变不了，那么这时候我们就希望，越到下面，η值相应减小，让运算次数变多，从而慢慢一步一步的精确计算结果。

为了使下降值变化的不那么剧烈，使用退火的方式：

学习率 = t/（迭代次数+t1）,t0,t1为超参数：

1.2 随机梯度下降法

1.生成数据

import numpy as np
import matplotlib.pyplot as plt

m = 100000

'''生成数据'''
x = np.random.normal(size=m)
X = x.reshape(-1,1)
print(X.shape)
# (100000, 1)
y = 4.*x + 3. + np.random.normal(0, 3, size=m)
print(y.shape)
# (100000,)

plt.scatter(x, y)
plt.show()

运行：

2.设置参数，使用批量梯度下降法

from Tidudown.manyTidu.Test_01 import gradient_descent_many

'''设置参数'''
X_b = np.hstack([np.ones((len(X),1)),X])
print(X_b.shape)
# (100000, 2)
init_theta = np.zeros(X_b.shape[1])
eta = 0.01

'''批量梯度下降'''
start = time.time()
theta = gradient_descent_many(X_b,y,init_theta,eta)
end = time.time()
print(end-start)     # 0.5734329223632812
print(theta)
# [3.00082657 4.00464813]

可以看到100000个样本使用批量梯度下降法的时间，大约0.57秒。

3.使用随机梯度下降法

1.DJ_sgd()函数
根据公式计算某一个随机样本的DJ：

'''随机梯度下降法的DJ函数'''
def dJ_sgd(theta,X_b_i,y_i):
    return 2 * (X_b_i.T).dot(X_b_i.dot(theta) - y_i)

2.随机梯度逻辑函数
参数：

X_b:处理后的X
y：x对应的正确值
init_theta:初始theta值
n_iter:循环次数

没有eta，因为eta式变化的，在函数内部定义生成。

没有最小差距值判断，因为下降值为随机点计算得值，所以其对应损失函数J得差距值也是不确定的，当设置了最小值判断后，如果在没找到最佳值得情况下，连续的两点的J差距值很小的话，会跳出循环，影响结果。

'''随机梯度下降函数'''
def sgd(X_b,y,init_theta,n_iters=10000):

    # 学习率函数
    t0, t1 = 5, 50
    def learning_rate(i_iter):
        return t0 / (i_iter + t1)

    theta = init_theta
    for i_iter in range(n_iters):
        rand_i = np.random.randint(len(X_b))
        gradient = dJ_sgd(theta,X_b[rand_i],y[rand_i])
        theta = theta - learning_rate(i_iter) * gradient

    return theta

调用：

init_theta1 = np.zeros(X_b.shape[1])
start = time.time()
theta = sgd(X_b,y,init_theta)
end = time.time()
print(end-start)     # 0.11940121650695801
print(theta)
# [2.93440611 4.06238553]

比较两种方式，随机梯度下降效率高，但是结果不如批量梯度下降。

1.3随机梯度下降法的封装和测试

我们还是封装在线性回归类--LinearRegression中：

'''随机梯度下降寻找theta'''
    def fit_sgd(self,X_train,y_train,n_iters=1e4,t0=5,t1=50):

        '''随机梯度下降法的DJ函数'''
        def dJ_sgd(theta, X_b_i, y_i):
            return 2 * (X_b_i.T).dot(X_b_i.dot(theta) - y_i)

        '''学习率函数'''
        def learning_rate(i_iter):
            return t0 / (i_iter + t1)

        '''随机梯度下降函数'''
        def sgd(X_b, y, init_theta):
            # 寻找n_iter次
            theta = init_theta
            for i_iter in range(int(n_iters)):
                rand_i = np.random.randint(len(X_b))
                gradient = dJ_sgd(theta, X_b[rand_i], y[rand_i])
                theta = theta - learning_rate(i_iter) * gradient

            return theta

        # 生成X_b,init_theta
        X_b = np.hstack([np.ones((len(X), 1)), X_train])
        init_theta = np.zeros(X_b.shape[1])
        # 调用sgd函数
        theta = sgd(X_b, y_train, init_theta)
        # 生成结果
        self._theta = theta
        self.coef_ = theta[1:]
        self.interception_ = theta[0]

        return self

问题：
我们每次计算的下降值都是由学习率和随机点的DJ决定的，因为是随机的点，所以偶然性比较大。不能保证每个点都取到，所以这样效果并不好。

解决：

所以为了保证每个点都可以取得到，兵器而还要保证效率，对随机梯度下降函数2：

'''随机梯度下降寻找theta--2'''
    def fit_sgd2(self,X_train,y_train,n_iters=4,t0=5,t1=50):

        '''随机梯度下降法的DJ函数'''
        def dJ_sgd(theta, X_b_i, y_i):
            return 2 * (X_b_i.T).dot(X_b_i.dot(theta) - y_i)

        '''学习率函数'''
        def learning_rate(i_iter):
            return t0 / (i_iter + t1)

        '''随机梯度下降函数'''
        def sgd(X_b, y, init_theta):

            theta = init_theta

            for i_iter in range(int(n_iters)):
                ## 对X_b进行一个乱序的排序
                indexes = np.random.permutation(len(X_b))
                X_b_new = X_b[indexes]
                y_new = y[indexes]

                for i in range(len(X_b)):

                    gradient = dJ_sgd(theta, X_b_new[i], y_new[i])
                    theta = theta - learning_rate(i_iter) * gradient

            return theta

        X_b = np.hstack([np.ones((len(X), 1)), X_train])
        init_theta = np.zeros(X_b.shape[1])

        theta = sgd(X_b, y_train, init_theta)

        self._theta = theta
        self.coef_ = theta[1:]
        self.interception_ = theta[0]

        return self

对sgd()中进行改造，逻辑：

对于n_iters参数的含义，在随机梯度下降中值得是循环查找次数，而在这里指的是查找几遍X_b。
在sgd中首先获取初始值theta，设置两层循环，遍数和遍历X_b。
在遍历X_b之前打乱X_b以及对应的y，每遍历一次X_b中的数据项，就计算一次theta，也就是下降一次theta，一直到n_iters遍X_b完为止。

测试：

x = 2 * np.random.random(size=100000)
# 修改为100行1列的二维数组,也就是100个数据，每个数据1个特征
X = x.reshape(-1,1)
# 生成100个靠近y = 3x + 4 直线的点
y = x * 3. + 4. + np.random.normal(size=100000)

# 实例化算法模型
lin = LinearRegression()
lin.fit_sgd2(X,y,n_iters=2)
print(lin.coef_)
print(lin._theta)
print(lin.interception_)
# [3.10241542]
# [4.16381465 3.10241542]
# 4.163814652107887

这两种随机梯度下降法在数据量多的情况下为了提高效率可以使用，但是如果数据量不多，效果就并不是很好了，只是效率快多了。

还是建议使用公式法 (fit_normal) 和批量随机梯度下降法(fit_gd)。

1.4 boston房价真实使用随机下降法

数据处理：

from sklearn import datasets
from KNN.Knn_2iris02.train_split_def import train_test_split
from KNN.Knn_4_.knn_StandardScaler import StandardScaler
from Tidudown.manyTidu.tuduClass import LinearRegression

boston = datasets.load_boston()
X = boston.data
y = boston.target

'''消除错误数据'''
X = X[y < 50.0]
y = y[y < 50.0]

'''数据分割'''
X_train,y_train,x_test,y_test = train_test_split(X,y)

'''数据归一化'''
standardScaler = StandardScaler()
standardScaler.fit(X_train)
X_train_standard = standardScaler.transform(X_train)
X_test_standard = standardScaler.transform(x_test)

训练模型：

lin = LinearRegression()

lin.fit_sgd2(X_train_standard,y_train)
print(lin.coef_)
print(lin._theta)
print(lin.interception_)
# [-0.72740041  1.11363104 -1.23455594 -0.01967048 -1.02025919  2.58628092
#  -0.92394561 -2.11695113  0.99512401 -1.1331838  -1.54966651  0.85111487
#  -3.09143846]
# [ 2.16848867e+01 -7.27400414e-01  1.11363104e+00 -1.23455594e+00
#  -1.96704786e-02 -1.02025919e+00  2.58628092e+00 -9.23945608e-01
#  -2.11695113e+00  9.95124008e-01 -1.13318380e+00 -1.54966651e+00
#   8.51114874e-01 -3.09143846e+00]
# 21.6848867402812

1.5 sklearn中的随机梯度下降法

from sklearn.linear_model import SGDRegressor

sgd_reg = SGDRegressor()
sgd_reg.fit(X_train_standard, y_train)

score = sgd_reg.score(X_test_standard, y_test)
print(score)         # 0.7493196823981358

需要注意的是sklearn中的梯度下降法比我们自己的算法要复杂的多，性能和计算准确度上都比我们的要好，我们的算法只是用来演示过程，具体生产上的使用还是应该使用Sklearn提供的

2.梯度下降法的调试

可能我们计算出梯度下降法的公式，并使用python编程实现，预测的过程中程序并没有报错，但是可能我们需要求的梯度的结果是错误的，这个时候需要怎么样去调试发现错误呢。

首先以二维坐标平面为例，一个点（O）的导数就是曲线在这个点的切线的斜率，在这个点两侧各取一个点（AB），那么AB两点对应的直线的斜率应该大体等于O的切线的斜率，并且这A和B的距离越近，那么两条直线的斜率就越接近。

事实上，这也正是导数的定义，当函数y=f(x)的自变量x在一点x0上产生一个增量Δx时，函数输出值的增量Δy与自变量增量Δx的比值在Δx趋于0时的极限a如果存在，a即为在x0处的导数，记作f'(x0)或df(x0)/dx

右侧公式推导也很简单，J的倒数DJ的值就是某一点沿着J函数的切线斜率，我们把斜率计算的原理法国大出来，就是该公式。

扩展到多维维度则如下：

我们在定义：

那么最后公式就为(以theta(0)为例)：

2.1 梯度下降调试代码实现

这个调试debug代码与之前的批量梯度下降的不同之处只是DJ的计算方式不同，之前使用的是公式推导出来的，而调试相对简单，使用上述推导方式。

J函数：

# 损失函数J
def J(theta, X_b, y):
    try:
        return np.sum((y - X_b.dot(theta))**2) / len(X_b)
    except:
        return float('inf')

dj_math()以及dJ_debug()如下：

# 批量梯度下降法公式计算DJ
def dJ_math(theta, X_b, y):
    return X_b.T.dot(X_b.dot(theta) - y) * 2. / len(y)

def dJ_debug(theta,X_b,y,epsilon=0.01):
    result = np.empty(len(theta))

    for i in range(len(theta)):
        # theta(+)
        theta_1 = theta.copy()
        theta_1[i] += epsilon
        # theta(-)
        theta_2 = theta.copy()
        theta_2[i] -= epsilon

        # 按照公式计算result的第i项
        result[i] =(J(theta_1,X_b,y) - J(theta_2,X_b,y)) / (2*epsilon)

    return result

梯度下降函数：

该函数的参数较之前的函数多了一个dJ参数，指的是传进来的计算DJ值的函数，这样就可以随时切换计算DJ值得方式了。

def gradient_descent(dJ,X_b,y,init_theta,eta,n_iter = 1e5,epsilon = 0.001):

    theta = init_theta
    i_iter = 0
    while i_iter < n_iter:
        gradient = dJ(theta,X_b,y)
        last = theta
        theta = theta - eta * gradient
        if abs(J(theta,X_b,y) - J(last,X_b,y)) < epsilon:
            break
        i_iter +=1

    return theta

测试数据：

import numpy as np

np.random.seed(666)
X = np.random.random(size=(1000, 10))
X_b = np.hstack([np.ones((len(X), 1)), X])
print(X_b.shape)
# (1000, 11)
true_theta = np.arange(1, 12, dtype=float)
y = X_b.dot(true_theta) + np.random.normal(size=1000)

init_theta = np.zeros(X_b.shape[1])
eta = 0.01

# 调用梯度下降函数
theta = gradient_descent(dJ_math,X_b,y,init_theta,eta)
print(theta)
# [ 1.2238063   1.9929646   2.99955613  3.74077918  5.05718132  5.87397064
#   7.061624    7.87575554  8.94828238 10.11907834 11.00453172]
theta = gradient_descent(dJ_debug,X_b,y,init_theta,eta)
print(theta)
# [ 1.2238063   1.9929646   2.99955613  3.74077918  5.05718132  5.87397064
#   7.061624    7.87575554  8.94828238 10.11907834 11.00453172]

由此可以看出，我们的dJ_debug和dJ_math的结果是相近的，所以我们的dJ_math的数学推导是没问题的。我们可以在真正的机器学习之前，先使用dJ_debug这种调试方式来验证一下我们的dJ_math的结果是否正确，然后再进行机器学习。

dJ_debug的效率不如dJ_math。

dJ_debug是通用的，可以放在任何求导的debug过程中，所以可以作为我们机器学习的工具箱来使用。

附录

LinearRegression--线性回归类：

class LinearRegression:

    def __init__(self):
        """初始化Linear Regression模型"""

        ## 系数向量（θ1,θ2,.....θn）
        self.coef_ = None
        ## 截距 (θ0)
        self.interception_ = None
        ## θ向量
        self._theta = None

    '''线性回归公式训练模型，计算theta'''
    def fit_normal(self, X_train, y_train):
        # 拼接为X(b)格式的数据，-----在每行的第一列之前加上1.
        ones_vector = np.ones((len(X_train), 1))
        X_b = np.hstack([ones_vector, X_train])

        # 根据X_b带入公式计算w
        # arr.dot(arr):点乘
        # np.linalg.inv(arr):矩阵的逆
        self._theta = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y_train)

        self.coef_ = self._theta[1:]
        self.interception_ = self._theta[0]

        return self

    '''使用梯度下降寻找theta'''
    def fit_gd(self, X_train, y_train, eta=0.0001, n_iters=1e5):
        """根据训练数据集X_train, y_train, 使用梯度下降法训练Linear Regression模型"""
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of X_train must be equal to the size of y_train"

        def J(theta, X_b, y):
            try:
                return np.sum((y - X_b.dot(theta)) ** 2) / len(y)
            except:
                return float('inf')

        '''向量化计算'''
        def dJ(theta, X_b, y):
            # res = np.empty(len(theta))
            # res[0] = np.sum(X_b.dot(theta) - y)
            # for i in range(1, len(theta)):
            #     res[i] = (X_b.dot(theta) - y).dot(X_b[:, i])
            # return res * 2 / len(X_b)
            return X_b.T.dot(X_b.dot(theta) - y) * 2 / len(X_b)

        def gradient_descent(X_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):

            theta = initial_theta
            cur_iter = 0

            while cur_iter < n_iters:
                gradient = dJ(theta, X_b, y)
                last_theta = theta
                theta = theta - eta * gradient
                if (abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon):
                    break

                cur_iter += 1

            return theta

        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        initial_theta = np.zeros(X_b.shape[1])
        self._theta = gradient_descent(X_b, y_train, initial_theta, eta, n_iters)

        self.interception_ = self._theta[0]
        self.coef_ = self._theta[1:]

        return self

    '''随机梯度下降寻找theta'''
    def fit_sgd(self,X_train,y_train,n_iters=1e4,t0=5,t1=50):

        '''随机梯度下降法的DJ函数'''
        def dJ_sgd(theta, X_b_i, y_i):
            return 2 * (X_b_i.T).dot(X_b_i.dot(theta) - y_i)

        '''学习率函数'''
        def learning_rate(i_iter):
            return t0 / (i_iter + t1)

        '''随机梯度下降函数'''
        def sgd(X_b, y, init_theta,):

            theta = init_theta
            for i_iter in range(int(n_iters)):
                rand_i = np.random.randint(len(X_b))
                gradient = dJ_sgd(theta, X_b[rand_i], y[rand_i])
                theta = theta - learning_rate(i_iter) * gradient

            return theta

        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        init_theta = np.zeros(X_b.shape[1])

        theta = sgd(X_b, y_train, init_theta)

        self._theta = theta
        self.coef_ = theta[1:]
        self.interception_ = theta[0]

        return self


    '''随机梯度下降寻找theta--2'''
    def fit_sgd2(self,X_train,y_train,n_iters=2,t0=5,t1=50):

        '''随机梯度下降法的DJ函数'''
        def dJ_sgd(theta, X_b_i, y_i):
            return 2 * (X_b_i.T).dot(X_b_i.dot(theta) - y_i)

        '''学习率函数'''
        def learning_rate(i_iter):
            return t0 / (i_iter + t1)

        '''随机梯度下降函数'''
        def sgd(X_b, y, init_theta):

            theta = init_theta

            for i_iter in range(int(n_iters)):
                ## 对X_b进行一个乱序的排序
                indexes = np.random.permutation(len(X_b))
                X_b_new = X_b[indexes]
                y_new = y[indexes]

                for i in range(len(X_b)):

                    gradient = dJ_sgd(theta, X_b_new[i], y_new[i])
                    theta = theta - learning_rate( i_iter*len(X_b_new) + i) * gradient

            return theta

        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        init_theta = np.zeros(X_b.shape[1])

        theta = sgd(X_b, y_train, init_theta)

        self._theta = theta
        self.coef_ = theta[1:]
        self.interception_ = theta[0]

        return self

    '''测试集'''
    def perdict(self, X_perdict):
        ones_vector = np.ones((len(X_perdict), 1))
        X_b = np.hstack([ones_vector, X_perdict])
        y_head = X_b.dot(self._theta)
        return y_head

    '''MSE'''
    def mean_squared_error(self, y_true, y_predict):
        """计算y_true和y_predict之间的MSE"""
        assert len(y_true) == len(y_predict), \
            "the size of y_true must be equal to the size of y_predict"

        return np.sum((y_true - y_predict) ** 2) / len(y_true)

    '''R^2'''
    def r2_score(self, y_true, y_perdict):
        r2_score = 1 - self.mean_squared_error(y_true, y_perdict) / np.var(y_true)
        return r2_score

    '''正确率'''
    def score(self,X_test,y_test):
        y_perdict = self.perdict(X_test)
        score = self.r2_score(y_perdict,y_test)
        return score


    def __repr__(self):
        return 'LinearRegression'

2020-08-22--梯度下降法02

1.随机梯度下降法

1.1随机梯度下降法介绍

1.2 随机梯度下降法

1.生成数据

2.设置参数，使用批量梯度下降法

3.使用随机梯度下降法

1.3随机梯度下降法的封装和测试

1.4 boston房价真实使用随机下降法

1.5 sklearn中的随机梯度下降法

2.梯度下降法的调试

2.1 梯度下降调试代码实现

附录

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读