分享

TensorFlow ML cookbook 第三章4、5节 理解线性回归中的损失函数和实施戴明回归

问题导读:
1、如何理解线性回归中的损失函数?
2、什么是实施戴明回归?
3、学习率对L1和L2有哪些影响?
4、常规线性回归与戴明回归之间的区别有哪些?



上一篇:TensorFlow ML cookbook 第三章1-3节 使用矩阵求逆方法、实现分解方法并学习张量流法

理解线性回归中的损失函数

知道算法收敛中损失函数的影响很重要。 这里我们将说明L1和L2损失函数如何影响线性回归中的收敛。

准备好
我们将使用与之前的配方相同的虹膜数据集,但我们将改变我们的损失函数和学习率,以查看收敛性如何变化。

怎么做
1.程序的开始与之前没有任何变化,直到我们失去功能。 我们加载必要的库,开始会话,加载数据,创建占位符,并定义我们的变量和模型。 有一点要注意的是,我们正在取消我们的学习速率和模型迭代。 我们这样做是因为我们想要显示快速更改这些参数的效果。 使用下面的代码:
[mw_shl_code=python,true]import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from sklearn import datasets
sess = tf.Session()
iris = datasets.load_iris()
x_vals = np.array([x[3] for x in iris.data])
y_vals = np.array([y[0] for y in iris.data])
batch_size = 25
learning_rate = 0.1 # Will not converge with learning rate at 0.4
iterations = 50
x_data = tf.placeholder(shape=[None, 1], dtype=tf.float32)
y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32)
A = tf.Variable(tf.random_normal(shape=[1,1]))
b = tf.Variable(tf.random_normal(shape=[1,1]))
model_output = tf.add(tf.matmul(x_data, A), b)[/mw_shl_code]

2.我们的损失函数将变为L1损失,如下所示:
[mw_shl_code=python,true]loss_l1 = tf.reduce_mean(tf.abs(y_target - model_output))[/mw_shl_code]
请注意,我们可以通过用下面的公式替换回L2损失:
[mw_shl_code=python,true]tf.reduce_mean(tf.square(y_target – model_output)).[/mw_shl_code]

3.现在我们通过初始化声明我们的优化器的变量并通过训练部分循环它们来继续。 请注意,我们也在每一代节约损失以衡量收敛。 使用下面的代码:
[mw_shl_code=python,true]init = tf.global_variables_initializer()
sess.run(init)
my_opt_l1 = tf.train.GradientDescentOptimizer(learning_rate)
train_step_l1 = my_opt_l1.minimize(loss_l1)
loss_vec_l1 = []
for i in range(iterations):
  rand_index = np.random.choice(len(x_vals), size=batch_size)
  rand_x = np.transpose([x_vals[rand_index]])
  rand_y = np.transpose([y_vals[rand_index]])
  sess.run(train_step_l1, feed_dict={x_data: rand_x, y_target:rand_y})
  temp_loss_l1 = sess.run(loss_l1, feed_dict={x_data: rand_x,y_target: rand_y})
  loss_vec_l1.append(temp_loss_l1)
    if (i+1)%25==0:
      print('Step #' + str(i+1) + ' A = ' + str(sess.run(A)) + 'b = ' + str(sess.run(b)))
plt.plot(loss_vec_l1, 'k-', label='L1 Loss')
plt.plot(loss_vec_l2, 'r--', label='L2 Loss')
plt.title('L1' and L2 Loss per Generation')
plt.xlabel('Generation')
plt.ylabel('L1 Loss')
plt.legend(loc='upper right')
plt.show()[/mw_shl_code]

怎么运行
在选择损失函数时,我们还必须选择适合我们问题的相应学习率。 在这里,我们将说明两种情况,一种是L2优先,另一种是L1优先。

如果我们的学习率很低,我们的融合将需要更多时间。 但是如果我们的学习速度太快,我们的算法就会出现问题,从不会收敛。 下面是当学习率为0.05时虹膜线性回归问题的L1和L2损失的损失函数图:
2018-05-22_102421.jpg
图5:这是虹膜线性回归问题的L1和L2损失,学习率为0.05。
学习率为0.05时,似乎L2损失是首选,因为它收敛于较低的数据损失。 以下是当我们将学习率提高到0.4时的损失函数图:
2018-05-22_102458.jpg

图6:显示虹膜线性回归问题的L1和L2损失,学习率为0.4。 请注意,由于y轴的高标度,L1损失不可见。
在这里,我们可以看到,大规模学习率在L2规范中可能会超调,而L1规范收敛。

还有更多
为了理解发生的事情,我们应该看看大规模的学习率和小的学习率如何对L1和L2规范起作用。 为了想象这个,我们看一下这两个规范的学习步骤的一维表示,如下所示:
2018-05-22_102532.jpg
图7:说明学习率越来越高的L1和L2规范会发生什么。

实施戴明回归
在这个配方中,我们将实施戴明回归(全回归),这意味着我们需要一种不同的方式来测量模型线和数据点之间的距离。

准备好
如果最小二乘线性回归使到线的垂直距离最小,则Deming回归使到线的总距离最小化。 这种类型的回归使y值和x值的误差最小化。 请参阅下图进行比较:

2018-05-22_102604.jpg
图8:我们在这里说明了常规线性回归与戴明回归之间的区别。 左边的线性回归最小化线的垂直距离,Deming回归最小化线的总距离。
为了实现戴明回归,我们必须修改损失函数。 常规线性回归中的损失函数使垂直距离最小化。 在这里,我们想要最小化总距离。 给定线的斜率和截距,到点的垂直距离是已知的几何公式。 我们只需要替换这个公式,并告诉TensorFlow来最小化它。

怎么做
除了达到损失功能以外,一切都保持不变。 我们从加载库,开始会话,加载数据,声明批量大小,创建占位符,变量和模型输出开始,如下所示:
[mw_shl_code=python,true]import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from sklearn import datasets
sess = tf.Session()
iris = datasets.load_iris()
x_vals = np.array([x[3] for x in iris.data])
y_vals = np.array([y[0] for y in iris.data])
batch_size = 50
x_data = tf.placeholder(shape=[None, 1], dtype=tf.float32)
y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32)
A = tf.Variable(tf.random_normal(shape=[1,1]))
b = tf.Variable(tf.random_normal(shape=[1,1]))
model_output = tf.add(tf.matmul(x_data, A), b) [/mw_shl_code]

2.损失函数是一个包含分子和分母的几何公式。 为了清楚起见,我们将分别写出这些。 给定一条线,y = mx + b和一个点,
2018-05-22_102646.jpg
[mw_shl_code=python,true]demming_numerator = tf.abs(tf.sub(y_target, tf.add(tf.matmul(x_data, A), b)))
demming_denominator = tf.sqrt(tf.add(tf.square(A),1))
loss = tf.reduce_mean(tf.truediv(demming_numerator, demming_denominator))[/mw_shl_code]

3.我们现在初始化我们的变量,声明我们的优化器,并循环遍历训练集以得到我们的参数,如下所示:
[mw_shl_code=python,true]init = tf.global_variables_initializer()
sess.run(init)
my_opt = tf.train.GradientDescentOptimizer(0.1)
train_step = my_opt.minimize(loss)
loss_vec = []
for i in range(250):
    rand_index = np.random.choice(len(x_vals), size=batch_size)
    rand_x = np.transpose([x_vals[rand_index]])
    rand_y = np.transpose([y_vals[rand_index]])
    sess.run(train_step, feed_dict={x_data: rand_x, y_target:rand_y})
    temp_loss = sess.run(loss, feed_dict={x_data: rand_x, y_target: rand_y})
    loss_vec.append(temp_loss)
   if (i+1)%50==0:
        print('Step #''' + str(i+1) + ' A = ' + str(sess.run(A)) +' b = ' + str(sess.run(b)))
        print('Loss = ' + str(temp_loss))[/mw_shl_code]

4.我们可以用以下代码绘制输出:
[mw_shl_code=python,true][slope] = sess.run(A)
[y_intercept] = sess.run(b)
best_fit = []
for i in x_vals:
  best_fit.append(slope*i+y_intercept)
  plt.plot(x_vals, y_vals, 'o', label='Data Points')
  plt.plot(x_vals, best_fit, 'r-', label='Best' fit line', linewidth=3)
  plt.legend(loc='upper left')
  plt.title('Sepal' Length vs Pedal Width')
  plt.xlabel('Pedal Width')
  plt.ylabel('Sepal Length')
  plt.show()[/mw_shl_code]

2018-05-22_102718.jpg
图9:描绘虹膜数据集Deming回归的解决方案的图表。

怎么运行
这里用于戴明回归的配方几乎与常规线性回归相同。 这里的关键区别在于我们如何衡量预测与数据点之间的损失。 我们有一个垂直损失(或总损失)与y值和x值的关系,而不是垂直损失。

请注意,此处实施的戴明回归的类型称为总回归。 当我们假设x和y值的误差相似时,总回归是。 根据我们的信念,我们还可以根据误差的不同来缩放距离计算中的x和y轴。

原文:
Understanding Loss Functions in Linear Regression

It is important to know the effect of loss functions in algorithm convergence. Here we will
illustrate how the L1 and L2 loss functions affect convergence in linear regression.

Getting ready
We will use the same iris dataset as in the prior recipe, but we will change our loss functions
and learning rates to see how convergence changes.

How to do it…
1. The start of the program is unchanged from before until we get to our loss function.
We load the necessary libraries, start a session, load the data, create placeholders,
and define our variables and model. One thing to note is that we are pulling out our
learning rate and model iterations. We are doing this because we want to show the
effect of quickly changing these parameters. Use the following code:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from sklearn import datasets
sess = tf.Session()
iris = datasets.load_iris()
x_vals = np.array([x[3] for x in iris.data])
y_vals = np.array([y[0] for y in iris.data])
batch_size = 25
learning_rate = 0.1 # Will not converge with learning rate at 0.4
iterations = 50
x_data = tf.placeholder(shape=[None, 1], dtype=tf.float32)
y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32)
A = tf.Variable(tf.random_normal(shape=[1,1]))
b = tf.Variable(tf.random_normal(shape=[1,1]))
model_output = tf.add(tf.matmul(x_data, A), b)

2. Our loss function will change to the L1 loss, as follows:
loss_l1 = tf.reduce_mean(tf.abs(y_target - model_output))
Note that we can change this back to the L2 loss by substituting
in the following formula: tf.reduce_mean(tf.square(y_target – model_output)).

3. Now we resume by initializing the variables declaring our optimizer, and looping them through the training part. Note that we are also saving our loss at every generation to measure the convergence. Use the following code:
init = tf.global_variables_initializer()
sess.run(init)
my_opt_l1 = tf.train.GradientDescentOptimizer(learning_rate)
train_step_l1 = my_opt_l1.minimize(loss_l1)
loss_vec_l1 = []
for i in range(iterations):
  rand_index = np.random.choice(len(x_vals), size=batch_size)
  rand_x = np.transpose([x_vals[rand_index]])
  rand_y = np.transpose([y_vals[rand_index]])
  sess.run(train_step_l1, feed_dict={x_data: rand_x, y_target:rand_y})
  temp_loss_l1 = sess.run(loss_l1, feed_dict={x_data: rand_x,y_target: rand_y})
  loss_vec_l1.append(temp_loss_l1)
    if (i+1)%25==0:
      print('Step #' + str(i+1) + ' A = ' + str(sess.run(A)) + '
      b = ' + str(sess.run(b)))
plt.plot(loss_vec_l1, 'k-', label='L1 Loss')
plt.plot(loss_vec_l2, 'r--', label='L2 Loss')
plt.title('L1' and L2 Loss per Generation')
plt.xlabel('Generation')
plt.ylabel('L1 Loss')
plt.legend(loc='upper right')
plt.show()

How it works…
When choosing a loss function, we must also choose a corresponding learning rate that will work with our problem. Here, we will illustrate two situations, one in which L2 is preferred and one in which L1 is preferred.

If our learning rate is small, our convergence will take more time. But if our learning rate is too large, we will have issues with our algorithm never converging. Here is a plot of the loss function of the L1 and L2 loss for the iris linear regression problem when the learning rate is 0.05:

2018-05-22_102421.jpg
Figure 5: Here is the L1 and L2 loss with a learning rate of 0.05 for the iris linear regression problem.
With a learning rate of 0.05, it would appear that L2 loss is preferred, as it converges to a lower loss on the data. Here is a graph of the loss functions when we increase the learning rate to 0.4:
2018-05-22_102458.jpg
Fihure 6: Shows the L1 and L2 loss on the iris linear regression problem with a learning rate of 0.4. Note that the L1 loss is not visible because of the high scale of the y-axis.
Here, we can see that the large learning rate can overshoot in the L2 norm, whereas the L1 norm converges.

There's more…
To understand what is happening, we should look at how a large learning rate and small learning rate act on L1 and L2 norms. To visualize this, we look at a one-dimensional representation of learning steps on both norms, as follows:
2018-05-22_102532.jpg
Figure 7: Illustrates what can happen with the L1 and L2 norm with larger and smaller learning rates.
Implementing Deming regression
In this recipe, we will implement Deming regression (total regression), which means we will need a different way to measure the distance between the model line and data points.

Getting ready
If least squares linear regression minimizes the vertical distance to the line, Deming regression minimizes the total distance to the line. This type of regression minimizes the error in the y values and the x values. See the following figure for a comparison:

2018-05-22_102604.jpg
Figure 8: Here we illustrate the difference between regular linear regression and Deming regression. Linear regression on the left minimizes the vertical distance to the line, and Deming regression minimizes the total distance to the line.
To implement Deming regression, we have to modify the loss function. The loss function in regular linear regression minimizes the vertical distance. Here, we want to minimize the total distance. Given a slope and intercept of a line, the perpendicular distance to a point is a known geometric formula. We just have to substitute this formula in and tell TensorFlow to minimize it.
How to do it…
1.Everything stays the same except when we get to the loss function. We begin by loading the libraries, starting a session, loading the data, declaring the batch size, creating the placeholders, variables, and model output, as follows:

import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from sklearn import datasets
sess = tf.Session()
iris = datasets.load_iris()
x_vals = np.array([x[3] for x in iris.data])
y_vals = np.array([y[0] for y in iris.data])
batch_size = 50
x_data = tf.placeholder(shape=[None, 1], dtype=tf.float32)
y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32)
A = tf.Variable(tf.random_normal(shape=[1,1]))
b = tf.Variable(tf.random_normal(shape=[1,1]))
model_output = tf.add(tf.matmul(x_data, A), b)
2.The loss function is a geometric formula that comprises of a numerator and denominator. For clarity we will write these out separately. Given a line, y=mx+b and a point,
2018-05-22_103515.jpg
demming_numerator = tf.abs(tf.sub(y_target, tf.add(tf.matmul(x_
data, A), b)))
demming_denominator = tf.sqrt(tf.add(tf.square(A),1))
loss = tf.reduce_mean(tf.truediv(demming_numerator, demming_
denominator))
3.We now initialize our variables, declare our optimizer, and loop through the training
set to arrive at our parameters, as follows:
init = tf.global_variables_initializer()
sess.run(init)
my_opt = tf.train.GradientDescentOptimizer(0.1)
train_step = my_opt.minimize(loss)
loss_vec = []
for i in range(250):
  rand_index = np.random.choice(len(x_vals), size=batch_size)
  rand_x = np.transpose([x_vals[rand_index]])
  rand_y = np.transpose([y_vals[rand_index]])
  sess.run(train_step, feed_dict={x_data: rand_x, y_target:rand_y})
  temp_loss = sess.run(loss, feed_dict={x_data: rand_x, y_target: rand_y})
  loss_vec.append(temp_loss)
  if (i+1)%50==0:
    print('Step #''' + str(i+1) + ' A = ' + str(sess.run(A)) +' b = ' + str(sess.run(b)))
    print('Loss = ' + str(temp_loss))

4.We can plot the output with the following code:
[slope] = sess.run(A)
[y_intercept] = sess.run(b)
best_fit = []
for i in x_vals:
  best_fit.append(slope*i+y_intercept)
  plt.plot(x_vals, y_vals, 'o', label='Data Points')
  plt.plot(x_vals, best_fit, 'r-', label='Best' fit line', linewidth=3)
  plt.legend(loc='upper left')
  plt.title('Sepal' Length vs Pedal Width')
  plt.xlabel('Pedal Width')
  plt.ylabel('Sepal Length')
  plt.show()
2018-05-22_102718.jpg
Figure 9: The graph depicting the solution to Deming regression on the iris dataset.
How it works…
The recipe here for Deming regression is almost identical to regular linear regression. The key difference here is how we measure the loss between the predictions and the data points. Instead of a vertical loss, we have a perpendicular loss (or total loss) with the y values and x values.

Note that the type of Deming regression implemented here is called total regression. Total regression is when we assume the error in the x and y values are similar. We can also scale the x and y axes in the distance calculation by the difference in the errors according to our beliefs.



已有(1)人评论

跳转到指定楼层
王成-Chris 发表于 2018-5-22 17:43:09
厉害,学习了
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

推荐上一条 /2 下一条