专业英语（4、Optimization）_综合

Summary

We defined a score function from image pixels to class scores (in this section, a linear function that depends on weights W and biases b).

pixels:像素 biases:偏差

Unlike kNN classifier, the advantage of this parametric approach is that once we learn the parameters we can discard the training data.

Additionally, the prediction for a new test image is fast since it requires a single matrix multiplication with W, not an exhaustive comparison to every single training example.

parametric approach:参数化方法  discard:丢弃 exhaustive comparison:详尽的比较  matrix multiplication:矩阵乘法

We defined a loss function (we introduced two commonly used losses for linear classifiers: the SVM and the Softmax) that measures how compatible a given set of parameters is with respect to the ground truth labels in the training dataset. We also saw that the loss function
was defined in such way that making good predictions on the training data is equivalent to having a small loss.
定义了从图像像素映射到不同类别的分类评分的评分函数。在本节中，评分函数是一个基于权重W和偏差b的线性函数。
与kNN分类器不同，参数方法的优势在于一旦通过训练学习到了参数，就可以将训练数据丢弃了。同时该方法对于新的测试数据的预测非常快，因为只需要与权重W进行一个矩阵乘法运算。
介绍了偏差技巧，让我们能够将偏差向量和权重矩阵合二为一，然后就可以只跟踪一个矩阵。
定义了损失函数（介绍了SVM和Softmax线性分类器最常用的2个损失函数）。损失函数能够衡量给出的参数集与训练集数据真实类别情况之间的一致性。在损失函数的定义中可以看到，对训练集数据做出良好预测与得到一个足够低的损失值这两件事是等价的。

Optimization

最优化是寻找能使得损失函数值最小化的参数W的过程。

Strategy #1——A first very bad idea solution:Random search（随机搜索）

过程如下：

# 假设X_train的每一列都是一个数据样本（比如3073 x 50000）
# 假设Y_train是数据样本的类别标签（比如一个长50000的一维数组）
# 假设函数L对损失函数进行评价bestloss = float("inf") # Python assigns the highest possible float value
for num in xrange(1000):W = np.random.randn(10, 3073) * 0.0001 # generate random parametersloss = L(X_train, Y_train, W) # get the loss over the entire training setif loss < bestloss: # keep track of the best solutionbestloss = lossbestW = Wprint 'in attempt %d the loss was %f, best %f' % (num, loss, bestloss)# 输出:
# in attempt 0 the loss was 9.401632, best 9.401632
# in attempt 1 the loss was 8.959668, best 8.959668
# in attempt 2 the loss was 9.044034, best 8.959668
# in attempt 3 the loss was 9.278948, best 8.959668
# in attempt 4 the loss was 8.857370, best 8.857370
# in attempt 5 the loss was 8.943151, best 8.857370
# in attempt 6 the loss was 8.605604, best 8.605604
# ... (trunctated: continues for 1000 lines)

在上面的代码中，我们尝试了若干随机生成的权重矩阵W，其中某些的损失值较小，而另一些的损失值大些。我们可以把这次随机搜索中找到的最好的权重W取出，然后去跑测试集：

# 假设X_test尺寸是[3073 x 10000], Y_test尺寸是[10000 x 1]
scores = Wbest.dot(Xte_cols) # 10 x 10000, the class scores for all test examples
# 找到在每列中评分值最大的索引（即预测的分类）
Yte_predict = np.argmax(scores, axis = 0)
# 以及计算准确率
np.mean(Yte_predict == Yte)
# 返回 0.1555

验证集上表现最好的权重W跑测试集的准确率是15.5%，而完全随机猜的准确率是10%，如此看来，这个准确率对于这样一个不经过大脑的策略来说，还算不错嘛！

核心思路：迭代优化。

对一个权重矩阵集W取优，使其损失值稍微减少。换句话说，我们的方法从一个随机的W开始，然后对其迭代取优，每次都让它的损失值变得更小一点。

Strategy #2- Random Local Search

We will start out with a random W, generate a random value α, such
that W <-W+αW, if the loss decreases, we will perform an update.

（这次我们从一个随机W开始，然后生成一个随机的扰动\delta W ，只有当W+\delta W的损失值变低，我们才会更新）

具体代码如下：

W = np.random.randn(10, 3073) * 0.001 # 生成随机初始W
bestloss = float("inf")
for i in xrange(1000):step_size = 0.0001Wtry = W + np.random.randn(10, 3073) * step_sizeloss = L(Xtr_cols, Ytr, Wtry)if loss < bestloss:W = Wtrybestloss = lossprint 'iter %d loss is %f' % (i, bestloss)

使用同样的数据（1000），这个方法可以得到 21.4% 的分类准确率。这个比策略一好，但是依然过于浪费计算资源。

Strategy #3- Follow the slope

In 1-dimension, the derivative(导数) of a function:

$df(x)dx=lim?h→0f(h+x)?f(x)h\frac{df(x)}{dx}=\lim\limits_{h\rarr0} \frac{f(h+x)-f(x)}{h}$

- In multiple dimensions, the gradient is the vector of (partial derivatives) along each dimension。（在多个维度中，梯度是沿每个维度的（偏导数）的向量）
- The slope in any direction is the dot product of the direction with the gradient。（任何方向的斜率都是梯度方向上的点积）
- The direction of steepest descent is the negative gradient。（最陡下降的方向是负梯度）

例子：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BmSyR9RU-1588265190739)(WEBRESOURCE448b48064e0161f9e4eff79e393123f3)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-NdOJi5tD-1588265190741)(WEBRESOURCEad876a03ddad174e20f59ce3bfb7f481)]

This is silly. The loss is just a function of W.（损失只是W的函数）

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GuB1FqjT-1588265190744)(WEBRESOURCEa3dc7d1751f6c3f35aa2598551ae3a4c)]

Use calculus to compute an analytic gradient（使用微积分来计算
分析梯度）

例子：

Analytical versus Numerical Approaches

Question: Find the root of f(x)=x?5.
Analytical solution:

f(x)=x?5=0, add +5 to both sides to get the answer x=5.

Numerical solution:

Let’s guess x=1: f(1)=1?5=?4, A negative number.
Let’s guess x=6: f(6)=6?5=1, A positive number.
The answer must be between them.
Let’s try x=(6+1)/2: f(7/2)<0. So it must be between 7/2 and 6,
repeat all above until f(x)≈x?5.

In summary

Numerical gradient: approximate, slow, easy to write
Analytic gradient: exact, fast, error-prone
数值梯度：近似，慢，易写
分析梯度：精确，快速，容易出错
In practice: Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check.
在实践中：始终使用分析梯度，但检查用数值梯度实现。这被称为梯度检查。
Computing the gradient analytically with Calculus。（用微积分分析计算梯度）
The numerical gradient is very simple to compute using the
finite difference approximation（有限差分近似）, but the downside（缺点） is that it is
approximate (since we have to pick a small value of h, while the true gradient is defined as the limit as h->0), and that it is very computationally expensive to compute.
The second way to compute the gradient is analytically using Calculus, which allows us to derive a direct formula for the gradient (no approximations) that is also very fast to compute. However, unlike the numerical gradient it can be more error。

用SVM的损失函数在某个数据点上的计算来举例：

$Li=∑j=/yi[max(0,wjTxi?wyiTxi+Δ]L_i=\sum_{j{=}\mathllap{/\,}y_i}[max(0,w_j^Tx_i-w_{y_i}^Tx_i+\Delta]$

可以对函数进行微分。比如，对w_{y_i}进行微分得到：

$?wyiLi=?(∑j=/yi1(wjTxi?wyiTxi+Δ>0))xi\nabla_{w_{y_i}}L_i=-\bigg(\sum_{j{=}\mathllap{/\,}y_i}1(w_j^Tx_i-w_{y_i}^Tx_i+\Delta>0)\bigg)x_i$

W的行向量的梯度，那些j\not =y_i行的梯度是：
$?wyiLi=1(wjTxi?wyiTxi+Δ>0)xi\nabla_{w_{y_i}}L_i=1(w_j^Tx_i-w_{y_i}^Tx_i+\Delta>0)x_i$

- 其中1是一个示性函数，如果括号中的条件为真，那么函数值为1，如果为假，则函数值为0。虽然上述公式看起来复杂，但在代码实现的时候比较简单：只需要计算没有满足边界值的分类的数量（因此对损失函数产生了贡献），然后乘以x_i就是梯度了
- 对应正确分类的W的行向量的梯度,第二个公式

Gradient Descent：

# 普通的梯度下降while True:weights_grad = evaluate_gradient(loss_fun, data, weights)weights += - step_size * weights_grad # 进行梯度更新

Update in negative gradient direction.

In the code above, notice that to compute new weight, we are making an update in the negative direction of the gradient since we wish our loss function to decrease,not increase。
（以负梯度方向更新。 在上面的代码中，请注意为了计算新的权重，我们正在进行负面更新梯度的方向，因为我们希望我们的损失函数减少，没有增加）

小批量数据梯度下降（Mini-batch gradient descent）：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-0YFw9Rd8-1588265190745)(WEBRESOURCE15c1111946cdf9f32123776c87b99641)]

# 普通的小批量数据梯度下降while True:data_batch = sample_training_data(data, 256) # 256个数据weights_grad = evaluate_gradient(loss_fun, data_batch, weights)weights += - step_size * weights_grad # 参数更新

在大规模的应用中（比如ILSVRC挑战赛），训练数据可以达到百万级量级。如果像这样计算整个训练集，来获得仅仅一个参数的更新就太浪费了。一个常用的方法是计算训练集中的小批量（batches）数据。