[论文笔记] [2014] An Empirical Analysis of Dropout in Piecewise Linear Networks_综合

这篇论文主要探讨了几个关于 dropout 性能的问题，通过几组实验做了验证。

一是探究在inference阶段 dropout 逼近 ensemble 预测的能力。在几个小任务上，设计一个很简单的模型，主要是为了实现inference阶段，准确的 geometric mean 计算，与 dropout 的 weight scaling approximation 的效果进行比较。

二是考虑 geometric mean 自身的重要性。因为通常的 bagging ensemble 的预测是通过模型的 arithmetic mean 来实现。而 dropout 类似于 bagging，但在 inference 阶段采用的 weight scaling approximation 其实做的是 geometric mean 的逼近。那么自然有疑问，采用逼近 geometric mean 的 trick 是否会影响 ensemble 的泛化能力（因为 arithmetic mean 与 geometric mean 还是有差异的）。下图是实验中采用 arithmetic mean 与 geometric mean 关于 test error 的相对差。可以看出两者的效果是相近的，那么意味着dropout的inference阶段中，几何平均代替算术平均是合适的。

在这里插入图片描述
前面两个问题主要讨论采用 dropout 在 inference 阶段的 approximate model averaging，但 dropout training 也存在几个值得讨论的问题。上面提到了 dropout 类似于 bagging，但 bagging 在训练阶段是各个子模型是参数独立的，而 dropout 中的子模型是参数共享的。那么一个疑问就是参数共享的方式相较于参数独立，给 ensemble 最后的效果会有怎样的影响？实验的设置是训练360个独立参数的模型，每个模型采用一个固定的 dropout mask，并做 ensemble。下图是 ensemble 的结果，值得注意的是，采用 dropout 的结果是1.06%，而上述模型的ensemble结果没有低于1.06%。那么意味着，没有参数共享确实存在影响，而参数共享起着正则的效果。 当然，也不排除存在干扰因子的可能：1) ensemble 采用的模型规模不够；2) 超参数的选择，这里超参数采用的是使采用 dropout 的模型在验证集上效果较好的超参数，而并非上面独立模型的较优超参数，估计这里造成了一些效果损失。

在这里插入图片描述
最后一个问题是 “Whether dropout can be wholly characterized in terms of learned noise robustness, and whether the model-averaging perspective is necessary or fruitful.” 作者是引入一个算法 dropout boosting，它和dropout bagging 一样采用同样的 noise，但训练的方式却和dropout bagging 不一样。dropout bagging 是当前子模型对当前样本正确 target 的最大似然估计，而 dropout boosting 还考虑了其他子模型，其目标函数为 $log?pensemble(y∣v;θ)\log p_{ensemble}(y | v; \theta)$ ，其中
$pensemble(y∣v;θ)=1Zp~(y∣v;θ)Z=∑y′p~(y′∣v;θ)p~(y′∣v;θ)=∏μ∈Mp(y∣v;θμ)2∣M∣.p_{ensemble}(y|v;\theta) = \frac{1}{Z} \tilde{p}(y|v;\theta) \\ Z = \sum_{y'}{\tilde{p}(y'|v;\theta)} \\ \tilde{p}(y'|v;\theta) = \sqrt[2^{|\mathcal{M}|}]{\prod_{\mu \in \mathcal{M}}{p(y|v;\theta_\mu)}}.$
dropout boosting 学习的规则是选取一个子模型，用 ensemble 的梯度 $?θμlog?pensemble(y∣v;θ)\nabla_{\theta_\mu} \log p_{ensemble}(y | v; \theta)$ 来更新权值。下图是dropout boosting 和 dropout bagging 的效果比较。可以发现 dropout boosting 的效果比普通 SGD 还差。这意味着 dropout bagging 的 approximate model averaging 是有必要且有成效的。

在这里插入图片描述

总结

这篇主要讨论了几个关于 dropout 性能的问题，得出了几个结论，虽然其中的对比实验，不免存在干扰因子，但最后的结论至少是毋庸置疑的：dropout is an extremely effective ensemble learning method, paired with a clever approximate inference scheme that is remarkable accurate in the case of rectified linear networks.