{
"cells": [
{
"cell_type": "markdown",
"id": "d9ba5257",
"metadata": {
"origin_pos": 0
},
"source": [
"# 梯度下降\n",
":label:`sec_gd`\n",
"\n",
"尽管*梯度下降*(gradient descent)很少直接用于深度学习,\n",
"但了解它是理解下一节随机梯度下降算法的关键。\n",
"例如,由于学习率过大,优化问题可能会发散,这种现象早已在梯度下降中出现。\n",
"同样地,*预处理*(preconditioning)是梯度下降中的一种常用技术,\n",
"还被沿用到更高级的算法中。\n",
"让我们从简单的一维梯度下降开始。\n",
"\n",
"## 一维梯度下降\n",
"\n",
"为什么梯度下降算法可以优化目标函数?\n",
"一维中的梯度下降给我们很好的启发。\n",
"考虑一类连续可微实值函数$f: \\mathbb{R} \\rightarrow \\mathbb{R}$,\n",
"利用泰勒展开,我们可以得到\n",
"\n",
"$$f(x + \\epsilon) = f(x) + \\epsilon f'(x) + \\mathcal{O}(\\epsilon^2).$$\n",
":eqlabel:`gd-taylor`\n",
"\n",
"即在一阶近似中,$f(x+\\epsilon)$可通过$x$处的函数值$f(x)$和一阶导数$f'(x)$得出。\n",
"我们可以假设在负梯度方向上移动的$\\epsilon$会减少$f$。\n",
"为了简单起见,我们选择固定步长$\\eta > 0$,然后取$\\epsilon = -\\eta f'(x)$。\n",
"将其代入泰勒展开式我们可以得到\n",
"\n",
"$$f(x - \\eta f'(x)) = f(x) - \\eta f'^2(x) + \\mathcal{O}(\\eta^2 f'^2(x)).$$\n",
":eqlabel:`gd-taylor-2`\n",
"\n",
"如果其导数$f'(x) \\neq 0$没有消失,我们就能继续展开,这是因为$\\eta f'^2(x)>0$。\n",
"此外,我们总是可以令$\\eta$小到足以使高阶项变得不相关。\n",
"因此,\n",
"\n",
"$$f(x - \\eta f'(x)) \\lessapprox f(x).$$\n",
"\n",
"这意味着,如果我们使用\n",
"\n",
"$$x \\leftarrow x - \\eta f'(x)$$\n",
"\n",
"来迭代$x$,函数$f(x)$的值可能会下降。\n",
"因此,在梯度下降中,我们首先选择初始值$x$和常数$\\eta > 0$,\n",
"然后使用它们连续迭代$x$,直到停止条件达成。\n",
"例如,当梯度$|f'(x)|$的幅度足够小或迭代次数达到某个值时。\n",
"\n",
"下面我们来展示如何实现梯度下降。为了简单起见,我们选用目标函数$f(x)=x^2$。\n",
"尽管我们知道$x=0$时$f(x)$能取得最小值,\n",
"但我们仍然使用这个简单的函数来观察$x$的变化。\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "7b783f74",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:01:23.083492Z",
"iopub.status.busy": "2023-08-18T07:01:23.082697Z",
"iopub.status.idle": "2023-08-18T07:01:26.147110Z",
"shell.execute_reply": "2023-08-18T07:01:26.145926Z"
},
"origin_pos": 2,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import numpy as np\n",
"import torch\n",
"from d2l import torch as d2l"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "b544b8ce",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:01:26.152743Z",
"iopub.status.busy": "2023-08-18T07:01:26.151955Z",
"iopub.status.idle": "2023-08-18T07:01:26.158133Z",
"shell.execute_reply": "2023-08-18T07:01:26.157003Z"
},
"origin_pos": 5,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"def f(x): # 目标函数\n",
" return x ** 2\n",
"\n",
"def f_grad(x): # 目标函数的梯度(导数)\n",
" return 2 * x"
]
},
{
"cell_type": "markdown",
"id": "784941de",
"metadata": {
"origin_pos": 6
},
"source": [
"接下来,我们使用$x=10$作为初始值,并假设$\\eta=0.2$。\n",
"使用梯度下降法迭代$x$共10次,我们可以看到,$x$的值最终将接近最优解。\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "3674fea9",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:01:26.163008Z",
"iopub.status.busy": "2023-08-18T07:01:26.162385Z",
"iopub.status.idle": "2023-08-18T07:01:26.170126Z",
"shell.execute_reply": "2023-08-18T07:01:26.169003Z"
},
"origin_pos": 7,
"tab": [
"pytorch"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 10, x: 0.060466\n"
]
}
],
"source": [
"def gd(eta, f_grad):\n",
" x = 10.0\n",
" results = [x]\n",
" for i in range(10):\n",
" x -= eta * f_grad(x)\n",
" results.append(float(x))\n",
" print(f'epoch 10, x: {x:f}')\n",
" return results\n",
"\n",
"results = gd(0.2, f_grad)"
]
},
{
"cell_type": "markdown",
"id": "a005e799",
"metadata": {
"origin_pos": 9
},
"source": [
"对进行$x$优化的过程可以绘制如下。\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "56e7d8fb",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:01:26.174775Z",
"iopub.status.busy": "2023-08-18T07:01:26.174138Z",
"iopub.status.idle": "2023-08-18T07:01:26.480544Z",
"shell.execute_reply": "2023-08-18T07:01:26.479162Z"
},
"origin_pos": 10,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"def show_trace(results, f):\n",
" n = max(abs(min(results)), abs(max(results)))\n",
" f_line = torch.arange(-n, n, 0.01)\n",
" d2l.set_figsize()\n",
" d2l.plot([f_line, results], [[f(x) for x in f_line], [\n",
" f(x) for x in results]], 'x', 'f(x)', fmts=['-', '-o'])\n",
"\n",
"show_trace(results, f)"
]
},
{
"cell_type": "markdown",
"id": "467465ec",
"metadata": {
"origin_pos": 12
},
"source": [
"### 学习率\n",
":label:`subsec_gd-learningrate`\n",
"\n",
"*学习率*(learning rate)决定目标函数能否收敛到局部最小值,以及何时收敛到最小值。\n",
"学习率$\\eta$可由算法设计者设置。\n",
"请注意,如果我们使用的学习率太小,将导致$x$的更新非常缓慢,需要更多的迭代。\n",
"例如,考虑同一优化问题中$\\eta = 0.05$的进度。\n",
"如下所示,尽管经过了10个步骤,我们仍然离最优解很远。\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "f785cac5",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:01:26.486541Z",
"iopub.status.busy": "2023-08-18T07:01:26.485458Z",
"iopub.status.idle": "2023-08-18T07:01:26.770061Z",
"shell.execute_reply": "2023-08-18T07:01:26.768785Z"
},
"origin_pos": 13,
"tab": [
"pytorch"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 10, x: 3.486784\n"
]
},
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"show_trace(gd(0.05, f_grad), f)"
]
},
{
"cell_type": "markdown",
"id": "ffd34ac9",
"metadata": {
"origin_pos": 14
},
"source": [
"相反,如果我们使用过高的学习率,$\\left|\\eta f'(x)\\right|$对于一阶泰勒展开式可能太大。\n",
"也就是说, :eqref:`gd-taylor`中的$\\mathcal{O}(\\eta^2 f'^2(x))$可能变得显著了。\n",
"在这种情况下,$x$的迭代不能保证降低$f(x)$的值。\n",
"例如,当学习率为$\\eta=1.1$时,$x$超出了最优解$x=0$并逐渐发散。\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "dabdbe10",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:01:26.775496Z",
"iopub.status.busy": "2023-08-18T07:01:26.774773Z",
"iopub.status.idle": "2023-08-18T07:01:27.351226Z",
"shell.execute_reply": "2023-08-18T07:01:27.349971Z"
},
"origin_pos": 15,
"tab": [
"pytorch"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 10, x: 61.917364\n"
]
},
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"show_trace(gd(1.1, f_grad), f)"
]
},
{
"cell_type": "markdown",
"id": "f6918593",
"metadata": {
"origin_pos": 16
},
"source": [
"### 局部最小值\n",
"\n",
"为了演示非凸函数的梯度下降,考虑函数$f(x) = x \\cdot \\cos(cx)$,其中$c$为某常数。\n",
"这个函数有无穷多个局部最小值。\n",
"根据我们选择的学习率,我们最终可能只会得到许多解的一个。\n",
"下面的例子说明了(不切实际的)高学习率如何导致较差的局部最小值。\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "51583e82",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:01:27.356773Z",
"iopub.status.busy": "2023-08-18T07:01:27.355855Z",
"iopub.status.idle": "2023-08-18T07:01:27.649547Z",
"shell.execute_reply": "2023-08-18T07:01:27.648346Z"
},
"origin_pos": 17,
"tab": [
"pytorch"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 10, x: -1.528166\n"
]
},
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"c = torch.tensor(0.15 * np.pi)\n",
"\n",
"def f(x): # 目标函数\n",
" return x * torch.cos(c * x)\n",
"\n",
"def f_grad(x): # 目标函数的梯度\n",
" return torch.cos(c * x) - c * x * torch.sin(c * x)\n",
"\n",
"show_trace(gd(2, f_grad), f)"
]
},
{
"cell_type": "markdown",
"id": "3a19f7d8",
"metadata": {
"origin_pos": 18
},
"source": [
"## 多元梯度下降\n",
"\n",
"现在我们对单变量的情况有了更好的理解,让我们考虑一下$\\mathbf{x} = [x_1, x_2, \\ldots, x_d]^\\top$的情况。\n",
"即目标函数$f: \\mathbb{R}^d \\to \\mathbb{R}$将向量映射成标量。\n",
"相应地,它的梯度也是多元的,它是一个由$d$个偏导数组成的向量:\n",
"\n",
"$$\\nabla f(\\mathbf{x}) = \\bigg[\\frac{\\partial f(\\mathbf{x})}{\\partial x_1}, \\frac{\\partial f(\\mathbf{x})}{\\partial x_2}, \\ldots, \\frac{\\partial f(\\mathbf{x})}{\\partial x_d}\\bigg]^\\top.$$\n",
"\n",
"梯度中的每个偏导数元素$\\partial f(\\mathbf{x})/\\partial x_i$代表了当输入$x_i$时$f$在$\\mathbf{x}$处的变化率。\n",
"和先前单变量的情况一样,我们可以对多变量函数使用相应的泰勒近似来思考。\n",
"具体来说,\n",
"\n",
"$$f(\\mathbf{x} + \\boldsymbol{\\epsilon}) = f(\\mathbf{x}) + \\mathbf{\\boldsymbol{\\epsilon}}^\\top \\nabla f(\\mathbf{x}) + \\mathcal{O}(\\|\\boldsymbol{\\epsilon}\\|^2).$$\n",
":eqlabel:`gd-multi-taylor`\n",
"\n",
"换句话说,在$\\boldsymbol{\\epsilon}$的二阶项中,\n",
"最陡下降的方向由负梯度$-\\nabla f(\\mathbf{x})$得出。\n",
"选择合适的学习率$\\eta > 0$来生成典型的梯度下降算法:\n",
"\n",
"$$\\mathbf{x} \\leftarrow \\mathbf{x} - \\eta \\nabla f(\\mathbf{x}).$$\n",
"\n",
"这个算法在实践中的表现如何呢?\n",
"我们构造一个目标函数$f(\\mathbf{x})=x_1^2+2x_2^2$,\n",
"并有二维向量$\\mathbf{x} = [x_1, x_2]^\\top$作为输入,\n",
"标量作为输出。\n",
"梯度由$\\nabla f(\\mathbf{x}) = [2x_1, 4x_2]^\\top$给出。\n",
"我们将从初始位置$[-5, -2]$通过梯度下降观察$\\mathbf{x}$的轨迹。\n",
"\n",
"我们还需要两个辅助函数:\n",
"第一个是update函数,并将其应用于初始值20次;\n",
"第二个函数会显示$\\mathbf{x}$的轨迹。\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "ef6dbba5",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:01:27.655203Z",
"iopub.status.busy": "2023-08-18T07:01:27.654471Z",
"iopub.status.idle": "2023-08-18T07:01:27.663479Z",
"shell.execute_reply": "2023-08-18T07:01:27.662393Z"
},
"origin_pos": 19,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"def train_2d(trainer, steps=20, f_grad=None): #@save\n",
" \"\"\"用定制的训练机优化2D目标函数\"\"\"\n",
" # s1和s2是稍后将使用的内部状态变量\n",
" x1, x2, s1, s2 = -5, -2, 0, 0\n",
" results = [(x1, x2)]\n",
" for i in range(steps):\n",
" if f_grad:\n",
" x1, x2, s1, s2 = trainer(x1, x2, s1, s2, f_grad)\n",
" else:\n",
" x1, x2, s1, s2 = trainer(x1, x2, s1, s2)\n",
" results.append((x1, x2))\n",
" print(f'epoch {i + 1}, x1: {float(x1):f}, x2: {float(x2):f}')\n",
" return results"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "d47839ef",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:01:27.668397Z",
"iopub.status.busy": "2023-08-18T07:01:27.667552Z",
"iopub.status.idle": "2023-08-18T07:01:27.675730Z",
"shell.execute_reply": "2023-08-18T07:01:27.674619Z"
},
"origin_pos": 21,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"def show_trace_2d(f, results): #@save\n",
" \"\"\"显示优化过程中2D变量的轨迹\"\"\"\n",
" d2l.set_figsize()\n",
" d2l.plt.plot(*zip(*results), '-o', color='#ff7f0e')\n",
" x1, x2 = torch.meshgrid(torch.arange(-5.5, 1.0, 0.1),\n",
" torch.arange(-3.0, 1.0, 0.1), indexing='ij')\n",
" d2l.plt.contour(x1, x2, f(x1, x2), colors='#1f77b4')\n",
" d2l.plt.xlabel('x1')\n",
" d2l.plt.ylabel('x2')"
]
},
{
"cell_type": "markdown",
"id": "bd6785e8",
"metadata": {
"origin_pos": 23
},
"source": [
"接下来,我们观察学习率$\\eta = 0.1$时优化变量$\\mathbf{x}$的轨迹。\n",
"可以看到,经过20步之后,$\\mathbf{x}$的值接近其位于$[0, 0]$的最小值。\n",
"虽然进展相当顺利,但相当缓慢。\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "6d58435d",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:01:27.679558Z",
"iopub.status.busy": "2023-08-18T07:01:27.679151Z",
"iopub.status.idle": "2023-08-18T07:01:27.882280Z",
"shell.execute_reply": "2023-08-18T07:01:27.881035Z"
},
"origin_pos": 24,
"tab": [
"pytorch"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 20, x1: -0.057646, x2: -0.000073\n"
]
},
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"def f_2d(x1, x2): # 目标函数\n",
" return x1 ** 2 + 2 * x2 ** 2\n",
"\n",
"def f_2d_grad(x1, x2): # 目标函数的梯度\n",
" return (2 * x1, 4 * x2)\n",
"\n",
"def gd_2d(x1, x2, s1, s2, f_grad):\n",
" g1, g2 = f_grad(x1, x2)\n",
" return (x1 - eta * g1, x2 - eta * g2, 0, 0)\n",
"\n",
"eta = 0.1\n",
"show_trace_2d(f_2d, train_2d(gd_2d, f_grad=f_2d_grad))"
]
},
{
"cell_type": "markdown",
"id": "1e90e61f",
"metadata": {
"origin_pos": 25
},
"source": [
"## 自适应方法\n",
"\n",
"正如我们在 :numref:`subsec_gd-learningrate`中所看到的,选择“恰到好处”的学习率$\\eta$是很棘手的。\n",
"如果我们把它选得太小,就没有什么进展;如果太大,得到的解就会振荡,甚至可能发散。\n",
"如果我们可以自动确定$\\eta$,或者完全不必选择学习率,会怎么样?\n",
"除了考虑目标函数的值和梯度、还考虑它的曲率的二阶方法可以帮我们解决这个问题。\n",
"虽然由于计算代价的原因,这些方法不能直接应用于深度学习,但它们为如何设计高级优化算法提供了有用的思维直觉,这些算法可以模拟下面概述的算法的许多理想特性。\n",
"\n",
"### 牛顿法\n",
"\n",
"回顾一些函数$f: \\mathbb{R}^d \\rightarrow \\mathbb{R}$的泰勒展开式,事实上我们可以把它写成\n",
"\n",
"$$f(\\mathbf{x} + \\boldsymbol{\\epsilon}) = f(\\mathbf{x}) + \\boldsymbol{\\epsilon}^\\top \\nabla f(\\mathbf{x}) + \\frac{1}{2} \\boldsymbol{\\epsilon}^\\top \\nabla^2 f(\\mathbf{x}) \\boldsymbol{\\epsilon} + \\mathcal{O}(\\|\\boldsymbol{\\epsilon}\\|^3).$$\n",
":eqlabel:`gd-hot-taylor`\n",
"\n",
"为了避免繁琐的符号,我们将$\\mathbf{H} \\stackrel{\\mathrm{def}}{=} \\nabla^2 f(\\mathbf{x})$定义为$f$的Hessian,是$d \\times d$矩阵。\n",
"当$d$的值很小且问题很简单时,$\\mathbf{H}$很容易计算。\n",
"但是对于深度神经网络而言,考虑到$\\mathbf{H}$可能非常大,\n",
"$\\mathcal{O}(d^2)$个条目的存储代价会很高,\n",
"此外通过反向传播进行计算可能雪上加霜。\n",
"然而,我们姑且先忽略这些考量,看看会得到什么算法。\n",
"\n",
"毕竟,$f$的最小值满足$\\nabla f = 0$。\n",
"遵循 :numref:`sec_calculus`中的微积分规则,\n",
"通过取$\\boldsymbol{\\epsilon}$对 :eqref:`gd-hot-taylor`的导数,\n",
"再忽略不重要的高阶项,我们便得到\n",
"\n",
"$$\\nabla f(\\mathbf{x}) + \\mathbf{H} \\boldsymbol{\\epsilon} = 0 \\text{ and hence }\n",
"\\boldsymbol{\\epsilon} = -\\mathbf{H}^{-1} \\nabla f(\\mathbf{x}).$$\n",
"\n",
"也就是说,作为优化问题的一部分,我们需要将Hessian矩阵$\\mathbf{H}$求逆。\n",
"\n",
"举一个简单的例子,对于$f(x) = \\frac{1}{2} x^2$,我们有$\\nabla f(x) = x$和$\\mathbf{H} = 1$。\n",
"因此,对于任何$x$,我们可以获得$\\epsilon = -x$。\n",
"换言之,单单一步就足以完美地收敛,而无须任何调整。\n",
"我们在这里比较幸运:泰勒展开式是确切的,因为$f(x+\\epsilon)= \\frac{1}{2} x^2 + \\epsilon x + \\frac{1}{2} \\epsilon^2$。\n",
"\n",
"让我们看看其他问题。\n",
"给定一个凸双曲余弦函数$c$,其中$c$为某些常数,\n",
"我们可以看到经过几次迭代后,得到了$x=0$处的全局最小值。\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "de14c854",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:01:27.887060Z",
"iopub.status.busy": "2023-08-18T07:01:27.886284Z",
"iopub.status.idle": "2023-08-18T07:01:28.178108Z",
"shell.execute_reply": "2023-08-18T07:01:28.176813Z"
},
"origin_pos": 26,
"tab": [
"pytorch"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 10, x: tensor(0.)\n"
]
},
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"c = torch.tensor(0.5)\n",
"\n",
"def f(x): # O目标函数\n",
" return torch.cosh(c * x)\n",
"\n",
"def f_grad(x): # 目标函数的梯度\n",
" return c * torch.sinh(c * x)\n",
"\n",
"def f_hess(x): # 目标函数的Hessian\n",
" return c**2 * torch.cosh(c * x)\n",
"\n",
"def newton(eta=1):\n",
" x = 10.0\n",
" results = [x]\n",
" for i in range(10):\n",
" x -= eta * f_grad(x) / f_hess(x)\n",
" results.append(float(x))\n",
" print('epoch 10, x:', x)\n",
" return results\n",
"\n",
"show_trace(newton(), f)"
]
},
{
"cell_type": "markdown",
"id": "61c98547",
"metadata": {
"origin_pos": 27
},
"source": [
"现在让我们考虑一个非凸函数,比如$f(x) = x \\cos(c x)$,$c$为某些常数。\n",
"请注意在牛顿法中,我们最终将除以Hessian。\n",
"这意味着如果二阶导数是负的,$f$的值可能会趋于增加。\n",
"这是这个算法的致命缺陷!\n",
"让我们看看实践中会发生什么。\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "302d97de",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:01:28.183563Z",
"iopub.status.busy": "2023-08-18T07:01:28.182810Z",
"iopub.status.idle": "2023-08-18T07:01:28.660824Z",
"shell.execute_reply": "2023-08-18T07:01:28.659664Z"
},
"origin_pos": 28,
"tab": [
"pytorch"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 10, x: tensor(26.8341)\n"
]
},
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"c = torch.tensor(0.15 * np.pi)\n",
"\n",
"def f(x): # 目标函数\n",
" return x * torch.cos(c * x)\n",
"\n",
"def f_grad(x): # 目标函数的梯度\n",
" return torch.cos(c * x) - c * x * torch.sin(c * x)\n",
"\n",
"def f_hess(x): # 目标函数的Hessian\n",
" return - 2 * c * torch.sin(c * x) - x * c**2 * torch.cos(c * x)\n",
"\n",
"show_trace(newton(), f)"
]
},
{
"cell_type": "markdown",
"id": "28696853",
"metadata": {
"origin_pos": 29
},
"source": [
"这发生了惊人的错误。我们怎样才能修正它?\n",
"一种方法是用取Hessian的绝对值来修正,另一个策略是重新引入学习率。\n",
"这似乎违背了初衷,但不完全是——拥有二阶信息可以使我们在曲率较大时保持谨慎,而在目标函数较平坦时则采用较大的学习率。\n",
"让我们看看在学习率稍小的情况下它是如何生效的,比如$\\eta = 0.5$。\n",
"如我们所见,我们有了一个相当高效的算法。\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "1d0aa6d6",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:01:28.666394Z",
"iopub.status.busy": "2023-08-18T07:01:28.665576Z",
"iopub.status.idle": "2023-08-18T07:01:28.904465Z",
"shell.execute_reply": "2023-08-18T07:01:28.903311Z"
},
"origin_pos": 30,
"tab": [
"pytorch"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 10, x: tensor(7.2699)\n"
]
},
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"show_trace(newton(0.5), f)"
]
},
{
"cell_type": "markdown",
"id": "1575b769",
"metadata": {
"origin_pos": 31
},
"source": [
"### 收敛性分析\n",
"\n",
"在此,我们以部分目标凸函数$f$为例,分析它们的牛顿法收敛速度。\n",
"这些目标凸函数三次可微,而且二阶导数不为零,即$f'' > 0$。\n",
"由于多变量情况下的证明是对以下一维参数情况证明的直接拓展,对我们理解这个问题不能提供更多帮助,因此我们省略了多变量情况的证明。\n",
"\n",
"用$x^{(k)}$表示$x$在第$k^\\mathrm{th}$次迭代时的值,\n",
"令$e^{(k)} \\stackrel{\\mathrm{def}}{=} x^{(k)} - x^*$表示$k^\\mathrm{th}$迭代时与最优性的距离。\n",
"通过泰勒展开,我们得到条件$f'(x^*) = 0$可以写成\n",
"\n",
"$$0 = f'(x^{(k)} - e^{(k)}) = f'(x^{(k)}) - e^{(k)} f''(x^{(k)}) + \\frac{1}{2} (e^{(k)})^2 f'''(\\xi^{(k)}),$$\n",
"\n",
"这对某些$\\xi^{(k)} \\in [x^{(k)} - e^{(k)}, x^{(k)}]$成立。\n",
"将上述展开除以$f''(x^{(k)})$得到\n",
"\n",
"$$e^{(k)} - \\frac{f'(x^{(k)})}{f''(x^{(k)})} = \\frac{1}{2} (e^{(k)})^2 \\frac{f'''(\\xi^{(k)})}{f''(x^{(k)})}.$$\n",
"\n",
"回想之前的方程$x^{(k+1)} = x^{(k)} - f'(x^{(k)}) / f''(x^{(k)})$。\n",
"代入这个更新方程,取两边的绝对值,我们得到\n",
"\n",
"$$\\left|e^{(k+1)}\\right| = \\frac{1}{2}(e^{(k)})^2 \\frac{\\left|f'''(\\xi^{(k)})\\right|}{f''(x^{(k)})}.$$\n",
"\n",
"因此,每当我们处于有界区域$\\left|f'''(\\xi^{(k)})\\right| / (2f''(x^{(k)})) \\leq c$,\n",
"我们就有一个二次递减误差\n",
"\n",
"$$\\left|e^{(k+1)}\\right| \\leq c (e^{(k)})^2.$$\n",
"\n",
"另一方面,优化研究人员称之为“线性”收敛,而将$\\left|e^{(k+1)}\\right| \\leq \\alpha \\left|e^{(k)}\\right|$这样的条件称为“恒定”收敛速度。\n",
"请注意,我们无法估计整体收敛的速度,但是一旦我们接近极小值,收敛将变得非常快。\n",
"另外,这种分析要求$f$在高阶导数上表现良好,即确保$f$在如何变化它的值方面没有任何“超常”的特性。\n",
"\n",
"### 预处理\n",
"\n",
"计算和存储完整的Hessian非常昂贵,而改善这个问题的一种方法是“预处理”。\n",
"它回避了计算整个Hessian,而只计算“对角线”项,即如下的算法更新:\n",
"\n",
"$$\\mathbf{x} \\leftarrow \\mathbf{x} - \\eta \\mathrm{diag}(\\mathbf{H})^{-1} \\nabla f(\\mathbf{x}).$$\n",
"\n",
"虽然这不如完整的牛顿法精确,但它仍然比不使用要好得多。\n",
"为什么预处理有效呢?\n",
"假设一个变量以毫米表示高度,另一个变量以公里表示高度的情况。\n",
"假设这两种自然尺度都以米为单位,那么我们的参数化就出现了严重的不匹配。\n",
"幸运的是,使用预处理可以消除这种情况。\n",
"梯度下降的有效预处理相当于为每个变量选择不同的学习率(矢量$\\mathbf{x}$的坐标)。\n",
"我们将在后面一节看到,预处理推动了随机梯度下降优化算法的一些创新。\n",
"\n",
"### 梯度下降和线搜索\n",
"\n",
"梯度下降的一个关键问题是我们可能会超过目标或进展不足,\n",
"解决这一问题的简单方法是结合使用线搜索和梯度下降。\n",
"也就是说,我们使用$\\nabla f(\\mathbf{x})$给出的方向,\n",
"然后进行二分搜索,以确定哪个学习率$\\eta$使$f(\\mathbf{x} - \\eta \\nabla f(\\mathbf{x}))$取最小值。\n",
"\n",
"有关分析和证明,此算法收敛迅速(请参见 :cite:`Boyd.Vandenberghe.2004`)。\n",
"然而,对深度学习而言,这不太可行。\n",
"因为线搜索的每一步都需要评估整个数据集上的目标函数,实现它的方式太昂贵了。\n",
"\n",
"## 小结\n",
"\n",
"* 学习率的大小很重要:学习率太大会使模型发散,学习率太小会没有进展。\n",
"* 梯度下降会可能陷入局部极小值,而得不到全局最小值。\n",
"* 在高维模型中,调整学习率是很复杂的。\n",
"* 预处理有助于调节比例。\n",
"* 牛顿法在凸问题中一旦开始正常工作,速度就会快得多。\n",
"* 对于非凸问题,不要不作任何调整就使用牛顿法。\n",
"\n",
"## 练习\n",
"\n",
"1. 用不同的学习率和目标函数进行梯度下降实验。\n",
"1. 在区间$[a, b]$中实现线搜索以最小化凸函数。\n",
" 1. 是否需要导数来进行二分搜索,即决定选择$[a, (a+b)/2]$还是$[(a+b)/2, b]$。\n",
" 1. 算法的收敛速度有多快?\n",
" 1. 实现该算法,并将其应用于求$\\log (\\exp(x) + \\exp(-2x -3))$的最小值。\n",
"1. 设计一个定义在$\\mathbb{R}^2$上的目标函数,它的梯度下降非常缓慢。提示:不同坐标的缩放方式不同。\n",
"1. 使用预处理实现牛顿方法的轻量版本。\n",
" 1. 使用对角Hessian作为预条件子。\n",
" 1. 使用它的绝对值,而不是实际值(可能有符号)。\n",
" 1. 将此应用于上述问题。\n",
"1. 将上述算法应用于多个目标函数(凸或非凸)。如果把坐标旋转$45$度会怎么样?\n"
]
},
{
"cell_type": "markdown",
"id": "5db14585",
"metadata": {
"origin_pos": 33,
"tab": [
"pytorch"
]
},
"source": [
"[Discussions](https://discuss.d2l.ai/t/3836)\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
},
"required_libs": []
},
"nbformat": 4,
"nbformat_minor": 5
}