更新

2025-12-16 09:23:53 +08:00
parent 19138d3cc1
commit 9e7efd0626
409 changed files with 272713 additions and 241 deletions
@@ -0,0 +1,344 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "962fc6f3",
+   "metadata": {
+    "origin_pos": 0
+   },
+   "source": [
+    "# 通过时间反向传播\n",
+    ":label:`sec_bptt`\n",
+    "\n",
+    "到目前为止，我们已经反复提到像*梯度爆炸*或*梯度消失*，\n",
+    "以及需要对循环神经网络*分离梯度*。\n",
+    "例如，在 :numref:`sec_rnn_scratch`中，\n",
+    "我们在序列上调用了`detach`函数。\n",
+    "为了能够快速构建模型并了解其工作原理，\n",
+    "上面所说的这些概念都没有得到充分的解释。\n",
+    "本节将更深入地探讨序列模型反向传播的细节，\n",
+    "以及相关的数学原理。\n",
+    "\n",
+    "当我们首次实现循环神经网络（ :numref:`sec_rnn_scratch`）时，\n",
+    "遇到了梯度爆炸的问题。\n",
+    "如果做了练习题，就会发现梯度截断对于确保模型收敛至关重要。\n",
+    "为了更好地理解此问题，本节将回顾序列模型梯度的计算方式，\n",
+    "它的工作原理没有什么新概念，毕竟我们使用的仍然是链式法则来计算梯度。\n",
+    "\n",
+    "我们在 :numref:`sec_backprop`中描述了多层感知机中的\n",
+    "前向与反向传播及相关的计算图。\n",
+    "循环神经网络中的前向传播相对简单。\n",
+    "*通过时间反向传播*（backpropagation through time，BPTT）\n",
+    " :cite:`Werbos.1990`实际上是循环神经网络中反向传播技术的一个特定应用。\n",
+    "它要求我们将循环神经网络的计算图一次展开一个时间步，\n",
+    "以获得模型变量和参数之间的依赖关系。\n",
+    "然后，基于链式法则，应用反向传播来计算和存储梯度。\n",
+    "由于序列可能相当长，因此依赖关系也可能相当长。\n",
+    "例如，某个1000个字符的序列，\n",
+    "其第一个词元可能会对最后位置的词元产生重大影响。\n",
+    "这在计算上是不可行的（它需要的时间和内存都太多了），\n",
+    "并且还需要超过1000个矩阵的乘积才能得到非常难以捉摸的梯度。\n",
+    "这个过程充满了计算与统计的不确定性。\n",
+    "在下文中，我们将阐明会发生什么以及如何在实践中解决它们。\n",
+    "\n",
+    "## 循环神经网络的梯度分析\n",
+    ":label:`subsec_bptt_analysis`\n",
+    "\n",
+    "我们从一个描述循环神经网络工作原理的简化模型开始，\n",
+    "此模型忽略了隐状态的特性及其更新方式的细节。\n",
+    "这里的数学表示没有像过去那样明确地区分标量、向量和矩阵，\n",
+    "因为这些细节对于分析并不重要，\n",
+    "反而只会使本小节中的符号变得混乱。\n",
+    "\n",
+    "在这个简化模型中，我们将时间步$t$的隐状态表示为$h_t$，\n",
+    "输入表示为$x_t$，输出表示为$o_t$。\n",
+    "回想一下我们在 :numref:`subsec_rnn_w_hidden_states`中的讨论，\n",
+    "输入和隐状态可以拼接后与隐藏层中的一个权重变量相乘。\n",
+    "因此，我们分别使用$w_h$和$w_o$来表示隐藏层和输出层的权重。\n",
+    "每个时间步的隐状态和输出可以写为：\n",
+    "\n",
+    "$$\\begin{aligned}h_t &= f(x_t, h_{t-1}, w_h),\\\\o_t &= g(h_t, w_o),\\end{aligned}$$\n",
+    ":eqlabel:`eq_bptt_ht_ot`\n",
+    "\n",
+    "其中$f$和$g$分别是隐藏层和输出层的变换。\n",
+    "因此，我们有一个链\n",
+    "$\\{\\ldots, (x_{t-1}, h_{t-1}, o_{t-1}), (x_{t}, h_{t}, o_t), \\ldots\\}$，\n",
+    "它们通过循环计算彼此依赖。\n",
+    "前向传播相当简单，一次一个时间步的遍历三元组$(x_t, h_t, o_t)$，\n",
+    "然后通过一个目标函数在所有$T$个时间步内\n",
+    "评估输出$o_t$和对应的标签$y_t$之间的差异：\n",
+    "\n",
+    "$$L(x_1, \\ldots, x_T, y_1, \\ldots, y_T, w_h, w_o) = \\frac{1}{T}\\sum_{t=1}^T l(y_t, o_t).$$\n",
+    "\n",
+    "对于反向传播，问题则有点棘手，\n",
+    "特别是当我们计算目标函数$L$关于参数$w_h$的梯度时。\n",
+    "具体来说，按照链式法则：\n",
+    "\n",
+    "$$\\begin{aligned}\\frac{\\partial L}{\\partial w_h}  & = \\frac{1}{T}\\sum_{t=1}^T \\frac{\\partial l(y_t, o_t)}{\\partial w_h}  \\\\& = \\frac{1}{T}\\sum_{t=1}^T \\frac{\\partial l(y_t, o_t)}{\\partial o_t} \\frac{\\partial g(h_t, w_o)}{\\partial h_t}  \\frac{\\partial h_t}{\\partial w_h}.\\end{aligned}$$\n",
+    ":eqlabel:`eq_bptt_partial_L_wh`\n",
+    "\n",
+    "在 :eqref:`eq_bptt_partial_L_wh`中乘积的第一项和第二项很容易计算，\n",
+    "而第三项$\\partial h_t/\\partial w_h$是使事情变得棘手的地方，\n",
+    "因为我们需要循环地计算参数$w_h$对$h_t$的影响。\n",
+    "根据 :eqref:`eq_bptt_ht_ot`中的递归计算，\n",
+    "$h_t$既依赖于$h_{t-1}$又依赖于$w_h$，\n",
+    "其中$h_{t-1}$的计算也依赖于$w_h$。\n",
+    "因此，使用链式法则产生：\n",
+    "\n",
+    "$$\\frac{\\partial h_t}{\\partial w_h}= \\frac{\\partial f(x_{t},h_{t-1},w_h)}{\\partial w_h} +\\frac{\\partial f(x_{t},h_{t-1},w_h)}{\\partial h_{t-1}} \\frac{\\partial h_{t-1}}{\\partial w_h}.$$\n",
+    ":eqlabel:`eq_bptt_partial_ht_wh_recur`\n",
+    "\n",
+    "为了导出上述梯度，假设我们有三个序列$\\{a_{t}\\},\\{b_{t}\\},\\{c_{t}\\}$，\n",
+    "当$t=1,2,\\ldots$时，序列满足$a_{0}=0$且$a_{t}=b_{t}+c_{t}a_{t-1}$。\n",
+    "对于$t\\geq 1$，就很容易得出：\n",
+    "\n",
+    "$$a_{t}=b_{t}+\\sum_{i=1}^{t-1}\\left(\\prod_{j=i+1}^{t}c_{j}\\right)b_{i}.$$\n",
+    ":eqlabel:`eq_bptt_at`\n",
+    "\n",
+    "基于下列公式替换$a_t$、$b_t$和$c_t$：\n",
+    "\n",
+    "$$\\begin{aligned}a_t &= \\frac{\\partial h_t}{\\partial w_h},\\\\\n",
+    "b_t &= \\frac{\\partial f(x_{t},h_{t-1},w_h)}{\\partial w_h}, \\\\\n",
+    "c_t &= \\frac{\\partial f(x_{t},h_{t-1},w_h)}{\\partial h_{t-1}},\\end{aligned}$$\n",
+    "\n",
+    "公式 :eqref:`eq_bptt_partial_ht_wh_recur`中的梯度计算\n",
+    "满足$a_{t}=b_{t}+c_{t}a_{t-1}$。\n",
+    "因此，对于每个 :eqref:`eq_bptt_at`，\n",
+    "我们可以使用下面的公式移除 :eqref:`eq_bptt_partial_ht_wh_recur`中的循环计算\n",
+    "\n",
+    "$$\\frac{\\partial h_t}{\\partial w_h}=\\frac{\\partial f(x_{t},h_{t-1},w_h)}{\\partial w_h}+\\sum_{i=1}^{t-1}\\left(\\prod_{j=i+1}^{t} \\frac{\\partial f(x_{j},h_{j-1},w_h)}{\\partial h_{j-1}} \\right) \\frac{\\partial f(x_{i},h_{i-1},w_h)}{\\partial w_h}.$$\n",
+    ":eqlabel:`eq_bptt_partial_ht_wh_gen`\n",
+    "\n",
+    "虽然我们可以使用链式法则递归地计算$\\partial h_t/\\partial w_h$，\n",
+    "但当$t$很大时这个链就会变得很长。\n",
+    "我们需要想想办法来处理这一问题.\n",
+    "\n",
+    "### 完全计算 ###\n",
+    "\n",
+    "显然，我们可以仅仅计算 :eqref:`eq_bptt_partial_ht_wh_gen`中的全部总和，\n",
+    "然而，这样的计算非常缓慢，并且可能会发生梯度爆炸，\n",
+    "因为初始条件的微小变化就可能会对结果产生巨大的影响。\n",
+    "也就是说，我们可以观察到类似于蝴蝶效应的现象，\n",
+    "即初始条件的很小变化就会导致结果发生不成比例的变化。\n",
+    "这对于我们想要估计的模型而言是非常不可取的。\n",
+    "毕竟，我们正在寻找的是能够很好地泛化高稳定性模型的估计器。\n",
+    "因此，在实践中，这种方法几乎从未使用过。\n",
+    "\n",
+    "### 截断时间步 ###\n",
+    "\n",
+    "或者，我们可以在$\\tau$步后截断\n",
+    " :eqref:`eq_bptt_partial_ht_wh_gen`中的求和计算。\n",
+    "这是我们到目前为止一直在讨论的内容，\n",
+    "例如在 :numref:`sec_rnn_scratch`中分离梯度时。\n",
+    "这会带来真实梯度的*近似*，\n",
+    "只需将求和终止为$\\partial h_{t-\\tau}/\\partial w_h$。\n",
+    "在实践中，这种方式工作得很好。\n",
+    "它通常被称为截断的通过时间反向传播 :cite:`Jaeger.2002`。\n",
+    "这样做导致该模型主要侧重于短期影响，而不是长期影响。\n",
+    "这在现实中是可取的，因为它会将估计值偏向更简单和更稳定的模型。\n",
+    "\n",
+    "### 随机截断 ###\n",
+    "\n",
+    "最后，我们可以用一个随机变量替换$\\partial h_t/\\partial w_h$，\n",
+    "该随机变量在预期中是正确的，但是会截断序列。\n",
+    "这个随机变量是通过使用序列$\\xi_t$来实现的，\n",
+    "序列预定义了$0 \\leq \\pi_t \\leq 1$，\n",
+    "其中$P(\\xi_t = 0) = 1-\\pi_t$且$P(\\xi_t = \\pi_t^{-1}) = \\pi_t$，\n",
+    "因此$E[\\xi_t] = 1$。\n",
+    "我们使用它来替换 :eqref:`eq_bptt_partial_ht_wh_recur`中的\n",
+    "梯度$\\partial h_t/\\partial w_h$得到：\n",
+    "\n",
+    "$$z_t= \\frac{\\partial f(x_{t},h_{t-1},w_h)}{\\partial w_h} +\\xi_t \\frac{\\partial f(x_{t},h_{t-1},w_h)}{\\partial h_{t-1}} \\frac{\\partial h_{t-1}}{\\partial w_h}.$$\n",
+    "\n",
+    "从$\\xi_t$的定义中推导出来$E[z_t] = \\partial h_t/\\partial w_h$。\n",
+    "每当$\\xi_t = 0$时，递归计算终止在这个$t$时间步。\n",
+    "这导致了不同长度序列的加权和，其中长序列出现的很少，\n",
+    "所以将适当地加大权重。\n",
+    "这个想法是由塔莱克和奥利维尔 :cite:`Tallec.Ollivier.2017`提出的。\n",
+    "\n",
+    "### 比较策略\n",
+    "\n",
+    "![比较RNN中计算梯度的策略，3行自上而下分别为：随机截断、常规截断、完整计算](../img/truncated-bptt.svg)\n",
+    ":label:`fig_truncated_bptt`\n",
+    "\n",
+    " :numref:`fig_truncated_bptt`说明了\n",
+    "当基于循环神经网络使用通过时间反向传播\n",
+    "分析《时间机器》书中前几个字符的三种策略：\n",
+    "\n",
+    "* 第一行采用随机截断，方法是将文本划分为不同长度的片断；\n",
+    "* 第二行采用常规截断，方法是将文本分解为相同长度的子序列。\n",
+    "  这也是我们在循环神经网络实验中一直在做的；\n",
+    "* 第三行采用通过时间的完全反向传播，结果是产生了在计算上不可行的表达式。\n",
+    "\n",
+    "遗憾的是，虽然随机截断在理论上具有吸引力，\n",
+    "但很可能是由于多种因素在实践中并不比常规截断更好。\n",
+    "首先，在对过去若干个时间步经过反向传播后，\n",
+    "观测结果足以捕获实际的依赖关系。\n",
+    "其次，增加的方差抵消了时间步数越多梯度越精确的事实。\n",
+    "第三，我们真正想要的是只有短范围交互的模型。\n",
+    "因此，模型需要的正是截断的通过时间反向传播方法所具备的轻度正则化效果。\n",
+    "\n",
+    "## 通过时间反向传播的细节\n",
+    "\n",
+    "在讨论一般性原则之后，我们看一下通过时间反向传播问题的细节。\n",
+    "与 :numref:`subsec_bptt_analysis`中的分析不同，\n",
+    "下面我们将展示如何计算目标函数相对于所有分解模型参数的梯度。\n",
+    "为了保持简单，我们考虑一个没有偏置参数的循环神经网络，\n",
+    "其在隐藏层中的激活函数使用恒等映射（$\\phi(x)=x$）。\n",
+    "对于时间步$t$，设单个样本的输入及其对应的标签分别为\n",
+    "$\\mathbf{x}_t \\in \\mathbb{R}^d$和$y_t$。\n",
+    "计算隐状态$\\mathbf{h}_t \\in \\mathbb{R}^h$和\n",
+    "输出$\\mathbf{o}_t \\in \\mathbb{R}^q$的方式为：\n",
+    "\n",
+    "$$\\begin{aligned}\\mathbf{h}_t &= \\mathbf{W}_{hx} \\mathbf{x}_t + \\mathbf{W}_{hh} \\mathbf{h}_{t-1},\\\\\n",
+    "\\mathbf{o}_t &= \\mathbf{W}_{qh} \\mathbf{h}_{t},\\end{aligned}$$\n",
+    "\n",
+    "其中权重参数为$\\mathbf{W}_{hx} \\in \\mathbb{R}^{h \\times d}$、\n",
+    "$\\mathbf{W}_{hh} \\in \\mathbb{R}^{h \\times h}$和\n",
+    "$\\mathbf{W}_{qh} \\in \\mathbb{R}^{q \\times h}$。\n",
+    "用$l(\\mathbf{o}_t, y_t)$表示时间步$t$处\n",
+    "（即从序列开始起的超过$T$个时间步）的损失函数，\n",
+    "则我们的目标函数的总体损失是：\n",
+    "\n",
+    "$$L = \\frac{1}{T} \\sum_{t=1}^T l(\\mathbf{o}_t, y_t).$$\n",
+    "\n",
+    "为了在循环神经网络的计算过程中可视化模型变量和参数之间的依赖关系，\n",
+    "我们可以为模型绘制一个计算图，\n",
+    "如 :numref:`fig_rnn_bptt`所示。\n",
+    "例如，时间步3的隐状态$\\mathbf{h}_3$的计算\n",
+    "取决于模型参数$\\mathbf{W}_{hx}$和$\\mathbf{W}_{hh}$，\n",
+    "以及最终时间步的隐状态$\\mathbf{h}_2$\n",
+    "以及当前时间步的输入$\\mathbf{x}_3$。\n",
+    "\n",
+    "![上图表示具有三个时间步的循环神经网络模型依赖关系的计算图。未着色的方框表示变量，着色的方框表示参数，圆表示运算符](../img/rnn-bptt.svg)\n",
+    ":label:`fig_rnn_bptt`\n",
+    "\n",
+    "正如刚才所说， :numref:`fig_rnn_bptt`中的模型参数是\n",
+    "$\\mathbf{W}_{hx}$、$\\mathbf{W}_{hh}$和$\\mathbf{W}_{qh}$。\n",
+    "通常，训练该模型需要对这些参数进行梯度计算：\n",
+    "$\\partial L/\\partial \\mathbf{W}_{hx}$、\n",
+    "$\\partial L/\\partial \\mathbf{W}_{hh}$和\n",
+    "$\\partial L/\\partial \\mathbf{W}_{qh}$。\n",
+    "根据 :numref:`fig_rnn_bptt`中的依赖关系，\n",
+    "我们可以沿箭头的相反方向遍历计算图，依次计算和存储梯度。\n",
+    "为了灵活地表示链式法则中不同形状的矩阵、向量和标量的乘法，\n",
+    "我们继续使用如 :numref:`sec_backprop`中\n",
+    "所述的$\\text{prod}$运算符。\n",
+    "\n",
+    "首先，在任意时间步$t$，\n",
+    "目标函数关于模型输出的微分计算是相当简单的：\n",
+    "\n",
+    "$$\\frac{\\partial L}{\\partial \\mathbf{o}_t} =  \\frac{\\partial l (\\mathbf{o}_t, y_t)}{T \\cdot \\partial \\mathbf{o}_t} \\in \\mathbb{R}^q.$$\n",
+    ":eqlabel:`eq_bptt_partial_L_ot`\n",
+    "\n",
+    "现在，我们可以计算目标函数关于输出层中参数$\\mathbf{W}_{qh}$的梯度：\n",
+    "$\\partial L/\\partial \\mathbf{W}_{qh} \\in \\mathbb{R}^{q \\times h}$。\n",
+    "基于 :numref:`fig_rnn_bptt`，\n",
+    "目标函数$L$通过$\\mathbf{o}_1, \\ldots, \\mathbf{o}_T$\n",
+    "依赖于$\\mathbf{W}_{qh}$。\n",
+    "依据链式法则，得到\n",
+    "\n",
+    "$$\n",
+    "\\frac{\\partial L}{\\partial \\mathbf{W}_{qh}}\n",
+    "= \\sum_{t=1}^T \\text{prod}\\left(\\frac{\\partial L}{\\partial \\mathbf{o}_t}, \\frac{\\partial \\mathbf{o}_t}{\\partial \\mathbf{W}_{qh}}\\right)\n",
+    "= \\sum_{t=1}^T \\frac{\\partial L}{\\partial \\mathbf{o}_t} \\mathbf{h}_t^\\top,\n",
+    "$$\n",
+    "\n",
+    "其中$\\partial L/\\partial \\mathbf{o}_t$是\n",
+    "由 :eqref:`eq_bptt_partial_L_ot`给出的。\n",
+    "\n",
+    "接下来，如 :numref:`fig_rnn_bptt`所示，\n",
+    "在最后的时间步$T$，目标函数$L$仅通过$\\mathbf{o}_T$\n",
+    "依赖于隐状态$\\mathbf{h}_T$。\n",
+    "因此，我们通过使用链式法可以很容易地得到梯度\n",
+    "$\\partial L/\\partial \\mathbf{h}_T \\in \\mathbb{R}^h$：\n",
+    "\n",
+    "$$\\frac{\\partial L}{\\partial \\mathbf{h}_T} = \\text{prod}\\left(\\frac{\\partial L}{\\partial \\mathbf{o}_T}, \\frac{\\partial \\mathbf{o}_T}{\\partial \\mathbf{h}_T} \\right) = \\mathbf{W}_{qh}^\\top \\frac{\\partial L}{\\partial \\mathbf{o}_T}.$$\n",
+    ":eqlabel:`eq_bptt_partial_L_hT_final_step`\n",
+    "\n",
+    "当目标函数$L$通过$\\mathbf{h}_{t+1}$和$\\mathbf{o}_t$\n",
+    "依赖$\\mathbf{h}_t$时，\n",
+    "对任意时间步$t < T$来说都变得更加棘手。\n",
+    "根据链式法则，隐状态的梯度\n",
+    "$\\partial L/\\partial \\mathbf{h}_t \\in \\mathbb{R}^h$\n",
+    "在任何时间步骤$t < T$时都可以递归地计算为：\n",
+    "\n",
+    "$$\\frac{\\partial L}{\\partial \\mathbf{h}_t} = \\text{prod}\\left(\\frac{\\partial L}{\\partial \\mathbf{h}_{t+1}}, \\frac{\\partial \\mathbf{h}_{t+1}}{\\partial \\mathbf{h}_t} \\right) + \\text{prod}\\left(\\frac{\\partial L}{\\partial \\mathbf{o}_t}, \\frac{\\partial \\mathbf{o}_t}{\\partial \\mathbf{h}_t} \\right) = \\mathbf{W}_{hh}^\\top \\frac{\\partial L}{\\partial \\mathbf{h}_{t+1}} + \\mathbf{W}_{qh}^\\top \\frac{\\partial L}{\\partial \\mathbf{o}_t}.$$\n",
+    ":eqlabel:`eq_bptt_partial_L_ht_recur`\n",
+    "\n",
+    "为了进行分析，对于任何时间步$1 \\leq t \\leq T$展开递归计算得\n",
+    "\n",
+    "$$\\frac{\\partial L}{\\partial \\mathbf{h}_t}= \\sum_{i=t}^T {\\left(\\mathbf{W}_{hh}^\\top\\right)}^{T-i} \\mathbf{W}_{qh}^\\top \\frac{\\partial L}{\\partial \\mathbf{o}_{T+t-i}}.$$\n",
+    ":eqlabel:`eq_bptt_partial_L_ht`\n",
+    "\n",
+    "我们可以从 :eqref:`eq_bptt_partial_L_ht`中看到，\n",
+    "这个简单的线性例子已经展现了长序列模型的一些关键问题：\n",
+    "它陷入到$\\mathbf{W}_{hh}^\\top$的潜在的非常大的幂。\n",
+    "在这个幂中，小于1的特征值将会消失，大于1的特征值将会发散。\n",
+    "这在数值上是不稳定的，表现形式为梯度消失或梯度爆炸。\n",
+    "解决此问题的一种方法是按照计算方便的需要截断时间步长的尺寸\n",
+    "如 :numref:`subsec_bptt_analysis`中所述。\n",
+    "实际上，这种截断是通过在给定数量的时间步之后分离梯度来实现的。\n",
+    "稍后，我们将学习更复杂的序列模型（如长短期记忆模型）\n",
+    "是如何进一步缓解这一问题的。\n",
+    "\n",
+    "最后， :numref:`fig_rnn_bptt`表明：\n",
+    "目标函数$L$通过隐状态$\\mathbf{h}_1, \\ldots, \\mathbf{h}_T$\n",
+    "依赖于隐藏层中的模型参数$\\mathbf{W}_{hx}$和$\\mathbf{W}_{hh}$。\n",
+    "为了计算有关这些参数的梯度\n",
+    "$\\partial L / \\partial \\mathbf{W}_{hx} \\in \\mathbb{R}^{h \\times d}$和$\\partial L / \\partial \\mathbf{W}_{hh} \\in \\mathbb{R}^{h \\times h}$，\n",
+    "我们应用链式规则得：\n",
+    "\n",
+    "$$\n",
+    "\\begin{aligned}\n",
+    "\\frac{\\partial L}{\\partial \\mathbf{W}_{hx}}\n",
+    "&= \\sum_{t=1}^T \\text{prod}\\left(\\frac{\\partial L}{\\partial \\mathbf{h}_t}, \\frac{\\partial \\mathbf{h}_t}{\\partial \\mathbf{W}_{hx}}\\right)\n",
+    "= \\sum_{t=1}^T \\frac{\\partial L}{\\partial \\mathbf{h}_t} \\mathbf{x}_t^\\top,\\\\\n",
+    "\\frac{\\partial L}{\\partial \\mathbf{W}_{hh}}\n",
+    "&= \\sum_{t=1}^T \\text{prod}\\left(\\frac{\\partial L}{\\partial \\mathbf{h}_t}, \\frac{\\partial \\mathbf{h}_t}{\\partial \\mathbf{W}_{hh}}\\right)\n",
+    "= \\sum_{t=1}^T \\frac{\\partial L}{\\partial \\mathbf{h}_t} \\mathbf{h}_{t-1}^\\top,\n",
+    "\\end{aligned}\n",
+    "$$\n",
+    "\n",
+    "其中$\\partial L/\\partial \\mathbf{h}_t$\n",
+    "是由 :eqref:`eq_bptt_partial_L_hT_final_step`和\n",
+    " :eqref:`eq_bptt_partial_L_ht_recur`递归计算得到的，\n",
+    "是影响数值稳定性的关键量。\n",
+    "\n",
+    "正如我们在 :numref:`sec_backprop`中所解释的那样，\n",
+    "由于通过时间反向传播是反向传播在循环神经网络中的应用方式，\n",
+    "所以训练循环神经网络交替使用前向传播和通过时间反向传播。\n",
+    "通过时间反向传播依次计算并存储上述梯度。\n",
+    "具体而言，存储的中间值会被重复使用，以避免重复计算，\n",
+    "例如存储$\\partial L/\\partial \\mathbf{h}_t$，\n",
+    "以便在计算$\\partial L / \\partial \\mathbf{W}_{hx}$和\n",
+    "$\\partial L / \\partial \\mathbf{W}_{hh}$时使用。\n",
+    "\n",
+    "## 小结\n",
+    "\n",
+    "* “通过时间反向传播”仅仅适用于反向传播在具有隐状态的序列模型。\n",
+    "* 截断是计算方便性和数值稳定性的需要。截断包括：规则截断和随机截断。\n",
+    "* 矩阵的高次幂可能导致神经网络特征值的发散或消失，将以梯度爆炸或梯度消失的形式表现。\n",
+    "* 为了计算的效率，“通过时间反向传播”在计算期间会缓存中间值。\n",
+    "\n",
+    "## 练习\n",
+    "\n",
+    "1. 假设我们拥有一个对称矩阵$\\mathbf{M} \\in \\mathbb{R}^{n \\times n}$，其特征值为$\\lambda_i$，对应的特征向量是$\\mathbf{v}_i$（$i = 1, \\ldots, n$）。通常情况下，假设特征值的序列顺序为$|\\lambda_i| \\geq |\\lambda_{i+1}|$。\n",
+    "   1. 证明$\\mathbf{M}^k$拥有特征值$\\lambda_i^k$。\n",
+    "   1. 证明对于一个随机向量$\\mathbf{x} \\in \\mathbb{R}^n$，$\\mathbf{M}^k \\mathbf{x}$将有较高概率与$\\mathbf{M}$的特征向量$\\mathbf{v}_1$在一条直线上。形式化这个证明过程。\n",
+    "   1. 上述结果对于循环神经网络中的梯度意味着什么？\n",
+    "1. 除了梯度截断，还有其他方法来应对循环神经网络中的梯度爆炸吗？\n",
+    "\n",
+    "[Discussions](https://discuss.d2l.ai/t/2107)\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  },
+  "required_libs": []
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,62 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "3fa3d90d",
+   "metadata": {
+    "origin_pos": 0
+   },
+   "source": [
+    "# 循环神经网络\n",
+    ":label:`chap_rnn`\n",
+    "\n",
+    "到目前为止，我们遇到过两种类型的数据：表格数据和图像数据。\n",
+    "对于图像数据，我们设计了专门的卷积神经网络架构来为这类特殊的数据结构建模。\n",
+    "换句话说，如果我们拥有一张图像，我们需要有效地利用其像素位置，\n",
+    "假若我们对图像中的像素位置进行重排，就会对图像中内容的推断造成极大的困难。\n",
+    "\n",
+    "最重要的是，到目前为止我们默认数据都来自于某种分布，\n",
+    "并且所有样本都是独立同分布的\n",
+    "（independently and identically distributed，i.i.d.）。\n",
+    "然而，大多数的数据并非如此。\n",
+    "例如，文章中的单词是按顺序写的，如果顺序被随机地重排，就很难理解文章原始的意思。\n",
+    "同样，视频中的图像帧、对话中的音频信号以及网站上的浏览行为都是有顺序的。\n",
+    "因此，针对此类数据而设计特定模型，可能效果会更好。\n",
+    "\n",
+    "另一个问题来自这样一个事实：\n",
+    "我们不仅仅可以接收一个序列作为输入，而是还可能期望继续猜测这个序列的后续。\n",
+    "例如，一个任务可以是继续预测$2, 4, 6, 8, 10, \\ldots$。\n",
+    "这在时间序列分析中是相当常见的，可以用来预测股市的波动、\n",
+    "患者的体温曲线或者赛车所需的加速度。\n",
+    "同理，我们需要能够处理这些数据的特定模型。\n",
+    "\n",
+    "简言之，如果说卷积神经网络可以有效地处理空间信息，\n",
+    "那么本章的*循环神经网络*（recurrent neural network，RNN）则可以更好地处理序列信息。\n",
+    "循环神经网络通过引入状态变量存储过去的信息和当前的输入，从而可以确定当前的输出。\n",
+    "\n",
+    "许多使用循环网络的例子都是基于文本数据的，因此我们将在本章中重点介绍语言模型。\n",
+    "在对序列数据进行更详细的回顾之后，我们将介绍文本预处理的实用技术。\n",
+    "然后，我们将讨论语言模型的基本概念，并将此讨论作为循环神经网络设计的灵感。\n",
+    "最后，我们描述了循环神经网络的梯度计算方法，以探讨训练此类网络时可能遇到的问题。\n",
+    "\n",
+    ":begin_tab:toc\n",
+    " - [sequence](sequence.ipynb)\n",
+    " - [text-preprocessing](text-preprocessing.ipynb)\n",
+    " - [language-models-and-dataset](language-models-and-dataset.ipynb)\n",
+    " - [rnn](rnn.ipynb)\n",
+    " - [rnn-scratch](rnn-scratch.ipynb)\n",
+    " - [rnn-concise](rnn-concise.ipynb)\n",
+    " - [bptt](bptt.ipynb)\n",
+    ":end_tab:\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  },
+  "required_libs": []
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,412 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "90ff0864",
+   "metadata": {
+    "origin_pos": 0
+   },
+   "source": [
+    "# 循环神经网络\n",
+    ":label:`sec_rnn`\n",
+    "\n",
+    "在 :numref:`sec_language_model`中，\n",
+    "我们介绍了$n$元语法模型，\n",
+    "其中单词$x_t$在时间步$t$的条件概率仅取决于前面$n-1$个单词。\n",
+    "对于时间步$t-(n-1)$之前的单词，\n",
+    "如果我们想将其可能产生的影响合并到$x_t$上，\n",
+    "需要增加$n$，然而模型参数的数量也会随之呈指数增长，\n",
+    "因为词表$\\mathcal{V}$需要存储$|\\mathcal{V}|^n$个数字，\n",
+    "因此与其将$P(x_t \\mid x_{t-1}, \\ldots, x_{t-n+1})$模型化，\n",
+    "不如使用隐变量模型：\n",
+    "\n",
+    "$$P(x_t \\mid x_{t-1}, \\ldots, x_1) \\approx P(x_t \\mid h_{t-1}),$$\n",
+    "\n",
+    "其中$h_{t-1}$是*隐状态*（hidden state），\n",
+    "也称为*隐藏变量*（hidden variable），\n",
+    "它存储了到时间步$t-1$的序列信息。\n",
+    "通常，我们可以基于当前输入$x_{t}$和先前隐状态$h_{t-1}$\n",
+    "来计算时间步$t$处的任何时间的隐状态：\n",
+    "\n",
+    "$$h_t = f(x_{t}, h_{t-1}).$$\n",
+    ":eqlabel:`eq_ht_xt`\n",
+    "\n",
+    "对于 :eqref:`eq_ht_xt`中的函数$f$，隐变量模型不是近似值。\n",
+    "毕竟$h_t$是可以仅仅存储到目前为止观察到的所有数据，\n",
+    "然而这样的操作可能会使计算和存储的代价都变得昂贵。\n",
+    "\n",
+    "回想一下，我们在 :numref:`chap_perceptrons`中\n",
+    "讨论过的具有隐藏单元的隐藏层。\n",
+    "值得注意的是，隐藏层和隐状态指的是两个截然不同的概念。\n",
+    "如上所述，隐藏层是在从输入到输出的路径上（以观测角度来理解）的隐藏的层，\n",
+    "而隐状态则是在给定步骤所做的任何事情（以技术角度来定义）的*输入*，\n",
+    "并且这些状态只能通过先前时间步的数据来计算。\n",
+    "\n",
+    "*循环神经网络*（recurrent neural networks，RNNs）\n",
+    "是具有隐状态的神经网络。\n",
+    "在介绍循环神经网络模型之前，\n",
+    "我们首先回顾 :numref:`sec_mlp`中介绍的多层感知机模型。\n",
+    "\n",
+    "## 无隐状态的神经网络\n",
+    "\n",
+    "让我们来看一看只有单隐藏层的多层感知机。\n",
+    "设隐藏层的激活函数为$\\phi$，\n",
+    "给定一个小批量样本$\\mathbf{X} \\in \\mathbb{R}^{n \\times d}$，\n",
+    "其中批量大小为$n$，输入维度为$d$，\n",
+    "则隐藏层的输出$\\mathbf{H} \\in \\mathbb{R}^{n \\times h}$通过下式计算：\n",
+    "\n",
+    "$$\\mathbf{H} = \\phi(\\mathbf{X} \\mathbf{W}_{xh} + \\mathbf{b}_h).$$\n",
+    ":eqlabel:`rnn_h_without_state`\n",
+    "\n",
+    "在 :eqref:`rnn_h_without_state`中，\n",
+    "我们拥有的隐藏层权重参数为$\\mathbf{W}_{xh} \\in \\mathbb{R}^{d \\times h}$，\n",
+    "偏置参数为$\\mathbf{b}_h \\in \\mathbb{R}^{1 \\times h}$，\n",
+    "以及隐藏单元的数目为$h$。\n",
+    "因此求和时可以应用广播机制（见 :numref:`subsec_broadcasting`）。\n",
+    "接下来，将隐藏变量$\\mathbf{H}$用作输出层的输入。\n",
+    "输出层由下式给出：\n",
+    "\n",
+    "$$\\mathbf{O} = \\mathbf{H} \\mathbf{W}_{hq} + \\mathbf{b}_q,$$\n",
+    "\n",
+    "其中，$\\mathbf{O} \\in \\mathbb{R}^{n \\times q}$是输出变量，\n",
+    "$\\mathbf{W}_{hq} \\in \\mathbb{R}^{h \\times q}$是权重参数，\n",
+    "$\\mathbf{b}_q \\in \\mathbb{R}^{1 \\times q}$是输出层的偏置参数。\n",
+    "如果是分类问题，我们可以用$\\text{softmax}(\\mathbf{O})$\n",
+    "来计算输出类别的概率分布。\n",
+    "\n",
+    "这完全类似于之前在 :numref:`sec_sequence`中解决的回归问题，\n",
+    "因此我们省略了细节。\n",
+    "无须多言，只要可以随机选择“特征-标签”对，\n",
+    "并且通过自动微分和随机梯度下降能够学习网络参数就可以了。\n",
+    "\n",
+    "## 有隐状态的循环神经网络\n",
+    ":label:`subsec_rnn_w_hidden_states`\n",
+    "\n",
+    "有了隐状态后，情况就完全不同了。\n",
+    "假设我们在时间步$t$有小批量输入$\\mathbf{X}_t \\in \\mathbb{R}^{n \\times d}$。\n",
+    "换言之，对于$n$个序列样本的小批量，\n",
+    "$\\mathbf{X}_t$的每一行对应于来自该序列的时间步$t$处的一个样本。\n",
+    "接下来，用$\\mathbf{H}_t  \\in \\mathbb{R}^{n \\times h}$\n",
+    "表示时间步$t$的隐藏变量。\n",
+    "与多层感知机不同的是，\n",
+    "我们在这里保存了前一个时间步的隐藏变量$\\mathbf{H}_{t-1}$，\n",
+    "并引入了一个新的权重参数$\\mathbf{W}_{hh} \\in \\mathbb{R}^{h \\times h}$，\n",
+    "来描述如何在当前时间步中使用前一个时间步的隐藏变量。\n",
+    "具体地说，当前时间步隐藏变量由当前时间步的输入\n",
+    "与前一个时间步的隐藏变量一起计算得出：\n",
+    "\n",
+    "$$\\mathbf{H}_t = \\phi(\\mathbf{X}_t \\mathbf{W}_{xh} + \\mathbf{H}_{t-1} \\mathbf{W}_{hh}  + \\mathbf{b}_h).$$\n",
+    ":eqlabel:`rnn_h_with_state`\n",
+    "\n",
+    "与 :eqref:`rnn_h_without_state`相比，\n",
+    " :eqref:`rnn_h_with_state`多添加了一项\n",
+    "$\\mathbf{H}_{t-1} \\mathbf{W}_{hh}$，\n",
+    "从而实例化了 :eqref:`eq_ht_xt`。\n",
+    "从相邻时间步的隐藏变量$\\mathbf{H}_t$和\n",
+    "$\\mathbf{H}_{t-1}$之间的关系可知，\n",
+    "这些变量捕获并保留了序列直到其当前时间步的历史信息，\n",
+    "就如当前时间步下神经网络的状态或记忆，\n",
+    "因此这样的隐藏变量被称为*隐状态*（hidden state）。\n",
+    "由于在当前时间步中，\n",
+    "隐状态使用的定义与前一个时间步中使用的定义相同，\n",
+    "因此 :eqref:`rnn_h_with_state`的计算是*循环的*（recurrent）。\n",
+    "于是基于循环计算的隐状态神经网络被命名为\n",
+    "*循环神经网络*（recurrent neural network）。\n",
+    "在循环神经网络中执行 :eqref:`rnn_h_with_state`计算的层\n",
+    "称为*循环层*（recurrent layer）。\n",
+    "\n",
+    "有许多不同的方法可以构建循环神经网络，\n",
+    "由 :eqref:`rnn_h_with_state`定义的隐状态的循环神经网络是非常常见的一种。\n",
+    "对于时间步$t$，输出层的输出类似于多层感知机中的计算：\n",
+    "\n",
+    "$$\\mathbf{O}_t = \\mathbf{H}_t \\mathbf{W}_{hq} + \\mathbf{b}_q.$$\n",
+    "\n",
+    "循环神经网络的参数包括隐藏层的权重\n",
+    "$\\mathbf{W}_{xh} \\in \\mathbb{R}^{d \\times h}, \\mathbf{W}_{hh} \\in \\mathbb{R}^{h \\times h}$和偏置$\\mathbf{b}_h \\in \\mathbb{R}^{1 \\times h}$，\n",
+    "以及输出层的权重$\\mathbf{W}_{hq} \\in \\mathbb{R}^{h \\times q}$\n",
+    "和偏置$\\mathbf{b}_q \\in \\mathbb{R}^{1 \\times q}$。\n",
+    "值得一提的是，即使在不同的时间步，循环神经网络也总是使用这些模型参数。\n",
+    "因此，循环神经网络的参数开销不会随着时间步的增加而增加。\n",
+    "\n",
+    " :numref:`fig_rnn`展示了循环神经网络在三个相邻时间步的计算逻辑。\n",
+    "在任意时间步$t$，隐状态的计算可以被视为：\n",
+    "\n",
+    "1. 拼接当前时间步$t$的输入$\\mathbf{X}_t$和前一时间步$t-1$的隐状态$\\mathbf{H}_{t-1}$；\n",
+    "1. 将拼接的结果送入带有激活函数$\\phi$的全连接层。\n",
+    "   全连接层的输出是当前时间步$t$的隐状态$\\mathbf{H}_t$。\n",
+    "   \n",
+    "在本例中，模型参数是$\\mathbf{W}_{xh}$和$\\mathbf{W}_{hh}$的拼接，\n",
+    "以及$\\mathbf{b}_h$的偏置，所有这些参数都来自 :eqref:`rnn_h_with_state`。\n",
+    "当前时间步$t$的隐状态$\\mathbf{H}_t$\n",
+    "将参与计算下一时间步$t+1$的隐状态$\\mathbf{H}_{t+1}$。\n",
+    "而且$\\mathbf{H}_t$还将送入全连接输出层，\n",
+    "用于计算当前时间步$t$的输出$\\mathbf{O}_t$。\n",
+    "\n",
+    "![具有隐状态的循环神经网络](../img/rnn.svg)\n",
+    ":label:`fig_rnn`\n",
+    "\n",
+    "我们刚才提到，隐状态中\n",
+    "$\\mathbf{X}_t \\mathbf{W}_{xh} + \\mathbf{H}_{t-1} \\mathbf{W}_{hh}$的计算，\n",
+    "相当于$\\mathbf{X}_t$和$\\mathbf{H}_{t-1}$的拼接\n",
+    "与$\\mathbf{W}_{xh}$和$\\mathbf{W}_{hh}$的拼接的矩阵乘法。\n",
+    "虽然这个性质可以通过数学证明，\n",
+    "但在下面我们使用一个简单的代码来说明一下。\n",
+    "首先，我们定义矩阵`X`、`W_xh`、`H`和`W_hh`，\n",
+    "它们的形状分别为$(3，1)$、$(1，4)$、$(3，4)$和$(4，4)$。\n",
+    "分别将`X`乘以`W_xh`，将`H`乘以`W_hh`，\n",
+    "然后将这两个乘法相加，我们得到一个形状为$(3，4)$的矩阵。\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "3b32e0ed",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2023-08-18T06:58:10.528336Z",
+     "iopub.status.busy": "2023-08-18T06:58:10.527597Z",
+     "iopub.status.idle": "2023-08-18T06:58:12.493106Z",
+     "shell.execute_reply": "2023-08-18T06:58:12.492193Z"
+    },
+    "origin_pos": 2,
+    "tab": [
+     "pytorch"
+    ]
+   },
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "from d2l import torch as d2l"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "cc0b1ab9",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2023-08-18T06:58:12.497284Z",
+     "iopub.status.busy": "2023-08-18T06:58:12.496796Z",
+     "iopub.status.idle": "2023-08-18T06:58:12.510001Z",
+     "shell.execute_reply": "2023-08-18T06:58:12.508927Z"
+    },
+    "origin_pos": 5,
+    "tab": [
+     "pytorch"
+    ]
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([[-1.6506, -0.7309,  2.0021, -0.1055],\n",
+       "        [ 1.7334,  2.2035, -3.3148, -2.1629],\n",
+       "        [-2.0071, -1.0902,  0.2376, -1.3144]])"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "X, W_xh = torch.normal(0, 1, (3, 1)), torch.normal(0, 1, (1, 4))\n",
+    "H, W_hh = torch.normal(0, 1, (3, 4)), torch.normal(0, 1, (4, 4))\n",
+    "torch.matmul(X, W_xh) + torch.matmul(H, W_hh)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d5d0879c",
+   "metadata": {
+    "origin_pos": 7
+   },
+   "source": [
+    "现在，我们沿列（轴1）拼接矩阵`X`和`H`，\n",
+    "沿行（轴0）拼接矩阵`W_xh`和`W_hh`。\n",
+    "这两个拼接分别产生形状$(3, 5)$和形状$(5, 4)$的矩阵。\n",
+    "再将这两个拼接的矩阵相乘，\n",
+    "我们得到与上面相同形状$(3, 4)$的输出矩阵。\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "1a310233",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2023-08-18T06:58:12.513639Z",
+     "iopub.status.busy": "2023-08-18T06:58:12.513360Z",
+     "iopub.status.idle": "2023-08-18T06:58:12.520602Z",
+     "shell.execute_reply": "2023-08-18T06:58:12.519678Z"
+    },
+    "origin_pos": 8,
+    "tab": [
+     "pytorch"
+    ]
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor([[-1.6506, -0.7309,  2.0021, -0.1055],\n",
+       "        [ 1.7334,  2.2035, -3.3148, -2.1629],\n",
+       "        [-2.0071, -1.0902,  0.2376, -1.3144]])"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "torch.matmul(torch.cat((X, H), 1), torch.cat((W_xh, W_hh), 0))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "816b1d00",
+   "metadata": {
+    "origin_pos": 9
+   },
+   "source": [
+    "## 基于循环神经网络的字符级语言模型\n",
+    "\n",
+    "回想一下 :numref:`sec_language_model`中的语言模型，\n",
+    "我们的目标是根据过去的和当前的词元预测下一个词元，\n",
+    "因此我们将原始序列移位一个词元作为标签。\n",
+    "Bengio等人首先提出使用神经网络进行语言建模\n",
+    " :cite:`Bengio.Ducharme.Vincent.ea.2003`。\n",
+    "接下来，我们看一下如何使用循环神经网络来构建语言模型。\n",
+    "设小批量大小为1，批量中的文本序列为“machine”。\n",
+    "为了简化后续部分的训练，我们考虑使用\n",
+    "*字符级语言模型*（character-level language model），\n",
+    "将文本词元化为字符而不是单词。\n",
+    " :numref:`fig_rnn_train`演示了\n",
+    "如何通过基于字符级语言建模的循环神经网络，\n",
+    "使用当前的和先前的字符预测下一个字符。\n",
+    "\n",
+    "![基于循环神经网络的字符级语言模型：输入序列和标签序列分别为“machin”和“achine”](../img/rnn-train.svg)\n",
+    ":label:`fig_rnn_train`\n",
+    "\n",
+    "在训练过程中，我们对每个时间步的输出层的输出进行softmax操作，\n",
+    "然后利用交叉熵损失计算模型输出和标签之间的误差。\n",
+    "由于隐藏层中隐状态的循环计算，\n",
+    " :numref:`fig_rnn_train`中的第$3$个时间步的输出$\\mathbf{O}_3$\n",
+    "由文本序列“m”“a”和“c”确定。\n",
+    "由于训练数据中这个文本序列的下一个字符是“h”，\n",
+    "因此第$3$个时间步的损失将取决于下一个字符的概率分布，\n",
+    "而下一个字符是基于特征序列“m”“a”“c”和这个时间步的标签“h”生成的。\n",
+    "\n",
+    "在实践中，我们使用的批量大小为$n>1$，\n",
+    "每个词元都由一个$d$维向量表示。\n",
+    "因此，在时间步$t$输入$\\mathbf X_t$将是一个$n\\times d$矩阵，\n",
+    "这与我们在 :numref:`subsec_rnn_w_hidden_states`中的讨论相同。\n",
+    "\n",
+    "## 困惑度（Perplexity）\n",
+    ":label:`subsec_perplexity`\n",
+    "\n",
+    "最后，让我们讨论如何度量语言模型的质量，\n",
+    "这将在后续部分中用于评估基于循环神经网络的模型。\n",
+    "一个好的语言模型能够用高度准确的词元来预测我们接下来会看到什么。\n",
+    "考虑一下由不同的语言模型给出的对“It is raining ...”（“...下雨了”）的续写：\n",
+    "\n",
+    "1. \"It is raining outside\"（外面下雨了）；\n",
+    "1. \"It is raining banana tree\"（香蕉树下雨了）；\n",
+    "1. \"It is raining piouw;kcj pwepoiut\"（piouw;kcj pwepoiut下雨了）。\n",
+    "\n",
+    "就质量而言，例$1$显然是最合乎情理、在逻辑上最连贯的。\n",
+    "虽然这个模型可能没有很准确地反映出后续词的语义，\n",
+    "比如，“It is raining in San Francisco”（旧金山下雨了）\n",
+    "和“It is raining in winter”（冬天下雨了）\n",
+    "可能才是更完美的合理扩展，\n",
+    "但该模型已经能够捕捉到跟在后面的是哪类单词。\n",
+    "例$2$则要糟糕得多，因为其产生了一个无意义的续写。\n",
+    "尽管如此，至少该模型已经学会了如何拼写单词，\n",
+    "以及单词之间的某种程度的相关性。\n",
+    "最后，例$3$表明了训练不足的模型是无法正确地拟合数据的。\n",
+    "\n",
+    "我们可以通过计算序列的似然概率来度量模型的质量。\n",
+    "然而这是一个难以理解、难以比较的数字。\n",
+    "毕竟，较短的序列比较长的序列更有可能出现，\n",
+    "因此评估模型产生托尔斯泰的巨著《战争与和平》的可能性\n",
+    "不可避免地会比产生圣埃克苏佩里的中篇小说《小王子》可能性要小得多。\n",
+    "而缺少的可能性值相当于平均数。\n",
+    "\n",
+    "在这里，信息论可以派上用场了。\n",
+    "我们在引入softmax回归\n",
+    "（ :numref:`subsec_info_theory_basics`）时定义了熵、惊异和交叉熵，\n",
+    "并在[信息论的在线附录](https://d2l.ai/chapter_appendix-mathematics-for-deep-learning/information-theory.html)\n",
+    "中讨论了更多的信息论知识。\n",
+    "如果想要压缩文本，我们可以根据当前词元集预测的下一个词元。\n",
+    "一个更好的语言模型应该能让我们更准确地预测下一个词元。\n",
+    "因此，它应该允许我们在压缩序列时花费更少的比特。\n",
+    "所以我们可以通过一个序列中所有的$n$个词元的交叉熵损失的平均值来衡量：\n",
+    "\n",
+    "$$\\frac{1}{n} \\sum_{t=1}^n -\\log P(x_t \\mid x_{t-1}, \\ldots, x_1),$$\n",
+    ":eqlabel:`eq_avg_ce_for_lm`\n",
+    "\n",
+    "其中$P$由语言模型给出，\n",
+    "$x_t$是在时间步$t$从该序列中观察到的实际词元。\n",
+    "这使得不同长度的文档的性能具有了可比性。\n",
+    "由于历史原因，自然语言处理的科学家更喜欢使用一个叫做*困惑度*（perplexity）的量。\n",
+    "简而言之，它是 :eqref:`eq_avg_ce_for_lm`的指数：\n",
+    "\n",
+    "$$\\exp\\left(-\\frac{1}{n} \\sum_{t=1}^n \\log P(x_t \\mid x_{t-1}, \\ldots, x_1)\\right).$$\n",
+    "\n",
+    "困惑度的最好的理解是“下一个词元的实际选择数的调和平均数”。\n",
+    "我们看看一些案例。\n",
+    "\n",
+    "* 在最好的情况下，模型总是完美地估计标签词元的概率为1。\n",
+    "  在这种情况下，模型的困惑度为1。\n",
+    "* 在最坏的情况下，模型总是预测标签词元的概率为0。\n",
+    "  在这种情况下，困惑度是正无穷大。\n",
+    "* 在基线上，该模型的预测是词表的所有可用词元上的均匀分布。\n",
+    "  在这种情况下，困惑度等于词表中唯一词元的数量。\n",
+    "  事实上，如果我们在没有任何压缩的情况下存储序列，\n",
+    "  这将是我们能做的最好的编码方式。\n",
+    "  因此，这种方式提供了一个重要的上限，\n",
+    "  而任何实际模型都必须超越这个上限。\n",
+    "\n",
+    "在接下来的小节中，我们将基于循环神经网络实现字符级语言模型，\n",
+    "并使用困惑度来评估这样的模型。\n",
+    "\n",
+    "## 小结\n",
+    "\n",
+    "* 对隐状态使用循环计算的神经网络称为循环神经网络（RNN）。\n",
+    "* 循环神经网络的隐状态可以捕获直到当前时间步序列的历史信息。\n",
+    "* 循环神经网络模型的参数数量不会随着时间步的增加而增加。\n",
+    "* 我们可以使用循环神经网络创建字符级语言模型。\n",
+    "* 我们可以使用困惑度来评价语言模型的质量。\n",
+    "\n",
+    "## 练习\n",
+    "\n",
+    "1. 如果我们使用循环神经网络来预测文本序列中的下一个字符，那么任意输出所需的维度是多少？\n",
+    "1. 为什么循环神经网络可以基于文本序列中所有先前的词元，在某个时间步表示当前词元的条件概率？\n",
+    "1. 如果基于一个长序列进行反向传播，梯度会发生什么状况？\n",
+    "1. 与本节中描述的语言模型相关的问题有哪些？\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d4ae34aa",
+   "metadata": {
+    "origin_pos": 11,
+    "tab": [
+     "pytorch"
+    ]
+   },
+   "source": [
+    "[Discussions](https://discuss.d2l.ai/t/2100)\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  },
+  "required_libs": []
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,454 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "fdaad6d5",
+   "metadata": {
+    "origin_pos": 0
+   },
+   "source": [
+    "# 文本预处理\n",
+    ":label:`sec_text_preprocessing`\n",
+    "\n",
+    "对于序列数据处理问题，我们在 :numref:`sec_sequence`中\n",
+    "评估了所需的统计工具和预测时面临的挑战。\n",
+    "这样的数据存在许多种形式，文本是最常见例子之一。\n",
+    "例如，一篇文章可以被简单地看作一串单词序列，甚至是一串字符序列。\n",
+    "本节中，我们将解析文本的常见预处理步骤。\n",
+    "这些步骤通常包括：\n",
+    "\n",
+    "1. 将文本作为字符串加载到内存中。\n",
+    "1. 将字符串拆分为词元（如单词和字符）。\n",
+    "1. 建立一个词表，将拆分的词元映射到数字索引。\n",
+    "1. 将文本转换为数字索引序列，方便模型操作。\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "bb8907ca",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2023-08-18T07:02:24.243885Z",
+     "iopub.status.busy": "2023-08-18T07:02:24.243343Z",
+     "iopub.status.idle": "2023-08-18T07:02:26.213654Z",
+     "shell.execute_reply": "2023-08-18T07:02:26.212745Z"
+    },
+    "origin_pos": 2,
+    "tab": [
+     "pytorch"
+    ]
+   },
+   "outputs": [],
+   "source": [
+    "import collections\n",
+    "import re\n",
+    "from d2l import torch as d2l"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e987bf4c",
+   "metadata": {
+    "origin_pos": 5
+   },
+   "source": [
+    "## 读取数据集\n",
+    "\n",
+    "首先，我们从H.G.Well的[时光机器](https://www.gutenberg.org/ebooks/35)中加载文本。\n",
+    "这是一个相当小的语料库，只有30000多个单词，但足够我们小试牛刀，\n",
+    "而现实中的文档集合可能会包含数十亿个单词。\n",
+    "下面的函数(**将数据集读取到由多条文本行组成的列表中**)，其中每条文本行都是一个字符串。\n",
+    "为简单起见，我们在这里忽略了标点符号和字母大写。\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "ac0f9f0d",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2023-08-18T07:02:26.218338Z",
+     "iopub.status.busy": "2023-08-18T07:02:26.217685Z",
+     "iopub.status.idle": "2023-08-18T07:02:26.304928Z",
+     "shell.execute_reply": "2023-08-18T07:02:26.304151Z"
+    },
+    "origin_pos": 6,
+    "tab": [
+     "pytorch"
+    ]
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Downloading ../data/timemachine.txt from http://d2l-data.s3-accelerate.amazonaws.com/timemachine.txt...\n",
+      "# 文本总行数: 3221\n",
+      "the time machine by h g wells\n",
+      "twinkled and his usually pale face was flushed and animated the\n"
+     ]
+    }
+   ],
+   "source": [
+    "#@save\n",
+    "d2l.DATA_HUB['time_machine'] = (d2l.DATA_URL + 'timemachine.txt',\n",
+    "                                '090b5e7e70c295757f55df93cb0a180b9691891a')\n",
+    "\n",
+    "def read_time_machine():  #@save\n",
+    "    \"\"\"将时间机器数据集加载到文本行的列表中\"\"\"\n",
+    "    with open(d2l.download('time_machine'), 'r') as f:\n",
+    "        lines = f.readlines()\n",
+    "    return [re.sub('[^A-Za-z]+', ' ', line).strip().lower() for line in lines]\n",
+    "\n",
+    "lines = read_time_machine()\n",
+    "print(f'# 文本总行数: {len(lines)}')\n",
+    "print(lines[0])\n",
+    "print(lines[10])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c34664d1",
+   "metadata": {
+    "origin_pos": 7
+   },
+   "source": [
+    "## 词元化\n",
+    "\n",
+    "下面的`tokenize`函数将文本行列表（`lines`）作为输入，\n",
+    "列表中的每个元素是一个文本序列（如一条文本行）。\n",
+    "[**每个文本序列又被拆分成一个词元列表**]，*词元*（token）是文本的基本单位。\n",
+    "最后，返回一个由词元列表组成的列表，其中的每个词元都是一个字符串（string）。\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "afd6a9df",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2023-08-18T07:02:26.308604Z",
+     "iopub.status.busy": "2023-08-18T07:02:26.308048Z",
+     "iopub.status.idle": "2023-08-18T07:02:26.317083Z",
+     "shell.execute_reply": "2023-08-18T07:02:26.316264Z"
+    },
+    "origin_pos": 8,
+    "tab": [
+     "pytorch"
+    ]
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "['the', 'time', 'machine', 'by', 'h', 'g', 'wells']\n",
+      "[]\n",
+      "[]\n",
+      "[]\n",
+      "[]\n",
+      "['i']\n",
+      "[]\n",
+      "[]\n",
+      "['the', 'time', 'traveller', 'for', 'so', 'it', 'will', 'be', 'convenient', 'to', 'speak', 'of', 'him']\n",
+      "['was', 'expounding', 'a', 'recondite', 'matter', 'to', 'us', 'his', 'grey', 'eyes', 'shone', 'and']\n",
+      "['twinkled', 'and', 'his', 'usually', 'pale', 'face', 'was', 'flushed', 'and', 'animated', 'the']\n"
+     ]
+    }
+   ],
+   "source": [
+    "def tokenize(lines, token='word'):  #@save\n",
+    "    \"\"\"将文本行拆分为单词或字符词元\"\"\"\n",
+    "    if token == 'word':\n",
+    "        return [line.split() for line in lines]\n",
+    "    elif token == 'char':\n",
+    "        return [list(line) for line in lines]\n",
+    "    else:\n",
+    "        print('错误：未知词元类型：' + token)\n",
+    "\n",
+    "tokens = tokenize(lines)\n",
+    "for i in range(11):\n",
+    "    print(tokens[i])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e61c06e8",
+   "metadata": {
+    "origin_pos": 9
+   },
+   "source": [
+    "## 词表\n",
+    "\n",
+    "词元的类型是字符串，而模型需要的输入是数字，因此这种类型不方便模型使用。\n",
+    "现在，让我们[**构建一个字典，通常也叫做*词表*（vocabulary），\n",
+    "用来将字符串类型的词元映射到从$0$开始的数字索引中**]。\n",
+    "我们先将训练集中的所有文档合并在一起，对它们的唯一词元进行统计，\n",
+    "得到的统计结果称之为*语料*（corpus）。\n",
+    "然后根据每个唯一词元的出现频率，为其分配一个数字索引。\n",
+    "很少出现的词元通常被移除，这可以降低复杂性。\n",
+    "另外，语料库中不存在或已删除的任何词元都将映射到一个特定的未知词元“&lt;unk&gt;”。\n",
+    "我们可以选择增加一个列表，用于保存那些被保留的词元，\n",
+    "例如：填充词元（“&lt;pad&gt;”）；\n",
+    "序列开始词元（“&lt;bos&gt;”）；\n",
+    "序列结束词元（“&lt;eos&gt;”）。\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "16db7dad",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2023-08-18T07:02:26.320587Z",
+     "iopub.status.busy": "2023-08-18T07:02:26.320050Z",
+     "iopub.status.idle": "2023-08-18T07:02:26.330519Z",
+     "shell.execute_reply": "2023-08-18T07:02:26.329736Z"
+    },
+    "origin_pos": 10,
+    "tab": [
+     "pytorch"
+    ]
+   },
+   "outputs": [],
+   "source": [
+    "class Vocab:  #@save\n",
+    "    \"\"\"文本词表\"\"\"\n",
+    "    def __init__(self, tokens=None, min_freq=0, reserved_tokens=None):\n",
+    "        if tokens is None:\n",
+    "            tokens = []\n",
+    "        if reserved_tokens is None:\n",
+    "            reserved_tokens = []\n",
+    "        # 按出现频率排序\n",
+    "        counter = count_corpus(tokens)\n",
+    "        self._token_freqs = sorted(counter.items(), key=lambda x: x[1],\n",
+    "                                   reverse=True)\n",
+    "        # 未知词元的索引为0\n",
+    "        self.idx_to_token = ['<unk>'] + reserved_tokens\n",
+    "        self.token_to_idx = {token: idx\n",
+    "                             for idx, token in enumerate(self.idx_to_token)}\n",
+    "        for token, freq in self._token_freqs:\n",
+    "            if freq < min_freq:\n",
+    "                break\n",
+    "            if token not in self.token_to_idx:\n",
+    "                self.idx_to_token.append(token)\n",
+    "                self.token_to_idx[token] = len(self.idx_to_token) - 1\n",
+    "\n",
+    "    def __len__(self):\n",
+    "        return len(self.idx_to_token)\n",
+    "\n",
+    "    def __getitem__(self, tokens):\n",
+    "        if not isinstance(tokens, (list, tuple)):\n",
+    "            return self.token_to_idx.get(tokens, self.unk)\n",
+    "        return [self.__getitem__(token) for token in tokens]\n",
+    "\n",
+    "    def to_tokens(self, indices):\n",
+    "        if not isinstance(indices, (list, tuple)):\n",
+    "            return self.idx_to_token[indices]\n",
+    "        return [self.idx_to_token[index] for index in indices]\n",
+    "\n",
+    "    @property\n",
+    "    def unk(self):  # 未知词元的索引为0\n",
+    "        return 0\n",
+    "\n",
+    "    @property\n",
+    "    def token_freqs(self):\n",
+    "        return self._token_freqs\n",
+    "\n",
+    "def count_corpus(tokens):  #@save\n",
+    "    \"\"\"统计词元的频率\"\"\"\n",
+    "    # 这里的tokens是1D列表或2D列表\n",
+    "    if len(tokens) == 0 or isinstance(tokens[0], list):\n",
+    "        # 将词元列表展平成一个列表\n",
+    "        tokens = [token for line in tokens for token in line]\n",
+    "    return collections.Counter(tokens)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d7fde2e0",
+   "metadata": {
+    "origin_pos": 11
+   },
+   "source": [
+    "我们首先使用时光机器数据集作为语料库来[**构建词表**]，然后打印前几个高频词元及其索引。\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "1501d478",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2023-08-18T07:02:26.333942Z",
+     "iopub.status.busy": "2023-08-18T07:02:26.333382Z",
+     "iopub.status.idle": "2023-08-18T07:02:26.346927Z",
+     "shell.execute_reply": "2023-08-18T07:02:26.346182Z"
+    },
+    "origin_pos": 12,
+    "tab": [
+     "pytorch"
+    ]
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[('<unk>', 0), ('the', 1), ('i', 2), ('and', 3), ('of', 4), ('a', 5), ('to', 6), ('was', 7), ('in', 8), ('that', 9)]\n"
+     ]
+    }
+   ],
+   "source": [
+    "vocab = Vocab(tokens)\n",
+    "print(list(vocab.token_to_idx.items())[:10])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ce7b78a3",
+   "metadata": {
+    "origin_pos": 13
+   },
+   "source": [
+    "现在，我们可以(**将每一条文本行转换成一个数字索引列表**)。\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "f0244f09",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2023-08-18T07:02:26.350343Z",
+     "iopub.status.busy": "2023-08-18T07:02:26.349779Z",
+     "iopub.status.idle": "2023-08-18T07:02:26.354215Z",
+     "shell.execute_reply": "2023-08-18T07:02:26.353468Z"
+    },
+    "origin_pos": 14,
+    "tab": [
+     "pytorch"
+    ]
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "文本: ['the', 'time', 'machine', 'by', 'h', 'g', 'wells']\n",
+      "索引: [1, 19, 50, 40, 2183, 2184, 400]\n",
+      "文本: ['twinkled', 'and', 'his', 'usually', 'pale', 'face', 'was', 'flushed', 'and', 'animated', 'the']\n",
+      "索引: [2186, 3, 25, 1044, 362, 113, 7, 1421, 3, 1045, 1]\n"
+     ]
+    }
+   ],
+   "source": [
+    "for i in [0, 10]:\n",
+    "    print('文本:', tokens[i])\n",
+    "    print('索引:', vocab[tokens[i]])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e84c1a2a",
+   "metadata": {
+    "origin_pos": 15
+   },
+   "source": [
+    "## 整合所有功能\n",
+    "\n",
+    "在使用上述函数时，我们[**将所有功能打包到`load_corpus_time_machine`函数中**]，\n",
+    "该函数返回`corpus`（词元索引列表）和`vocab`（时光机器语料库的词表）。\n",
+    "我们在这里所做的改变是：\n",
+    "\n",
+    "1. 为了简化后面章节中的训练，我们使用字符（而不是单词）实现文本词元化；\n",
+    "1. 时光机器数据集中的每个文本行不一定是一个句子或一个段落，还可能是一个单词，因此返回的`corpus`仅处理为单个列表，而不是使用多词元列表构成的一个列表。\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "578ed76f",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2023-08-18T07:02:26.357414Z",
+     "iopub.status.busy": "2023-08-18T07:02:26.357141Z",
+     "iopub.status.idle": "2023-08-18T07:02:26.470812Z",
+     "shell.execute_reply": "2023-08-18T07:02:26.470008Z"
+    },
+    "origin_pos": 16,
+    "tab": [
+     "pytorch"
+    ]
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(170580, 28)"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "def load_corpus_time_machine(max_tokens=-1):  #@save\n",
+    "    \"\"\"返回时光机器数据集的词元索引列表和词表\"\"\"\n",
+    "    lines = read_time_machine()\n",
+    "    tokens = tokenize(lines, 'char')\n",
+    "    vocab = Vocab(tokens)\n",
+    "    # 因为时光机器数据集中的每个文本行不一定是一个句子或一个段落，\n",
+    "    # 所以将所有文本行展平到一个列表中\n",
+    "    corpus = [vocab[token] for line in tokens for token in line]\n",
+    "    if max_tokens > 0:\n",
+    "        corpus = corpus[:max_tokens]\n",
+    "    return corpus, vocab\n",
+    "\n",
+    "corpus, vocab = load_corpus_time_machine()\n",
+    "len(corpus), len(vocab)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "28620a4d",
+   "metadata": {
+    "origin_pos": 17
+   },
+   "source": [
+    "## 小结\n",
+    "\n",
+    "* 文本是序列数据的一种最常见的形式之一。\n",
+    "* 为了对文本进行预处理，我们通常将文本拆分为词元，构建词表将词元字符串映射为数字索引，并将文本数据转换为词元索引以供模型操作。\n",
+    "\n",
+    "## 练习\n",
+    "\n",
+    "1. 词元化是一个关键的预处理步骤，它因语言而异。尝试找到另外三种常用的词元化文本的方法。\n",
+    "1. 在本节的实验中，将文本词元为单词和更改`Vocab`实例的`min_freq`参数。这对词表大小有何影响？\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "17f3b26f",
+   "metadata": {
+    "origin_pos": 19,
+    "tab": [
+     "pytorch"
+    ]
+   },
+   "source": [
+    "[Discussions](https://discuss.d2l.ai/t/2094)\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  },
+  "required_libs": []
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}