This commit is contained in:
2025-12-16 09:23:53 +08:00
parent 19138d3cc1
commit 9e7efd0626
409 changed files with 272713 additions and 241 deletions
@@ -0,0 +1,190 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "06969ee4",
"metadata": {
"origin_pos": 0
},
"source": [
"# 束搜索\n",
":label:`sec_beam-search`\n",
"\n",
"在 :numref:`sec_seq2seq`中,我们逐个预测输出序列,\n",
"直到预测序列中出现特定的序列结束词元“<eos>”。\n",
"本节将首先介绍*贪心搜索*greedy search)策略,\n",
"并探讨其存在的问题,然后对比其他替代策略:\n",
"*穷举搜索*exhaustive search)和*束搜索*beam search)。\n",
"\n",
"在正式介绍贪心搜索之前,我们使用与 :numref:`sec_seq2seq`中\n",
"相同的数学符号定义搜索问题。\n",
"在任意时间步$t'$,解码器输出$y_{t'}$的概率取决于\n",
"时间步$t'$之前的输出子序列$y_1, \\ldots, y_{t'-1}$\n",
"和对输入序列的信息进行编码得到的上下文变量$\\mathbf{c}$。\n",
"为了量化计算代价,用$\\mathcal{Y}$表示输出词表,\n",
"其中包含“<eos>”,\n",
"所以这个词汇集合的基数$\\left|\\mathcal{Y}\\right|$就是词表的大小。\n",
"我们还将输出序列的最大词元数指定为$T'$。\n",
"因此,我们的目标是从所有$\\mathcal{O}(\\left|\\mathcal{Y}\\right|^{T'})$个\n",
"可能的输出序列中寻找理想的输出。\n",
"当然,对于所有输出序列,在“<eos>”之后的部分(非本句)\n",
"将在实际输出中丢弃。\n",
"\n",
"## 贪心搜索\n",
"\n",
"首先,让我们看看一个简单的策略:*贪心搜索*,\n",
"该策略已用于 :numref:`sec_seq2seq`的序列预测。\n",
"对于输出序列的每一时间步$t'$\n",
"我们都将基于贪心搜索从$\\mathcal{Y}$中找到具有最高条件概率的词元,即:\n",
"\n",
"$$y_{t'} = \\operatorname*{argmax}_{y \\in \\mathcal{Y}} P(y \\mid y_1, \\ldots, y_{t'-1}, \\mathbf{c})$$\n",
"\n",
"一旦输出序列包含了“<eos>”或者达到其最大长度$T'$,则输出完成。\n",
"\n",
"![在每个时间步,贪心搜索选择具有最高条件概率的词元](../img/s2s-prob1.svg)\n",
":label:`fig_s2s-prob1`\n",
"\n",
"如 :numref:`fig_s2s-prob1`中,\n",
"假设输出中有四个词元“A”“B”“C”和“<eos>”。\n",
"每个时间步下的四个数字分别表示在该时间步\n",
"生成“A”“B”“C”和“<eos>”的条件概率。\n",
"在每个时间步,贪心搜索选择具有最高条件概率的词元。\n",
"因此,将在 :numref:`fig_s2s-prob1`中\n",
"预测输出序列“A”“B”“C”和“<eos>”。\n",
"这个输出序列的条件概率是\n",
"$0.5\\times0.4\\times0.4\\times0.6 = 0.048$。\n",
"\n",
"那么贪心搜索存在的问题是什么呢?\n",
"现实中,*最优序列*optimal sequence)应该是最大化\n",
"$\\prod_{t'=1}^{T'} P(y_{t'} \\mid y_1, \\ldots, y_{t'-1}, \\mathbf{c})$\n",
"值的输出序列,这是基于输入序列生成输出序列的条件概率。\n",
"然而,贪心搜索无法保证得到最优序列。\n",
"\n",
"![在时间步2,选择具有第二高条件概率的词元“C”(而非最高条件概率的词元)](../img/s2s-prob2.svg)\n",
":label:`fig_s2s-prob2`\n",
"\n",
" :numref:`fig_s2s-prob2`中的另一个例子阐述了这个问题。\n",
"与 :numref:`fig_s2s-prob1`不同,在时间步$2$中,\n",
"我们选择 :numref:`fig_s2s-prob2`中的词元“C”,\n",
"它具有*第二*高的条件概率。\n",
"由于时间步$3$所基于的时间步$1$和$2$处的输出子序列已从\n",
" :numref:`fig_s2s-prob1`中的“A”和“B”改变为\n",
" :numref:`fig_s2s-prob2`中的“A”和“C”,\n",
"因此时间步$3$处的每个词元的条件概率也在 :numref:`fig_s2s-prob2`中改变。\n",
"假设我们在时间步$3$选择词元“B”,\n",
"于是当前的时间步$4$基于前三个时间步的输出子序列“A”“C”和“B”为条件,\n",
"这与 :numref:`fig_s2s-prob1`中的“A”“B”和“C”不同。\n",
"因此,在 :numref:`fig_s2s-prob2`中的时间步$4$生成\n",
"每个词元的条件概率也不同于 :numref:`fig_s2s-prob1`中的条件概率。\n",
"结果, :numref:`fig_s2s-prob2`中的输出序列\n",
"“A”“C”“B”和“<eos>”的条件概率为\n",
"$0.5\\times0.3 \\times0.6\\times0.6=0.054$\n",
"这大于 :numref:`fig_s2s-prob1`中的贪心搜索的条件概率。\n",
"这个例子说明:贪心搜索获得的输出序列\n",
"“A”“B”“C”和“<eos>”\n",
"不一定是最佳序列。\n",
"\n",
"## 穷举搜索\n",
"\n",
"如果目标是获得最优序列,\n",
"我们可以考虑使用*穷举搜索*exhaustive search):\n",
"穷举地列举所有可能的输出序列及其条件概率,\n",
"然后计算输出条件概率最高的一个。\n",
"\n",
"虽然我们可以使用穷举搜索来获得最优序列,\n",
"但其计算量$\\mathcal{O}(\\left|\\mathcal{Y}\\right|^{T'})$可能高的惊人。\n",
"例如,当$|\\mathcal{Y}|=10000$和$T'=10$时,\n",
"我们需要评估$10000^{10} = 10^{40}$序列,\n",
"这是一个极大的数,现有的计算机几乎不可能计算它。\n",
"然而,贪心搜索的计算量\n",
"$\\mathcal{O}(\\left|\\mathcal{Y}\\right|T')$\n",
"通它要显著地小于穷举搜索。\n",
"例如,当$|\\mathcal{Y}|=10000$和$T'=10$时,\n",
"我们只需要评估$10000\\times10=10^5$个序列。\n",
"\n",
"## 束搜索\n",
"\n",
"那么该选取哪种序列搜索策略呢?\n",
"如果精度最重要,则显然是穷举搜索。\n",
"如果计算成本最重要,则显然是贪心搜索。\n",
"而束搜索的实际应用则介于这两个极端之间。\n",
"\n",
"*束搜索*beam search)是贪心搜索的一个改进版本。\n",
"它有一个超参数,名为*束宽*beam size$k$。\n",
"在时间步$1$,我们选择具有最高条件概率的$k$个词元。\n",
"这$k$个词元将分别是$k$个候选输出序列的第一个词元。\n",
"在随后的每个时间步,基于上一时间步的$k$个候选输出序列,\n",
"我们将继续从$k\\left|\\mathcal{Y}\\right|$个可能的选择中\n",
"挑出具有最高条件概率的$k$个候选输出序列。\n",
"\n",
"![束搜索过程(束宽:2,输出序列的最大长度:3)。候选输出序列是$A$、$C$、$AB$、$CE$、$ABD$和$CED$](../img/beam-search.svg)\n",
":label:`fig_beam-search`\n",
"\n",
" :numref:`fig_beam-search`演示了束搜索的过程。\n",
"假设输出的词表只包含五个元素:\n",
"$\\mathcal{Y} = \\{A, B, C, D, E\\}$\n",
"其中有一个是“<eos>”。\n",
"设置束宽为$2$,输出序列的最大长度为$3$。\n",
"在时间步$1$,假设具有最高条件概率\n",
"$P(y_1 \\mid \\mathbf{c})$的词元是$A$和$C$。\n",
"在时间步$2$,我们计算所有$y_2 \\in \\mathcal{Y}$为:\n",
"\n",
"$$\\begin{aligned}P(A, y_2 \\mid \\mathbf{c}) = P(A \\mid \\mathbf{c})P(y_2 \\mid A, \\mathbf{c}),\\\\ P(C, y_2 \\mid \\mathbf{c}) = P(C \\mid \\mathbf{c})P(y_2 \\mid C, \\mathbf{c}),\\end{aligned}$$ \n",
"\n",
"从这十个值中选择最大的两个,\n",
"比如$P(A, B \\mid \\mathbf{c})$和$P(C, E \\mid \\mathbf{c})$。\n",
"然后在时间步$3$,我们计算所有$y_3 \\in \\mathcal{Y}$为:\n",
"\n",
"$$\\begin{aligned}P(A, B, y_3 \\mid \\mathbf{c}) = P(A, B \\mid \\mathbf{c})P(y_3 \\mid A, B, \\mathbf{c}),\\\\P(C, E, y_3 \\mid \\mathbf{c}) = P(C, E \\mid \\mathbf{c})P(y_3 \\mid C, E, \\mathbf{c}),\\end{aligned}$$ \n",
"\n",
"从这十个值中选择最大的两个,\n",
"即$P(A, B, D \\mid \\mathbf{c})$和$P(C, E, D \\mid \\mathbf{c})$\n",
"我们会得到六个候选输出序列:\n",
"1$A$;(2$C$;(3$A,B$;(4$C,E$;(5$A,B,D$;(6$C,E,D$。\n",
"\n",
"最后,基于这六个序列(例如,丢弃包括“<eos>”和之后的部分),\n",
"我们获得最终候选输出序列集合。\n",
"然后我们选择其中条件概率乘积最高的序列作为输出序列:\n",
"\n",
"$$ \\frac{1}{L^\\alpha} \\log P(y_1, \\ldots, y_{L}\\mid \\mathbf{c}) = \\frac{1}{L^\\alpha} \\sum_{t'=1}^L \\log P(y_{t'} \\mid y_1, \\ldots, y_{t'-1}, \\mathbf{c}),$$\n",
":eqlabel:`eq_beam-search-score`\n",
"\n",
"其中$L$是最终候选序列的长度,\n",
"$\\alpha$通常设置为$0.75$。\n",
"因为一个较长的序列在 :eqref:`eq_beam-search-score`\n",
"的求和中会有更多的对数项,\n",
"因此分母中的$L^\\alpha$用于惩罚长序列。\n",
"\n",
"束搜索的计算量为$\\mathcal{O}(k\\left|\\mathcal{Y}\\right|T')$\n",
"这个结果介于贪心搜索和穷举搜索之间。\n",
"实际上,贪心搜索可以看作一种束宽为$1$的特殊类型的束搜索。\n",
"通过灵活地选择束宽,束搜索可以在正确率和计算代价之间进行权衡。\n",
"\n",
"## 小结\n",
"\n",
"* 序列搜索策略包括贪心搜索、穷举搜索和束搜索。\n",
"* 贪心搜索所选取序列的计算量最小,但精度相对较低。\n",
"* 穷举搜索所选取序列的精度最高,但计算量最大。\n",
"* 束搜索通过灵活选择束宽,在正确率和计算代价之间进行权衡。\n",
"\n",
"## 练习\n",
"\n",
"1. 我们可以把穷举搜索看作一种特殊的束搜索吗?为什么?\n",
"1. 在 :numref:`sec_seq2seq`的机器翻译问题中应用束搜索。\n",
" 束宽是如何影响预测的速度和结果的?\n",
"1. 在 :numref:`sec_rnn_scratch`中,我们基于用户提供的前缀,\n",
" 通过使用语言模型来生成文本。这个例子中使用了哪种搜索策略?可以改进吗?\n",
"\n",
"[Discussions](https://discuss.d2l.ai/t/5768)\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
},
"required_libs": []
},
"nbformat": 4,
"nbformat_minor": 5
}
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
@@ -0,0 +1,218 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "962e28eb",
"metadata": {
"origin_pos": 0
},
"source": [
"# 编码器-解码器架构\n",
":label:`sec_encoder-decoder`\n",
"\n",
"正如我们在 :numref:`sec_machine_translation`中所讨论的,\n",
"机器翻译是序列转换模型的一个核心问题,\n",
"其输入和输出都是长度可变的序列。\n",
"为了处理这种类型的输入和输出,\n",
"我们可以设计一个包含两个主要组件的架构:\n",
"第一个组件是一个*编码器*encoder):\n",
"它接受一个长度可变的序列作为输入,\n",
"并将其转换为具有固定形状的编码状态。\n",
"第二个组件是*解码器*decoder):\n",
"它将固定形状的编码状态映射到长度可变的序列。\n",
"这被称为*编码器-解码器*encoder-decoder)架构,\n",
"如 :numref:`fig_encoder_decoder` 所示。\n",
"\n",
"![编码器-解码器架构](../img/encoder-decoder.svg)\n",
":label:`fig_encoder_decoder`\n",
"\n",
"我们以英语到法语的机器翻译为例:\n",
"给定一个英文的输入序列:“They”“are”“watching”“.”。\n",
"首先,这种“编码器-解码器”架构将长度可变的输入序列编码成一个“状态”,\n",
"然后对该状态进行解码,\n",
"一个词元接着一个词元地生成翻译后的序列作为输出:\n",
"“Ils”“regordent”“.”。\n",
"由于“编码器-解码器”架构是形成后续章节中不同序列转换模型的基础,\n",
"因此本节将把这个架构转换为接口方便后面的代码实现。\n",
"\n",
"## (**编码器**)\n",
"\n",
"在编码器接口中,我们只指定长度可变的序列作为编码器的输入`X`。\n",
"任何继承这个`Encoder`基类的模型将完成代码实现。\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "17f77c60",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:05:48.406295Z",
"iopub.status.busy": "2023-08-18T07:05:48.405469Z",
"iopub.status.idle": "2023-08-18T07:05:49.653322Z",
"shell.execute_reply": "2023-08-18T07:05:49.651979Z"
},
"origin_pos": 2,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"from torch import nn\n",
"\n",
"\n",
"#@save\n",
"class Encoder(nn.Module):\n",
" \"\"\"编码器-解码器架构的基本编码器接口\"\"\"\n",
" def __init__(self, **kwargs):\n",
" super(Encoder, self).__init__(**kwargs)\n",
"\n",
" def forward(self, X, *args):\n",
" raise NotImplementedError"
]
},
{
"cell_type": "markdown",
"id": "de7f0caf",
"metadata": {
"origin_pos": 5
},
"source": [
"## [**解码器**]\n",
"\n",
"在下面的解码器接口中,我们新增一个`init_state`函数,\n",
"用于将编码器的输出(`enc_outputs`)转换为编码后的状态。\n",
"注意,此步骤可能需要额外的输入,例如:输入序列的有效长度,\n",
"这在 :numref:`subsec_mt_data_loading`中进行了解释。\n",
"为了逐个地生成长度可变的词元序列,\n",
"解码器在每个时间步都会将输入\n",
"(例如:在前一时间步生成的词元)和编码后的状态\n",
"映射成当前时间步的输出词元。\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "5c7a6471",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:05:49.659889Z",
"iopub.status.busy": "2023-08-18T07:05:49.659020Z",
"iopub.status.idle": "2023-08-18T07:05:49.666360Z",
"shell.execute_reply": "2023-08-18T07:05:49.665230Z"
},
"origin_pos": 7,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"#@save\n",
"class Decoder(nn.Module):\n",
" \"\"\"编码器-解码器架构的基本解码器接口\"\"\"\n",
" def __init__(self, **kwargs):\n",
" super(Decoder, self).__init__(**kwargs)\n",
"\n",
" def init_state(self, enc_outputs, *args):\n",
" raise NotImplementedError\n",
"\n",
" def forward(self, X, state):\n",
" raise NotImplementedError"
]
},
{
"cell_type": "markdown",
"id": "6e0548de",
"metadata": {
"origin_pos": 10
},
"source": [
"## [**合并编码器和解码器**]\n",
"\n",
"总而言之,“编码器-解码器”架构包含了一个编码器和一个解码器,\n",
"并且还拥有可选的额外的参数。\n",
"在前向传播中,编码器的输出用于生成编码状态,\n",
"这个状态又被解码器作为其输入的一部分。\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "53fb0929",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:05:49.671685Z",
"iopub.status.busy": "2023-08-18T07:05:49.670944Z",
"iopub.status.idle": "2023-08-18T07:05:49.678831Z",
"shell.execute_reply": "2023-08-18T07:05:49.677718Z"
},
"origin_pos": 12,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"#@save\n",
"class EncoderDecoder(nn.Module):\n",
" \"\"\"编码器-解码器架构的基类\"\"\"\n",
" def __init__(self, encoder, decoder, **kwargs):\n",
" super(EncoderDecoder, self).__init__(**kwargs)\n",
" self.encoder = encoder\n",
" self.decoder = decoder\n",
"\n",
" def forward(self, enc_X, dec_X, *args):\n",
" enc_outputs = self.encoder(enc_X, *args)\n",
" dec_state = self.decoder.init_state(enc_outputs, *args)\n",
" return self.decoder(dec_X, dec_state)"
]
},
{
"cell_type": "markdown",
"id": "dce5eb8e",
"metadata": {
"origin_pos": 15
},
"source": [
"“编码器-解码器”体系架构中的术语*状态*\n",
"会启发人们使用具有状态的神经网络来实现该架构。\n",
"在下一节中,我们将学习如何应用循环神经网络,\n",
"来设计基于“编码器-解码器”架构的序列转换模型。\n",
"\n",
"## 小结\n",
"\n",
"* “编码器-解码器”架构可以将长度可变的序列作为输入和输出,因此适用于机器翻译等序列转换问题。\n",
"* 编码器将长度可变的序列作为输入,并将其转换为具有固定形状的编码状态。\n",
"* 解码器将具有固定形状的编码状态映射为长度可变的序列。\n",
"\n",
"## 练习\n",
"\n",
"1. 假设我们使用神经网络来实现“编码器-解码器”架构,那么编码器和解码器必须是同一类型的神经网络吗?\n",
"1. 除了机器翻译,还有其它可以适用于”编码器-解码器“架构的应用吗?\n"
]
},
{
"cell_type": "markdown",
"id": "99846b42",
"metadata": {
"origin_pos": 17,
"tab": [
"pytorch"
]
},
"source": [
"[Discussions](https://discuss.d2l.ai/t/2779)\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
},
"required_libs": []
},
"nbformat": 4,
"nbformat_minor": 5
}
File diff suppressed because it is too large Load Diff
@@ -0,0 +1,59 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "729b9613",
"metadata": {
"origin_pos": 0
},
"source": [
"# 现代循环神经网络\n",
":label:`chap_modern_rnn`\n",
"\n",
"前一章中我们介绍了循环神经网络的基础知识,\n",
"这种网络可以更好地处理序列数据。\n",
"我们在文本数据上实现了基于循环神经网络的语言模型,\n",
"但是对于当今各种各样的序列学习问题,这些技术可能并不够用。\n",
"\n",
"例如,循环神经网络在实践中一个常见问题是数值不稳定性。\n",
"尽管我们已经应用了梯度裁剪等技巧来缓解这个问题,\n",
"但是仍需要通过设计更复杂的序列模型来进一步处理它。\n",
"具体来说,我们将引入两个广泛使用的网络,\n",
"即*门控循环单元*gated recurrent unitsGRU)和\n",
"*长短期记忆网络*long short-term memoryLSTM)。\n",
"然后,我们将基于一个单向隐藏层来扩展循环神经网络架构。\n",
"我们将描述具有多个隐藏层的深层架构,\n",
"并讨论基于前向和后向循环计算的双向设计。\n",
"现代循环网络经常采用这种扩展。\n",
"在解释这些循环神经网络的变体时,\n",
"我们将继续考虑 :numref:`chap_rnn`中的语言建模问题。\n",
"\n",
"事实上,语言建模只揭示了序列学习能力的冰山一角。\n",
"在各种序列学习问题中,如自动语音识别、文本到语音转换和机器翻译,\n",
"输入和输出都是任意长度的序列。\n",
"为了阐述如何拟合这种类型的数据,\n",
"我们将以机器翻译为例介绍基于循环神经网络的\n",
"“编码器-解码器”架构和束搜索,并用它们来生成序列。\n",
"\n",
":begin_tab:toc\n",
" - [gru](gru.ipynb)\n",
" - [lstm](lstm.ipynb)\n",
" - [deep-rnn](deep-rnn.ipynb)\n",
" - [bi-rnn](bi-rnn.ipynb)\n",
" - [machine-translation-and-dataset](machine-translation-and-dataset.ipynb)\n",
" - [encoder-decoder](encoder-decoder.ipynb)\n",
" - [seq2seq](seq2seq.ipynb)\n",
" - [beam-search](beam-search.ipynb)\n",
":end_tab:\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
},
"required_libs": []
},
"nbformat": 4,
"nbformat_minor": 5
}
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff