Files
Python/d2l/d2l-zh/pytorch/chapter_natural-language-processing-pretraining/subword-embedding.ipynb
T
2025-12-16 09:23:53 +08:00

447 lines
16 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"cells": [
{
"cell_type": "markdown",
"id": "2eef48dc",
"metadata": {
"origin_pos": 0
},
"source": [
"# 子词嵌入\n",
":label:`sec_fasttext`\n",
"\n",
"在英语中,“helps”“helped”和“helping”等单词都是同一个词“help”的变形形式。“dog”和“dogs”之间的关系与“cat”和“cats”之间的关系相同,“boy”和“boyfriend”之间的关系与“girl”和“girlfriend”之间的关系相同。在法语和西班牙语等其他语言中,许多动词有40多种变形形式,而在芬兰语中,名词最多可能有15种变形。在语言学中,形态学研究单词形成和词汇关系。但是,word2vec和GloVe都没有对词的内部结构进行探讨。\n",
"\n",
"## fastText模型\n",
"\n",
"回想一下词在word2vec中是如何表示的。在跳元模型和连续词袋模型中,同一词的不同变形形式直接由不同的向量表示,不需要共享参数。为了使用形态信息,*fastText模型*提出了一种*子词嵌入*方法,其中子词是一个字符$n$-gram :cite:`Bojanowski.Grave.Joulin.ea.2017`。fastText可以被认为是子词级跳元模型,而非学习词级向量表示,其中每个*中心词*由其子词级向量之和表示。\n",
"\n",
"让我们来说明如何以单词“where”为例获得fastText中每个中心词的子词。首先,在词的开头和末尾添加特殊字符“<”和“>”,以将前缀和后缀与其他子词区分开来。\n",
"然后,从词中提取字符$n$-gram。\n",
"例如,值$n=3$时,我们将获得长度为3的所有子词:\n",
"“<wh”“whe”“her”“ere”“re>”和特殊子词“<where>”。\n",
"\n",
"在fastText中,对于任意词$w$,用$\\mathcal{G}_w$表示其长度在3和6之间的所有子词与其特殊子词的并集。词表是所有词的子词的集合。假设$\\mathbf{z}_g$是词典中的子词$g$的向量,则跳元模型中作为中心词的词$w$的向量$\\mathbf{v}_w$是其子词向量的和:\n",
"\n",
"$$\\mathbf{v}_w = \\sum_{g\\in\\mathcal{G}_w} \\mathbf{z}_g.$$\n",
"\n",
"fastText的其余部分与跳元模型相同。与跳元模型相比,fastText的词量更大,模型参数也更多。此外,为了计算一个词的表示,它的所有子词向量都必须求和,这导致了更高的计算复杂度。然而,由于具有相似结构的词之间共享来自子词的参数,罕见词甚至词表外的词在fastText中可能获得更好的向量表示。\n",
"\n",
"## 字节对编码(Byte Pair Encoding\n",
":label:`subsec_Byte_Pair_Encoding`\n",
"\n",
"在fastText中,所有提取的子词都必须是指定的长度,例如$3$到$6$,因此词表大小不能预定义。为了在固定大小的词表中允许可变长度的子词,我们可以应用一种称为*字节对编码*(Byte Pair EncodingBPE)的压缩算法来提取子词 :cite:`Sennrich.Haddow.Birch.2015`。\n",
"\n",
"字节对编码执行训练数据集的统计分析,以发现单词内的公共符号,诸如任意长度的连续字符。从长度为1的符号开始,字节对编码迭代地合并最频繁的连续符号对以产生新的更长的符号。请注意,为提高效率,不考虑跨越单词边界的对。最后,我们可以使用像子词这样的符号来切分单词。字节对编码及其变体已经用于诸如GPT-2 :cite:`Radford.Wu.Child.ea.2019`和RoBERTa :cite:`Liu.Ott.Goyal.ea.2019`等自然语言处理预训练模型中的输入表示。在下面,我们将说明字节对编码是如何工作的。\n",
"\n",
"首先,我们将符号词表初始化为所有英文小写字符、特殊的词尾符号`'_'`和特殊的未知符号`'[UNK]'`。\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "70df59d8",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T06:56:35.604170Z",
"iopub.status.busy": "2023-08-18T06:56:35.603510Z",
"iopub.status.idle": "2023-08-18T06:56:35.611979Z",
"shell.execute_reply": "2023-08-18T06:56:35.611231Z"
},
"origin_pos": 1,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"import collections\n",
"\n",
"symbols = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',\n",
" 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',\n",
" '_', '[UNK]']"
]
},
{
"cell_type": "markdown",
"id": "f94dab27",
"metadata": {
"origin_pos": 3
},
"source": [
"因为我们不考虑跨越词边界的符号对,所以我们只需要一个字典`raw_token_freqs`将词映射到数据集中的频率(出现次数)。注意,特殊符号`'_'`被附加到每个词的尾部,以便我们可以容易地从输出符号序列(例如,“a_all er_man”)恢复单词序列(例如,“a_all er_man”)。由于我们仅从单个字符和特殊符号的词开始合并处理,所以在每个词(词典`token_freqs`的键)内的每对连续字符之间插入空格。换句话说,空格是词中符号之间的分隔符。\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "6a26ec96",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T06:56:35.615843Z",
"iopub.status.busy": "2023-08-18T06:56:35.615201Z",
"iopub.status.idle": "2023-08-18T06:56:35.623942Z",
"shell.execute_reply": "2023-08-18T06:56:35.623209Z"
},
"origin_pos": 4,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"{'f a s t _': 4, 'f a s t e r _': 3, 't a l l _': 5, 't a l l e r _': 4}"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"raw_token_freqs = {'fast_': 4, 'faster_': 3, 'tall_': 5, 'taller_': 4}\n",
"token_freqs = {}\n",
"for token, freq in raw_token_freqs.items():\n",
" token_freqs[' '.join(list(token))] = raw_token_freqs[token]\n",
"token_freqs"
]
},
{
"cell_type": "markdown",
"id": "8ee2d216",
"metadata": {
"origin_pos": 5
},
"source": [
"我们定义以下`get_max_freq_pair`函数,其返回词内最频繁的连续符号对,其中词来自输入词典`token_freqs`的键。\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "874de73a",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T06:56:35.627616Z",
"iopub.status.busy": "2023-08-18T06:56:35.627025Z",
"iopub.status.idle": "2023-08-18T06:56:35.631950Z",
"shell.execute_reply": "2023-08-18T06:56:35.631221Z"
},
"origin_pos": 6,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"def get_max_freq_pair(token_freqs):\n",
" pairs = collections.defaultdict(int)\n",
" for token, freq in token_freqs.items():\n",
" symbols = token.split()\n",
" for i in range(len(symbols) - 1):\n",
" # “pairs”的键是两个连续符号的元组\n",
" pairs[symbols[i], symbols[i + 1]] += freq\n",
" return max(pairs, key=pairs.get) # 具有最大值的“pairs”键"
]
},
{
"cell_type": "markdown",
"id": "701a4399",
"metadata": {
"origin_pos": 7
},
"source": [
"作为基于连续符号频率的贪心方法,字节对编码将使用以下`merge_symbols`函数来合并最频繁的连续符号对以产生新符号。\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "877dce88",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T06:56:35.635554Z",
"iopub.status.busy": "2023-08-18T06:56:35.634913Z",
"iopub.status.idle": "2023-08-18T06:56:35.639631Z",
"shell.execute_reply": "2023-08-18T06:56:35.638892Z"
},
"origin_pos": 8,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"def merge_symbols(max_freq_pair, token_freqs, symbols):\n",
" symbols.append(''.join(max_freq_pair))\n",
" new_token_freqs = dict()\n",
" for token, freq in token_freqs.items():\n",
" new_token = token.replace(' '.join(max_freq_pair),\n",
" ''.join(max_freq_pair))\n",
" new_token_freqs[new_token] = token_freqs[token]\n",
" return new_token_freqs"
]
},
{
"cell_type": "markdown",
"id": "63e888f9",
"metadata": {
"origin_pos": 9
},
"source": [
"现在,我们对词典`token_freqs`的键迭代地执行字节对编码算法。在第一次迭代中,最频繁的连续符号对是`'t'`和`'a'`,因此字节对编码将它们合并以产生新符号`'ta'`。在第二次迭代中,字节对编码继续合并`'ta'`和`'l'`以产生另一个新符号`'tal'`。\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "ea95bc7c",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T06:56:35.643247Z",
"iopub.status.busy": "2023-08-18T06:56:35.642643Z",
"iopub.status.idle": "2023-08-18T06:56:35.647847Z",
"shell.execute_reply": "2023-08-18T06:56:35.647061Z"
},
"origin_pos": 10,
"tab": [
"pytorch"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"合并# 1: ('t', 'a')\n",
"合并# 2: ('ta', 'l')\n",
"合并# 3: ('tal', 'l')\n",
"合并# 4: ('f', 'a')\n",
"合并# 5: ('fa', 's')\n",
"合并# 6: ('fas', 't')\n",
"合并# 7: ('e', 'r')\n",
"合并# 8: ('er', '_')\n",
"合并# 9: ('tall', '_')\n",
"合并# 10: ('fast', '_')\n"
]
}
],
"source": [
"num_merges = 10\n",
"for i in range(num_merges):\n",
" max_freq_pair = get_max_freq_pair(token_freqs)\n",
" token_freqs = merge_symbols(max_freq_pair, token_freqs, symbols)\n",
" print(f'合并# {i+1}:',max_freq_pair)"
]
},
{
"cell_type": "markdown",
"id": "4fe6d30f",
"metadata": {
"origin_pos": 11
},
"source": [
"在字节对编码的10次迭代之后,我们可以看到列表`symbols`现在又包含10个从其他符号迭代合并而来的符号。\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "14d6459f",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T06:56:35.651408Z",
"iopub.status.busy": "2023-08-18T06:56:35.650818Z",
"iopub.status.idle": "2023-08-18T06:56:35.654893Z",
"shell.execute_reply": "2023-08-18T06:56:35.654143Z"
},
"origin_pos": 12,
"tab": [
"pytorch"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '_', '[UNK]', 'ta', 'tal', 'tall', 'fa', 'fas', 'fast', 'er', 'er_', 'tall_', 'fast_']\n"
]
}
],
"source": [
"print(symbols)"
]
},
{
"cell_type": "markdown",
"id": "70283228",
"metadata": {
"origin_pos": 13
},
"source": [
"对于在词典`raw_token_freqs`的键中指定的同一数据集,作为字节对编码算法的结果,数据集中的每个词现在被子词“fast_”“fast”“er_”“tall_”和“tall”分割。例如,单词“fast er_”和“tall er_”分别被分割为“fast er_”和“tall er_”。\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "93120bf0",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T06:56:35.658487Z",
"iopub.status.busy": "2023-08-18T06:56:35.657897Z",
"iopub.status.idle": "2023-08-18T06:56:35.662020Z",
"shell.execute_reply": "2023-08-18T06:56:35.661268Z"
},
"origin_pos": 14,
"tab": [
"pytorch"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['fast_', 'fast er_', 'tall_', 'tall er_']\n"
]
}
],
"source": [
"print(list(token_freqs.keys()))"
]
},
{
"cell_type": "markdown",
"id": "83456139",
"metadata": {
"origin_pos": 15
},
"source": [
"请注意,字节对编码的结果取决于正在使用的数据集。我们还可以使用从一个数据集学习的子词来切分另一个数据集的单词。作为一种贪心方法,下面的`segment_BPE`函数尝试将单词从输入参数`symbols`分成可能最长的子词。\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "04e84fc1",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T06:56:35.665538Z",
"iopub.status.busy": "2023-08-18T06:56:35.664918Z",
"iopub.status.idle": "2023-08-18T06:56:35.670601Z",
"shell.execute_reply": "2023-08-18T06:56:35.669830Z"
},
"origin_pos": 16,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"def segment_BPE(tokens, symbols):\n",
" outputs = []\n",
" for token in tokens:\n",
" start, end = 0, len(token)\n",
" cur_output = []\n",
" # 具有符号中可能最长子字的词元段\n",
" while start < len(token) and start < end:\n",
" if token[start: end] in symbols:\n",
" cur_output.append(token[start: end])\n",
" start = end\n",
" end = len(token)\n",
" else:\n",
" end -= 1\n",
" if start < len(token):\n",
" cur_output.append('[UNK]')\n",
" outputs.append(' '.join(cur_output))\n",
" return outputs"
]
},
{
"cell_type": "markdown",
"id": "ce7118c8",
"metadata": {
"origin_pos": 17
},
"source": [
"我们使用列表`symbols`中的子词(从前面提到的数据集学习)来表示另一个数据集的`tokens`。\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "00e7e03a",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T06:56:35.674172Z",
"iopub.status.busy": "2023-08-18T06:56:35.673554Z",
"iopub.status.idle": "2023-08-18T06:56:35.677812Z",
"shell.execute_reply": "2023-08-18T06:56:35.677058Z"
},
"origin_pos": 18,
"tab": [
"pytorch"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['tall e s t _', 'fa t t er_']\n"
]
}
],
"source": [
"tokens = ['tallest_', 'fatter_']\n",
"print(segment_BPE(tokens, symbols))"
]
},
{
"cell_type": "markdown",
"id": "b4a5e47c",
"metadata": {
"origin_pos": 19
},
"source": [
"## 小结\n",
"\n",
"* fastText模型提出了一种子词嵌入方法:基于word2vec中的跳元模型,它将中心词表示为其子词向量之和。\n",
"* 字节对编码执行训练数据集的统计分析,以发现词内的公共符号。作为一种贪心方法,字节对编码迭代地合并最频繁的连续符号对。\n",
"* 子词嵌入可以提高稀有词和词典外词的表示质量。\n",
"\n",
"## 练习\n",
"\n",
"1. 例如,英语中大约有$3\\times 10^8$种可能的$6$-元组。子词太多会有什么问题呢?如何解决这个问题?提示:请参阅fastText论文第3.2节末尾 :cite:`Bojanowski.Grave.Joulin.ea.2017`。\n",
"1. 如何在连续词袋模型的基础上设计一个子词嵌入模型?\n",
"1. 要获得大小为$m$的词表,当初始符号词表大小为$n$时,需要多少合并操作?\n",
"1. 如何扩展字节对编码的思想来提取短语?\n"
]
},
{
"cell_type": "markdown",
"id": "09bfff94",
"metadata": {
"origin_pos": 21,
"tab": [
"pytorch"
]
},
"source": [
"[Discussions](https://discuss.d2l.ai/t/5748)\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
},
"required_libs": []
},
"nbformat": 4,
"nbformat_minor": 5
}