Python/d2l/d2l-zh/pytorch/chapter_natural-language-processing-pretraining/word-embedding-dataset.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "82f9eacb",
   "metadata": {
    "origin_pos": 0
   },
   "source": [
    "# 用于预训练词嵌入的数据集\n",
    ":label:`sec_word2vec_data`\n",
    "\n",
    "现在我们已经了解了word2vec模型的技术细节和大致的训练方法，让我们来看看它们的实现。具体地说，我们将以 :numref:`sec_word2vec`的跳元模型和 :numref:`sec_approx_train`的负采样为例。本节从用于预训练词嵌入模型的数据集开始：数据的原始格式将被转换为可以在训练期间迭代的小批量。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "596ed133",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:01:38.933299Z",
     "iopub.status.busy": "2023-08-18T07:01:38.932361Z",
     "iopub.status.idle": "2023-08-18T07:01:41.929964Z",
     "shell.execute_reply": "2023-08-18T07:01:41.928691Z"
    },
    "origin_pos": 2,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [],
   "source": [
    "import math\n",
    "import os\n",
    "import random\n",
    "import torch\n",
    "from d2l import torch as d2l"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8286adf0",
   "metadata": {
    "origin_pos": 4
   },
   "source": [
    "## 读取数据集\n",
    "\n",
    "我们在这里使用的数据集是[Penn Tree Bank（PTB）](https://catalog.ldc.upenn.edu/LDC99T42)。该语料库取自“华尔街日报”的文章，分为训练集、验证集和测试集。在原始格式中，文本文件的每一行表示由空格分隔的一句话。在这里，我们将每个单词视为一个词元。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "cc6c9b2e",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:01:41.935897Z",
     "iopub.status.busy": "2023-08-18T07:01:41.934975Z",
     "iopub.status.idle": "2023-08-18T07:01:42.345380Z",
     "shell.execute_reply": "2023-08-18T07:01:42.344041Z"
    },
    "origin_pos": 5,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading ../data/ptb.zip from http://d2l-data.s3-accelerate.amazonaws.com/ptb.zip...\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'# sentences数: 42069'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#@save\n",
    "d2l.DATA_HUB['ptb'] = (d2l.DATA_URL + 'ptb.zip',\n",
    "                       '319d85e578af0cdc590547f26231e4e31cdf1e42')\n",
    "\n",
    "#@save\n",
    "def read_ptb():\n",
    "    \"\"\"将PTB数据集加载到文本行的列表中\"\"\"\n",
    "    data_dir = d2l.download_extract('ptb')\n",
    "    # Readthetrainingset.\n",
    "    with open(os.path.join(data_dir, 'ptb.train.txt')) as f:\n",
    "        raw_text = f.read()\n",
    "    return [line.split() for line in raw_text.split('\\n')]\n",
    "\n",
    "sentences = read_ptb()\n",
    "f'# sentences数: {len(sentences)}'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7290de5",
   "metadata": {
    "origin_pos": 6
   },
   "source": [
    "在读取训练集之后，我们为语料库构建了一个词表，其中出现次数少于10次的任何单词都将由“&lt;unk&gt;”词元替换。请注意，原始数据集还包含表示稀有（未知）单词的“&lt;unk&gt;”词元。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "04285c2d",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:01:42.350103Z",
     "iopub.status.busy": "2023-08-18T07:01:42.349586Z",
     "iopub.status.idle": "2023-08-18T07:01:42.520737Z",
     "shell.execute_reply": "2023-08-18T07:01:42.519523Z"
    },
    "origin_pos": 7,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'vocab size: 6719'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vocab = d2l.Vocab(sentences, min_freq=10)\n",
    "f'vocab size: {len(vocab)}'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0bba2291",
   "metadata": {
    "origin_pos": 8
   },
   "source": [
    "## 下采样\n",
    "\n",
    "文本数据通常有“the”“a”和“in”等高频词：它们在非常大的语料库中甚至可能出现数十亿次。然而，这些词经常在上下文窗口中与许多不同的词共同出现，提供的有用信息很少。例如，考虑上下文窗口中的词“chip”：直观地说，它与低频单词“intel”的共现比与高频单词“a”的共现在训练中更有用。此外，大量（高频）单词的训练速度很慢。因此，当训练词嵌入模型时，可以对高频单词进行*下采样* :cite:`Mikolov.Sutskever.Chen.ea.2013`。具体地说，数据集中的每个词$w_i$将有概率地被丢弃\n",
    "\n",
    "$$ P(w_i) = \\max\\left(1 - \\sqrt{\\frac{t}{f(w_i)}}, 0\\right),$$\n",
    "\n",
    "其中$f(w_i)$是$w_i$的词数与数据集中的总词数的比率，常量$t$是超参数（在实验中为$10^{-4}$）。我们可以看到，只有当相对比率$f(w_i) > t$时，（高频）词$w_i$才能被丢弃，且该词的相对比率越高，被丢弃的概率就越大。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "88d0f9c2",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:01:42.524901Z",
     "iopub.status.busy": "2023-08-18T07:01:42.524245Z",
     "iopub.status.idle": "2023-08-18T07:01:44.019122Z",
     "shell.execute_reply": "2023-08-18T07:01:44.017912Z"
    },
    "origin_pos": 9,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [],
   "source": [
    "#@save\n",
    "def subsample(sentences, vocab):\n",
    "    \"\"\"下采样高频词\"\"\"\n",
    "    # 排除未知词元'<unk>'\n",
    "    sentences = [[token for token in line if vocab[token] != vocab.unk]\n",
    "                 for line in sentences]\n",
    "    counter = d2l.count_corpus(sentences)\n",
    "    num_tokens = sum(counter.values())\n",
    "\n",
    "    # 如果在下采样期间保留词元，则返回True\n",
    "    def keep(token):\n",
    "        return(random.uniform(0, 1) <\n",
    "               math.sqrt(1e-4 / counter[token] * num_tokens))\n",
    "\n",
    "    return ([[token for token in line if keep(token)] for line in sentences],\n",
    "            counter)\n",
    "\n",
    "subsampled, counter = subsample(sentences, vocab)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5c892ade",
   "metadata": {
    "origin_pos": 10
   },
   "source": [
    "下面的代码片段绘制了下采样前后每句话的词元数量的直方图。正如预期的那样，下采样通过删除高频词来显著缩短句子，这将使训练加速。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "5dd0b4f6",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:01:44.024294Z",
     "iopub.status.busy": "2023-08-18T07:01:44.023765Z",
     "iopub.status.idle": "2023-08-18T07:01:44.272889Z",
     "shell.execute_reply": "2023-08-18T07:01:44.271933Z"
    },
    "origin_pos": 11,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [
    {
     "data": {
      "image/svg+xml": [
       "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"no\"?>\n",
       "<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
       "  \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
       "<svg xmlns:xlink=\"http://www.w3.org/1999/xlink\" width=\"262.190625pt\" height=\"182.053046pt\" viewBox=\"0 0 262.190625 182.053046\" xmlns=\"http://www.w3.org/2000/svg\" version=\"1.1\">\n",
       " <metadata>\n",
       "  <rdf:RDF xmlns:dc=\"http://purl.org/dc/elements/1.1/\" xmlns:cc=\"http://creativecommons.org/ns#\" xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\">\n",
       "   <cc:Work>\n",
       "    <dc:type rdf:resource=\"http://purl.org/dc/dcmitype/StillImage\"/>\n",
       "    <dc:date>2023-08-18T07:01:44.224820</dc:date>\n",
       "    <dc:format>image/svg+xml</dc:format>\n",
       "    <dc:creator>\n",
       "     <cc:Agent>\n",
       "      <dc:title>Matplotlib v3.5.1, https://matplotlib.org/</dc:title>\n",
       "     </cc:Agent>\n",
       "    </dc:creator>\n",
       "   </cc:Work>\n",
       "  </rdf:RDF>\n",
       " </metadata>\n",
       " <defs>\n",
       "  <style type=\"text/css\">*{stroke-linejoin: round; stroke-linecap: butt}</style>\n",
       " </defs>\n",
       " <g id=\"figure_1\">\n",
       "  <g id=\"patch_1\">\n",
       "   <path d=\"M 0 182.053046 \n",
       "L 262.190625 182.053046 \n",
       "L 262.190625 0 \n",
       "L 0 0 \n",
       "L 0 182.053046 \n",
       "z\n",
       "\" style=\"fill: none\"/>\n",
       "  </g>\n",
       "  <g id=\"axes_1\">\n",
       "   <g id=\"patch_2\">\n",
       "    <path d=\"M 59.690625 144.496796 \n",
       "L 254.990625 144.496796 \n",
       "L 254.990625 8.596796 \n",
       "L 59.690625 8.596796 \n",
       "z\n",
       "\" style=\"fill: #ffffff\"/>\n",
       "   </g>\n",
       "   <g id=\"patch_3\">\n",
       "    <path d=\"M 68.567898 144.496796 \n",
       "L 75.814651 144.496796 \n",
       "L 75.814651 123.324081 \n",
       "L 68.567898 123.324081 \n",
       "z\n",
       "\" clip-path=\"url(#p224370b0b9)\" style=\"fill: #1f77b4\"/>\n",
       "   </g>\n",
       "   <g id=\"patch_4\">\n",
       "    <path d=\"M 86.684781 144.496796 \n",
       "L 93.931534 144.496796 \n",
       "L 93.931534 85.795242 \n",
       "L 86.684781 85.795242 \n",
       "z\n",
       "\" clip-path=\"url(#p224370b0b9)\" style=\"fill: #1f77b4\"/>\n",
       "   </g>\n",
       "   <g id=\"patch_5\">\n",
       "    <path d=\"M 104.801664 144.496796 \n",
       "L 112.048417 144.496796 \n",
       "L 112.048417 76.680027 \n",
       "L 104.801664 76.680027 \n",
       "z\n",
       "\" clip-path=\"url(#p224370b0b9)\" style=\"fill: #1f77b4\"/>\n",
       "   </g>\n",
       "   <g id=\"patch_6\">\n",
       "    <path d=\"M 122.918547 144.496796 \n",
       "L 130.1653 144.496796 \n",
       "L 130.1653 96.491067 \n",
       "L 122.918547 96.491067 \n",
       "z\n",
       "\" clip-path=\"url(#p224370b0b9)\" style=\"fill: #1f77b4\"/>\n",
       "   </g>\n",
       "   <g id=\"patch_7\">\n",
       "    <path d=\"M 141.03543 144.496796 \n",
       "L 148.282183 144.496796 \n",
       "L 148.282183 124.48284 \n",
       "L 141.03543 124.48284 \n",
       "z\n",
       "\" clip-path=\"url(#p224370b0b9)\" style=\"fill: #1f77b4\"/>\n",
       "   </g>\n",
       "   <g id=\"patch_8\">\n",
       "    <path d=\"M 159.152313 144.496796 \n",
       "L 166.399067 144.496796 \n",
       "L 166.399067 137.6991 \n",
       "L 159.152313 137.6991 \n",
       "z\n",
       "\" clip-path=\"url(#p224370b0b9)\" style=\"fill: #1f77b4\"/>\n",
       "   </g>\n",
       "   <g id=\"patch_9\">\n",
       "    <path d=\"M 177.269196 144.496796 \n",
       "L 184.51595 144.496796 \n",
       "L 184.51595 142.958904 \n",
       "L 177.269196 142.958904 \n",
       "z\n",
       "\" clip-path=\"url(#p224370b0b9)\" style=\"fill: #1f77b4\"/>\n",
       "   </g>\n",
       "   <g id=\"patch_10\">\n",
       "    <path d=\"M 195.38608 144.496796 \n",
       "L 202.632833 144.496796 \n",
       "L 202.632833 144.106983 \n",
       "L 195.38608 144.106983 \n",
       "z\n",
       "\" clip-path=\"url(#p224370b0b9)\" style=\"fill: #1f77b4\"/>\n",
       "   </g>\n",
       "   <g id=\"patch_11\">\n",
       "    <path d=\"M 213.502963 144.496796 \n",
       "L 220.749716 144.496796 \n",
       "L 220.749716 144.363299 \n",
       "L 213.502963 144.363299 \n",
       "z\n",
       "\" clip-path=\"url(#p224370b0b9)\" style=\"fill: #1f77b4\"/>\n",
       "   </g>\n",
       "   <g id=\"patch_12\">\n",
       "    <path d=\"M 231.619846 144.496796 \n",
       "L 238.866599 144.496796 \n",
       "L 238.866599 144.422038 \n",
       "L 231.619846 144.422038 \n",
       "z\n",
       "\" clip-path=\"url(#p224370b0b9)\" style=\"fill: #1f77b4\"/>\n",
       "   </g>\n",
       "   <g id=\"patch_13\">\n",
       "    <path d=\"M 75.814651 144.496796 \n",
       "L 83.061404 144.496796 \n",
       "L 83.061404 15.068225 \n",
       "L 75.814651 15.068225 \n",
       "z\n",
       "\" clip-path=\"url(#p224370b0b9)\" style=\"fill: url(#hdd73a5dc4c)\"/>\n",
       "   </g>\n",
       "   <g id=\"patch_14\">\n",
       "    <path d=\"M 93.931534 144.496796 \n",
       "L 101.178287 144.496796 \n",
       "L 101.178287 59.469519 \n",
       "L 93.931534 59.469519 \n",
       "z\n",
       "\" clip-path=\"url(#p224370b0b9)\" style=\"fill: url(#hdd73a5dc4c)\"/>\n",
       "   </g>\n",
       "   <g id=\"patch_15\">\n",
       "    <path d=\"M 112.048417 144.496796 \n",
       "L 119.29517 144.496796 \n",
       "L 119.29517 134.91701 \n",
       "L 112.048417 134.91701 \n",
       "z\n",
       "\" clip-path=\"url(#p224370b0b9)\" style=\"fill: url(#hdd73a5dc4c)\"/>\n",
       "   </g>\n",
       "   <g id=\"patch_16\">\n",
       "    <path d=\"M 130.1653 144.496796 \n",
       "L 137.412054 144.496796 \n",
       "L 137.412054 143.941446 \n",
       "L 130.1653 143.941446 \n",
       "z\n",
       "\" clip-path=\"url(#p224370b0b9)\" style=\"fill: url(#hdd73a5dc4c)\"/>\n",
       "   </g>\n",
       "   <g id=\"patch_17\">\n",
       "    <path d=\"M 148.282183 144.496796 \n",
       "L 155.528937 144.496796 \n",
       "L 155.528937 144.448737 \n",
       "L 148.282183 144.448737 \n",
       "z\n",
       "\" clip-path=\"url(#p224370b0b9)\" style=\"fill: url(#hdd73a5dc4c)\"/>\n",
       "   </g>\n",
       "   <g id=\"patch_18\">\n",
       "    <path d=\"M 166.399067 144.496796 \n",
       "L 173.64582 144.496796 \n",
       "L 173.64582 144.491456 \n",
       "L 166.399067 144.491456 \n",
       "z\n",
       "\" clip-path=\"url(#p224370b0b9)\" style=\"fill: url(#hdd73a5dc4c)\"/>\n",
       "   </g>\n",
       "   <g id=\"patch_19\">\n",
       "    <path d=\"M 184.51595 144.496796 \n",
       "L 191.762703 144.496796 \n",
       "L 191.762703 144.496796 \n",
       "L 184.51595 144.496796 \n",
       "z\n",
       "\" clip-path=\"url(#p224370b0b9)\" style=\"fill: url(#hdd73a5dc4c)\"/>\n",
       "   </g>\n",
       "   <g id=\"patch_20\">\n",
       "    <path d=\"M 202.632833 144.496796 \n",
       "L 209.879586 144.496796 \n",
       "L 209.879586 144.496796 \n",
       "L 202.632833 144.496796 \n",
       "z\n",
       "\" clip-path=\"url(#p224370b0b9)\" style=\"fill: url(#hdd73a5dc4c)\"/>\n",
       "   </g>\n",
       "   <g id=\"patch_21\">\n",
       "    <path d=\"M 220.749716 144.496796 \n",
       "L 227.996469 144.496796 \n",
       "L 227.996469 144.496796 \n",
       "L 220.749716 144.496796 \n",
       "z\n",
       "\" clip-path=\"url(#p224370b0b9)\" style=\"fill: url(#hdd73a5dc4c)\"/>\n",
       "   </g>\n",
       "   <g id=\"patch_22\">\n",
       "    <path d=\"M 238.866599 144.496796 \n",
       "L 246.113352 144.496796 \n",
       "L 246.113352 144.496796 \n",
       "L 238.866599 144.496796 \n",
       "z\n",
       "\" clip-path=\"url(#p224370b0b9)\" style=\"fill: url(#hdd73a5dc4c)\"/>\n",
       "   </g>\n",
       "   <g id=\"matplotlib.axis_1\">\n",
       "    <g id=\"xtick_1\">\n",
       "     <g id=\"line2d_1\">\n",
       "      <defs>\n",
       "       <path id=\"mb08744926d\" d=\"M 0 0 \n",
       "L 0 3.5 \n",
       "\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
       "      </defs>\n",
       "      <g>\n",
       "       <use xlink:href=\"#mb08744926d\" x=\"66.756209\" y=\"144.496796\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "     <g id=\"text_1\">\n",
       "      <!-- 0 -->\n",
       "      <g transform=\"translate(63.574959 159.095234)scale(0.1 -0.1)\">\n",
       "       <defs>\n",
       "        <path id=\"DejaVuSans-30\" d=\"M 2034 4250 \n",
       "Q 1547 4250 1301 3770 \n",
       "Q 1056 3291 1056 2328 \n",
       "Q 1056 1369 1301 889 \n",
       "Q 1547 409 2034 409 \n",
       "Q 2525 409 2770 889 \n",
       "Q 3016 1369 3016 2328 \n",
       "Q 3016 3291 2770 3770 \n",
       "Q 2525 4250 2034 4250 \n",
       "z\n",
       "M 2034 4750 \n",
       "Q 2819 4750 3233 4129 \n",
       "Q 3647 3509 3647 2328 \n",
       "Q 3647 1150 3233 529 \n",
       "Q 2819 -91 2034 -91 \n",
       "Q 1250 -91 836 529 \n",
       "Q 422 1150 422 2328 \n",
       "Q 422 3509 836 4129 \n",
       "Q 1250 4750 2034 4750 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       </defs>\n",
       "       <use xlink:href=\"#DejaVuSans-30\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "    </g>\n",
       "    <g id=\"xtick_2\">\n",
       "     <g id=\"line2d_2\">\n",
       "      <g>\n",
       "       <use xlink:href=\"#mb08744926d\" x=\"110.943729\" y=\"144.496796\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "     <g id=\"text_2\">\n",
       "      <!-- 20 -->\n",
       "      <g transform=\"translate(104.581229 159.095234)scale(0.1 -0.1)\">\n",
       "       <defs>\n",
       "        <path id=\"DejaVuSans-32\" d=\"M 1228 531 \n",
       "L 3431 531 \n",
       "L 3431 0 \n",
       "L 469 0 \n",
       "L 469 531 \n",
       "Q 828 903 1448 1529 \n",
       "Q 2069 2156 2228 2338 \n",
       "Q 2531 2678 2651 2914 \n",
       "Q 2772 3150 2772 3378 \n",
       "Q 2772 3750 2511 3984 \n",
       "Q 2250 4219 1831 4219 \n",
       "Q 1534 4219 1204 4116 \n",
       "Q 875 4013 500 3803 \n",
       "L 500 4441 \n",
       "Q 881 4594 1212 4672 \n",
       "Q 1544 4750 1819 4750 \n",
       "Q 2544 4750 2975 4387 \n",
       "Q 3406 4025 3406 3419 \n",
       "Q 3406 3131 3298 2873 \n",
       "Q 3191 2616 2906 2266 \n",
       "Q 2828 2175 2409 1742 \n",
       "Q 1991 1309 1228 531 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       </defs>\n",
       "       <use xlink:href=\"#DejaVuSans-32\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"63.623047\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "    </g>\n",
       "    <g id=\"xtick_3\">\n",
       "     <g id=\"line2d_3\">\n",
       "      <g>\n",
       "       <use xlink:href=\"#mb08744926d\" x=\"155.131249\" y=\"144.496796\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "     <g id=\"text_3\">\n",
       "      <!-- 40 -->\n",
       "      <g transform=\"translate(148.768749 159.095234)scale(0.1 -0.1)\">\n",
       "       <defs>\n",
       "        <path id=\"DejaVuSans-34\" d=\"M 2419 4116 \n",
       "L 825 1625 \n",
       "L 2419 1625 \n",
       "L 2419 4116 \n",
       "z\n",
       "M 2253 4666 \n",
       "L 3047 4666 \n",
       "L 3047 1625 \n",
       "L 3713 1625 \n",
       "L 3713 1100 \n",
       "L 3047 1100 \n",
       "L 3047 0 \n",
       "L 2419 0 \n",
       "L 2419 1100 \n",
       "L 313 1100 \n",
       "L 313 1709 \n",
       "L 2253 4666 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       </defs>\n",
       "       <use xlink:href=\"#DejaVuSans-34\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"63.623047\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "    </g>\n",
       "    <g id=\"xtick_4\">\n",
       "     <g id=\"line2d_4\">\n",
       "      <g>\n",
       "       <use xlink:href=\"#mb08744926d\" x=\"199.318769\" y=\"144.496796\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "     <g id=\"text_4\">\n",
       "      <!-- 60 -->\n",
       "      <g transform=\"translate(192.956269 159.095234)scale(0.1 -0.1)\">\n",
       "       <defs>\n",
       "        <path id=\"DejaVuSans-36\" d=\"M 2113 2584 \n",
       "Q 1688 2584 1439 2293 \n",
       "Q 1191 2003 1191 1497 \n",
       "Q 1191 994 1439 701 \n",
       "Q 1688 409 2113 409 \n",
       "Q 2538 409 2786 701 \n",
       "Q 3034 994 3034 1497 \n",
       "Q 3034 2003 2786 2293 \n",
       "Q 2538 2584 2113 2584 \n",
       "z\n",
       "M 3366 4563 \n",
       "L 3366 3988 \n",
       "Q 3128 4100 2886 4159 \n",
       "Q 2644 4219 2406 4219 \n",
       "Q 1781 4219 1451 3797 \n",
       "Q 1122 3375 1075 2522 \n",
       "Q 1259 2794 1537 2939 \n",
       "Q 1816 3084 2150 3084 \n",
       "Q 2853 3084 3261 2657 \n",
       "Q 3669 2231 3669 1497 \n",
       "Q 3669 778 3244 343 \n",
       "Q 2819 -91 2113 -91 \n",
       "Q 1303 -91 875 529 \n",
       "Q 447 1150 447 2328 \n",
       "Q 447 3434 972 4092 \n",
       "Q 1497 4750 2381 4750 \n",
       "Q 2619 4750 2861 4703 \n",
       "Q 3103 4656 3366 4563 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       </defs>\n",
       "       <use xlink:href=\"#DejaVuSans-36\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"63.623047\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "    </g>\n",
       "    <g id=\"xtick_5\">\n",
       "     <g id=\"line2d_5\">\n",
       "      <g>\n",
       "       <use xlink:href=\"#mb08744926d\" x=\"243.506289\" y=\"144.496796\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "     <g id=\"text_5\">\n",
       "      <!-- 80 -->\n",
       "      <g transform=\"translate(237.143789 159.095234)scale(0.1 -0.1)\">\n",
       "       <defs>\n",
       "        <path id=\"DejaVuSans-38\" d=\"M 2034 2216 \n",
       "Q 1584 2216 1326 1975 \n",
       "Q 1069 1734 1069 1313 \n",
       "Q 1069 891 1326 650 \n",
       "Q 1584 409 2034 409 \n",
       "Q 2484 409 2743 651 \n",
       "Q 3003 894 3003 1313 \n",
       "Q 3003 1734 2745 1975 \n",
       "Q 2488 2216 2034 2216 \n",
       "z\n",
       "M 1403 2484 \n",
       "Q 997 2584 770 2862 \n",
       "Q 544 3141 544 3541 \n",
       "Q 544 4100 942 4425 \n",
       "Q 1341 4750 2034 4750 \n",
       "Q 2731 4750 3128 4425 \n",
       "Q 3525 4100 3525 3541 \n",
       "Q 3525 3141 3298 2862 \n",
       "Q 3072 2584 2669 2484 \n",
       "Q 3125 2378 3379 2068 \n",
       "Q 3634 1759 3634 1313 \n",
       "Q 3634 634 3220 271 \n",
       "Q 2806 -91 2034 -91 \n",
       "Q 1263 -91 848 271 \n",
       "Q 434 634 434 1313 \n",
       "Q 434 1759 690 2068 \n",
       "Q 947 2378 1403 2484 \n",
       "z\n",
       "M 1172 3481 \n",
       "Q 1172 3119 1398 2916 \n",
       "Q 1625 2713 2034 2713 \n",
       "Q 2441 2713 2670 2916 \n",
       "Q 2900 3119 2900 3481 \n",
       "Q 2900 3844 2670 4047 \n",
       "Q 2441 4250 2034 4250 \n",
       "Q 1625 4250 1398 4047 \n",
       "Q 1172 3844 1172 3481 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       </defs>\n",
       "       <use xlink:href=\"#DejaVuSans-38\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"63.623047\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "    </g>\n",
       "    <g id=\"text_6\">\n",
       "     <!-- # tokens per sentence -->\n",
       "     <g transform=\"translate(100.6125 172.773359)scale(0.1 -0.1)\">\n",
       "      <defs>\n",
       "       <path id=\"DejaVuSans-23\" d=\"M 3272 2816 \n",
       "L 2363 2816 \n",
       "L 2100 1772 \n",
       "L 3016 1772 \n",
       "L 3272 2816 \n",
       "z\n",
       "M 2803 4594 \n",
       "L 2478 3297 \n",
       "L 3391 3297 \n",
       "L 3719 4594 \n",
       "L 4219 4594 \n",
       "L 3897 3297 \n",
       "L 4872 3297 \n",
       "L 4872 2816 \n",
       "L 3775 2816 \n",
       "L 3519 1772 \n",
       "L 4513 1772 \n",
       "L 4513 1294 \n",
       "L 3397 1294 \n",
       "L 3072 0 \n",
       "L 2572 0 \n",
       "L 2894 1294 \n",
       "L 1978 1294 \n",
       "L 1656 0 \n",
       "L 1153 0 \n",
       "L 1478 1294 \n",
       "L 494 1294 \n",
       "L 494 1772 \n",
       "L 1594 1772 \n",
       "L 1856 2816 \n",
       "L 850 2816 \n",
       "L 850 3297 \n",
       "L 1978 3297 \n",
       "L 2297 4594 \n",
       "L 2803 4594 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       <path id=\"DejaVuSans-20\" transform=\"scale(0.015625)\"/>\n",
       "       <path id=\"DejaVuSans-74\" d=\"M 1172 4494 \n",
       "L 1172 3500 \n",
       "L 2356 3500 \n",
       "L 2356 3053 \n",
       "L 1172 3053 \n",
       "L 1172 1153 \n",
       "Q 1172 725 1289 603 \n",
       "Q 1406 481 1766 481 \n",
       "L 2356 481 \n",
       "L 2356 0 \n",
       "L 1766 0 \n",
       "Q 1100 0 847 248 \n",
       "Q 594 497 594 1153 \n",
       "L 594 3053 \n",
       "L 172 3053 \n",
       "L 172 3500 \n",
       "L 594 3500 \n",
       "L 594 4494 \n",
       "L 1172 4494 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       <path id=\"DejaVuSans-6f\" d=\"M 1959 3097 \n",
       "Q 1497 3097 1228 2736 \n",
       "Q 959 2375 959 1747 \n",
       "Q 959 1119 1226 758 \n",
       "Q 1494 397 1959 397 \n",
       "Q 2419 397 2687 759 \n",
       "Q 2956 1122 2956 1747 \n",
       "Q 2956 2369 2687 2733 \n",
       "Q 2419 3097 1959 3097 \n",
       "z\n",
       "M 1959 3584 \n",
       "Q 2709 3584 3137 3096 \n",
       "Q 3566 2609 3566 1747 \n",
       "Q 3566 888 3137 398 \n",
       "Q 2709 -91 1959 -91 \n",
       "Q 1206 -91 779 398 \n",
       "Q 353 888 353 1747 \n",
       "Q 353 2609 779 3096 \n",
       "Q 1206 3584 1959 3584 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       <path id=\"DejaVuSans-6b\" d=\"M 581 4863 \n",
       "L 1159 4863 \n",
       "L 1159 1991 \n",
       "L 2875 3500 \n",
       "L 3609 3500 \n",
       "L 1753 1863 \n",
       "L 3688 0 \n",
       "L 2938 0 \n",
       "L 1159 1709 \n",
       "L 1159 0 \n",
       "L 581 0 \n",
       "L 581 4863 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       <path id=\"DejaVuSans-65\" d=\"M 3597 1894 \n",
       "L 3597 1613 \n",
       "L 953 1613 \n",
       "Q 991 1019 1311 708 \n",
       "Q 1631 397 2203 397 \n",
       "Q 2534 397 2845 478 \n",
       "Q 3156 559 3463 722 \n",
       "L 3463 178 \n",
       "Q 3153 47 2828 -22 \n",
       "Q 2503 -91 2169 -91 \n",
       "Q 1331 -91 842 396 \n",
       "Q 353 884 353 1716 \n",
       "Q 353 2575 817 3079 \n",
       "Q 1281 3584 2069 3584 \n",
       "Q 2775 3584 3186 3129 \n",
       "Q 3597 2675 3597 1894 \n",
       "z\n",
       "M 3022 2063 \n",
       "Q 3016 2534 2758 2815 \n",
       "Q 2500 3097 2075 3097 \n",
       "Q 1594 3097 1305 2825 \n",
       "Q 1016 2553 972 2059 \n",
       "L 3022 2063 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       <path id=\"DejaVuSans-6e\" d=\"M 3513 2113 \n",
       "L 3513 0 \n",
       "L 2938 0 \n",
       "L 2938 2094 \n",
       "Q 2938 2591 2744 2837 \n",
       "Q 2550 3084 2163 3084 \n",
       "Q 1697 3084 1428 2787 \n",
       "Q 1159 2491 1159 1978 \n",
       "L 1159 0 \n",
       "L 581 0 \n",
       "L 581 3500 \n",
       "L 1159 3500 \n",
       "L 1159 2956 \n",
       "Q 1366 3272 1645 3428 \n",
       "Q 1925 3584 2291 3584 \n",
       "Q 2894 3584 3203 3211 \n",
       "Q 3513 2838 3513 2113 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       <path id=\"DejaVuSans-73\" d=\"M 2834 3397 \n",
       "L 2834 2853 \n",
       "Q 2591 2978 2328 3040 \n",
       "Q 2066 3103 1784 3103 \n",
       "Q 1356 3103 1142 2972 \n",
       "Q 928 2841 928 2578 \n",
       "Q 928 2378 1081 2264 \n",
       "Q 1234 2150 1697 2047 \n",
       "L 1894 2003 \n",
       "Q 2506 1872 2764 1633 \n",
       "Q 3022 1394 3022 966 \n",
       "Q 3022 478 2636 193 \n",
       "Q 2250 -91 1575 -91 \n",
       "Q 1294 -91 989 -36 \n",
       "Q 684 19 347 128 \n",
       "L 347 722 \n",
       "Q 666 556 975 473 \n",
       "Q 1284 391 1588 391 \n",
       "Q 1994 391 2212 530 \n",
       "Q 2431 669 2431 922 \n",
       "Q 2431 1156 2273 1281 \n",
       "Q 2116 1406 1581 1522 \n",
       "L 1381 1569 \n",
       "Q 847 1681 609 1914 \n",
       "Q 372 2147 372 2553 \n",
       "Q 372 3047 722 3315 \n",
       "Q 1072 3584 1716 3584 \n",
       "Q 2034 3584 2315 3537 \n",
       "Q 2597 3491 2834 3397 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       <path id=\"DejaVuSans-70\" d=\"M 1159 525 \n",
       "L 1159 -1331 \n",
       "L 581 -1331 \n",
       "L 581 3500 \n",
       "L 1159 3500 \n",
       "L 1159 2969 \n",
       "Q 1341 3281 1617 3432 \n",
       "Q 1894 3584 2278 3584 \n",
       "Q 2916 3584 3314 3078 \n",
       "Q 3713 2572 3713 1747 \n",
       "Q 3713 922 3314 415 \n",
       "Q 2916 -91 2278 -91 \n",
       "Q 1894 -91 1617 61 \n",
       "Q 1341 213 1159 525 \n",
       "z\n",
       "M 3116 1747 \n",
       "Q 3116 2381 2855 2742 \n",
       "Q 2594 3103 2138 3103 \n",
       "Q 1681 3103 1420 2742 \n",
       "Q 1159 2381 1159 1747 \n",
       "Q 1159 1113 1420 752 \n",
       "Q 1681 391 2138 391 \n",
       "Q 2594 391 2855 752 \n",
       "Q 3116 1113 3116 1747 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       <path id=\"DejaVuSans-72\" d=\"M 2631 2963 \n",
       "Q 2534 3019 2420 3045 \n",
       "Q 2306 3072 2169 3072 \n",
       "Q 1681 3072 1420 2755 \n",
       "Q 1159 2438 1159 1844 \n",
       "L 1159 0 \n",
       "L 581 0 \n",
       "L 581 3500 \n",
       "L 1159 3500 \n",
       "L 1159 2956 \n",
       "Q 1341 3275 1631 3429 \n",
       "Q 1922 3584 2338 3584 \n",
       "Q 2397 3584 2469 3576 \n",
       "Q 2541 3569 2628 3553 \n",
       "L 2631 2963 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       <path id=\"DejaVuSans-63\" d=\"M 3122 3366 \n",
       "L 3122 2828 \n",
       "Q 2878 2963 2633 3030 \n",
       "Q 2388 3097 2138 3097 \n",
       "Q 1578 3097 1268 2742 \n",
       "Q 959 2388 959 1747 \n",
       "Q 959 1106 1268 751 \n",
       "Q 1578 397 2138 397 \n",
       "Q 2388 397 2633 464 \n",
       "Q 2878 531 3122 666 \n",
       "L 3122 134 \n",
       "Q 2881 22 2623 -34 \n",
       "Q 2366 -91 2075 -91 \n",
       "Q 1284 -91 818 406 \n",
       "Q 353 903 353 1747 \n",
       "Q 353 2603 823 3093 \n",
       "Q 1294 3584 2113 3584 \n",
       "Q 2378 3584 2631 3529 \n",
       "Q 2884 3475 3122 3366 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "      </defs>\n",
       "      <use xlink:href=\"#DejaVuSans-23\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-20\" x=\"83.789062\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-74\" x=\"115.576172\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-6f\" x=\"154.785156\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-6b\" x=\"215.966797\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-65\" x=\"270.251953\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-6e\" x=\"331.775391\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-73\" x=\"395.154297\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-20\" x=\"447.253906\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-70\" x=\"479.041016\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-65\" x=\"542.517578\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-72\" x=\"604.041016\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-20\" x=\"645.154297\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-73\" x=\"676.941406\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-65\" x=\"729.041016\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-6e\" x=\"790.564453\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-74\" x=\"853.943359\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-65\" x=\"893.152344\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-6e\" x=\"954.675781\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-63\" x=\"1018.054688\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-65\" x=\"1073.035156\"/>\n",
       "     </g>\n",
       "    </g>\n",
       "   </g>\n",
       "   <g id=\"matplotlib.axis_2\">\n",
       "    <g id=\"ytick_1\">\n",
       "     <g id=\"line2d_6\">\n",
       "      <defs>\n",
       "       <path id=\"m7f47d3da76\" d=\"M 0 0 \n",
       "L -3.5 0 \n",
       "\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
       "      </defs>\n",
       "      <g>\n",
       "       <use xlink:href=\"#m7f47d3da76\" x=\"59.690625\" y=\"144.496796\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "     <g id=\"text_7\">\n",
       "      <!-- 0 -->\n",
       "      <g transform=\"translate(46.328125 148.296015)scale(0.1 -0.1)\">\n",
       "       <use xlink:href=\"#DejaVuSans-30\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "    </g>\n",
       "    <g id=\"ytick_2\">\n",
       "     <g id=\"line2d_7\">\n",
       "      <g>\n",
       "       <use xlink:href=\"#m7f47d3da76\" x=\"59.690625\" y=\"117.797281\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "     <g id=\"text_8\">\n",
       "      <!-- 5000 -->\n",
       "      <g transform=\"translate(27.240625 121.5965)scale(0.1 -0.1)\">\n",
       "       <defs>\n",
       "        <path id=\"DejaVuSans-35\" d=\"M 691 4666 \n",
       "L 3169 4666 \n",
       "L 3169 4134 \n",
       "L 1269 4134 \n",
       "L 1269 2991 \n",
       "Q 1406 3038 1543 3061 \n",
       "Q 1681 3084 1819 3084 \n",
       "Q 2600 3084 3056 2656 \n",
       "Q 3513 2228 3513 1497 \n",
       "Q 3513 744 3044 326 \n",
       "Q 2575 -91 1722 -91 \n",
       "Q 1428 -91 1123 -41 \n",
       "Q 819 9 494 109 \n",
       "L 494 744 \n",
       "Q 775 591 1075 516 \n",
       "Q 1375 441 1709 441 \n",
       "Q 2250 441 2565 725 \n",
       "Q 2881 1009 2881 1497 \n",
       "Q 2881 1984 2565 2268 \n",
       "Q 2250 2553 1709 2553 \n",
       "Q 1456 2553 1204 2497 \n",
       "Q 953 2441 691 2322 \n",
       "L 691 4666 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       </defs>\n",
       "       <use xlink:href=\"#DejaVuSans-35\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"63.623047\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"127.246094\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"190.869141\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "    </g>\n",
       "    <g id=\"ytick_3\">\n",
       "     <g id=\"line2d_8\">\n",
       "      <g>\n",
       "       <use xlink:href=\"#m7f47d3da76\" x=\"59.690625\" y=\"91.097765\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "     <g id=\"text_9\">\n",
       "      <!-- 10000 -->\n",
       "      <g transform=\"translate(20.878125 94.896984)scale(0.1 -0.1)\">\n",
       "       <defs>\n",
       "        <path id=\"DejaVuSans-31\" d=\"M 794 531 \n",
       "L 1825 531 \n",
       "L 1825 4091 \n",
       "L 703 3866 \n",
       "L 703 4441 \n",
       "L 1819 4666 \n",
       "L 2450 4666 \n",
       "L 2450 531 \n",
       "L 3481 531 \n",
       "L 3481 0 \n",
       "L 794 0 \n",
       "L 794 531 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       </defs>\n",
       "       <use xlink:href=\"#DejaVuSans-31\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"63.623047\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"127.246094\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"190.869141\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"254.492188\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "    </g>\n",
       "    <g id=\"ytick_4\">\n",
       "     <g id=\"line2d_9\">\n",
       "      <g>\n",
       "       <use xlink:href=\"#m7f47d3da76\" x=\"59.690625\" y=\"64.39825\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "     <g id=\"text_10\">\n",
       "      <!-- 15000 -->\n",
       "      <g transform=\"translate(20.878125 68.197469)scale(0.1 -0.1)\">\n",
       "       <use xlink:href=\"#DejaVuSans-31\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-35\" x=\"63.623047\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"127.246094\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"190.869141\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"254.492188\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "    </g>\n",
       "    <g id=\"ytick_5\">\n",
       "     <g id=\"line2d_10\">\n",
       "      <g>\n",
       "       <use xlink:href=\"#m7f47d3da76\" x=\"59.690625\" y=\"37.698734\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "     <g id=\"text_11\">\n",
       "      <!-- 20000 -->\n",
       "      <g transform=\"translate(20.878125 41.497953)scale(0.1 -0.1)\">\n",
       "       <use xlink:href=\"#DejaVuSans-32\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"63.623047\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"127.246094\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"190.869141\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"254.492188\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "    </g>\n",
       "    <g id=\"ytick_6\">\n",
       "     <g id=\"line2d_11\">\n",
       "      <g>\n",
       "       <use xlink:href=\"#m7f47d3da76\" x=\"59.690625\" y=\"10.999219\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "     <g id=\"text_12\">\n",
       "      <!-- 25000 -->\n",
       "      <g transform=\"translate(20.878125 14.798437)scale(0.1 -0.1)\">\n",
       "       <use xlink:href=\"#DejaVuSans-32\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-35\" x=\"63.623047\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"127.246094\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"190.869141\"/>\n",
       "       <use xlink:href=\"#DejaVuSans-30\" x=\"254.492188\"/>\n",
       "      </g>\n",
       "     </g>\n",
       "    </g>\n",
       "    <g id=\"text_13\">\n",
       "     <!-- count -->\n",
       "     <g transform=\"translate(14.798438 90.653046)rotate(-90)scale(0.1 -0.1)\">\n",
       "      <defs>\n",
       "       <path id=\"DejaVuSans-75\" d=\"M 544 1381 \n",
       "L 544 3500 \n",
       "L 1119 3500 \n",
       "L 1119 1403 \n",
       "Q 1119 906 1312 657 \n",
       "Q 1506 409 1894 409 \n",
       "Q 2359 409 2629 706 \n",
       "Q 2900 1003 2900 1516 \n",
       "L 2900 3500 \n",
       "L 3475 3500 \n",
       "L 3475 0 \n",
       "L 2900 0 \n",
       "L 2900 538 \n",
       "Q 2691 219 2414 64 \n",
       "Q 2138 -91 1772 -91 \n",
       "Q 1169 -91 856 284 \n",
       "Q 544 659 544 1381 \n",
       "z\n",
       "M 1991 3584 \n",
       "L 1991 3584 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "      </defs>\n",
       "      <use xlink:href=\"#DejaVuSans-63\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-6f\" x=\"54.980469\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-75\" x=\"116.162109\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-6e\" x=\"179.541016\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-74\" x=\"242.919922\"/>\n",
       "     </g>\n",
       "    </g>\n",
       "   </g>\n",
       "   <g id=\"patch_23\">\n",
       "    <path d=\"M 59.690625 144.496796 \n",
       "L 59.690625 8.596796 \n",
       "\" style=\"fill: none; stroke: #000000; stroke-width: 0.8; stroke-linejoin: miter; stroke-linecap: square\"/>\n",
       "   </g>\n",
       "   <g id=\"patch_24\">\n",
       "    <path d=\"M 254.990625 144.496796 \n",
       "L 254.990625 8.596796 \n",
       "\" style=\"fill: none; stroke: #000000; stroke-width: 0.8; stroke-linejoin: miter; stroke-linecap: square\"/>\n",
       "   </g>\n",
       "   <g id=\"patch_25\">\n",
       "    <path d=\"M 59.690625 144.496796 \n",
       "L 254.990625 144.496796 \n",
       "\" style=\"fill: none; stroke: #000000; stroke-width: 0.8; stroke-linejoin: miter; stroke-linecap: square\"/>\n",
       "   </g>\n",
       "   <g id=\"patch_26\">\n",
       "    <path d=\"M 59.690625 8.596796 \n",
       "L 254.990625 8.596796 \n",
       "\" style=\"fill: none; stroke: #000000; stroke-width: 0.8; stroke-linejoin: miter; stroke-linecap: square\"/>\n",
       "   </g>\n",
       "   <g id=\"legend_1\">\n",
       "    <g id=\"patch_27\">\n",
       "     <path d=\"M 155.389063 45.953046 \n",
       "L 247.990625 45.953046 \n",
       "Q 249.990625 45.953046 249.990625 43.953046 \n",
       "L 249.990625 15.596796 \n",
       "Q 249.990625 13.596796 247.990625 13.596796 \n",
       "L 155.389063 13.596796 \n",
       "Q 153.389063 13.596796 153.389063 15.596796 \n",
       "L 153.389063 43.953046 \n",
       "Q 153.389063 45.953046 155.389063 45.953046 \n",
       "z\n",
       "\" style=\"fill: #ffffff; opacity: 0.8; stroke: #cccccc; stroke-linejoin: miter\"/>\n",
       "    </g>\n",
       "    <g id=\"patch_28\">\n",
       "     <path d=\"M 157.389063 25.195234 \n",
       "L 177.389063 25.195234 \n",
       "L 177.389063 18.195234 \n",
       "L 157.389063 18.195234 \n",
       "z\n",
       "\" style=\"fill: #1f77b4\"/>\n",
       "    </g>\n",
       "    <g id=\"text_14\">\n",
       "     <!-- origin -->\n",
       "     <g transform=\"translate(185.389063 25.195234)scale(0.1 -0.1)\">\n",
       "      <defs>\n",
       "       <path id=\"DejaVuSans-69\" d=\"M 603 3500 \n",
       "L 1178 3500 \n",
       "L 1178 0 \n",
       "L 603 0 \n",
       "L 603 3500 \n",
       "z\n",
       "M 603 4863 \n",
       "L 1178 4863 \n",
       "L 1178 4134 \n",
       "L 603 4134 \n",
       "L 603 4863 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       <path id=\"DejaVuSans-67\" d=\"M 2906 1791 \n",
       "Q 2906 2416 2648 2759 \n",
       "Q 2391 3103 1925 3103 \n",
       "Q 1463 3103 1205 2759 \n",
       "Q 947 2416 947 1791 \n",
       "Q 947 1169 1205 825 \n",
       "Q 1463 481 1925 481 \n",
       "Q 2391 481 2648 825 \n",
       "Q 2906 1169 2906 1791 \n",
       "z\n",
       "M 3481 434 \n",
       "Q 3481 -459 3084 -895 \n",
       "Q 2688 -1331 1869 -1331 \n",
       "Q 1566 -1331 1297 -1286 \n",
       "Q 1028 -1241 775 -1147 \n",
       "L 775 -588 \n",
       "Q 1028 -725 1275 -790 \n",
       "Q 1522 -856 1778 -856 \n",
       "Q 2344 -856 2625 -561 \n",
       "Q 2906 -266 2906 331 \n",
       "L 2906 616 \n",
       "Q 2728 306 2450 153 \n",
       "Q 2172 0 1784 0 \n",
       "Q 1141 0 747 490 \n",
       "Q 353 981 353 1791 \n",
       "Q 353 2603 747 3093 \n",
       "Q 1141 3584 1784 3584 \n",
       "Q 2172 3584 2450 3431 \n",
       "Q 2728 3278 2906 2969 \n",
       "L 2906 3500 \n",
       "L 3481 3500 \n",
       "L 3481 434 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "      </defs>\n",
       "      <use xlink:href=\"#DejaVuSans-6f\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-72\" x=\"61.181641\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-69\" x=\"102.294922\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-67\" x=\"130.078125\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-69\" x=\"193.554688\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-6e\" x=\"221.337891\"/>\n",
       "     </g>\n",
       "    </g>\n",
       "    <g id=\"patch_29\">\n",
       "     <path d=\"M 157.389063 39.873359 \n",
       "L 177.389063 39.873359 \n",
       "L 177.389063 32.873359 \n",
       "L 157.389063 32.873359 \n",
       "z\n",
       "\" style=\"fill: url(#hdd73a5dc4c)\"/>\n",
       "    </g>\n",
       "    <g id=\"text_15\">\n",
       "     <!-- subsampled -->\n",
       "     <g transform=\"translate(185.389063 39.873359)scale(0.1 -0.1)\">\n",
       "      <defs>\n",
       "       <path id=\"DejaVuSans-62\" d=\"M 3116 1747 \n",
       "Q 3116 2381 2855 2742 \n",
       "Q 2594 3103 2138 3103 \n",
       "Q 1681 3103 1420 2742 \n",
       "Q 1159 2381 1159 1747 \n",
       "Q 1159 1113 1420 752 \n",
       "Q 1681 391 2138 391 \n",
       "Q 2594 391 2855 752 \n",
       "Q 3116 1113 3116 1747 \n",
       "z\n",
       "M 1159 2969 \n",
       "Q 1341 3281 1617 3432 \n",
       "Q 1894 3584 2278 3584 \n",
       "Q 2916 3584 3314 3078 \n",
       "Q 3713 2572 3713 1747 \n",
       "Q 3713 922 3314 415 \n",
       "Q 2916 -91 2278 -91 \n",
       "Q 1894 -91 1617 61 \n",
       "Q 1341 213 1159 525 \n",
       "L 1159 0 \n",
       "L 581 0 \n",
       "L 581 4863 \n",
       "L 1159 4863 \n",
       "L 1159 2969 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       <path id=\"DejaVuSans-61\" d=\"M 2194 1759 \n",
       "Q 1497 1759 1228 1600 \n",
       "Q 959 1441 959 1056 \n",
       "Q 959 750 1161 570 \n",
       "Q 1363 391 1709 391 \n",
       "Q 2188 391 2477 730 \n",
       "Q 2766 1069 2766 1631 \n",
       "L 2766 1759 \n",
       "L 2194 1759 \n",
       "z\n",
       "M 3341 1997 \n",
       "L 3341 0 \n",
       "L 2766 0 \n",
       "L 2766 531 \n",
       "Q 2569 213 2275 61 \n",
       "Q 1981 -91 1556 -91 \n",
       "Q 1019 -91 701 211 \n",
       "Q 384 513 384 1019 \n",
       "Q 384 1609 779 1909 \n",
       "Q 1175 2209 1959 2209 \n",
       "L 2766 2209 \n",
       "L 2766 2266 \n",
       "Q 2766 2663 2505 2880 \n",
       "Q 2244 3097 1772 3097 \n",
       "Q 1472 3097 1187 3025 \n",
       "Q 903 2953 641 2809 \n",
       "L 641 3341 \n",
       "Q 956 3463 1253 3523 \n",
       "Q 1550 3584 1831 3584 \n",
       "Q 2591 3584 2966 3190 \n",
       "Q 3341 2797 3341 1997 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       <path id=\"DejaVuSans-6d\" d=\"M 3328 2828 \n",
       "Q 3544 3216 3844 3400 \n",
       "Q 4144 3584 4550 3584 \n",
       "Q 5097 3584 5394 3201 \n",
       "Q 5691 2819 5691 2113 \n",
       "L 5691 0 \n",
       "L 5113 0 \n",
       "L 5113 2094 \n",
       "Q 5113 2597 4934 2840 \n",
       "Q 4756 3084 4391 3084 \n",
       "Q 3944 3084 3684 2787 \n",
       "Q 3425 2491 3425 1978 \n",
       "L 3425 0 \n",
       "L 2847 0 \n",
       "L 2847 2094 \n",
       "Q 2847 2600 2669 2842 \n",
       "Q 2491 3084 2119 3084 \n",
       "Q 1678 3084 1418 2786 \n",
       "Q 1159 2488 1159 1978 \n",
       "L 1159 0 \n",
       "L 581 0 \n",
       "L 581 3500 \n",
       "L 1159 3500 \n",
       "L 1159 2956 \n",
       "Q 1356 3278 1631 3431 \n",
       "Q 1906 3584 2284 3584 \n",
       "Q 2666 3584 2933 3390 \n",
       "Q 3200 3197 3328 2828 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       <path id=\"DejaVuSans-6c\" d=\"M 603 4863 \n",
       "L 1178 4863 \n",
       "L 1178 0 \n",
       "L 603 0 \n",
       "L 603 4863 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "       <path id=\"DejaVuSans-64\" d=\"M 2906 2969 \n",
       "L 2906 4863 \n",
       "L 3481 4863 \n",
       "L 3481 0 \n",
       "L 2906 0 \n",
       "L 2906 525 \n",
       "Q 2725 213 2448 61 \n",
       "Q 2172 -91 1784 -91 \n",
       "Q 1150 -91 751 415 \n",
       "Q 353 922 353 1747 \n",
       "Q 353 2572 751 3078 \n",
       "Q 1150 3584 1784 3584 \n",
       "Q 2172 3584 2448 3432 \n",
       "Q 2725 3281 2906 2969 \n",
       "z\n",
       "M 947 1747 \n",
       "Q 947 1113 1208 752 \n",
       "Q 1469 391 1925 391 \n",
       "Q 2381 391 2643 752 \n",
       "Q 2906 1113 2906 1747 \n",
       "Q 2906 2381 2643 2742 \n",
       "Q 2381 3103 1925 3103 \n",
       "Q 1469 3103 1208 2742 \n",
       "Q 947 2381 947 1747 \n",
       "z\n",
       "\" transform=\"scale(0.015625)\"/>\n",
       "      </defs>\n",
       "      <use xlink:href=\"#DejaVuSans-73\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-75\" x=\"52.099609\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-62\" x=\"115.478516\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-73\" x=\"178.955078\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-61\" x=\"231.054688\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-6d\" x=\"292.333984\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-70\" x=\"389.746094\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-6c\" x=\"453.222656\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-65\" x=\"481.005859\"/>\n",
       "      <use xlink:href=\"#DejaVuSans-64\" x=\"542.529297\"/>\n",
       "     </g>\n",
       "    </g>\n",
       "   </g>\n",
       "  </g>\n",
       " </g>\n",
       " <defs>\n",
       "  <clipPath id=\"p224370b0b9\">\n",
       "   <rect x=\"59.690625\" y=\"8.596796\" width=\"195.3\" height=\"135.9\"/>\n",
       "  </clipPath>\n",
       " </defs>\n",
       " <defs>\n",
       "  <pattern id=\"hdd73a5dc4c\" patternUnits=\"userSpaceOnUse\" x=\"0\" y=\"0\" width=\"72\" height=\"72\">\n",
       "   <rect x=\"0\" y=\"0\" width=\"73\" height=\"73\" fill=\"#ff7f0e\"/>\n",
       "   <path d=\"M -36 36 \n",
       "L 36 -36 \n",
       "M -24 48 \n",
       "L 48 -24 \n",
       "M -12 60 \n",
       "L 60 -12 \n",
       "M 0 72 \n",
       "L 72 0 \n",
       "M 12 84 \n",
       "L 84 12 \n",
       "M 24 96 \n",
       "L 96 24 \n",
       "M 36 108 \n",
       "L 108 36 \n",
       "\" style=\"fill: #000000; stroke: #000000; stroke-width: 1.0; stroke-linecap: butt; stroke-linejoin: miter\"/>\n",
       "  </pattern>\n",
       " </defs>\n",
       "</svg>\n"
      ],
      "text/plain": [
       "<Figure size 252x180 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "d2l.show_list_len_pair_hist(\n",
    "    ['origin', 'subsampled'], '# tokens per sentence',\n",
    "    'count', sentences, subsampled);"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "80da6e9d",
   "metadata": {
    "origin_pos": 12
   },
   "source": [
    "对于单个词元，高频词“the”的采样率不到1/20。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "2ac63b1a",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:01:44.277942Z",
     "iopub.status.busy": "2023-08-18T07:01:44.277661Z",
     "iopub.status.idle": "2023-08-18T07:01:44.319135Z",
     "shell.execute_reply": "2023-08-18T07:01:44.317982Z"
    },
    "origin_pos": 13,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\"the\"的数量：之前=50770, 之后=2056'"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def compare_counts(token):\n",
    "    return (f'\"{token}\"的数量：'\n",
    "            f'之前={sum([l.count(token) for l in sentences])}, '\n",
    "            f'之后={sum([l.count(token) for l in subsampled])}')\n",
    "\n",
    "compare_counts('the')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "73ef69ef",
   "metadata": {
    "origin_pos": 14
   },
   "source": [
    "相比之下，低频词“join”则被完全保留。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "9307cb04",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:01:44.324650Z",
     "iopub.status.busy": "2023-08-18T07:01:44.323726Z",
     "iopub.status.idle": "2023-08-18T07:01:44.366586Z",
     "shell.execute_reply": "2023-08-18T07:01:44.365449Z"
    },
    "origin_pos": 15,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\"join\"的数量：之前=45, 之后=45'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compare_counts('join')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "38762200",
   "metadata": {
    "origin_pos": 16
   },
   "source": [
    "在下采样之后，我们将词元映射到它们在语料库中的索引。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "ed59e4d0",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:01:44.371681Z",
     "iopub.status.busy": "2023-08-18T07:01:44.370695Z",
     "iopub.status.idle": "2023-08-18T07:01:44.930927Z",
     "shell.execute_reply": "2023-08-18T07:01:44.929824Z"
    },
    "origin_pos": 17,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[], [2115, 274, 406], [140, 3, 5277, 3054, 1580]]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corpus = [vocab[line] for line in subsampled]\n",
    "corpus[:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe5918fb",
   "metadata": {
    "origin_pos": 18
   },
   "source": [
    "## 中心词和上下文词的提取\n",
    "\n",
    "下面的`get_centers_and_contexts`函数从`corpus`中提取所有中心词及其上下文词。它随机采样1到`max_window_size`之间的整数作为上下文窗口。对于任一中心词，与其距离不超过采样上下文窗口大小的词为其上下文词。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "d4a20ba3",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:01:44.935833Z",
     "iopub.status.busy": "2023-08-18T07:01:44.935066Z",
     "iopub.status.idle": "2023-08-18T07:01:44.944963Z",
     "shell.execute_reply": "2023-08-18T07:01:44.943901Z"
    },
    "origin_pos": 19,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [],
   "source": [
    "#@save\n",
    "def get_centers_and_contexts(corpus, max_window_size):\n",
    "    \"\"\"返回跳元模型中的中心词和上下文词\"\"\"\n",
    "    centers, contexts = [], []\n",
    "    for line in corpus:\n",
    "        # 要形成“中心词-上下文词”对，每个句子至少需要有2个词\n",
    "        if len(line) < 2:\n",
    "            continue\n",
    "        centers += line\n",
    "        for i in range(len(line)):  # 上下文窗口中间i\n",
    "            window_size = random.randint(1, max_window_size)\n",
    "            indices = list(range(max(0, i - window_size),\n",
    "                                 min(len(line), i + 1 + window_size)))\n",
    "            # 从上下文词中排除中心词\n",
    "            indices.remove(i)\n",
    "            contexts.append([line[idx] for idx in indices])\n",
    "    return centers, contexts"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "86fba895",
   "metadata": {
    "origin_pos": 20
   },
   "source": [
    "接下来，我们创建一个人工数据集，分别包含7个和3个单词的两个句子。设置最大上下文窗口大小为2，并打印所有中心词及其上下文词。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "fae4771b",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:01:44.948910Z",
     "iopub.status.busy": "2023-08-18T07:01:44.948190Z",
     "iopub.status.idle": "2023-08-18T07:01:44.955563Z",
     "shell.execute_reply": "2023-08-18T07:01:44.954488Z"
    },
    "origin_pos": 21,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "数据集 [[0, 1, 2, 3, 4, 5, 6], [7, 8, 9]]\n",
      "中心词 0 的上下文词是 [1]\n",
      "中心词 1 的上下文词是 [0, 2]\n",
      "中心词 2 的上下文词是 [0, 1, 3, 4]\n",
      "中心词 3 的上下文词是 [2, 4]\n",
      "中心词 4 的上下文词是 [3, 5]\n",
      "中心词 5 的上下文词是 [4, 6]\n",
      "中心词 6 的上下文词是 [5]\n",
      "中心词 7 的上下文词是 [8, 9]\n",
      "中心词 8 的上下文词是 [7, 9]\n",
      "中心词 9 的上下文词是 [7, 8]\n"
     ]
    }
   ],
   "source": [
    "tiny_dataset = [list(range(7)), list(range(7, 10))]\n",
    "print('数据集', tiny_dataset)\n",
    "for center, context in zip(*get_centers_and_contexts(tiny_dataset, 2)):\n",
    "    print('中心词', center, '的上下文词是', context)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e21272fc",
   "metadata": {
    "origin_pos": 22
   },
   "source": [
    "在PTB数据集上进行训练时，我们将最大上下文窗口大小设置为5。下面提取数据集中的所有中心词及其上下文词。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "ec92f27e",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:01:44.960145Z",
     "iopub.status.busy": "2023-08-18T07:01:44.959231Z",
     "iopub.status.idle": "2023-08-18T07:01:47.218796Z",
     "shell.execute_reply": "2023-08-18T07:01:47.217626Z"
    },
    "origin_pos": 23,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'# “中心词-上下文词对”的数量: 1499984'"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_centers, all_contexts = get_centers_and_contexts(corpus, 5)\n",
    "f'# “中心词-上下文词对”的数量: {sum([len(contexts) for contexts in all_contexts])}'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f48c535f",
   "metadata": {
    "origin_pos": 24
   },
   "source": [
    "## 负采样\n",
    "\n",
    "我们使用负采样进行近似训练。为了根据预定义的分布对噪声词进行采样，我们定义以下`RandomGenerator`类，其中（可能未规范化的）采样分布通过变量`sampling_weights`传递。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "365189a2",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:01:47.223801Z",
     "iopub.status.busy": "2023-08-18T07:01:47.223354Z",
     "iopub.status.idle": "2023-08-18T07:01:47.232254Z",
     "shell.execute_reply": "2023-08-18T07:01:47.231166Z"
    },
    "origin_pos": 25,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [],
   "source": [
    "#@save\n",
    "class RandomGenerator:\n",
    "    \"\"\"根据n个采样权重在{1,...,n}中随机抽取\"\"\"\n",
    "    def __init__(self, sampling_weights):\n",
    "        # Exclude\n",
    "        self.population = list(range(1, len(sampling_weights) + 1))\n",
    "        self.sampling_weights = sampling_weights\n",
    "        self.candidates = []\n",
    "        self.i = 0\n",
    "\n",
    "    def draw(self):\n",
    "        if self.i == len(self.candidates):\n",
    "            # 缓存k个随机采样结果\n",
    "            self.candidates = random.choices(\n",
    "                self.population, self.sampling_weights, k=10000)\n",
    "            self.i = 0\n",
    "        self.i += 1\n",
    "        return self.candidates[self.i - 1]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f886ada9",
   "metadata": {
    "origin_pos": 26
   },
   "source": [
    "例如，我们可以在索引1、2和3中绘制10个随机变量$X$，采样概率为$P(X=1)=2/9, P(X=2)=3/9$和$P(X=3)=4/9$，如下所示。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "f534865c",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:01:47.237153Z",
     "iopub.status.busy": "2023-08-18T07:01:47.236381Z",
     "iopub.status.idle": "2023-08-18T07:01:47.251510Z",
     "shell.execute_reply": "2023-08-18T07:01:47.250435Z"
    },
    "origin_pos": 27,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[1, 2, 2, 3, 3, 3, 3, 2, 1, 2]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#@save\n",
    "generator = RandomGenerator([2, 3, 4])\n",
    "[generator.draw() for _ in range(10)]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe4049d4",
   "metadata": {
    "origin_pos": 28
   },
   "source": [
    "对于一对中心词和上下文词，我们随机抽取了`K`个（实验中为5个）噪声词。根据word2vec论文中的建议，将噪声词$w$的采样概率$P(w)$设置为其在字典中的相对频率，其幂为0.75 :cite:`Mikolov.Sutskever.Chen.ea.2013`。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "21950025",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:01:47.256344Z",
     "iopub.status.busy": "2023-08-18T07:01:47.255586Z",
     "iopub.status.idle": "2023-08-18T07:01:59.259799Z",
     "shell.execute_reply": "2023-08-18T07:01:59.258793Z"
    },
    "origin_pos": 29,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [],
   "source": [
    "#@save\n",
    "def get_negatives(all_contexts, vocab, counter, K):\n",
    "    \"\"\"返回负采样中的噪声词\"\"\"\n",
    "    # 索引为1、2、...（索引0是词表中排除的未知标记）\n",
    "    sampling_weights = [counter[vocab.to_tokens(i)]**0.75\n",
    "                        for i in range(1, len(vocab))]\n",
    "    all_negatives, generator = [], RandomGenerator(sampling_weights)\n",
    "    for contexts in all_contexts:\n",
    "        negatives = []\n",
    "        while len(negatives) < len(contexts) * K:\n",
    "            neg = generator.draw()\n",
    "            # 噪声词不能是上下文词\n",
    "            if neg not in contexts:\n",
    "                negatives.append(neg)\n",
    "        all_negatives.append(negatives)\n",
    "    return all_negatives\n",
    "\n",
    "all_negatives = get_negatives(all_contexts, vocab, counter, 5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8aa17e2d",
   "metadata": {
    "origin_pos": 30
   },
   "source": [
    "## 小批量加载训练实例\n",
    ":label:`subsec_word2vec-minibatch-loading`\n",
    "\n",
    "在提取所有中心词及其上下文词和采样噪声词后，将它们转换成小批量的样本，在训练过程中可以迭代加载。\n",
    "\n",
    "在小批量中，$i^\\mathrm{th}$个样本包括中心词及其$n_i$个上下文词和$m_i$个噪声词。由于上下文窗口大小不同，$n_i+m_i$对于不同的$i$是不同的。因此，对于每个样本，我们在`contexts_negatives`个变量中将其上下文词和噪声词连结起来，并填充零，直到连结长度达到$\\max_i n_i+m_i$(`max_len`)。为了在计算损失时排除填充，我们定义了掩码变量`masks`。在`masks`中的元素和`contexts_negatives`中的元素之间存在一一对应关系，其中`masks`中的0（否则为1）对应于`contexts_negatives`中的填充。\n",
    "\n",
    "为了区分正反例，我们在`contexts_negatives`中通过一个`labels`变量将上下文词与噪声词分开。类似于`masks`，在`labels`中的元素和`contexts_negatives`中的元素之间也存在一一对应关系，其中`labels`中的1（否则为0）对应于`contexts_negatives`中的上下文词的正例。\n",
    "\n",
    "上述思想在下面的`batchify`函数中实现。其输入`data`是长度等于批量大小的列表，其中每个元素是由中心词`center`、其上下文词`context`和其噪声词`negative`组成的样本。此函数返回一个可以在训练期间加载用于计算的小批量，例如包括掩码变量。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "8e92a65e",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:01:59.264970Z",
     "iopub.status.busy": "2023-08-18T07:01:59.264337Z",
     "iopub.status.idle": "2023-08-18T07:01:59.271417Z",
     "shell.execute_reply": "2023-08-18T07:01:59.270518Z"
    },
    "origin_pos": 31,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [],
   "source": [
    "#@save\n",
    "def batchify(data):\n",
    "    \"\"\"返回带有负采样的跳元模型的小批量样本\"\"\"\n",
    "    max_len = max(len(c) + len(n) for _, c, n in data)\n",
    "    centers, contexts_negatives, masks, labels = [], [], [], []\n",
    "    for center, context, negative in data:\n",
    "        cur_len = len(context) + len(negative)\n",
    "        centers += [center]\n",
    "        contexts_negatives += \\\n",
    "            [context + negative + [0] * (max_len - cur_len)]\n",
    "        masks += [[1] * cur_len + [0] * (max_len - cur_len)]\n",
    "        labels += [[1] * len(context) + [0] * (max_len - len(context))]\n",
    "    return (torch.tensor(centers).reshape((-1, 1)), torch.tensor(\n",
    "        contexts_negatives), torch.tensor(masks), torch.tensor(labels))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7aeb5c51",
   "metadata": {
    "origin_pos": 32
   },
   "source": [
    "让我们使用一个小批量的两个样本来测试此函数。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "e14e34ce",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:01:59.276193Z",
     "iopub.status.busy": "2023-08-18T07:01:59.275387Z",
     "iopub.status.idle": "2023-08-18T07:01:59.282832Z",
     "shell.execute_reply": "2023-08-18T07:01:59.281912Z"
    },
    "origin_pos": 33,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "centers = tensor([[1],\n",
      "        [1]])\n",
      "contexts_negatives = tensor([[2, 2, 3, 3, 3, 3],\n",
      "        [2, 2, 2, 3, 3, 0]])\n",
      "masks = tensor([[1, 1, 1, 1, 1, 1],\n",
      "        [1, 1, 1, 1, 1, 0]])\n",
      "labels = tensor([[1, 1, 0, 0, 0, 0],\n",
      "        [1, 1, 1, 0, 0, 0]])\n"
     ]
    }
   ],
   "source": [
    "x_1 = (1, [2, 2], [3, 3, 3, 3])\n",
    "x_2 = (1, [2, 2, 2], [3, 3])\n",
    "batch = batchify((x_1, x_2))\n",
    "\n",
    "names = ['centers', 'contexts_negatives', 'masks', 'labels']\n",
    "for name, data in zip(names, batch):\n",
    "    print(name, '=', data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e1eef3d8",
   "metadata": {
    "origin_pos": 34
   },
   "source": [
    "## 整合代码\n",
    "\n",
    "最后，我们定义了读取PTB数据集并返回数据迭代器和词表的`load_data_ptb`函数。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "8ddfb20d",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:01:59.287587Z",
     "iopub.status.busy": "2023-08-18T07:01:59.286823Z",
     "iopub.status.idle": "2023-08-18T07:01:59.296040Z",
     "shell.execute_reply": "2023-08-18T07:01:59.294978Z"
    },
    "origin_pos": 36,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [],
   "source": [
    "#@save\n",
    "def load_data_ptb(batch_size, max_window_size, num_noise_words):\n",
    "    \"\"\"下载PTB数据集，然后将其加载到内存中\"\"\"\n",
    "    num_workers = d2l.get_dataloader_workers()\n",
    "    sentences = read_ptb()\n",
    "    vocab = d2l.Vocab(sentences, min_freq=10)\n",
    "    subsampled, counter = subsample(sentences, vocab)\n",
    "    corpus = [vocab[line] for line in subsampled]\n",
    "    all_centers, all_contexts = get_centers_and_contexts(\n",
    "        corpus, max_window_size)\n",
    "    all_negatives = get_negatives(\n",
    "        all_contexts, vocab, counter, num_noise_words)\n",
    "\n",
    "    class PTBDataset(torch.utils.data.Dataset):\n",
    "        def __init__(self, centers, contexts, negatives):\n",
    "            assert len(centers) == len(contexts) == len(negatives)\n",
    "            self.centers = centers\n",
    "            self.contexts = contexts\n",
    "            self.negatives = negatives\n",
    "\n",
    "        def __getitem__(self, index):\n",
    "            return (self.centers[index], self.contexts[index],\n",
    "                    self.negatives[index])\n",
    "\n",
    "        def __len__(self):\n",
    "            return len(self.centers)\n",
    "\n",
    "    dataset = PTBDataset(all_centers, all_contexts, all_negatives)\n",
    "\n",
    "    data_iter = torch.utils.data.DataLoader(\n",
    "        dataset, batch_size, shuffle=True,\n",
    "        collate_fn=batchify, num_workers=num_workers)\n",
    "    return data_iter, vocab"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "97991d10",
   "metadata": {
    "origin_pos": 38
   },
   "source": [
    "让我们打印数据迭代器的第一个小批量。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "5115b257",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:01:59.300574Z",
     "iopub.status.busy": "2023-08-18T07:01:59.299960Z",
     "iopub.status.idle": "2023-08-18T07:02:13.672095Z",
     "shell.execute_reply": "2023-08-18T07:02:13.671142Z"
    },
    "origin_pos": 39,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "centers shape: torch.Size([512, 1])\n",
      "contexts_negatives shape: torch.Size([512, 60])\n",
      "masks shape: torch.Size([512, 60])\n",
      "labels shape: torch.Size([512, 60])\n"
     ]
    }
   ],
   "source": [
    "data_iter, vocab = load_data_ptb(512, 5, 5)\n",
    "for batch in data_iter:\n",
    "    for name, data in zip(names, batch):\n",
    "        print(name, 'shape:', data.shape)\n",
    "    break"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cfc03f54",
   "metadata": {
    "origin_pos": 40
   },
   "source": [
    "## 小结\n",
    "\n",
    "* 高频词在训练中可能不是那么有用。我们可以对他们进行下采样，以便在训练中加快速度。\n",
    "* 为了提高计算效率，我们以小批量方式加载样本。我们可以定义其他变量来区分填充标记和非填充标记，以及正例和负例。\n",
    "\n",
    "## 练习\n",
    "\n",
    "1. 如果不使用下采样，本节中代码的运行时间会发生什么变化？\n",
    "1. `RandomGenerator`类缓存`k`个随机采样结果。将`k`设置为其他值，看看它如何影响数据加载速度。\n",
    "1. 本节代码中的哪些其他超参数可能会影响数据加载速度？\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e415387",
   "metadata": {
    "origin_pos": 42,
    "tab": [
     "pytorch"
    ]
   },
   "source": [
    "[Discussions](https://discuss.d2l.ai/t/5735)\n"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  },
  "required_libs": []
 },
 "nbformat": 4,
 "nbformat_minor": 5
}