1249 lines
42 KiB
Plaintext
1249 lines
42 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "bba5a16c",
|
||
"metadata": {
|
||
"origin_pos": 0
|
||
},
|
||
"source": [
|
||
"# 情感分析及数据集\n",
|
||
":label:`sec_sentiment`\n",
|
||
"\n",
|
||
"随着在线社交媒体和评论平台的快速发展,大量评论的数据被记录下来。这些数据具有支持决策过程的巨大潜力。\n",
|
||
"*情感分析*(sentiment analysis)研究人们在文本中\n",
|
||
"(如产品评论、博客评论和论坛讨论等)“隐藏”的情绪。\n",
|
||
"它在广泛应用于政治(如公众对政策的情绪分析)、\n",
|
||
"金融(如市场情绪分析)和营销(如产品研究和品牌管理)等领域。\n",
|
||
"\n",
|
||
"由于情感可以被分类为离散的极性或尺度(例如,积极的和消极的),我们可以将情感分析看作一项文本分类任务,它将可变长度的文本序列转换为固定长度的文本类别。在本章中,我们将使用斯坦福大学的[大型电影评论数据集(large movie review dataset)](https://ai.stanford.edu/~amaas/data/sentiment/)进行情感分析。它由一个训练集和一个测试集组成,其中包含从IMDb下载的25000个电影评论。在这两个数据集中,“积极”和“消极”标签的数量相同,表示不同的情感极性。\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"id": "7822039c",
|
||
"metadata": {
|
||
"execution": {
|
||
"iopub.execute_input": "2023-08-18T07:04:17.696417Z",
|
||
"iopub.status.busy": "2023-08-18T07:04:17.695782Z",
|
||
"iopub.status.idle": "2023-08-18T07:04:19.693903Z",
|
||
"shell.execute_reply": "2023-08-18T07:04:19.692968Z"
|
||
},
|
||
"origin_pos": 2,
|
||
"tab": [
|
||
"pytorch"
|
||
]
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"import os\n",
|
||
"import torch\n",
|
||
"from torch import nn\n",
|
||
"from d2l import torch as d2l"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "76c1daa2",
|
||
"metadata": {
|
||
"origin_pos": 4
|
||
},
|
||
"source": [
|
||
"## 读取数据集\n",
|
||
"\n",
|
||
"首先,下载并提取路径`../data/aclImdb`中的IMDb评论数据集。\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"id": "831081fb",
|
||
"metadata": {
|
||
"execution": {
|
||
"iopub.execute_input": "2023-08-18T07:04:19.698054Z",
|
||
"iopub.status.busy": "2023-08-18T07:04:19.697364Z",
|
||
"iopub.status.idle": "2023-08-18T07:04:42.609194Z",
|
||
"shell.execute_reply": "2023-08-18T07:04:42.607873Z"
|
||
},
|
||
"origin_pos": 5,
|
||
"tab": [
|
||
"pytorch"
|
||
]
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#@save\n",
|
||
"d2l.DATA_HUB['aclImdb'] = (\n",
|
||
" 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz',\n",
|
||
" '01ada507287d82875905620988597833ad4e0903')\n",
|
||
"\n",
|
||
"data_dir = d2l.download_extract('aclImdb', 'aclImdb')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "a376611c",
|
||
"metadata": {
|
||
"origin_pos": 6
|
||
},
|
||
"source": [
|
||
"接下来,读取训练和测试数据集。每个样本都是一个评论及其标签:1表示“积极”,0表示“消极”。\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"id": "4d08a828",
|
||
"metadata": {
|
||
"execution": {
|
||
"iopub.execute_input": "2023-08-18T07:04:42.614109Z",
|
||
"iopub.status.busy": "2023-08-18T07:04:42.613148Z",
|
||
"iopub.status.idle": "2023-08-18T07:04:43.353563Z",
|
||
"shell.execute_reply": "2023-08-18T07:04:43.352484Z"
|
||
},
|
||
"origin_pos": 7,
|
||
"tab": [
|
||
"pytorch"
|
||
]
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"训练集数目: 25000\n",
|
||
"标签: 1 review: Zentropa has much in common with The Third Man, another noir\n",
|
||
"标签: 1 review: Zentropa is the most original movie I've seen in years. If y\n",
|
||
"标签: 1 review: Lars Von Trier is never backward in trying out new technique\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"#@save\n",
|
||
"def read_imdb(data_dir, is_train):\n",
|
||
" \"\"\"读取IMDb评论数据集文本序列和标签\"\"\"\n",
|
||
" data, labels = [], []\n",
|
||
" for label in ('pos', 'neg'):\n",
|
||
" folder_name = os.path.join(data_dir, 'train' if is_train else 'test',\n",
|
||
" label)\n",
|
||
" for file in os.listdir(folder_name):\n",
|
||
" with open(os.path.join(folder_name, file), 'rb') as f:\n",
|
||
" review = f.read().decode('utf-8').replace('\\n', '')\n",
|
||
" data.append(review)\n",
|
||
" labels.append(1 if label == 'pos' else 0)\n",
|
||
" return data, labels\n",
|
||
"\n",
|
||
"train_data = read_imdb(data_dir, is_train=True)\n",
|
||
"print('训练集数目:', len(train_data[0]))\n",
|
||
"for x, y in zip(train_data[0][:3], train_data[1][:3]):\n",
|
||
" print('标签:', y, 'review:', x[0:60])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "35e114e6",
|
||
"metadata": {
|
||
"origin_pos": 8
|
||
},
|
||
"source": [
|
||
"## 预处理数据集\n",
|
||
"\n",
|
||
"将每个单词作为一个词元,过滤掉出现不到5次的单词,我们从训练数据集中创建一个词表。\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"id": "b833b646",
|
||
"metadata": {
|
||
"execution": {
|
||
"iopub.execute_input": "2023-08-18T07:04:43.358797Z",
|
||
"iopub.status.busy": "2023-08-18T07:04:43.358266Z",
|
||
"iopub.status.idle": "2023-08-18T07:04:46.339449Z",
|
||
"shell.execute_reply": "2023-08-18T07:04:46.338553Z"
|
||
},
|
||
"origin_pos": 9,
|
||
"tab": [
|
||
"pytorch"
|
||
]
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"train_tokens = d2l.tokenize(train_data[0], token='word')\n",
|
||
"vocab = d2l.Vocab(train_tokens, min_freq=5, reserved_tokens=['<pad>'])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "6592cc46",
|
||
"metadata": {
|
||
"origin_pos": 10
|
||
},
|
||
"source": [
|
||
"在词元化之后,让我们绘制评论词元长度的直方图。\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"id": "ca2ed7c7",
|
||
"metadata": {
|
||
"execution": {
|
||
"iopub.execute_input": "2023-08-18T07:04:46.343348Z",
|
||
"iopub.status.busy": "2023-08-18T07:04:46.343069Z",
|
||
"iopub.status.idle": "2023-08-18T07:04:46.663216Z",
|
||
"shell.execute_reply": "2023-08-18T07:04:46.662099Z"
|
||
},
|
||
"origin_pos": 11,
|
||
"tab": [
|
||
"pytorch"
|
||
]
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"image/svg+xml": [
|
||
"<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"no\"?>\n",
|
||
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
|
||
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
|
||
"<svg xmlns:xlink=\"http://www.w3.org/1999/xlink\" width=\"255.828125pt\" height=\"180.65625pt\" viewBox=\"0 0 255.828125 180.65625\" xmlns=\"http://www.w3.org/2000/svg\" version=\"1.1\">\n",
|
||
" <metadata>\n",
|
||
" <rdf:RDF xmlns:dc=\"http://purl.org/dc/elements/1.1/\" xmlns:cc=\"http://creativecommons.org/ns#\" xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\">\n",
|
||
" <cc:Work>\n",
|
||
" <dc:type rdf:resource=\"http://purl.org/dc/dcmitype/StillImage\"/>\n",
|
||
" <dc:date>2023-08-18T07:04:46.626523</dc:date>\n",
|
||
" <dc:format>image/svg+xml</dc:format>\n",
|
||
" <dc:creator>\n",
|
||
" <cc:Agent>\n",
|
||
" <dc:title>Matplotlib v3.5.1, https://matplotlib.org/</dc:title>\n",
|
||
" </cc:Agent>\n",
|
||
" </dc:creator>\n",
|
||
" </cc:Work>\n",
|
||
" </rdf:RDF>\n",
|
||
" </metadata>\n",
|
||
" <defs>\n",
|
||
" <style type=\"text/css\">*{stroke-linejoin: round; stroke-linecap: butt}</style>\n",
|
||
" </defs>\n",
|
||
" <g id=\"figure_1\">\n",
|
||
" <g id=\"patch_1\">\n",
|
||
" <path d=\"M 0 180.65625 \n",
|
||
"L 255.828125 180.65625 \n",
|
||
"L 255.828125 0 \n",
|
||
"L 0 0 \n",
|
||
"L 0 180.65625 \n",
|
||
"z\n",
|
||
"\" style=\"fill: none\"/>\n",
|
||
" </g>\n",
|
||
" <g id=\"axes_1\">\n",
|
||
" <g id=\"patch_2\">\n",
|
||
" <path d=\"M 53.328125 143.1 \n",
|
||
"L 248.628125 143.1 \n",
|
||
"L 248.628125 7.2 \n",
|
||
"L 53.328125 7.2 \n",
|
||
"z\n",
|
||
"\" style=\"fill: #ffffff\"/>\n",
|
||
" </g>\n",
|
||
" <g id=\"patch_3\">\n",
|
||
" <path d=\"M 62.205398 143.1 \n",
|
||
"L 71.549895 143.1 \n",
|
||
"L 71.549895 132.605279 \n",
|
||
"L 62.205398 132.605279 \n",
|
||
"z\n",
|
||
"\" clip-path=\"url(#p696020f9ff)\" style=\"fill: #1f77b4\"/>\n",
|
||
" </g>\n",
|
||
" <g id=\"patch_4\">\n",
|
||
" <path d=\"M 71.549895 143.1 \n",
|
||
"L 80.894393 143.1 \n",
|
||
"L 80.894393 98.065689 \n",
|
||
"L 71.549895 98.065689 \n",
|
||
"z\n",
|
||
"\" clip-path=\"url(#p696020f9ff)\" style=\"fill: #1f77b4\"/>\n",
|
||
" </g>\n",
|
||
" <g id=\"patch_5\">\n",
|
||
" <path d=\"M 80.894393 143.1 \n",
|
||
"L 90.238891 143.1 \n",
|
||
"L 90.238891 13.671429 \n",
|
||
"L 80.894393 13.671429 \n",
|
||
"z\n",
|
||
"\" clip-path=\"url(#p696020f9ff)\" style=\"fill: #1f77b4\"/>\n",
|
||
" </g>\n",
|
||
" <g id=\"patch_6\">\n",
|
||
" <path d=\"M 90.238891 143.1 \n",
|
||
"L 99.583388 143.1 \n",
|
||
"L 99.583388 51.361332 \n",
|
||
"L 90.238891 51.361332 \n",
|
||
"z\n",
|
||
"\" clip-path=\"url(#p696020f9ff)\" style=\"fill: #1f77b4\"/>\n",
|
||
" </g>\n",
|
||
" <g id=\"patch_7\">\n",
|
||
" <path d=\"M 99.583388 143.1 \n",
|
||
"L 108.927886 143.1 \n",
|
||
"L 108.927886 89.639548 \n",
|
||
"L 99.583388 89.639548 \n",
|
||
"z\n",
|
||
"\" clip-path=\"url(#p696020f9ff)\" style=\"fill: #1f77b4\"/>\n",
|
||
" </g>\n",
|
||
" <g id=\"patch_8\">\n",
|
||
" <path d=\"M 108.927886 143.1 \n",
|
||
"L 118.272383 143.1 \n",
|
||
"L 118.272383 108.029032 \n",
|
||
"L 108.927886 108.029032 \n",
|
||
"z\n",
|
||
"\" clip-path=\"url(#p696020f9ff)\" style=\"fill: #1f77b4\"/>\n",
|
||
" </g>\n",
|
||
" <g id=\"patch_9\">\n",
|
||
" <path d=\"M 118.272383 143.1 \n",
|
||
"L 127.616881 143.1 \n",
|
||
"L 127.616881 116.910641 \n",
|
||
"L 118.272383 116.910641 \n",
|
||
"z\n",
|
||
"\" clip-path=\"url(#p696020f9ff)\" style=\"fill: #1f77b4\"/>\n",
|
||
" </g>\n",
|
||
" <g id=\"patch_10\">\n",
|
||
" <path d=\"M 127.616881 143.1 \n",
|
||
"L 136.961379 143.1 \n",
|
||
"L 136.961379 124.027315 \n",
|
||
"L 127.616881 124.027315 \n",
|
||
"z\n",
|
||
"\" clip-path=\"url(#p696020f9ff)\" style=\"fill: #1f77b4\"/>\n",
|
||
" </g>\n",
|
||
" <g id=\"patch_11\">\n",
|
||
" <path d=\"M 136.961379 143.1 \n",
|
||
"L 146.305876 143.1 \n",
|
||
"L 146.305876 128.695853 \n",
|
||
"L 136.961379 128.695853 \n",
|
||
"z\n",
|
||
"\" clip-path=\"url(#p696020f9ff)\" style=\"fill: #1f77b4\"/>\n",
|
||
" </g>\n",
|
||
" <g id=\"patch_12\">\n",
|
||
" <path d=\"M 146.305876 143.1 \n",
|
||
"L 155.650374 143.1 \n",
|
||
"L 155.650374 132.0739 \n",
|
||
"L 146.305876 132.0739 \n",
|
||
"z\n",
|
||
"\" clip-path=\"url(#p696020f9ff)\" style=\"fill: #1f77b4\"/>\n",
|
||
" </g>\n",
|
||
" <g id=\"patch_13\">\n",
|
||
" <path d=\"M 155.650374 143.1 \n",
|
||
"L 164.994871 143.1 \n",
|
||
"L 164.994871 134.806703 \n",
|
||
"L 155.650374 134.806703 \n",
|
||
"z\n",
|
||
"\" clip-path=\"url(#p696020f9ff)\" style=\"fill: #1f77b4\"/>\n",
|
||
" </g>\n",
|
||
" <g id=\"patch_14\">\n",
|
||
" <path d=\"M 164.994871 143.1 \n",
|
||
"L 174.339369 143.1 \n",
|
||
"L 174.339369 136.476749 \n",
|
||
"L 164.994871 136.476749 \n",
|
||
"z\n",
|
||
"\" clip-path=\"url(#p696020f9ff)\" style=\"fill: #1f77b4\"/>\n",
|
||
" </g>\n",
|
||
" <g id=\"patch_15\">\n",
|
||
" <path d=\"M 174.339369 143.1 \n",
|
||
"L 183.683867 143.1 \n",
|
||
"L 183.683867 138.222706 \n",
|
||
"L 174.339369 138.222706 \n",
|
||
"z\n",
|
||
"\" clip-path=\"url(#p696020f9ff)\" style=\"fill: #1f77b4\"/>\n",
|
||
" </g>\n",
|
||
" <g id=\"patch_16\">\n",
|
||
" <path d=\"M 183.683867 143.1 \n",
|
||
"L 193.028364 143.1 \n",
|
||
"L 193.028364 139.171596 \n",
|
||
"L 183.683867 139.171596 \n",
|
||
"z\n",
|
||
"\" clip-path=\"url(#p696020f9ff)\" style=\"fill: #1f77b4\"/>\n",
|
||
" </g>\n",
|
||
" <g id=\"patch_17\">\n",
|
||
" <path d=\"M 193.028364 143.1 \n",
|
||
"L 202.372862 143.1 \n",
|
||
"L 202.372862 139.797863 \n",
|
||
"L 193.028364 139.797863 \n",
|
||
"z\n",
|
||
"\" clip-path=\"url(#p696020f9ff)\" style=\"fill: #1f77b4\"/>\n",
|
||
" </g>\n",
|
||
" <g id=\"patch_18\">\n",
|
||
" <path d=\"M 202.372862 143.1 \n",
|
||
"L 211.717359 143.1 \n",
|
||
"L 211.717359 140.575953 \n",
|
||
"L 202.372862 140.575953 \n",
|
||
"z\n",
|
||
"\" clip-path=\"url(#p696020f9ff)\" style=\"fill: #1f77b4\"/>\n",
|
||
" </g>\n",
|
||
" <g id=\"patch_19\">\n",
|
||
" <path d=\"M 211.717359 143.1 \n",
|
||
"L 221.061857 143.1 \n",
|
||
"L 221.061857 140.898576 \n",
|
||
"L 211.717359 140.898576 \n",
|
||
"z\n",
|
||
"\" clip-path=\"url(#p696020f9ff)\" style=\"fill: #1f77b4\"/>\n",
|
||
" </g>\n",
|
||
" <g id=\"patch_20\">\n",
|
||
" <path d=\"M 221.061857 143.1 \n",
|
||
"L 230.406355 143.1 \n",
|
||
"L 230.406355 141.486887 \n",
|
||
"L 221.061857 141.486887 \n",
|
||
"z\n",
|
||
"\" clip-path=\"url(#p696020f9ff)\" style=\"fill: #1f77b4\"/>\n",
|
||
" </g>\n",
|
||
" <g id=\"patch_21\">\n",
|
||
" <path d=\"M 230.406355 143.1 \n",
|
||
"L 239.750852 143.1 \n",
|
||
"L 239.750852 141.676665 \n",
|
||
"L 230.406355 141.676665 \n",
|
||
"z\n",
|
||
"\" clip-path=\"url(#p696020f9ff)\" style=\"fill: #1f77b4\"/>\n",
|
||
" </g>\n",
|
||
" <g id=\"matplotlib.axis_1\">\n",
|
||
" <g id=\"xtick_1\">\n",
|
||
" <g id=\"line2d_1\">\n",
|
||
" <defs>\n",
|
||
" <path id=\"mf1c14b8332\" d=\"M 0 0 \n",
|
||
"L 0 3.5 \n",
|
||
"\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
|
||
" </defs>\n",
|
||
" <g>\n",
|
||
" <use xlink:href=\"#mf1c14b8332\" x=\"62.205398\" y=\"143.1\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" <g id=\"text_1\">\n",
|
||
" <!-- 0 -->\n",
|
||
" <g transform=\"translate(59.024148 157.698438)scale(0.1 -0.1)\">\n",
|
||
" <defs>\n",
|
||
" <path id=\"DejaVuSans-30\" d=\"M 2034 4250 \n",
|
||
"Q 1547 4250 1301 3770 \n",
|
||
"Q 1056 3291 1056 2328 \n",
|
||
"Q 1056 1369 1301 889 \n",
|
||
"Q 1547 409 2034 409 \n",
|
||
"Q 2525 409 2770 889 \n",
|
||
"Q 3016 1369 3016 2328 \n",
|
||
"Q 3016 3291 2770 3770 \n",
|
||
"Q 2525 4250 2034 4250 \n",
|
||
"z\n",
|
||
"M 2034 4750 \n",
|
||
"Q 2819 4750 3233 4129 \n",
|
||
"Q 3647 3509 3647 2328 \n",
|
||
"Q 3647 1150 3233 529 \n",
|
||
"Q 2819 -91 2034 -91 \n",
|
||
"Q 1250 -91 836 529 \n",
|
||
"Q 422 1150 422 2328 \n",
|
||
"Q 422 3509 836 4129 \n",
|
||
"Q 1250 4750 2034 4750 \n",
|
||
"z\n",
|
||
"\" transform=\"scale(0.015625)\"/>\n",
|
||
" </defs>\n",
|
||
" <use xlink:href=\"#DejaVuSans-30\"/>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" <g id=\"xtick_2\">\n",
|
||
" <g id=\"line2d_2\">\n",
|
||
" <g>\n",
|
||
" <use xlink:href=\"#mf1c14b8332\" x=\"99.583388\" y=\"143.1\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" <g id=\"text_2\">\n",
|
||
" <!-- 200 -->\n",
|
||
" <g transform=\"translate(90.039638 157.698438)scale(0.1 -0.1)\">\n",
|
||
" <defs>\n",
|
||
" <path id=\"DejaVuSans-32\" d=\"M 1228 531 \n",
|
||
"L 3431 531 \n",
|
||
"L 3431 0 \n",
|
||
"L 469 0 \n",
|
||
"L 469 531 \n",
|
||
"Q 828 903 1448 1529 \n",
|
||
"Q 2069 2156 2228 2338 \n",
|
||
"Q 2531 2678 2651 2914 \n",
|
||
"Q 2772 3150 2772 3378 \n",
|
||
"Q 2772 3750 2511 3984 \n",
|
||
"Q 2250 4219 1831 4219 \n",
|
||
"Q 1534 4219 1204 4116 \n",
|
||
"Q 875 4013 500 3803 \n",
|
||
"L 500 4441 \n",
|
||
"Q 881 4594 1212 4672 \n",
|
||
"Q 1544 4750 1819 4750 \n",
|
||
"Q 2544 4750 2975 4387 \n",
|
||
"Q 3406 4025 3406 3419 \n",
|
||
"Q 3406 3131 3298 2873 \n",
|
||
"Q 3191 2616 2906 2266 \n",
|
||
"Q 2828 2175 2409 1742 \n",
|
||
"Q 1991 1309 1228 531 \n",
|
||
"z\n",
|
||
"\" transform=\"scale(0.015625)\"/>\n",
|
||
" </defs>\n",
|
||
" <use xlink:href=\"#DejaVuSans-32\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-30\" x=\"63.623047\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-30\" x=\"127.246094\"/>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" <g id=\"xtick_3\">\n",
|
||
" <g id=\"line2d_3\">\n",
|
||
" <g>\n",
|
||
" <use xlink:href=\"#mf1c14b8332\" x=\"136.961379\" y=\"143.1\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" <g id=\"text_3\">\n",
|
||
" <!-- 400 -->\n",
|
||
" <g transform=\"translate(127.417629 157.698438)scale(0.1 -0.1)\">\n",
|
||
" <defs>\n",
|
||
" <path id=\"DejaVuSans-34\" d=\"M 2419 4116 \n",
|
||
"L 825 1625 \n",
|
||
"L 2419 1625 \n",
|
||
"L 2419 4116 \n",
|
||
"z\n",
|
||
"M 2253 4666 \n",
|
||
"L 3047 4666 \n",
|
||
"L 3047 1625 \n",
|
||
"L 3713 1625 \n",
|
||
"L 3713 1100 \n",
|
||
"L 3047 1100 \n",
|
||
"L 3047 0 \n",
|
||
"L 2419 0 \n",
|
||
"L 2419 1100 \n",
|
||
"L 313 1100 \n",
|
||
"L 313 1709 \n",
|
||
"L 2253 4666 \n",
|
||
"z\n",
|
||
"\" transform=\"scale(0.015625)\"/>\n",
|
||
" </defs>\n",
|
||
" <use xlink:href=\"#DejaVuSans-34\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-30\" x=\"63.623047\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-30\" x=\"127.246094\"/>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" <g id=\"xtick_4\">\n",
|
||
" <g id=\"line2d_4\">\n",
|
||
" <g>\n",
|
||
" <use xlink:href=\"#mf1c14b8332\" x=\"174.339369\" y=\"143.1\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" <g id=\"text_4\">\n",
|
||
" <!-- 600 -->\n",
|
||
" <g transform=\"translate(164.795619 157.698438)scale(0.1 -0.1)\">\n",
|
||
" <defs>\n",
|
||
" <path id=\"DejaVuSans-36\" d=\"M 2113 2584 \n",
|
||
"Q 1688 2584 1439 2293 \n",
|
||
"Q 1191 2003 1191 1497 \n",
|
||
"Q 1191 994 1439 701 \n",
|
||
"Q 1688 409 2113 409 \n",
|
||
"Q 2538 409 2786 701 \n",
|
||
"Q 3034 994 3034 1497 \n",
|
||
"Q 3034 2003 2786 2293 \n",
|
||
"Q 2538 2584 2113 2584 \n",
|
||
"z\n",
|
||
"M 3366 4563 \n",
|
||
"L 3366 3988 \n",
|
||
"Q 3128 4100 2886 4159 \n",
|
||
"Q 2644 4219 2406 4219 \n",
|
||
"Q 1781 4219 1451 3797 \n",
|
||
"Q 1122 3375 1075 2522 \n",
|
||
"Q 1259 2794 1537 2939 \n",
|
||
"Q 1816 3084 2150 3084 \n",
|
||
"Q 2853 3084 3261 2657 \n",
|
||
"Q 3669 2231 3669 1497 \n",
|
||
"Q 3669 778 3244 343 \n",
|
||
"Q 2819 -91 2113 -91 \n",
|
||
"Q 1303 -91 875 529 \n",
|
||
"Q 447 1150 447 2328 \n",
|
||
"Q 447 3434 972 4092 \n",
|
||
"Q 1497 4750 2381 4750 \n",
|
||
"Q 2619 4750 2861 4703 \n",
|
||
"Q 3103 4656 3366 4563 \n",
|
||
"z\n",
|
||
"\" transform=\"scale(0.015625)\"/>\n",
|
||
" </defs>\n",
|
||
" <use xlink:href=\"#DejaVuSans-36\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-30\" x=\"63.623047\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-30\" x=\"127.246094\"/>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" <g id=\"xtick_5\">\n",
|
||
" <g id=\"line2d_5\">\n",
|
||
" <g>\n",
|
||
" <use xlink:href=\"#mf1c14b8332\" x=\"211.717359\" y=\"143.1\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" <g id=\"text_5\">\n",
|
||
" <!-- 800 -->\n",
|
||
" <g transform=\"translate(202.173609 157.698438)scale(0.1 -0.1)\">\n",
|
||
" <defs>\n",
|
||
" <path id=\"DejaVuSans-38\" d=\"M 2034 2216 \n",
|
||
"Q 1584 2216 1326 1975 \n",
|
||
"Q 1069 1734 1069 1313 \n",
|
||
"Q 1069 891 1326 650 \n",
|
||
"Q 1584 409 2034 409 \n",
|
||
"Q 2484 409 2743 651 \n",
|
||
"Q 3003 894 3003 1313 \n",
|
||
"Q 3003 1734 2745 1975 \n",
|
||
"Q 2488 2216 2034 2216 \n",
|
||
"z\n",
|
||
"M 1403 2484 \n",
|
||
"Q 997 2584 770 2862 \n",
|
||
"Q 544 3141 544 3541 \n",
|
||
"Q 544 4100 942 4425 \n",
|
||
"Q 1341 4750 2034 4750 \n",
|
||
"Q 2731 4750 3128 4425 \n",
|
||
"Q 3525 4100 3525 3541 \n",
|
||
"Q 3525 3141 3298 2862 \n",
|
||
"Q 3072 2584 2669 2484 \n",
|
||
"Q 3125 2378 3379 2068 \n",
|
||
"Q 3634 1759 3634 1313 \n",
|
||
"Q 3634 634 3220 271 \n",
|
||
"Q 2806 -91 2034 -91 \n",
|
||
"Q 1263 -91 848 271 \n",
|
||
"Q 434 634 434 1313 \n",
|
||
"Q 434 1759 690 2068 \n",
|
||
"Q 947 2378 1403 2484 \n",
|
||
"z\n",
|
||
"M 1172 3481 \n",
|
||
"Q 1172 3119 1398 2916 \n",
|
||
"Q 1625 2713 2034 2713 \n",
|
||
"Q 2441 2713 2670 2916 \n",
|
||
"Q 2900 3119 2900 3481 \n",
|
||
"Q 2900 3844 2670 4047 \n",
|
||
"Q 2441 4250 2034 4250 \n",
|
||
"Q 1625 4250 1398 4047 \n",
|
||
"Q 1172 3844 1172 3481 \n",
|
||
"z\n",
|
||
"\" transform=\"scale(0.015625)\"/>\n",
|
||
" </defs>\n",
|
||
" <use xlink:href=\"#DejaVuSans-38\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-30\" x=\"63.623047\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-30\" x=\"127.246094\"/>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" <g id=\"text_6\">\n",
|
||
" <!-- # tokens per review -->\n",
|
||
" <g transform=\"translate(100.597656 171.376563)scale(0.1 -0.1)\">\n",
|
||
" <defs>\n",
|
||
" <path id=\"DejaVuSans-23\" d=\"M 3272 2816 \n",
|
||
"L 2363 2816 \n",
|
||
"L 2100 1772 \n",
|
||
"L 3016 1772 \n",
|
||
"L 3272 2816 \n",
|
||
"z\n",
|
||
"M 2803 4594 \n",
|
||
"L 2478 3297 \n",
|
||
"L 3391 3297 \n",
|
||
"L 3719 4594 \n",
|
||
"L 4219 4594 \n",
|
||
"L 3897 3297 \n",
|
||
"L 4872 3297 \n",
|
||
"L 4872 2816 \n",
|
||
"L 3775 2816 \n",
|
||
"L 3519 1772 \n",
|
||
"L 4513 1772 \n",
|
||
"L 4513 1294 \n",
|
||
"L 3397 1294 \n",
|
||
"L 3072 0 \n",
|
||
"L 2572 0 \n",
|
||
"L 2894 1294 \n",
|
||
"L 1978 1294 \n",
|
||
"L 1656 0 \n",
|
||
"L 1153 0 \n",
|
||
"L 1478 1294 \n",
|
||
"L 494 1294 \n",
|
||
"L 494 1772 \n",
|
||
"L 1594 1772 \n",
|
||
"L 1856 2816 \n",
|
||
"L 850 2816 \n",
|
||
"L 850 3297 \n",
|
||
"L 1978 3297 \n",
|
||
"L 2297 4594 \n",
|
||
"L 2803 4594 \n",
|
||
"z\n",
|
||
"\" transform=\"scale(0.015625)\"/>\n",
|
||
" <path id=\"DejaVuSans-20\" transform=\"scale(0.015625)\"/>\n",
|
||
" <path id=\"DejaVuSans-74\" d=\"M 1172 4494 \n",
|
||
"L 1172 3500 \n",
|
||
"L 2356 3500 \n",
|
||
"L 2356 3053 \n",
|
||
"L 1172 3053 \n",
|
||
"L 1172 1153 \n",
|
||
"Q 1172 725 1289 603 \n",
|
||
"Q 1406 481 1766 481 \n",
|
||
"L 2356 481 \n",
|
||
"L 2356 0 \n",
|
||
"L 1766 0 \n",
|
||
"Q 1100 0 847 248 \n",
|
||
"Q 594 497 594 1153 \n",
|
||
"L 594 3053 \n",
|
||
"L 172 3053 \n",
|
||
"L 172 3500 \n",
|
||
"L 594 3500 \n",
|
||
"L 594 4494 \n",
|
||
"L 1172 4494 \n",
|
||
"z\n",
|
||
"\" transform=\"scale(0.015625)\"/>\n",
|
||
" <path id=\"DejaVuSans-6f\" d=\"M 1959 3097 \n",
|
||
"Q 1497 3097 1228 2736 \n",
|
||
"Q 959 2375 959 1747 \n",
|
||
"Q 959 1119 1226 758 \n",
|
||
"Q 1494 397 1959 397 \n",
|
||
"Q 2419 397 2687 759 \n",
|
||
"Q 2956 1122 2956 1747 \n",
|
||
"Q 2956 2369 2687 2733 \n",
|
||
"Q 2419 3097 1959 3097 \n",
|
||
"z\n",
|
||
"M 1959 3584 \n",
|
||
"Q 2709 3584 3137 3096 \n",
|
||
"Q 3566 2609 3566 1747 \n",
|
||
"Q 3566 888 3137 398 \n",
|
||
"Q 2709 -91 1959 -91 \n",
|
||
"Q 1206 -91 779 398 \n",
|
||
"Q 353 888 353 1747 \n",
|
||
"Q 353 2609 779 3096 \n",
|
||
"Q 1206 3584 1959 3584 \n",
|
||
"z\n",
|
||
"\" transform=\"scale(0.015625)\"/>\n",
|
||
" <path id=\"DejaVuSans-6b\" d=\"M 581 4863 \n",
|
||
"L 1159 4863 \n",
|
||
"L 1159 1991 \n",
|
||
"L 2875 3500 \n",
|
||
"L 3609 3500 \n",
|
||
"L 1753 1863 \n",
|
||
"L 3688 0 \n",
|
||
"L 2938 0 \n",
|
||
"L 1159 1709 \n",
|
||
"L 1159 0 \n",
|
||
"L 581 0 \n",
|
||
"L 581 4863 \n",
|
||
"z\n",
|
||
"\" transform=\"scale(0.015625)\"/>\n",
|
||
" <path id=\"DejaVuSans-65\" d=\"M 3597 1894 \n",
|
||
"L 3597 1613 \n",
|
||
"L 953 1613 \n",
|
||
"Q 991 1019 1311 708 \n",
|
||
"Q 1631 397 2203 397 \n",
|
||
"Q 2534 397 2845 478 \n",
|
||
"Q 3156 559 3463 722 \n",
|
||
"L 3463 178 \n",
|
||
"Q 3153 47 2828 -22 \n",
|
||
"Q 2503 -91 2169 -91 \n",
|
||
"Q 1331 -91 842 396 \n",
|
||
"Q 353 884 353 1716 \n",
|
||
"Q 353 2575 817 3079 \n",
|
||
"Q 1281 3584 2069 3584 \n",
|
||
"Q 2775 3584 3186 3129 \n",
|
||
"Q 3597 2675 3597 1894 \n",
|
||
"z\n",
|
||
"M 3022 2063 \n",
|
||
"Q 3016 2534 2758 2815 \n",
|
||
"Q 2500 3097 2075 3097 \n",
|
||
"Q 1594 3097 1305 2825 \n",
|
||
"Q 1016 2553 972 2059 \n",
|
||
"L 3022 2063 \n",
|
||
"z\n",
|
||
"\" transform=\"scale(0.015625)\"/>\n",
|
||
" <path id=\"DejaVuSans-6e\" d=\"M 3513 2113 \n",
|
||
"L 3513 0 \n",
|
||
"L 2938 0 \n",
|
||
"L 2938 2094 \n",
|
||
"Q 2938 2591 2744 2837 \n",
|
||
"Q 2550 3084 2163 3084 \n",
|
||
"Q 1697 3084 1428 2787 \n",
|
||
"Q 1159 2491 1159 1978 \n",
|
||
"L 1159 0 \n",
|
||
"L 581 0 \n",
|
||
"L 581 3500 \n",
|
||
"L 1159 3500 \n",
|
||
"L 1159 2956 \n",
|
||
"Q 1366 3272 1645 3428 \n",
|
||
"Q 1925 3584 2291 3584 \n",
|
||
"Q 2894 3584 3203 3211 \n",
|
||
"Q 3513 2838 3513 2113 \n",
|
||
"z\n",
|
||
"\" transform=\"scale(0.015625)\"/>\n",
|
||
" <path id=\"DejaVuSans-73\" d=\"M 2834 3397 \n",
|
||
"L 2834 2853 \n",
|
||
"Q 2591 2978 2328 3040 \n",
|
||
"Q 2066 3103 1784 3103 \n",
|
||
"Q 1356 3103 1142 2972 \n",
|
||
"Q 928 2841 928 2578 \n",
|
||
"Q 928 2378 1081 2264 \n",
|
||
"Q 1234 2150 1697 2047 \n",
|
||
"L 1894 2003 \n",
|
||
"Q 2506 1872 2764 1633 \n",
|
||
"Q 3022 1394 3022 966 \n",
|
||
"Q 3022 478 2636 193 \n",
|
||
"Q 2250 -91 1575 -91 \n",
|
||
"Q 1294 -91 989 -36 \n",
|
||
"Q 684 19 347 128 \n",
|
||
"L 347 722 \n",
|
||
"Q 666 556 975 473 \n",
|
||
"Q 1284 391 1588 391 \n",
|
||
"Q 1994 391 2212 530 \n",
|
||
"Q 2431 669 2431 922 \n",
|
||
"Q 2431 1156 2273 1281 \n",
|
||
"Q 2116 1406 1581 1522 \n",
|
||
"L 1381 1569 \n",
|
||
"Q 847 1681 609 1914 \n",
|
||
"Q 372 2147 372 2553 \n",
|
||
"Q 372 3047 722 3315 \n",
|
||
"Q 1072 3584 1716 3584 \n",
|
||
"Q 2034 3584 2315 3537 \n",
|
||
"Q 2597 3491 2834 3397 \n",
|
||
"z\n",
|
||
"\" transform=\"scale(0.015625)\"/>\n",
|
||
" <path id=\"DejaVuSans-70\" d=\"M 1159 525 \n",
|
||
"L 1159 -1331 \n",
|
||
"L 581 -1331 \n",
|
||
"L 581 3500 \n",
|
||
"L 1159 3500 \n",
|
||
"L 1159 2969 \n",
|
||
"Q 1341 3281 1617 3432 \n",
|
||
"Q 1894 3584 2278 3584 \n",
|
||
"Q 2916 3584 3314 3078 \n",
|
||
"Q 3713 2572 3713 1747 \n",
|
||
"Q 3713 922 3314 415 \n",
|
||
"Q 2916 -91 2278 -91 \n",
|
||
"Q 1894 -91 1617 61 \n",
|
||
"Q 1341 213 1159 525 \n",
|
||
"z\n",
|
||
"M 3116 1747 \n",
|
||
"Q 3116 2381 2855 2742 \n",
|
||
"Q 2594 3103 2138 3103 \n",
|
||
"Q 1681 3103 1420 2742 \n",
|
||
"Q 1159 2381 1159 1747 \n",
|
||
"Q 1159 1113 1420 752 \n",
|
||
"Q 1681 391 2138 391 \n",
|
||
"Q 2594 391 2855 752 \n",
|
||
"Q 3116 1113 3116 1747 \n",
|
||
"z\n",
|
||
"\" transform=\"scale(0.015625)\"/>\n",
|
||
" <path id=\"DejaVuSans-72\" d=\"M 2631 2963 \n",
|
||
"Q 2534 3019 2420 3045 \n",
|
||
"Q 2306 3072 2169 3072 \n",
|
||
"Q 1681 3072 1420 2755 \n",
|
||
"Q 1159 2438 1159 1844 \n",
|
||
"L 1159 0 \n",
|
||
"L 581 0 \n",
|
||
"L 581 3500 \n",
|
||
"L 1159 3500 \n",
|
||
"L 1159 2956 \n",
|
||
"Q 1341 3275 1631 3429 \n",
|
||
"Q 1922 3584 2338 3584 \n",
|
||
"Q 2397 3584 2469 3576 \n",
|
||
"Q 2541 3569 2628 3553 \n",
|
||
"L 2631 2963 \n",
|
||
"z\n",
|
||
"\" transform=\"scale(0.015625)\"/>\n",
|
||
" <path id=\"DejaVuSans-76\" d=\"M 191 3500 \n",
|
||
"L 800 3500 \n",
|
||
"L 1894 563 \n",
|
||
"L 2988 3500 \n",
|
||
"L 3597 3500 \n",
|
||
"L 2284 0 \n",
|
||
"L 1503 0 \n",
|
||
"L 191 3500 \n",
|
||
"z\n",
|
||
"\" transform=\"scale(0.015625)\"/>\n",
|
||
" <path id=\"DejaVuSans-69\" d=\"M 603 3500 \n",
|
||
"L 1178 3500 \n",
|
||
"L 1178 0 \n",
|
||
"L 603 0 \n",
|
||
"L 603 3500 \n",
|
||
"z\n",
|
||
"M 603 4863 \n",
|
||
"L 1178 4863 \n",
|
||
"L 1178 4134 \n",
|
||
"L 603 4134 \n",
|
||
"L 603 4863 \n",
|
||
"z\n",
|
||
"\" transform=\"scale(0.015625)\"/>\n",
|
||
" <path id=\"DejaVuSans-77\" d=\"M 269 3500 \n",
|
||
"L 844 3500 \n",
|
||
"L 1563 769 \n",
|
||
"L 2278 3500 \n",
|
||
"L 2956 3500 \n",
|
||
"L 3675 769 \n",
|
||
"L 4391 3500 \n",
|
||
"L 4966 3500 \n",
|
||
"L 4050 0 \n",
|
||
"L 3372 0 \n",
|
||
"L 2619 2869 \n",
|
||
"L 1863 0 \n",
|
||
"L 1184 0 \n",
|
||
"L 269 3500 \n",
|
||
"z\n",
|
||
"\" transform=\"scale(0.015625)\"/>\n",
|
||
" </defs>\n",
|
||
" <use xlink:href=\"#DejaVuSans-23\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-20\" x=\"83.789062\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-74\" x=\"115.576172\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-6f\" x=\"154.785156\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-6b\" x=\"215.966797\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-65\" x=\"270.251953\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-6e\" x=\"331.775391\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-73\" x=\"395.154297\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-20\" x=\"447.253906\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-70\" x=\"479.041016\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-65\" x=\"542.517578\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-72\" x=\"604.041016\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-20\" x=\"645.154297\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-72\" x=\"676.941406\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-65\" x=\"715.804688\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-76\" x=\"777.328125\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-69\" x=\"836.507812\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-65\" x=\"864.291016\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-77\" x=\"925.814453\"/>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" <g id=\"matplotlib.axis_2\">\n",
|
||
" <g id=\"ytick_1\">\n",
|
||
" <g id=\"line2d_6\">\n",
|
||
" <defs>\n",
|
||
" <path id=\"ma202f8944e\" d=\"M 0 0 \n",
|
||
"L -3.5 0 \n",
|
||
"\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
|
||
" </defs>\n",
|
||
" <g>\n",
|
||
" <use xlink:href=\"#ma202f8944e\" x=\"53.328125\" y=\"143.1\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" <g id=\"text_7\">\n",
|
||
" <!-- 0 -->\n",
|
||
" <g transform=\"translate(39.965625 146.899219)scale(0.1 -0.1)\">\n",
|
||
" <use xlink:href=\"#DejaVuSans-30\"/>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" <g id=\"ytick_2\">\n",
|
||
" <g id=\"line2d_7\">\n",
|
||
" <g>\n",
|
||
" <use xlink:href=\"#ma202f8944e\" x=\"53.328125\" y=\"105.144407\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" <g id=\"text_8\">\n",
|
||
" <!-- 2000 -->\n",
|
||
" <g transform=\"translate(20.878125 108.943626)scale(0.1 -0.1)\">\n",
|
||
" <use xlink:href=\"#DejaVuSans-32\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-30\" x=\"63.623047\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-30\" x=\"127.246094\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-30\" x=\"190.869141\"/>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" <g id=\"ytick_3\">\n",
|
||
" <g id=\"line2d_8\">\n",
|
||
" <g>\n",
|
||
" <use xlink:href=\"#ma202f8944e\" x=\"53.328125\" y=\"67.188814\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" <g id=\"text_9\">\n",
|
||
" <!-- 4000 -->\n",
|
||
" <g transform=\"translate(20.878125 70.988033)scale(0.1 -0.1)\">\n",
|
||
" <use xlink:href=\"#DejaVuSans-34\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-30\" x=\"63.623047\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-30\" x=\"127.246094\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-30\" x=\"190.869141\"/>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" <g id=\"ytick_4\">\n",
|
||
" <g id=\"line2d_9\">\n",
|
||
" <g>\n",
|
||
" <use xlink:href=\"#ma202f8944e\" x=\"53.328125\" y=\"29.233222\" style=\"stroke: #000000; stroke-width: 0.8\"/>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" <g id=\"text_10\">\n",
|
||
" <!-- 6000 -->\n",
|
||
" <g transform=\"translate(20.878125 33.03244)scale(0.1 -0.1)\">\n",
|
||
" <use xlink:href=\"#DejaVuSans-36\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-30\" x=\"63.623047\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-30\" x=\"127.246094\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-30\" x=\"190.869141\"/>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" <g id=\"text_11\">\n",
|
||
" <!-- count -->\n",
|
||
" <g transform=\"translate(14.798437 89.25625)rotate(-90)scale(0.1 -0.1)\">\n",
|
||
" <defs>\n",
|
||
" <path id=\"DejaVuSans-63\" d=\"M 3122 3366 \n",
|
||
"L 3122 2828 \n",
|
||
"Q 2878 2963 2633 3030 \n",
|
||
"Q 2388 3097 2138 3097 \n",
|
||
"Q 1578 3097 1268 2742 \n",
|
||
"Q 959 2388 959 1747 \n",
|
||
"Q 959 1106 1268 751 \n",
|
||
"Q 1578 397 2138 397 \n",
|
||
"Q 2388 397 2633 464 \n",
|
||
"Q 2878 531 3122 666 \n",
|
||
"L 3122 134 \n",
|
||
"Q 2881 22 2623 -34 \n",
|
||
"Q 2366 -91 2075 -91 \n",
|
||
"Q 1284 -91 818 406 \n",
|
||
"Q 353 903 353 1747 \n",
|
||
"Q 353 2603 823 3093 \n",
|
||
"Q 1294 3584 2113 3584 \n",
|
||
"Q 2378 3584 2631 3529 \n",
|
||
"Q 2884 3475 3122 3366 \n",
|
||
"z\n",
|
||
"\" transform=\"scale(0.015625)\"/>\n",
|
||
" <path id=\"DejaVuSans-75\" d=\"M 544 1381 \n",
|
||
"L 544 3500 \n",
|
||
"L 1119 3500 \n",
|
||
"L 1119 1403 \n",
|
||
"Q 1119 906 1312 657 \n",
|
||
"Q 1506 409 1894 409 \n",
|
||
"Q 2359 409 2629 706 \n",
|
||
"Q 2900 1003 2900 1516 \n",
|
||
"L 2900 3500 \n",
|
||
"L 3475 3500 \n",
|
||
"L 3475 0 \n",
|
||
"L 2900 0 \n",
|
||
"L 2900 538 \n",
|
||
"Q 2691 219 2414 64 \n",
|
||
"Q 2138 -91 1772 -91 \n",
|
||
"Q 1169 -91 856 284 \n",
|
||
"Q 544 659 544 1381 \n",
|
||
"z\n",
|
||
"M 1991 3584 \n",
|
||
"L 1991 3584 \n",
|
||
"z\n",
|
||
"\" transform=\"scale(0.015625)\"/>\n",
|
||
" </defs>\n",
|
||
" <use xlink:href=\"#DejaVuSans-63\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-6f\" x=\"54.980469\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-75\" x=\"116.162109\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-6e\" x=\"179.541016\"/>\n",
|
||
" <use xlink:href=\"#DejaVuSans-74\" x=\"242.919922\"/>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" <g id=\"patch_22\">\n",
|
||
" <path d=\"M 53.328125 143.1 \n",
|
||
"L 53.328125 7.2 \n",
|
||
"\" style=\"fill: none; stroke: #000000; stroke-width: 0.8; stroke-linejoin: miter; stroke-linecap: square\"/>\n",
|
||
" </g>\n",
|
||
" <g id=\"patch_23\">\n",
|
||
" <path d=\"M 248.628125 143.1 \n",
|
||
"L 248.628125 7.2 \n",
|
||
"\" style=\"fill: none; stroke: #000000; stroke-width: 0.8; stroke-linejoin: miter; stroke-linecap: square\"/>\n",
|
||
" </g>\n",
|
||
" <g id=\"patch_24\">\n",
|
||
" <path d=\"M 53.328125 143.1 \n",
|
||
"L 248.628125 143.1 \n",
|
||
"\" style=\"fill: none; stroke: #000000; stroke-width: 0.8; stroke-linejoin: miter; stroke-linecap: square\"/>\n",
|
||
" </g>\n",
|
||
" <g id=\"patch_25\">\n",
|
||
" <path d=\"M 53.328125 7.2 \n",
|
||
"L 248.628125 7.2 \n",
|
||
"\" style=\"fill: none; stroke: #000000; stroke-width: 0.8; stroke-linejoin: miter; stroke-linecap: square\"/>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" </g>\n",
|
||
" <defs>\n",
|
||
" <clipPath id=\"p696020f9ff\">\n",
|
||
" <rect x=\"53.328125\" y=\"7.2\" width=\"195.3\" height=\"135.9\"/>\n",
|
||
" </clipPath>\n",
|
||
" </defs>\n",
|
||
"</svg>\n"
|
||
],
|
||
"text/plain": [
|
||
"<Figure size 252x180 with 1 Axes>"
|
||
]
|
||
},
|
||
"metadata": {
|
||
"needs_background": "light"
|
||
},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"d2l.set_figsize()\n",
|
||
"d2l.plt.xlabel('# tokens per review')\n",
|
||
"d2l.plt.ylabel('count')\n",
|
||
"d2l.plt.hist([len(line) for line in train_tokens], bins=range(0, 1000, 50));"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "4b5faa2c",
|
||
"metadata": {
|
||
"origin_pos": 12
|
||
},
|
||
"source": [
|
||
"正如我们所料,评论的长度各不相同。为了每次处理一小批量这样的评论,我们通过截断和填充将每个评论的长度设置为500。这类似于 :numref:`sec_machine_translation`中对机器翻译数据集的预处理步骤。\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"id": "2d5d1601",
|
||
"metadata": {
|
||
"execution": {
|
||
"iopub.execute_input": "2023-08-18T07:04:46.667504Z",
|
||
"iopub.status.busy": "2023-08-18T07:04:46.666759Z",
|
||
"iopub.status.idle": "2023-08-18T07:04:53.619587Z",
|
||
"shell.execute_reply": "2023-08-18T07:04:53.618556Z"
|
||
},
|
||
"origin_pos": 13,
|
||
"tab": [
|
||
"pytorch"
|
||
]
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"torch.Size([25000, 500])\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"num_steps = 500 # 序列长度\n",
|
||
"train_features = torch.tensor([d2l.truncate_pad(\n",
|
||
" vocab[line], num_steps, vocab['<pad>']) for line in train_tokens])\n",
|
||
"print(train_features.shape)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "dca33759",
|
||
"metadata": {
|
||
"origin_pos": 14
|
||
},
|
||
"source": [
|
||
"## 创建数据迭代器\n",
|
||
"\n",
|
||
"现在我们可以创建数据迭代器了。在每次迭代中,都会返回一小批量样本。\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"id": "454154e6",
|
||
"metadata": {
|
||
"execution": {
|
||
"iopub.execute_input": "2023-08-18T07:04:53.625971Z",
|
||
"iopub.status.busy": "2023-08-18T07:04:53.624962Z",
|
||
"iopub.status.idle": "2023-08-18T07:04:53.662071Z",
|
||
"shell.execute_reply": "2023-08-18T07:04:53.660909Z"
|
||
},
|
||
"origin_pos": 16,
|
||
"tab": [
|
||
"pytorch"
|
||
]
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"X: torch.Size([64, 500]) , y: torch.Size([64])\n",
|
||
"小批量数目: 391\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"train_iter = d2l.load_array((train_features,\n",
|
||
" torch.tensor(train_data[1])), 64)\n",
|
||
"\n",
|
||
"for X, y in train_iter:\n",
|
||
" print('X:', X.shape, ', y:', y.shape)\n",
|
||
" break\n",
|
||
"print('小批量数目:', len(train_iter))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "42b492d4",
|
||
"metadata": {
|
||
"origin_pos": 18
|
||
},
|
||
"source": [
|
||
"## 整合代码\n",
|
||
"\n",
|
||
"最后,我们将上述步骤封装到`load_data_imdb`函数中。它返回训练和测试数据迭代器以及IMDb评论数据集的词表。\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"id": "8dd551a9",
|
||
"metadata": {
|
||
"execution": {
|
||
"iopub.execute_input": "2023-08-18T07:04:53.666983Z",
|
||
"iopub.status.busy": "2023-08-18T07:04:53.666388Z",
|
||
"iopub.status.idle": "2023-08-18T07:04:53.677743Z",
|
||
"shell.execute_reply": "2023-08-18T07:04:53.676460Z"
|
||
},
|
||
"origin_pos": 20,
|
||
"tab": [
|
||
"pytorch"
|
||
]
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#@save\n",
|
||
"def load_data_imdb(batch_size, num_steps=500):\n",
|
||
" \"\"\"返回数据迭代器和IMDb评论数据集的词表\"\"\"\n",
|
||
" data_dir = d2l.download_extract('aclImdb', 'aclImdb')\n",
|
||
" train_data = read_imdb(data_dir, True)\n",
|
||
" test_data = read_imdb(data_dir, False)\n",
|
||
" train_tokens = d2l.tokenize(train_data[0], token='word')\n",
|
||
" test_tokens = d2l.tokenize(test_data[0], token='word')\n",
|
||
" vocab = d2l.Vocab(train_tokens, min_freq=5)\n",
|
||
" train_features = torch.tensor([d2l.truncate_pad(\n",
|
||
" vocab[line], num_steps, vocab['<pad>']) for line in train_tokens])\n",
|
||
" test_features = torch.tensor([d2l.truncate_pad(\n",
|
||
" vocab[line], num_steps, vocab['<pad>']) for line in test_tokens])\n",
|
||
" train_iter = d2l.load_array((train_features, torch.tensor(train_data[1])),\n",
|
||
" batch_size)\n",
|
||
" test_iter = d2l.load_array((test_features, torch.tensor(test_data[1])),\n",
|
||
" batch_size,\n",
|
||
" is_train=False)\n",
|
||
" return train_iter, test_iter, vocab"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ead6677a",
|
||
"metadata": {
|
||
"origin_pos": 22
|
||
},
|
||
"source": [
|
||
"## 小结\n",
|
||
"\n",
|
||
"* 情感分析研究人们在文本中的情感,这被认为是一个文本分类问题,它将可变长度的文本序列进行转换转换为固定长度的文本类别。\n",
|
||
"* 经过预处理后,我们可以使用词表将IMDb评论数据集加载到数据迭代器中。\n",
|
||
"\n",
|
||
"## 练习\n",
|
||
"\n",
|
||
"1. 我们可以修改本节中的哪些超参数来加速训练情感分析模型?\n",
|
||
"1. 请实现一个函数来将[Amazon reviews](https://snap.stanford.edu/data/web-Amazon.html)的数据集加载到数据迭代器中进行情感分析。\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0a0b32b5",
|
||
"metadata": {
|
||
"origin_pos": 24,
|
||
"tab": [
|
||
"pytorch"
|
||
]
|
||
},
|
||
"source": [
|
||
"[Discussions](https://discuss.d2l.ai/t/5726)\n"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"language_info": {
|
||
"name": "python"
|
||
},
|
||
"required_libs": []
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
} |