{ "cells": [ { "cell_type": "markdown", "id": "bba5a16c", "metadata": { "origin_pos": 0 }, "source": [ "# 情感分析及数据集\n", ":label:`sec_sentiment`\n", "\n", "随着在线社交媒体和评论平台的快速发展,大量评论的数据被记录下来。这些数据具有支持决策过程的巨大潜力。\n", "*情感分析*(sentiment analysis)研究人们在文本中\n", "(如产品评论、博客评论和论坛讨论等)“隐藏”的情绪。\n", "它在广泛应用于政治(如公众对政策的情绪分析)、\n", "金融(如市场情绪分析)和营销(如产品研究和品牌管理)等领域。\n", "\n", "由于情感可以被分类为离散的极性或尺度(例如,积极的和消极的),我们可以将情感分析看作一项文本分类任务,它将可变长度的文本序列转换为固定长度的文本类别。在本章中,我们将使用斯坦福大学的[大型电影评论数据集(large movie review dataset)](https://ai.stanford.edu/~amaas/data/sentiment/)进行情感分析。它由一个训练集和一个测试集组成,其中包含从IMDb下载的25000个电影评论。在这两个数据集中,“积极”和“消极”标签的数量相同,表示不同的情感极性。\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "7822039c", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T07:04:17.696417Z", "iopub.status.busy": "2023-08-18T07:04:17.695782Z", "iopub.status.idle": "2023-08-18T07:04:19.693903Z", "shell.execute_reply": "2023-08-18T07:04:19.692968Z" }, "origin_pos": 2, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "import os\n", "import torch\n", "from torch import nn\n", "from d2l import torch as d2l" ] }, { "cell_type": "markdown", "id": "76c1daa2", "metadata": { "origin_pos": 4 }, "source": [ "## 读取数据集\n", "\n", "首先,下载并提取路径`../data/aclImdb`中的IMDb评论数据集。\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "831081fb", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T07:04:19.698054Z", "iopub.status.busy": "2023-08-18T07:04:19.697364Z", "iopub.status.idle": "2023-08-18T07:04:42.609194Z", "shell.execute_reply": "2023-08-18T07:04:42.607873Z" }, "origin_pos": 5, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "#@save\n", "d2l.DATA_HUB['aclImdb'] = (\n", " 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz',\n", " '01ada507287d82875905620988597833ad4e0903')\n", "\n", "data_dir = d2l.download_extract('aclImdb', 'aclImdb')" ] }, { "cell_type": "markdown", "id": "a376611c", "metadata": { "origin_pos": 6 }, "source": [ "接下来,读取训练和测试数据集。每个样本都是一个评论及其标签:1表示“积极”,0表示“消极”。\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "4d08a828", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T07:04:42.614109Z", "iopub.status.busy": "2023-08-18T07:04:42.613148Z", "iopub.status.idle": "2023-08-18T07:04:43.353563Z", "shell.execute_reply": "2023-08-18T07:04:43.352484Z" }, "origin_pos": 7, "tab": [ "pytorch" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "训练集数目: 25000\n", "标签: 1 review: Zentropa has much in common with The Third Man, another noir\n", "标签: 1 review: Zentropa is the most original movie I've seen in years. If y\n", "标签: 1 review: Lars Von Trier is never backward in trying out new technique\n" ] } ], "source": [ "#@save\n", "def read_imdb(data_dir, is_train):\n", " \"\"\"读取IMDb评论数据集文本序列和标签\"\"\"\n", " data, labels = [], []\n", " for label in ('pos', 'neg'):\n", " folder_name = os.path.join(data_dir, 'train' if is_train else 'test',\n", " label)\n", " for file in os.listdir(folder_name):\n", " with open(os.path.join(folder_name, file), 'rb') as f:\n", " review = f.read().decode('utf-8').replace('\\n', '')\n", " data.append(review)\n", " labels.append(1 if label == 'pos' else 0)\n", " return data, labels\n", "\n", "train_data = read_imdb(data_dir, is_train=True)\n", "print('训练集数目:', len(train_data[0]))\n", "for x, y in zip(train_data[0][:3], train_data[1][:3]):\n", " print('标签:', y, 'review:', x[0:60])" ] }, { "cell_type": "markdown", "id": "35e114e6", "metadata": { "origin_pos": 8 }, "source": [ "## 预处理数据集\n", "\n", "将每个单词作为一个词元,过滤掉出现不到5次的单词,我们从训练数据集中创建一个词表。\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "b833b646", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T07:04:43.358797Z", "iopub.status.busy": "2023-08-18T07:04:43.358266Z", "iopub.status.idle": "2023-08-18T07:04:46.339449Z", "shell.execute_reply": "2023-08-18T07:04:46.338553Z" }, "origin_pos": 9, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "train_tokens = d2l.tokenize(train_data[0], token='word')\n", "vocab = d2l.Vocab(train_tokens, min_freq=5, reserved_tokens=[''])" ] }, { "cell_type": "markdown", "id": "6592cc46", "metadata": { "origin_pos": 10 }, "source": [ "在词元化之后,让我们绘制评论词元长度的直方图。\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "ca2ed7c7", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T07:04:46.343348Z", "iopub.status.busy": "2023-08-18T07:04:46.343069Z", "iopub.status.idle": "2023-08-18T07:04:46.663216Z", "shell.execute_reply": "2023-08-18T07:04:46.662099Z" }, "origin_pos": 11, "tab": [ "pytorch" ] }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " 2023-08-18T07:04:46.626523\n", " image/svg+xml\n", " \n", " \n", " Matplotlib v3.5.1, https://matplotlib.org/\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n" ], "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "d2l.set_figsize()\n", "d2l.plt.xlabel('# tokens per review')\n", "d2l.plt.ylabel('count')\n", "d2l.plt.hist([len(line) for line in train_tokens], bins=range(0, 1000, 50));" ] }, { "cell_type": "markdown", "id": "4b5faa2c", "metadata": { "origin_pos": 12 }, "source": [ "正如我们所料,评论的长度各不相同。为了每次处理一小批量这样的评论,我们通过截断和填充将每个评论的长度设置为500。这类似于 :numref:`sec_machine_translation`中对机器翻译数据集的预处理步骤。\n" ] }, { "cell_type": "code", "execution_count": 6, "id": "2d5d1601", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T07:04:46.667504Z", "iopub.status.busy": "2023-08-18T07:04:46.666759Z", "iopub.status.idle": "2023-08-18T07:04:53.619587Z", "shell.execute_reply": "2023-08-18T07:04:53.618556Z" }, "origin_pos": 13, "tab": [ "pytorch" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "torch.Size([25000, 500])\n" ] } ], "source": [ "num_steps = 500 # 序列长度\n", "train_features = torch.tensor([d2l.truncate_pad(\n", " vocab[line], num_steps, vocab['']) for line in train_tokens])\n", "print(train_features.shape)" ] }, { "cell_type": "markdown", "id": "dca33759", "metadata": { "origin_pos": 14 }, "source": [ "## 创建数据迭代器\n", "\n", "现在我们可以创建数据迭代器了。在每次迭代中,都会返回一小批量样本。\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "454154e6", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T07:04:53.625971Z", "iopub.status.busy": "2023-08-18T07:04:53.624962Z", "iopub.status.idle": "2023-08-18T07:04:53.662071Z", "shell.execute_reply": "2023-08-18T07:04:53.660909Z" }, "origin_pos": 16, "tab": [ "pytorch" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "X: torch.Size([64, 500]) , y: torch.Size([64])\n", "小批量数目: 391\n" ] } ], "source": [ "train_iter = d2l.load_array((train_features,\n", " torch.tensor(train_data[1])), 64)\n", "\n", "for X, y in train_iter:\n", " print('X:', X.shape, ', y:', y.shape)\n", " break\n", "print('小批量数目:', len(train_iter))" ] }, { "cell_type": "markdown", "id": "42b492d4", "metadata": { "origin_pos": 18 }, "source": [ "## 整合代码\n", "\n", "最后,我们将上述步骤封装到`load_data_imdb`函数中。它返回训练和测试数据迭代器以及IMDb评论数据集的词表。\n" ] }, { "cell_type": "code", "execution_count": 8, "id": "8dd551a9", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T07:04:53.666983Z", "iopub.status.busy": "2023-08-18T07:04:53.666388Z", "iopub.status.idle": "2023-08-18T07:04:53.677743Z", "shell.execute_reply": "2023-08-18T07:04:53.676460Z" }, "origin_pos": 20, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "#@save\n", "def load_data_imdb(batch_size, num_steps=500):\n", " \"\"\"返回数据迭代器和IMDb评论数据集的词表\"\"\"\n", " data_dir = d2l.download_extract('aclImdb', 'aclImdb')\n", " train_data = read_imdb(data_dir, True)\n", " test_data = read_imdb(data_dir, False)\n", " train_tokens = d2l.tokenize(train_data[0], token='word')\n", " test_tokens = d2l.tokenize(test_data[0], token='word')\n", " vocab = d2l.Vocab(train_tokens, min_freq=5)\n", " train_features = torch.tensor([d2l.truncate_pad(\n", " vocab[line], num_steps, vocab['']) for line in train_tokens])\n", " test_features = torch.tensor([d2l.truncate_pad(\n", " vocab[line], num_steps, vocab['']) for line in test_tokens])\n", " train_iter = d2l.load_array((train_features, torch.tensor(train_data[1])),\n", " batch_size)\n", " test_iter = d2l.load_array((test_features, torch.tensor(test_data[1])),\n", " batch_size,\n", " is_train=False)\n", " return train_iter, test_iter, vocab" ] }, { "cell_type": "markdown", "id": "ead6677a", "metadata": { "origin_pos": 22 }, "source": [ "## 小结\n", "\n", "* 情感分析研究人们在文本中的情感,这被认为是一个文本分类问题,它将可变长度的文本序列进行转换转换为固定长度的文本类别。\n", "* 经过预处理后,我们可以使用词表将IMDb评论数据集加载到数据迭代器中。\n", "\n", "## 练习\n", "\n", "1. 我们可以修改本节中的哪些超参数来加速训练情感分析模型?\n", "1. 请实现一个函数来将[Amazon reviews](https://snap.stanford.edu/data/web-Amazon.html)的数据集加载到数据迭代器中进行情感分析。\n" ] }, { "cell_type": "markdown", "id": "0a0b32b5", "metadata": { "origin_pos": 24, "tab": [ "pytorch" ] }, "source": [ "[Discussions](https://discuss.d2l.ai/t/5726)\n" ] } ], "metadata": { "language_info": { "name": "python" }, "required_libs": [] }, "nbformat": 4, "nbformat_minor": 5 }