{ "cells": [ { "cell_type": "markdown", "id": "ab73852c", "metadata": { "origin_pos": 0 }, "source": [ "# 数据预处理\n", ":label:`sec_pandas`\n", "\n", "为了能用深度学习来解决现实世界的问题,我们经常从预处理原始数据开始,\n", "而不是从那些准备好的张量格式数据开始。\n", "在Python中常用的数据分析工具中,我们通常使用`pandas`软件包。\n", "像庞大的Python生态系统中的许多其他扩展包一样,`pandas`可以与张量兼容。\n", "本节我们将简要介绍使用`pandas`预处理原始数据,并将原始数据转换为张量格式的步骤。\n", "后面的章节将介绍更多的数据预处理技术。\n", "\n", "## 读取数据集\n", "\n", "举一个例子,我们首先(**创建一个人工数据集,并存储在CSV(逗号分隔值)文件**)\n", "`../data/house_tiny.csv`中。\n", "以其他格式存储的数据也可以通过类似的方式进行处理。\n", "下面我们将数据集按行写入CSV文件中。\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "ee72fd16", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T07:03:38.903209Z", "iopub.status.busy": "2023-08-18T07:03:38.902351Z", "iopub.status.idle": "2023-08-18T07:03:38.918117Z", "shell.execute_reply": "2023-08-18T07:03:38.916775Z" }, "origin_pos": 1, "tab": [ "pytorch" ] }, "outputs": [], "source": [ "import os\n", "\n", "os.makedirs(os.path.join('..', 'data'), exist_ok=True)\n", "data_file = os.path.join('..', 'data', 'house_tiny.csv')\n", "with open(data_file, 'w') as f:\n", " f.write('NumRooms,Alley,Price\\n') # 列名\n", " f.write('NA,Pave,127500\\n') # 每行表示一个数据样本\n", " f.write('2,NA,106000\\n')\n", " f.write('4,NA,178100\\n')\n", " f.write('NA,NA,140000\\n')" ] }, { "cell_type": "markdown", "id": "f5be7568", "metadata": { "origin_pos": 2 }, "source": [ "要[**从创建的CSV文件中加载原始数据集**],我们导入`pandas`包并调用`read_csv`函数。该数据集有四行三列。其中每行描述了房间数量(“NumRooms”)、巷子类型(“Alley”)和房屋价格(“Price”)。\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "5fb16e52", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T07:03:38.923957Z", "iopub.status.busy": "2023-08-18T07:03:38.923101Z", "iopub.status.idle": "2023-08-18T07:03:39.372116Z", "shell.execute_reply": "2023-08-18T07:03:39.371151Z" }, "origin_pos": 3, "tab": [ "pytorch" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " NumRooms Alley Price\n", "0 NaN Pave 127500\n", "1 2.0 NaN 106000\n", "2 4.0 NaN 178100\n", "3 NaN NaN 140000\n" ] } ], "source": [ "# 如果没有安装pandas,只需取消对以下行的注释来安装pandas\n", "# !pip install pandas\n", "import pandas as pd\n", "\n", "data = pd.read_csv(data_file)\n", "print(data)" ] }, { "cell_type": "markdown", "id": "30188bf5", "metadata": { "origin_pos": 4 }, "source": [ "## 处理缺失值\n", "\n", "注意,“NaN”项代表缺失值。\n", "[**为了处理缺失的数据,典型的方法包括*插值法*和*删除法*,**]\n", "其中插值法用一个替代值弥补缺失值,而删除法则直接忽略缺失值。\n", "在(**这里,我们将考虑插值法**)。\n", "\n", "通过位置索引`iloc`,我们将`data`分成`inputs`和`outputs`,\n", "其中前者为`data`的前两列,而后者为`data`的最后一列。\n", "对于`inputs`中缺少的数值,我们用同一列的均值替换“NaN”项。\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "d460a301", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T07:03:39.375828Z", "iopub.status.busy": "2023-08-18T07:03:39.375535Z", "iopub.status.idle": "2023-08-18T07:03:39.389220Z", "shell.execute_reply": "2023-08-18T07:03:39.387998Z" }, "origin_pos": 5, "tab": [ "pytorch" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " NumRooms Alley\n", "0 3.0 Pave\n", "1 2.0 NaN\n", "2 4.0 NaN\n", "3 3.0 NaN\n" ] } ], "source": [ "inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]\n", "inputs = inputs.fillna(inputs.mean())\n", "print(inputs)" ] }, { "cell_type": "markdown", "id": "eae762a4", "metadata": { "origin_pos": 6 }, "source": [ "[**对于`inputs`中的类别值或离散值,我们将“NaN”视为一个类别。**]\n", "由于“巷子类型”(“Alley”)列只接受两种类型的类别值“Pave”和“NaN”,\n", "`pandas`可以自动将此列转换为两列“Alley_Pave”和“Alley_nan”。\n", "巷子类型为“Pave”的行会将“Alley_Pave”的值设置为1,“Alley_nan”的值设置为0。\n", "缺少巷子类型的行会将“Alley_Pave”和“Alley_nan”分别设置为0和1。\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "09ab8738", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T07:03:39.394176Z", "iopub.status.busy": "2023-08-18T07:03:39.393444Z", "iopub.status.idle": "2023-08-18T07:03:39.409892Z", "shell.execute_reply": "2023-08-18T07:03:39.408559Z" }, "origin_pos": 7, "tab": [ "pytorch" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " NumRooms Alley_Pave Alley_nan\n", "0 3.0 1 0\n", "1 2.0 0 1\n", "2 4.0 0 1\n", "3 3.0 0 1\n" ] } ], "source": [ "inputs = pd.get_dummies(inputs, dummy_na=True)\n", "print(inputs)" ] }, { "cell_type": "markdown", "id": "ea1dd875", "metadata": { "origin_pos": 8 }, "source": [ "## 转换为张量格式\n", "\n", "[**现在`inputs`和`outputs`中的所有条目都是数值类型,它们可以转换为张量格式。**]\n", "当数据采用张量格式后,可以通过在 :numref:`sec_ndarray`中引入的那些张量函数来进一步操作。\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "4f551c6d", "metadata": { "execution": { "iopub.execute_input": "2023-08-18T07:03:39.414531Z", "iopub.status.busy": "2023-08-18T07:03:39.413831Z", "iopub.status.idle": "2023-08-18T07:03:40.467689Z", "shell.execute_reply": "2023-08-18T07:03:40.466637Z" }, "origin_pos": 10, "tab": [ "pytorch" ] }, "outputs": [ { "data": { "text/plain": [ "(tensor([[3., 1., 0.],\n", " [2., 0., 1.],\n", " [4., 0., 1.],\n", " [3., 0., 1.]], dtype=torch.float64),\n", " tensor([127500., 106000., 178100., 140000.], dtype=torch.float64))" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import torch\n", "\n", "X = torch.tensor(inputs.to_numpy(dtype=float))\n", "y = torch.tensor(outputs.to_numpy(dtype=float))\n", "X, y" ] }, { "cell_type": "markdown", "id": "dbcbca0d", "metadata": { "origin_pos": 13 }, "source": [ "## 小结\n", "\n", "* `pandas`软件包是Python中常用的数据分析工具中,`pandas`可以与张量兼容。\n", "* 用`pandas`处理缺失的数据时,我们可根据情况选择用插值法和删除法。\n", "\n", "## 练习\n", "\n", "创建包含更多行和列的原始数据集。\n", "\n", "1. 删除缺失值最多的列。\n", "2. 将预处理后的数据集转换为张量格式。\n" ] }, { "cell_type": "markdown", "id": "7b8c6c96", "metadata": { "origin_pos": 15, "tab": [ "pytorch" ] }, "source": [ "[Discussions](https://discuss.d2l.ai/t/1750)\n" ] } ], "metadata": { "language_info": { "name": "python" }, "required_libs": [] }, "nbformat": 4, "nbformat_minor": 5 }