Python/d2l/d2l-zh/pytorch/chapter_preliminaries/pandas.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "ab73852c",
   "metadata": {
    "origin_pos": 0
   },
   "source": [
    "# 数据预处理\n",
    ":label:`sec_pandas`\n",
    "\n",
    "为了能用深度学习来解决现实世界的问题，我们经常从预处理原始数据开始，\n",
    "而不是从那些准备好的张量格式数据开始。\n",
    "在Python中常用的数据分析工具中，我们通常使用`pandas`软件包。\n",
    "像庞大的Python生态系统中的许多其他扩展包一样，`pandas`可以与张量兼容。\n",
    "本节我们将简要介绍使用`pandas`预处理原始数据，并将原始数据转换为张量格式的步骤。\n",
    "后面的章节将介绍更多的数据预处理技术。\n",
    "\n",
    "## 读取数据集\n",
    "\n",
    "举一个例子，我们首先(**创建一个人工数据集，并存储在CSV（逗号分隔值）文件**)\n",
    "`../data/house_tiny.csv`中。\n",
    "以其他格式存储的数据也可以通过类似的方式进行处理。\n",
    "下面我们将数据集按行写入CSV文件中。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "ee72fd16",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:03:38.903209Z",
     "iopub.status.busy": "2023-08-18T07:03:38.902351Z",
     "iopub.status.idle": "2023-08-18T07:03:38.918117Z",
     "shell.execute_reply": "2023-08-18T07:03:38.916775Z"
    },
    "origin_pos": 1,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.makedirs(os.path.join('..', 'data'), exist_ok=True)\n",
    "data_file = os.path.join('..', 'data', 'house_tiny.csv')\n",
    "with open(data_file, 'w') as f:\n",
    "    f.write('NumRooms,Alley,Price\\n')  # 列名\n",
    "    f.write('NA,Pave,127500\\n')  # 每行表示一个数据样本\n",
    "    f.write('2,NA,106000\\n')\n",
    "    f.write('4,NA,178100\\n')\n",
    "    f.write('NA,NA,140000\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f5be7568",
   "metadata": {
    "origin_pos": 2
   },
   "source": [
    "要[**从创建的CSV文件中加载原始数据集**]，我们导入`pandas`包并调用`read_csv`函数。该数据集有四行三列。其中每行描述了房间数量（“NumRooms”）、巷子类型（“Alley”）和房屋价格（“Price”）。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "5fb16e52",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:03:38.923957Z",
     "iopub.status.busy": "2023-08-18T07:03:38.923101Z",
     "iopub.status.idle": "2023-08-18T07:03:39.372116Z",
     "shell.execute_reply": "2023-08-18T07:03:39.371151Z"
    },
    "origin_pos": 3,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   NumRooms Alley   Price\n",
      "0       NaN  Pave  127500\n",
      "1       2.0   NaN  106000\n",
      "2       4.0   NaN  178100\n",
      "3       NaN   NaN  140000\n"
     ]
    }
   ],
   "source": [
    "# 如果没有安装pandas，只需取消对以下行的注释来安装pandas\n",
    "# !pip install pandas\n",
    "import pandas as pd\n",
    "\n",
    "data = pd.read_csv(data_file)\n",
    "print(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30188bf5",
   "metadata": {
    "origin_pos": 4
   },
   "source": [
    "## 处理缺失值\n",
    "\n",
    "注意，“NaN”项代表缺失值。\n",
    "[**为了处理缺失的数据，典型的方法包括*插值法*和*删除法*，**]\n",
    "其中插值法用一个替代值弥补缺失值，而删除法则直接忽略缺失值。\n",
    "在(**这里，我们将考虑插值法**)。\n",
    "\n",
    "通过位置索引`iloc`，我们将`data`分成`inputs`和`outputs`，\n",
    "其中前者为`data`的前两列，而后者为`data`的最后一列。\n",
    "对于`inputs`中缺少的数值，我们用同一列的均值替换“NaN”项。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "d460a301",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:03:39.375828Z",
     "iopub.status.busy": "2023-08-18T07:03:39.375535Z",
     "iopub.status.idle": "2023-08-18T07:03:39.389220Z",
     "shell.execute_reply": "2023-08-18T07:03:39.387998Z"
    },
    "origin_pos": 5,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   NumRooms Alley\n",
      "0       3.0  Pave\n",
      "1       2.0   NaN\n",
      "2       4.0   NaN\n",
      "3       3.0   NaN\n"
     ]
    }
   ],
   "source": [
    "inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]\n",
    "inputs = inputs.fillna(inputs.mean())\n",
    "print(inputs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eae762a4",
   "metadata": {
    "origin_pos": 6
   },
   "source": [
    "[**对于`inputs`中的类别值或离散值，我们将“NaN”视为一个类别。**]\n",
    "由于“巷子类型”（“Alley”）列只接受两种类型的类别值“Pave”和“NaN”，\n",
    "`pandas`可以自动将此列转换为两列“Alley_Pave”和“Alley_nan”。\n",
    "巷子类型为“Pave”的行会将“Alley_Pave”的值设置为1，“Alley_nan”的值设置为0。\n",
    "缺少巷子类型的行会将“Alley_Pave”和“Alley_nan”分别设置为0和1。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "09ab8738",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:03:39.394176Z",
     "iopub.status.busy": "2023-08-18T07:03:39.393444Z",
     "iopub.status.idle": "2023-08-18T07:03:39.409892Z",
     "shell.execute_reply": "2023-08-18T07:03:39.408559Z"
    },
    "origin_pos": 7,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   NumRooms  Alley_Pave  Alley_nan\n",
      "0       3.0           1          0\n",
      "1       2.0           0          1\n",
      "2       4.0           0          1\n",
      "3       3.0           0          1\n"
     ]
    }
   ],
   "source": [
    "inputs = pd.get_dummies(inputs, dummy_na=True)\n",
    "print(inputs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea1dd875",
   "metadata": {
    "origin_pos": 8
   },
   "source": [
    "## 转换为张量格式\n",
    "\n",
    "[**现在`inputs`和`outputs`中的所有条目都是数值类型，它们可以转换为张量格式。**]\n",
    "当数据采用张量格式后，可以通过在 :numref:`sec_ndarray`中引入的那些张量函数来进一步操作。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "4f551c6d",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T07:03:39.414531Z",
     "iopub.status.busy": "2023-08-18T07:03:39.413831Z",
     "iopub.status.idle": "2023-08-18T07:03:40.467689Z",
     "shell.execute_reply": "2023-08-18T07:03:40.466637Z"
    },
    "origin_pos": 10,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(tensor([[3., 1., 0.],\n",
       "         [2., 0., 1.],\n",
       "         [4., 0., 1.],\n",
       "         [3., 0., 1.]], dtype=torch.float64),\n",
       " tensor([127500., 106000., 178100., 140000.], dtype=torch.float64))"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import torch\n",
    "\n",
    "X = torch.tensor(inputs.to_numpy(dtype=float))\n",
    "y = torch.tensor(outputs.to_numpy(dtype=float))\n",
    "X, y"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dbcbca0d",
   "metadata": {
    "origin_pos": 13
   },
   "source": [
    "## 小结\n",
    "\n",
    "* `pandas`软件包是Python中常用的数据分析工具中，`pandas`可以与张量兼容。\n",
    "* 用`pandas`处理缺失的数据时，我们可根据情况选择用插值法和删除法。\n",
    "\n",
    "## 练习\n",
    "\n",
    "创建包含更多行和列的原始数据集。\n",
    "\n",
    "1. 删除缺失值最多的列。\n",
    "2. 将预处理后的数据集转换为张量格式。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7b8c6c96",
   "metadata": {
    "origin_pos": 15,
    "tab": [
     "pytorch"
    ]
   },
   "source": [
    "[Discussions](https://discuss.d2l.ai/t/1750)\n"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  },
  "required_libs": []
 },
 "nbformat": 4,
 "nbformat_minor": 5
}