303 lines
8.5 KiB
Plaintext
303 lines
8.5 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ab73852c",
|
||
"metadata": {
|
||
"origin_pos": 0
|
||
},
|
||
"source": [
|
||
"# 数据预处理\n",
|
||
":label:`sec_pandas`\n",
|
||
"\n",
|
||
"为了能用深度学习来解决现实世界的问题,我们经常从预处理原始数据开始,\n",
|
||
"而不是从那些准备好的张量格式数据开始。\n",
|
||
"在Python中常用的数据分析工具中,我们通常使用`pandas`软件包。\n",
|
||
"像庞大的Python生态系统中的许多其他扩展包一样,`pandas`可以与张量兼容。\n",
|
||
"本节我们将简要介绍使用`pandas`预处理原始数据,并将原始数据转换为张量格式的步骤。\n",
|
||
"后面的章节将介绍更多的数据预处理技术。\n",
|
||
"\n",
|
||
"## 读取数据集\n",
|
||
"\n",
|
||
"举一个例子,我们首先(**创建一个人工数据集,并存储在CSV(逗号分隔值)文件**)\n",
|
||
"`../data/house_tiny.csv`中。\n",
|
||
"以其他格式存储的数据也可以通过类似的方式进行处理。\n",
|
||
"下面我们将数据集按行写入CSV文件中。\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"id": "ee72fd16",
|
||
"metadata": {
|
||
"execution": {
|
||
"iopub.execute_input": "2023-08-18T07:03:38.903209Z",
|
||
"iopub.status.busy": "2023-08-18T07:03:38.902351Z",
|
||
"iopub.status.idle": "2023-08-18T07:03:38.918117Z",
|
||
"shell.execute_reply": "2023-08-18T07:03:38.916775Z"
|
||
},
|
||
"origin_pos": 1,
|
||
"tab": [
|
||
"pytorch"
|
||
]
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"import os\n",
|
||
"\n",
|
||
"os.makedirs(os.path.join('..', 'data'), exist_ok=True)\n",
|
||
"data_file = os.path.join('..', 'data', 'house_tiny.csv')\n",
|
||
"with open(data_file, 'w') as f:\n",
|
||
" f.write('NumRooms,Alley,Price\\n') # 列名\n",
|
||
" f.write('NA,Pave,127500\\n') # 每行表示一个数据样本\n",
|
||
" f.write('2,NA,106000\\n')\n",
|
||
" f.write('4,NA,178100\\n')\n",
|
||
" f.write('NA,NA,140000\\n')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "f5be7568",
|
||
"metadata": {
|
||
"origin_pos": 2
|
||
},
|
||
"source": [
|
||
"要[**从创建的CSV文件中加载原始数据集**],我们导入`pandas`包并调用`read_csv`函数。该数据集有四行三列。其中每行描述了房间数量(“NumRooms”)、巷子类型(“Alley”)和房屋价格(“Price”)。\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"id": "5fb16e52",
|
||
"metadata": {
|
||
"execution": {
|
||
"iopub.execute_input": "2023-08-18T07:03:38.923957Z",
|
||
"iopub.status.busy": "2023-08-18T07:03:38.923101Z",
|
||
"iopub.status.idle": "2023-08-18T07:03:39.372116Z",
|
||
"shell.execute_reply": "2023-08-18T07:03:39.371151Z"
|
||
},
|
||
"origin_pos": 3,
|
||
"tab": [
|
||
"pytorch"
|
||
]
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" NumRooms Alley Price\n",
|
||
"0 NaN Pave 127500\n",
|
||
"1 2.0 NaN 106000\n",
|
||
"2 4.0 NaN 178100\n",
|
||
"3 NaN NaN 140000\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# 如果没有安装pandas,只需取消对以下行的注释来安装pandas\n",
|
||
"# !pip install pandas\n",
|
||
"import pandas as pd\n",
|
||
"\n",
|
||
"data = pd.read_csv(data_file)\n",
|
||
"print(data)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "30188bf5",
|
||
"metadata": {
|
||
"origin_pos": 4
|
||
},
|
||
"source": [
|
||
"## 处理缺失值\n",
|
||
"\n",
|
||
"注意,“NaN”项代表缺失值。\n",
|
||
"[**为了处理缺失的数据,典型的方法包括*插值法*和*删除法*,**]\n",
|
||
"其中插值法用一个替代值弥补缺失值,而删除法则直接忽略缺失值。\n",
|
||
"在(**这里,我们将考虑插值法**)。\n",
|
||
"\n",
|
||
"通过位置索引`iloc`,我们将`data`分成`inputs`和`outputs`,\n",
|
||
"其中前者为`data`的前两列,而后者为`data`的最后一列。\n",
|
||
"对于`inputs`中缺少的数值,我们用同一列的均值替换“NaN”项。\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"id": "d460a301",
|
||
"metadata": {
|
||
"execution": {
|
||
"iopub.execute_input": "2023-08-18T07:03:39.375828Z",
|
||
"iopub.status.busy": "2023-08-18T07:03:39.375535Z",
|
||
"iopub.status.idle": "2023-08-18T07:03:39.389220Z",
|
||
"shell.execute_reply": "2023-08-18T07:03:39.387998Z"
|
||
},
|
||
"origin_pos": 5,
|
||
"tab": [
|
||
"pytorch"
|
||
]
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" NumRooms Alley\n",
|
||
"0 3.0 Pave\n",
|
||
"1 2.0 NaN\n",
|
||
"2 4.0 NaN\n",
|
||
"3 3.0 NaN\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]\n",
|
||
"inputs = inputs.fillna(inputs.mean())\n",
|
||
"print(inputs)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "eae762a4",
|
||
"metadata": {
|
||
"origin_pos": 6
|
||
},
|
||
"source": [
|
||
"[**对于`inputs`中的类别值或离散值,我们将“NaN”视为一个类别。**]\n",
|
||
"由于“巷子类型”(“Alley”)列只接受两种类型的类别值“Pave”和“NaN”,\n",
|
||
"`pandas`可以自动将此列转换为两列“Alley_Pave”和“Alley_nan”。\n",
|
||
"巷子类型为“Pave”的行会将“Alley_Pave”的值设置为1,“Alley_nan”的值设置为0。\n",
|
||
"缺少巷子类型的行会将“Alley_Pave”和“Alley_nan”分别设置为0和1。\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"id": "09ab8738",
|
||
"metadata": {
|
||
"execution": {
|
||
"iopub.execute_input": "2023-08-18T07:03:39.394176Z",
|
||
"iopub.status.busy": "2023-08-18T07:03:39.393444Z",
|
||
"iopub.status.idle": "2023-08-18T07:03:39.409892Z",
|
||
"shell.execute_reply": "2023-08-18T07:03:39.408559Z"
|
||
},
|
||
"origin_pos": 7,
|
||
"tab": [
|
||
"pytorch"
|
||
]
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" NumRooms Alley_Pave Alley_nan\n",
|
||
"0 3.0 1 0\n",
|
||
"1 2.0 0 1\n",
|
||
"2 4.0 0 1\n",
|
||
"3 3.0 0 1\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"inputs = pd.get_dummies(inputs, dummy_na=True)\n",
|
||
"print(inputs)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ea1dd875",
|
||
"metadata": {
|
||
"origin_pos": 8
|
||
},
|
||
"source": [
|
||
"## 转换为张量格式\n",
|
||
"\n",
|
||
"[**现在`inputs`和`outputs`中的所有条目都是数值类型,它们可以转换为张量格式。**]\n",
|
||
"当数据采用张量格式后,可以通过在 :numref:`sec_ndarray`中引入的那些张量函数来进一步操作。\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"id": "4f551c6d",
|
||
"metadata": {
|
||
"execution": {
|
||
"iopub.execute_input": "2023-08-18T07:03:39.414531Z",
|
||
"iopub.status.busy": "2023-08-18T07:03:39.413831Z",
|
||
"iopub.status.idle": "2023-08-18T07:03:40.467689Z",
|
||
"shell.execute_reply": "2023-08-18T07:03:40.466637Z"
|
||
},
|
||
"origin_pos": 10,
|
||
"tab": [
|
||
"pytorch"
|
||
]
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"(tensor([[3., 1., 0.],\n",
|
||
" [2., 0., 1.],\n",
|
||
" [4., 0., 1.],\n",
|
||
" [3., 0., 1.]], dtype=torch.float64),\n",
|
||
" tensor([127500., 106000., 178100., 140000.], dtype=torch.float64))"
|
||
]
|
||
},
|
||
"execution_count": 5,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"import torch\n",
|
||
"\n",
|
||
"X = torch.tensor(inputs.to_numpy(dtype=float))\n",
|
||
"y = torch.tensor(outputs.to_numpy(dtype=float))\n",
|
||
"X, y"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "dbcbca0d",
|
||
"metadata": {
|
||
"origin_pos": 13
|
||
},
|
||
"source": [
|
||
"## 小结\n",
|
||
"\n",
|
||
"* `pandas`软件包是Python中常用的数据分析工具中,`pandas`可以与张量兼容。\n",
|
||
"* 用`pandas`处理缺失的数据时,我们可根据情况选择用插值法和删除法。\n",
|
||
"\n",
|
||
"## 练习\n",
|
||
"\n",
|
||
"创建包含更多行和列的原始数据集。\n",
|
||
"\n",
|
||
"1. 删除缺失值最多的列。\n",
|
||
"2. 将预处理后的数据集转换为张量格式。\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7b8c6c96",
|
||
"metadata": {
|
||
"origin_pos": 15,
|
||
"tab": [
|
||
"pytorch"
|
||
]
|
||
},
|
||
"source": [
|
||
"[Discussions](https://discuss.d2l.ai/t/1750)\n"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"language_info": {
|
||
"name": "python"
|
||
},
|
||
"required_libs": []
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
} |