Files
2025-12-16 09:23:53 +08:00

303 lines
8.5 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"cells": [
{
"cell_type": "markdown",
"id": "ab73852c",
"metadata": {
"origin_pos": 0
},
"source": [
"# 数据预处理\n",
":label:`sec_pandas`\n",
"\n",
"为了能用深度学习来解决现实世界的问题,我们经常从预处理原始数据开始,\n",
"而不是从那些准备好的张量格式数据开始。\n",
"在Python中常用的数据分析工具中,我们通常使用`pandas`软件包。\n",
"像庞大的Python生态系统中的许多其他扩展包一样,`pandas`可以与张量兼容。\n",
"本节我们将简要介绍使用`pandas`预处理原始数据,并将原始数据转换为张量格式的步骤。\n",
"后面的章节将介绍更多的数据预处理技术。\n",
"\n",
"## 读取数据集\n",
"\n",
"举一个例子,我们首先(**创建一个人工数据集,并存储在CSV(逗号分隔值)文件**)\n",
"`../data/house_tiny.csv`中。\n",
"以其他格式存储的数据也可以通过类似的方式进行处理。\n",
"下面我们将数据集按行写入CSV文件中。\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "ee72fd16",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:03:38.903209Z",
"iopub.status.busy": "2023-08-18T07:03:38.902351Z",
"iopub.status.idle": "2023-08-18T07:03:38.918117Z",
"shell.execute_reply": "2023-08-18T07:03:38.916775Z"
},
"origin_pos": 1,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"import os\n",
"\n",
"os.makedirs(os.path.join('..', 'data'), exist_ok=True)\n",
"data_file = os.path.join('..', 'data', 'house_tiny.csv')\n",
"with open(data_file, 'w') as f:\n",
" f.write('NumRooms,Alley,Price\\n') # 列名\n",
" f.write('NA,Pave,127500\\n') # 每行表示一个数据样本\n",
" f.write('2,NA,106000\\n')\n",
" f.write('4,NA,178100\\n')\n",
" f.write('NA,NA,140000\\n')"
]
},
{
"cell_type": "markdown",
"id": "f5be7568",
"metadata": {
"origin_pos": 2
},
"source": [
"要[**从创建的CSV文件中加载原始数据集**],我们导入`pandas`包并调用`read_csv`函数。该数据集有四行三列。其中每行描述了房间数量(“NumRooms”)、巷子类型(“Alley”)和房屋价格(“Price”)。\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "5fb16e52",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:03:38.923957Z",
"iopub.status.busy": "2023-08-18T07:03:38.923101Z",
"iopub.status.idle": "2023-08-18T07:03:39.372116Z",
"shell.execute_reply": "2023-08-18T07:03:39.371151Z"
},
"origin_pos": 3,
"tab": [
"pytorch"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" NumRooms Alley Price\n",
"0 NaN Pave 127500\n",
"1 2.0 NaN 106000\n",
"2 4.0 NaN 178100\n",
"3 NaN NaN 140000\n"
]
}
],
"source": [
"# 如果没有安装pandas,只需取消对以下行的注释来安装pandas\n",
"# !pip install pandas\n",
"import pandas as pd\n",
"\n",
"data = pd.read_csv(data_file)\n",
"print(data)"
]
},
{
"cell_type": "markdown",
"id": "30188bf5",
"metadata": {
"origin_pos": 4
},
"source": [
"## 处理缺失值\n",
"\n",
"注意,“NaN”项代表缺失值。\n",
"[**为了处理缺失的数据,典型的方法包括*插值法*和*删除法*,**]\n",
"其中插值法用一个替代值弥补缺失值,而删除法则直接忽略缺失值。\n",
"在(**这里,我们将考虑插值法**)。\n",
"\n",
"通过位置索引`iloc`,我们将`data`分成`inputs`和`outputs`\n",
"其中前者为`data`的前两列,而后者为`data`的最后一列。\n",
"对于`inputs`中缺少的数值,我们用同一列的均值替换“NaN”项。\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "d460a301",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:03:39.375828Z",
"iopub.status.busy": "2023-08-18T07:03:39.375535Z",
"iopub.status.idle": "2023-08-18T07:03:39.389220Z",
"shell.execute_reply": "2023-08-18T07:03:39.387998Z"
},
"origin_pos": 5,
"tab": [
"pytorch"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" NumRooms Alley\n",
"0 3.0 Pave\n",
"1 2.0 NaN\n",
"2 4.0 NaN\n",
"3 3.0 NaN\n"
]
}
],
"source": [
"inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]\n",
"inputs = inputs.fillna(inputs.mean())\n",
"print(inputs)"
]
},
{
"cell_type": "markdown",
"id": "eae762a4",
"metadata": {
"origin_pos": 6
},
"source": [
"[**对于`inputs`中的类别值或离散值,我们将“NaN”视为一个类别。**]\n",
"由于“巷子类型”(“Alley”)列只接受两种类型的类别值“Pave”和“NaN”,\n",
"`pandas`可以自动将此列转换为两列“Alley_Pave”和“Alley_nan”。\n",
"巷子类型为“Pave”的行会将“Alley_Pave”的值设置为1,“Alley_nan”的值设置为0。\n",
"缺少巷子类型的行会将“Alley_Pave”和“Alley_nan”分别设置为0和1。\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "09ab8738",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:03:39.394176Z",
"iopub.status.busy": "2023-08-18T07:03:39.393444Z",
"iopub.status.idle": "2023-08-18T07:03:39.409892Z",
"shell.execute_reply": "2023-08-18T07:03:39.408559Z"
},
"origin_pos": 7,
"tab": [
"pytorch"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" NumRooms Alley_Pave Alley_nan\n",
"0 3.0 1 0\n",
"1 2.0 0 1\n",
"2 4.0 0 1\n",
"3 3.0 0 1\n"
]
}
],
"source": [
"inputs = pd.get_dummies(inputs, dummy_na=True)\n",
"print(inputs)"
]
},
{
"cell_type": "markdown",
"id": "ea1dd875",
"metadata": {
"origin_pos": 8
},
"source": [
"## 转换为张量格式\n",
"\n",
"[**现在`inputs`和`outputs`中的所有条目都是数值类型,它们可以转换为张量格式。**]\n",
"当数据采用张量格式后,可以通过在 :numref:`sec_ndarray`中引入的那些张量函数来进一步操作。\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "4f551c6d",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:03:39.414531Z",
"iopub.status.busy": "2023-08-18T07:03:39.413831Z",
"iopub.status.idle": "2023-08-18T07:03:40.467689Z",
"shell.execute_reply": "2023-08-18T07:03:40.466637Z"
},
"origin_pos": 10,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"(tensor([[3., 1., 0.],\n",
" [2., 0., 1.],\n",
" [4., 0., 1.],\n",
" [3., 0., 1.]], dtype=torch.float64),\n",
" tensor([127500., 106000., 178100., 140000.], dtype=torch.float64))"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import torch\n",
"\n",
"X = torch.tensor(inputs.to_numpy(dtype=float))\n",
"y = torch.tensor(outputs.to_numpy(dtype=float))\n",
"X, y"
]
},
{
"cell_type": "markdown",
"id": "dbcbca0d",
"metadata": {
"origin_pos": 13
},
"source": [
"## 小结\n",
"\n",
"* `pandas`软件包是Python中常用的数据分析工具中,`pandas`可以与张量兼容。\n",
"* 用`pandas`处理缺失的数据时,我们可根据情况选择用插值法和删除法。\n",
"\n",
"## 练习\n",
"\n",
"创建包含更多行和列的原始数据集。\n",
"\n",
"1. 删除缺失值最多的列。\n",
"2. 将预处理后的数据集转换为张量格式。\n"
]
},
{
"cell_type": "markdown",
"id": "7b8c6c96",
"metadata": {
"origin_pos": 15,
"tab": [
"pytorch"
]
},
"source": [
"[Discussions](https://discuss.d2l.ai/t/1750)\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
},
"required_libs": []
},
"nbformat": 4,
"nbformat_minor": 5
}