This commit is contained in:
2025-12-16 09:23:53 +08:00
parent 19138d3cc1
commit 9e7efd0626
409 changed files with 272713 additions and 241 deletions
@@ -0,0 +1,424 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "dda65809",
"metadata": {
"origin_pos": 0
},
"source": [
"# 多输入多输出通道\n",
":label:`sec_channels`\n",
"\n",
"虽然我们在 :numref:`subsec_why-conv-channels`中描述了构成每个图像的多个通道和多层卷积层。例如彩色图像具有标准的RGB通道来代表红、绿和蓝。\n",
"但是到目前为止,我们仅展示了单个输入和单个输出通道的简化例子。\n",
"这使得我们可以将输入、卷积核和输出看作二维张量。\n",
"\n",
"当我们添加通道时,我们的输入和隐藏的表示都变成了三维张量。例如,每个RGB输入图像具有$3\\times h\\times w$的形状。我们将这个大小为$3$的轴称为*通道*(channel)维度。本节将更深入地研究具有多输入和多输出通道的卷积核。\n",
"\n",
"## 多输入通道\n",
"\n",
"当输入包含多个通道时,需要构造一个与输入数据具有相同输入通道数的卷积核,以便与输入数据进行互相关运算。假设输入的通道数为$c_i$,那么卷积核的输入通道数也需要为$c_i$。如果卷积核的窗口形状是$k_h\\times k_w$,那么当$c_i=1$时,我们可以把卷积核看作形状为$k_h\\times k_w$的二维张量。\n",
"\n",
"然而,当$c_i>1$时,我们卷积核的每个输入通道将包含形状为$k_h\\times k_w$的张量。将这些张量$c_i$连结在一起可以得到形状为$c_i\\times k_h\\times k_w$的卷积核。由于输入和卷积核都有$c_i$个通道,我们可以对每个通道输入的二维张量和卷积核的二维张量进行互相关运算,再对通道求和(将$c_i$的结果相加)得到二维张量。这是多通道输入和多输入通道卷积核之间进行二维互相关运算的结果。\n",
"\n",
"在 :numref:`fig_conv_multi_in`中,我们演示了一个具有两个输入通道的二维互相关运算的示例。阴影部分是第一个输出元素以及用于计算这个输出的输入和核张量元素:$(1\\times1+2\\times2+4\\times3+5\\times4)+(0\\times0+1\\times1+3\\times2+4\\times3)=56$。\n",
"\n",
"![两个输入通道的互相关计算。](../img/conv-multi-in.svg)\n",
":label:`fig_conv_multi_in`\n",
"\n",
"为了加深理解,我们(**实现一下多输入通道互相关运算**)。\n",
"简而言之,我们所做的就是对每个通道执行互相关操作,然后将结果相加。\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "412ea0b9",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:02:36.340241Z",
"iopub.status.busy": "2023-08-18T07:02:36.339505Z",
"iopub.status.idle": "2023-08-18T07:02:38.335558Z",
"shell.execute_reply": "2023-08-18T07:02:38.334349Z"
},
"origin_pos": 2,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"import torch\n",
"from d2l import torch as d2l"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "0cff24d4",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:02:38.339612Z",
"iopub.status.busy": "2023-08-18T07:02:38.339031Z",
"iopub.status.idle": "2023-08-18T07:02:38.344485Z",
"shell.execute_reply": "2023-08-18T07:02:38.343326Z"
},
"origin_pos": 4,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"def corr2d_multi_in(X, K):\n",
" # 先遍历“X”和“K”的第0个维度(通道维度),再把它们加在一起\n",
" return sum(d2l.corr2d(x, k) for x, k in zip(X, K))"
]
},
{
"cell_type": "markdown",
"id": "54507b8a",
"metadata": {
"origin_pos": 6
},
"source": [
"我们可以构造与 :numref:`fig_conv_multi_in`中的值相对应的输入张量`X`和核张量`K`,以(**验证互相关运算的输出**)。\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "5a60b8f9",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:02:38.347937Z",
"iopub.status.busy": "2023-08-18T07:02:38.347463Z",
"iopub.status.idle": "2023-08-18T07:02:38.380997Z",
"shell.execute_reply": "2023-08-18T07:02:38.379885Z"
},
"origin_pos": 7,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"tensor([[ 56., 72.],\n",
" [104., 120.]])"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = torch.tensor([[[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]],\n",
" [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]]])\n",
"K = torch.tensor([[[0.0, 1.0], [2.0, 3.0]], [[1.0, 2.0], [3.0, 4.0]]])\n",
"\n",
"corr2d_multi_in(X, K)"
]
},
{
"cell_type": "markdown",
"id": "118648d7",
"metadata": {
"origin_pos": 8
},
"source": [
"## 多输出通道\n",
"\n",
"到目前为止,不论有多少输入通道,我们还只有一个输出通道。然而,正如我们在 :numref:`subsec_why-conv-channels`中所讨论的,每一层有多个输出通道是至关重要的。在最流行的神经网络架构中,随着神经网络层数的加深,我们常会增加输出通道的维数,通过减少空间分辨率以获得更大的通道深度。直观地说,我们可以将每个通道看作对不同特征的响应。而现实可能更为复杂一些,因为每个通道不是独立学习的,而是为了共同使用而优化的。因此,多输出通道并不仅是学习多个单通道的检测器。\n",
"\n",
"用$c_i$和$c_o$分别表示输入和输出通道的数目,并让$k_h$和$k_w$为卷积核的高度和宽度。为了获得多个通道的输出,我们可以为每个输出通道创建一个形状为$c_i\\times k_h\\times k_w$的卷积核张量,这样卷积核的形状是$c_o\\times c_i\\times k_h\\times k_w$。在互相关运算中,每个输出通道先获取所有输入通道,再以对应该输出通道的卷积核计算出结果。\n",
"\n",
"如下所示,我们实现一个[**计算多个通道的输出的互相关函数**]。\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "aa2e4e5f",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:02:38.384845Z",
"iopub.status.busy": "2023-08-18T07:02:38.384104Z",
"iopub.status.idle": "2023-08-18T07:02:38.389279Z",
"shell.execute_reply": "2023-08-18T07:02:38.388126Z"
},
"origin_pos": 9,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"def corr2d_multi_in_out(X, K):\n",
" # 迭代“K”的第0个维度,每次都对输入“X”执行互相关运算。\n",
" # 最后将所有结果都叠加在一起\n",
" return torch.stack([corr2d_multi_in(X, k) for k in K], 0)"
]
},
{
"cell_type": "markdown",
"id": "f5677efa",
"metadata": {
"origin_pos": 10
},
"source": [
"通过将核张量`K`与`K+1``K`中每个元素加$1$)和`K+2`连接起来,构造了一个具有$3$个输出通道的卷积核。\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "6dde7543",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:02:38.392733Z",
"iopub.status.busy": "2023-08-18T07:02:38.392298Z",
"iopub.status.idle": "2023-08-18T07:02:38.399310Z",
"shell.execute_reply": "2023-08-18T07:02:38.398211Z"
},
"origin_pos": 11,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"torch.Size([3, 2, 2, 2])"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"K = torch.stack((K, K + 1, K + 2), 0)\n",
"K.shape"
]
},
{
"cell_type": "markdown",
"id": "c7e08b44",
"metadata": {
"origin_pos": 12
},
"source": [
"下面,我们对输入张量`X`与卷积核张量`K`执行互相关运算。现在的输出包含$3$个通道,第一个通道的结果与先前输入张量`X`和多输入单输出通道的结果一致。\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "86b2b71f",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:02:38.403159Z",
"iopub.status.busy": "2023-08-18T07:02:38.402457Z",
"iopub.status.idle": "2023-08-18T07:02:38.410409Z",
"shell.execute_reply": "2023-08-18T07:02:38.409310Z"
},
"origin_pos": 13,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"tensor([[[ 56., 72.],\n",
" [104., 120.]],\n",
"\n",
" [[ 76., 100.],\n",
" [148., 172.]],\n",
"\n",
" [[ 96., 128.],\n",
" [192., 224.]]])"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"corr2d_multi_in_out(X, K)"
]
},
{
"cell_type": "markdown",
"id": "285e9413",
"metadata": {
"origin_pos": 14
},
"source": [
"## $1\\times 1$ 卷积层\n",
"\n",
"[~~1x1卷积~~]\n",
"\n",
"$1 \\times 1$卷积,即$k_h = k_w = 1$,看起来似乎没有多大意义。\n",
"毕竟,卷积的本质是有效提取相邻像素间的相关特征,而$1 \\times 1$卷积显然没有此作用。\n",
"尽管如此,$1 \\times 1$仍然十分流行,经常包含在复杂深层网络的设计中。下面,让我们详细地解读一下它的实际作用。\n",
"\n",
"因为使用了最小窗口,$1\\times 1$卷积失去了卷积层的特有能力——在高度和宽度维度上,识别相邻元素间相互作用的能力。\n",
"其实$1\\times 1$卷积的唯一计算发生在通道上。\n",
"\n",
" :numref:`fig_conv_1x1`展示了使用$1\\times 1$卷积核与$3$个输入通道和$2$个输出通道的互相关计算。\n",
"这里输入和输出具有相同的高度和宽度,输出中的每个元素都是从输入图像中同一位置的元素的线性组合。\n",
"我们可以将$1\\times 1$卷积层看作在每个像素位置应用的全连接层,以$c_i$个输入值转换为$c_o$个输出值。\n",
"因为这仍然是一个卷积层,所以跨像素的权重是一致的。\n",
"同时,$1\\times 1$卷积层需要的权重维度为$c_o\\times c_i$,再额外加上一个偏置。\n",
"\n",
"![互相关计算使用了具有3个输入通道和2个输出通道的 $1\\times 1$ 卷积核。其中,输入和输出具有相同的高度和宽度。](../img/conv-1x1.svg)\n",
":label:`fig_conv_1x1`\n",
"\n",
"下面,我们使用全连接层实现$1 \\times 1$卷积。\n",
"请注意,我们需要对输入和输出的数据形状进行调整。\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "f5be69b4",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:02:38.413874Z",
"iopub.status.busy": "2023-08-18T07:02:38.413425Z",
"iopub.status.idle": "2023-08-18T07:02:38.419141Z",
"shell.execute_reply": "2023-08-18T07:02:38.418037Z"
},
"origin_pos": 15,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"def corr2d_multi_in_out_1x1(X, K):\n",
" c_i, h, w = X.shape\n",
" c_o = K.shape[0]\n",
" X = X.reshape((c_i, h * w))\n",
" K = K.reshape((c_o, c_i))\n",
" # 全连接层中的矩阵乘法\n",
" Y = torch.matmul(K, X)\n",
" return Y.reshape((c_o, h, w))"
]
},
{
"cell_type": "markdown",
"id": "0685d9f1",
"metadata": {
"origin_pos": 16
},
"source": [
"当执行$1\\times 1$卷积运算时,上述函数相当于先前实现的互相关函数`corr2d_multi_in_out`。让我们用一些样本数据来验证这一点。\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "420f0d54",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:02:38.422499Z",
"iopub.status.busy": "2023-08-18T07:02:38.422070Z",
"iopub.status.idle": "2023-08-18T07:02:38.427214Z",
"shell.execute_reply": "2023-08-18T07:02:38.426115Z"
},
"origin_pos": 17,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"X = torch.normal(0, 1, (3, 3, 3))\n",
"K = torch.normal(0, 1, (2, 3, 1, 1))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "7250eae2",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:02:38.430613Z",
"iopub.status.busy": "2023-08-18T07:02:38.430184Z",
"iopub.status.idle": "2023-08-18T07:02:38.438715Z",
"shell.execute_reply": "2023-08-18T07:02:38.437662Z"
},
"origin_pos": 19,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"Y1 = corr2d_multi_in_out_1x1(X, K)\n",
"Y2 = corr2d_multi_in_out(X, K)\n",
"assert float(torch.abs(Y1 - Y2).sum()) < 1e-6"
]
},
{
"cell_type": "markdown",
"id": "8ba378bd",
"metadata": {
"origin_pos": 20
},
"source": [
"## 小结\n",
"\n",
"* 多输入多输出通道可以用来扩展卷积层的模型。\n",
"* 当以每像素为基础应用时,$1\\times 1$卷积层相当于全连接层。\n",
"* $1\\times 1$卷积层通常用于调整网络层的通道数量和控制模型复杂性。\n",
"\n",
"## 练习\n",
"\n",
"1. 假设我们有两个卷积核,大小分别为$k_1$和$k_2$(中间没有非线性激活函数)。\n",
" 1. 证明运算可以用单次卷积来表示。\n",
" 1. 这个等效的单个卷积核的维数是多少呢?\n",
" 1. 反之亦然吗?\n",
"1. 假设输入为$c_i\\times h\\times w$,卷积核大小为$c_o\\times c_i\\times k_h\\times k_w$,填充为$(p_h, p_w)$,步幅为$(s_h, s_w)$。\n",
" 1. 前向传播的计算成本(乘法和加法)是多少?\n",
" 1. 内存占用是多少?\n",
" 1. 反向传播的内存占用是多少?\n",
" 1. 反向传播的计算成本是多少?\n",
"1. 如果我们将输入通道$c_i$和输出通道$c_o$的数量加倍,计算数量会增加多少?如果我们把填充数量翻一番会怎么样?\n",
"1. 如果卷积核的高度和宽度是$k_h=k_w=1$,前向传播的计算复杂度是多少?\n",
"1. 本节最后一个示例中的变量`Y1`和`Y2`是否完全相同?为什么?\n",
"1. 当卷积窗口不是$1\\times 1$时,如何使用矩阵乘法实现卷积?\n"
]
},
{
"cell_type": "markdown",
"id": "0167237f",
"metadata": {
"origin_pos": 22,
"tab": [
"pytorch"
]
},
"source": [
"[Discussions](https://discuss.d2l.ai/t/1854)\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
},
"required_libs": []
},
"nbformat": 4,
"nbformat_minor": 5
}
@@ -0,0 +1,557 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "7d2e90ba",
"metadata": {
"origin_pos": 0
},
"source": [
"# 图像卷积\n",
":label:`sec_conv_layer`\n",
"\n",
"上节我们解析了卷积层的原理,现在我们看看它的实际应用。由于卷积神经网络的设计是用于探索图像数据,本节我们将以图像为例。\n",
"\n",
"## 互相关运算\n",
"\n",
"严格来说,卷积层是个错误的叫法,因为它所表达的运算其实是*互相关运算*(cross-correlation),而不是卷积运算。\n",
"根据 :numref:`sec_why-conv`中的描述,在卷积层中,输入张量和核张量通过(**互相关运算**)产生输出张量。\n",
"\n",
"首先,我们暂时忽略通道(第三维)这一情况,看看如何处理二维图像数据和隐藏表示。在 :numref:`fig_correlation`中,输入是高度为$3$、宽度为$3$的二维张量(即形状为$3 \\times 3$)。卷积核的高度和宽度都是$2$,而卷积核窗口(或卷积窗口)的形状由内核的高度和宽度决定(即$2 \\times 2$)。\n",
"\n",
"![二维互相关运算。阴影部分是第一个输出元素,以及用于计算输出的输入张量元素和核张量元素:$0\\times0+1\\times1+3\\times2+4\\times3=19$.](../img/correlation.svg)\n",
":label:`fig_correlation`\n",
"\n",
"在二维互相关运算中,卷积窗口从输入张量的左上角开始,从左到右、从上到下滑动。\n",
"当卷积窗口滑动到新一个位置时,包含在该窗口中的部分张量与卷积核张量进行按元素相乘,得到的张量再求和得到一个单一的标量值,由此我们得出了这一位置的输出张量值。\n",
"在如上例子中,输出张量的四个元素由二维互相关运算得到,这个输出高度为$2$、宽度为$2$,如下所示:\n",
"\n",
"$$\n",
"0\\times0+1\\times1+3\\times2+4\\times3=19,\\\\\n",
"1\\times0+2\\times1+4\\times2+5\\times3=25,\\\\\n",
"3\\times0+4\\times1+6\\times2+7\\times3=37,\\\\\n",
"4\\times0+5\\times1+7\\times2+8\\times3=43.\n",
"$$\n",
"\n",
"注意,输出大小略小于输入大小。这是因为卷积核的宽度和高度大于1,\n",
"而卷积核只与图像中每个大小完全适合的位置进行互相关运算。\n",
"所以,输出大小等于输入大小$n_h \\times n_w$减去卷积核大小$k_h \\times k_w$,即:\n",
"\n",
"$$(n_h-k_h+1) \\times (n_w-k_w+1).$$\n",
"\n",
"这是因为我们需要足够的空间在图像上“移动”卷积核。稍后,我们将看到如何通过在图像边界周围填充零来保证有足够的空间移动卷积核,从而保持输出大小不变。\n",
"接下来,我们在`corr2d`函数中实现如上过程,该函数接受输入张量`X`和卷积核张量`K`,并返回输出张量`Y`。\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "1bd2b0f5",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:07:26.587988Z",
"iopub.status.busy": "2023-08-18T07:07:26.587419Z",
"iopub.status.idle": "2023-08-18T07:07:28.559553Z",
"shell.execute_reply": "2023-08-18T07:07:28.558681Z"
},
"origin_pos": 2,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"import torch\n",
"from torch import nn\n",
"from d2l import torch as d2l"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "16abe7ca",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:07:28.563668Z",
"iopub.status.busy": "2023-08-18T07:07:28.562986Z",
"iopub.status.idle": "2023-08-18T07:07:28.569424Z",
"shell.execute_reply": "2023-08-18T07:07:28.568319Z"
},
"origin_pos": 4,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"def corr2d(X, K): #@save\n",
" \"\"\"计算二维互相关运算\"\"\"\n",
" h, w = K.shape\n",
" Y = torch.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1))\n",
" for i in range(Y.shape[0]):\n",
" for j in range(Y.shape[1]):\n",
" Y[i, j] = (X[i:i + h, j:j + w] * K).sum()\n",
" return Y"
]
},
{
"cell_type": "markdown",
"id": "e2adaedd",
"metadata": {
"origin_pos": 6
},
"source": [
"通过 :numref:`fig_correlation`的输入张量`X`和卷积核张量`K`,我们来[**验证上述二维互相关运算的输出**]。\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "6f84e512",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:07:28.572958Z",
"iopub.status.busy": "2023-08-18T07:07:28.572449Z",
"iopub.status.idle": "2023-08-18T07:07:28.604854Z",
"shell.execute_reply": "2023-08-18T07:07:28.603813Z"
},
"origin_pos": 7,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"tensor([[19., 25.],\n",
" [37., 43.]])"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])\n",
"K = torch.tensor([[0.0, 1.0], [2.0, 3.0]])\n",
"corr2d(X, K)"
]
},
{
"cell_type": "markdown",
"id": "e93ccf40",
"metadata": {
"origin_pos": 8
},
"source": [
"## 卷积层\n",
"\n",
"卷积层对输入和卷积核权重进行互相关运算,并在添加标量偏置之后产生输出。\n",
"所以,卷积层中的两个被训练的参数是卷积核权重和标量偏置。\n",
"就像我们之前随机初始化全连接层一样,在训练基于卷积层的模型时,我们也随机初始化卷积核权重。\n",
"\n",
"基于上面定义的`corr2d`函数[**实现二维卷积层**]。在`__init__`构造函数中,将`weight`和`bias`声明为两个模型参数。前向传播函数调用`corr2d`函数并添加偏置。\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "450def67",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:07:28.610672Z",
"iopub.status.busy": "2023-08-18T07:07:28.609819Z",
"iopub.status.idle": "2023-08-18T07:07:28.615602Z",
"shell.execute_reply": "2023-08-18T07:07:28.614632Z"
},
"origin_pos": 10,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"class Conv2D(nn.Module):\n",
" def __init__(self, kernel_size):\n",
" super().__init__()\n",
" self.weight = nn.Parameter(torch.rand(kernel_size))\n",
" self.bias = nn.Parameter(torch.zeros(1))\n",
"\n",
" def forward(self, x):\n",
" return corr2d(x, self.weight) + self.bias"
]
},
{
"cell_type": "markdown",
"id": "d361e4c7",
"metadata": {
"origin_pos": 13
},
"source": [
"高度和宽度分别为$h$和$w$的卷积核可以被称为$h \\times w$卷积或$h \\times w$卷积核。\n",
"我们也将带有$h \\times w$卷积核的卷积层称为$h \\times w$卷积层。\n",
"\n",
"## 图像中目标的边缘检测\n",
"\n",
"如下是[**卷积层的一个简单应用:**]通过找到像素变化的位置,来(**检测图像中不同颜色的边缘**)。\n",
"首先,我们构造一个$6\\times 8$像素的黑白图像。中间四列为黑色($0$),其余像素为白色($1$)。\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "dee1bc79",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:07:28.620077Z",
"iopub.status.busy": "2023-08-18T07:07:28.619277Z",
"iopub.status.idle": "2023-08-18T07:07:28.626719Z",
"shell.execute_reply": "2023-08-18T07:07:28.625746Z"
},
"origin_pos": 14,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"tensor([[1., 1., 0., 0., 0., 0., 1., 1.],\n",
" [1., 1., 0., 0., 0., 0., 1., 1.],\n",
" [1., 1., 0., 0., 0., 0., 1., 1.],\n",
" [1., 1., 0., 0., 0., 0., 1., 1.],\n",
" [1., 1., 0., 0., 0., 0., 1., 1.],\n",
" [1., 1., 0., 0., 0., 0., 1., 1.]])"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = torch.ones((6, 8))\n",
"X[:, 2:6] = 0\n",
"X"
]
},
{
"cell_type": "markdown",
"id": "ea455932",
"metadata": {
"origin_pos": 16
},
"source": [
"接下来,我们构造一个高度为$1$、宽度为$2$的卷积核`K`。当进行互相关运算时,如果水平相邻的两元素相同,则输出为零,否则输出为非零。\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "d042bda0",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:07:28.630101Z",
"iopub.status.busy": "2023-08-18T07:07:28.629606Z",
"iopub.status.idle": "2023-08-18T07:07:28.634133Z",
"shell.execute_reply": "2023-08-18T07:07:28.633165Z"
},
"origin_pos": 17,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"K = torch.tensor([[1.0, -1.0]])"
]
},
{
"cell_type": "markdown",
"id": "19635ba4",
"metadata": {
"origin_pos": 18
},
"source": [
"现在,我们对参数`X`(输入)和`K`(卷积核)执行互相关运算。\n",
"如下所示,[**输出`Y`中的1代表从白色到黑色的边缘,-1代表从黑色到白色的边缘**],其他情况的输出为$0$。\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "36de9e2a",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:07:28.639056Z",
"iopub.status.busy": "2023-08-18T07:07:28.638505Z",
"iopub.status.idle": "2023-08-18T07:07:28.646532Z",
"shell.execute_reply": "2023-08-18T07:07:28.645509Z"
},
"origin_pos": 19,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"tensor([[ 0., 1., 0., 0., 0., -1., 0.],\n",
" [ 0., 1., 0., 0., 0., -1., 0.],\n",
" [ 0., 1., 0., 0., 0., -1., 0.],\n",
" [ 0., 1., 0., 0., 0., -1., 0.],\n",
" [ 0., 1., 0., 0., 0., -1., 0.],\n",
" [ 0., 1., 0., 0., 0., -1., 0.]])"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Y = corr2d(X, K)\n",
"Y"
]
},
{
"cell_type": "markdown",
"id": "9f3991ae",
"metadata": {
"origin_pos": 20
},
"source": [
"现在我们将输入的二维图像转置,再进行如上的互相关运算。\n",
"其输出如下,之前检测到的垂直边缘消失了。\n",
"不出所料,这个[**卷积核`K`只可以检测垂直边缘**],无法检测水平边缘。\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "0a754b2d",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:07:28.651371Z",
"iopub.status.busy": "2023-08-18T07:07:28.650819Z",
"iopub.status.idle": "2023-08-18T07:07:28.658419Z",
"shell.execute_reply": "2023-08-18T07:07:28.657436Z"
},
"origin_pos": 21,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"tensor([[0., 0., 0., 0., 0.],\n",
" [0., 0., 0., 0., 0.],\n",
" [0., 0., 0., 0., 0.],\n",
" [0., 0., 0., 0., 0.],\n",
" [0., 0., 0., 0., 0.],\n",
" [0., 0., 0., 0., 0.],\n",
" [0., 0., 0., 0., 0.],\n",
" [0., 0., 0., 0., 0.]])"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"corr2d(X.t(), K)"
]
},
{
"cell_type": "markdown",
"id": "18ceafe9",
"metadata": {
"origin_pos": 22
},
"source": [
"## 学习卷积核\n",
"\n",
"如果我们只需寻找黑白边缘,那么以上`[1, -1]`的边缘检测器足以。然而,当有了更复杂数值的卷积核,或者连续的卷积层时,我们不可能手动设计滤波器。那么我们是否可以[**学习由`X`生成`Y`的卷积核**]呢?\n",
"\n",
"现在让我们看看是否可以通过仅查看“输入-输出”对来学习由`X`生成`Y`的卷积核。\n",
"我们先构造一个卷积层,并将其卷积核初始化为随机张量。接下来,在每次迭代中,我们比较`Y`与卷积层输出的平方误差,然后计算梯度来更新卷积核。为了简单起见,我们在此使用内置的二维卷积层,并忽略偏置。\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "2b423578",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:07:28.662260Z",
"iopub.status.busy": "2023-08-18T07:07:28.661527Z",
"iopub.status.idle": "2023-08-18T07:07:28.681412Z",
"shell.execute_reply": "2023-08-18T07:07:28.680192Z"
},
"origin_pos": 24,
"tab": [
"pytorch"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 2, loss 6.422\n",
"epoch 4, loss 1.225\n",
"epoch 6, loss 0.266\n",
"epoch 8, loss 0.070\n",
"epoch 10, loss 0.022\n"
]
}
],
"source": [
"# 构造一个二维卷积层,它具有1个输出通道和形状为(1,2)的卷积核\n",
"conv2d = nn.Conv2d(1,1, kernel_size=(1, 2), bias=False)\n",
"\n",
"# 这个二维卷积层使用四维输入和输出格式(批量大小、通道、高度、宽度),\n",
"# 其中批量大小和通道数都为1\n",
"X = X.reshape((1, 1, 6, 8))\n",
"Y = Y.reshape((1, 1, 6, 7))\n",
"lr = 3e-2 # 学习率\n",
"\n",
"for i in range(10):\n",
" Y_hat = conv2d(X)\n",
" l = (Y_hat - Y) ** 2\n",
" conv2d.zero_grad()\n",
" l.sum().backward()\n",
" # 迭代卷积核\n",
" conv2d.weight.data[:] -= lr * conv2d.weight.grad\n",
" if (i + 1) % 2 == 0:\n",
" print(f'epoch {i+1}, loss {l.sum():.3f}')"
]
},
{
"cell_type": "markdown",
"id": "37744bcf",
"metadata": {
"origin_pos": 27
},
"source": [
"在$10$次迭代之后,误差已经降到足够低。现在我们来看看我们[**所学的卷积核的权重张量**]。\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "b40515e8",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:07:28.684721Z",
"iopub.status.busy": "2023-08-18T07:07:28.684428Z",
"iopub.status.idle": "2023-08-18T07:07:28.691507Z",
"shell.execute_reply": "2023-08-18T07:07:28.690512Z"
},
"origin_pos": 29,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"tensor([[ 1.0010, -0.9739]])"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"conv2d.weight.data.reshape((1, 2))"
]
},
{
"cell_type": "markdown",
"id": "366d2c4f",
"metadata": {
"origin_pos": 32
},
"source": [
"细心的读者一定会发现,我们学习到的卷积核权重非常接近我们之前定义的卷积核`K`。\n",
"\n",
"## 互相关和卷积\n",
"\n",
"回想一下我们在 :numref:`sec_why-conv`中观察到的互相关和卷积运算之间的对应关系。\n",
"为了得到正式的*卷积*运算输出,我们需要执行 :eqref:`eq_2d-conv-discrete`中定义的严格卷积运算,而不是互相关运算。\n",
"幸运的是,它们差别不大,我们只需水平和垂直翻转二维卷积核张量,然后对输入张量执行*互相关*运算。\n",
"\n",
"值得注意的是,由于卷积核是从数据中学习到的,因此无论这些层执行严格的卷积运算还是互相关运算,卷积层的输出都不会受到影响。\n",
"为了说明这一点,假设卷积层执行*互相关*运算并学习 :numref:`fig_correlation`中的卷积核,该卷积核在这里由矩阵$\\mathbf{K}$表示。\n",
"假设其他条件不变,当这个层执行严格的*卷积*时,学习的卷积核$\\mathbf{K}'$在水平和垂直翻转之后将与$\\mathbf{K}$相同。\n",
"也就是说,当卷积层对 :numref:`fig_correlation`中的输入和$\\mathbf{K}'$执行严格*卷积*运算时,将得到与互相关运算 :numref:`fig_correlation`中相同的输出。\n",
"\n",
"为了与深度学习文献中的标准术语保持一致,我们将继续把“互相关运算”称为卷积运算,尽管严格地说,它们略有不同。\n",
"此外,对于卷积核张量上的权重,我们称其为*元素*。\n",
"\n",
"## 特征映射和感受野\n",
"\n",
"如在 :numref:`subsec_why-conv-channels`中所述, :numref:`fig_correlation`中输出的卷积层有时被称为*特征映射*(feature map),因为它可以被视为一个输入映射到下一层的空间维度的转换器。\n",
"在卷积神经网络中,对于某一层的任意元素$x$,其*感受野*(receptive field)是指在前向传播期间可能影响$x$计算的所有元素(来自所有先前层)。\n",
"\n",
"请注意,感受野可能大于输入的实际大小。让我们用 :numref:`fig_correlation`为例来解释感受野:\n",
"给定$2 \\times 2$卷积核,阴影输出元素值$19$的感受野是输入阴影部分的四个元素。\n",
"假设之前输出为$\\mathbf{Y}$,其大小为$2 \\times 2$,现在我们在其后附加一个卷积层,该卷积层以$\\mathbf{Y}$为输入,输出单个元素$z$。\n",
"在这种情况下,$\\mathbf{Y}$上的$z$的感受野包括$\\mathbf{Y}$的所有四个元素,而输入的感受野包括最初所有九个输入元素。\n",
"因此,当一个特征图中的任意元素需要检测更广区域的输入特征时,我们可以构建一个更深的网络。\n",
"\n",
"## 小结\n",
"\n",
"* 二维卷积层的核心计算是二维互相关运算。最简单的形式是,对二维输入数据和卷积核执行互相关操作,然后添加一个偏置。\n",
"* 我们可以设计一个卷积核来检测图像的边缘。\n",
"* 我们可以从数据中学习卷积核的参数。\n",
"* 学习卷积核时,无论用严格卷积运算或互相关运算,卷积层的输出不会受太大影响。\n",
"* 当需要检测输入特征中更广区域时,我们可以构建一个更深的卷积网络。\n",
"\n",
"## 练习\n",
"\n",
"1. 构建一个具有对角线边缘的图像`X`。\n",
" 1. 如果将本节中举例的卷积核`K`应用于`X`,会发生什么情况?\n",
" 1. 如果转置`X`会发生什么?\n",
" 1. 如果转置`K`会发生什么?\n",
"1. 在我们创建的`Conv2D`自动求导时,有什么错误消息?\n",
"1. 如何通过改变输入张量和卷积核张量,将互相关运算表示为矩阵乘法?\n",
"1. 手工设计一些卷积核。\n",
" 1. 二阶导数的核的形式是什么?\n",
" 1. 积分的核的形式是什么?\n",
" 1. 得到$d$次导数的最小核的大小是多少?\n"
]
},
{
"cell_type": "markdown",
"id": "c9adecf6",
"metadata": {
"origin_pos": 34,
"tab": [
"pytorch"
]
},
"source": [
"[Discussions](https://discuss.d2l.ai/t/1848)\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
},
"required_libs": []
},
"nbformat": 4,
"nbformat_minor": 5
}
@@ -0,0 +1,53 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "5231073b",
"metadata": {
"origin_pos": 0
},
"source": [
"# 卷积神经网络\n",
":label:`chap_cnn`\n",
"\n",
"在前面的章节中,我们遇到过图像数据。\n",
"这种数据的每个样本都由一个二维像素网格组成,\n",
"每个像素可能是一个或者多个数值,取决于是黑白还是彩色图像。\n",
"到目前为止,我们处理这类结构丰富的数据的方式还不够有效。\n",
"我们仅仅通过将图像数据展平成一维向量而忽略了每个图像的空间结构信息,再将数据送入一个全连接的多层感知机中。\n",
"因为这些网络特征元素的顺序是不变的,因此最优的结果是利用先验知识,即利用相近像素之间的相互关联性,从图像数据中学习得到有效的模型。\n",
"\n",
"本章介绍的*卷积神经网络*convolutional neural networkCNN)是一类强大的、为处理图像数据而设计的神经网络。\n",
"基于卷积神经网络架构的模型在计算机视觉领域中已经占主导地位,当今几乎所有的图像识别、目标检测或语义分割相关的学术竞赛和商业应用都以这种方法为基础。\n",
"\n",
"现代卷积神经网络的设计得益于生物学、群论和一系列的补充实验。\n",
"卷积神经网络需要的参数少于全连接架构的网络,而且卷积也很容易用GPU并行计算。\n",
"因此卷积神经网络除了能够高效地采样从而获得精确的模型,还能够高效地计算。\n",
"久而久之,从业人员越来越多地使用卷积神经网络。即使在通常使用循环神经网络的一维序列结构任务上(例如音频、文本和时间序列分析),卷积神经网络也越来越受欢迎。\n",
"通过对卷积神经网络一些巧妙的调整,也使它们在图结构数据和推荐系统中发挥作用。\n",
"\n",
"在本章的开始,我们将介绍构成所有卷积网络主干的基本元素。\n",
"这包括卷积层本身、填充(padding)和步幅(stride)的基本细节、用于在相邻区域汇聚信息的汇聚层(pooling)、在每一层中多通道(channel)的使用,以及有关现代卷积网络架构的仔细讨论。\n",
"在本章的最后,我们将介绍一个完整的、可运行的LeNet模型:这是第一个成功应用的卷积神经网络,比现代深度学习兴起时间还要早。\n",
"在下一章中,我们将深入研究一些流行的、相对较新的卷积神经网络架构的完整实现,这些网络架构涵盖了现代从业者通常使用的大多数经典技术。\n",
"\n",
":begin_tab:toc\n",
" - [why-conv](why-conv.ipynb)\n",
" - [conv-layer](conv-layer.ipynb)\n",
" - [padding-and-strides](padding-and-strides.ipynb)\n",
" - [channels](channels.ipynb)\n",
" - [pooling](pooling.ipynb)\n",
" - [lenet](lenet.ipynb)\n",
":end_tab:\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
},
"required_libs": []
},
"nbformat": 4,
"nbformat_minor": 5
}
File diff suppressed because it is too large Load Diff
@@ -0,0 +1,299 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "f68fd76a",
"metadata": {
"origin_pos": 0
},
"source": [
"# 填充和步幅\n",
":label:`sec_padding`\n",
"\n",
"在前面的例子 :numref:`fig_correlation`中,输入的高度和宽度都为$3$,卷积核的高度和宽度都为$2$,生成的输出表征的维数为$2\\times2$。\n",
"正如我们在 :numref:`sec_conv_layer`中所概括的那样,假设输入形状为$n_h\\times n_w$,卷积核形状为$k_h\\times k_w$,那么输出形状将是$(n_h-k_h+1) \\times (n_w-k_w+1)$。\n",
"因此,卷积的输出形状取决于输入形状和卷积核的形状。\n",
"\n",
"还有什么因素会影响输出的大小呢?本节我们将介绍*填充*(padding)和*步幅*stride)。假设以下情景:\n",
"有时,在应用了连续的卷积之后,我们最终得到的输出远小于输入大小。这是由于卷积核的宽度和高度通常大于$1$所导致的。比如,一个$240 \\times 240$像素的图像,经过$10$层$5 \\times 5$的卷积后,将减少到$200 \\times 200$像素。如此一来,原始图像的边界丢失了许多有用信息。而*填充*是解决此问题最有效的方法;\n",
"有时,我们可能希望大幅降低图像的宽度和高度。例如,如果我们发现原始的输入分辨率十分冗余。*步幅*则可以在这类情况下提供帮助。\n",
"\n",
"## 填充\n",
"\n",
"如上所述,在应用多层卷积时,我们常常丢失边缘像素。\n",
"由于我们通常使用小卷积核,因此对于任何单个卷积,我们可能只会丢失几个像素。\n",
"但随着我们应用许多连续卷积层,累积丢失的像素数就多了。\n",
"解决这个问题的简单方法即为*填充*(padding):在输入图像的边界填充元素(通常填充元素是$0$)。\n",
"例如,在 :numref:`img_conv_pad`中,我们将$3 \\times 3$输入填充到$5 \\times 5$,那么它的输出就增加为$4 \\times 4$。阴影部分是第一个输出元素以及用于输出计算的输入和核张量元素:\n",
"$0\\times0+0\\times1+0\\times2+0\\times3=0$。\n",
"\n",
"![带填充的二维互相关。](../img/conv-pad.svg)\n",
":label:`img_conv_pad`\n",
"\n",
"通常,如果我们添加$p_h$行填充(大约一半在顶部,一半在底部)和$p_w$列填充(左侧大约一半,右侧一半),则输出形状将为\n",
"\n",
"$$(n_h-k_h+p_h+1)\\times(n_w-k_w+p_w+1)。$$\n",
"\n",
"这意味着输出的高度和宽度将分别增加$p_h$和$p_w$。\n",
"\n",
"在许多情况下,我们需要设置$p_h=k_h-1$和$p_w=k_w-1$,使输入和输出具有相同的高度和宽度。\n",
"这样可以在构建网络时更容易地预测每个图层的输出形状。假设$k_h$是奇数,我们将在高度的两侧填充$p_h/2$行。\n",
"如果$k_h$是偶数,则一种可能性是在输入顶部填充$\\lceil p_h/2\\rceil$行,在底部填充$\\lfloor p_h/2\\rfloor$行。同理,我们填充宽度的两侧。\n",
"\n",
"卷积神经网络中卷积核的高度和宽度通常为奇数,例如1、3、5或7。\n",
"选择奇数的好处是,保持空间维度的同时,我们可以在顶部和底部填充相同数量的行,在左侧和右侧填充相同数量的列。\n",
"\n",
"此外,使用奇数的核大小和填充大小也提供了书写上的便利。对于任何二维张量`X`,当满足:\n",
"1. 卷积核的大小是奇数;\n",
"2. 所有边的填充行数和列数相同;\n",
"3. 输出与输入具有相同高度和宽度\n",
"则可以得出:输出`Y[i, j]`是通过以输入`X[i, j]`为中心,与卷积核进行互相关计算得到的。\n",
"\n",
"比如,在下面的例子中,我们创建一个高度和宽度为3的二维卷积层,并(**在所有侧边填充1个像素**)。给定高度和宽度为8的输入,则输出的高度和宽度也是8。\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "ee25ca28",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:00:27.440657Z",
"iopub.status.busy": "2023-08-18T07:00:27.439788Z",
"iopub.status.idle": "2023-08-18T07:00:28.396461Z",
"shell.execute_reply": "2023-08-18T07:00:28.395508Z"
},
"origin_pos": 2,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"torch.Size([8, 8])"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import torch\n",
"from torch import nn\n",
"\n",
"\n",
"# 为了方便起见,我们定义了一个计算卷积层的函数。\n",
"# 此函数初始化卷积层权重,并对输入和输出提高和缩减相应的维数\n",
"def comp_conv2d(conv2d, X):\n",
" # 这里的(1,1)表示批量大小和通道数都是1\n",
" X = X.reshape((1, 1) + X.shape)\n",
" Y = conv2d(X)\n",
" # 省略前两个维度:批量大小和通道\n",
" return Y.reshape(Y.shape[2:])\n",
"\n",
"# 请注意,这里每边都填充了1行或1列,因此总共添加了2行或2列\n",
"conv2d = nn.Conv2d(1, 1, kernel_size=3, padding=1)\n",
"X = torch.rand(size=(8, 8))\n",
"comp_conv2d(conv2d, X).shape"
]
},
{
"cell_type": "markdown",
"id": "f46e5ea5",
"metadata": {
"origin_pos": 5
},
"source": [
"当卷积核的高度和宽度不同时,我们可以[**填充不同的高度和宽度**],使输出和输入具有相同的高度和宽度。在如下示例中,我们使用高度为5,宽度为3的卷积核,高度和宽度两边的填充分别为2和1。\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "5dadebb1",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:00:28.400923Z",
"iopub.status.busy": "2023-08-18T07:00:28.400085Z",
"iopub.status.idle": "2023-08-18T07:00:28.406887Z",
"shell.execute_reply": "2023-08-18T07:00:28.406085Z"
},
"origin_pos": 7,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"torch.Size([8, 8])"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"conv2d = nn.Conv2d(1, 1, kernel_size=(5, 3), padding=(2, 1))\n",
"comp_conv2d(conv2d, X).shape"
]
},
{
"cell_type": "markdown",
"id": "5a303f4b",
"metadata": {
"origin_pos": 10
},
"source": [
"## 步幅\n",
"\n",
"在计算互相关时,卷积窗口从输入张量的左上角开始,向下、向右滑动。\n",
"在前面的例子中,我们默认每次滑动一个元素。\n",
"但是,有时候为了高效计算或是缩减采样次数,卷积窗口可以跳过中间位置,每次滑动多个元素。\n",
"\n",
"我们将每次滑动元素的数量称为*步幅*(stride)。到目前为止,我们只使用过高度或宽度为$1$的步幅,那么如何使用较大的步幅呢?\n",
" :numref:`img_conv_stride`是垂直步幅为$3$,水平步幅为$2$的二维互相关运算。\n",
"着色部分是输出元素以及用于输出计算的输入和内核张量元素:$0\\times0+0\\times1+1\\times2+2\\times3=8$、$0\\times0+6\\times1+0\\times2+0\\times3=6$。\n",
"\n",
"可以看到,为了计算输出中第一列的第二个元素和第一行的第二个元素,卷积窗口分别向下滑动三行和向右滑动两列。但是,当卷积窗口继续向右滑动两列时,没有输出,因为输入元素无法填充窗口(除非我们添加另一列填充)。\n",
"\n",
"![垂直步幅为 $3$,水平步幅为 $2$ 的二维互相关运算。](../img/conv-stride.svg)\n",
":label:`img_conv_stride`\n",
"\n",
"通常,当垂直步幅为$s_h$、水平步幅为$s_w$时,输出形状为\n",
"\n",
"$$\\lfloor(n_h-k_h+p_h+s_h)/s_h\\rfloor \\times \\lfloor(n_w-k_w+p_w+s_w)/s_w\\rfloor.$$\n",
"\n",
"如果我们设置了$p_h=k_h-1$和$p_w=k_w-1$,则输出形状将简化为$\\lfloor(n_h+s_h-1)/s_h\\rfloor \\times \\lfloor(n_w+s_w-1)/s_w\\rfloor$。\n",
"更进一步,如果输入的高度和宽度可以被垂直和水平步幅整除,则输出形状将为$(n_h/s_h) \\times (n_w/s_w)$。\n",
"\n",
"下面,我们[**将高度和宽度的步幅设置为2**],从而将输入的高度和宽度减半。\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "7b6ac278",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:00:28.410395Z",
"iopub.status.busy": "2023-08-18T07:00:28.410090Z",
"iopub.status.idle": "2023-08-18T07:00:28.416621Z",
"shell.execute_reply": "2023-08-18T07:00:28.415848Z"
},
"origin_pos": 12,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"torch.Size([4, 4])"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"conv2d = nn.Conv2d(1, 1, kernel_size=3, padding=1, stride=2)\n",
"comp_conv2d(conv2d, X).shape"
]
},
{
"cell_type": "markdown",
"id": "e9e254ec",
"metadata": {
"origin_pos": 15
},
"source": [
"接下来,看(**一个稍微复杂的例子**)。\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "6f1c0e6c",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:00:28.422070Z",
"iopub.status.busy": "2023-08-18T07:00:28.421461Z",
"iopub.status.idle": "2023-08-18T07:00:28.429200Z",
"shell.execute_reply": "2023-08-18T07:00:28.427969Z"
},
"origin_pos": 17,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"torch.Size([2, 2])"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"conv2d = nn.Conv2d(1, 1, kernel_size=(3, 5), padding=(0, 1), stride=(3, 4))\n",
"comp_conv2d(conv2d, X).shape"
]
},
{
"cell_type": "markdown",
"id": "4674c8d4",
"metadata": {
"origin_pos": 20
},
"source": [
"为了简洁起见,当输入高度和宽度两侧的填充数量分别为$p_h$和$p_w$时,我们称之为填充$(p_h, p_w)$。当$p_h = p_w = p$时,填充是$p$。同理,当高度和宽度上的步幅分别为$s_h$和$s_w$时,我们称之为步幅$(s_h, s_w)$。特别地,当$s_h = s_w = s$时,我们称步幅为$s$。默认情况下,填充为0,步幅为1。在实践中,我们很少使用不一致的步幅或填充,也就是说,我们通常有$p_h = p_w$和$s_h = s_w$。\n",
"\n",
"## 小结\n",
"\n",
"* 填充可以增加输出的高度和宽度。这常用来使输出与输入具有相同的高和宽。\n",
"* 步幅可以减小输出的高和宽,例如输出的高和宽仅为输入的高和宽的$1/n$($n$是一个大于$1$的整数)。\n",
"* 填充和步幅可用于有效地调整数据的维度。\n",
"\n",
"## 练习\n",
"\n",
"1. 对于本节中的最后一个示例,计算其输出形状,以查看它是否与实验结果一致。\n",
"1. 在本节中的实验中,试一试其他填充和步幅组合。\n",
"1. 对于音频信号,步幅$2$说明什么?\n",
"1. 步幅大于$1$的计算优势是什么?\n"
]
},
{
"cell_type": "markdown",
"id": "a93cbfa0",
"metadata": {
"origin_pos": 22,
"tab": [
"pytorch"
]
},
"source": [
"[Discussions](https://discuss.d2l.ai/t/1851)\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
},
"required_libs": []
},
"nbformat": 4,
"nbformat_minor": 5
}
@@ -0,0 +1,527 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "3406a2db",
"metadata": {
"origin_pos": 0
},
"source": [
"# 汇聚层\n",
":label:`sec_pooling`\n",
"\n",
"通常当我们处理图像时,我们希望逐渐降低隐藏表示的空间分辨率、聚集信息,这样随着我们在神经网络中层叠的上升,每个神经元对其敏感的感受野(输入)就越大。\n",
"\n",
"而我们的机器学习任务通常会跟全局图像的问题有关(例如,“图像是否包含一只猫呢?”),所以我们最后一层的神经元应该对整个输入的全局敏感。通过逐渐聚合信息,生成越来越粗糙的映射,最终实现学习全局表示的目标,同时将卷积图层的所有优势保留在中间层。\n",
"\n",
"此外,当检测较底层的特征时(例如 :numref:`sec_conv_layer`中所讨论的边缘),我们通常希望这些特征保持某种程度上的平移不变性。例如,如果我们拍摄黑白之间轮廓清晰的图像`X`,并将整个图像向右移动一个像素,即`Z[i, j] = X[i, j + 1]`,则新图像`Z`的输出可能大不相同。而在现实中,随着拍摄角度的移动,任何物体几乎不可能发生在同一像素上。即使用三脚架拍摄一个静止的物体,由于快门的移动而引起的相机振动,可能会使所有物体左右移动一个像素(除了高端相机配备了特殊功能来解决这个问题)。\n",
"\n",
"本节将介绍*汇聚*(pooling)层,它具有双重目的:降低卷积层对位置的敏感性,同时降低对空间降采样表示的敏感性。\n",
"\n",
"## 最大汇聚层和平均汇聚层\n",
"\n",
"与卷积层类似,汇聚层运算符由一个固定形状的窗口组成,该窗口根据其步幅大小在输入的所有区域上滑动,为固定形状窗口(有时称为*汇聚窗口*)遍历的每个位置计算一个输出。\n",
"然而,不同于卷积层中的输入与卷积核之间的互相关计算,汇聚层不包含参数。\n",
"相反,池运算是确定性的,我们通常计算汇聚窗口中所有元素的最大值或平均值。这些操作分别称为*最大汇聚层*(maximum pooling)和*平均汇聚层*average pooling)。\n",
"\n",
"在这两种情况下,与互相关运算符一样,汇聚窗口从输入张量的左上角开始,从左往右、从上往下的在输入张量内滑动。在汇聚窗口到达的每个位置,它计算该窗口中输入子张量的最大值或平均值。计算最大值或平均值是取决于使用了最大汇聚层还是平均汇聚层。\n",
"\n",
"![汇聚窗口形状为 $2\\times 2$ 的最大汇聚层。着色部分是第一个输出元素,以及用于计算这个输出的输入元素: $\\max(0, 1, 3, 4)=4$.](../img/pooling.svg)\n",
":label:`fig_pooling`\n",
"\n",
" :numref:`fig_pooling`中的输出张量的高度为$2$,宽度为$2$。这四个元素为每个汇聚窗口中的最大值:\n",
"\n",
"$$\n",
"\\max(0, 1, 3, 4)=4,\\\\\n",
"\\max(1, 2, 4, 5)=5,\\\\\n",
"\\max(3, 4, 6, 7)=7,\\\\\n",
"\\max(4, 5, 7, 8)=8.\\\\\n",
"$$\n",
"\n",
"汇聚窗口形状为$p \\times q$的汇聚层称为$p \\times q$汇聚层,汇聚操作称为$p \\times q$汇聚。\n",
"\n",
"回到本节开头提到的对象边缘检测示例,现在我们将使用卷积层的输出作为$2\\times 2$最大汇聚的输入。\n",
"设置卷积层输入为`X`,汇聚层输出为`Y`。\n",
"无论`X[i, j]`和`X[i, j + 1]`的值相同与否,或`X[i, j + 1]`和`X[i, j + 2]`的值相同与否,汇聚层始终输出`Y[i, j] = 1`。\n",
"也就是说,使用$2\\times 2$最大汇聚层,即使在高度或宽度上移动一个元素,卷积层仍然可以识别到模式。\n",
"\n",
"在下面的代码中的`pool2d`函数,我们(**实现汇聚层的前向传播**)。\n",
"这类似于 :numref:`sec_conv_layer`中的`corr2d`函数。\n",
"然而,这里我们没有卷积核,输出为输入中每个区域的最大值或平均值。\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "292e979e",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:02:18.192662Z",
"iopub.status.busy": "2023-08-18T07:02:18.191844Z",
"iopub.status.idle": "2023-08-18T07:02:20.224371Z",
"shell.execute_reply": "2023-08-18T07:02:20.223413Z"
},
"origin_pos": 2,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"import torch\n",
"from torch import nn\n",
"from d2l import torch as d2l"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "fe35adac",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:02:20.228639Z",
"iopub.status.busy": "2023-08-18T07:02:20.227964Z",
"iopub.status.idle": "2023-08-18T07:02:20.234155Z",
"shell.execute_reply": "2023-08-18T07:02:20.233266Z"
},
"origin_pos": 4,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"def pool2d(X, pool_size, mode='max'):\n",
" p_h, p_w = pool_size\n",
" Y = torch.zeros((X.shape[0] - p_h + 1, X.shape[1] - p_w + 1))\n",
" for i in range(Y.shape[0]):\n",
" for j in range(Y.shape[1]):\n",
" if mode == 'max':\n",
" Y[i, j] = X[i: i + p_h, j: j + p_w].max()\n",
" elif mode == 'avg':\n",
" Y[i, j] = X[i: i + p_h, j: j + p_w].mean()\n",
" return Y"
]
},
{
"cell_type": "markdown",
"id": "27b51b5e",
"metadata": {
"origin_pos": 6
},
"source": [
"我们可以构建 :numref:`fig_pooling`中的输入张量`X`,[**验证二维最大汇聚层的输出**]。\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "3a781c85",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:02:20.237767Z",
"iopub.status.busy": "2023-08-18T07:02:20.237211Z",
"iopub.status.idle": "2023-08-18T07:02:20.268065Z",
"shell.execute_reply": "2023-08-18T07:02:20.267212Z"
},
"origin_pos": 7,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"tensor([[4., 5.],\n",
" [7., 8.]])"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])\n",
"pool2d(X, (2, 2))"
]
},
{
"cell_type": "markdown",
"id": "8cc88d86",
"metadata": {
"origin_pos": 8
},
"source": [
"此外,我们还可以(**验证平均汇聚层**)。\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "4f9a1ffd",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:02:20.272001Z",
"iopub.status.busy": "2023-08-18T07:02:20.271411Z",
"iopub.status.idle": "2023-08-18T07:02:20.277849Z",
"shell.execute_reply": "2023-08-18T07:02:20.276928Z"
},
"origin_pos": 9,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"tensor([[2., 3.],\n",
" [5., 6.]])"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pool2d(X, (2, 2), 'avg')"
]
},
{
"cell_type": "markdown",
"id": "447c6999",
"metadata": {
"origin_pos": 10
},
"source": [
"## [**填充和步幅**]\n",
"\n",
"与卷积层一样,汇聚层也可以改变输出形状。和以前一样,我们可以通过填充和步幅以获得所需的输出形状。\n",
"下面,我们用深度学习框架中内置的二维最大汇聚层,来演示汇聚层中填充和步幅的使用。\n",
"我们首先构造了一个输入张量`X`,它有四个维度,其中样本数和通道数都是1。\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "140d08f5",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:02:20.281458Z",
"iopub.status.busy": "2023-08-18T07:02:20.280874Z",
"iopub.status.idle": "2023-08-18T07:02:20.287391Z",
"shell.execute_reply": "2023-08-18T07:02:20.286578Z"
},
"origin_pos": 12,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"tensor([[[[ 0., 1., 2., 3.],\n",
" [ 4., 5., 6., 7.],\n",
" [ 8., 9., 10., 11.],\n",
" [12., 13., 14., 15.]]]])"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = torch.arange(16, dtype=torch.float32).reshape((1, 1, 4, 4))\n",
"X"
]
},
{
"cell_type": "markdown",
"id": "f95f2492",
"metadata": {
"origin_pos": 15
},
"source": [
"默认情况下,(**深度学习框架中的步幅与汇聚窗口的大小相同**)。\n",
"因此,如果我们使用形状为`(3, 3)`的汇聚窗口,那么默认情况下,我们得到的步幅形状为`(3, 3)`。\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "a3cc01e3",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:02:20.291052Z",
"iopub.status.busy": "2023-08-18T07:02:20.290402Z",
"iopub.status.idle": "2023-08-18T07:02:20.296276Z",
"shell.execute_reply": "2023-08-18T07:02:20.295476Z"
},
"origin_pos": 17,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"tensor([[[[10.]]]])"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pool2d = nn.MaxPool2d(3)\n",
"pool2d(X)"
]
},
{
"cell_type": "markdown",
"id": "0b19d625",
"metadata": {
"origin_pos": 20
},
"source": [
"[**填充和步幅可以手动设定**]。\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "9c247428",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:02:20.299965Z",
"iopub.status.busy": "2023-08-18T07:02:20.299310Z",
"iopub.status.idle": "2023-08-18T07:02:20.307455Z",
"shell.execute_reply": "2023-08-18T07:02:20.306477Z"
},
"origin_pos": 22,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"tensor([[[[ 5., 7.],\n",
" [13., 15.]]]])"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pool2d = nn.MaxPool2d(3, padding=1, stride=2)\n",
"pool2d(X)"
]
},
{
"cell_type": "markdown",
"id": "635b4034",
"metadata": {
"origin_pos": 26,
"tab": [
"pytorch"
]
},
"source": [
"当然,我们可以(**设定一个任意大小的矩形汇聚窗口,并分别设定填充和步幅的高度和宽度**)。\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "7c169b2f",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:02:20.311794Z",
"iopub.status.busy": "2023-08-18T07:02:20.311492Z",
"iopub.status.idle": "2023-08-18T07:02:20.320399Z",
"shell.execute_reply": "2023-08-18T07:02:20.319108Z"
},
"origin_pos": 30,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"tensor([[[[ 5., 7.],\n",
" [13., 15.]]]])"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pool2d = nn.MaxPool2d((2, 3), stride=(2, 3), padding=(0, 1))\n",
"pool2d(X)"
]
},
{
"cell_type": "markdown",
"id": "a893596a",
"metadata": {
"origin_pos": 33
},
"source": [
"## 多个通道\n",
"\n",
"在处理多通道输入数据时,[**汇聚层在每个输入通道上单独运算**],而不是像卷积层一样在通道上对输入进行汇总。\n",
"这意味着汇聚层的输出通道数与输入通道数相同。\n",
"下面,我们将在通道维度上连结张量`X`和`X + 1`,以构建具有2个通道的输入。\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "c0a30a7f",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:02:20.325617Z",
"iopub.status.busy": "2023-08-18T07:02:20.324879Z",
"iopub.status.idle": "2023-08-18T07:02:20.335303Z",
"shell.execute_reply": "2023-08-18T07:02:20.334055Z"
},
"origin_pos": 35,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"tensor([[[[ 0., 1., 2., 3.],\n",
" [ 4., 5., 6., 7.],\n",
" [ 8., 9., 10., 11.],\n",
" [12., 13., 14., 15.]],\n",
"\n",
" [[ 1., 2., 3., 4.],\n",
" [ 5., 6., 7., 8.],\n",
" [ 9., 10., 11., 12.],\n",
" [13., 14., 15., 16.]]]])"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = torch.cat((X, X + 1), 1)\n",
"X"
]
},
{
"cell_type": "markdown",
"id": "45add004",
"metadata": {
"origin_pos": 37
},
"source": [
"如下所示,汇聚后输出通道的数量仍然是2。\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "e534c8f3",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:02:20.340529Z",
"iopub.status.busy": "2023-08-18T07:02:20.339767Z",
"iopub.status.idle": "2023-08-18T07:02:20.349365Z",
"shell.execute_reply": "2023-08-18T07:02:20.348159Z"
},
"origin_pos": 39,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"tensor([[[[ 5., 7.],\n",
" [13., 15.]],\n",
"\n",
" [[ 6., 8.],\n",
" [14., 16.]]]])"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pool2d = nn.MaxPool2d(3, padding=1, stride=2)\n",
"pool2d(X)"
]
},
{
"cell_type": "markdown",
"id": "0a91fd9f",
"metadata": {
"origin_pos": 43
},
"source": [
"## 小结\n",
"\n",
"* 对于给定输入元素,最大汇聚层会输出该窗口内的最大值,平均汇聚层会输出该窗口内的平均值。\n",
"* 汇聚层的主要优点之一是减轻卷积层对位置的过度敏感。\n",
"* 我们可以指定汇聚层的填充和步幅。\n",
"* 使用最大汇聚层以及大于1的步幅,可减少空间维度(如高度和宽度)。\n",
"* 汇聚层的输出通道数与输入通道数相同。\n",
"\n",
"## 练习\n",
"\n",
"1. 尝试将平均汇聚层作为卷积层的特殊情况实现。\n",
"1. 尝试将最大汇聚层作为卷积层的特殊情况实现。\n",
"1. 假设汇聚层的输入大小为$c\\times h\\times w$,则汇聚窗口的形状为$p_h\\times p_w$,填充为$(p_h, p_w)$,步幅为$(s_h, s_w)$。这个汇聚层的计算成本是多少?\n",
"1. 为什么最大汇聚层和平均汇聚层的工作方式不同?\n",
"1. 我们是否需要最小汇聚层?可以用已知函数替换它吗?\n",
"1. 除了平均汇聚层和最大汇聚层,是否有其它函数可以考虑(提示:回想一下`softmax`)?为什么它不流行?\n"
]
},
{
"cell_type": "markdown",
"id": "f53a8320",
"metadata": {
"origin_pos": 45,
"tab": [
"pytorch"
]
},
"source": [
"[Discussions](https://discuss.d2l.ai/t/1857)\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
},
"required_libs": []
},
"nbformat": 4,
"nbformat_minor": 5
}
@@ -0,0 +1,172 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "36224718",
"metadata": {
"origin_pos": 0
},
"source": [
"# 从全连接层到卷积\n",
":label:`sec_why-conv`\n",
"\n",
"我们之前讨论的多层感知机十分适合处理表格数据,其中行对应样本,列对应特征。\n",
"对于表格数据,我们寻找的模式可能涉及特征之间的交互,但是我们不能预先假设任何与特征交互相关的先验结构。\n",
"此时,多层感知机可能是最好的选择,然而对于高维感知数据,这种缺少结构的网络可能会变得不实用。\n",
"\n",
"例如,在之前猫狗分类的例子中:假设我们有一个足够充分的照片数据集,数据集中是拥有标注的照片,每张照片具有百万级像素,这意味着网络的每次输入都有一百万个维度。\n",
"即使将隐藏层维度降低到1000,这个全连接层也将有$10^6 \\times 10^3 = 10^9$个参数。\n",
"想要训练这个模型将不可实现,因为需要有大量的GPU、分布式优化训练的经验和超乎常人的耐心。\n",
"\n",
"有些读者可能会反对这个观点,认为要求百万像素的分辨率可能不是必要的。\n",
"然而,即使分辨率减小为十万像素,使用1000个隐藏单元的隐藏层也可能不足以学习到良好的图像特征,在真实的系统中我们仍然需要数十亿个参数。\n",
"此外,拟合如此多的参数还需要收集大量的数据。\n",
"然而,如今人类和机器都能很好地区分猫和狗:这是因为图像中本就拥有丰富的结构,而这些结构可以被人类和机器学习模型使用。\n",
"*卷积神经网络*convolutional neural networksCNN)是机器学习利用自然图像中一些已知结构的创造性方法。\n",
"\n",
"## 不变性\n",
"\n",
"想象一下,假设我们想从一张图片中找到某个物体。\n",
"合理的假设是:无论哪种方法找到这个物体,都应该和物体的位置无关。\n",
"理想情况下,我们的系统应该能够利用常识:猪通常不在天上飞,飞机通常不在水里游泳。\n",
"但是,如果一只猪出现在图片顶部,我们还是应该认出它。\n",
"我们可以从儿童游戏”沃尔多在哪里”( :numref:`img_waldo`)中得到灵感:\n",
"在这个游戏中包含了许多充斥着活动的混乱场景,而沃尔多通常潜伏在一些不太可能的位置,读者的目标就是找出他。\n",
"尽管沃尔多的装扮很有特点,但是在眼花缭乱的场景中找到他也如大海捞针。\n",
"然而沃尔多的样子并不取决于他潜藏的地方,因此我们可以使用一个“沃尔多检测器”扫描图像。\n",
"该检测器将图像分割成多个区域,并为每个区域包含沃尔多的可能性打分。\n",
"卷积神经网络正是将*空间不变性*spatial invariance)的这一概念系统化,从而基于这个模型使用较少的参数来学习有用的表示。\n",
"\n",
"![沃尔多游戏示例图。](../img/where-wally-walker-books.jpg)\n",
":width:`400px`\n",
":label:`img_waldo`\n",
"\n",
"现在,我们将上述想法总结一下,从而帮助我们设计适合于计算机视觉的神经网络架构。\n",
"\n",
"1. *平移不变性*translation invariance):不管检测对象出现在图像中的哪个位置,神经网络的前面几层应该对相同的图像区域具有相似的反应,即为“平移不变性”。\n",
"1. *局部性*locality):神经网络的前面几层应该只探索输入图像中的局部区域,而不过度在意图像中相隔较远区域的关系,这就是“局部性”原则。最终,可以聚合这些局部特征,以在整个图像级别进行预测。\n",
"\n",
"让我们看看这些原则是如何转化为数学表示的。\n",
"\n",
"## 多层感知机的限制\n",
"\n",
"首先,多层感知机的输入是二维图像$\\mathbf{X}$,其隐藏表示$\\mathbf{H}$在数学上是一个矩阵,在代码中表示为二维张量。\n",
"其中$\\mathbf{X}$和$\\mathbf{H}$具有相同的形状。\n",
"为了方便理解,我们可以认为,无论是输入还是隐藏表示都拥有空间结构。\n",
"\n",
"使用$[\\mathbf{X}]_{i, j}$和$[\\mathbf{H}]_{i, j}$分别表示输入图像和隐藏表示中位置($i$,$j$)处的像素。\n",
"为了使每个隐藏神经元都能接收到每个输入像素的信息,我们将参数从权重矩阵(如同我们先前在多层感知机中所做的那样)替换为四阶权重张量$\\mathsf{W}$。假设$\\mathbf{U}$包含偏置参数,我们可以将全连接层形式化地表示为\n",
"\n",
"$$\\begin{aligned} \\left[\\mathbf{H}\\right]_{i, j} &= [\\mathbf{U}]_{i, j} + \\sum_k \\sum_l[\\mathsf{W}]_{i, j, k, l} [\\mathbf{X}]_{k, l}\\\\ &= [\\mathbf{U}]_{i, j} +\n",
"\\sum_a \\sum_b [\\mathsf{V}]_{i, j, a, b} [\\mathbf{X}]_{i+a, j+b}.\\end{aligned}$$\n",
"\n",
"其中,从$\\mathsf{W}$到$\\mathsf{V}$的转换只是形式上的转换,因为在这两个四阶张量的元素之间存在一一对应的关系。\n",
"我们只需重新索引下标$(k, l)$,使$k = i+a$、$l = j+b$,由此可得$[\\mathsf{V}]_{i, j, a, b} = [\\mathsf{W}]_{i, j, i+a, j+b}$。\n",
"索引$a$和$b$通过在正偏移和负偏移之间移动覆盖了整个图像。\n",
"对于隐藏表示中任意给定位置($i$,$j$)处的像素值$[\\mathbf{H}]_{i, j}$,可以通过在$x$中以$(i, j)$为中心对像素进行加权求和得到,加权使用的权重为$[\\mathsf{V}]_{i, j, a, b}$。\n",
"\n",
"### 平移不变性\n",
"\n",
"现在引用上述的第一个原则:平移不变性。\n",
"这意味着检测对象在输入$\\mathbf{X}$中的平移,应该仅导致隐藏表示$\\mathbf{H}$中的平移。也就是说,$\\mathsf{V}$和$\\mathbf{U}$实际上不依赖于$(i, j)$的值,即$[\\mathsf{V}]_{i, j, a, b} = [\\mathbf{V}]_{a, b}$。并且$\\mathbf{U}$是一个常数,比如$u$。因此,我们可以简化$\\mathbf{H}$定义为:\n",
"\n",
"$$[\\mathbf{H}]_{i, j} = u + \\sum_a\\sum_b [\\mathbf{V}]_{a, b} [\\mathbf{X}]_{i+a, j+b}.$$\n",
"\n",
"这就是*卷积*convolution)。我们是在使用系数$[\\mathbf{V}]_{a, b}$对位置$(i, j)$附近的像素$(i+a, j+b)$进行加权得到$[\\mathbf{H}]_{i, j}$。\n",
"注意,$[\\mathbf{V}]_{a, b}$的系数比$[\\mathsf{V}]_{i, j, a, b}$少很多,因为前者不再依赖于图像中的位置。这就是显著的进步!\n",
"\n",
"### 局部性\n",
"\n",
"现在引用上述的第二个原则:局部性。如上所述,为了收集用来训练参数$[\\mathbf{H}]_{i, j}$的相关信息,我们不应偏离到距$(i, j)$很远的地方。这意味着在$|a|> \\Delta$或$|b| > \\Delta$的范围之外,我们可以设置$[\\mathbf{V}]_{a, b} = 0$。因此,我们可以将$[\\mathbf{H}]_{i, j}$重写为\n",
"\n",
"$$[\\mathbf{H}]_{i, j} = u + \\sum_{a = -\\Delta}^{\\Delta} \\sum_{b = -\\Delta}^{\\Delta} [\\mathbf{V}]_{a, b} [\\mathbf{X}]_{i+a, j+b}.$$\n",
":eqlabel:`eq_conv-layer`\n",
"\n",
"简而言之, :eqref:`eq_conv-layer`是一个*卷积层*convolutional layer),而卷积神经网络是包含卷积层的一类特殊的神经网络。\n",
"在深度学习研究社区中,$\\mathbf{V}$被称为*卷积核*convolution kernel)或者*滤波器*(filter),亦或简单地称之为该卷积层的*权重*,通常该权重是可学习的参数。\n",
"当图像处理的局部区域很小时,卷积神经网络与多层感知机的训练差异可能是巨大的:以前,多层感知机可能需要数十亿个参数来表示网络中的一层,而现在卷积神经网络通常只需要几百个参数,而且不需要改变输入或隐藏表示的维数。\n",
"参数大幅减少的代价是,我们的特征现在是平移不变的,并且当确定每个隐藏活性值时,每一层只包含局部的信息。\n",
"以上所有的权重学习都将依赖于归纳偏置。当这种偏置与现实相符时,我们就能得到样本有效的模型,并且这些模型能很好地泛化到未知数据中。\n",
"但如果这偏置与现实不符时,比如当图像不满足平移不变时,我们的模型可能难以拟合我们的训练数据。\n",
"\n",
"## 卷积\n",
"\n",
"在进一步讨论之前,我们先简要回顾一下为什么上面的操作被称为卷积。在数学中,两个函数(比如$f, g: \\mathbb{R}^d \\to \\mathbb{R}$)之间的“卷积”被定义为\n",
"\n",
"$$(f * g)(\\mathbf{x}) = \\int f(\\mathbf{z}) g(\\mathbf{x}-\\mathbf{z}) d\\mathbf{z}.$$\n",
"\n",
"也就是说,卷积是当把一个函数“翻转”并移位$\\mathbf{x}$时,测量$f$和$g$之间的重叠。\n",
"当为离散对象时,积分就变成求和。例如,对于由索引为$\\mathbb{Z}$的、平方可和的、无限维向量集合中抽取的向量,我们得到以下定义:\n",
"\n",
"$$(f * g)(i) = \\sum_a f(a) g(i-a).$$\n",
"\n",
"对于二维张量,则为$f$的索引$(a, b)$和$g$的索引$(i-a, j-b)$上的对应加和:\n",
"\n",
"$$(f * g)(i, j) = \\sum_a\\sum_b f(a, b) g(i-a, j-b).$$\n",
":eqlabel:`eq_2d-conv-discrete`\n",
"\n",
"这看起来类似于 :eqref:`eq_conv-layer`,但有一个主要区别:这里不是使用$(i+a, j+b)$,而是使用差值。然而,这种区别是表面的,因为我们总是可以匹配 :eqref:`eq_conv-layer`和 :eqref:`eq_2d-conv-discrete`之间的符号。我们在 :eqref:`eq_conv-layer`中的原始定义更正确地描述了*互相关*cross-correlation),这个问题将在下一节中讨论。\n",
"\n",
"## “沃尔多在哪里”回顾\n",
"\n",
"回到上面的“沃尔多在哪里”游戏,让我们看看它到底是什么样子。卷积层根据滤波器$\\mathbf{V}$选取给定大小的窗口,并加权处理图片,如 :numref:`fig_waldo_mask`中所示。我们的目标是学习一个模型,以便探测出在“沃尔多”最可能出现的地方。\n",
"\n",
"![发现沃尔多。](../img/waldo-mask.jpg)\n",
":width:`400px`\n",
":label:`fig_waldo_mask`\n",
"\n",
"### 通道\n",
":label:`subsec_why-conv-channels`\n",
"\n",
"然而这种方法有一个问题:我们忽略了图像一般包含三个通道/三种原色(红色、绿色和蓝色)。\n",
"实际上,图像不是二维张量,而是一个由高度、宽度和颜色组成的三维张量,比如包含$1024 \\times 1024 \\times 3$个像素。\n",
"前两个轴与像素的空间位置有关,而第三个轴可以看作每个像素的多维表示。\n",
"因此,我们将$\\mathsf{X}$索引为$[\\mathsf{X}]_{i, j, k}$。由此卷积相应地调整为$[\\mathsf{V}]_{a,b,c}$,而不是$[\\mathbf{V}]_{a,b}$。\n",
"\n",
"此外,由于输入图像是三维的,我们的隐藏表示$\\mathsf{H}$也最好采用三维张量。\n",
"换句话说,对于每一个空间位置,我们想要采用一组而不是一个隐藏表示。这样一组隐藏表示可以想象成一些互相堆叠的二维网格。\n",
"因此,我们可以把隐藏表示想象为一系列具有二维张量的*通道*(channel)。\n",
"这些通道有时也被称为*特征映射*(feature maps),因为每个通道都向后续层提供一组空间化的学习特征。\n",
"直观上可以想象在靠近输入的底层,一些通道专门识别边缘,而一些通道专门识别纹理。\n",
"\n",
"为了支持输入$\\mathsf{X}$和隐藏表示$\\mathsf{H}$中的多个通道,我们可以在$\\mathsf{V}$中添加第四个坐标,即$[\\mathsf{V}]_{a, b, c, d}$。综上所述,\n",
"\n",
"$$[\\mathsf{H}]_{i,j,d} = \\sum_{a = -\\Delta}^{\\Delta} \\sum_{b = -\\Delta}^{\\Delta} \\sum_c [\\mathsf{V}]_{a, b, c, d} [\\mathsf{X}]_{i+a, j+b, c},$$\n",
":eqlabel:`eq_conv-layer-channels`\n",
"\n",
"其中隐藏表示$\\mathsf{H}$中的索引$d$表示输出通道,而随后的输出将继续以三维张量$\\mathsf{H}$作为输入进入下一个卷积层。\n",
"所以, :eqref:`eq_conv-layer-channels`可以定义具有多个通道的卷积层,而其中$\\mathsf{V}$是该卷积层的权重。\n",
"\n",
"然而,仍有许多问题亟待解决。\n",
"例如,图像中是否到处都有存在沃尔多的可能?如何有效地计算输出层?如何选择适当的激活函数?为了训练有效的网络,如何做出合理的网络设计选择?我们将在本章的其它部分讨论这些问题。\n",
"\n",
"## 小结\n",
"\n",
"- 图像的平移不变性使我们以相同的方式处理局部图像,而不在乎它的位置。\n",
"- 局部性意味着计算相应的隐藏表示只需一小部分局部图像像素。\n",
"- 在图像处理中,卷积层通常比全连接层需要更少的参数,但依旧获得高效用的模型。\n",
"- 卷积神经网络(CNN)是一类特殊的神经网络,它可以包含多个卷积层。\n",
"- 多个输入和输出通道使模型在每个空间位置可以获取图像的多方面特征。\n",
"\n",
"## 练习\n",
"\n",
"1. 假设卷积层 :eqref:`eq_conv-layer`覆盖的局部区域$\\Delta = 0$。在这种情况下,证明卷积内核为每组通道独立地实现一个全连接层。\n",
"1. 为什么平移不变性可能也不是好主意呢?\n",
"1. 当从图像边界像素获取隐藏表示时,我们需要思考哪些问题?\n",
"1. 描述一个类似的音频卷积层的架构。\n",
"1. 卷积层也适合于文本数据吗?为什么?\n",
"1. 证明在 :eqref:`eq_2d-conv-discrete`中,$f * g = g * f$。\n",
"\n",
"[Discussions](https://discuss.d2l.ai/t/5767)\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
},
"required_libs": []
},
"nbformat": 4,
"nbformat_minor": 5
}