Files
2025-12-16 09:23:53 +08:00

340 lines
11 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "7a4cb8fa",
"metadata": {
"origin_pos": 0
},
"source": [
"# 自动并行\n",
":label:`sec_auto_para`\n",
"\n",
"深度学习框架(例如,MxNet、飞桨和PyTorch)会在后端自动构建计算图。利用计算图,系统可以了解所有依赖关系,并且可以选择性地并行执行多个不相互依赖的任务以提高速度。例如, :numref:`sec_async`中的 :numref:`fig_asyncgraph`独立初始化两个变量。因此,系统可以选择并行执行它们。\n",
"\n",
"通常情况下单个操作符将使用所有CPU或单个GPU上的所有计算资源。例如,即使在一台机器上有多个CPU处理器,`dot`操作符也将使用所有CPU上的所有核心(和线程)。这样的行为同样适用于单个GPU。因此,并行化对单设备计算机来说并不是很有用,而并行化对于多个设备就很重要了。虽然并行化通常应用在多个GPU之间,但增加本地CPU以后还将提高少许性能。例如, :cite:`Hadjis.Zhang.Mitliagkas.ea.2016`则把结合GPU和CPU的训练应用到计算机视觉模型中。借助自动并行化框架的便利性,我们可以依靠几行Python代码实现相同的目标。对自动并行计算的讨论主要集中在使用CPU和GPU的并行计算上,以及计算和通信的并行化内容。\n",
"\n",
"请注意,本节中的实验至少需要两个GPU来运行。\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "8c944f1a",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:11:59.505418Z",
"iopub.status.busy": "2023-08-18T07:11:59.504686Z",
"iopub.status.idle": "2023-08-18T07:12:02.958789Z",
"shell.execute_reply": "2023-08-18T07:12:02.957933Z"
},
"origin_pos": 2,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"import torch\n",
"from d2l import torch as d2l"
]
},
{
"cell_type": "markdown",
"id": "4c8e7569",
"metadata": {
"origin_pos": 4
},
"source": [
"## 基于GPU的并行计算\n",
"\n",
"从定义一个具有参考性的用于测试的工作负载开始:下面的`run`函数将执行$10$次*矩阵-矩阵*乘法时需要使用的数据分配到两个变量(`x_gpu1`和`x_gpu2`)中,这两个变量分别位于选择的不同设备上。\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "5e7b039a",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:12:02.987012Z",
"iopub.status.busy": "2023-08-18T07:12:02.986327Z",
"iopub.status.idle": "2023-08-18T07:12:05.221346Z",
"shell.execute_reply": "2023-08-18T07:12:05.220262Z"
},
"origin_pos": 6,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"devices = d2l.try_all_gpus()\n",
"def run(x):\n",
" return [x.mm(x) for _ in range(50)]\n",
"\n",
"x_gpu1 = torch.rand(size=(4000, 4000), device=devices[0])\n",
"x_gpu2 = torch.rand(size=(4000, 4000), device=devices[1])"
]
},
{
"cell_type": "markdown",
"id": "c2f2ffe6",
"metadata": {
"origin_pos": 9,
"tab": [
"pytorch"
]
},
"source": [
"现在使用函数来处理数据。通过在测量之前需要预热设备(对设备执行一次传递)来确保缓存的作用不影响最终的结果。`torch.cuda.synchronize()`函数将会等待一个CUDA设备上的所有流中的所有核心的计算完成。函数接受一个`device`参数,代表是哪个设备需要同步。如果device参数是`None`(默认值),它将使用`current_device()`找出的当前设备。\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "970d8c24",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:12:05.225646Z",
"iopub.status.busy": "2023-08-18T07:12:05.224864Z",
"iopub.status.idle": "2023-08-18T07:12:07.664593Z",
"shell.execute_reply": "2023-08-18T07:12:07.663740Z"
},
"origin_pos": 12,
"tab": [
"pytorch"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"GPU1 time: 0.4600 sec\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"GPU2 time: 0.4706 sec\n"
]
}
],
"source": [
"run(x_gpu1)\n",
"run(x_gpu2) # 预热设备\n",
"torch.cuda.synchronize(devices[0])\n",
"torch.cuda.synchronize(devices[1])\n",
"\n",
"with d2l.Benchmark('GPU1 time'):\n",
" run(x_gpu1)\n",
" torch.cuda.synchronize(devices[0])\n",
"\n",
"with d2l.Benchmark('GPU2 time'):\n",
" run(x_gpu2)\n",
" torch.cuda.synchronize(devices[1])"
]
},
{
"cell_type": "markdown",
"id": "4df4f720",
"metadata": {
"origin_pos": 15,
"tab": [
"pytorch"
]
},
"source": [
"如果删除两个任务之间的`synchronize`语句,系统就可以在两个设备上自动实现并行计算。\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "d6a567e4",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:12:07.668313Z",
"iopub.status.busy": "2023-08-18T07:12:07.667763Z",
"iopub.status.idle": "2023-08-18T07:12:08.130167Z",
"shell.execute_reply": "2023-08-18T07:12:08.129377Z"
},
"origin_pos": 18,
"tab": [
"pytorch"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"GPU1 & GPU2: 0.4580 sec\n"
]
}
],
"source": [
"with d2l.Benchmark('GPU1 & GPU2'):\n",
" run(x_gpu1)\n",
" run(x_gpu2)\n",
" torch.cuda.synchronize()"
]
},
{
"cell_type": "markdown",
"id": "a04f1ffe",
"metadata": {
"origin_pos": 20
},
"source": [
"在上述情况下,总执行时间小于两个部分执行时间的总和,因为深度学习框架自动调度两个GPU设备上的计算,而不需要用户编写复杂的代码。\n",
"\n",
"## 并行计算与通信\n",
"\n",
"在许多情况下,我们需要在不同的设备之间移动数据,比如在CPU和GPU之间,或者在不同的GPU之间。例如,当执行分布式优化时,就需要移动数据来聚合多个加速卡上的梯度。让我们通过在GPU上计算,然后将结果复制回CPU来模拟这个过程。\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "3b71f533",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:12:08.133753Z",
"iopub.status.busy": "2023-08-18T07:12:08.133184Z",
"iopub.status.idle": "2023-08-18T07:12:10.950227Z",
"shell.execute_reply": "2023-08-18T07:12:10.949308Z"
},
"origin_pos": 22,
"tab": [
"pytorch"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"在GPU1上运行: 0.4608 sec\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"复制到CPU: 2.3504 sec\n"
]
}
],
"source": [
"def copy_to_cpu(x, non_blocking=False):\n",
" return [y.to('cpu', non_blocking=non_blocking) for y in x]\n",
"\n",
"with d2l.Benchmark('在GPU1上运行'):\n",
" y = run(x_gpu1)\n",
" torch.cuda.synchronize()\n",
"\n",
"with d2l.Benchmark('复制到CPU'):\n",
" y_cpu = copy_to_cpu(y)\n",
" torch.cuda.synchronize()"
]
},
{
"cell_type": "markdown",
"id": "5290ab0c",
"metadata": {
"origin_pos": 25,
"tab": [
"pytorch"
]
},
"source": [
"这种方式效率不高。注意到当列表中的其余部分还在计算时,我们可能就已经开始将`y`的部分复制到CPU了。例如,当计算一个小批量的(反传)梯度时。某些参数的梯度将比其他参数的梯度更早可用。因此,在GPU仍在运行时就开始使用PCI-Express总线带宽来移动数据是有利的。在PyTorch中,`to()`和`copy_()`等函数都允许显式的`non_blocking`参数,这允许在不需要同步时调用方可以绕过同步。设置`non_blocking=True`以模拟这个场景。\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "b6ecdc54",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:12:10.954084Z",
"iopub.status.busy": "2023-08-18T07:12:10.953336Z",
"iopub.status.idle": "2023-08-18T07:12:12.728692Z",
"shell.execute_reply": "2023-08-18T07:12:12.727837Z"
},
"origin_pos": 28,
"tab": [
"pytorch"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"在GPU1上运行并复制到CPU: 1.7703 sec\n"
]
}
],
"source": [
"with d2l.Benchmark('在GPU1上运行并复制到CPU'):\n",
" y = run(x_gpu1)\n",
" y_cpu = copy_to_cpu(y, True)\n",
" torch.cuda.synchronize()"
]
},
{
"cell_type": "markdown",
"id": "58a269e8",
"metadata": {
"origin_pos": 30
},
"source": [
"两个操作所需的总时间少于它们各部分操作所需时间的总和。请注意,与并行计算的区别是通信操作使用的资源:CPU和GPU之间的总线。事实上,我们可以在两个设备上同时进行计算和通信。如上所述,计算和通信之间存在的依赖关系是必须先计算`y[i]`,然后才能将其复制到CPU。幸运的是,系统可以在计算`y[i]`的同时复制`y[i-1]`,以减少总的运行时间。\n",
"\n",
"最后,本节给出了一个简单的两层多层感知机在CPU和两个GPU上训练时的计算图及其依赖关系的例子,如 :numref:`fig_twogpu`所示。手动调度由此产生的并行程序将是相当痛苦的。这就是基于图的计算后端进行优化的优势所在。\n",
"\n",
"![在一个CPU和两个GPU上的两层的多层感知机的计算图及其依赖关系](../img/twogpu.svg)\n",
":label:`fig_twogpu`\n",
"\n",
"## 小结\n",
"\n",
"* 现代系统拥有多种设备,如多个GPU和多个CPU,还可以并行地、异步地使用它们。\n",
"* 现代系统还拥有各种通信资源,如PCI Express、存储(通常是固态硬盘或网络存储)和网络带宽,为了达到最高效率可以并行使用它们。\n",
"* 后端可以通过自动化地并行计算和通信来提高性能。\n",
"\n",
"## 练习\n",
"\n",
"1. 在本节定义的`run`函数中执行了八个操作,并且操作之间没有依赖关系。设计一个实验,看看深度学习框架是否会自动地并行地执行它们。\n",
"1. 当单个操作符的工作量足够小,即使在单个CPU或GPU上,并行化也会有所帮助。设计一个实验来验证这一点。\n",
"1. 设计一个实验,在CPU和GPU这两种设备上使用并行计算和通信。\n",
"1. 使用诸如NVIDIA的[Nsight](https://developer.nvidia.com/nsight-compute-2019_5)之类的调试器来验证代码是否有效。\n",
"1. 设计并实验具有更加复杂的数据依赖关系的计算任务,以查看是否可以在提高性能的同时获得正确的结果。\n"
]
},
{
"cell_type": "markdown",
"id": "88f15d8c",
"metadata": {
"origin_pos": 32,
"tab": [
"pytorch"
]
},
"source": [
"[Discussions](https://discuss.d2l.ai/t/2794)\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
},
"required_libs": []
},
"nbformat": 4,
"nbformat_minor": 5
}