Files
2025-12-16 09:23:53 +08:00

314 lines
9.9 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"cells": [
{
"cell_type": "markdown",
"id": "77479af3",
"metadata": {
"origin_pos": 0
},
"source": [
"# 异步计算\n",
":label:`sec_async`\n",
"\n",
"今天的计算机是高度并行的系统,由多个CPU核、多个GPU、多个处理单元组成。通常每个CPU核有多个线程,每个设备通常有多个GPU,每个GPU有多个处理单元。总之,我们可以同时处理许多不同的事情,并且通常是在不同的设备上。不幸的是,Python并不善于编写并行和异步代码,至少在没有额外帮助的情况下不是好选择。归根结底,Python是单线程的,将来也是不太可能改变的。因此在诸多的深度学习框架中,MXNet和TensorFlow之类则采用了一种*异步编程*asynchronous programming)模型来提高性能,而PyTorch则使用了Python自己的调度器来实现不同的性能权衡。对PyTorch来说GPU操作在默认情况下是异步的。当调用一个使用GPU的函数时,操作会排队到特定的设备上,但不一定要等到以后才执行。这允许我们并行执行更多的计算,包括在CPU或其他GPU上的操作。\n",
"\n",
"因此,了解异步编程是如何工作的,通过主动地减少计算需求和相互依赖,有助于我们开发更高效的程序。这能够减少内存开销并提高处理器利用率。\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "66ebecda",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:29:23.676819Z",
"iopub.status.busy": "2023-08-18T07:29:23.676275Z",
"iopub.status.idle": "2023-08-18T07:29:26.719058Z",
"shell.execute_reply": "2023-08-18T07:29:26.717749Z"
},
"origin_pos": 2,
"tab": [
"pytorch"
]
},
"outputs": [],
"source": [
"import os\n",
"import subprocess\n",
"import numpy\n",
"import torch\n",
"from torch import nn\n",
"from d2l import torch as d2l"
]
},
{
"cell_type": "markdown",
"id": "86ca52da",
"metadata": {
"origin_pos": 4
},
"source": [
"## 通过后端异步处理\n"
]
},
{
"cell_type": "markdown",
"id": "0fbdcd2b",
"metadata": {
"origin_pos": 6,
"tab": [
"pytorch"
]
},
"source": [
"作为热身,考虑一个简单问题:生成一个随机矩阵并将其相乘。让我们在NumPy和PyTorch张量中都这样做,看看它们的区别。请注意,PyTorch的`tensor`是在GPU上定义的。\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "e4c20b11",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:29:26.723694Z",
"iopub.status.busy": "2023-08-18T07:29:26.723007Z",
"iopub.status.idle": "2023-08-18T07:29:29.882717Z",
"shell.execute_reply": "2023-08-18T07:29:29.881143Z"
},
"origin_pos": 9,
"tab": [
"pytorch"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"numpy: 1.0704 sec\n",
"torch: 0.0013 sec\n"
]
}
],
"source": [
"# GPU计算热身\n",
"device = d2l.try_gpu()\n",
"a = torch.randn(size=(1000, 1000), device=device)\n",
"b = torch.mm(a, a)\n",
"\n",
"with d2l.Benchmark('numpy'):\n",
" for _ in range(10):\n",
" a = numpy.random.normal(size=(1000, 1000))\n",
" b = numpy.dot(a, a)\n",
"\n",
"with d2l.Benchmark('torch'):\n",
" for _ in range(10):\n",
" a = torch.randn(size=(1000, 1000), device=device)\n",
" b = torch.mm(a, a)"
]
},
{
"cell_type": "markdown",
"id": "c8188fde",
"metadata": {
"origin_pos": 12,
"tab": [
"pytorch"
]
},
"source": [
"通过PyTorch的基准输出比较快了几个数量级。NumPy点积是在CPU上执行的,而PyTorch矩阵乘法是在GPU上执行的,后者的速度要快得多。但巨大的时间差距表明一定还有其他原因。默认情况下,GPU操作在PyTorch中是异步的。强制PyTorch在返回之前完成所有计算,这种强制说明了之前发生的情况:计算是由后端执行,而前端将控制权返回给了Python。\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "78106858",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:29:29.891458Z",
"iopub.status.busy": "2023-08-18T07:29:29.890289Z",
"iopub.status.idle": "2023-08-18T07:29:29.904366Z",
"shell.execute_reply": "2023-08-18T07:29:29.902435Z"
},
"origin_pos": 15,
"tab": [
"pytorch"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done: 0.0049 sec\n"
]
}
],
"source": [
"with d2l.Benchmark():\n",
" for _ in range(10):\n",
" a = torch.randn(size=(1000, 1000), device=device)\n",
" b = torch.mm(a, a)\n",
" torch.cuda.synchronize(device)"
]
},
{
"cell_type": "markdown",
"id": "eb45905d",
"metadata": {
"origin_pos": 18,
"tab": [
"pytorch"
]
},
"source": [
"广义上说,PyTorch有一个用于与用户直接交互的前端(例如通过Python),还有一个由系统用来执行计算的后端。如 :numref:`fig_frontends`所示,用户可以用各种前端语言编写PyTorch程序,如Python和C++。不管使用的前端编程语言是什么,PyTorch程序的执行主要发生在C++实现的后端。由前端语言发出的操作被传递到后端执行。后端管理自己的线程,这些线程不断收集和执行排队的任务。请注意,要使其工作,后端必须能够跟踪计算图中各个步骤之间的依赖关系。因此,不可能并行化相互依赖的操作。\n"
]
},
{
"cell_type": "markdown",
"id": "751ec224",
"metadata": {
"origin_pos": 20
},
"source": [
"![编程语言前端和深度学习框架后端](../img/frontends.png)\n",
":width:`300px`\n",
":label:`fig_frontends`\n",
"\n",
"接下来看看另一个简单例子,以便更好地理解依赖关系图。\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "e4b981d5",
"metadata": {
"execution": {
"iopub.execute_input": "2023-08-18T07:29:29.910704Z",
"iopub.status.busy": "2023-08-18T07:29:29.910033Z",
"iopub.status.idle": "2023-08-18T07:29:29.963733Z",
"shell.execute_reply": "2023-08-18T07:29:29.962149Z"
},
"origin_pos": 22,
"tab": [
"pytorch"
]
},
"outputs": [
{
"data": {
"text/plain": [
"tensor([[3., 3.]], device='cuda:0')"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x = torch.ones((1, 2), device=device)\n",
"y = torch.ones((1, 2), device=device)\n",
"z = x * y + 2\n",
"z"
]
},
{
"cell_type": "markdown",
"id": "47f52925",
"metadata": {
"origin_pos": 24
},
"source": [
"![后端跟踪计算图中各个步骤之间的依赖关系](../img/asyncgraph.svg)\n",
":label:`fig_asyncgraph`\n",
"\n",
"上面的代码片段在 :numref:`fig_asyncgraph`中进行了说明。每当Python前端线程执行前三条语句中的一条语句时,它只是将任务返回到后端队列。当最后一个语句的结果需要被打印出来时,Python前端线程将等待C++后端线程完成变量`z`的结果计算。这种设计的一个好处是Python前端线程不需要执行实际的计算。因此,不管Python的性能如何,对程序的整体性能几乎没有影响。 :numref:`fig_threading`演示了前端和后端如何交互。\n",
"\n",
"![前端和后端的交互](../img/threading.svg)\n",
":label:`fig_threading`\n",
"\n",
"## 障碍器与阻塞器\n"
]
},
{
"cell_type": "markdown",
"id": "8c107794",
"metadata": {
"origin_pos": 29
},
"source": [
"## 改进计算\n"
]
},
{
"cell_type": "markdown",
"id": "185ccf3b",
"metadata": {
"origin_pos": 32
},
"source": [
"Python前端线程和C++后端线程之间的简化交互可以概括如下:\n",
"\n",
"1. 前端命令后端将计算任务`y = x + 1`插入队列;\n",
"1. 然后后端从队列接收计算任务并执行;\n",
"1. 然后后端将计算结果返回到前端。\n",
"\n",
"假设这三个阶段的持续时间分别为$t_1, t_2, t_3$。如果不使用异步编程,执行10000次计算所需的总时间约为$10000 (t_1+ t_2 + t_3)$。如果使用异步编程,因为前端不必等待后端为每个循环返回计算结果,执行$10000$次计算所花费的总时间可以减少到$t_1 + 10000 t_2 + t_3$(假设$10000 t_2 > 9999t_1$)。\n",
"\n",
"\n",
"## 小结\n",
"\n",
"* 深度学习框架可以将Python前端的控制与后端的执行解耦,使得命令可以快速地异步插入后端、并行执行。\n",
"* 异步产生了一个相当灵活的前端,但请注意:过度填充任务队列可能会导致内存消耗过多。建议对每个小批量进行同步,以保持前端和后端大致同步。\n",
"* 芯片供应商提供了复杂的性能分析工具,以获得对深度学习效率更精确的洞察。\n"
]
},
{
"cell_type": "markdown",
"id": "8a353540",
"metadata": {
"origin_pos": 34
},
"source": [
"## 练习\n"
]
},
{
"cell_type": "markdown",
"id": "1989c931",
"metadata": {
"origin_pos": 36,
"tab": [
"pytorch"
]
},
"source": [
"1. 在CPU上,对本节中相同的矩阵乘法操作进行基准测试,仍然可以通过后端观察异步吗?\n"
]
},
{
"cell_type": "markdown",
"id": "85daf71a",
"metadata": {
"origin_pos": 39,
"tab": [
"pytorch"
]
},
"source": [
"[Discussions](https://discuss.d2l.ai/t/2791)\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
},
"required_libs": []
},
"nbformat": 4,
"nbformat_minor": 5
}