浅谈Tensorflow2对GPU内存的分配策略

2025-02-03 12:00:20

一、问题源起

从以下的异常堆栈可以看到是BLAS程序集初始化失败，可以看到是执行MatMul的时候发生的异常，基本可以断定可能数据集太大导致memory不够用了。

2021-08-10 16:38:04.917501: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-08-10 16:38:04.960048: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-08-10 16:38:04.986898: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-08-10 16:38:04.992366: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-08-10 16:38:04.992389: W tensorflow/stream_executor/stream.cc:1455] attempting to perform BLAS operation using StreamExecutor without BLAS support
Traceback (most recent call last):
  File "/home/mango/PycharmProjects/DeepLearing/minist_conv.py", line 32, in <module>
    model.fit(train_images, train_labels, epochs=5, batch_size=64)
  File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/keras/engine/training.py", line 1183, in fit
    tmp_logs = self.train_function(iterator)
  File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/def_function.py", line 889, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/def_function.py", line 950, in _call
    return self._stateless_fn(*args, **kwds)
  File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/function.py", line 3023, in __call__
    return graph_function._call_flat(
  File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/function.py", line 1960, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/function.py", line 591, in call
    outputs = execute.execute(
  File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError:  Blas xGEMM launch failed : a.shape=[1,64,576], b.shape=[1,576,64], m=64, n=64, k=576
  [[node sequential/dense/MatMul (defined at home/mango/PycharmProjects/DeepLearing/minist_conv.py:32) ]] [Op:__inference_train_function_993]

Function call stack:
train_function

二、开发环境

mango@mango-ubuntu:~$ /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Wed_Jul_14_19:41:19_PDT_2021
Cuda~~ compilation tools, release 11.4, V11.4.100==
Build cuda_11.4.r11.4/compiler.30188945_0

mango@mango-ubuntu:~$ tail -n 10 /usr/include/cudnn_version.h
#ifndef CUDNN_VERSION_H_
#define CUDNN_VERSION_H_

#define CUDNN_MAJOR 8
#define CUDNN_MINOR 2
#define CUDNN_PATCHLEVEL 2

#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

#endif /* CUDNN_VERSION_H */

mango@mango-ubuntu:~$ python3 --version
Python 3.9.5

mango@mango-ubuntu:~$ nvidia-smi
Tue Aug 10 19:57:58 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   54C    P0    N/A /  N/A |    329MiB /  2002MiB |      9%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1818      G   /usr/lib/xorg/Xorg                186MiB |
|    0   N/A  N/A      2002      G   /usr/bin/gnome-shell               45MiB |
|    0   N/A  N/A      3435      G   ...AAAAAAAAA= --shared-files       75MiB |
|    0   N/A  N/A      6016      G   python3                            13MiB |
+-----------------------------------------------------------------------------+

mango@mango-ubuntu:~$ python3
Python 3.9.5 (default, May 11 2021, 08:20:37)
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2021-08-10 18:33:05.917520: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
>>> tf.__version__
'2.5.0'
>>>

三、Tensorflow针对GPU内存的分配策略

By default, TensorFlow maps nearly all of the GPU memory of all GPUs (subject to CUDA_VISIBLE_DEVICES) visible to the process. This is done to more efficiently use the relatively precious GPU memory resources on the devices by reducing memory fragmentation.

默认情况下，为了通过减少内存碎片更有效地利用设备上相对宝贵的GPU内存资源，TensorFlow进程会使用所有可见的GPU。

In some cases it is desirable for the process to only allocate a subset of the available memory, or to only grow the memory usage as is needed by the process. TensorFlow provides two methods to control this.

在某些情况下，进程只分配可用内存的一个子集，或者只根据进程的需要增加内存使用量。TensorFlow提供了两种方法来控制这种情况。

The first option is to turn on memory growth by calling tf.config.experimental.set_memory_growth, which attempts to allocate only as much GPU memory as needed for the runtime allocations: it starts out allocating very little memory, and as the program gets run and more GPU memory is needed, the GPU memory region is extended for the TensorFlow process. Memory is not released since it can lead to memory fragmentation. To turn on memory growth for a specific GPU, use the following code prior to allocating any tensors or executing any ops.

第一种选择是通过调用tf.config.experimental.set_memory_growth来打开内存增长，它尝试只分配运行时所需的GPU内存:它开始分配很少的内存，当程序运行时需要更多的GPU内存时，GPU内存区域会进一步扩展增大。内存不会被释放，因为这会导致内存碎片。为了打开特定GPU的内存增长，在分配任何张量或执行任何操作之前，使用以下代码。

gpus = tf.config.list_physical_devices('GPU')
if gpus:
  try:
    # Currently, memory growth needs to be the same across GPUs
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)

Another way to enable this option is to set the environmental variable TF_FORCE_GPU_ALLOW_GROWTH to true. This configuration is platform specific.

启用该选项的另一种方法是将环境变量TF_FORCE_GPU_ALLOW_GROWTH设置为true。此配置是特定于平台的。

The second method is to configure a virtual GPU device with tf.config.experimental.set_virtual_device_configuration and set a hard limit on the total memory to allocate on the GPU.

This is useful if you want to truly bound the amount of GPU memory available to the TensorFlow process. This is common practice for local development when the GPU is shared with other applications such as a workstation GUI.

第二种方法是使用tf.config.experimental.set_virtual_device_configuration配置虚拟GPU设备,并设置GPU上可分配的总内存的硬限制。

如果你想真正将GPU内存的数量绑定到TensorFlow进程中，这是非常有用的。当GPU与其他应用程序(如工作站GUI)共享时，这是本地开发的常见做法。

gpus = tf.config.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only allocate 1GB of memory on the first GPU
  try:
    tf.config.set_logical_device_configuration(
        gpus[0],
        [tf.config.LogicalDeviceConfiguration(memory_limit=1024)])
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)

四、问题分析验证

通过上边对TensorFlow文档的分析，默认情况下会占用所有的GPU内存，但是TensorFlow提供了两种方式可以灵活的控制内存的分配策略；

我们可以直接设置GPU内存按需动态分配

import tensorflow as tf
physical_gpus = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_gpus[0], True)

通过以下命令可以看到执行过程中GPU内存的占用最高为697M

mango@mango-ubuntu:~$ while true; do nvidia-smi; sleep 0.2; done;
Tue Aug 10 20:30:58 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   58C    P0    N/A /  N/A |   1026MiB /  2002MiB |     72%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1818      G   /usr/lib/xorg/Xorg                186MiB |
|    0   N/A  N/A      2002      G   /usr/bin/gnome-shell               45MiB |
|    0   N/A  N/A      3435      G   ...AAAAAAAAA= --shared-files       73MiB |
|    0   N/A  N/A      6016      G   python3                            13MiB |
|    0   N/A  N/A     13829      C   /usr/bin/python3.9                697MiB |
+-----------------------------------------------------------------------------+

我们也可以限制最多使用1024M的GPU内存

import tensorflow as tf
physical_gpus = tf.config.list_physical_devices('GPU')
tf.config.set_logical_device_configuration(physical_gpus[0], [tf.config.LogicalDeviceConfiguration(memory_limit=1024)])

同样通过命令可以看到执行过程中GPU内存的占用最高为1455M

mango@mango-ubuntu:~$ while true; do nvidia-smi; sleep 0.2; done;
Tue Aug 10 20:31:24 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   58C    P0    N/A /  N/A |   1784MiB /  2002MiB |     74%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1818      G   /usr/lib/xorg/Xorg                186MiB |
|    0   N/A  N/A      2002      G   /usr/bin/gnome-shell               46MiB |
|    0   N/A  N/A      3435      G   ...AAAAAAAAA= --shared-files       72MiB |
|    0   N/A  N/A      6016      G   python3                            13MiB |
|    0   N/A  N/A     13570      C   /usr/bin/python3.9               1455MiB |
+-----------------------------------------------------------------------------+

五、GPU分配策略分析

通过四中的测试结果可得

默认的分配策略会占用所有的内存，并且执行中不会进行释放，如果训练数据量比较打很容易内存不够用；
限制最大使用内存，测试占用内存比设置的大，这个可能跟训练中间使用的模型和操作的复杂程度有关系，需要根据具体的业务场景设置合适的值；但是要注意不能设置大了，否则还是会报错，但是设置小了只是执行的慢一些罢了；
设置内存按需分配可能是一个相对比较中庸的方案，感觉可能是一个更好的方案，不知道TensorFlow为什么没有设置为默认值，留作一个问题，后续有新的认知的话再补充；

六、扩展

单GPU模拟多GPU环境

当我们的本地开发环境只有一个GPU，但却需要编写多GPU的程序在工作站上进行训练任务时，TensorFlow为我们提供了一个方便的功能，可以让我们在本地开发环境中建立多个模拟GPU，从而让多GPU的程序调试变得更加方便。以下代码在实体GPU GPU:0 的基础上建立了两个显存均为2GB的虚拟GPU。

gpus = tf.config.list_physical_devices('GPU')
if gpus:
  # Create 2 virtual GPUs with 1GB memory each
  try:
    tf.config.set_logical_device_configuration(
        gpus[0],
        [tf.config.LogicalDeviceConfiguration(memory_limit=1024),
         tf.config.LogicalDeviceConfiguration(memory_limit=1024)])
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPU,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)

多GPU的数据并行

使用 tf.distribute.Strategy可以将模型拷贝到每个GPU上，然后将训练数据分批在不同的GPU上执行，达到数据并行。

tf.debugging.set_log_device_placement(True)
gpus = tf.config.list_logical_devices('GPU')
strategy = tf.distribute.MirroredStrategy(gpus)
with strategy.scope():
  inputs = tf.keras.layers.Input(shape=(1,))
  predictions = tf.keras.layers.Dense(1)(inputs)
  model = tf.keras.models.Model(inputs=inputs, outputs=predictions)
  model.compile(loss='mse',
                optimizer=tf.keras.optimizers.SGD(learning_rate=0.2))

到此这篇关于浅谈Tensorflow2对GPU内存的分配策略的文章就介绍到这了,更多相关Tensorflow2 GPU内存分配内容请搜索我们以前的文章或继续浏览下面的相关文章希望大家以后多多支持我们！

Keras设定GPU使用内存大小方式(Tensorflow backend)

通过设置Keras的Tensorflow后端的全局变量达到. import os import tensorflow as tf import keras.backend.tensorflow_backend as KTF def get_session(gpu_fraction=0.3): '''Assume that you have 6GB of GPU memory and want to allocate ~2GB''' num_threads = os.environ.get('OM
浅谈Tensorflow2对GPU内存的分配策略

目录一.问题源起二.开发环境三.Tensorflow针对GPU内存的分配策略四.问题分析验证五.GPU分配策略分析六.扩展一.问题源起从以下的异常堆栈可以看到是BLAS程序集初始化失败,可以看到是执行MatMul的时候发生的异常,基本可以断定可能数据集太大导致memory不够用了. 2021-08-10 16:38:04.917501: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cub
浅谈redis采用不同内存分配器tcmalloc和jemalloc

我们知道Redis并没有自己实现内存池,没有在标准的系统内存分配器上再加上自己的东西.所以系统内存分配器的性能及碎片率会对Redis造成一些性能上的影响. 在Redis的 zmalloc.c 源码中,我们可以看到如下代码: /* Double expansion needed for stringification of macro values. */ #define __xstr(s) __str(s) #define __str(s) #s #if defined(USE_TCMALLOC
浅谈Java堆外内存之突破JVM枷锁

对于有Java开发经验的朋友都知道,Java中不需要手动的申请和释放内存,JVM会自动进行垃圾回收:而使用的内存是由JVM控制的. 那么,什么时机会进行垃圾回收,如何避免过度频繁的垃圾回收?如果JVM给的内存不够用,怎么办? 此时,堆外内存登场!利用堆外内存,不仅可以随意操控内存,还能提高网络交互的速度. 背景1:JVM内存的分配对于JVM的内存规则,应该是老生常谈的东西了,这里我就简单的说下: 新生代:一般来说新创建的对象都分配在这里. 年老代:经过几次垃圾回收,新生代的对象就会放在年老代里
浅谈C++对象的内存分布和虚函数表

c++中一个类中无非有四种成员:静态数据成员和非静态数据成员,静态函数和非静态函数. 1.非静态数据成员被放在每一个对象体内作为对象专有的数据成员. 2.静态数据成员被提取出来放在程序的静态数据区内,为该类所有对象共享,因此只存在一份. 3.静态和非静态成员函数最终都被提取出来放在程序的代码段中并为该类所有对象共享,因此每一个成员函数也只能存在一份代码实体.在c++中类的成员函数都是保存在静态存储区中的 ,那静态函数也是保存在静态存储区中的,他们都是在类中保存同一个惫份. 因此,构成对象本身的只
浅谈Redis中的内存淘汰策略和过期键删除策略

目录 8种淘汰策略过期键的删除策略总结 redis是我们现在最常用的一个工具,帮助我们建设系统的高可用,高性能. 而且我们都知道redis是一个完全基于内存的工具,这也是redis速度快的一个原因,当我们往redis中不断缓存数据的时候,其内存总有满的时候(而且内存是很贵的东西,尽量省着点用),所以尽可能把有用的数据,或者使用频繁的数据缓存在redis中,物尽其用. 那么如果正在使用的redis内存用完了,我们应该怎么取舍redis中已存在的数据和即将要存入的数据呢,我们要怎么处理呢? re
浅谈redis的maxmemory设置以及淘汰策略

redis的maxmemory参数用于控制redis可使用的最大内存容量.如果超过maxmemory的值,就会动用淘汰策略来处理expaire字典中的键. 关于redis的淘汰策略: Redis提供了下面几种淘汰策略供用户选择,其中默认的策略为noeviction策略: · noeviction:当内存使用达到阈值的时候,所有引起申请内存的命令会报错. · allkeys-lru:在主键空间中,优先移除最近未使用的key. · volatile-lru:在设置了过期时间的键空间中,优
浅谈SQL Server 对于内存的管理[图文]

理解SQL Server对于内存的管理是对于SQL Server问题处理和性能调优的基本,本篇文章讲述SQL Server对于内存管理的内存原理. 二级存储(secondary storage) 对于计算机来说,存储体系是分层级的.离CPU越近的地方速度愉快,但容量越小(如图1所示).比如:传统的计算机存储体系结构离CPU由近到远依次是:CPU内的寄存器,一级缓存,二级缓存,内存,硬盘.但同时离CPU越远的存储系统都会比之前的存储系统大一个数量级.比如硬盘通常要比同时代的内存大一个数量级. 图1
浅谈Android应用的内存优化及Handler的内存泄漏问题

一.Android内存基础物理内存与进程内存物理内存即移动设备上的RAM,当启动一个Android程序时,会启动一个Dalvik VM进程,系统会给它分配固定的内存空间(16M,32M不定),这块内存空间会映射到RAM上某个区域.然后这个Android程序就会运行在这块空间上.Java里会将这块空间分成Stack栈内存和Heap堆内存.stack里存放对象的引用,heap里存放实际对象数据. 在程序运行中会创建对象,如果未合理管理内存,比如不及时回收无效空间就会造成内存泄露,严重的话可能导致
浅谈C#互操作的内存溢出问题

c#调用C++DLL代码,发现了一个隐藏很深的问题. 危害很大,而且不易察觉. 大概是申明c++的函数时候,有一个long类型的指针.在C#中我的申明成了这样: public extern void Method(ref uint para); 最初怎么也没有发现这里面有什么问题,知道这个隐藏的问题暴露出来,把前面申明的一个变量改变了, 我才恍然大悟. 复制代码代码如下: uint test = 0;int *p = new IntPtr();Method(ref test); 在调用Meth
浅谈PostgreSQL消耗的内存计算方法

wal_buffers默认值为-1,此时wal_buffers使用的是shared_buffers,wal_buffers大小为shared_buffers的1/32 autovacuum_work_mem默认值为-1,此时使用maintenance_work_mem的值 1 不使用wal_buffers.autovacuum_work_mem 计算公式为: max_connections*work_mem + max_connections*temp_buffers +shared_buffe