解决Pytorch自定义层出现多Variable共享内存错误问题

2025-05-05 19:47:43

错误信息:

RuntimeError: in-place operations can be only used on variables that don't share storage with any other variables, but detected that there are 4 objects sharing it

自动求导是很方便, 但是想想, 如果两个Variable共享内存, 再对这个共享的内存的数据进行修改, 就会引起错误!

一般是由于 inplace操作或是indexing或是转置. 这些都是共享内存的.

 @staticmethod
 def backward(ctx, grad_output):
  ind_lst = ctx.ind_lst
  flag = ctx.flag

  c = grad_output.size(1)
  grad_former_all = grad_output[:, 0:c//3, :, :]
  grad_latter_all = grad_output[:, c//3: c*2//3, :, :]
  grad_swapped_all = grad_output[:, c*2//3:c, :, :]

  spatial_size = ctx.h * ctx.w

  W_mat_all = Variable(ctx.Tensor(ctx.bz, spatial_size, spatial_size).zero_())
  for idx in range(ctx.bz):
   W_mat = W_mat_all.select(0,idx)
   for cnt in range(spatial_size):
    indS = ind_lst[idx][cnt] 

    if flag[cnt] == 1:
     # 这里W_mat是W_mat_all通过select出来的, 他们共享内存.
     W_mat[cnt, indS] = 1

   W_mat_t = W_mat.t()

   grad_swapped_weighted = torch.mm(W_mat_t, grad_swapped_all[idx].view(c//3, -1).t())
   grad_swapped_weighted = grad_swapped_weighted.t().contiguous().view(1, c//3, ctx.h, ctx.w)
   grad_latter_all[idx] = torch.add(grad_latter_all[idx], grad_swapped_weighted.mul(ctx.triple_w))

由于这里W_mat是W_mat_all通过select出来的, 他们共享内存. 所以当对这个共享的内存进行修改W_mat[cnt, indS] = 1, 就会出错. 此时我们可以通过clone()将W_mat和W_mat_all独立出来. 这样的话, 梯度也会通过 clone()操作将W_mat的梯度正确反传到W_mat_all中.

 @staticmethod
 def backward(ctx, grad_output):
  ind_lst = ctx.ind_lst
  flag = ctx.flag

  c = grad_output.size(1)
  grad_former_all = grad_output[:, 0:c//3, :, :]
  grad_latter_all = grad_output[:, c//3: c*2//3, :, :]
  grad_swapped_all = grad_output[:, c*2//3:c, :, :]

  spatial_size = ctx.h * ctx.w

  W_mat_all = Variable(ctx.Tensor(ctx.bz, spatial_size, spatial_size).zero_())
  for idx in range(ctx.bz):
   # 这里使用clone了
   W_mat = W_mat_all.select(0,idx).clone()
   for cnt in range(spatial_size):
    indS = ind_lst[idx][cnt]

    if flag[cnt] == 1:
     W_mat[cnt, indS] = 1

   W_mat_t = W_mat.t()

   grad_swapped_weighted = torch.mm(W_mat_t, grad_swapped_all[idx].view(c//3, -1).t())
   grad_swapped_weighted = grad_swapped_weighted.t().contiguous().view(1, c//3, ctx.h, ctx.w)

   # 这句话删了不会出错, 加上就吹出错
   grad_latter_all[idx] = torch.add(grad_latter_all[idx], grad_swapped_weighted.mul(ctx.triple_w))

但是现在却出现 4个objects共享内存. 如果将最后一句话删掉, 那么则不会出错.

如果没有最后一句话, 我们看到

grad_swapped_weighted = torch.mm(W_mat_t, grad_swapped_all[idx].view(c//3, -1).t())

grad_swapped_weighted = grad_swapped_weighted.t().contiguous().view(1, c//3, ctx.h, ctx.w)

grad_swapped_weighted 一个新的Variable, 因此并没有和其他Variable共享内存, 所以不会出错. 但是最后一句话,

grad_latter_all[idx] = torch.add(grad_latter_all[idx], grad_swapped_weighted.mul(ctx.triple_w))

你可能会说, 不对啊, 修改grad_latter_all[idx]又没有创建新的Variable, 怎么会出错. 这是因为grad_latter_all和grad_output是共享内存的. 因为 grad_latter_all = grad_output[:, c//3: c*2//3, :, :], 所以这里的解决方案是:

 @staticmethod
 def backward(ctx, grad_output):
  ind_lst = ctx.ind_lst
  flag = ctx.flag

  c = grad_output.size(1)
  grad_former_all = grad_output[:, 0:c//3, :, :]
  # 这两个后面修改值了, 所以也要加clone, 防止它们与grad_output共享内存
  grad_latter_all = grad_output[:, c//3: c*2//3, :, :].clone()
  grad_swapped_all = grad_output[:, c*2//3:c, :, :].clone()

  spatial_size = ctx.h * ctx.w

  W_mat_all = Variable(ctx.Tensor(ctx.bz, spatial_size, spatial_size).zero_())
  for idx in range(ctx.bz):
   W_mat = W_mat_all.select(0,idx).clone()
   for cnt in range(spatial_size):
    indS = ind_lst[idx][cnt]

    if flag[cnt] == 1:
     W_mat[cnt, indS] = 1

   W_mat_t = W_mat.t()

   grad_swapped_weighted = torch.mm(W_mat_t, grad_swapped_all[idx].view(c//3, -1).t())

   grad_swapped_weighted = grad_swapped_weighted.t().contiguous().view(1, c//3, ctx.h, ctx.w)
   grad_latter_all[idx] = torch.add(grad_latter_all[idx], grad_swapped_weighted.mul(ctx.triple_w))

  grad_input = torch.cat([grad_former_all, grad_latter_all], 1)

  return grad_input, None, None, None, None, None, None, None, None, None, None

补充知识：Pytorch 中 expand, expand_as是共享内存的，只是原始数据的一个视图 view

如下所示：

mask = mask_miss.expand_as(sxing).clone() # type: torch.Tensor
mask[:, :, -2, :, :] = 1 # except for person mask channel

为了避免对expand后对某个channel操作会影响原始tensor的全部元素，需要使用clone()

如果没有clone()，对mask_miss的某个通道赋值后，所有通道上的tensor都会变成1！

# Notice! expand does not allocate more memory but just make the tensor look as if you expanded it.
# You should call .clone() on the resulting tensor if you plan on modifying it
# https://discuss.pytorch.org/t/very-strange-behavior-change-one-element-of-a-tensor-will-influence-all-elements/41190

以上这篇解决Pytorch自定义层出现多Variable共享内存错误问题就是小编分享给大家的全部内容了，希望能给大家一个参考，也希望大家多多支持我们。

解决pytorch GPU 计算过程中出现内存耗尽的问题

Pytorch GPU运算过程中会出现:"cuda runtime error(2): out of memory"这样的错误.通常,这种错误是由于在循环中使用全局变量当做累加器,且累加梯度信息的缘故,用官方的说法就是:"accumulate history across your training loop".在默认情况下,开启梯度计算的Tensor变量是会在GPU保持他的历史数据的,所以在编程或者调试过程中应该尽力避免在循环中累加梯度信息. 下面举个栗子: 上代
Pytorch 实现自定义参数层的例子

注意,一般官方接口都带有可导功能,如果你实现的层不具有可导功能,就需要自己实现梯度的反向传递. 官方Linear层: class Linear(Module): def __init__(self, in_features, out_features, bias=True): super(Linear, self).__init__() self.in_features = in_features self.out_features = out_features self.weight = Pa
PyTorch中的Variable变量详解

一.了解Variable 顾名思义,Variable就是变量的意思.实质上也就是可以变化的量,区别于int变量,它是一种可以变化的变量,这正好就符合了反向传播,参数更新的属性. 具体来说,在pytorch中的Variable就是一个存放会变化值的地理位置,里面的值会不停发生片花,就像一个装鸡蛋的篮子,鸡蛋数会不断发生变化.那谁是里面的鸡蛋呢,自然就是pytorch中的tensor了.(也就是说,pytorch都是有tensor计算的,而tensor里面的参数都是Variable的形式).如果
pytorch数据预处理错误的解决

出错: Traceback (most recent call last): File "train.py", line 305, in <module> train_model(model_conv, criterion, optimizer_conv, exp_lr_scheduler) File "train.py", line 145, in train_model for inputs, age_labels, gender_labels in
解决Pytorch自定义层出现多Variable共享内存错误问题

错误信息: RuntimeError: in-place operations can be only used on variables that don't share storage with any other variables, but detected that there are 4 objects sharing it 自动求导是很方便, 但是想想, 如果两个Variable共享内存, 再对这个共享的内存的数据进行修改, 就会引起错误! 一般是由于 inplace操作或是ind
解决Keras 自定义层时遇到版本的问题

在2.2.0版本前, from keras import backend as K from keras.engine.topology import Layer class MyLayer(Layer): def __init__(self, output_dim, **kwargs): self.output_dim = output_dim super(MyLayer, self).__init__(**kwargs) def build(self, input_shape): # 为该层
Pytorch: 自定义网络层实例

自定义Autograd函数对于浅层的网络,我们可以手动的书写前向传播和反向传播过程.但是当网络变得很大时,特别是在做深度学习时,网络结构变得复杂.前向传播和反向传播也随之变得复杂,手动书写这两个过程就会存在很大的困难.幸运地是在pytorch中存在了自动微分的包,可以用来解决该问题.在使用自动求导的时候,网络的前向传播会定义一个计算图(computational graph),图中的节点是张量(tensor),两个节点之间的边对应了两个张量之间变换关系的函数.有了计算图的存在,张量的梯度计算也
pytorch自定义初始化权重的方法

在常见的pytorch代码中,我们见到的初始化方式都是调用init类对每层所有参数进行初始化.但是,有时我们有些特殊需求,比如用某一层的权重取优化其它层,或者手动指定某些权重的初始值. 核心思想就是构造和该层权重同一尺寸的矩阵去对该层权重赋值.但是,值得注意的是,pytorch中各层权重的数据类型是nn.Parameter,而不是Tensor或者Variable. import torch import torch.nn as nn import torch.optim as optim imp
pytorch自定义二值化网络层方式

任务要求: 自定义一个层主要是定义该层的实现函数,只需要重载Function的forward和backward函数即可,如下: import torch from torch.autograd import Function from torch.autograd import Variable 定义二值化函数 class BinarizedF(Function): def forward(self, input): self.save_for_backward(input) a = torch
Pytorch自定义CNN网络实现猫狗分类详解过程

目录前言一. 数据预处理二. 定义网络三. 训练模型前言数据集下载地址: 链接: https://pan.baidu.com/s/17aglKyKFvMvcug0xrOqJdQ?pwd=6i7m Dogs vs. Cats(猫狗大战)来源Kaggle上的一个竞赛题,任务为给定一个数据集,设计一种算法中的猫狗图片进行判别. 数据集包括25000张带标签的训练集图片,猫和狗各125000张,标签都是以cat or dog命名的.图像为RGB格式jpg图片,size不一样.截图如下: 一.
Pytorch卷积层手动初始化权值的实例

由于研究关系需要自己手动给卷积层初始化权值,但是好像博客上提到的相关文章比较少(大部分都只提到使用nn.init里的按照一定分布初始化方法),自己参考了下Pytorch的官方文档,发现有两种方法吧. 所以mark下. import torch import torch.nn as nn import torch.optim as optim import numpy as np # 第一一个卷积层,我们可以看到它的权值是随机初始化的 w=torch.nn.Conv2d(2,2,3,padding
解决Pytorch 加载训练好的模型遇到的error问题

这是一个非常愚蠢的错误 debug的时候要好好看error信息提醒自己切记好好对待error!切记!切记! -----------------------分割线---------------- pytorch 已经非常友好了保存模型和加载模型都只需要一条简单的命令 #保存整个网络和参数 torch.save(your_net, 'save_name.pkl') #加载保存的模型 net = torch.load('save_name.pkl') 因为我比较懒我就想直接把整个网络都保存下来,然
解决pytorch多GPU训练保存的模型,在单GPU环境下加载出错问题

背景在公司用多卡训练模型,得到权值文件后保存,然后回到实验室,没有多卡的环境,用单卡训练,加载模型时出错,因为单卡机器上,没有使用DataParallel来加载模型,所以会出现加载错误. 原因 DataParallel包装的模型在保存时,权值参数前面会带有module字符,然而自己在单卡环境下,没有用DataParallel包装的模型权值参数不带module.本质上保存的权值文件是一个有序字典. 解决方法 1.在单卡环境下,用DataParallel包装模型. 2.自己重写Load函数,灵活.
解决pytorch 交叉熵损失输出为负数的问题

网络训练中,loss曲线非常奇怪交叉熵怎么会有负数. 经过排查,交叉熵不是有个负对数吗,当网络输出的概率是0-1时,正数.可当网络输出大于1的数,就有可能变成负数. 所以加上一行就行了 out1 = F.softmax(out1, dim=1) 补充知识:在pytorch框架下,训练model过程中,loss=nan问题时该怎么解决? 当我在UCF-101数据集训练alexnet时,epoch设为100,跑到三十多个epoch时,出现了loss=nan问题,当时是一脸懵逼,在查阅资料后,我通过

解决Pytorch自定义层出现多Variable共享内存错误问题

相关推荐

随机推荐