pytorch DistributedDataParallel 多卡训练结果变差的解决方案

2025-02-18 18:14:55

DDP 数据shuffle 的设置

使用DDP要给dataloader传入sampler参数（torch.utils.data.distributed.DistributedSampler(dataset, num_replicas=None, rank=None, shuffle=True, seed=0, drop_last=False)）。默认shuffle=True，但按照pytorch DistributedSampler的实现：

    def __iter__(self) -> Iterator[T_co]:
        if self.shuffle:
            # deterministically shuffle based on epoch and seed
            g = torch.Generator()
            g.manual_seed(self.seed + self.epoch)
            indices = torch.randperm(len(self.dataset), generator=g).tolist()  # type: ignore
        else:
            indices = list(range(len(self.dataset)))  # type: ignore

产生随机indix的种子是和当前的epoch有关，所以需要在训练的时候手动set epoch的值来实现真正的shuffle：

for epoch in range(start_epoch, n_epochs):
    if is_distributed:
        sampler.set_epoch(epoch)
    train(loader)

DDP 增大batchsize 效果变差的问题

large batchsize：

理论上的优点：

数据中的噪声影响可能会变小，可能容易接近最优点；

缺点和问题：

降低了梯度的variance；(理论上，对于凸优化问题，低的梯度variance可以得到更好的优化效果; 但是实际上Keskar et al验证了增大batchsize会导致差的泛化能力);

对于非凸优化问题，损失函数包含多个局部最优点，小的batchsize有噪声的干扰可能容易跳出局部最优点，而大的batchsize有可能停在局部最优点跳不出来。

解决方法：

增大learning_rate，但是可能出现问题，在训练开始就用很大的learning_rate 可能导致模型不收敛 (https://arxiv.org/abs/1609.04836)

使用warming up (https://arxiv.org/abs/1706.02677)

warmup

在训练初期就用很大的learning_rate可能会导致训练不收敛的问题，warmup的思想是在训练初期用小的学习率，随着训练慢慢变大学习率，直到base learning_rate，再使用其他decay（CosineAnnealingLR）的方式训练.

# copy from https://github.com/ildoonet/pytorch-gradual-warmup-lr/blob/master/warmup_scheduler/scheduler.py
from torch.optim.lr_scheduler import _LRScheduler
from torch.optim.lr_scheduler import ReduceLROnPlateau
class GradualWarmupScheduler(_LRScheduler):
    """ Gradually warm-up(increasing) learning rate in optimizer.
    Proposed in 'Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour'.
    Args:
        optimizer (Optimizer): Wrapped optimizer.
        multiplier: target learning rate = base lr * multiplier if multiplier > 1.0. if multiplier = 1.0, lr starts from 0 and ends up with the base_lr.
        total_epoch: target learning rate is reached at total_epoch, gradually
        after_scheduler: after target_epoch, use this scheduler(eg. ReduceLROnPlateau)
    """
    def __init__(self, optimizer, multiplier, total_epoch, after_scheduler=None):
        self.multiplier = multiplier
        if self.multiplier < 1.:
            raise ValueError('multiplier should be greater thant or equal to 1.')
        self.total_epoch = total_epoch
        self.after_scheduler = after_scheduler
        self.finished = False
        super(GradualWarmupScheduler, self).__init__(optimizer)
    def get_lr(self):
        if self.last_epoch > self.total_epoch:
            if self.after_scheduler:
                if not self.finished:
                    self.after_scheduler.base_lrs = [base_lr * self.multiplier for base_lr in self.base_lrs]
                    self.finished = True
                return self.after_scheduler.get_last_lr()
            return [base_lr * self.multiplier for base_lr in self.base_lrs]
        if self.multiplier == 1.0:
            return [base_lr * (float(self.last_epoch) / self.total_epoch) for base_lr in self.base_lrs]
        else:
            return [base_lr * ((self.multiplier - 1.) * self.last_epoch / self.total_epoch + 1.) for base_lr in self.base_lrs]
    def step_ReduceLROnPlateau(self, metrics, epoch=None):
        if epoch is None:
            epoch = self.last_epoch + 1
        self.last_epoch = epoch if epoch != 0 else 1  # ReduceLROnPlateau is called at the end of epoch, whereas others are called at beginning
        if self.last_epoch <= self.total_epoch:
            warmup_lr = [base_lr * ((self.multiplier - 1.) * self.last_epoch / self.total_epoch + 1.) for base_lr in self.base_lrs]
            for param_group, lr in zip(self.optimizer.param_groups, warmup_lr):
                param_group['lr'] = lr
        else:
            if epoch is None:
                self.after_scheduler.step(metrics, None)
            else:
                self.after_scheduler.step(metrics, epoch - self.total_epoch)
    def step(self, epoch=None, metrics=None):
        if type(self.after_scheduler) != ReduceLROnPlateau:
            if self.finished and self.after_scheduler:
                if epoch is None:
                    self.after_scheduler.step(None)
                else:
                    self.after_scheduler.step(epoch - self.total_epoch)
                self._last_lr = self.after_scheduler.get_last_lr()
            else:
                return super(GradualWarmupScheduler, self).step(epoch)
        else:
            self.step_ReduceLROnPlateau(metrics, epoch)

分布式多卡训练DistributedDataParallel踩坑

近几天想研究了多卡训练，就花了点时间，本以为会很轻松，可是好多坑，一步一步踏过来，一般分布式训练分为单机多卡与多机多卡两种类型；

主要有两种方式实现：

１、DataParallel: Parameter Server模式，一张卡位reducer，实现也超级简单，一行代码

DataParallel是基于Parameter server的算法，负载不均衡的问题比较严重，有时在模型较大的时候（比如bert-large），reducer的那张卡会多出3-4g的显存占用

２、DistributedDataParallel：官方建议用新的DDP，采用all-reduce算法，本来设计主要是为了多机多卡使用，但是单机上也能用

为什么要分布式训练？

可以用多张卡，总体跑得更快

可以得到更大的 BatchSize

有些分布式会取得更好的效果

主要分为以下几个部分：

单机多卡，DataParallel（最常用，最简单）

单机多卡，DistributedDataParallel（较高级）、多机多卡，DistributedDataParallel（最高级）

如何启动训练

模型保存与读取

注意事项

一、单机多卡（DATAPARALLEL）

from torch.nn import DataParallel

device = torch.device("cuda")
＃或者device = torch.device("cuda:0" if True else "cpu")

model = MyModel()
model = model.to(device)
model = DataParallel(model)
＃或者model = nn.DataParallel(model,device_ids=[0,1，2,3])

比较简单，只需要加一行代码就行， model = DataParallel(model)

二、多机多卡、单机多卡（DISTRIBUTEDDATAPARALLEL）

建议先把注意事项看完在修改代码，防止出现莫名的bug，修改训练代码如下：

其中opt.local_rank要在代码前面解析这个参数，可以去后面看我写的注意事项；

    from torch.utils.data.distributed import DistributedSampler
    import torch.distributed as dist
    import torch

    # Initialize Process Group
    dist_backend = 'nccl'
    print('args.local_rank: ', opt.local_rank)
    torch.cuda.set_device(opt.local_rank)
    dist.init_process_group(backend=dist_backend)

    model = yourModel()＃自己的模型
    if torch.cuda.device_count() > 1:
        print("Let's use", torch.cuda.device_count(), "GPUs!")
        # 5) 封装
        # model = torch.nn.parallel.DistributedDataParallel(model,
        #                                                   device_ids=[opt.local_rank],
        #                                                   output_device=opt.local_rank)
        model = torch.nn.parallel.DistributedDataParallel(model.cuda(), device_ids=[opt.local_rank])
    device = torch.device(opt.local_rank)
    model.to(device)
    dataset = ListDataset(train_path, augment=True, multiscale=opt.multiscale_training, img_size=opt.img_size, normalized_labels=True)#自己的读取数据的代码
    world_size = torch.cuda.device_count()
    datasampler = DistributedSampler(dataset, num_replicas=dist.get_world_size(), rank=opt.local_rank)

    dataloader = torch.utils.data.DataLoader(
        dataset,
        batch_size=opt.batch_size,
        shuffle=False,
        num_workers=opt.n_cpu,
        pin_memory=True,
        collate_fn=dataset.collate_fn,
        sampler=datasampler
    )＃在原始读取数据中加sampler参数就行

.....

训练过程中，数据转cuda
      imgs = imgs.to(device)
      targets = targets.to(device)

三、如何启动训练

１、DataParallel方式

正常训练即可，即

python3 train.py

２、DistributedDataParallel方式

需要通过torch.distributed.launch来启动，一般是单节点，

CUDA_VISIBLE_DEVICES=0,1 python3 -m torch.distributed.launch --nproc_per_node=2 train.py

其中CUDA_VISIBLE_DEVICES　设置用的显卡编号，--nproc_pre_node 每个节点的显卡数量，一般有几个显卡就用几个显卡

多节点

python３ -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE --nnodes=2 --node_rank=0
＃两个节点，在０号节点

要是训练成功，就会打印出几个信息，有几个卡就打印几个信息，如下图所示:

四、模型保存与读取

以下a、b是对应的，用a保存，就用a方法加载

１、保存

a、只保存参数

torch.save(model.module.state_dict(), path)

b、保存参数与网络

torch.save(model.module,path)

２、加载

a、多卡加载模型预训练；

model = Yourmodel()
if opt.pretrained_weights:
        if opt.pretrained_weights.endswith(".pth"):
            model.load_state_dict(torch.load(opt.pretrained_weights))
        else:
            model.load_darknet_weights(opt.pretrained_weights)

单卡加载模型，需要加载模型时指定主卡读模型，而且这个'cuda:0',是看你训练的模型是０还是１（否则就会出错RuntimeError: Attempting to deserialize object on CUDA device 1 but torch.cuda.device_count() is 1. Please use torch.load with map_location to map your storages to an existing device），可以根据自己的更改：

model = Yourmodel()
if opt.pretrained_weights:
        if opt.pretrained_weights.endswith(".pth"):
            model.load_state_dict(torch.load(opt.pretrained_weights，map_location="cuda:0"))
        else:
            model.load_darknet_weights(opt.pretrained_weights)

b、单卡加载模型；

同样也要指定读取模型的卡。　　

model = torch.load(opt.weights_path, map_location="cuda:0")

多卡加载预训练模型，以b这种方式还没跑通。

五、注意事项

１、model后面添加module

获取到网络模型后，使用并行方法，并将网络模型和参数移到GPU上。注意，若需要修改网络模块或者获得模型的某个参数，一定要在model后面加上.module，否则会报错，比如：

model.img_size　　要改成　　model.module.img_size

２、.cuda或者.to(device)等问题

device是自己设置，如果.cuda出错，就要化成相应的device

model（如：model.to(device)）

input（通常需要使用Variable包装，如：input = Variable(input).to(device)）

target（通常需要使用Variable包装

nn.CrossEntropyLoss()（如：criterion = nn.CrossEntropyLoss().to(device)）

３、args.local_rank的参数

通过torch.distributed.launch来启动训练，torch.distributed.launch 会给模型分配一个args.local_rank的参数，所以在训练代码中要解析这个参数，也可以通过torch.distributed.get_rank()获取进程id。

parser.add_argument("--local_rank", type=int, default=-1, help="number of cpu threads to use during batch generation")

以上为个人经验，希望能给大家一个参考，也希望大家多多支持我们。

解决Pytorch训练过程中loss不下降的问题

在使用Pytorch进行神经网络训练时,有时会遇到训练学习率不下降的问题.出现这种问题的可能原因有很多,包括学习率过小,数据没有进行Normalization等.不过除了这些常规的原因,还有一种难以发现的原因:在计算loss时数据维数不匹配. 下面是我的代码: loss_function = torch.nn.MSE_loss() optimizer.zero_grad() output = model(x_train) loss = loss_function(output, y_train)
解决pytorch多GPU训练保存的模型,在单GPU环境下加载出错问题

背景在公司用多卡训练模型,得到权值文件后保存,然后回到实验室,没有多卡的环境,用单卡训练,加载模型时出错,因为单卡机器上,没有使用DataParallel来加载模型,所以会出现加载错误. 原因 DataParallel包装的模型在保存时,权值参数前面会带有module字符,然而自己在单卡环境下,没有用DataParallel包装的模型权值参数不带module.本质上保存的权值文件是一个有序字典. 解决方法 1.在单卡环境下,用DataParallel包装模型. 2.自己重写Load函数,灵活.
关于pytorch多GPU训练实例与性能对比分析

以下实验是我在百度公司实习的时候做的,记录下来留个小经验. 多GPU训练 cifar10_97.23 使用 run.sh 文件开始训练 cifar10_97.50 使用 run.4GPU.sh 开始训练在集群中改变GPU调用个数修改 run.sh 文件 nohup srun --job-name=cf23 $pt --gres=gpu:2 -n1 bash cluster_run.sh $cmd 2>&1 1>>log.cf50_2GPU & 修改 –gres=gpu:
pytorch 指定gpu训练与多gpu并行训练示例

一. 指定一个gpu训练的两种方法: 1.代码中指定 import torch torch.cuda.set_device(id) 2.终端中指定 CUDA_VISIBLE_DEVICES=1 python 你的程序其中id就是你的gpu编号二. 多gpu并行训练: torch.nn.DataParallel(module, device_ids=None, output_device=None, dim=0) 该函数实现了在module级别上的数据并行使用,注意batch size要大于G
pytorch DistributedDataParallel 多卡训练结果变差的解决方案

DDP 数据shuffle 的设置使用DDP要给dataloader传入sampler参数(torch.utils.data.distributed.DistributedSampler(dataset, num_replicas=None, rank=None, shuffle=True, seed=0, drop_last=False)) . 默认shuffle=True,但按照pytorch DistributedSampler的实现: def __iter__(self) -> Ite
详解如何使用Pytorch进行多卡训练

目录 1.DP 2.DDP 2.1Pytorch分布式基础 2.2Pytorch分布式训练DEMO 当一块GPU不够用时,我们就需要使用多卡进行并行训练.其中多卡并行可分为数据并行和模型并行.具体区别如下图所示: 由于模型并行比较少用,这里只对数据并行进行记录.对于pytorch,有两种方式可以进行数据并行:数据并行(DataParallel, DP)和分布式数据并行(DistributedDataParallel, DDP). 在多卡训练的实现上,DP与DDP的思路是相似的: 1.每张卡都复制
Pytorch distributed 多卡并行载入模型操作

一.Pytorch distributed 多卡并行载入模型这次来介绍下如何载入模型. 目前没有找到官方的distribute 载入模型的方式,所以采用如下方式. 大部分情况下,我们在测试时不需要多卡并行计算. 所以,我在测试时只使用单卡. from collections import OrderedDict device = torch.device("cuda") model = DGCNN(args).to(device) #自己的模型 state_dict = torch.
详解pytorch的多GPU训练的两种方式

目录方法一:torch.nn.DataParallel 1. 原理 2. 常用的配套代码如下 3. 优缺点方法二:torch.distributed 1. 代码说明方法一:torch.nn.DataParallel 1. 原理如下图所示:小朋友一个人做4份作业,假设1份需要60min,共需要240min. 这里的作业就是pytorch中要处理的data. 与此同时,他也可以先花3min把作业分配给3个同伙,大家一起60min做完.最后他再花3min把作业收起来,一共需要66min. 这个
Pytorch 使用Google Colab训练神经网络深度学习

目录学习前言什么是Google Colab 相关链接利用Colab进行训练一.数据集与预训练权重的上传 1.数据集的上传 2.预训练权重的上传二.打开Colab并配置环境 1.笔记本的创建 2.环境的简单配置 3.深度学习库的下载 4.数据集的复制与解压 5.保存路径设置三.开始训练 1.标注文件的处理 2.训练文件的处理 3.开始训练断线怎么办? 1.防掉线措施 2.完了还是掉线呀? 总结学习前言 Colab是谷歌提供的一个云学习平台,Very Nice,最近卡不够用了决定去白
在pytorch中查看可训练参数的例子

pytorch中我们有时候可能需要设定某些变量是参与训练的,这时候就需要查看哪些是可训练参数,以确定这些设置是成功的. pytorch中model.parameters()函数定义如下: def parameters(self): r"""Returns an iterator over module parameters. This is typically passed to an optimizer. Yields: Parameter: module paramete
pytorch使用指定GPU训练的实例

本文适合多GPU的机器,并且每个用户需要单独使用GPU训练. 虽然pytorch提供了指定gpu的几种方式,但是使用不当的话会遇到out of memory的问题,主要是因为pytorch会在第0块gpu上初始化,并且会占用一定空间的显存.这种情况下,经常会出现指定的gpu明明是空闲的,但是因为第0块gpu被占满而无法运行,一直报out of memory错误. 解决方案如下: 指定环境变量,屏蔽第0块gpu CUDA_VISIBLE_DEVICES = 1 main.py 这句话表示只有第1块
pytorch 使用加载训练好的模型做inference

前提: 模型参数和结构是分别保存的 1. 构建模型(# load model graph) model = MODEL() 2.加载模型参数(# load model state_dict) model.load_state_dict ( { k.replace('module.',''):v for k,v in torch.load(config.model_path, map_location=config.device).items() } ) model = self.model.to
关于idea一直卡在build不动的解决方案

就一直这样,卡在这儿,不动,也不报错,也没有报错日志. 以下是我尝试的解决方案: 扩展idea内存,无效.重启电脑,无效,似乎也不是网络的问题.切换本地仓库,无效.清除缓存和索引,无效.删除项目,重新引入,无效.删除项目,重新下载,再导入idea,无效.新建一个类,加上启动类的注解,在run方法前加一个sysout,无效.关闭idea,项目导入eclipse,启动.重新打开idea,再看看是否能启动,无效.卸载,重装.. 卸载重装当前版本2019.3,无效重装2018.3.6,无效. 编辑Con
PyTorch梯度裁剪避免训练loss nan的操作

近来在训练检测网络的时候会出现loss为nan的情况,需要中断重新训练,会很麻烦.因而选择使用PyTorch提供的梯度裁剪库来对模型训练过程中的梯度范围进行限制,修改之后,不再出现loss为nan的情况. PyTorch中采用torch.nn.utils.clip_grad_norm_来实现梯度裁剪,链接如下: https://pytorch.org/docs/stable/_modules/torch/nn/utils/clip_grad.html 训练代码使用示例如下: from torch