pytorch中DataLoader()过程中遇到的一些问题

如下所示:

RuntimeError: stack expects each tensor to be equal size, but got [3, 60, 32] at entry 0 and [3, 54, 32] at entry 2

train_dataset = datasets.ImageFolder(
    traindir,
    transforms.Compose([
        transforms.Resize((224)) ###

原因是

transforms.Resize() 的参数设置问题,改为如下设置就可以了

train_dataset = datasets.ImageFolder(
    traindir,
    transforms.Compose([
        transforms.Resize((224,224)),

同理,val_dataset中也调整为transforms.Resize((224,224))。

补充:pytorch之dataloader深入剖析

- dataloader本质是一个可迭代对象,使用iter()访问,不能使用next()访问;

- 使用iter(dataloader)返回的是一个迭代器,然后可以使用next访问;

- 也可以使用`for inputs, labels in dataloaders`进行可迭代对象的访问;

- 一般我们实现一个datasets对象,传入到dataloader中;然后内部使用yeild返回每一次batch的数据;

① DataLoader本质上就是一个iterable(跟python的内置类型list等一样),并利用多进程来加速batch data的处理,使用yield来使用有限的内存 ​

② Queue的特点

当队列里面没有数据时: queue.get() 会阻塞, 阻塞的时候,其它进程/线程如果有queue.put() 操作,本线程/进程会被通知,然后就可以 get 成功。

当数据满了: queue.put() 会阻塞

③ DataLoader是一个高效,简洁,直观的网络输入数据结构,便于使用和扩展

输入数据PipeLine

pytorch 的数据加载到模型的操作顺序是这样的:

① 创建一个 Dataset 对象

② 创建一个 DataLoader 对象

③ 循环这个 DataLoader 对象,将img, label加载到模型中进行训练

dataset = MyDataset()
dataloader = DataLoader(dataset)
num_epoches = 100
for epoch in range(num_epoches):
for img, label in dataloader:
....

所以,作为直接对数据进入模型中的关键一步, DataLoader非常重要。

首先简单介绍一下DataLoader,它是PyTorch中数据读取的一个重要接口,该接口定义在dataloader.py中,只要是用PyTorch来训练模型基本都会用到该接口(除非用户重写…),该接口的目的:将自定义的Dataset根据batch size大小、是否shuffle等封装成一个Batch Size大小的Tensor,用于后面的训练。

官方对DataLoader的说明是:“数据加载由数据集和采样器组成,基于python的单、多进程的iterators来处理数据。”关于iterator和iterable的区别和概念请自行查阅,在实现中的差别就是iterators有__iter__和__next__方法,而iterable只有__iter__方法。

1.DataLoader

先介绍一下DataLoader(object)的参数:

dataset(Dataset): 传入的数据集

batch_size(int, optional): 每个batch有多少个样本

shuffle(bool, optional): 在每个epoch开始的时候,对数据进行重新排序

sampler(Sampler, optional): 自定义从数据集中取样本的策略,如果指定这个参数,那么shuffle必须为False

batch_sampler(Sampler, optional): 与sampler类似,但是一次只返回一个batch的indices(索引),需要注意的是,一旦指定了这个参数,那么batch_size,shuffle,sampler,drop_last就不能再制定了(互斥——Mutually exclusive)

num_workers (int, optional): 这个参数决定了有几个进程来处理data loading。0意味着所有的数据都会被load进主进程。(默认为0)

collate_fn (callable, optional): 将一个list的sample组成一个mini-batch的函数

pin_memory (bool, optional): 如果设置为True,那么data loader将会在返回它们之前,将tensors拷贝到CUDA中的固定内存(CUDA pinned memory)中.

drop_last (bool, optional): 如果设置为True:这个是对最后的未完成的batch来说的,比如你的batch_size设置为64,而一个epoch只有100个样本,那么训练的时候后面的36个就被扔掉了…

如果为False(默认),那么会继续正常执行,只是最后的batch_size会小一点。

timeout(numeric, optional): 如果是正数,表明等待从worker进程中收集一个batch等待的时间,若超出设定的时间还没有收集到,那就不收集这个内容了。这个numeric应总是大于等于0。默认为0

worker_init_fn (callable, optional): 每个worker初始化函数 If not None, this will be called on each

worker subprocess with the worker id (an int in [0, num_workers - 1]) as
input, after seeding and before data loading. (default: None)

- 首先dataloader初始化时得到datasets的采样list

class DataLoader(object):
    r"""
    Data loader. Combines a dataset and a sampler, and provides
    single- or multi-process iterators over the dataset.
    Arguments:
        dataset (Dataset): dataset from which to load the data.
        batch_size (int, optional): how many samples per batch to load
            (default: 1).
        shuffle (bool, optional): set to ``True`` to have the data reshuffled
            at every epoch (default: False).
        sampler (Sampler, optional): defines the strategy to draw samples from
            the dataset. If specified, ``shuffle`` must be False.
        batch_sampler (Sampler, optional): like sampler, but returns a batch of
            indices at a time. Mutually exclusive with batch_size, shuffle,
            sampler, and drop_last.
        num_workers (int, optional): how many subprocesses to use for data
            loading. 0 means that the data will be loaded in the main process.
            (default: 0)
        collate_fn (callable, optional): merges a list of samples to form a mini-batch.
        pin_memory (bool, optional): If ``True``, the data loader will copy tensors
            into CUDA pinned memory before returning them.
        drop_last (bool, optional): set to ``True`` to drop the last incomplete batch,
            if the dataset size is not divisible by the batch size. If ``False`` and
            the size of dataset is not divisible by the batch size, then the last batch
            will be smaller. (default: False)
        timeout (numeric, optional): if positive, the timeout value for collecting a batch
            from workers. Should always be non-negative. (default: 0)
        worker_init_fn (callable, optional): If not None, this will be called on each
            worker subprocess with the worker id (an int in ``[0, num_workers - 1]``) as
            input, after seeding and before data loading. (default: None)
    .. note:: By default, each worker will have its PyTorch seed set to
              ``base_seed + worker_id``, where ``base_seed`` is a long generated
              by main process using its RNG. However, seeds for other libraies
              may be duplicated upon initializing workers (w.g., NumPy), causing
              each worker to return identical random numbers. (See
              :ref:`dataloader-workers-random-seed` section in FAQ.) You may
              use ``torch.initial_seed()`` to access the PyTorch seed for each
              worker in :attr:`worker_init_fn`, and use it to set other seeds
              before data loading.
    .. warning:: If ``spawn`` start method is used, :attr:`worker_init_fn` cannot be an
                 unpicklable object, e.g., a lambda function.
    """
    __initialized = False
    def __init__(self, dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None,
                 num_workers=0, collate_fn=default_collate, pin_memory=False, drop_last=False,
                 timeout=0, worker_init_fn=None):
        self.dataset = dataset
        self.batch_size = batch_size
        self.num_workers = num_workers
        self.collate_fn = collate_fn
        self.pin_memory = pin_memory
        self.drop_last = drop_last
        self.timeout = timeout
        self.worker_init_fn = worker_init_fn
        if timeout < 0:
            raise ValueError('timeout option should be non-negative')
        if batch_sampler is not None:
            if batch_size > 1 or shuffle or sampler is not None or drop_last:
                raise ValueError('batch_sampler option is mutually exclusive '
                                 'with batch_size, shuffle, sampler, and '
                                 'drop_last')
            self.batch_size = None
            self.drop_last = None
        if sampler is not None and shuffle:
            raise ValueError('sampler option is mutually exclusive with '
                             'shuffle')
        if self.num_workers < 0:
            raise ValueError('num_workers option cannot be negative; '
                             'use num_workers=0 to disable multiprocessing.')
        if batch_sampler is None:
            if sampler is None:
                if shuffle:
                    sampler = RandomSampler(dataset)  //将list打乱
                else:
                    sampler = SequentialSampler(dataset)
            batch_sampler = BatchSampler(sampler, batch_size, drop_last)
        self.sampler = sampler
        self.batch_sampler = batch_sampler
        self.__initialized = True
    def __setattr__(self, attr, val):
        if self.__initialized and attr in ('batch_size', 'sampler', 'drop_last'):
            raise ValueError('{} attribute should not be set after {} is '
                             'initialized'.format(attr, self.__class__.__name__))
        super(DataLoader, self).__setattr__(attr, val)
    def __iter__(self):
        return _DataLoaderIter(self)
    def __len__(self):
        return len(self.batch_sampler)

其中:RandomSampler,BatchSampler已经得到了采用batch数据的index索引;yield batch机制已经在!!!

class RandomSampler(Sampler):
    r"""Samples elements randomly, without replacement.
    Arguments:
        data_source (Dataset): dataset to sample from
    """
    def __init__(self, data_source):
        self.data_source = data_source
    def __iter__(self):
        return iter(torch.randperm(len(self.data_source)).tolist())
    def __len__(self):
        return len(self.data_source)
class BatchSampler(Sampler):
    r"""Wraps another sampler to yield a mini-batch of indices.
    Args:
        sampler (Sampler): Base sampler.
        batch_size (int): Size of mini-batch.
        drop_last (bool): If ``True``, the sampler will drop the last batch if
            its size would be less than ``batch_size``
    Example:
        >>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=False))
        [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
        >>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=True))
        [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
    """
    def __init__(self, sampler, batch_size, drop_last):
        if not isinstance(sampler, Sampler):
            raise ValueError("sampler should be an instance of "
                             "torch.utils.data.Sampler, but got sampler={}"
                             .format(sampler))
        if not isinstance(batch_size, _int_classes) or isinstance(batch_size, bool) or \
                batch_size <= 0:
            raise ValueError("batch_size should be a positive integeral value, "
                             "but got batch_size={}".format(batch_size))
        if not isinstance(drop_last, bool):
            raise ValueError("drop_last should be a boolean value, but got "
                             "drop_last={}".format(drop_last))
        self.sampler = sampler
        self.batch_size = batch_size
        self.drop_last = drop_last
    def __iter__(self):
        batch = []
        for idx in self.sampler:
            batch.append(idx)
            if len(batch) == self.batch_size:
                yield batch
                batch = []
        if len(batch) > 0 and not self.drop_last:
            yield batch
    def __len__(self):
        if self.drop_last:
            return len(self.sampler) // self.batch_size
        else:
            return (len(self.sampler) + self.batch_size - 1) // self.batch_size

- 其中 _DataLoaderIter(self)输入为一个dataloader对象;如果num_workers=0很好理解,num_workers!=0引入多线程机制,加速数据加载过程;

- 没有多线程时:batch = self.collate_fn([self.dataset[i] for i in indices])进行将index转化为data数据,返回(image,label);self.dataset[i]会调用datasets对象的

__getitem__()方法

- 多线程下,会为每个线程创建一个索引队列index_queues;共享一个worker_result_queue数据队列!在_worker_loop方法中加载数据;

class _DataLoaderIter(object):
    r"""Iterates once over the DataLoader's dataset, as specified by the sampler"""
    def __init__(self, loader):
        self.dataset = loader.dataset
        self.collate_fn = loader.collate_fn
        self.batch_sampler = loader.batch_sampler
        self.num_workers = loader.num_workers
        self.pin_memory = loader.pin_memory and torch.cuda.is_available()
        self.timeout = loader.timeout
        self.done_event = threading.Event()
        self.sample_iter = iter(self.batch_sampler)
        base_seed = torch.LongTensor(1).random_().item()
        if self.num_workers > 0:
            self.worker_init_fn = loader.worker_init_fn
            self.index_queues = [multiprocessing.Queue() for _ in range(self.num_workers)]
            self.worker_queue_idx = 0
            self.worker_result_queue = multiprocessing.SimpleQueue()
            self.batches_outstanding = 0
            self.worker_pids_set = False
            self.shutdown = False
            self.send_idx = 0
            self.rcvd_idx = 0
            self.reorder_dict = {}
            self.workers = [
                multiprocessing.Process(
                    target=_worker_loop,
                    args=(self.dataset, self.index_queues[i],
                          self.worker_result_queue, self.collate_fn, base_seed + i,
                          self.worker_init_fn, i))
                for i in range(self.num_workers)]
            if self.pin_memory or self.timeout > 0:
                self.data_queue = queue.Queue()
                if self.pin_memory:
                    maybe_device_id = torch.cuda.current_device()
                else:
                    # do not initialize cuda context if not necessary
                    maybe_device_id = None
                self.worker_manager_thread = threading.Thread(
                    target=_worker_manager_loop,
                    args=(self.worker_result_queue, self.data_queue, self.done_event, self.pin_memory,
                          maybe_device_id))
                self.worker_manager_thread.daemon = True
                self.worker_manager_thread.start()
            else:
                self.data_queue = self.worker_result_queue
            for w in self.workers:
                w.daemon = True  # ensure that the worker exits on process exit
                w.start()
            _update_worker_pids(id(self), tuple(w.pid for w in self.workers))
            _set_SIGCHLD_handler()
            self.worker_pids_set = True
            # prime the prefetch loop
            for _ in range(2 * self.num_workers):
                self._put_indices()
    def __len__(self):
        return len(self.batch_sampler)
    def _get_batch(self):
        if self.timeout > 0:
            try:
                return self.data_queue.get(timeout=self.timeout)
            except queue.Empty:
                raise RuntimeError('DataLoader timed out after {} seconds'.format(self.timeout))
        else:
            return self.data_queue.get()
    def __next__(self):
        if self.num_workers == 0:  # same-process loading
            indices = next(self.sample_iter)  # may raise StopIteration
            batch = self.collate_fn([self.dataset[i] for i in indices])
            if self.pin_memory:
                batch = pin_memory_batch(batch)
            return batch
        # check if the next sample has already been generated
        if self.rcvd_idx in self.reorder_dict:
            batch = self.reorder_dict.pop(self.rcvd_idx)
            return self._process_next_batch(batch)
        if self.batches_outstanding == 0:
            self._shutdown_workers()
            raise StopIteration
        while True:
            assert (not self.shutdown and self.batches_outstanding > 0)
            idx, batch = self._get_batch()
            self.batches_outstanding -= 1
            if idx != self.rcvd_idx:
                # store out-of-order samples
                self.reorder_dict[idx] = batch
                continue
            return self._process_next_batch(batch)
    next = __next__  # Python 2 compatibility
    def __iter__(self):
        return self
    def _put_indices(self):
        assert self.batches_outstanding < 2 * self.num_workers
        indices = next(self.sample_iter, None)
        if indices is None:
            return
        self.index_queues[self.worker_queue_idx].put((self.send_idx, indices))
        self.worker_queue_idx = (self.worker_queue_idx + 1) % self.num_workers
        self.batches_outstanding += 1
        self.send_idx += 1
    def _process_next_batch(self, batch):
        self.rcvd_idx += 1
        self._put_indices()
        if isinstance(batch, ExceptionWrapper):
            raise batch.exc_type(batch.exc_msg)
        return batch
def _worker_loop(dataset, index_queue, data_queue, collate_fn, seed, init_fn, worker_id):
    global _use_shared_memory
    _use_shared_memory = True
    # Intialize C side signal handlers for SIGBUS and SIGSEGV. Python signal
    # module's handlers are executed after Python returns from C low-level
    # handlers, likely when the same fatal signal happened again already.
    # https://docs.python.org/3/library/signal.html Sec. 18.8.1.1
    _set_worker_signal_handlers()
    torch.set_num_threads(1)
    random.seed(seed)
    torch.manual_seed(seed)
    if init_fn is not None:
        init_fn(worker_id)
    watchdog = ManagerWatchdog()
    while True:
        try:
            r = index_queue.get(timeout=MANAGER_STATUS_CHECK_INTERVAL)
        except queue.Empty:
            if watchdog.is_alive():
                continue
            else:
                break
        if r is None:
            break
        idx, batch_indices = r
        try:
            samples = collate_fn([dataset[i] for i in batch_indices])
        except Exception:
            data_queue.put((idx, ExceptionWrapper(sys.exc_info())))
        else:
            data_queue.put((idx, samples))
            del samples

- 需要对队列操作,缓存数据,使得加载提速!

以上为个人经验,希望能给大家一个参考,也希望大家多多支持我们。

(0)

相关推荐

  • Pytorch dataloader在加载最后一个batch时卡死的解决

    问题: 自己写了个dataloader,为了部署方便,用OpenCV的接口进行数据读取,而没有用PIL,代码大致如下: def __getitem__(self, idx): sample = self.samples[idx] img = cv2.imread(sample[0]) img = cv2.resize(img, tuple(self.input_size)) img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) # if not self.val

  • 我对PyTorch dataloader里的shuffle=True的理解

    对shuffle=True的理解: 之前不了解shuffle的实际效果,假设有数据a,b,c,d,不知道batch_size=2后打乱,具体是如下哪一种情况: 1.先按顺序取batch,对batch内打乱,即先取a,b,a,b进行打乱: 2.先打乱,再取batch. 证明是第二种 shuffle (bool, optional): set to ``True`` to have the data reshuffled at every epoch (default: ``False``). if

  • pytorch锁死在dataloader(训练时卡死)

    1.问题描述 2.解决方案 (1)Dataloader里面不用cv2.imread进行读取图片,用cv2.imread还会带来一系列的不方便,比如不能结合torchvision进行数据增强,所以最好用PIL 里面的Image.open来读图片.(并不适用本例) (2)将DataLoader 里面的参变量num_workers设置为0,但会导致数据的读取很慢,拖慢整个模型的训练.(并不适用本例) (3)如果用了cv2.imread,不想改代码的,那就加两条语句,来关闭Opencv的多线程:cv2.

  • Pytorch 如何加速Dataloader提升数据读取速度

    在利用DL解决图像问题时,影响训练效率最大的有时候是GPU,有时候也可能是CPU和你的磁盘. 很多设计不当的任务,在训练神经网络的时候,大部分时间都是在从磁盘中读取数据,而不是做 Backpropagation . 这种症状的体现是使用 Nividia-smi 查看 GPU 使用率时,Memory-Usage 占用率很高,但是 GPU-Util 时常为 0% ,如下图所示: 如何解决这种问题呢? 在 Nvidia 提出的分布式框架 Apex 里面,我们在源码里面找到了一个简单的解决方案: htt

  • Pytorch数据读取之Dataset和DataLoader知识总结

    一.前言 确保安装 scikit-image numpy 二.Dataset 一个例子: # 导入需要的包 import torch import torch.utils.data.dataset as Dataset import numpy as np # 编造数据 Data = np.asarray([[1, 2], [3, 4],[5, 6], [7, 8]]) Label = np.asarray([[0], [1], [0], [2]]) # 数据[1,2],对应的标签是[0],数据

  • pytorch DataLoader的num_workers参数与设置大小详解

    Q:在给Dataloader设置worker数量(num_worker)时,到底设置多少合适?这个worker到底怎么工作的? train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=4) 参数详解: 1.每次dataloader加载数据时:dataloader一次性创建num_worker个worker,(也可以说dataloader一次

  • pytorch中DataLoader()过程中遇到的一些问题

    如下所示: RuntimeError: stack expects each tensor to be equal size, but got [3, 60, 32] at entry 0 and [3, 54, 32] at entry 2 train_dataset = datasets.ImageFolder( traindir, transforms.Compose([ transforms.Resize((224)) ### 原因是 transforms.Resize() 的参数设置问

  • 解决pytorch GPU 计算过程中出现内存耗尽的问题

    Pytorch GPU运算过程中会出现:"cuda runtime error(2): out of memory"这样的错误.通常,这种错误是由于在循环中使用全局变量当做累加器,且累加梯度信息的缘故,用官方的说法就是:"accumulate history across your training loop".在默认情况下,开启梯度计算的Tensor变量是会在GPU保持他的历史数据的,所以在编程或者调试过程中应该尽力避免在循环中累加梯度信息. 下面举个栗子: 上代

  • Pytorch在dataloader类中设置shuffle的随机数种子方式

    如题:Pytorch在dataloader类中设置shuffle的随机数种子方式 虽然实验结果差别不大,但是有时候也悬殊两个百分点 想要复现实验结果 发现用到随机数的地方就是dataloader类中封装的shuffle属性 查了半天没有关于这个的设置,最后在设置随机数种子里面找到了答案 以下方法即可: def setup_seed(seed): torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) np.random.seed(seed

  • Pytorch训练网络过程中loss突然变为0的解决方案

    问题 // loss 突然变成0 python train.py -b=8 INFO: Using device cpu INFO: Network: 1 input channels 7 output channels (classes) Bilinear upscaling INFO: Creating dataset with 868 examples INFO: Starting training: Epochs: 5 Batch size: 8 Learning rate: 0.001

  • 解决Pytorch训练过程中loss不下降的问题

    在使用Pytorch进行神经网络训练时,有时会遇到训练学习率不下降的问题.出现这种问题的可能原因有很多,包括学习率过小,数据没有进行Normalization等.不过除了这些常规的原因,还有一种难以发现的原因:在计算loss时数据维数不匹配. 下面是我的代码: loss_function = torch.nn.MSE_loss() optimizer.zero_grad() output = model(x_train) loss = loss_function(output, y_train)

  • 解决Pytorch在测试与训练过程中的验证结果不一致问题

    引言 今天在使用Pytorch导入此前保存的模型进行测试,在过程中发现输出的结果与验证结果差距甚大,经过排查后发现是forward与eval()顺序问题. 现象 此前的错误代码是 input_cpu = torch.ones((1, 2, 160, 160)) target_cpu =torch.ones((1, 2, 160, 160)) target_gpu, input_gpu = target_cpu.cuda(), input_cpu.cuda() model.set_input_2(

  • pytorch中dataloader 的sampler 参数详解

    目录 1. dataloader() 初始化函数 2. shuffle 与sample 之间的关系 3. sample 的定义方法 3.1 sampler 参数的使用 4. batch 生成过程 1. dataloader() 初始化函数 def __init__(self, dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None, num_workers=0, collate_fn=None, pin_mem

  • 浅析mmdetection在windows10系统环境中搭建过程

     mmdetection简介 目标检测工具箱mmdetection是香港中文大学-商汤科技联合实验室基于pytorch开源的检测库,它包含个丰富的目标检测和实例分割方法以及相关的组件和模块.现在mmdetection成了一个涵盖许多主流检测方法和模块的统一平台. mmdetection官方并不提供windows版本的稳定安装教程,网上有相关的教程但是并不支持最新的mmdetection版本,这边根据各位大佬的博客总结了以下这篇文章,帮助大家可以在windows10中使用. 环境版本要求: pyt

  • SpringMail使用过程中的报错解决办法

    SpringMail使用过程中的报错解决办法 1.Unable to locate provider for protocol: smtp –>缺少依赖造成的 <dependency> <groupId>javax.mail</groupId> <artifactId>mail</artifactId> <version>1.4</version> </dependency> <dependency

  • 浅谈PHP接入(第三方登录)QQ登录 OAuth2.0 过程中遇到的坑

    前言 绝大多数网站都集成了第三方登录,降低了注册门槛,增强了用户体验.最近看了看 QQ 互联上 QQ 登录的接口文档.接入 QQ 登录的一般流程呢,是这样的:先申请开发者 -> 然后创建应用(拿到一组 AppId 和 AppKey)-> 获取 access_token -> 获取 openid -> 调用 openApi 访问或修改用户信息. 然而,从申请个人开发者开始,坑就来了. 1. 申请(个人)开发者 QQ 互联中申请开发者信息的页面,一些重点太过简陋,缺失细节,比如身份证正

随机推荐