完美解决keras 读取多个hdf5文件进行训练的问题

用keras进行大数据训练,为了加快训练,需要提前制作训练集。

由于HDF5的特性,所有数据需要一次性读入到内存中,才能保存。

为此,我采用分批次分为2个以上HDF5进行存储。

1、先读取每个标签下的图片,并设置标签

def load_dataset(path_name,data_path):
 images = []
 labels = []
 train_images = []
 valid_images = []
 train_labels = []
 valid_labels = []
 counter = 0
 allpath = os.listdir(path_name)
 nb_classes = len(allpath)
 print("label_num: ",nb_classes)

 for child_dir in allpath:
 child_path = os.path.join(path_name, child_dir)
 for dir_image in os.listdir(child_path):
  if dir_image.endswith('.jpg'):
  img = cv2.imread(os.path.join(child_path, dir_image))
  image = misc.imresize(img, (IMAGE_SIZE, IMAGE_SIZE), interp='bilinear')
  #resized_img = cv2.resize(img, (IMAGE_SIZE, IMAGE_SIZE))
  images.append(image)
  labels.append(counter)

2、该标签下的数据集分割为训练集(train images),验证集(val images),训练标签(train labels),验证标签

(val labels)

def split_dataset(images, labels): 

 train_images, valid_images, train_labels, valid_labels = train_test_split(images,\
 labels, test_size = 0.2, random_state = random.randint(0, 100)) 

 #print(train_images.shape[0], 'train samples')
 #print(valid_images.shape[0], 'valid samples')
 return train_images, valid_images, train_labels ,valid_labels

3、分割后的数据分别添加到总的训练集,验证集,训练标签,验证标签。

其次,清空原有的图片集和标签集,目的是节省内存。假如一次性读入多个标签的数据集与标签集,进行数据分割后,会占用大于单纯进行上述操作两倍以上的内存。

images = np.array(images)
t_images, v_images, t_labels ,v_labels = split_dataset(images, labels)
for i in range(len(t_images)):
 train_images.append(t_images[i])
 train_labels.append(t_labels[i])
for j in range(len(v_images)):
 valid_images.append(v_images[j])
 valid_labels.append(v_labels[j])
if counter%50== 49:
 print( counter+1 , "is read to the memory!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!") 

images = []
labels = []
counter = counter + 1 

print("train_images num: ", len(train_images), " ", "valid_images num: ",len(valid_images))

4、进行判断,直到读到自己自己分割的那个标签。

开始进行写入。写入之前,为了更好地训练模型,需要把对应的图片集和标签打乱顺序。

if ((counter % 4316 == 4315) or (counter == nb_classes - 1)): 

  print("start write images and labels data...................................................................")
  num = counter // 5000
  dirs = data_path + "/" + "h5_" + str(num - 1)
  if not os.path.exists(dirs):
  os.makedirs(dirs)
  data2h5(dirs, t_images, v_images, t_labels ,v_labels)

对应打乱顺序并写入到HDF5

def data2h5(dirs_path, train_images, valid_images, train_labels ,valid_labels):

 TRAIN_HDF5 = dirs_path + '/' + "train.hdf5"
 VAL_HDF5 = dirs_path + '/' + "val.hdf5"

 #shuffle
 state1 = np.random.get_state()
 np.random.shuffle(train_images)
 np.random.set_state(state1)
 np.random.shuffle(train_labels)

 state2 = np.random.get_state()
 np.random.shuffle(valid_images)
 np.random.set_state(state2)
 np.random.shuffle(valid_labels)

 datasets = [
 ("train",train_images,train_labels,TRAIN_HDF5),
 ("val",valid_images,valid_labels,VAL_HDF5)]

 for (dType,images,labels,outputPath) in datasets:
 # HDF5 initial
 f = h5py.File(outputPath, "w")
 f.create_dataset("x_"+dType, data=images)
 f.create_dataset("y_"+dType, data=labels)
 #f.create_dataset("x_"+dType, data=images, compression="gzip", compression_opts=9)
 #f.create_dataset("y_"+dType, data=labels, compression="gzip", compression_opts=9)
 f.close()

5、判断文件全部读入

def read_dataset(dirs):

 files = os.listdir(dirs)
 print(files)
 for file in files:
 path = dirs+'/' + file
 dataset = h5py.File(path, "r")
 file = file.split('.')
 set_x_orig = dataset["x_"+file[0]].shape[0]
 set_y_orig = dataset["y_"+file[0]].shape[0]

 print(set_x_orig)
 print(set_y_orig)

6、训练中,采用迭代器读入数据

 def generator(self, datagen, mode):

 passes=np.inf
 aug = ImageDataGenerator(
  featurewise_center = False,
  samplewise_center = False,
  featurewise_std_normalization = False,
  samplewise_std_normalization = False,
  zca_whitening = False,
  rotation_range = 20,
  width_shift_range = 0.2,
  height_shift_range = 0.2,
  horizontal_flip = True,
  vertical_flip = False)  

 epochs = 0
 # 默认是无限循环遍历

 while epochs < passes:
  # 遍历数据
  file_dir = os.listdir(self.data_path)
  for file in file_dir:
  #print(file)
  file_path = os.path.join(self.data_path,file)
  TRAIN_HDF5 = file_path +"/train.hdf5"
  VAL_HDF5 = file_path +"/val.hdf5"
  #TEST_HDF5 = file_path +"/test.hdf5"

  db_t = h5py.File(TRAIN_HDF5)
  numImages_t = db_t['y_train'].shape[0]
  db_v = h5py.File(VAL_HDF5)
  numImages_v = db_v['y_val'].shape[0] 

  if mode == "train":
   for i in np.arange(0, numImages_t, self.BS):

   images = db_t['x_train'][i: i+self.BS]
   labels = db_t['y_train'][i: i+self.BS]

   if K.image_data_format() == 'channels_first':

    images = images.reshape(images.shape[0], 3, IMAGE_SIZE,IMAGE_SIZE)
   else:
    images = images.reshape(images.shape[0], IMAGE_SIZE, IMAGE_SIZE, 3) 

   images = images.astype('float32')
   images = images/255   

   if datagen :
    (images,labels) = next(aug.flow(images,labels,batch_size = self.BS))   

   # one-hot编码
   if self.binarize:
    labels = np_utils.to_categorical(labels,self.classes)   

   yield ({'input_1': images}, {'softmax': labels})

  elif mode == "val":
   for i in np.arange(0, numImages_v, self.BS):
   images = db_v['x_val'][i: i+self.BS]
   labels = db_v['y_val'][i: i+self.BS] 

   if K.image_data_format() == 'channels_first':

    images = images.reshape(images.shape[0], 3, IMAGE_SIZE,IMAGE_SIZE)
   else:
    images = images.reshape(images.shape[0], IMAGE_SIZE, IMAGE_SIZE, 3) 

   images = images.astype('float32')
   images = images/255   

   if datagen :
    (images,labels) = next(aug.flow(images,labels,batch_size = self.BS))   

   #one-hot编码
   if self.binarize:
    labels = np_utils.to_categorical(labels,self.classes) 

   yield ({'input_1': images}, {'softmax': labels})

  epochs += 1

7、至此,就大功告成了

完整的代码:

# -*- coding: utf-8 -*-
"""
Created on Mon Feb 12 20:46:12 2018

@author: william_yue
"""
import os
import numpy as np
import cv2
import random
from scipy import misc
import h5py
from sklearn.model_selection import train_test_split
from keras import backend as K
K.clear_session()
from keras.utils import np_utils

IMAGE_SIZE = 128

# 加载数据集并按照交叉验证的原则划分数据集并进行相关预处理工作
def split_dataset(images, labels):
 # 导入了sklearn库的交叉验证模块,利用函数train_test_split()来划分训练集和验证集
 # 划分出了20%的数据用于验证,80%用于训练模型
 train_images, valid_images, train_labels, valid_labels = train_test_split(images,\
 labels, test_size = 0.2, random_state = random.randint(0, 100))
 return train_images, valid_images, train_labels ,valid_labels

def data2h5(dirs_path, train_images, valid_images, train_labels ,valid_labels):

#def data2h5(dirs_path, train_images, valid_images, test_images, train_labels ,valid_labels, test_labels):

 TRAIN_HDF5 = dirs_path + '/' + "train.hdf5"
 VAL_HDF5 = dirs_path + '/' + "val.hdf5"

 #采用标签与图片相同的顺序分别打乱训练集与验证集
 state1 = np.random.get_state()
 np.random.shuffle(train_images)
 np.random.set_state(state1)
 np.random.shuffle(train_labels)

 state2 = np.random.get_state()
 np.random.shuffle(valid_images)
 np.random.set_state(state2)
 np.random.shuffle(valid_labels)

 datasets = [
 ("train",train_images,train_labels,TRAIN_HDF5),
 ("val",valid_images,valid_labels,VAL_HDF5)]

 for (dType,images,labels,outputPath) in datasets:
 # 初始化HDF5写入
 f = h5py.File(outputPath, "w")
 f.create_dataset("x_"+dType, data=images)
 f.create_dataset("y_"+dType, data=labels)
 #f.create_dataset("x_"+dType, data=images, compression="gzip", compression_opts=9)
 #f.create_dataset("y_"+dType, data=labels, compression="gzip", compression_opts=9)
 f.close()

def read_dataset(dirs):
 files = os.listdir(dirs)
 print(files)
 for file in files:
 path = dirs+'/' + file
 file_read = os.listdir(path)
 for i in file_read:
  path_read = os.path.join(path, i)
  dataset = h5py.File(path_read, "r")
  i = i.split('.')
  set_x_orig = dataset["x_"+i[0]].shape[0]
  set_y_orig = dataset["y_"+i[0]].shape[0]
  print(set_x_orig)
  print(set_y_orig)

#循环读取每个标签集下的所有图片
def load_dataset(path_name,data_path):
 images = []
 labels = []
 train_images = []
 valid_images = []
 train_labels = []
 valid_labels = []
 counter = 0
 allpath = os.listdir(path_name)
 nb_classes = len(allpath)
 print("label_num: ",nb_classes)

 for child_dir in allpath:
 child_path = os.path.join(path_name, child_dir)
 for dir_image in os.listdir(child_path):
  if dir_image.endswith('.jpg'):
  img = cv2.imread(os.path.join(child_path, dir_image))
  image = misc.imresize(img, (IMAGE_SIZE, IMAGE_SIZE), interp='bilinear')
  #resized_img = cv2.resize(img, (IMAGE_SIZE, IMAGE_SIZE))
  images.append(image)
  labels.append(counter)

 images = np.array(images)
 t_images, v_images, t_labels ,v_labels = split_dataset(images, labels)
 for i in range(len(t_images)):
  train_images.append(t_images[i])
  train_labels.append(t_labels[i])
 for j in range(len(v_images)):
  valid_images.append(v_images[j])
  valid_labels.append(v_labels[j])
 if counter%50== 49:
  print( counter+1 , "is read to the memory!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!") 

 images = []
 labels = [] 

 if ((counter % 4316 == 4315) or (counter == nb_classes - 1)):
  print("train_images num: ", len(train_images), "  ", "valid_images num: ",len(valid_images))
  print("start write images and labels data...................................................................")
  num = counter // 5000
  dirs = data_path + "/" + "h5_" + str(num - 1)
  if not os.path.exists(dirs):
  os.makedirs(dirs)
  data2h5(dirs, train_images, valid_images, train_labels ,valid_labels)
  #read_dataset(dirs)
  print("File HDF5_%d "%num, " id done!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
  train_images = []
  valid_images = []
  train_labels = []
  valid_labels = []
 counter = counter + 1
 print("All File HDF5 done!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
 read_dataset(data_path) 

#读取训练数据集的文件夹,把他们的名字返回给一个list
def read_name_list(path_name):
 name_list = []
 for child_dir in os.listdir(path_name):
 name_list.append(child_dir)
 return name_list

if __name__ == '__main__':
 path = "data"
 data_path = "data_hdf5_half"
 if not os.path.exists(data_path):
 os.makedirs(data_path)
 load_dataset(path,data_path)

以上这篇完美解决keras 读取多个hdf5文件进行训练的问题就是小编分享给大家的全部内容了,希望能给大家一个参考,也希望大家多多支持我们。

(0)

相关推荐

  • keras多显卡训练方式

    使用keras进行训练,默认使用单显卡,即使设置了os.environ['CUDA_VISIBLE_DEVICES']为两张显卡,也只是占满了显存,再设置tf.GPUOptions(allow_growth=True)之后可以清楚看到,只占用了第一张显卡,第二张显卡完全没用. 要使用多张显卡,需要按如下步骤: (1)import multi_gpu_model函数:from keras.utils import multi_gpu_model (2)在定义好model之后,使用multi_gpu

  • keras 读取多标签图像数据方式

    我所接触的多标签数据,主要包括两类: 1.一张图片属于多个标签,比如,data:一件蓝色的上衣图片.jpg,label:蓝色,上衣.其中label包括两类标签,label1第一类:上衣,裤子,外套.label2第二类,蓝色,黑色,红色.这样两个输出label1,label2都是是分类,我们可以直接把label1和label2整合为一个label,直接编码,比如[蓝色,上衣]编码为[011011].这样模型的输出也只需要一个输出.实现了多分类. 2.一张图片属于多个标签,但是几个标签不全是分类.比

  • 利用keras加载训练好的.H5文件,并实现预测图片

    我就废话不多说了,直接上代码吧! import matplotlib matplotlib.use('Agg') import os from keras.models import load_model import numpy as np from PIL import Image import cv2 #加载模型h5文件 model = load_model("C:\\python\\python3_projects\\cat_dog\\cats_dogs_fifty_thousand.h

  • keras读取h5文件load_weights、load代码操作

    关于保存h5模型.权重网上的示例非常多,也非常简单.主要有以下两个函数: 1.keras.models.load_model() 读取网络.权重 2.keras.models.load_weights() 仅读取权重 load_model代码包含load_weights的代码,区别在于load_weights时需要先有网络.并且load_weights需要将权重数据写入到对应网络层的tensor中. 下面以resnet50加载h5权重为例,示例代码如下 import keras from ker

  • 完美解决keras 读取多个hdf5文件进行训练的问题

    用keras进行大数据训练,为了加快训练,需要提前制作训练集. 由于HDF5的特性,所有数据需要一次性读入到内存中,才能保存. 为此,我采用分批次分为2个以上HDF5进行存储. 1.先读取每个标签下的图片,并设置标签 def load_dataset(path_name,data_path): images = [] labels = [] train_images = [] valid_images = [] train_labels = [] valid_labels = [] counte

  • 完美解决java读取大文件内存溢出的问题

    1. 传统方式:在内存中读取文件内容 读取文件行的标准方式是在内存中读取,Guava 和Apache Commons IO都提供了如下所示快速读取文件行的方法: Files.readLines(new File(path), Charsets.UTF_8); FileUtils.readLines(new File(path)); 实际上是使用BufferedReader或者其子类LineNumberReader来读取的. 传统方式的问题: 是文件的所有行都被存放在内存中,当文件足够大时很快就会

  • 完美解决keras保存好的model不能成功加载问题

    前两天调用之前用keras(tensorflow做后端)训练好model,却意外发现报错了!!之前从来没有过报错!!错误内容粘贴如下: File "h5py_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (C:\Minonda\conda-bld\h5py_1496885653697\work\h5py_objects.c:2867) File "h5py_objects.pyx", line 5

  • 完美解决beego 根目录不能访问静态文件的问题

    beego可算是Go框架里面文档最多的了.学起来比较容易.但是框架自身的局限性却带了不小的问题. 最近我在处理flash跨域问题上就遇到个活生生的例子: flash里面的as3访问外网时没啥问题.可是假如通过网页调取的情况下,flash访问的外网又与所在网页不是一个域的.就会产生跨域问题.打开浏览器,f12之后,看到的就是flash没有访问你的url,而是访问了这个url所在的域名根目录下的配置文件crossdomain.xml.该配置文件来设置跨域访问的权限. 这时候需要在你的域名根目录下放个

  • 完美解决mac环境使用sed修改文件出错的问题

    sed是linux命令,用于处理文件内容(修改,替换等),mac中都可以使用,但发现相同的替换命令在linux可以正常执行,在mac则执行失败. 出错原因 用shell写了个更新Config/Config.php版本的脚本,代码如下: #!/bin/bash file='Config/Config.php' old_version='1.1.0' new_version='1.1.1' #替换配置文件版本 sed -i "s/$old_version/$new_version/g" &

  • 完美解决TensorFlow和Keras大数据量内存溢出的问题

    内存溢出问题是参加kaggle比赛或者做大数据量实验的第一个拦路虎. 以前做的练手小项目导致新手产生一个惯性思维--读取训练集图片的时候把所有图读到内存中,然后分批训练. 其实这是有问题的,很容易导致OOM.现在内存一般16G,而训练集图片通常是上万张,而且RGB图,还很大,VGG16的图片一般是224x224x3,上万张图片,16G内存根本不够用.这时候又会想起--设置batch,但是那个batch的输入参数却又是图片,它只是把传进去的图片分批送到显卡,而我OOM的地方恰是那个"传进去&quo

  • php使用ftp远程上传文件类(完美解决主从文件同步问题的方法)

    php使用ftp实现文件上传代码片段: <?php /** * ftp上传文件类 */ class Ftp { /** * 测试服务器 * * @var array */ private $testServer = array( 'host' => 'ip', 'port' => 21, 'user' => 'userName', 'pwd' => 'password' ); /** * 打开并登录服务器 * * @param string $flag 服务器标识test *

  • Python读取excel中的图片完美解决方法

    excel中有图片是很常见的,但是通过python读取excel中的图片没有很好的解决办法. 网上找了一种很聪明的方法,原理是这样的: 1.将待读取的excel文件后缀名改成zip,变成压缩文件. 2.再解压这个文件. 3.在解压后的文件夹中,就有excel中的图片. 4.这样读excel中的图片,就变成了读文件夹中的图片了,和普通文件一样,可以做各种处理. 解压后的压缩包如下: python脚本如下: ''' File Name: readexcelimg Author: tim Date:

  • 解决SpringBoot jar包中的文件读取问题实现

    前言 SpringBoot微服务已成为业界主流,从开发到部署都非常省时省力,但是最近小明开发时遇到一个问题:在代码中读取资源文件(比如word文档.导出模版等),本地开发时可以正常读取 ,但是,当我们打成jar包发布到服务器后,再次执行程序时就会抛出找不到文件的异常. 背景 这个问题是在一次使用freemarker模版引擎导出word报告时发现的.大概说一下docx导出java实现思路:导出word的文档格式为docx,事先准备好一个排好版的docx文档作为模版,读取解析该模版,将其中的静态资源

随机推荐