在单个HDF5文件中存取大量数据

作者: ydlstartx | 来源:发表于2018-05-11 12:02 被阅读0次

原文标题为：Saving and loading a large number of images (data) into a single HDF5 file。这篇文章介绍如何将大量的图片存储在单个HDF5文件中，以及如何分batch读取数据来训练网络。原文中分别使用了python中的h5py模块和tables模块，这里仅使用h5py模块。此外，原文代码中对于彩色图像的表示分为TensorFlow和Theano两种情况来处理，这里只处理TensorFlow的情况，即图像表示为[batch, image_height, image_width, image_depth]的形式。

Introduction

在深度学习中，通常会使用巨量的数据或图片来训练网络，比如ImageNet上就有数百万张图片。对于如此大的数据集，如果对于每张图片都单独从硬盘读取、预处理、之后再送入网络进行训练、验证或是测试，这样效率可是太低了。如果将这些图片都放入一个文件中再进行处理，这样效率会更高。有多种数据模型和库可完成这种操作，如HDF5和TFRecord。这篇文章中会介绍如何将大量的图片存入单个HDF5文件中，以及如何将它们按batch读出来。不论数据多大，即不论数据是否大于电脑内存，该方法都是有效的。HDF5提供了工具，可用于对数据进行管理、操作、可视化、压缩和存储等。

本文使用kaggle上Dogs vs. Cats中的训练集。

List images and their labels

首先介绍下使用的数据集，Dogs vs. Cats训练集中有25,000张图片，猫和狗各一半，图片的命名为dog.5199.jpg或cat.123.jpg的形式。下面的代码读取所有文件名(并没有读入图片内容)，并对各图片指定对应的标签，如果是猫，则label=0，如果是狗，则label=1。接着shuffle数据集，并分为训练(60%)、验证(20%)和测试集(20%)。

"""
List images and label them
"""
from random import shuffle
import glob
# shuffle the addresses before saving
shuffle_data = True  
# address to where you want to save the hdf5 file
hdf5_path = 'Cat vs Dog/dataset.hdf5'  
# 数据集的读取路径
cat_dog_train_path = 'Cat vs Dog/train/*.jpg'

# read addresses and labels from the 'train' folder
# 使用glob来获取所有符合条件的文件名列表
# 此时，addrs中的元素皆为'Cat vs Dog/train/dog.5199.jpg'此类字符串
addrs = glob.glob(cat_dog_train_path)
# 对各文件名指定对应的label
labels = [0 if 'cat' in addr else 1 for addr in addrs]  # 0 = Cat, 1 = Dog

# to shuffle data
if shuffle_data:
    c = list(zip(addrs, labels))
    shuffle(c)
    addrs, labels = zip(*c)
    
# Divide the hata into 60% train, 20% validation, and 20% test
train_addrs = addrs[0:int(0.6*len(addrs))]
train_labels = labels[0:int(0.6*len(labels))]

val_addrs = addrs[int(0.6*len(addrs)):int(0.8*len(addrs))]
val_labels = labels[int(0.6*len(addrs)):int(0.8*len(addrs))]

test_addrs = addrs[int(0.8*len(addrs)):]
test_labels = labels[int(0.8*len(labels)):]

Create a HDF5 file

有两个库可处理HDF5格式，h5py和tables，原文中介绍了二者，这里只介绍h5py。第一步先建立HDF5文件。为了存储图片，需要对训练、验证、测试集分别定义shape为(number of data, image_height, image_width, image_depth)的矩阵。同样，对于标签需要分别定义shape为(number of data)的矩阵。最后，计算训练集中每一个像素点的平局值，并存到shape为(image_height, image_width, image_depth)的矩阵中。注意在建立矩阵式，要一直注意其数据类型。
在h5py中，使用create_dataset来建立矩阵。可以直接使用numpy中的数据类型来指定矩阵的dtype。在建立矩阵时要指定矩阵的大小(shape)。代码如下：

"""
Creating a HDF5 file
"""
import numpy as np
import h5py

train_shape = (len(train_addrs), 224, 224, 3)
val_shape = (len(val_addrs), 224, 224, 3)
test_shape = (len(test_addrs), 224, 224, 3)

# open a hdf5 file and create earrays
hdf5_file = h5py.File(hdf5_path, mode='w')

# 建立了下面四个矩阵，但是没有赋值
hdf5_file.create_dataset("train_img", train_shape, np.int8)
hdf5_file.create_dataset("val_img", val_shape, np.int8)
hdf5_file.create_dataset("test_img", test_shape, np.int8)

hdf5_file.create_dataset("train_mean", train_shape[1:], np.float32)

# 建立各标签矩阵，并赋值
hdf5_file.create_dataset("train_labels", (len(train_addrs),), np.int8)
# 第二个索引框[...]是必须有滴。
hdf5_file["train_labels"][...] = train_labels
hdf5_file.create_dataset("val_labels", (len(val_addrs),), np.int8)
hdf5_file["val_labels"][...] = val_labels
hdf5_file.create_dataset("test_labels", (len(test_addrs),), np.int8)
hdf5_file["test_labels"][...] = test_labels

下面，一张张读取图片，进行预处理后存到hdf5_file中。代码分为三个个循环，分别处理训练集、验证集合测试集。代码中对图片的预处理只是使用opencv来resize下。

"""
Load images and save them
"""
import cv2
# a numpy array to save the mean of the images
# shape := (image_height, image_width, image_depth)
mean = np.zeros(train_shape[1:], np.float32)

# loop over train addresses
for i in range(len(train_addrs)):
    # print how many images are saved every 1000 images
    if i % 1000 == 0 and i > 1:
        print('Train data: {}/{}'.format(i, len(train_addrs)))

    # read an image and resize to (224, 224)
    # cv2 load images as BGR, convert it to RGB
    addr = train_addrs[i]
    img = cv2.imread(addr)
    img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    # add any image pre-processing here

    # save the image and calculate the mean so far
    # img.shape := (224, 224, 3)
    hdf5_file["train_img"][i, ...] = img
    mean += img / float(len(train_labels))

# loop over validation addresses
for i in range(len(val_addrs)):
    # print how many images are saved every 1000 images
    if i % 1000 == 0 and i > 1:
        print('Validation data: {}/{}'.format(i, len(val_addrs)))

    # read an image and resize to (224, 224)
    # cv2 load images as BGR, convert it to RGB
    addr = val_addrs[i]
    img = cv2.imread(addr)
    img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    # add any image pre-processing here

    # save the image
    hdf5_file["val_img"][i, ...] = img

# loop over test addresses
for i in range(len(test_addrs)):
    # print how many images are saved every 1000 images
    if i % 1000 == 0 and i > 1:
        print('Test data: {}/{}'.format(i, len(test_addrs)))

    # read an image and resize to (224, 224)
    # cv2 load images as BGR, convert it to RGB
    addr = test_addrs[i]
    img = cv2.imread(addr)
    img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    # add any image pre-processing here

    # save the image
    hdf5_file["test_img"][i, ...] = img

# save the mean and close the hdf5 file
hdf5_file["train_mean"][...] = mean
hdf5_file.close()

Read the HDF5 file

下面来检测数据是否已经正确存在HDF5文件中了。按batch载入任意数量的图片，并显示前五个batch的第一个图片。代码中定义了一个变量subtract_mean，用于指示在显示图像前是否需要减去训练集中的平均值。在h5py中，可以像访问字典一样，通过数组名来访问其内容(hdf5_file["arrayname""])。此外，也可以像numpy数组一样通过.shape来得到数组尺寸。

"""
Open the HDF5 for read
"""
hdf5_path = 'Cat vs Dog/dataset.hdf5'
subtract_mean = False
# open the hdf5 file
hdf5_file = h5py.File(hdf5_path, "r")
# subtract the training mean
if subtract_mean:
    mm = hdf5_file["train_mean"][0, ...]
    mm = mm[np.newaxis, ...]
# Total number of samples
data_num = hdf5_file["train_img"].shape[0]

下面，按batch读取图片。

batch_size = 64
nb_class = 2

from random import shuffle
from math import ceil
import matplotlib.pyplot as plt
# create list of batches to shuffle the data
batches_list = list(range(int(ceil(float(data_num) / batch_size))))
# shuffle索引
shuffle(batches_list)
# loop over batches
for n, i in enumerate(batches_list):
    i_s = i * batch_size  # index of the first image in this batch
    i_e = min([(i + 1) * batch_size, data_num])  # index of the last image in this batch
    # read batch images and remove training mean
    images = hdf5_file["train_img"][i_s:i_e, ...]
    if subtract_mean:
        # 注意，这里有个bug，mm.dtype为np.float32
        # 而images[i,...].dtype为np. int8
        # 直接相减会抛出TypeError错误。需要类型转换。
        images -= mm
    # read labels and convert to one hot encoding
    labels = hdf5_file["train_labels"][i_s:i_e]
    labels_one_hot = np.zeros((batch_size, nb_class))
    labels_one_hot[np.arange(batch_size), labels] = 1
    print(n+1, '/', len(batches_list))
    print(labels[0], labels_one_hot[0, :])
    plt.imshow(images[0])
    plt.show()
    if n == 5:  # break after 5 batches
        break
hdf5_file.close()

总结

"""
写模式打开hdf5文件：
"""
hdf5_file = h5py.File(hdf5_path, mode='w')

"""
建立矩阵
"""
hdf5_file.create_dataset("train_img", train_shape, np.int8)

"""
对矩阵赋值
"""
hdf5_file["train_img"][i, ...] = img
hdf5_file["train_mean"][...] = mean

"""
读模式打开hdf5文件
"""
hdf5_file = h5py.File(hdf5_path, "r")

"""
读取矩阵中内容
"""
images = hdf5_file["train_img"][i, ...]

网友评论

我爱编程

本文标题：在单个HDF5文件中存取大量数据

本文链接：https://www.haomeiwen.com/subject/iqdtdftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

在单个HDF5文件中存取大量数据

Introduction

List images and their labels

Create a HDF5 file

Read the HDF5 file

总结

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

我爱编程