美文网首页我爱编程
在单个HDF5文件中存取大量数据

在单个HDF5文件中存取大量数据

作者: ydlstartx | 来源:发表于2018-05-11 12:02 被阅读0次

    原文标题为:Saving and loading a large number of images (data) into a single HDF5 file。这篇文章介绍如何将大量的图片存储在单个HDF5文件中,以及如何分batch读取数据来训练网络。原文中分别使用了python中的h5py模块和tables模块,这里仅使用h5py模块。此外,原文代码中对于彩色图像的表示分为TensorFlow和Theano两种情况来处理,这里只处理TensorFlow的情况,即图像表示为[batch, image_height, image_width, image_depth]的形式。

    Introduction

    在深度学习中,通常会使用巨量的数据或图片来训练网络,比如ImageNet上就有数百万张图片。对于如此大的数据集,如果对于每张图片都单独从硬盘读取、预处理、之后再送入网络进行训练、验证或是测试,这样效率可是太低了。如果将这些图片都放入一个文件中再进行处理,这样效率会更高。有多种数据模型和库可完成这种操作,如HDF5和TFRecord。这篇文章中会介绍如何将大量的图片存入单个HDF5文件中,以及如何将它们按batch读出来。不论数据多大,即不论数据是否大于电脑内存,该方法都是有效的。HDF5提供了工具,可用于对数据进行管理、操作、可视化、压缩和存储等。

    本文使用kaggle上Dogs vs. Cats中的训练集。

    List images and their labels

    首先介绍下使用的数据集,Dogs vs. Cats训练集中有25,000张图片,猫和狗各一半,图片的命名为dog.5199.jpg或cat.123.jpg的形式。下面的代码读取所有文件名(并没有读入图片内容),并对各图片指定对应的标签,如果是猫,则label=0,如果是狗,则label=1。接着shuffle数据集,并分为训练(60%)、验证(20%)和测试集(20%)。

    """
    List images and label them
    """
    from random import shuffle
    import glob
    # shuffle the addresses before saving
    shuffle_data = True  
    # address to where you want to save the hdf5 file
    hdf5_path = 'Cat vs Dog/dataset.hdf5'  
    # 数据集的读取路径
    cat_dog_train_path = 'Cat vs Dog/train/*.jpg'
    
    # read addresses and labels from the 'train' folder
    # 使用glob来获取所有符合条件的文件名列表
    # 此时,addrs中的元素皆为'Cat vs Dog/train/dog.5199.jpg'此类字符串
    addrs = glob.glob(cat_dog_train_path)
    # 对各文件名指定对应的label
    labels = [0 if 'cat' in addr else 1 for addr in addrs]  # 0 = Cat, 1 = Dog
    
    # to shuffle data
    if shuffle_data:
        c = list(zip(addrs, labels))
        shuffle(c)
        addrs, labels = zip(*c)
        
    # Divide the hata into 60% train, 20% validation, and 20% test
    train_addrs = addrs[0:int(0.6*len(addrs))]
    train_labels = labels[0:int(0.6*len(labels))]
    
    val_addrs = addrs[int(0.6*len(addrs)):int(0.8*len(addrs))]
    val_labels = labels[int(0.6*len(addrs)):int(0.8*len(addrs))]
    
    test_addrs = addrs[int(0.8*len(addrs)):]
    test_labels = labels[int(0.8*len(labels)):]
    

    Create a HDF5 file

    有两个库可处理HDF5格式,h5py和tables,原文中介绍了二者,这里只介绍h5py。第一步先建立HDF5文件。为了存储图片,需要对训练、验证、测试集分别定义shape为(number of data, image_height, image_width, image_depth)的矩阵。同样,对于标签需要分别定义shape为(number of data)的矩阵。最后,计算训练集中每一个像素点的平局值,并存到shape为(image_height, image_width, image_depth)的矩阵中。注意在建立矩阵式,要一直注意其数据类型。
    在h5py中,使用create_dataset来建立矩阵。可以直接使用numpy中的数据类型来指定矩阵的dtype。在建立矩阵时要指定矩阵的大小(shape)。代码如下:

    """
    Creating a HDF5 file
    """
    import numpy as np
    import h5py
    
    train_shape = (len(train_addrs), 224, 224, 3)
    val_shape = (len(val_addrs), 224, 224, 3)
    test_shape = (len(test_addrs), 224, 224, 3)
    
    # open a hdf5 file and create earrays
    hdf5_file = h5py.File(hdf5_path, mode='w')
    
    # 建立了下面四个矩阵,但是没有赋值
    hdf5_file.create_dataset("train_img", train_shape, np.int8)
    hdf5_file.create_dataset("val_img", val_shape, np.int8)
    hdf5_file.create_dataset("test_img", test_shape, np.int8)
    
    hdf5_file.create_dataset("train_mean", train_shape[1:], np.float32)
    
    # 建立各标签矩阵,并赋值
    hdf5_file.create_dataset("train_labels", (len(train_addrs),), np.int8)
    # 第二个索引框[...]是必须有滴。
    hdf5_file["train_labels"][...] = train_labels
    hdf5_file.create_dataset("val_labels", (len(val_addrs),), np.int8)
    hdf5_file["val_labels"][...] = val_labels
    hdf5_file.create_dataset("test_labels", (len(test_addrs),), np.int8)
    hdf5_file["test_labels"][...] = test_labels
    

    下面,一张张读取图片,进行预处理后存到hdf5_file中。代码分为三个个循环,分别处理训练集、验证集合测试集。代码中对图片的预处理只是使用opencv来resize下。

    """
    Load images and save them
    """
    import cv2
    # a numpy array to save the mean of the images
    # shape := (image_height, image_width, image_depth)
    mean = np.zeros(train_shape[1:], np.float32)
    
    # loop over train addresses
    for i in range(len(train_addrs)):
        # print how many images are saved every 1000 images
        if i % 1000 == 0 and i > 1:
            print('Train data: {}/{}'.format(i, len(train_addrs)))
    
        # read an image and resize to (224, 224)
        # cv2 load images as BGR, convert it to RGB
        addr = train_addrs[i]
        img = cv2.imread(addr)
        img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    
        # add any image pre-processing here
    
        # save the image and calculate the mean so far
        # img.shape := (224, 224, 3)
        hdf5_file["train_img"][i, ...] = img
        mean += img / float(len(train_labels))
    
    # loop over validation addresses
    for i in range(len(val_addrs)):
        # print how many images are saved every 1000 images
        if i % 1000 == 0 and i > 1:
            print('Validation data: {}/{}'.format(i, len(val_addrs)))
    
        # read an image and resize to (224, 224)
        # cv2 load images as BGR, convert it to RGB
        addr = val_addrs[i]
        img = cv2.imread(addr)
        img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    
        # add any image pre-processing here
    
        # save the image
        hdf5_file["val_img"][i, ...] = img
    
    # loop over test addresses
    for i in range(len(test_addrs)):
        # print how many images are saved every 1000 images
        if i % 1000 == 0 and i > 1:
            print('Test data: {}/{}'.format(i, len(test_addrs)))
    
        # read an image and resize to (224, 224)
        # cv2 load images as BGR, convert it to RGB
        addr = test_addrs[i]
        img = cv2.imread(addr)
        img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    
        # add any image pre-processing here
    
        # save the image
        hdf5_file["test_img"][i, ...] = img
    
    # save the mean and close the hdf5 file
    hdf5_file["train_mean"][...] = mean
    hdf5_file.close()
    

    Read the HDF5 file

    下面来检测数据是否已经正确存在HDF5文件中了。按batch载入任意数量的图片,并显示前五个batch的第一个图片。代码中定义了一个变量subtract_mean,用于指示在显示图像前是否需要减去训练集中的平均值。在h5py中,可以像访问字典一样,通过数组名来访问其内容(hdf5_file["arrayname""])。此外,也可以像numpy数组一样通过.shape来得到数组尺寸。

    """
    Open the HDF5 for read
    """
    hdf5_path = 'Cat vs Dog/dataset.hdf5'
    subtract_mean = False
    # open the hdf5 file
    hdf5_file = h5py.File(hdf5_path, "r")
    # subtract the training mean
    if subtract_mean:
        mm = hdf5_file["train_mean"][0, ...]
        mm = mm[np.newaxis, ...]
    # Total number of samples
    data_num = hdf5_file["train_img"].shape[0]
    

    下面,按batch读取图片。

    batch_size = 64
    nb_class = 2
    
    from random import shuffle
    from math import ceil
    import matplotlib.pyplot as plt
    # create list of batches to shuffle the data
    batches_list = list(range(int(ceil(float(data_num) / batch_size))))
    # shuffle索引
    shuffle(batches_list)
    # loop over batches
    for n, i in enumerate(batches_list):
        i_s = i * batch_size  # index of the first image in this batch
        i_e = min([(i + 1) * batch_size, data_num])  # index of the last image in this batch
        # read batch images and remove training mean
        images = hdf5_file["train_img"][i_s:i_e, ...]
        if subtract_mean:
            # 注意,这里有个bug,mm.dtype为np.float32
            # 而images[i,...].dtype为np. int8
            # 直接相减会抛出TypeError错误。需要类型转换。
            images -= mm
        # read labels and convert to one hot encoding
        labels = hdf5_file["train_labels"][i_s:i_e]
        labels_one_hot = np.zeros((batch_size, nb_class))
        labels_one_hot[np.arange(batch_size), labels] = 1
        print(n+1, '/', len(batches_list))
        print(labels[0], labels_one_hot[0, :])
        plt.imshow(images[0])
        plt.show()
        if n == 5:  # break after 5 batches
            break
    hdf5_file.close()
    

    总结

    """
    写模式打开hdf5文件:
    """
    hdf5_file = h5py.File(hdf5_path, mode='w')
    
    """
    建立矩阵
    """
    hdf5_file.create_dataset("train_img", train_shape, np.int8)
    
    """
    对矩阵赋值
    """
    hdf5_file["train_img"][i, ...] = img
    hdf5_file["train_mean"][...] = mean
    
    """
    读模式打开hdf5文件
    """
    hdf5_file = h5py.File(hdf5_path, "r")
    
    """
    读取矩阵中内容
    """
    images = hdf5_file["train_img"][i, ...]
    
    

    相关文章

      网友评论

        本文标题:在单个HDF5文件中存取大量数据

        本文链接:https://www.haomeiwen.com/subject/iqdtdftx.html