美文网首页我爱编程
TensorFlow的读写方式

TensorFlow的读写方式

作者: 奔跑_7182 | 来源:发表于2018-01-15 00:13 被阅读0次

    本篇文章带来的是TensorFlow框架下常见的数据读取方式。

    1、Preloaded data: 预加载数据
    就是我们常见的写在程序里面的数据格式。

    #coding=utf8
    import tensorflow as tf
    
    a = tf.constant([[1,2],[3,4]])
    b = tf.constant([[1,2],[3,4]])
    
    c = tf.matmul(a,b)
    with tf.Session() as sess:
        print(sess.run(c))
    
    

    2、Feeding: Python产生数据,再把数据喂给后端。
    这种方法经常用到。

    #coding=utf8
    import tensorflow as tf
    
    a = tf.placeholder(tf.int32,shape=[2,2])
    b = tf.placeholder(tf.int32,shape=[2,2])
    ##a,b为占位符,只有在程序运行的时候才载入数据
    
    c = tf.matmul(a,b)
    
    a1 = [[1,2],[3,4]]
    b1 = [[1,2],[3,4]]
    ##这里的a1,b1是我们要传入a,b的数据,也可以从文件读取
    with tf.Session() as sess:
        print(sess.run(c,feed_dict={a:a1,b:b1}))
    

    3、Reading from file: 从文件中直接读取
    (1)、read from CSV or txt
    有时我们遇到的数据文件是CSV或者txt格式。
    单reader,单样本(batch_size=1)

    #coding=utf8
    import tensorflow as tf
    
    #创建文件队列
    filenames = ['datas/A.csv','datas/B.csv']
    filename_queue = tf.train.string_input_producer(filenames,shuffle=True)
    #shuffle=True 文件队列随机读取,默认
    
    TFReader = tf.TextLineReader()
    key,value = TFReader.read(filename_queue)
    
    example, label = tf.decode_csv(value, record_defaults=[[], []])
    ##record_defaults=[[], []]文件读取后的数据默认格式,文件有几列返回值就有几个,
    ##默认是英文逗号分隔,可以指定
    ##关于tf.decode_csv()的具体用法可以查看https://www.tensorflow.org/versions/master/api_docs/python/tf/decode_csv
    
    with tf.Session() as sess:
        coord = tf.train.Coordinator()
        threads = tf.train.start_queue_runners(sess=sess,coord=coord)
        for i in range(100):
            ##循环读取,即使所有文件没有那么多行
            print(example.eval(),label.eval())
        coord.request_stop()
        coord.join(threads)
    

    单reader,多样本(batch_size)

    #coding=utf8
    import tensorflow as tf
    
    #创建文件队列
    filenames = ['datas/A.csv','datas/B.csv']
    filename_queue = tf.train.string_input_producer(filenames,shuffle=False)
    #shuffle=True 文件队列随机读取,默认
    
    TFReader = tf.TextLineReader()
    key,value = TFReader.read(filename_queue)
    
    example, label = tf.decode_csv(value, record_defaults=[[], []])
    ##record_defaults=[[], []]文件读取后的数据默认格式,文件有几列返回值就有几个,
    ##默认是英文逗号分隔,可以指定
    
    example_batch,label_batch = tf.train.batch([example,label],
                                               batch_size=5,
                                               capacity=100,
                                               num_threads=2)
    # ###随机读取
    # example_batch,label_batch = tf.train.shuffle_batch([example,label],
    #                                                     batch_size=5,
    #                                                     capacity=100,
    #                                                    min_after_dequeue=50,
    #                                                     num_threads=2)
    
    with tf.Session() as sess:
        coord = tf.train.Coordinator()
        threads = tf.train.start_queue_runners(sess=sess,coord=coord)
        for i in range(10):
            ##循环读取,即使所有文件没有那么多行
            print(example_batch.eval())
        coord.request_stop()
        coord.join(threads)
    

    多reader,多样本

    #coding=utf8
    import tensorflow as tf
    
    #创建文件队列
    filenames = ['datas/A.csv','datas/B.csv']
    filename_queue = tf.train.string_input_producer(filenames,shuffle=False)
    #shuffle=True 文件队列随机读取,默认
    
    TFReader = tf.TextLineReader()
    key,value = TFReader.read(filename_queue)
    
    example_list = [tf.decode_csv(value, record_defaults=[[], []]) for _ in range(2)]
    ##2表示创建两个reader
    
    example_batch,label_batch = tf.train.batch_join(example_list,batch_size=5)
    # 使用tf.train.batch_join(),可以使用多个reader,并行读取数据。每个Reader使用一个线程。
    
    with tf.Session() as sess:
        coord = tf.train.Coordinator()
        threads = tf.train.start_queue_runners(sess=sess,coord=coord)
        for i in range(10):
            ##循环读取,即使所有文件没有那么多行
            print(example_batch.eval(),label_batch.eval())
        coord.request_stop()
        coord.join(threads)
    

    tf.train.batch与tf.train.shuffle_batch函数是单个Reader读取,但是可以多线程。tf.train.batch_join与tf.train.shuffle_batch_join可设置多Reader读取,每个Reader使用一个线程。至于两种方法的效率,单Reader时,2个线程就达到了速度的极限。多Reader时,2个Reader就达到了极限。所以并不是线程越多越快,甚至更多的线程反而会使效率下降。
    (2)、read from tfrecords
    这种读取方式常被用来读取图片数据,先是将图片数据写入到tfrecords文件中,当要使用时再从中读取,速度很快,但是将图片格式的文件写入tfrecords文件后所占用的磁盘内存更大?有弊有利,在图像处理时要先将其处理成相同大小的图片保存(很重要,我在测试过程中没有找到这么储存不同大小的图片)。

    #coding=utf8
    import tensorflow as tf
    import os
    import numpy as np
    from PIL import Image
    
    ## 写入tfrecords文件
    def write_tfrecods():
        np.random.seed(100)
        path = 'E:/cats_dogs/'
        savepath = 'datas/test.tfrecords'
        files = [path+item for item in os.listdir(path) if item.endswith('.jpg')]
        np.random.shuffle(files)
        train_files = files[:23000]
        test_files = files[23000:]
        TFWriter = tf.python_io.TFRecordWriter(savepath)
        for i,file in enumerate(test_files):
            if i%1000==0:
                print(i)
            lab = file.split('/')[-1].split('.')[0].strip()
            if lab=='cat':
                label = 1
            else:
                label = 0
            image = Image.open(file)
            image = image.resize((208,208))
            imagerow = image.tobytes()
            sample = tf.train.Example(features=tf.train.Features(feature={
                'label':tf.train.Feature(int64_list=tf.train.Int64List(value=[label])),
                'image':tf.train.Feature(bytes_list=tf.train.BytesList(value=[imagerow]))
            }))
            TFWriter.write(sample.SerializeToString())
        TFWriter.close()
    
    
    # write_tfrecods()
    
    
    ## 从tfrecords文件读取
    def read_tfrecords():
        filepath = 'datas/test.tfrecords'
        filename_queue = tf.train.string_input_producer([filepath])
        TFReader = tf.TFRecordReader()
        _,serialize_sample = TFReader.read(filename_queue)
        features = tf.parse_single_example(serialize_sample,features={
            'label':tf.FixedLenFeature([],tf.int64),
            'image':tf.FixedLenFeature([],tf.string)
        })
        image = tf.decode_raw(features['image'],tf.uint8)
        image = tf.reshape(image,shape=[208,208,3])
        label = tf.cast(features['label'],tf.int32)
    
        return image,label
    
    ## 读取批次
    def next_batch(batch_size):
        # import matplotlib.pyplot as plt
        image,label = read_tfrecords()
        image_batch,label_batch = tf.train.shuffle_batch([image,label],
                                                         batch_size=batch_size,
                                                         capacity=200,
                                                         min_after_dequeue=100,
                                                         num_threads=32)
    
        # with tf.Session() as sess:
        #     sess.run(tf.global_variables_initializer())
        #     coord = tf.train.Coordinator()
        #     threads = tf.train.start_queue_runners(coord=coord)
        #     image,label = sess.run([image_batch,label_batch])
        #     for i in range(2):
        #         print(label[i])
        #         plt.imshow(image[i])
        #         plt.show()
        #     coord.request_stop()
        #     coord.join(threads)
        image_batch = tf.cast(image_batch,tf.float32)
        label_batch = tf.cast(label_batch,tf.int32)
        return image_batch,label_batch
    
    # next_batch(2)
    

    上面这种方法非常快,而且不占用内存,但是缺点是数据已定,比如在处理图像数据时往往为了增加数据量,会对图像做一些噪声,如模糊、亮度、删除,如果先处理后再写入tfrecords文件,那么是及其浪费磁盘空间的,因此我有时候就喜欢直接读取文件,然后对图片加噪声处理,然后再送入训练,这样做的缺点是高速读写图片文件,会非常占用CPU。

    import tensorflow as tf
    import numpy as np
    import os
    import math
    
    # you need to change this to your data directory
    # train_dir = '/home/acrobat/DataSets/cats_vs_dogs/train/'
    
    def get_files(file_dir, ratio):
        """
        Args:
            file_dir: file directory
            ratio:ratio of validation datasets
        Returns:
            list of images and labels
        """
        cats = []
        label_cats = []
        dogs = []
        label_dogs = []
        for file in os.listdir(file_dir):
            name = file.split(sep='.')
            if name[0]=='cat':
                cats.append(file_dir + file)
                label_cats.append(0)
            else:
                dogs.append(file_dir + file)
                label_dogs.append(1)
        print('There are %d cats\nThere are %d dogs' %(len(cats), len(dogs)))
    
        image_list = np.hstack((cats, dogs))
        label_list = np.hstack((label_cats, label_dogs))
    
        temp = np.array([image_list, label_list])
        temp = temp.transpose()
        np.random.shuffle(temp)
    
        all_image_list = temp[:, 0]
        all_label_list = temp[:, 1]
    
        n_sample = len(all_label_list)
        n_val = math.ceil(n_sample*ratio) # number of validation samples
        n_train = n_sample - n_val # number of trainning samples
    
        tra_images = all_image_list[0:n_train]
        tra_labels = all_label_list[0:n_train]
        tra_labels = [int(float(i)) for i in tra_labels]
        val_images = all_image_list[n_train:-1]
        val_labels = all_label_list[n_train:-1]
        val_labels = [int(float(i)) for i in val_labels]
    
        return tra_images,tra_labels,val_images,val_labels
    
    
    def get_batch(image, label, image_W, image_H, batch_size, capacity):
        """
        Args:
            image: list type
            label: list type
            image_W: image width
            image_H: image height
            batch_size: batch size
            capacity: the maximum elements in queue
        Returns:
            image_batch: 4D tensor [batch_size, width, height, 3], dtype=tf.float32
            label_batch: 1D tensor [batch_size], dtype=tf.int32
        """
    
        image = tf.cast(image, tf.string)
        label = tf.cast(label, tf.int32)
    
        # make an input queue
        input_queue = tf.train.slice_input_producer([image, label])
    
        label = input_queue[1]
        image_contents = tf.read_file(input_queue[0])
        image = tf.image.decode_jpeg(image_contents, channels=3)
    
        image = tf.image.resize_image_with_crop_or_pad(image, image_W, image_H)
    
        # if you want to test the generated batches of images, you might want to comment the following line.
        image = tf.image.per_image_standardization(image)
    
        image_batch, label_batch = tf.train.batch([image, label],
                                                    batch_size= batch_size,
                                                    num_threads= 64,
                                                    capacity = capacity)
        #you can also use shuffle_batch
    #    image_batch, label_batch = tf.train.shuffle_batch([image,label],
    #                                                      batch_size=BATCH_SIZE,
    #                                                      num_threads=64,
    #                                                      capacity=CAPACITY,
    #                                                      min_after_dequeue=CAPACITY-1)
    
        label_batch = tf.reshape(label_batch, [batch_size])
        image_batch = tf.cast(image_batch, tf.float32)
    
        return image_batch, label_batch
    

    (3)、read from bin
    有的时候我们的数据是二进制格式(bin),因此需要将二进制文件读取出来。
    在官网的cifar的例子中就是从bin文件中读取的。bin文件需要以一定的size格式存储,比如每个样本的值占多少字节,label占多少字节,且这对于每个样本都是固定的,然后一个挨着一个存储。这样就可以使用tf.FixedLengthRecordReader 类来每次读取固定长度的字节,正好对应一个样本存储的字节(包括label)。并且用tf.decode_raw进行解析。

    import tensorflow as tf
    import numpy as np
    
    # 预定义图像数据信息
    labelBytes = 1
    witdthBytes = 32
    heightBytes = 32
    depthBytes = 3
    imageBytes = witdthBytes*heightBytes*depthBytes
    recordBytes = imageBytes+labelBytes
    
    filename_queue = tf.train.string_input_producer(["./data/train.bin"])
    reader = tf.FixedLengthRecordReader(record_bytes=recordBytes) # 按固定长度读取二进制文件
    key,value = reader.read(filename_queue)
    
    bytes = tf.decode_raw(value,out_type=tf.uint8) # 解码为uint8,0-255 8位3通道图像
    label = tf.cast(tf.strided_slice(bytes,[0],[labelBytes]),tf.int32) # 分割label并转化为int32
    ##tf.strided_slice() 将读取的一个bytes切分
    originalImg  = tf.reshape(tf.strided_slice(bytes,[labelBytes],[labelBytes+imageBytes]),[depthBytes,heightBytes,witdthBytes])
    # 分割图像,此时按照数据组织形式深度在前
    img = tf.transpose(originalImg,[1,2,0]) # 调整轴的顺序,深度在后
    
    with tf.Session() as sess:
        coord = tf.train.Coordinator()
        threads = tf.train.start_queue_runners(coord=coord)
    
        for i in range(100):
            imgArr = sess.run(img)
            print (imgArr.shape)
    
        coord.request_stop()
        coord.join(threads)
    

    tf.strided_slice具体用法见:https://www.tensorflow.org/versions/master/api_docs/python/tf/strided_slice

    上面就是我现在遇到过的TensorFlow读写数据的一些常见方式,后面遇到了会陆续添加。

    补充:这里有一篇比较详细的tfrecords文件的读写教程,受益颇多!!
    http://blog.csdn.net/u010223750/article/details/70482498

    参考文章:
    http://honggang.io/2016/08/19/tensorflow-data-reading/
    http://blog.csdn.net/freedom098/article/details/56008784

    相关文章

      网友评论

        本文标题:TensorFlow的读写方式

        本文链接:https://www.haomeiwen.com/subject/ckbloxtx.html