美文网首页
TF2 Keras (6): 预处理层 Preprocess

TF2 Keras (6): 预处理层 Preprocess

作者: 数科每日 | 来源:发表于2021-01-05 21:17 被阅读0次

    本文是对官方文档 的学习笔记。


    这里介绍的预处理层 (Preprocessing Layers) 是Keras 原生组件。 其实它提供的各种对数据的预处理都可以用其他工具完成 (pandas, numpy, sklearn), 而且网上也有很多代码。 Preprocessing Layers 来做预处理的最大好处是: 构建好的模型会自带预处理机制, 这样有助于构建一个 end-to-end 的模型, 最大程度的减少调用者的麻烦。 模型使用者可以直接将 raw string 或者 raw image 喂给模型。

    Preprocessing Layers:

    Core preprocessing layers
    • TextVectorization : 文本向量化
    • Normalization : 对数值featrue 做正则化
    Structured data preprocessing layers
    • CategoryEncoding layer: 对已经转为indices 的 Category 进行 one-hot, multi-hot, or TF-IDF 编码, 常与StringLookup, IntegerLookup 联用

    • Hashing layer: 对Featrue 进行 hash 化

    • Discretization layer: 对数值featrue 进行分段,变成Category 类型

    • StringLookup layer: 把String Category 变成 indices

    • IntegerLookup layer: 把numeric Category 变成 indices

    • CategoryCrossing layer: 把多列交叉,生成新的 featrue

    Image preprocessing layers
    • Resizing layer:修改图片尺寸
    • Rescaling layer:把图片颜色数值变为[0, 1]
    • CenterCrop : center crop ?
    Image data augmentation layers
    • RandomCrop layer:裁剪
    • RandomFlip layer:翻转
    • RandomTranslation layer:变换 ??
    • RandomRotation layer:旋转
    • RandomZoom layer:缩放
    • RandomHeight layer:??
    • RandomWidth layer:??

    包含状态的 Layer 和 Adapt

    以下 Layer 的状态需要根据不同数据进行计算, 在对数据处理前, 需要调用 adapt ,对数据进行学习。

    • TextVectorization
    • Normalization
    • StringLookup
    • IntegerLookup
    • CategoryEncoding
    • Discretization
    import numpy as np
    import tensorflow as tf
    from tensorflow.keras.layers.experimental import preprocessing
    
    data = np.array([[0.1, 0.2, 0.3], [0.8, 0.9, 1.0], [1.5, 1.6, 1.7],])
    layer = preprocessing.Normalization()
    layer.adapt(data)
    normalized_data = layer(data)
    
    print("Features mean: %.2f" % (normalized_data.numpy().mean()))
    print("Features std: %.2f" % (normalized_data.numpy().std()))
    

    是否把Preprocessing Layer 放到Model 中?

    放入Model
    • 可以利用GPU 加速
    • 如果有GPU,把Layer 放入Model 对 Normalization 和 Image 相关 Preprocessing Layer 会有很大好处。
    inputs = keras.Input(shape=input_shape)
    x = preprocessing_layer(inputs)
    outputs = rest_of_the_model(x)
    model = keras.Model(inputs, outputs)
    
    不放入Model
    • 会在CPU上运算
    • 可以 Buffer, 甚至缓存到文件中
    • 异步运算
    • 对于 TextVectorization 和 结构化数据比较好

    例子

    Image data augmentation

    from tensorflow import keras
    from tensorflow.keras import layers
    
    # Create a data augmentation stage with horizontal flipping, rotations, zooms
    data_augmentation = keras.Sequential(
        [
            preprocessing.RandomFlip("horizontal"),
            preprocessing.RandomRotation(0.1),
            preprocessing.RandomZoom(0.1),
        ]
    )
    
    # Create a model that includes the augmentation stage
    input_shape = (32, 32, 3)
    classes = 10
    inputs = keras.Input(shape=input_shape)
    # Augment images
    x = data_augmentation(inputs)
    # Rescale image values to [0, 1]
    x = preprocessing.Rescaling(1.0 / 255)(x)
    # Add the rest of the model
    outputs = keras.applications.ResNet50(
        weights=None, input_shape=input_shape, classes=classes
    )(x)
    model = keras.Model(inputs, outputs)
    

    Normalizing numerical features

    # Load some data
    (x_train, y_train), _ = keras.datasets.cifar10.load_data()
    x_train = x_train.reshape((len(x_train), -1))
    input_shape = x_train.shape[1:]
    classes = 10
    
    # Create a Normalization layer and set its internal state using the training data
    normalizer = preprocessing.Normalization()
    normalizer.adapt(x_train)
    
    # Create a model that include the normalization layer
    inputs = keras.Input(shape=input_shape)
    x = normalizer(inputs)
    outputs = layers.Dense(classes, activation="softmax")(x)
    model = keras.Model(inputs, outputs)
    
    # Train the model
    model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")
    model.fit(x_train, y_train)
    

    Encoding string categorical features via one-hot encoding

    # Define some toy data
    data = tf.constant(["a", "b", "c", "b", "c", "a"])
    
    # Use StringLookup to build an index of the feature values
    indexer = preprocessing.StringLookup()
    indexer.adapt(data)
    
    # Use CategoryEncoding to encode the integer indices to a one-hot vector
    encoder = preprocessing.CategoryEncoding(output_mode="binary")
    encoder.adapt(indexer(data))
    
    # Convert new test data (which includes unknown feature values)
    test_data = tf.constant(["a", "b", "c", "d", "e", ""])
    encoded_data = encoder(indexer(test_data))
    print(encoded_data)
    
    # Define some toy data
    data = tf.constant([10, 20, 20, 10, 30, 0])
    
    # Use IntegerLookup to build an index of the feature values
    indexer = preprocessing.IntegerLookup()
    indexer.adapt(data)
    
    # Use CategoryEncoding to encode the integer indices to a one-hot vector
    encoder = preprocessing.CategoryEncoding(output_mode="binary")
    encoder.adapt(indexer(data))
    
    # Convert new test data (which includes unknown feature values)
    test_data = tf.constant([10, 10, 20, 50, 60, 0])
    encoded_data = encoder(indexer(test_data))
    print(encoded_data)
    

    Applying the hashing trick to an integer categorical feature

    If you have a categorical feature that can take many different values (on the order of 10e3 or higher), where each value only appears a few times in the data, it becomes impractical and ineffective to index and one-hot encode the feature values. Instead, it can be a good idea to apply the "hashing trick": hash the values to a vector of fixed size. This keeps the size of the feature space manageable, and removes the need for explicit indexing.

    # Sample data: 10,000 random integers with values between 0 and 100,000
    data = np.random.randint(0, 100000, size=(10000, 1))
    
    # Use the Hashing layer to hash the values to the range [0, 64]
    hasher = preprocessing.Hashing(num_bins=64, salt=1337)
    
    # Use the CategoryEncoding layer to one-hot encode the hashed values
    encoder = preprocessing.CategoryEncoding(max_tokens=64, output_mode="binary")
    encoded_data = encoder(hasher(data))
    print(encoded_data.shape)
    

    Encoding text as a sequence of token indices

    # Define some text data to adapt the layer
    data = tf.constant(
        [
            "The Brain is wider than the Sky",
            "For put them side by side",
            "The one the other will contain",
            "With ease and You beside",
        ]
    )
    # Instantiate TextVectorization with "int" output_mode
    text_vectorizer = preprocessing.TextVectorization(output_mode="int")
    # Index the vocabulary via `adapt()`
    text_vectorizer.adapt(data)
    
    # You can retrieve the vocabulary we indexed via get_vocabulary()
    vocab = text_vectorizer.get_vocabulary()
    print("Vocabulary:", vocab)
    
    # Create an Embedding + LSTM model
    inputs = keras.Input(shape=(1,), dtype="string")
    x = text_vectorizer(inputs)
    x = layers.Embedding(input_dim=len(vocab), output_dim=64)(x)
    outputs = layers.LSTM(1)(x)
    model = keras.Model(inputs, outputs)
    
    # Call the model on test data (which includes unknown tokens)
    test_data = tf.constant(["The Brain is deeper than the sea"])
    test_output = model(test_data)
    

    Encoding text as a dense matrix of ngrams with multi-hot encoding

    # Define some text data to adapt the layer
    data = tf.constant(
        [
            "The Brain is wider than the Sky",
            "For put them side by side",
            "The one the other will contain",
            "With ease and You beside",
        ]
    )
    # Instantiate TextVectorization with "binary" output_mode (multi-hot)
    # and ngrams=2 (index all bigrams)
    text_vectorizer = preprocessing.TextVectorization(output_mode="binary", ngrams=2)
    # Index the bigrams via `adapt()`
    text_vectorizer.adapt(data)
    
    print(
        "Encoded text:\n",
        text_vectorizer(["The Brain is deeper than the sea"]).numpy(),
        "\n",
    )
    
    # Create a Dense model
    inputs = keras.Input(shape=(1,), dtype="string")
    x = text_vectorizer(inputs)
    outputs = layers.Dense(1)(x)
    model = keras.Model(inputs, outputs)
    
    # Call the model on test data (which includes unknown tokens)
    test_data = tf.constant(["The Brain is deeper than the sea"])
    test_output = model(test_data)
    
    print("Model output:", test_output)
    

    Encoding text as a dense matrix of ngrams with TF-IDF weighting

    # Define some text data to adapt the layer
    data = tf.constant(
        [
            "The Brain is wider than the Sky",
            "For put them side by side",
            "The one the other will contain",
            "With ease and You beside",
        ]
    )
    # Instantiate TextVectorization with "tf-idf" output_mode
    # (multi-hot with TF-IDF weighting) and ngrams=2 (index all bigrams)
    text_vectorizer = preprocessing.TextVectorization(output_mode="tf-idf", ngrams=2)
    # Index the bigrams and learn the TF-IDF weights via `adapt()`
    text_vectorizer.adapt(data)
    
    print(
        "Encoded text:\n",
        text_vectorizer(["The Brain is deeper than the sea"]).numpy(),
        "\n",
    )
    
    # Create a Dense model
    inputs = keras.Input(shape=(1,), dtype="string")
    x = text_vectorizer(inputs)
    outputs = layers.Dense(1)(x)
    model = keras.Model(inputs, outputs)
    
    # Call the model on test data (which includes unknown tokens)
    test_data = tf.constant(["The Brain is deeper than the sea"])
    test_output = model(test_data)
    print("Model output:", test_output)
    

    相关文章

      网友评论

          本文标题:TF2 Keras (6): 预处理层 Preprocess

          本文链接:https://www.haomeiwen.com/subject/fziwoktx.html