美文网首页
TF2 Keras (6): 预处理层 Preprocess

TF2 Keras (6): 预处理层 Preprocess

作者: 数科每日 | 来源:发表于2021-01-05 21:17 被阅读0次

本文是对官方文档 的学习笔记。


这里介绍的预处理层 (Preprocessing Layers) 是Keras 原生组件。 其实它提供的各种对数据的预处理都可以用其他工具完成 (pandas, numpy, sklearn), 而且网上也有很多代码。 Preprocessing Layers 来做预处理的最大好处是: 构建好的模型会自带预处理机制, 这样有助于构建一个 end-to-end 的模型, 最大程度的减少调用者的麻烦。 模型使用者可以直接将 raw string 或者 raw image 喂给模型。

Preprocessing Layers:

Core preprocessing layers
  • TextVectorization : 文本向量化
  • Normalization : 对数值featrue 做正则化
Structured data preprocessing layers
  • CategoryEncoding layer: 对已经转为indices 的 Category 进行 one-hot, multi-hot, or TF-IDF 编码, 常与StringLookup, IntegerLookup 联用

  • Hashing layer: 对Featrue 进行 hash 化

  • Discretization layer: 对数值featrue 进行分段,变成Category 类型

  • StringLookup layer: 把String Category 变成 indices

  • IntegerLookup layer: 把numeric Category 变成 indices

  • CategoryCrossing layer: 把多列交叉,生成新的 featrue

Image preprocessing layers
  • Resizing layer:修改图片尺寸
  • Rescaling layer:把图片颜色数值变为[0, 1]
  • CenterCrop : center crop ?
Image data augmentation layers
  • RandomCrop layer:裁剪
  • RandomFlip layer:翻转
  • RandomTranslation layer:变换 ??
  • RandomRotation layer:旋转
  • RandomZoom layer:缩放
  • RandomHeight layer:??
  • RandomWidth layer:??

包含状态的 Layer 和 Adapt

以下 Layer 的状态需要根据不同数据进行计算, 在对数据处理前, 需要调用 adapt ,对数据进行学习。

  • TextVectorization
  • Normalization
  • StringLookup
  • IntegerLookup
  • CategoryEncoding
  • Discretization
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers.experimental import preprocessing

data = np.array([[0.1, 0.2, 0.3], [0.8, 0.9, 1.0], [1.5, 1.6, 1.7],])
layer = preprocessing.Normalization()
layer.adapt(data)
normalized_data = layer(data)

print("Features mean: %.2f" % (normalized_data.numpy().mean()))
print("Features std: %.2f" % (normalized_data.numpy().std()))

是否把Preprocessing Layer 放到Model 中?

放入Model
  • 可以利用GPU 加速
  • 如果有GPU,把Layer 放入Model 对 Normalization 和 Image 相关 Preprocessing Layer 会有很大好处。
inputs = keras.Input(shape=input_shape)
x = preprocessing_layer(inputs)
outputs = rest_of_the_model(x)
model = keras.Model(inputs, outputs)
不放入Model
  • 会在CPU上运算
  • 可以 Buffer, 甚至缓存到文件中
  • 异步运算
  • 对于 TextVectorization 和 结构化数据比较好

例子

Image data augmentation

from tensorflow import keras
from tensorflow.keras import layers

# Create a data augmentation stage with horizontal flipping, rotations, zooms
data_augmentation = keras.Sequential(
    [
        preprocessing.RandomFlip("horizontal"),
        preprocessing.RandomRotation(0.1),
        preprocessing.RandomZoom(0.1),
    ]
)

# Create a model that includes the augmentation stage
input_shape = (32, 32, 3)
classes = 10
inputs = keras.Input(shape=input_shape)
# Augment images
x = data_augmentation(inputs)
# Rescale image values to [0, 1]
x = preprocessing.Rescaling(1.0 / 255)(x)
# Add the rest of the model
outputs = keras.applications.ResNet50(
    weights=None, input_shape=input_shape, classes=classes
)(x)
model = keras.Model(inputs, outputs)

Normalizing numerical features

# Load some data
(x_train, y_train), _ = keras.datasets.cifar10.load_data()
x_train = x_train.reshape((len(x_train), -1))
input_shape = x_train.shape[1:]
classes = 10

# Create a Normalization layer and set its internal state using the training data
normalizer = preprocessing.Normalization()
normalizer.adapt(x_train)

# Create a model that include the normalization layer
inputs = keras.Input(shape=input_shape)
x = normalizer(inputs)
outputs = layers.Dense(classes, activation="softmax")(x)
model = keras.Model(inputs, outputs)

# Train the model
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")
model.fit(x_train, y_train)

Encoding string categorical features via one-hot encoding

# Define some toy data
data = tf.constant(["a", "b", "c", "b", "c", "a"])

# Use StringLookup to build an index of the feature values
indexer = preprocessing.StringLookup()
indexer.adapt(data)

# Use CategoryEncoding to encode the integer indices to a one-hot vector
encoder = preprocessing.CategoryEncoding(output_mode="binary")
encoder.adapt(indexer(data))

# Convert new test data (which includes unknown feature values)
test_data = tf.constant(["a", "b", "c", "d", "e", ""])
encoded_data = encoder(indexer(test_data))
print(encoded_data)
# Define some toy data
data = tf.constant([10, 20, 20, 10, 30, 0])

# Use IntegerLookup to build an index of the feature values
indexer = preprocessing.IntegerLookup()
indexer.adapt(data)

# Use CategoryEncoding to encode the integer indices to a one-hot vector
encoder = preprocessing.CategoryEncoding(output_mode="binary")
encoder.adapt(indexer(data))

# Convert new test data (which includes unknown feature values)
test_data = tf.constant([10, 10, 20, 50, 60, 0])
encoded_data = encoder(indexer(test_data))
print(encoded_data)

Applying the hashing trick to an integer categorical feature

If you have a categorical feature that can take many different values (on the order of 10e3 or higher), where each value only appears a few times in the data, it becomes impractical and ineffective to index and one-hot encode the feature values. Instead, it can be a good idea to apply the "hashing trick": hash the values to a vector of fixed size. This keeps the size of the feature space manageable, and removes the need for explicit indexing.

# Sample data: 10,000 random integers with values between 0 and 100,000
data = np.random.randint(0, 100000, size=(10000, 1))

# Use the Hashing layer to hash the values to the range [0, 64]
hasher = preprocessing.Hashing(num_bins=64, salt=1337)

# Use the CategoryEncoding layer to one-hot encode the hashed values
encoder = preprocessing.CategoryEncoding(max_tokens=64, output_mode="binary")
encoded_data = encoder(hasher(data))
print(encoded_data.shape)

Encoding text as a sequence of token indices

# Define some text data to adapt the layer
data = tf.constant(
    [
        "The Brain is wider than the Sky",
        "For put them side by side",
        "The one the other will contain",
        "With ease and You beside",
    ]
)
# Instantiate TextVectorization with "int" output_mode
text_vectorizer = preprocessing.TextVectorization(output_mode="int")
# Index the vocabulary via `adapt()`
text_vectorizer.adapt(data)

# You can retrieve the vocabulary we indexed via get_vocabulary()
vocab = text_vectorizer.get_vocabulary()
print("Vocabulary:", vocab)

# Create an Embedding + LSTM model
inputs = keras.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = layers.Embedding(input_dim=len(vocab), output_dim=64)(x)
outputs = layers.LSTM(1)(x)
model = keras.Model(inputs, outputs)

# Call the model on test data (which includes unknown tokens)
test_data = tf.constant(["The Brain is deeper than the sea"])
test_output = model(test_data)

Encoding text as a dense matrix of ngrams with multi-hot encoding

# Define some text data to adapt the layer
data = tf.constant(
    [
        "The Brain is wider than the Sky",
        "For put them side by side",
        "The one the other will contain",
        "With ease and You beside",
    ]
)
# Instantiate TextVectorization with "binary" output_mode (multi-hot)
# and ngrams=2 (index all bigrams)
text_vectorizer = preprocessing.TextVectorization(output_mode="binary", ngrams=2)
# Index the bigrams via `adapt()`
text_vectorizer.adapt(data)

print(
    "Encoded text:\n",
    text_vectorizer(["The Brain is deeper than the sea"]).numpy(),
    "\n",
)

# Create a Dense model
inputs = keras.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
outputs = layers.Dense(1)(x)
model = keras.Model(inputs, outputs)

# Call the model on test data (which includes unknown tokens)
test_data = tf.constant(["The Brain is deeper than the sea"])
test_output = model(test_data)

print("Model output:", test_output)

Encoding text as a dense matrix of ngrams with TF-IDF weighting

# Define some text data to adapt the layer
data = tf.constant(
    [
        "The Brain is wider than the Sky",
        "For put them side by side",
        "The one the other will contain",
        "With ease and You beside",
    ]
)
# Instantiate TextVectorization with "tf-idf" output_mode
# (multi-hot with TF-IDF weighting) and ngrams=2 (index all bigrams)
text_vectorizer = preprocessing.TextVectorization(output_mode="tf-idf", ngrams=2)
# Index the bigrams and learn the TF-IDF weights via `adapt()`
text_vectorizer.adapt(data)

print(
    "Encoded text:\n",
    text_vectorizer(["The Brain is deeper than the sea"]).numpy(),
    "\n",
)

# Create a Dense model
inputs = keras.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
outputs = layers.Dense(1)(x)
model = keras.Model(inputs, outputs)

# Call the model on test data (which includes unknown tokens)
test_data = tf.constant(["The Brain is deeper than the sea"])
test_output = model(test_data)
print("Model output:", test_output)

相关文章

网友评论

      本文标题:TF2 Keras (6): 预处理层 Preprocess

      本文链接:https://www.haomeiwen.com/subject/fziwoktx.html