TF2 Keras (6)：预处理层 Preprocess

作者: 数科每日 | 来源:发表于2021-01-05 21:17 被阅读0次

TF2 Keras (6)：预处理层 Preprocess
源码到可执⾏⽂件中间过程？
cv::Mat 与 float 互换，实现 argmax 得到像
常用的C++预处理
CSS 预处理的区别的深度比较 - Stylus/Sass/Le
PostgreSQL 源码解读（32）- 查询语句#17（查询优
php - 计算机编程语言（Hypertext Preproc
tensorflow模型建立与训练
scikit-learn_data preprocessing
（4）简单的模型编写

本文是对官方文档的学习笔记。

这里介绍的预处理层 (Preprocessing Layers) 是Keras 原生组件。其实它提供的各种对数据的预处理都可以用其他工具完成（pandas， numpy， sklearn），而且网上也有很多代码。 Preprocessing Layers 来做预处理的最大好处是：构建好的模型会自带预处理机制，这样有助于构建一个 end-to-end 的模型，最大程度的减少调用者的麻烦。模型使用者可以直接将 raw string 或者 raw image 喂给模型。

Preprocessing Layers：

Core preprocessing layers

TextVectorization ：文本向量化
Normalization : 对数值featrue 做正则化

Structured data preprocessing layers

CategoryEncoding layer: 对已经转为indices 的 Category 进行 one-hot, multi-hot, or TF-IDF 编码，常与StringLookup， IntegerLookup 联用
Hashing layer: 对Featrue 进行 hash 化
Discretization layer: 对数值featrue 进行分段，变成Category 类型
StringLookup layer: 把String Category 变成 indices
IntegerLookup layer: 把numeric Category 变成 indices
CategoryCrossing layer: 把多列交叉，生成新的 featrue

Image preprocessing layers

Resizing layer：修改图片尺寸
Rescaling layer：把图片颜色数值变为[0, 1]
CenterCrop : center crop ?

Image data augmentation layers

RandomCrop layer：裁剪
RandomFlip layer：翻转
RandomTranslation layer：变换？？
RandomRotation layer：旋转
RandomZoom layer：缩放
RandomHeight layer：？？
RandomWidth layer：？？

包含状态的 Layer 和 Adapt

以下 Layer 的状态需要根据不同数据进行计算，在对数据处理前，需要调用 adapt ，对数据进行学习。

TextVectorization
Normalization
StringLookup
IntegerLookup
CategoryEncoding
Discretization

import numpy as np
import tensorflow as tf
from tensorflow.keras.layers.experimental import preprocessing

data = np.array([[0.1, 0.2, 0.3], [0.8, 0.9, 1.0], [1.5, 1.6, 1.7],])
layer = preprocessing.Normalization()
layer.adapt(data)
normalized_data = layer(data)

print("Features mean: %.2f" % (normalized_data.numpy().mean()))
print("Features std: %.2f" % (normalized_data.numpy().std()))

是否把Preprocessing Layer 放到Model 中？

放入Model

可以利用GPU 加速
如果有GPU，把Layer 放入Model 对 Normalization 和 Image 相关 Preprocessing Layer 会有很大好处。

inputs = keras.Input(shape=input_shape)
x = preprocessing_layer(inputs)
outputs = rest_of_the_model(x)
model = keras.Model(inputs, outputs)

不放入Model

会在CPU上运算
可以 Buffer，甚至缓存到文件中
异步运算
对于 TextVectorization 和结构化数据比较好

例子

Image data augmentation

from tensorflow import keras
from tensorflow.keras import layers

# Create a data augmentation stage with horizontal flipping, rotations, zooms
data_augmentation = keras.Sequential(
    [
        preprocessing.RandomFlip("horizontal"),
        preprocessing.RandomRotation(0.1),
        preprocessing.RandomZoom(0.1),
    ]
)

# Create a model that includes the augmentation stage
input_shape = (32, 32, 3)
classes = 10
inputs = keras.Input(shape=input_shape)
# Augment images
x = data_augmentation(inputs)
# Rescale image values to [0, 1]
x = preprocessing.Rescaling(1.0 / 255)(x)
# Add the rest of the model
outputs = keras.applications.ResNet50(
    weights=None, input_shape=input_shape, classes=classes
)(x)
model = keras.Model(inputs, outputs)

Normalizing numerical features

# Load some data
(x_train, y_train), _ = keras.datasets.cifar10.load_data()
x_train = x_train.reshape((len(x_train), -1))
input_shape = x_train.shape[1:]
classes = 10

# Create a Normalization layer and set its internal state using the training data
normalizer = preprocessing.Normalization()
normalizer.adapt(x_train)

# Create a model that include the normalization layer
inputs = keras.Input(shape=input_shape)
x = normalizer(inputs)
outputs = layers.Dense(classes, activation="softmax")(x)
model = keras.Model(inputs, outputs)

# Train the model
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")
model.fit(x_train, y_train)

Encoding string categorical features via one-hot encoding

# Define some toy data
data = tf.constant(["a", "b", "c", "b", "c", "a"])

# Use StringLookup to build an index of the feature values
indexer = preprocessing.StringLookup()
indexer.adapt(data)

# Use CategoryEncoding to encode the integer indices to a one-hot vector
encoder = preprocessing.CategoryEncoding(output_mode="binary")
encoder.adapt(indexer(data))

# Convert new test data (which includes unknown feature values)
test_data = tf.constant(["a", "b", "c", "d", "e", ""])
encoded_data = encoder(indexer(test_data))
print(encoded_data)

# Define some toy data
data = tf.constant([10, 20, 20, 10, 30, 0])

# Use IntegerLookup to build an index of the feature values
indexer = preprocessing.IntegerLookup()
indexer.adapt(data)

# Use CategoryEncoding to encode the integer indices to a one-hot vector
encoder = preprocessing.CategoryEncoding(output_mode="binary")
encoder.adapt(indexer(data))

# Convert new test data (which includes unknown feature values)
test_data = tf.constant([10, 10, 20, 50, 60, 0])
encoded_data = encoder(indexer(test_data))
print(encoded_data)

Applying the hashing trick to an integer categorical feature

If you have a categorical feature that can take many different values (on the order of 10e3 or higher), where each value only appears a few times in the data, it becomes impractical and ineffective to index and one-hot encode the feature values. Instead, it can be a good idea to apply the "hashing trick": hash the values to a vector of fixed size. This keeps the size of the feature space manageable, and removes the need for explicit indexing.

# Sample data: 10,000 random integers with values between 0 and 100,000
data = np.random.randint(0, 100000, size=(10000, 1))

# Use the Hashing layer to hash the values to the range [0, 64]
hasher = preprocessing.Hashing(num_bins=64, salt=1337)

# Use the CategoryEncoding layer to one-hot encode the hashed values
encoder = preprocessing.CategoryEncoding(max_tokens=64, output_mode="binary")
encoded_data = encoder(hasher(data))
print(encoded_data.shape)

Encoding text as a sequence of token indices

# Define some text data to adapt the layer
data = tf.constant(
    [
        "The Brain is wider than the Sky",
        "For put them side by side",
        "The one the other will contain",
        "With ease and You beside",
    ]
)
# Instantiate TextVectorization with "int" output_mode
text_vectorizer = preprocessing.TextVectorization(output_mode="int")
# Index the vocabulary via `adapt()`
text_vectorizer.adapt(data)

# You can retrieve the vocabulary we indexed via get_vocabulary()
vocab = text_vectorizer.get_vocabulary()
print("Vocabulary:", vocab)

# Create an Embedding + LSTM model
inputs = keras.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = layers.Embedding(input_dim=len(vocab), output_dim=64)(x)
outputs = layers.LSTM(1)(x)
model = keras.Model(inputs, outputs)

# Call the model on test data (which includes unknown tokens)
test_data = tf.constant(["The Brain is deeper than the sea"])
test_output = model(test_data)

Encoding text as a dense matrix of ngrams with multi-hot encoding

# Define some text data to adapt the layer
data = tf.constant(
    [
        "The Brain is wider than the Sky",
        "For put them side by side",
        "The one the other will contain",
        "With ease and You beside",
    ]
)
# Instantiate TextVectorization with "binary" output_mode (multi-hot)
# and ngrams=2 (index all bigrams)
text_vectorizer = preprocessing.TextVectorization(output_mode="binary", ngrams=2)
# Index the bigrams via `adapt()`
text_vectorizer.adapt(data)

print(
    "Encoded text:\n",
    text_vectorizer(["The Brain is deeper than the sea"]).numpy(),
    "\n",
)

# Create a Dense model
inputs = keras.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
outputs = layers.Dense(1)(x)
model = keras.Model(inputs, outputs)

# Call the model on test data (which includes unknown tokens)
test_data = tf.constant(["The Brain is deeper than the sea"])
test_output = model(test_data)

print("Model output:", test_output)

Encoding text as a dense matrix of ngrams with TF-IDF weighting

# Define some text data to adapt the layer
data = tf.constant(
    [
        "The Brain is wider than the Sky",
        "For put them side by side",
        "The one the other will contain",
        "With ease and You beside",
    ]
)
# Instantiate TextVectorization with "tf-idf" output_mode
# (multi-hot with TF-IDF weighting) and ngrams=2 (index all bigrams)
text_vectorizer = preprocessing.TextVectorization(output_mode="tf-idf", ngrams=2)
# Index the bigrams and learn the TF-IDF weights via `adapt()`
text_vectorizer.adapt(data)

print(
    "Encoded text:\n",
    text_vectorizer(["The Brain is deeper than the sea"]).numpy(),
    "\n",
)

# Create a Dense model
inputs = keras.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
outputs = layers.Dense(1)(x)
model = keras.Model(inputs, outputs)

# Call the model on test data (which includes unknown tokens)
test_data = tf.constant(["The Brain is deeper than the sea"])
test_output = model(test_data)
print("Model output:", test_output)

TF2 Keras (6)：预处理层 Preprocess
本文是对官方文档[https://www.tensorflow.org/guide/keras/preproces...
源码到可执⾏⽂件中间过程？
1.Preprocess - 预处理 import 头⽂件 macro 展开处理 ‘#’ 打头的预处理指令，如 ...
cv::Mat 与 float 互换，实现 argmax 得到像
Functions： preprocess 预处理图片，resize, [0, 1], normalize, pa...
常用的C++预处理
文章来源：C++的预处理（Preprocess） - pmars - 博客园
CSS 预处理的区别的深度比较 - Stylus/Sass/Le
CSS Preprocess Different 在前端界，有三大 CSS 预处理器，分别是 SASS(SCSS)...
PostgreSQL 源码解读（32）- 查询语句#17（查询优
本节简单介绍了PG查询优化表达式预处理中常量的简化过程。表达式预处理主要的函数主要有preprocess_expr...
php - 计算机编程语言（Hypertext Preproc
PHP简介 PHP，一个嵌套的缩写名称，是英文超级文本预处理语言（PHP:Hypertext Preprocess...
tensorflow模型建立与训练
线性分类器Model 数据获取及预处理： tf.keras.datasets 模型的构建： tf.keras.Mo...
scikit-learn_data preprocessing
主要简单介绍sklearn中的数据预处理preprocessing模块可以对数据进行标准化,而preprocess...
（4）简单的模型编写
（1）使用tf.keras.datasets获得数据集并预处理（2）使用tf.keras.Model和tf.ker...