目标检测中边界框回归算法(bounding box regression)
本笔记将学习如何训练自定义深度学习模型,以通过Keras和TensorFlow的边界框回归来预测目标检测。
但这引发了以下问题:
-如果我们想训练一个端到端的物体检测器怎么办?
-是否有可能构建可以输出边界框坐标的CNN架构,从而使我们实际上可以训练模型以做出更好的对象检测器预测?
-如果是这样,我们如何去训练这样的模型?
关键是边界框回归(bounding box regression)。
什么是边界框回归(bounding box regression)?
bounding_box_regression_procedure.jpg
我们都可能熟悉通过深度神经网络进行图像分类的概念。在执行图像分类时,我们:
- 将输入图像呈现给CNN
- 通过CNN前馈网络预测
- 输出带有N个元素的向量,其中N是类标签的总数
- 选择可能性最大的班级标签作为我们最终的预测班级标签
从根本上讲,我们可以将图像分类视为预测类标签。
但不幸的是,这种类型的模型不能转化为物体检测。对于我们来说,不可能为输入图像中(x,y)坐标边界框的每种可能组合构造一个类标签。
相反,我们需要依赖于另一种称为回归的机器学习模型。与产生标签的分类不同,回归使我们能够预测连续值。
通常,将回归模型应用于以下问题:
- 预测房屋价格(我们实际上在本教程中做了)
- 预测股市
- 确定疾病在人群中传播的速度
这里的要点是,回归模型的输出不限于像分类模型一样离散化为“箱”(请记住,分类模型只能输出类标签,仅此而已)。
相反,回归模型可以输出特定范围内的任何实际值。
通常,我们在训练期间将值的输出范围缩放到[0,1],然后在预测后将输出缩放回去(如果需要)。
为了执行边界框回归以进行对象检测,我们需要做的就是调整我们的网络架构:
在网络的顶部,放置一个具有四个神经元的完全连接层,分别对应于左上角和右下角(x,y)坐标。
给定四神经元层,请实现S型激活函数,以使输出返回的范围为[0,1]。
在由(1)输入图像和(2)图像中对象的边界框组成的训练数据上使用均方误差或均值绝对误差等损失函数训练模型。
训练后,我们可以将输入图像呈现给边界框回归网络。然后,我们的网络将执行前向传递,然后实际预测对象的输出边界框坐标。
1.自定义深度学习模型
我们采用经典的VGG16模型,以此为基础修改为具备bounding box regression功能的模型.
# load the VGG16 network, ensuring the head FC layers are left off
vgg = VGG16(weights="imagenet", include_top=False, input_tensor=Input(shape=(224, 224, 3)))
# freeze all VGG layers so they will *not* be updated during the
# training process
vgg.trainable = False
# flatten the max-pooling output of VGG
flatten = vgg.output
flatten = Flatten()(flatten)
# construct a fully-connected layer header to output the predicted
# bounding box coordinates
bboxHead = Dense(128, activation="relu")(flatten)
bboxHead = Dense(64, activation="relu")(bboxHead)
bboxHead = Dense(32, activation="relu")(bboxHead)
bboxHead = Dense(4, activation="sigmoid")(bboxHead)
# construct the model we will fine-tune for bounding box regression
model = Model(inputs=vgg.input, outputs=bboxHead)
# initialize the optimizer, compile the model, and show the model
# summary
opt = Adam(lr=INIT_LR)
model.compile(loss="mse", optimizer=opt)
print(model.summary())
Model: "model_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_3 (InputLayer) [(None, 224, 224, 3)] 0
_________________________________________________________________
block1_conv1 (Conv2D) (None, 224, 224, 64) 1792
_________________________________________________________________
block1_conv2 (Conv2D) (None, 224, 224, 64) 36928
_________________________________________________________________
block1_pool (MaxPooling2D) (None, 112, 112, 64) 0
_________________________________________________________________
block2_conv1 (Conv2D) (None, 112, 112, 128) 73856
_________________________________________________________________
block2_conv2 (Conv2D) (None, 112, 112, 128) 147584
_________________________________________________________________
block2_pool (MaxPooling2D) (None, 56, 56, 128) 0
_________________________________________________________________
block3_conv1 (Conv2D) (None, 56, 56, 256) 295168
_________________________________________________________________
block3_conv2 (Conv2D) (None, 56, 56, 256) 590080
_________________________________________________________________
block3_conv3 (Conv2D) (None, 56, 56, 256) 590080
_________________________________________________________________
block3_pool (MaxPooling2D) (None, 28, 28, 256) 0
_________________________________________________________________
block4_conv1 (Conv2D) (None, 28, 28, 512) 1180160
_________________________________________________________________
block4_conv2 (Conv2D) (None, 28, 28, 512) 2359808
_________________________________________________________________
block4_conv3 (Conv2D) (None, 28, 28, 512) 2359808
_________________________________________________________________
block4_pool (MaxPooling2D) (None, 14, 14, 512) 0
_________________________________________________________________
block5_conv1 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_conv2 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_pool (MaxPooling2D) (None, 7, 7, 512) 0
_________________________________________________________________
flatten_2 (Flatten) (None, 25088) 0
_________________________________________________________________
dense_8 (Dense) (None, 128) 3211392
_________________________________________________________________
dense_9 (Dense) (None, 64) 8256
_________________________________________________________________
dense_10 (Dense) (None, 32) 2080
_________________________________________________________________
dense_11 (Dense) (None, 4) 132
=================================================================
Total params: 17,936,548
Trainable params: 3,221,860
Non-trainable params: 14,714,688
_________________________________________________________________
None
2.加载预训练的数据集
我们的数据集采用了CALTECH-101 dataset,这次只使用了飞机的部分. CALTECH-101 dataset
bound_box_regression_airplane_dataset.jpg# define the base path to the input dataset and then use it to derive
# the path to the images directory and annotation CSV file
BASE_PATH = "dataset"
IMAGES_PATH = os.path.sep.join([BASE_PATH, "images"])
ANNOTS_PATH = os.path.sep.join([BASE_PATH, "airplanes.csv"])
data = []
targets = []
filenames = []
# loop over the rows
for row in rows:
# break the row into the filename and bounding box coordinates
row = row.split(",")
(filename, startX, startY, endX, endY) = row
# derive the path to the input image, load the image (in OpenCV
# format), and grab its dimensions
imagePath = os.path.sep.join([IMAGES_PATH, filename])
image = cv2.imread(imagePath)
(h, w) = image.shape[:2]
# scale the bounding box coordinates relative to the spatial
# dimensions of the input image
startX = float(startX) / w
startY = float(startY) / h
endX = float(endX) / w
endY = float(endY) / h
# load the image and preprocess it
image = load_img(imagePath, target_size=(224, 224))
image = img_to_array(image)
# update our list of data, targets, and filenames
data.append(image)
targets.append((startX, startY, endX, endY))
filenames.append(filename)
# convert the data and targets to NumPy arrays, scaling the input
# pixel intensities from the range [0, 255] to [0, 1]
data = np.array(data, dtype="float32") / 255.0
targets = np.array(targets, dtype="float32")
# partition the data into training and testing splits using 90% of
# the data for training and the remaining 10% for testing
split = train_test_split(data, targets, filenames, test_size=0.10,
random_state=42)
# unpack the data split
(trainImages, testImages) = split[:2]
(trainTargets, testTargets) = split[2:4]
(trainFilenames, testFilenames) = split[4:]
3.模型训练
# define the path to the base output directory
BASE_OUTPUT = "output"
# define the path to the output serialized model, model training plot,
# and testing image filenames
MODEL_PATH = os.path.sep.join([BASE_OUTPUT, "detector.h5"])
PLOT_PATH = os.path.sep.join([BASE_OUTPUT, "plot.png"])
TEST_FILENAMES = os.path.sep.join([BASE_OUTPUT, "test_images.txt"])
NUM_EPOCHS = 25
BATCH_SIZE = 32
print("[INFO] training bounding box regressor...")
H = model.fit(
trainImages, trainTargets,
validation_data=(testImages, testTargets),
batch_size=BATCH_SIZE,
epochs=NUM_EPOCHS,
verbose=1)
# serialize the model to disk
print("[INFO] saving object detector model...")
model.save(MODEL_PATH, save_format="h5")
[INFO] training bounding box regressor...
Train on 720 samples, validate on 80 samples
Epoch 1/25
720/720 [==============================] - 386s 537ms/sample - loss: 0.0178 - val_loss: 0.0055
720/720 [==============================] - 292s 406ms/sample - loss: 3.2873e-04 - val_loss: 4.9465e-04
Epoch 11/25
720/720 [==============================] - 289s 401ms/sample - loss: 2.9336e-04 - val_loss: 4.7190e-04
Epoch 20/25
720/720 [==============================] - 286s 397ms/sample - loss: 1.3219e-04 - val_loss: 4.4026e-04
Epoch 21/25
720/720 [==============================] - 287s 399ms/sample - loss: 1.2311e-04 - val_loss: 4.3631e-04
Epoch 25/25
720/720 [==============================] - 284s 394ms/sample - loss: 9.5651e-05 - val_loss: 4.3679e-04
[INFO] saving object detector model...
训练收敛的还是比较快,我的老款mac air大概一个回合需要5分钟,2个多小时就可以训练完成.
可视化训练过程:
4.预测训练成果
训练完成,把模型保存到h5文件中,预测代码加载h5文件,就可以打开图片做预测.
输入测试图片:
image_0001.jpg
预测代码:
import imutils
imagePath = './dataset/images/image_0001.jpg'
image = load_img(imagePath, target_size=(224, 224))
image = img_to_array(image) / 255.0
image = np.expand_dims(image, axis=0)
# make bounding box predictions on the input image
preds = model.predict(image)[0]
(startX, startY, endX, endY) = preds
print(startX, startY, endX, endY)
# load the input image (in OpenCV format), resize it such that it
# fits on our screen, and grab its dimensions
image = cv2.imread(imagePath)
image = imutils.resize(image, width=600)
(h, w) = image.shape[:2]
# scale the predicted bounding box coordinates based on the image
# dimensions
startX = int(startX * w)
startY = int(startY * h)
endX = int(endX * w)
endY = int(endY * h)
# draw the predicted bounding box on the image
cv2.rectangle(image, (startX, startY), (endX, endY),(0, 255, 0), 2)
# show the output image
cv2.imwrite("Output.jpg", image)
预测输出图片:
Output.jpg
网友评论