This is the third article of Your Guide with MLSQL Stack series. We hope this article series shows you how MLSQL stack helps people do AI job.
As we have seen in the previous posts that MLSQL stack give you the power to use the built-in Algorithms and Python ML frameworks. The ability to use Python ML framework means you are totally free to use Deep Learning tools like PyTorch, Tensorflow. But this time, we will teach you how to use built-in DL framework called BigDL to accomplish image classification task first.
Requirements
This guide requires MLSQL Stack 1.3.0-SNAPSHOT. You can setup MLSQL stack with following links. We recommend you deploy MLSQL stack in local.
If you meet any problem when deploying, please let me know and feel free to address any issue in this link.
Project Structure
I have created a project named store1, and there is a directory called image_classify contains all mlsql script we talk today. It looks like this:
image.pngWe will teach you how to build the project step by step.
Upload Image
First, download cifar10 raw images from url: https://github.com/allwefantasy/spark-deep-learning-toy/releases/download/v0.01/cifar.tgz ungzip it and make sure it's a tar file.
Though MLSQL Console supports directory uploading, but the huge number of files in the directory will crash the uploading component in the web page, and of course, we hope we can fix this issue in future. Now, there is one way that packaging the directory as a tar file to walk around this uploading crash issue.
image.pngthen save upload tar file to your home:
-- download cifar data from https://github.com/allwefantasy/spark-deep-learning-toy/releases/download/v0.01/cifar.tgz
!fs -mkdir -p /tmp/cifar;
!saveUploadFileToHome /cifar.tar /tmp/cifar;
the console will show the real-time log which indicates that the system is extracting images.
image.pngThis may take for a while because there are almost 60000 pictures.
Setup some paths.
We create a env.mlsql which contains variables path related:
set basePath="/tmp/cifar";
set labelMappingPath = "${basePath}/si";
set trainDataPath = "${basePath}/cifar_train_data";
set testDataPath = "${basePath}/cifar_test_data";
set modelPath = "${basePath}/bigdl";
And the other script will include this script to get all these paths.
Resize the pictures
We hope we can resize the images to 28*28, you can achieve it with ET ImageLoaderExt
. Here are how we use it:
include store1.`alg.image_classify.env.mlsql`;
-- {} or {number} is used as parameter holder.
set imageResize='''
run command as ImageLoaderExt.`/tmp/cifar/cifar/{}` where
and code="
def apply(params:Map[String,String]) = {
Resize(28, 28) ->
MatToTensor() -> ImageFrameToSample()
}
"
as {}
''';
-- train should be quoted because it's a keyword.
!imageResize "train" data;
!imageResize test testData;
In the above code, because we need to resize train and test dataset, in order to avoid duplicate code, we wrap the resize code as a command, then use this command to
process train and test dataset separately.
Extract label
For example, When we see the following path we know that this picture contains frog. So we should extract frog from the path.
/tmp/cifar/cifar/train/38189_frog.png
Again, we wrap the SQL as a command and process the train and test data separately.
set extractLabel='''
-- convert image path to number label
select split(split(imageName,"_")[1],"\\.")[0] as labelStr,features from {} as {}
''';
!extractLabel data newdata;
!extractLabel testData newTestData;
We will convert the label to number and then plus 1(cause the bigdl needs the label starts from 1 instead of 0).
set numericLabel='''
train {0} as StringIndex.`/tmp/cifar/si` where inputCol="labelStr" and outputCol="labelIndex" as newdata1;
predict {0} as StringIndex.`/tmp/cifar/si` as newdata2;
select (cast(labelIndex as int) + 1) as label,features from newdata2 as {1}
''';
!numericLabel newdata trainData;
!numericLabel newTestData testData;
Save what we get until now
We will save all these data so we can use the processed data in future without executing repeatedly:
save overwrite trainData as parquet.`${trainDataPath}`;
save overwrite testData as parquet.`${testDataPath}`;
Train the images with DL
We create a new script file named classify_train.mlsql, and we should load the data first and convert the label to an array:
include store1.`alg.image_classify.env.mlsql`;
load parquet.`${trainDataPath}` as tmpTrainData;
load parquet.`${testDataPath}` as tmpTestData;
select array(cast(label as float)) as label,features from tmpTrainData as trainData;
select array(cast(label as float)) as label,features from tmpTestData as testData;
finally, we use our algorithm to train them:
train trainData as BigDLClassifyExt.`${modelPath}` where
disableSparkLog = "true"
and fitParam.0.featureSize="[3,28,28]"
and fitParam.0.classNum="10"
and fitParam.0.maxEpoch="300"
-- print evaluate message
and fitParam.0.evaluate.trigger.everyEpoch="true"
and fitParam.0.evaluate.batchSize="1000"
and fitParam.0.evaluate.table="testData"
and fitParam.0.evaluate.methods="Loss,Top1Accuracy"
-- for unbalanced class
-- and fitParam.0.criterion.classWeight="[......]"
and fitParam.0.code='''
def apply(params:Map[String,String])={
val model = Sequential()
model.add(Reshape(Array(3, 28, 28), inputShape = Shape(28, 28, 3)))
model.add(Convolution2D(6, 5, 5, activation = "tanh").setName("conv1_5x5"))
model.add(MaxPooling2D())
model.add(Convolution2D(12, 5, 5, activation = "tanh").setName("conv2_5x5"))
model.add(MaxPooling2D())
model.add(Flatten())
model.add(Dense(100, activation = "tanh").setName("fc1"))
model.add(Dense(params("classNum").toInt, activation = "softmax").setName("fc2"))
}
'''
;
Int the code block, we use Keras-style code to build our model, and we tell our system some information e.g. how many classes and what's the feature size.
If this training stage takes too long, you can decrease fitParam.0.maxEpoch
to a small value.
The console will print the message when training:
image.pngand finally the validate result:
image.pngUse model command to check the model train history:
!model history /tmp/cifar/bigdl;
Here are the result:
image.pngRegister the model as a function
Since we have built our model, now let us learn how to predict the image.
First, we load some data:
include store1.`alg.image_classify.env.mlsql`;
load parquet.`${trainDataPath}` as tmpTrainData;
load parquet.`${testDataPath}` as tmpTestData;
select array(cast(label as float)) as label,features from tmpTrainData as trainData;
select array(cast(label as float)) as label,features from tmpTestData as testData;
now, we can register the model as a function:
register BigDLClassifyExt.`${modelPath}` as cifarPredict;
finally, we can use the function to predict a new picture:
select
vec_argmax(cifarPredict(vec_dense(to_array_double(features)))) as predict_label,
label from testData limit 10
as output;
Of course, you can predict a table:
predict testData as BigDLClassifyExt.`${modelPath}` as predictdata;
Why BigDL
GPU is very expensive and normally, our company already have lots of CPUs, if we can make full use of these CPUs which will save a lot of money.
ChatRoom
imageimage
网友评论