We will cover the following tasks in 1 hour and 24 minutes:
In this project, we are going to perform a common computer vision task called object detection. This is the task of detecting one or more objects in a given input image, classifying the detected objects and also localizing the objects within the image. The algorithm that we are going to use is called Single Shot Detection and like name suggests, it makes the class and bounding box prediction representing the localization in a single pass through the network. This is unlike some other object detection techniques where multiple passes through a network may be required.
Importing Libraries and Data
Sagemaker uses Anaconda and has a bunch of different environments that you can choose from. So, when you launch your uploaded notebook, you will have a bunch of different environment options to chose from. Let’s go with python 3.6 one. We aren’t using any framework in this project since we are using Amazon’s built in algorithm.
All we have to do is run the first couple of code cells that you see here in your notebook. The first code cell imports the required libraries and the second one downloads and extracts the required data. We are using the popular PASCAL VOC 2012 dataset for this project.
In order to perform training, we will need to setup and authenticate the use of AWS services. We will need an execution role, a sagemaker session and we will also need a S3 bucket to store our dataset in as well as to store the final trained model in. The session object manages interactions with Amazon SageMaker APIs and any other AWS service that the training job uses.
We are going to use the default Sagemaker bucket. Sagemaker uses Docker to contain various images for various tasks. We are interested in object detection so we will need to access that docker image as well.
We are using the PASCAL VOC dataset from VOC 2012. This is a very popular dataset used for a variety of computer vision tasks including object detection. For the sake of simplicity, we will pick one class and train an object detector to, basically, recognise and locate that class in a given image. I am going to choose the class Dog but feel free to use some other class from this list. There are three image sets here for the dog class. So, I will iterate through all three and find all possible images of dogs in this dataset.
Create JSON Annotations Part 1
SageMaker algorithms expect the data to be in a certain format. For bounding box annotations, one recommended way is to create JSON files for each image example. These JSON files will contain information on the image examples’ size, name, class and of course the information on bounding boxes: specifically the height, width and the localtion of the bounding boxes within each image. So, we will need to read the annotations in the PASCAL VOC data and create JSON files for all the training and validation examples based on that.
Create JSON Annotations in Part 2
We continue to create our annotation files in JSON format by reading the annotations given in an XML format. We iterate over the XML objects with the help of
xml.etree.ElementTree and just grab the information that we are interested in and store it in a dictionary. We will make sure that we have the same number of examples and JSON annotation files in our dataset.
Upload to S3
Let us now move our prepared datset to the S3 bucket that we decided to use in this notebook earlier. Notice the directory structure that is used. Let’s now upload our dataset with a key prefix: S3 uses the prefix to create a directory structure for the bucket content that it display in the S3 console. Once the model is trained, we will need to access for deployment - so we will dump this model artifact in S3 as well. Let’s set a location where it will be dumped. We will use the same S3 bucket.
Now that we have uploaded the data to S3, we are ready to train a model with the SSD algorithm. First, we will create an estimator. This estimator will handle the end-to-end Amazon SageMaker training and deployment tasks. We will need a fast GPU instance for training a task like this. We are going to use a
ml.p3.2xlarge instance to train.
The SSD algorithm can use either a VGG network or a ResNet50 as its base network. We are using the
p3.2x large instance which can handle these many images at 512 by 512 resolution. If you want, you can use a higher batch size if you reduce the resolution of the images.
Set the epochs to 30. For 1000 training images, this should take just over 10 mins. We are using the Stochastic Gradient Descent method for loss optimization with a momentum of 0.9. Let’s use weight decay as well. This is commonly used for RMSProp but we can use it for SGD as well. We will set a number of other relevant hyperparameters in this task.
Input Objects and Model Training
Now that the hyperparameters are setup, let us prepare the handshake between our data channels and the algorithm. To do this, we need to create the S3 input objects within Sagemaker from our data channels. These objects are then put in a dictionary, which the algorithm uses to train. We are going to use just 1 instance for training but in more distributed training, you can decide if you want the data to be fully distributed or not.
Training the algorithm involves a few steps. Firstly, the instances that we requested while creating the Estimator classes are provisioned and are setup with the appropriate libraries. Then, the data from our channels are downloaded into the instance. Once this is done, the training job begins. The provisioning and data downloading will take some time.
To do inference on our trained model, let’s deploy the model in an EC2 instance. For inference, we don’t need a GPU so we will use a
ml.c5.xlarge instance in this example. Once the deployment is done, we can start using the deployed model to make inferences.
Now that the model is deployed, let’s download a random image from the internet and see how the dog in the image is localized by our model. We will need to convert the image to bytearray before we supply it to our endpoint.
Final Results and Cleanup
There are many predictions for the test image. Typically, you’d want to look at only the predictions with high confidence scores. So, let’s say we are only interested in looking at predictions with confidence scores 0.5 or higher. If you don’t see any result, then you may want to reduce the score threshold a little and see if it gives you any bounding boxes. Please note that having an endpoint running will incur some costs. Therefore as a clean-up job, we should delete the endpoint.
About the Host (Amit Yadav)
I am a machine learning engineer with focus in computer vision and sequence modelling for automated signal processing using deep learning techniques. My previous experiences include leading chatbot development for a large corporation.