These days, machine learning and computer vision are all the craze. We’ve all seen the news about self-driving cars and facial recognition and probably imagined how cool it’d be to build our own computer vision models. However, it’s not always easy to break into the field, especially without a strong math background. Libraries like PyTorch and TensorFlow can be tedious to learn if all you want to do is experiment with something small.

In this tutorial, I present a simple way for anyone to build fully-functional object detection models with just a few lines of code. More specifically, we’ll be using Detecto, a Python package built on top of PyTorch that makes the process easy and open to programmers at all levels.

Quick and easy example

To demonstrate how simple it is to use Detecto, let’s load in a pre-trained model and run inference on the following image:

First, download the Detecto package using pip:

pip3 install detecto

Then, save the image above as “fruit.jpg” and create a Python file in the same folder as the image. Inside the Python file, write these 5 lines of code:

from detecto import core, utils, visualize

image = utils.read_image('fruit.jpg')
model = core.Model()

labels, boxes, scores = model.predict_top(image)
visualize.show_labeled_image(image, boxes, labels)

After running this file (it may take a few seconds if you don’t have a CUDA-enabled GPU on your computer; more on that later), you should see something similar to the plot below:

Awesome! We did all that with just 5 lines of code. Here’s what we did in each:

Imported Detecto’s modules
Read in an image
Initialized a pre-trained model
Generated the top predictions on our image
Plotted our predictions

Detecto uses a Faster R-CNN ResNet-50 FPN from PyTorch’s model zoo, which is able to detect about 80 different objects such as animals, vehicles, kitchen appliances, etc. However, what if you wanted to detect custom objects, like Coke vs. Pepsi cans, or zebras vs. giraffes?

You’ll be glad to know that training a Detecto model on a custom dataset is just as easy; again, all you need is 5 lines of code, as well as either an existing dataset or some time spent labeling images.

Building a custom dataset

In this tutorial, we’ll start from scratch by building our own dataset. I recommend that you do the same, but if you want to skip this step, you can download a sample dataset here (modified from Stanford’s Dog Dataset).

For our dataset, we’ll be training our model to detect an underwater alien, bat, and witch from the RoboSub competition, as shown below:

Ideally, you’ll want at least 100 images of each class. The good thing is that you can have multiple objects in each image, so you could theoretically get away with 100 total images if each image contains every class of object you want to detect. Also, if you have video footage, Detecto makes it easy to split that footage into images that you can then use for your dataset:

from detecto.utils import split_video

split_video('video.mp4', 'frames/', step_size=4)

The code above takes every 4th frame in “video.mp4” and saves it as a JPEG file in the “frames” folder.

Once you’ve produced your training dataset, you should have a folder that looks something like the following:

images/
|   image0.jpg
|   image1.jpg
|   image2.jpg
|   ...

If you want, you can also have a second folder containing a set of validation images.

Now comes the time-consuming part: labeling. Detecto supports the PASCAL VOC format, in which you have XML files containing label and position data for each object in your images. To create these XML files, you can use the open-source LabelImg tool as follows:

pip3 install labelImg    # Download LabelImg using pip
labelImg                 # Launch the application

You should now see a window pop up. On the left, click the “Open Dir” button and select the folder of images that you want to label. If things worked correctly, you should see something like this:

To draw a bounding box, click the icon in the left menu bar (or use the keyboard shortcut “w”). You can then drag a box around your objects and write/select a label:

When you’ve finished labeling an image, use CTRL+S or CMD+S to save your XML file (for simplicity and speed, you can just use the default file location and name that they auto-fill). To label the next image, click “Next Image” (or use the keyboard shortcut “d”).

Once you’re done with the entire dataset, your folder should look something like this:

images/
|   image0.jpg
|   image0.xml
|   image1.jpg
|   image1.xml
|   ...

We’re almost ready to start training our object detection model!

Getting access to a GPU

First, check whether your computer has a CUDA-enabled GPU. Since deep learning uses a lot of processing power, training on a typical CPU can be very slow. Thankfully, most modern deep learning frameworks like PyTorch and Tensorflow can run on GPUs, making things much faster. Make sure you have PyTorch downloaded (you should already have it if you installed Detecto), and then run the following 2 lines of code:

import torch

print(torch.cuda.is_available())

If it prints True, great! You can skip to the next section. If it prints False, don’t fret. Follow the below steps to create a Google Colaboratory notebook, an online coding environment that comes with a free, usable GPU. For this tutorial, you’ll just be working from within a Google Drive folder rather than on your computer.

1. Log in to Google Drive

2. Create a folder called “Detecto Tutorial” and navigate into this folder

3. Upload your training images (and/or validation images) to this folder

4. Right-click, go to “More”, and click “Google Colaboratory”:

You should now see an interface like this:

5. Give your notebook a name if you want, and then go to Edit ->Notebook settings -> Hardware accelerator and select GPU

6. Type the following code to “mount” your Drive, change directory to the current folder, and install Detecto:

import os
from google.colab import drive

drive.mount('/content/drive')

os.chdir('/content/drive/My Drive/Detecto Tutorial')

!pip install detecto

To make sure everything worked, you can create a new code cell and type

!ls

to check that you’re in the right directory.

Train a custom model

Finally, we can now train a model on our custom dataset! As promised, this is the easy part. All it takes is 4 lines of code:

from detecto import core, utils, visualize

dataset = core.Dataset('images/')
model = core.Model(['alien', 'bat', 'witch'])

model.fit(dataset)

Let’s again break down what we’ve done with each line of code:

Imported Detecto’s modules
Created a Dataset from the “images” folder (containing our JPEG and XML files)
Initialized a model to detect our custom objects (alien, bat, and witch)
Trained our model on the dataset

This can take anywhere from 10 minutes to 1+ hours to run depending on the size of your dataset, so make sure your program doesn’t exit immediately after finishing the above statements (i.e. you’re using a Jupyter/Colab notebook that preserves state while active).

Using the trained model

Now that you have a trained model, let’s test it on some images. To read images from a file path, you can use the

read_image

function from the

detecto.utils

module (you could also use an image from the Dataset you created above):

# Specify the path to your image
image = utils.read_image('images/image0.jpg')
predictions = model.predict(image)

# predictions format: (labels, boxes, scores)
labels, boxes, scores = predictions

# ['alien', 'bat', 'bat']
print(labels) 

#           xmin       ymin       xmax       ymax
# tensor([[ 569.2125,  203.6702, 1003.4383,  658.1044],
#         [ 276.2478,  144.0074,  579.6044,  508.7444],
#         [ 277.2929,  162.6719,  627.9399,  511.9841]])
print(boxes)

# tensor([0.9952, 0.9837, 0.5153])
print(scores)

As you can see, the model’s predict method returns a tuple of 3 elements: labels, boxes, and scores. In the above example, the model predicted an alien (

labels[0]

) at the coordinates [569, 204, 1003, 658] (

boxes[0]

) with a confidence level of 0.995 (

scores[0]

From these predictions, we can plot the results using the

detecto.visualize

module. For example:

visualize.show_labeled_image(image, boxes, labels)

Running the above code with the image and predictions you received should produce something that looks like this:

If you have a video, you can run object detection on it:

visualize.detect_video(model, 'input.mp4', 'output.avi')

This takes in a video file called “input.mp4” and produces an “output.avi” file with the given model’s predictions. If you open this file with VLC or some other video player, you should see some promising results!

Lastly, you can save and load models from files, allowing you to save your progress and come back to it later:

model.save('model_weights.pth')

# ... Later ...

model = core.Model.load('model_weights.pth', ['alien', 'bat', 'witch'])

Advanced usage

You’ll be happy to know that Detecto isn’t just limited to 5 lines of code. Let’s say for example that the model didn’t do as well as you hoped. We can try to increase its performance by augmenting our dataset with torchvision transforms and defining a custom DataLoader:

from torchvision import transforms

augmentations = transforms.Compose([
    transforms.ToPILImage(),
    transforms.RandomHorizontalFlip(0.5),
    transforms.ColorJitter(saturation=0.5),
    transforms.ToTensor(),
    utils.normalize_transform(),
])

dataset = core.Dataset('images/', transform=augmentations)

loader = core.DataLoader(dataset, batch_size=2, shuffle=True)

This code applies random horizontal flips and saturation effects on images in our dataset, increasing the diversity of our data. We then define a DataLoader object with

batch_size=2

; we’ll pass this to

model.fit

instead of the Dataset to tell our model to train on batches of 2 images rather than the default of 1.

If you created a separate validation dataset earlier, now is the time to load it in during training. By providing a validation dataset, the

fit

method returns a list of the losses at each epoch, and if

verbose=True

, then it will also print these out during the training process itself. The following code block demonstrates this as well as customizes several other training parameters:

import matplotlib.pyplot as plt

val_dataset = core.Dataset('validation_images/')

losses = model.fit(loader, val_dataset, epochs=10, learning_rate=0.001, 
                   lr_step_size=5, verbose=True)
                   
plt.plot(losses)
plt.show()

The resulting plot of the losses should be more or less decreasing:

For even more flexibility and control over your model, you can bypass Detecto altogether; the

model.get_internal_model

method returns the underlying torchvision model used, which you can mess around with as much as you see fit.

Conclusion

In this tutorial, we showed that computer vision and object detection don’t need to be challenging. All you need is a bit of time and patience to come up with a labeled dataset.

If you’re interested in further exploration, check out Detecto on GitHub or visit the documentation for more tutorials and use cases!

Previously published at https://medium.com/@alankbi/build-a-custom-trained-object-detection-model-with-5-lines-of-code-713ba7f6c0fb

Build a Custom-Trained Object Detection Model With 5 Lines of Code