STT-tensorflow/tensorflow/lite/g3doc/models/object_detection/overview.md
Gregory Clark a628c339c5 Minor TF Lite documentation updates.
PiperOrigin-RevId: 314643633
Change-Id: Ieaa82849c35d1071d6a750b60c72ca08c47a0db7
2020-06-03 18:30:34 -07:00

11 KiB
Raw Permalink Blame History

Object detection

Detect multiple objects within an image, with bounding boxes. Recognize 80 different classes of objects.

Get started

If you are new to TensorFlow Lite and are working with Android or iOS, we recommend exploring the following example applications that can help you get started.

Android example iOS example

If you are using a platform other than Android or iOS, or you are already familiar with the TensorFlow Lite APIs, you can download our starter object detection model and the accompanying labels.

Download starter model and labels

For more information about the starter model, see Starter model.

What is object detection?

Given an image or a video stream, an object detection model can identify which of a known set of objects might be present and provide information about their positions within the image.

For example, this screenshot of our example application shows how two objects have been recognized and their positions annotated:

Screenshot of Android example

An object detection model is trained to detect the presence and location of multiple classes of objects. For example, a model might be trained with images that contain various pieces of fruit, along with a label that specifies the class of fruit they represent (e.g. an apple, a banana, or a strawberry), and data specifying where each object appears in the image.

When we subsequently provide an image to the model, it will output a list of the objects it detects, the location of a bounding box that contains each object, and a score that indicates the confidence that detection was correct.

Model output

Imagine a model has been trained to detect apples, bananas, and strawberries. When we pass it an image, it will output a set number of detection results - in this example, 5.

Class Score Location
Apple 0.92 [18, 21, 57, 63]
Banana 0.88 [100, 30, 180, 150]
Strawberry 0.87 [7, 82, 89, 163]
Banana 0.23 [42, 66, 57, 83]
Apple 0.11 [6, 42, 31, 58]

Confidence score

To interpret these results, we can look at the score and the location for each detected object. The score is a number between 0 and 1 that indicates confidence that the object was genuinely detected. The closer the number is to 1, the more confident the model is.

Depending on your application, you can decide a cut-off threshold below which you will discard detection results. For our example, we might decide a sensible cut-off is a score of 0.5 (meaning a 50% probability that the detection is valid). In that case, we would ignore the last two objects in the array, because those confidence scores are below 0.5:

Class Score Location
Apple 0.92 [18, 21, 57, 63]
Banana 0.88 [100, 30, 180, 150]
Strawberry 0.87 [7, 82, 89, 163]
Banana 0.23 [42, 66, 57, 83]
Apple 0.11 [6, 42, 31, 58]

The cut-off you use should be based on whether you are more comfortable with false positives (objects that are wrongly identified, or areas of the image that are erroneously identified as objects when they are not), or false negatives (genuine objects that are missed because their confidence was low).

For example, in the following image, a pear (which is not an object that the model was trained to detect) was misidentified as a "person". This is an example of a false positive that could be ignored by selecting an appropriate cut-off. In this case, a cut-off of 0.6 (or 60%) would comfortably exclude the false positive.

Screenshot of Android example showing a false positive

Location

For each detected object, the model will return an array of four numbers representing a bounding rectangle that surrounds its position. For the starter model we provide, the numbers are ordered as follows:

[ top, left, bottom, right ]

The top value represents the distance of the rectangles top edge from the top of the image, in pixels. The left value represents the left edges distance from the left of the input image. The other values represent the bottom and right edges in a similar manner.

Note: Object detection models accept input images of a specific size. This is likely to be different from the size of the raw image captured by your devices camera, and you will have to write code to crop and scale your raw image to fit the models input size (there are examples of this in our example applications).

The pixel values output by the model refer to the position in the cropped and scaled image, so you must scale them to fit the raw image in order to interpret them correctly.

Performance benchmarks

Performance benchmark numbers are generated with the tool described here.

Model Name Model size Device GPU CPU
COCO SSD MobileNet v1 27 Mb Pixel 3 (Android 10) 22ms 46ms*
Pixel 4 (Android 10) 20ms 29ms*
iPhone XS (iOS 12.4.1) 7.6ms 11ms**

* 4 threads used.

** 2 threads used on iPhone for the best performance result.

Starter model

We recommend starting with this pre-trained quantized COCO SSD MobileNet v1 model.

Download starter model and labels

Uses and limitations

The object detection model we provide can identify and locate up to 10 objects in an image. It is trained to recognize 80 classes of object. For a full list of classes, see the labels file in the model zip.

If you want to train a model to recognize new classes, see Customize model.

For the following use cases, you should use a different type of model:

  • Predicting which single label the image most likely represents (see image classification)
  • Predicting the composition of an image, for example subject versus background (see segmentation)

Input

The model takes an image as input. The expected image is 300x300 pixels, with three channels (red, blue, and green) per pixel. This should be fed to the model as a flattened buffer of 270,000 byte values (300x300x3). Since the model is quantized, each value should be a single byte representing a value between 0 and 255.

Output

The model outputs four arrays, mapped to the indices 0-4. Arrays 0, 1, and 2 describe 10 detected objects, with one element in each array corresponding to each object. There will always be 10 objects detected.

Index Name Description
0 Locations Multidimensional array of [10][4] floating point values between 0 and 1, the inner arrays representing bounding boxes in the form [top, left, bottom, right]
1 Classes Array of 10 integers (output as floating point values) each indicating the index of a class label from the labels file
2 Scores Array of 10 floating point values between 0 and 1 representing probability that a class was detected
3 Number and detections Array of length 1 containing a floating point value expressing the total number of detection results

Customize model

The pre-trained models we provide are trained to detect 80 classes of object. For a full list of classes, see the labels file in the model zip.

You can use a technique known as transfer learning to re-train a model to recognize classes not in the original set. For example, you could re-train the model to detect multiple types of vegetable, despite there only being one vegetable in the original training data. To do this, you will need a set of training images for each of the new labels you wish to train.

Learn how to perform transfer learning in Training and serving a real-time mobile object detector in 30 minutes.