Skip to contents

trackRai provides four apps to:

  1. generate a training dataset suitable for training a YOLO model (this tutorial);
  2. train a YOLO model using that dataset;
  3. use the trained model to track objects in a video;
  4. visualize the results of the tracking.

In this tutorial, we will discuss how to use the first app to automatically generate a dataset that you can then use to train a YOLO model. The idea here is to use traditional computer vision techniques to isolate single instances of the objects that you would like to track, as well as the background over which these objects are moving. We will then use them to create composite images with random arrangements of an arbitrary large number of these isolated objects. Using that approach, we can generate automatically a large number of training images for YOLO, without requiring any manual labelling.

Note: this approach only works well if the background of the video is stable, with, ideally, no camera movement and no lighting variations. Good tracking results are not guaranteed otherwise.


2.1 - Launch the preparation app

To launch the dataset preparation app, run the following in the R console:

This will open the app either in the viewer panel of RStudio and Positron, or in your default internet browser. You can control where the app is opened using the launch.browser parameter (see the documentation of shiny::runApp() for more information).


2.2 - Tab 1: video module

Once the app opens, you will be presented with the “Video” tab. Click on the Select video button and navigate to the video that you would like to use for preparing the YOLO dataset. Once the video is selected, it will appear in the left panel of the app.

Below the video is a slider that you can use to control the video stream. The slider has three handles:

  • the green handle allows you to navigate through the video to display a frame of your choice;
  • the two grey handles allow you to restrict the processing of the video to a range of frames in the video. This can be convenient when only a portion of the video is usable, for instance.

Once the video is loaded, the second tab of the app will become available and you can click on it to navigate there.


2.3 - Tab 2: background module

Once the video is loaded in memory, the first step of the process is to load or reconstruct the background over which the objects are moving.

If you already have an image of the background without the objects in it, you can load it using the Select existing background button. If not, you can attempt to reconstruct it by clicking on the Automatically estimate background button.

If you choose to reconstruct the background, you can select one of the methods of reconstruction using the Background type dropwdown menu:

  • “mean” uses the average value of each pixel as background
  • “median” uses the median value of each pixel as background
  • “min” uses the minimum value of each pixel as background
  • “max” uses the maximum value of each pixel as background

These statistics will be computed using several equally-spaced frames of the video. You can choose the number of frames used for this process with the Number of frames for estimating background slider. More frames tend to yield a better result at the expense of a longer computation time.

If objects are still present in the reconstructed background image (for instance, because they did not move during the recording), you can attempt to remove them by clicking on the Select ghost for removal button. Once you have activated the ghost removal mode, you can draw a polygon around the object to remove (just click around the object to create the corners of the polygon). Once you are done drawing the polygon, hit the return key on your keyboard and the app will try to replace the object with its best estimate of the local background.

Finally, you can save the background picture for later use by clicking on the Save background file button.

Once a background image is created/loaded, the third tab of the app will become available and you can click on it to navigate there.

Note: the ghost removal mode is very basic and may not yield good results with complex backgrounds. Another option is to save the background file with the ghosts and use a more advanced image editing software to remove them (for instance, Photoshop’s Remove tool can give much better results).


2.4 - Tab 3: mask module

The second step of the process is to load or create an optional mask. A mask can help restrict the area in which the app should look for objects and, therefore, make the final results more accurate. By default, the whole image is taken into account in the analyis.

If you already have a mask image, you can load it using the Select existing mask button. If not, you can create one as follows.

First, click on the Exclude all button to exclude the entirety of the image from the analysis. Then, click on either the Add polygon ROI or Add ellipse ROI buttons, making sure that the Including tick box is selected.

If you clicked the Add polygon ROI button, you will be able to draw a polygon that encloses the region of the image you would like to include in the analysis (just click in the image to create the corners of the polygon). Once you are done drawing the polygon, hit the return key on your keyboard to set the region of interest.

If you clicked the Add ellipse ROI button, you will need to select five points that will define the periphery of the ellipse.

You can also define multiple regions of interests by incrementing the ROI id counter before creating a new region.

You can create more complex masks by excluding parts of the image selectively if you tick the Excluding tick box.

Finally, you can save the mask picture for later use by clicking on the Save mask file button.

Once a mask image is created/loaded, the fourth tab of the app will become available and you can click on it to navigate there.


2.5 - Tab 4: segmentation module

The third step of the process is to determine a threshold that differentiate the objects from the background. In this tab, what you are seeing is the difference between a given frame of the video and the background image loaded/created earlier.

By default, the app considers that the objects to detect appear darker than the background in the video frames. If that is not the case, you can tick the Ligther tick box if the objects appear lighter than the background in the video frames, or the A bit of both tick box if the objects have parts that are darker than the background and others that are lighter.

You can then manually select a difference threshold by using the slider at the bottom of the tab, or let the app decide for you by clicking the Autothreshold button and selecting one of the automatic threshold detection methods provided under the dropdown menu on the right of the button.

Once a threshold has been set, the fifth tab of the app will become available and you can click on it to navigate there.


2.6 - Tab 5: identification module

The fourth step of the process is to isolate single instances of the objects in a subset of equally spaced frames in the video.

First, you will need to set a buffer around each of the detected objects. The buffer should be large enough so that each object is completely enclosed in its corresponding rectangle. Do not worry if several objects are enclosed in the same rectangle; what matters is that enough single instances are clearly separated from the others. You can navigate through the subset of frames to check whether the selected buffer size works well across all of them.

Once the buffer size is set, you will click on the Detect objects button and the app will compute the widths and heights of all the detected objects across the subset of frames. You can choose the number of frames to be used yourself; more frames will provide better statistics at the expense of longer computation time. Once the statistics are computed, their distribution will be plotted in the sidebar.

By default, the app will attempt to determine the best statistics that corresponds to single instances of the objects. The single instances detected that way will appear as green rectangles in the video frame, the others will appear as red/orange rectangles. You can navigate through the subset of frames to check whether the selected statistic ranges work across the whole video. In particular, you should make sure that all selected instances contain one, and only one object. If that is not the case, you can untick the Automatic object selection tick box and manually change the ranges to eliminate these instances. You can also manually change the status of an instance (selected or unselected) by directly clicking on it with your mouse cursor.

Once the objects have been selected, the last tab of the app will become available and you can click on it to navigate there.

Note: If you modify the buffer size or the number of images used to detect the objects after computing the object statistics, you will need to run the object detection again to take into account the new parameters.


2.7 - Tab 6: composite module

The final step of the process is to generate the training dataset itself. This will create composite images in which a set number of objects will be printed over the background image, at random locations and orientations within the boundary of the mask loaded/created earlier. This will also create the bounding boxes of the objects that YOLO needs to learn what the objects look like.

First, you will set the number of objects that will be included in each of the composite images. The number of objects should be high enough to result in a lot of random arrangements and overlaps so that YOLO can generalize better to situations where objects touch or overlap in the real video frames. You can also set a buffer zone to ensure that the printed objects do not overlap too much with your mask boundaries.

Next, you can add random noise to the composite images in order to increase the generalizability of the YOLO training. You can also add random gain and bias to change the contrast and luminosity of the the composite images for ever more generalizability.

To check the effect of each parameter on the resulting composite images, you can click on the Generate test composite button to generate one or more sample composite images.

Once you are satisfied with the results, you can set the number of images that will be used for training, validating, and testing the YOLO model that you will train in a separate app. A recommend breakdown is to reserve 70% of the images for training, and 15% for validating and testing, respectively.

You can then generate the training dataset by clicking on the Generate YOLO dataset button. This will bring up a file manager and you can select the location where you would like the dataset to be saved (a folder named YOLO will be created at that location). Once the process terminates, you are done and you can close the app. The next step will be training a YOLO model.

Note: When generating the training dataset, you can also ask the app to create a reframed video by ticking the tick box at the bottom of the sidebar. This will prepare a video file in which everything outside the mask will be removed and the dimensions of the image will be set to optimize the video processing done by YOLO during the tracking phase. This is completely optional and will add some processing time, but performing the tracking on the reframed video can be up to twice as fast as on the original video, depending on the number of objects to track.