The UTUAV Urban Traffic Dataset

We present the UTUAV dataset which consists of three different scenes captured in the second largest city of Colombia: Medellín. Road user classes representative of those in emerging countries such as Colombia, have been choosen: motorcycles (MC), light vehicles (LV) and heavy vehicles (HV).

The dataset was initially annotated by means of Viper annotation tool. Subsequently, the annotations were converted to the Pascal VOC (XML) format (directories named "Annotations", bounding boxes in absolute coordinates xmin, xmax, ymin, ymax) and to Ultralytics YOLOv8 format (directories named "labels", class label (0: motorbike, 1: LV, 2: HV followed bounding boxes in normalised xywhcoordinates relative to image width and height: xcentroid, ycentroid, width, height). The images are stored in directories named "images".UTUAV-A Dataset (road side)

This dataset corresponds to an extension of Espinosa et al. which originally only annotated motorbikes in 10,000 frames with a resolution of 640x364 pixels. The images were taken from an unmanned air vehicle (UAV), elevated 4.5 meters from the ground. The UAV is kept at the same position and small movement of the camera is noted. The extension includes the annotation of light and heavy vehicles. The following Table presents the main dataset characteristics, including the number of annotated vehicles, mean area of the vehicle bounding box, total occluded vehicles, the mean duration of total occlusions measured in frames and the mean displacement in pixels of objects when are occluded. Note that due the limited elevation, and the capture angle of the sequence, occlusions appear frequently, and object sizes changes significantly.

(NB: the numbers are indicative only as they might be slightly different to the final annotations, in this case there are 10,050 images)

Vehicle Motorcycle Ligth Vehicle Heavy Vehicle
Number of annotations 56970 44415 44415
Annotated objects 318 159 16
Mean Size (pixels) 1,763 4,546 4,771
Totally occluded objects 7 6 3
Mean occlusion duration (frames) 8.1 2.9 269.3
Mean occlusion displacement (pixels) 31.2 6.3 369.4

 

Annotated Image of UTUAV-A Dataset

Fully annotated dataset
Dataset split 80:10:10 into training, validation and evaluation sub-sets. This also contains json files in COCO format that we have used to train and evaluate DETR models.

Bird-eye Datasets

Currently, we are concentrating on the following two sets ("B" and "C")  which are of bird-eye views of urban traffic taken from two different (but similar) heights for two different road topologies. These sets are useful to experiment with:

Global Data

When you download these datasets, you will find the following sub-directories:

Data Partitions

This refers on how the data is partitioned into training, validation and evaluation (test). We do not mix B with C as one of the purposes it to check generalisation (see above). Originally, we "naively" created such partitions by random selection of frames. This however tends to "inflate" evaluation metrics because the trained model had been exposed to similar images. This is aggravated by the fact that images come from temporal sequences (videos). So, we provide the following partitioning schemes:

Download partitions

To generate Ultralytics-compatible images and labels directories using the above text files and the global datasets, we provide the python script: GenDataset.py (it works in Linux and by default it creates symbolic links to the original global files to reduce storage requirements). The script can be downloaded here.

Segment GTs

Finally, we have also generated "segment" GTs compatible with Ultralytics (each object is represented by a polygon corresponding to the segment contour). We have done this using Meta's SAM (version 1, perhaps later versions produce better results?). As per the normal (bounding box) annotations, the "labels" directory for segment GT contains the GT all the images. You will need to then separate these into training, validation and evaluation for a given partition (we recommend the Sequential_gap partition).

UTUAV-B Dataset

Exploiting the top view angle that a high elevation UAV could reach (approx 100 meters), the second dataset is composed of 6,500 labelled images with a resolution of 3840x2160 (4k) pixels. The entire sequence is captured with a top view and turbulence or camera movement is hardly perceived. The annotation bounding box color codes are red for light vehicles, blue for motorbikes and green for heavy vehicles

Vehicle Motorcycle Ligth Vehicle Heavy Vehicle
Number of annotations 70,064 331,508 18,864
Annotated objects 128 282 13
Mean Size (pixels)

992

3,318 5,882
Totally occluded objects 80 84 4
Mean occlusion duration (frames) 19.6 18.6 34.0
Mean occlusion displacement (pixels)

108.4

130.7 197.5

 

Annotated Image (4K resolution - Here resized) of UTUAV-B Dataset
Fully annotated dataset (24GB)
Segment GT dataset (99MB)

UTUAV-C Dataset

The third dataset was a sequence of 10,560 frames, of which 6,600 are annotated, with a resolution of 3840x2160 (4k) pixels. This video sequence was also captured from an UAV, elevated at 120 meters from the ground. The road configuration is different. These differences have been introduced intentionally to test generalisation ability e.g. of detection on this dataset using a model trained with UTUAV-B. This dataset also uses a top view angle and uses the same color code for annotations as per UTUAV-B.

(NB: the numbers are indicative only as they are different to the final annotations, in this case there are only 6600 annotated images)

Vehicle Motorcycle Ligth Vehicle Heavy Vehicle
Number of annotations 463,009 1,477,287 130,142
Annotated objects 456 997 86
Mean Size (pixels) 467 1,722 4,275
Totally occluded objects 211 265 31
Mean occlusion duration (frames) 89.9 86.8 110.8
Mean occlusion displacement (pixels) 226.9 210.1 260.3

 

Annotated Image (4K resolution - Here resized) of UTUAV-C Dataset
Fully annotated dataset (49GB)
Segment GT dataset (226MB)

To do ...

  • Make available other Coco-compatible data (for UTUAV-B and UTUAV-C)
  • Include annotations of stationary vehicles (the original annotations focused on moving vehicles), because a model that can detect such non-annotated vehicles will be unfairly penalised (so far, in our experiments the models we have trained learn to ignore stationary vehicles!)
  • Publish OBB annotations (to account for different object angles)
  • Publish baseline results for various partitions and architectures
  • Convert and make available tracking ground truth (where each object has a unique identifier)


For any queries related to these datasets please contact  Jorge Espinosa or Sergio A Velastin