We present the UTUAV dataset which consists of three different scenes captured in the second largest city of Colombia: Medellín. Road user classes representative of those in emerging countries such as Colombia, have been choosen: motorcycles (MC), light vehicles (LV) and heavy vehicles (HV).
The dataset was initially annotated by means of Viper
annotation tool. Subsequently, the annotations were converted
to the Pascal
VOC (XML) format (directories named "Annotations",
bounding boxes in absolute coordinates xmin, xmax, ymin, ymax)
and to Ultralytics
YOLOv8 format (directories named "labels", class label
(0: motorbike, 1: LV, 2: HV followed bounding boxes in
normalised xywhcoordinates relative to image width and height:
xcentroid, ycentroid, width, height). The images are stored in
directories named "images".UTUAV-A Dataset (road side)
(NB: the numbers are indicative only as they might be
slightly different to the final annotations, in this case
there are 10,050 images)
Vehicle | Motorcycle | Ligth Vehicle | Heavy Vehicle |
---|---|---|---|
Number of annotations | 56970 | 44415 | 44415 |
Annotated objects | 318 | 159 | 16 |
Mean Size (pixels) | 1,763 | 4,546 | 4,771 |
Totally occluded objects | 7 | 6 | 3 |
Mean occlusion duration (frames) | 8.1 | 2.9 | 269.3 |
Mean occlusion displacement (pixels) | 31.2 | 6.3 | 369.4 |
![]() |
Fully
annotated dataset
Dataset split
80:10:10 into training, validation and evaluation
sub-sets. This also contains json files in COCO format that we
have used to train and evaluate DETR models.
Bird-eye Datasets
Currently, we are concentrating on the following two sets
("B" and "C") which are of bird-eye views of urban
traffic taken from two different (but similar) heights for two
different road topologies. These sets are useful to experiment
with:
- Detection methods able to detect both very small and larger objects.
- Measuring generalisation capabilities e.g. how a model trained on B performs with images from C and vice versa.
Global Data
When you download these datasets, you will find the following sub-directories:
- Visualise: The images with added bounding boxes on the annotated objects (each class shown with a different colour)
- Annotations: The annotated objects in XML-encoded Pascal VOC format (absolute image coordinates)
- images: the jpeg images of that dataset
- labels: the YOLO (Ultralytics)-formatted annotations (ground truth) for those images
- You might also find files
- *.xgtf: original annotations in Viper-GT format (historical)
- default_copy.yaml: hyper parameter configurations for YOLO (Ultralytics) training
- X_Dataset.yaml: X=B or C, Ultralytics definitions of file locations (train, val, test) and classes
- Note that at this point, the downloadable datasets contain ALL images and labels (annotations) without distinguishing between training, validation and testing sub-sets
Data Partitions
This refers on how the data is partitioned into training,
validation and evaluation (test). We do not mix B with C as
one of the purposes it to check generalisation (see above).
Originally, we "naively" created such partitions by random
selection of frames. This however tends to "inflate"
evaluation metrics because the trained model had been exposed
to similar images. This is aggravated by the fact that images
come from temporal sequences (videos). So, we provide the
following partitioning schemes:
- Original: Where train/val/test frames were randomly selected from the global data (separately for B and C). This is mainly of historical interest.
- We then first separate an evaluation (test) sub-set consisting of the last temporal segment of the video sequence. The same evaluation sub-set is used for all tests (for different partitions, except from "Original"). Then training and validation sub-sets are provided using the following approaches:
- Random: where training and validation images have been randomly selected from the remaining images
- Sequential: where the training sub-set is extracted from the start of the video sequence up to a point in time where there is a given proportion of these frames. Then the frames that follow (after the evaluation set was extracted) form the validation sub-set
- Sequential_gap: Where we drop a proportion of frames around the dividing line between training and validation frames and between the validation and the evaluation frames.
- These data partitions are stored as simple text files (e.g. training.txt) that contain a list of image files
- The partitions for the B dataset are stored in a directory
B_Split and those for C in C_Split
To generate Ultralytics-compatible images and labels
directories using the above text files and the global
datasets, we provide the python script: GenDataset.py (it
works in Linux and by default it creates symbolic links to the
original global files to reduce storage requirements). The
script can be downloaded here.
Segment GTs
Finally, we have also generated "segment" GTs compatible with
Ultralytics (each object is represented by a polygon
corresponding to the segment contour). We have done this using
Meta's SAM (version 1, perhaps later versions produce better
results?). As per the normal (bounding box) annotations, the
"labels" directory for segment GT contains the GT all the
images. You will need to then separate these into training,
validation and evaluation for a given partition (we recommend
the Sequential_gap partition).
UTUAV-B Dataset
Exploiting the top view angle that a high elevation UAV could reach (approx 100 meters), the second dataset is composed of 6,500 labelled images with a resolution of 3840x2160 (4k) pixels. The entire sequence is captured with a top view and turbulence or camera movement is hardly perceived. The annotation bounding box color codes are red for light vehicles, blue for motorbikes and green for heavy vehiclesVehicle | Motorcycle | Ligth Vehicle | Heavy Vehicle |
---|---|---|---|
Number of annotations | 70,064 | 331,508 | 18,864 |
Annotated objects | 128 | 282 | 13 |
Mean Size (pixels) |
992 |
3,318 | 5,882 |
Totally occluded objects | 80 | 84 | 4 |
Mean occlusion duration (frames) | 19.6 | 18.6 | 34.0 |
Mean occlusion displacement (pixels) |
108.4 |
130.7 | 197.5 |
![]() |
Segment GT dataset (99MB)
UTUAV-C Dataset
The third dataset was a sequence of 10,560 frames, of which 6,600 are annotated, with a resolution of 3840x2160 (4k) pixels. This video sequence was also captured from an UAV, elevated at 120 meters from the ground. The road configuration is different. These differences have been introduced intentionally to test generalisation ability e.g. of detection on this dataset using a model trained with UTUAV-B. This dataset also uses a top view angle and uses the same color code for annotations as per UTUAV-B.(NB: the numbers are indicative only as they are different to
the final annotations, in this case there are only 6600
annotated images)
Vehicle | Motorcycle | Ligth Vehicle | Heavy Vehicle |
---|---|---|---|
Number of annotations | 463,009 | 1,477,287 | 130,142 |
Annotated objects | 456 | 997 | 86 |
Mean Size (pixels) | 467 | 1,722 | 4,275 |
Totally occluded objects | 211 | 265 | 31 |
Mean occlusion duration (frames) | 89.9 | 86.8 | 110.8 |
Mean occlusion displacement (pixels) | 226.9 | 210.1 | 260.3 |
![]() |
Segment GT dataset (226MB)
To do ...
- Make available other Coco-compatible data (for UTUAV-B and UTUAV-C)
- Include annotations of stationary vehicles (the
original annotations focused on moving vehicles),
because a model that can detect such non-annotated
vehicles will be unfairly penalised (so far, in our
experiments the models we have trained learn to
ignore stationary vehicles!)
- Publish OBB annotations (to account for different object angles)
- Publish baseline results for various partitions and
architectures
- Convert and make available tracking ground
truth (where each object has a unique identifier)
For any queries related to these datasets
please contact Jorge
Espinosa or Sergio
A Velastin