We present the UTUAV dataset which consists of three different scenes captured in the second largest city of Colombia: Medellín. Road user classes representative of those in emerging countries such as Colombia, have been choosen: motorcycles (MC), light vehicles (LV) and heavy vehicles (HV).
The dataset was initially annotated by means of Viper
annotation tool. Subsequently, the annotations were converted
to the Pascal
VOC (XML) format (directories named "Annotations",
bounding boxes in absolute coordinates xmin, xmax, ymin, ymax)
and to Ultralytics
YOLOv8 format (directories named "labels", class label
(0: motorbike, 1: LV, 2: HV followed bounding boxes in
normalised xywhcoordinates relative to image width and height:
xcentroid, ycentroid, width, height). The images are stored in
directories named "images". For YoloV8 we also provide
examples of the dataset definition file e.g. B_Dataset.yaml
and of the configuration file e.g. default_copy.yaml
UTUAV-A Dataset
This dataset corresponds to an extension of Espinosa et al. which originally only annotated motorbikes in 10,000 frames with a resolution of 640x364 pixels. The images were taken from an unmanned air vehicle (UAV), elevated 4.5 meters from the ground. The UAV is kept at the same position and small movement of the camera is noted. The extension includes the annotation of light and heavy vehicles. The following Table presents the main dataset characteristics, including the number of annotated vehicles, mean area of the vehicle bounding box, total occluded vehicles, the mean duration of total occlusions measured in frames and the mean displacement in pixels of objects when are occluded. Note that due the limited elevation, and the capture angle of the sequence, occlusions appear frequently, and object sizes changes significantly.(NB: the numbers are indicative only as they might be
slightly different to the final annotations, in this case
there are 10,050 images)
Vehicle | Motorcycle | Ligth Vehicle | Heavy Vehicle |
---|---|---|---|
Number of annotations | 56970 | 44415 | 44415 |
Annotated objects | 318 | 159 | 16 |
Mean Size (pixels) | 1,763 | 4,546 | 4,771 |
Totally occluded objects | 7 | 6 | 3 |
Mean occlusion duration (frames) | 8.1 | 2.9 | 269.3 |
Mean occlusion displacement (pixels) | 31.2 | 6.3 | 369.4 |
![]() |
Fully
annotated dataset
Dataset split
80:10:10 into training, validation and evaluation
sub-sets. This also contains json files in COCO format that we
have used to train and evaluate DETR models.
UTUAV-B Dataset
Exploiting the top view angle that a high elevation UAV could reach (approx 100 meters), the second dataset is composed of 6,500 labelled images with a resolution of 3840x2160 (4k) pixels. The entire sequence is captured with a top view and turbulence or camera movement is hardly perceived. The annotation bounding box color codes are red for light vehicles, blue for motorbikes and green for heavy vehiclesVehicle | Motorcycle | Ligth Vehicle | Heavy Vehicle |
---|---|---|---|
Number of annotations | 70,064 | 331,508 | 18,864 |
Annotated objects | 128 | 282 | 13 |
Mean Size (pixels) |
992 |
3,318 | 5,882 |
Totally occluded objects | 80 | 84 | 4 |
Mean occlusion duration (frames) | 19.6 | 18.6 | 34.0 |
Mean occlusion displacement (pixels) |
108.4 |
130.7 | 197.5 |
![]() |
Dataset split 80:10:10 into training, validation and evaluation sub-sets.
UTUAV-C Dataset
The third dataset was a sequence of 10,560 labelled frames with a resolution of 3840x2160 (4k) pixels. This video sequence was also captured from an UAV, elevated at 120 meters from the ground. The road configuration is different. These differences have been introduced intentionally to test generalisation ability e.g. of detection on this dataset using a model trained with UTUAV-B. This dataset also uses a top view angle and uses the same color code for annotations as per UTUAV-B.(NB: the numbers are indicative only as they are different to
the final annotations, in this case there are only 6600
annotated images)
Vehicle | Motorcycle | Ligth Vehicle | Heavy Vehicle |
---|---|---|---|
Number of annotations | 463,009 | 1,477,287 | 130,142 |
Annotated objects | 456 | 997 | 86 |
Mean Size (pixels) | 467 | 1,722 | 4,275 |
Totally occluded objects | 211 | 265 | 31 |
Mean occlusion duration (frames) | 89.9 | 86.8 | 110.8 |
Mean occlusion displacement (pixels) | 226.9 | 210.1 | 260.3 |
![]() |
Dataset split 80:10:10 into training, validation and evaluation sub-sets.
Dealing with Large Images
UTUAV-B and UTUAV-C contain many objects which are small
compared to the image size. Also 4K images typically need to be
resized to smaller sizes for training on standard GPU cards. An
alternative approach is to divide each training image into
sub-images taken on a regular grid of images of size NWxNH and
overlapping O, as illustrated below. In our case, somewhat
arbitrarily, NW=NH=1000, O=200. 
Dividing each training images into overlapped sub-images
UTUAV-C 80:10:10, training sub-images (1000,1000,200)
To do ...
- Make available other Coco-compatible data (for UTUAV-B and UTUAV-C)
- Convert and make available tracking ground truth
(where each object has a unique identifier)
For any queries related to these datasets
please contact Jorge
Espinosa or Sergio
A Velastin