We offer a benchmark suite together with an evaluation server, such that authors can upload their results and get a ranking regarding the different tasks (pixel-level, instance-level, and panoptic semantic labeling as well as 3d vehicle detection). If you would like to submit your results, please register, login, and follow the instructions on our submission page.
Pixel-Level Semantic Labeling Task
The first Cityscapes task involves predicting a per-pixel semantic labeling of the image without considering higher-level object instance or boundary information.
Metrics
To assess performance, we rely on the standard Jaccard Index, commonly known as the PASCAL VOC intersection-over-union metric IoU = TP ⁄ (TP+FP+FN) [1], where TP, FP, and FN are the numbers of true positive, false positive, and false negative pixels, respectively, determined over the whole test set. Owing to the two semantic granularities, i.e. classes and categories, we report two separate mean performance scores: IoUcategory and IoUclass. In either case, pixels labeled as void do not contribute to the score.
It is well-known that the global IoU measure is biased toward object instances that cover a large image area. In street scenes with their strong scale variation this can be problematic. Specifically for traffic participants, which are the key classes in our scenario, we aim to evaluate how well the individual instances in the scene are represented in the labeling. To address this, we additionally evaluate the semantic labeling using an instance-level intersection-over-union metric iIoU = iTP ⁄ (iTP+FP+iFN). Again iTP, FP, and iFN denote the numbers of true positive, false positive, and false negative pixels, respectively. However, in contrast to the standard IoU measure, iTP and iFN are computed by weighting the contribution of each pixel by the ratio of the class’ average instance size to the size of the respective ground truth instance. It is important to note here that unlike the instance-level task below, we assume that the methods only yield a standard per-pixel semantic class labeling as output. Therefore, the false positive pixels are not associated with any instance and thus do not require normalization. The final scores, iIoUcategory and iIoUclass, are obtained as the means for the two semantic granularities.
Results
Detailed results including performances regarding individual classes and categories can be found here.
Instance-Level Semantic Labeling Task
In the second Cityscapes task we focus on simultaneously detecting objects and segmenting them. This is an extension to both traditional object detection, since per-instance segments must be provided, and pixel-level semantic labeling, since each instance is treated as a separate label. Therefore, algorithms are required to deliver a set of detections of traffic participants in the scene, each associated with a confidence score and a per-instance segmentation mask.
Metrics
To assess instance-level performance, we compute the average precision on the region level (AP [2]) for each class and average it across a range of overlap thresholds to avoid a bias towards a specific value. Specifically, we follow [3] and use 10 different overlaps ranging from 0.5 to 0.95 in steps of 0.05. The overlap is computed at the region level, making it equivalent to the IoU of a single instance. We penalize multiple predictions of the same ground truth instance as false positives. To obtain a single, easy to compare compound score, we report the mean average precision AP, obtained by also averaging over the class label set. As minor scores, we add AP50% for an overlap value of 50 %, as well as AP100m and AP50m where the evaluation is restricted to objects within 100 m and 50 m distance, respectively.
Results
Detailed results including performances regarding individual classes can be found here.
Panoptic Semantic Labeling Task
The third Cityscapes task was added in 2019 and combines both, pixel-level and instance-level semantic labeling, in a single task called “panoptic segmentation”. The challenge as well as the evaluation metrics are described in [4].
Results
Detailed results including performances regarding individual classes can be found here.
3D Vehicle Detection Task
The fourth Cityscapes task was added in 2020 and focuses on 3D Object Detection for vehicles to estimate their 3D parameters like orientation and location. Objects of class car, truck, bus, train, motorcycle, and bicycle are evaluated. Each object is described by an amodal 2D bounding box as well as a 9 degree-of-freedom 3D bounding box: center position, dimensions and orientation.
Metrics
In this task, both 2D and 3D parameters are evaluated. 2D Average Precision (AP) is used to assess the ability to detect the objects of interest within the image and match predictions and ground truth objects. In addition, for each true positive pair of detection and ground truth, the 3D parameters (location, orientation, and size) are evaluated individually using depth dependent true positive metrics for location (BEV), orientation (OS Yaw and OS PitchRoll), and dimensions (SizeSim) as described in [5]. Objects up to a distance of 100 m are taken into account with a step size of 5 m. Detections within an ignore region are not considered in the evaluation.
For each label class a detection score (DS) is calculated as DS = AP x (BEV + OS_Yaw + OS_PitchRoll + SizeSim) / 4.
The average scores over all classes are used to score the submissions.
Results
Detailed results including performances regarding individual classes can be found here.
Meta Information
In addition to the previously introduced measures, we report additional meta information for each method, such as timings or the kind of information each algorithm is using, e.g. depth data or multiple video frames. Please refer to the result tables for further details.