Fine grained visual understanding

Figure 1. Example of semantic segmentation from an RGB image in a driving scenario Image from the Cityscapes dataset.

What is spatial intelligence? It may be defined as “the ability to generate, retain, retrieve, and transform well-structured visual images” (D. F. Lohman, Spatial ability and g, 1996) or, more broadly, “human’s computational capacity that provides the ability or mental skill to solve spatial problems of navigation, visualization of objects from different angles and space, faces or scenes recognition, or to notice fine details” (H. Gardner).

Understanding the fine grained details in images (or other forms of sensing) is a form of spatial reasoning that is useful for many applications. It is a spatial reasoning because the informative content of a fraction of the image (from a patch to a single pixel) depend on the context given by its surrounding. Me and my team have been working on the development of fine grained visual understanding algorithms for various use-cases, e.g., for driving scenes, aerial imagery and industrial applications.

Driving scenes

Understanding the fine details in images collected from cameras onboard vehicles is extremely important for the development of the perception stack of a autonomous/assisted car. It is also a challenging task, due to variations in scenery and weather/illumination conditions, the cost of labeling, etcetera. At Vandal we have been working on this topics, developing tools that can help support research, like the IDDA dataset, which contains 105 different scenarios that differentiate for the weather condition, environment and point of view of the camera. We have also developed several algorithms, with a focus on robustness across domains (Pixel-by-Pixel Cross-Domain Alignment for Few-Shot Semantic Segmentation) and capable to handle anomalies (Mask2Anomaly: Mask Transformer for Universal Open-set Segmentation).

Figure 2. IDDA offers 105 scenarios of driving scenes, with weather conditions, environments and points of view of the camera (on a car, jeep, mini van, ...). Each RGB images is accompanied by a semantic mask and depth image.

Aerial scenes

In aerial imagery there is a lot of fine detailed information that can be extremely valuable, e.g., to monitor urban development or support environmental monitoring. Unlike in driving scenes, where the street is always at the bottom and the sky on top, in aerial imagery the model cannot rely on a fixed semantic structure of the scene. Moreover, the the relative scale of different visual elements may be extremely different. We have developed solutions that address the domain shift problem in aerial segmentation and are tailored to address the specificities of these scenes.

Augmentation invariance and adaptive sampling for aerial segmentation

Figure 3. Some of the solutions that we have developed specifically for aerial segmentation. Left: Augmentation Invariance and Adaptive Sampling in Semantic Segmentation of Agricultural Aerial Images. Right: Hierarchical Instance Mixing Across Domains in Aerial Segmentation.

Related Publications

2024

Journal
Mask2Anomaly: Mask Transformer for Universal Open-set Segmentation

Shyam Nandan Rai, Fabio Cermelli, Barbara Caputo, and Carlo Masone

IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Abs Bib HTML Code

Segmenting unknown or anomalous object instances is a critical task in autonomous driving applications, and it is approached traditionally as a per-pixel classification problem. However, reasoning individually about each pixel without considering their contextual semantics results in high uncertainty around the objects’ boundaries and numerous false positives. We propose a paradigm change by shifting from a per-pixel classification to a mask classification. Our mask-based method, Mask2Anomaly, demonstrates the feasibility of integrating a mask-classification architecture to jointly address anomaly segmentation, open-set semantic segmentation, and open-set panoptic segmentation. Mask2Anomaly includes several technical novelties that are designed to improve the detection of anomalies/unknown objects: i) a global masked attention module to focus individually on the foreground and background regions; ii) a mask contrastive learning that maximizes the margin between an anomaly and known classes; iii) a mask refinement solution to reduce false positives; and iv) a novel approach to mine unknown instances based on the mask- architecture properties. By comprehensive qualitative and qualitative evaluation, we show Mask2Anomaly achieves new state-of-the-art results across the benchmarks of anomaly segmentation, open-set semantic segmentation, and open-set panoptic segmentation.
@article{Rai-2024-mask2anomaly, author = {Rai, Shyam Nandan and Cermelli, Fabio and Caputo, Barbara and Masone, Carlo}, journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence}, title = {{Mask2Anomaly}: Mask Transformer for Universal Open-set Segmentation}, year = {2024}, volume = {}, number = {}, pages = {1-17}, doi = {10.1109/TPAMI.2024.3419055}, keywords = {fine grained understanding, driving, uncertainty quantification, spatial intelligence}, }

2023

Journal
Hierarchical Instance Mixing Across Domains in Aerial Segmentation

Edoardo Arnaudo, Antonio Tavera, Carlo Masone, Fabrizio Dominici, and Barbara Caputo

IEEE Access, 2023

Abs arXiv Bib HTML

We investigate the task of unsupervised domain adaptation in aerial semantic segmentation observing that there are some shortcomings in the class mixing strategies used by the recent state-of-the-art methods that tackle this task: 1) they do not account for the large disparity in the extension of the semantic categories that is common in the aerial setting, which causes a domain imbalance in the mixed image; 2) they do not consider that aerial scenes have a weaker structural consistency in comparison to the driving scenes for which the mixing technique was originally proposed, which causes the mixed images to have elements placed out of their natural context; 3) source model used to generate the pseudo-labels may be susceptible to perturbations across domains, which causes inconsistent predictions on the target images and can jeopardize the mixing strategy. We address these shortcomings with a novel aerial semantic segmentation framework for UDA, named HIUDA, which is composed of two main technical novelties: firstly, a new mixing strategy for aerial segmentation across domains called Hierarchical Instance Mixing (HIMix), which extracts a set of connected components from each semantic mask and mixes them according to a semantic hierarchy and secondly, a twin-head architecture in which two separate segmentation heads are fed with variations of the same images in a contrastive fashion to produce finer segmentation maps. We conduct extensive experiments on the LoveDA benchmark, where our solution outperforms the current state-of-the-art.
@article{Arnaudo-2023-hierarchical, author = {Arnaudo, Edoardo and Tavera, Antonio and Masone, Carlo and Dominici, Fabrizio and Caputo, Barbara}, journal = {IEEE Access}, title = {Hierarchical Instance Mixing Across Domains in Aerial Segmentation}, year = {2023}, volume = {11}, number = {}, pages = {13324-13333}, doi = {10.1109/ACCESS.2023.3243475}, keywords = {fine grained understanding, aerial, robust learning, spatial intelligence}, }
Conference
Unmasking Anomalies in Road-Scene Segmentation

Shyam Nandan Rai, Fabio Cermelli, Dario Fontanel, Carlo Masone, and Barbara Caputo

In IEEE/CVF International Conference on Computer Vision (ICCV), 2023

Oral Abs arXiv Bib HTML Video Code

Within the top 2.3% of submitted papers, and the top 9% of accepted papers.

Anomaly segmentation is a critical task for driving applications, and it is approached traditionally as a per-pixel classification problem. However, reasoning individually about each pixel without considering their contextual semantics results in high uncertainty around the objects’ boundaries and numerous false positives. We propose a paradigm change by shifting from a per-pixel classification to a mask classification. Our mask-based method, Mask2Anomaly, demonstrates the feasibility of integrating an anomaly detection method in a mask-classification architecture. Mask2Anomaly includes several technical novelties that are designed to improve the detection of anomalies in masks: i) a global masked attention module to focus individually on the foreground and background regions; ii) a mask contrastive learning that maximizes the margin between an anomaly and known classes; and iii) a mask refinement solution to reduce false positives. Mask2Anomaly achieves new state-of-the-art results across a range of benchmarks, both in the per-pixel and component-level evaluations. In particular, Mask2Anomaly reduces the average false positives rate by 60% w.r.t. the previous state-of-the-art.
@inproceedings{Rai-2023-unmasking, author = {Rai, Shyam Nandan and Cermelli, Fabio and Fontanel, Dario and Masone, Carlo and Caputo, Barbara}, booktitle = {IEEE/CVF International Conference on Computer Vision (ICCV)}, title = {Unmasking Anomalies in Road-Scene Segmentation}, year = {2023}, volume = {}, number = {}, pages = {4014-4023}, doi = {10.1109/ICCV51070.2023.00373}, keywords = {fine grained understanding, driving, uncertainty quantification, spatial intelligence}, }
Workshop
The Robust Semantic Segmentation UNCV2023 Challenge Results

Xuanlong Yu, Yi Zuo, Zitao Wang, Xiaowen Zhang, Jiaxuan Zhao, Yuting Yang, Licheng Jiao, Rui Peng, Xinyi Wang, Junpei Zhang, Kexin Zhang, Fang Liu, Roberto Alcover-Couso, Juan C. SanMiguel, Marcos Escudero-Viñolo, Hanlin Tian, Kenta Matsui, Tianhao Wang, Fahmy Adan, Zhitong Gao, Xuming He, Quentin Bouniot, Hossein Moghaddam, Shyam Nandan Rai, Fabio Cermelli, Carlo Masone, Andrea Pilzer, Elisa Ricci, Andrei Bursuc, Arno Solin, Martin Trapp, Rui Li, Angela Yao, Wenlong Chen, Ivor Simpson, Neill D. F. Campbell, and Gianni Franchi

In IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2023

Abs arXiv Bib

This paper outlines the winning solutions employed in addressing the MUAD uncertainty quantification challenge held at ICCV 2023. The challenge was centered around semantic segmentation in urban environments, with a particular focus on natural adversarial scenarios. The report presents the results of 19 submitted entries, with numerous techniques drawing inspiration from cutting-edge uncertainty quantification methodologies presented at prominent conferences in the fields of computer vision and machine learning and venues over the past few years. Within this document, the challenge is introduced, shedding light on its purpose and objectives, which primarily revolved around enhancing the robustness of semantic segmentation in urban scenes under varying natural adversarial conditions. The report then delves into the top-performing solutions. Moreover, the document aims to provide a comprehensive overview of the diverse solutions deployed by all participants. By doing so, it seeks to offer readers a deeper insight into the array of strategies that can be leveraged to effectively handle the inherent uncertainties associated with autonomous driving and semantic segmentation, especially within urban environments.
@inproceedings{Yu-2023-uncv, author = {Yu, Xuanlong and Zuo, Yi and Wang, Zitao and Zhang, Xiaowen and Zhao, Jiaxuan and Yang, Yuting and Jiao, Licheng and Peng, Rui and Wang, Xinyi and Zhang, Junpei and Zhang, Kexin and Liu, Fang and Alcover-Couso, Roberto and SanMiguel, Juan C. and Escudero-Viñolo, Marcos and Tian, Hanlin and Matsui, Kenta and Wang, Tianhao and Adan, Fahmy and Gao, Zhitong and He, Xuming and Bouniot, Quentin and Moghaddam, Hossein and Rai, Shyam Nandan and Cermelli, Fabio and Masone, Carlo and Pilzer, Andrea and Ricci, Elisa and Bursuc, Andrei and Solin, Arno and Trapp, Martin and Li, Rui and Yao, Angela and Chen, Wenlong and Simpson, Ivor and Campbell, Neill D. F. and Franchi, Gianni}, booktitle = {IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)}, title = {The Robust Semantic Segmentation UNCV2023 Challenge Results}, year = {2023}, volume = {}, number = {}, pages = {4620-4630}, doi = {10.1109/ICCVW60793.2023.00496}, keywords = {fine grained understanding, uncertainty}, }

2022

Workshop
Augmentation Invariance and Adaptive Sampling in Semantic Segmentation of Agricultural Aerial Images

A. Tavera, E. Arnaudo, C. Masone, and B. Caputo

In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2022

Abs arXiv Bib HTML Code

In this paper, we investigate the problem of Semantic Segmentation for agricultural aerial imagery. We observe that the existing methods used for this task are designed without considering two characteristics of the aerial data: (i) the top-down perspective implies that the model cannot rely on a fixed semantic structure of the scene, because the same scene may be experienced with different rotations of the sensor; (ii) there can be a strong imbalance in the distribution of semantic classes because the relevant objects of the scene may appear at extremely different scales (e.g., a field of crops and a small vehicle). We propose a solution to these problems based on two ideas: (i) we use together a set of suitable augmentation and a consistency loss to guide the model to learn semantic representations that are invariant to the photometric and geometric shifts typical of the top-down perspective (Augmentation Invariance); (ii) we use a sampling method (Adaptive Sampling) that selects the training images based on a measure of pixel-wise distribution of classes and actual network confidence. With an extensive set of experiments conducted on the Agriculture-Vision dataset, we demonstrate that our proposed strategies improve the performance of the current state-of-the-art method.
@inproceedings{Tavera-2022-augmentation, author = {Tavera, A. and Arnaudo, E. and Masone, C. and Caputo, B.}, title = {Augmentation Invariance and Adaptive Sampling in Semantic Segmentation of Agricultural Aerial Images}, booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)}, year = {2022}, pages = {1656-1665}, doi = {10.1109/CVPRW56347.2022.00172}, keywords = {fine grained understanding, aerial images, robust learning, spatial intelligence}, }
Conference
Learning Semantics for Visual Place Recognition through Multi-Scale Attention

V. Paolicelli, A. Tavera, G. Berton, C. Masone, and B. Caputo

In International Conference on Image Analysis and Processing (ICIAP), 2022

Abs arXiv Bib HTML

In this paper we address the task of visual place recognition (VPR), where the goal is to retrieve the correct GPS coordinates of a given query image against a huge geotagged gallery. While recent works have shown that building descriptors incorporating semantic and appearance information is beneficial, current state-of-the-art methods opt for a top down definition of the significant semantic content. Here we present the first VPR algorithm that learns robust global embeddings from both visual appearance and semantic content of the data, with the segmentation process being dynamically guided by the recognition of places through a multi-scale attention module. Experiments on various scenarios validate this new approach and demonstrate its performance against state-of-the-art methods. Finally, we propose the first synthetic-world dataset suited for both place recognition and segmentation tasks.
@inproceedings{Paolicelli-2022-semantic, author = {Paolicelli, V. and Tavera, A. and Berton, G. and Masone, C. and Caputo, B.}, title = {Learning Semantics for Visual Place Recognition through Multi-Scale Attention}, booktitle = {International Conference on Image Analysis and Processing (ICIAP)}, editor = {Sclaroff, Stan and Distante, Cosimo and Leo, Marco and Farinella, Giovanni M. and Tombari, Federico}, year = {2022}, publisher = {Springer International Publishing}, address = {Cham}, pages = {454-466}, doi = {10.1007/978-3-031-06430-2_38}, isbn = {978-3-031-06430-2}, keywords = {localization, fine grained understanding, spatial intelligence}, }
Conference
Pixel-by-Pixel Cross-Domain Alignment for Few-Shot Semantic Segmentation

A. Tavera, F. Cermelli, C. Masone, and B. Caputo

In IEEE Winter Conference on Applications of Computer Vision (WACV), 2022

Abs arXiv Bib HTML Code

In this paper we consider the task of semantic segmentation in autonomous driving applications. Specifically, we consider the cross-domain few-shot setting where training can use only few real-world annotated images and many annotated synthetic images. In this context, aligning the domains is made more challenging by the pixel-wise class imbalance that is intrinsic in the segmentation and that leads to ignoring the underrepresented classes and overfitting the well represented ones. We address this problem with a novel framework called Pixel-By-Pixel Cross-Domain Alignment (PixDA). We propose a novel pixel-by-pixel domain adversarial loss following three criteria: (i) align the source and the target domain for each pixel, (ii) avoid negative transfer on the correctly represented pixels, and (iii) regularize the training of infrequent classes to avoid overfitting. The pixel-wise adversarial training is assisted by a novel sample selection procedure, that handles the imbalance between source and target data, and a knowledge distillation strategy, that avoids overfitting towards the few target images. We demonstrate on standard synthetic-to-real benchmarks that PixDA outperforms previous state-of-the-art methods in (1-5)-shot settings.
@inproceedings{Tavera-2022-pixda, author = {Tavera, A. and Cermelli, F. and Masone, C. and Caputo, B.}, title = {Pixel-by-Pixel Cross-Domain Alignment for Few-Shot Semantic Segmentation}, booktitle = {IEEE Winter Conference on Applications of Computer Vision (WACV)}, year = {2022}, pages = {1626-1635}, doi = {10.1109/WACV51458.2022.00202}, keywords = {fine grained understanding, driving, robust learning, spatial intelligence}, }

2021

Conference
Reimagine BiSeNet for Real-Time Domain Adaptation in Semantic Segmentation

A. Tavera, C. Masone, and B. Caputo

In Proceedings of the I-RIM 2021 Conference, 2021

Abs arXiv Bib Code

Semantic segmentation models have reached remarkable performance across various tasks. However, this performance is achieved with extremely large models, using powerful computational resources and without considering training and inference time. Real-world applications, on the other hand, necessitate models with minimal memory demands, efficient inference speed, and executable with low-resources embedded devices, such as self-driving vehicles. In this paper, we look at the challenge of real-time semantic segmentation across domains, and we train a model to act appropriately on real-world data even though it was trained on a synthetic realm. We employ a new lightweight and shallow discriminator that was specifically created for this purpose. To the best of our knowledge, we are the first to present a real-time adversarial approach for assessing the domain adaption problem in semantic segmentation. We tested our framework in the two standard protocol: GTA5 to Cityscapes and SYNTHIA to Cityscapes.
@inproceedings{Tavera-2021-reimagine, author = {Tavera, A. and Masone, C. and Caputo, B.}, title = {Reimagine {BiSeNet} for Real-Time Domain Adaptation in Semantic Segmentation}, booktitle = {Proceedings of the I-RIM 2021 Conference}, year = {2021}, pages = {33-37}, doi = {10.5281/zenodo.5900517}, keywords = {fine grained understanding, spatial intelligence, robust learning}, }

2020

Journal
IDDA: A Large-Scale Multi-Domain Dataset for Autonomous Driving

E. Alberti, A. Tavera, C. Masone, and B. Caputo

IEEE Robotics and Automation Letters, 2020

Abs arXiv Bib HTML Website

Semantic segmentation is key in autonomous driving. Using deep visual learning architectures is not trivial in this context, because of the challenges in creating suitable large scale annotated datasets. This issue has been traditionally circumvented through the use of synthetic datasets, that have become a popular resource in this field. They have been released with the need to develop semantic segmentation algorithms able to close the visual domain shift between the training and test data. Although exacerbated by the use of artificial data, the problem is extremely relevant in this field even when training on real data. Indeed, weather conditions, viewpoint changes and variations in the city appearances can vary considerably from car to car, and even at test time for a single, specific vehicle. How to deal with domain adaptation in semantic segmentation, and how to leverage effectively several different data distributions (source domains) are important research questions in this field. To support work in this direction, this letter contributes a new large scale, synthetic dataset for semantic segmentation with more than 100 different source visual domains. The dataset has been created to explicitly address the challenges of domain shift between training and test data in various weather and view point conditions, in seven different city types. Extensive benchmark experiments assess the dataset, showcasing open challenges for the current state of the art.
@article{Alberti-2020-idda, author = {Alberti, E. and Tavera, A. and Masone, C. and Caputo, B.}, journal = {IEEE Robotics and Automation Letters}, title = {IDDA: A Large-Scale Multi-Domain Dataset for Autonomous Driving}, year = {2020}, volume = {5}, number = {4}, pages = {5526-5533}, doi = {10.1109/LRA.2020.3009075}, keywords = {fine grained understanding, driving, robust learning, spatial intelligence}, }