Fine grained visual understanding

from patch-level to pixel-level details

Figure 1. Example of semantic segmentation from an RGB image in a driving scenario Image from the Cityscapes dataset.

What is spatial intelligence? It may be defined as “the ability to generate, retain, retrieve, and transform well-structured visual images” (D. F. Lohman, Spatial ability and g, 1996) or, more broadly, “human’s computational capacity that provides the ability or mental skill to solve spatial problems of navigation, visualization of objects from different angles and space, faces or scenes recognition, or to notice fine details” (H. Gardner).

Understanding the fine grained details in images (or other forms of sensing) is a form of spatial reasoning that is useful for many applications. It is a spatial reasoning because the informative content of a fraction of the image (from a patch to a single pixel) depend on the context given by its surrounding. Me and my team have been working on the development of fine grained visual understanding algorithms for various use-cases, e.g., for driving scenes, aerial imagery and industrial applications.

Driving scenes

Understanding the fine details in images collected from cameras onboard vehicles is extremely important for the development of the perception stack of a autonomous/assisted car. It is also a challenging task, due to variations in scenery and weather/illumination conditions, the cost of labeling, etcetera. At Vandal we have been working on this topics, developing tools that can help support research, like the IDDA dataset, which contains 105 different scenarios that differentiate for the weather condition, environment and point of view of the camera. We have also developed several algorithms, with a focus on robustness across domains (Pixel-by-Pixel Cross-Domain Alignment for Few-Shot Semantic Segmentation) and capable to handle anomalies (Mask2Anomaly: Mask Transformer for Universal Open-set Segmentation).

Figure 2. IDDA offers 105 scenarios of driving scenes, with weather conditions, environments and points of view of the camera (on a car, jeep, mini van, ...). Each RGB images is accompanied by a semantic mask and depth image.
Aerial scenes

In aerial imagery there is a lot of fine detailed information that can be extremely valuable, e.g., to monitor urban development or support environmental monitoring. Unlike in driving scenes, where the street is always at the bottom and the sky on top, in aerial imagery the model cannot rely on a fixed semantic structure of the scene. Moreover, the the relative scale of different visual elements may be extremely different. We have developed solutions that address the domain shift problem in aerial segmentation and are tailored to address the specificities of these scenes.

Figure 3. Some of the solutions that we have developed specifically for aerial segmentation. Left: Augmentation Invariance and Adaptive Sampling in Semantic Segmentation of Agricultural Aerial Images. Right: Hierarchical Instance Mixing Across Domains in Aerial Segmentation.

Related Publications

2024

  1. Journal
    Rai-2024-mask2anomaly.jpg
    Mask2Anomaly: Mask Transformer for Universal Open-set Segmentation
    Shyam Nandan Rai, Fabio Cermelli, Barbara Caputo, and Carlo Masone
    IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

2023

  1. Journal
    Arnaudo-2023-hierarchical.jpg
    Hierarchical Instance Mixing Across Domains in Aerial Segmentation
    Edoardo Arnaudo, Antonio Tavera, Carlo Masone, Fabrizio Dominici, and Barbara Caputo
    IEEE Access, 2023
  2. Conference
    Rai-2023-unmasking.gif
    Unmasking Anomalies in Road-Scene Segmentation
    Shyam Nandan Rai, Fabio Cermelli, Dario Fontanel, Carlo Masone, and Barbara Caputo
    In IEEE/CVF International Conference on Computer Vision (ICCV), 2023
  3. Workshop
    Yu-2023-uncv.jpg
    The Robust Semantic Segmentation UNCV2023 Challenge Results
    Xuanlong Yu, Yi Zuo, Zitao Wang, Xiaowen Zhang, Jiaxuan Zhao, Yuting Yang, Licheng Jiao, Rui Peng, Xinyi Wang, Junpei Zhang, Kexin Zhang, Fang Liu, Roberto Alcover-Couso, Juan C. SanMiguel, Marcos Escudero-Viñolo, Hanlin Tian, Kenta Matsui, Tianhao Wang, Fahmy Adan, Zhitong Gao, Xuming He, Quentin Bouniot, Hossein Moghaddam, Shyam Nandan Rai, Fabio Cermelli, Carlo Masone, Andrea Pilzer, Elisa Ricci, Andrei Bursuc, Arno Solin, Martin Trapp, Rui Li, Angela Yao, Wenlong Chen, Ivor Simpson, Neill D. F. Campbell, and Gianni Franchi
    In IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2023

2022

  1. Workshop
    Tavera-2022-augmentation.jpg
    Augmentation Invariance and Adaptive Sampling in Semantic Segmentation of Agricultural Aerial Images
    A. Tavera, E. Arnaudo, C. Masone, and B. Caputo
    In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2022
  2. Conference
    Paolicelli-2022-semantic.jpg
    Learning Semantics for Visual Place Recognition through Multi-Scale Attention
    V. Paolicelli, A. Tavera, G. Berton, C. Masone, and B. Caputo
    In International Conference on Image Analysis and Processing (ICIAP), 2022
  3. Conference
    Tavera-2022-pixda.jpg
    Pixel-by-Pixel Cross-Domain Alignment for Few-Shot Semantic Segmentation
    A. Tavera, F. Cermelli, C. Masone, and B. Caputo
    In IEEE Winter Conference on Applications of Computer Vision (WACV), 2022

2021

  1. Conference
    Tavera-2021-reimagine.jpg
    Reimagine BiSeNet for Real-Time Domain Adaptation in Semantic Segmentation
    A. Tavera, C. Masone, and B. Caputo
    In Proceedings of the I-RIM 2021 Conference, 2021

2020

  1. Journal
    Alberti-2020-idda.jpg
    IDDA: A Large-Scale Multi-Domain Dataset for Autonomous Driving
    E. Alberti, A. Tavera, C. Masone, and B. Caputo
    IEEE Robotics and Automation Letters, 2020