Segmenting unknown or anomalous object instances is a critical task in autonomous driving applications, and it is approached traditionally as a per-pixel classification problem. However, reasoning individually about each pixel without considering their contextual semantics results in high uncertainty around the objects’ boundaries and numerous false positives. We propose a paradigm change by shifting from a per-pixel classification to a mask classification. Our mask-based method, Mask2Anomaly, demonstrates the feasibility of integrating a mask-classification architecture to jointly address anomaly segmentation, open-set semantic segmentation, and open-set panoptic segmentation. Mask2Anomaly includes several technical novelties that are designed to improve the detection of anomalies/unknown objects: i) a global masked attention module to focus individually on the foreground and background regions; ii) a mask contrastive learning that maximizes the margin between an anomaly and known classes; iii) a mask refinement solution to reduce false positives; and iv) a novel approach to mine unknown instances based on the mask- architecture properties. By comprehensive qualitative and qualitative evaluation, we show Mask2Anomaly achieves new state-of-the-art results across the benchmarks of anomaly segmentation, open-set semantic segmentation, and open-set panoptic segmentation.
Conference
MeshVPR: Citywide Visual Place Recognition Using 3D Meshes
Gabriele Berton, Lorenz Junglas, Riccardo Zaccone, Thomas Pollok, Barbara Caputo, and Carlo Masone
In European Conference on Computer Vision (ECCV), 2024
Mesh-based scene representation offers a promising direction for simplifying large-scale hierarchical visual localization pipelines, combining a visual place recognition step based on global features (retrieval) and a visual localization step based on local features. While existing work demonstrates the viability of meshes for visual localization, the impact of using synthetic databases rendered from them in visual place recognition remains largely unexplored. In this work we investigate using dense 3D textured meshes for large-scale Visual Place Recognition (VPR). We identify a significant performance drop when using synthetic mesh-based image databases compared to real-world images for retrieval. To address this, we propose MeshVPR, a novel VPR pipeline that utilizes a lightweight features alignment framework to bridge the gap between real-world and synthetic domains. MeshVPR leverages pre-trained VPR models and is efficient and scalable for city-wide deployments. We introduce novel datasets with freely available 3D meshes and manually collected queries from Berlin, Paris, and Melbourne. Extensive evaluations demonstrate that MeshVPR achieves competitive performance with standard VPR pipelines, paving the way for mesh-based localization systems.
Conference
Scale-Free Image Keypoints Using Differentiable Persistent Homology
Giovanni Barbarani, Francesco Vaccarino, Gabriele Trivigno, Marco Guerra, Gabriele Berton, and Carlo Masone
In International Conference on Machine Learning (ICML), 2024
In computer vision, keypoint detection is a fundamental task, with applications spanning from robotics to image retrieval; however, existing learning-based methods suffer from scale dependency and lack flexibility. This paper introduces a novel approach that leverages Morse theory and persistent homology, powerful tools rooted in algebraic topology. We propose a novel loss function based on the recent introduction of a notion of subgradient in persistent homology, paving the way toward topological learning. Our detector, MorseDet, is the first topology-based learning model for feature detection, which achieves competitive performance in keypoint repeatability and introduces a principled and theoretically robust approach to the problem.
Workshop
EarthMatch: Iterative Coregistration for Fine-grained Localization of Astronaut Photography
Gabriele Berton, Gabriele Goletto, Gabriele Trivigno, Alex Stoken, Barbara Caputo, and Carlo Masone
In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Jun 2024
Precise, pixel-wise geolocalization of astronaut photography is critical to unlocking the potential of this unique type of remotely sensed Earth data, particularly for its use in disaster management and climate change research. Recent works have established the Astronaut Photography Localization task, but have either proved too costly for mass deployment or generated too coarse a localization. Thus, we present EarthMatch, an iterative homography estimation method that produces fine-grained localization of astronaut photographs while maintaining an emphasis on speed. We refocus the astronaut photography benchmark, AIMS, on the geolocalization task itself, and prove our method’s efficacy on this dataset. In addition, we offer a new, fair method for image matcher comparison, and an extensive evaluation of different matching models within our localization pipeline. Our method will enable fast and accurate localization of the 4.5 million and growing collection of astronaut photography of Earth.
Journal
Distributed training of CosPlace for large-scale visual place recognition
Riccardo Zaccone, Gabriele Berton, and Carlo Masone
Visual place recognition (VPR) is a popular computer vision task aimed at recognizing the geographic location of a visual query, usually within a tolerance of a few meters. Modern approaches address VPR from an image retrieval standpoint using a kNN on top of embeddings extracted by a deep neural network from both the query and images in a database. Although most of these approaches rely on contrastive learning, which limits their ability to be trained on large-scale datasets (due to mining), the recently reported CosPlace proposes an alternative training paradigm using a classification task as the proxy. This has been shown to be effective in expanding the potential of VPR models to learn from large-scale and fine-grained datasets. In this work, we experimentally analyze CosPlace from a continual learning perspective and show that its sequential training procedure leads to suboptimal results. As a solution, we propose a different formulation that not only solves the pitfalls of the original training strategy effectively but also enables faster and more efficient distributed training. Finally, we discuss the open challenges in further speeding up large-scale image retrieval for VPR.
Conference
The Unreasonable Effectiveness of Pre-Trained Features for Camera Pose Refinement
Gabriele Trivigno, Carlo Masone, Barbara Caputo, and Torsten Sattler
In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2024
Pose refinement is an interesting and practically relevant research direction. Pose refinement can be used to (1) obtain a more accurate pose estimate from an initial prior (e.g. from retrieval) (2) as pre-processing i.e. to provide a better starting point to a more expensive pose estimator (3) as post-processing of a more accurate localizer. Existing approaches focus on learning features / scene representations for the pose refinement task. This involves training an implicit scene representation or learning features while optimizing a camera pose-based loss. A natural question is whether training specific features / representations is truly necessary or whether similar results can be already achieved with more generic features. In this work we present a simple approach that combines pre-trained features with a particle filter and a renderable representation of the scene. Despite its simplicity it achieves state-of-the-art results demonstrating that one can easily build a pose refiner without the need for specific training.
Workshop
Collaborative Visual Place Recognition through Federated Learning
Mattia Dutto, Gabriele Berton, Debora Caldarola, Eros Fanı̀, Gabriele Trivigno, and Carlo Masone
In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Jun 2024
Visual Place Recognition (VPR) aims to estimate the location of an image by treating it as a retrieval problem. VPR uses a database of geo-tagged images and leverages deep neural networks to extract a global representation called descriptor from each image. While the training data for VPR models often originates from diverse geographically scattered sources (geo-tagged images) the training process itself is typically assumed to be centralized. This research revisits the task of VPR through the lens of Federated Learning (FL) addressing several key challenges associated with this adaptation. VPR data inherently lacks well-defined classes and models are typically trained using contrastive learning which necessitates a data mining step on a centralized database. Additionally client devices in federated systems can be highly heterogeneous in terms of their processing capabilities. The proposed FedVPR framework not only presents a novel approach for VPR but also introduces a new challenging and realistic task for FL research. This has the potential to spur the application of FL to other image retrieval tasks.
Conference
Earthloc: Astronaut photography localization by indexing earth from space
Gabriele Berton, Alex Stoken, Barbara Caputo, and Carlo Masone
In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2024
Astronaut photography, spanning six decades of human spaceflight, presents a unique Earth observations dataset with immense value for both scientific research and disaster response. Despite its significance, accurately localizing the geographical extent of these images, crucial for effective utilization, poses substantial challenges. Current manual localization efforts are time-consuming, motivating the need for automated solutions. We propose a novel approach - leveraging image retrieval - to address this challenge efficiently. We introduce innovative training techniques, including Year-Wise Data Augmentation and a Neutral-Aware Multi-Similarity Loss, which contribute to the development of a high-performance model, EarthLoc. We develop six evaluation datasets and perform a comprehensive benchmark comparing EarthLoc to existing methods, showcasing its superior efficiency and accuracy. Our approach marks a significant advancement in automating the localization of astronaut photography, which will help bridge a critical gap in Earth observations data.
Journal
JIST: Joint Image and Sequence Training for Sequential Visual Place Recognition
Gabriele Berton, Gabriele Trivigno, Barbara Caputo, and Carlo Masone
Visual Place Recognition aims at recognizing previously visited places by relying on visual clues, and it is used in robotics applications for SLAM and localization. Since typically a mobile robot has access to a continuous stream of frames, this task is naturally cast as a sequence-to-sequence localization problem. Nevertheless, obtaining sequences of labelled data is much more expensive than collecting isolated images, which can be done in an automated way with little supervision. As a mitigation to this problem, we propose a novel Joint Image and Sequence Training protocol (JIST) that leverages large uncurated sets of images through a multi-task learning framework. With JIST we also introduce SeqGeM, an aggregation layer that revisits the popular GeM pooling to produce a single robust and compact embedding from a sequence of single-frame embeddings. We show that our model is able to outperform previous state of the art while being faster, using 8 times smaller descriptors, having a lighter architecture and allowing to process sequences of various lengths.
2023
Preprint
Communication-Efficient Heterogeneous Federated Learning with Generalized Heavy-Ball Momentum
Federated Learning (FL) has emerged as the state-of-the-art approach for learning from decentralized data in privacy-constrained scenarios. However, system and statistical challenges hinder real-world applications, which demand efficient learning from edge devices and robustness to heterogeneity. Despite significant research efforts, existing approaches (i) are not sufficiently robust, (ii) do not perform well in large-scale scenarios, and (iii) are not communication efficient. In this work, we propose a novel Generalized Heavy-Ball Momentum (GHBM), motivating its principled application to counteract the effects of statistical heterogeneity in FL. Then, we present FedHBM as an adaptive, communication-efficient by-design instance of GHBM. Extensive experimentation on vision and language tasks, in both controlled and realistic large-scale scenarios, provides compelling evidence of substantial and consistent performance gains over the state of the art.
Journal
Hierarchical Instance Mixing Across Domains in Aerial Segmentation
Edoardo Arnaudo, Antonio Tavera, Carlo Masone, Fabrizio Dominici, and Barbara Caputo
We investigate the task of unsupervised domain adaptation in aerial semantic segmentation observing that there are some shortcomings in the class mixing strategies used by the recent state-of-the-art methods that tackle this task: 1) they do not account for the large disparity in the extension of the semantic categories that is common in the aerial setting, which causes a domain imbalance in the mixed image; 2) they do not consider that aerial scenes have a weaker structural consistency in comparison to the driving scenes for which the mixing technique was originally proposed, which causes the mixed images to have elements placed out of their natural context; 3) source model used to generate the pseudo-labels may be susceptible to perturbations across domains, which causes inconsistent predictions on the target images and can jeopardize the mixing strategy. We address these shortcomings with a novel aerial semantic segmentation framework for UDA, named HIUDA, which is composed of two main technical novelties: firstly, a new mixing strategy for aerial segmentation across domains called Hierarchical Instance Mixing (HIMix), which extracts a set of connected components from each semantic mask and mixes them according to a semantic hierarchy and secondly, a twin-head architecture in which two separate segmentation heads are fed with variations of the same images in a contrastive fashion to produce finer segmentation maps. We conduct extensive experiments on the LoveDA benchmark, where our solution outperforms the current state-of-the-art.
Conference
Unmasking Anomalies in Road-Scene Segmentation
Shyam Nandan Rai, Fabio Cermelli, Dario Fontanel, Carlo Masone, and Barbara Caputo
In IEEE/CVF International Conference on Computer Vision (ICCV), Jun 2023
Within the top 2.3% of submitted papers, and the top 9% of accepted papers.
Anomaly segmentation is a critical task for driving applications, and it is approached traditionally as a per-pixel classification problem. However, reasoning individually about each pixel without considering their contextual semantics results in high uncertainty around the objects’ boundaries and numerous false positives. We propose a paradigm change by shifting from a per-pixel classification to a mask classification. Our mask-based method, Mask2Anomaly, demonstrates the feasibility of integrating an anomaly detection method in a mask-classification architecture. Mask2Anomaly includes several technical novelties that are designed to improve the detection of anomalies in masks: i) a global masked attention module to focus individually on the foreground and background regions; ii) a mask contrastive learning that maximizes the margin between an anomaly and known classes; and iii) a mask refinement solution to reduce false positives. Mask2Anomaly achieves new state-of-the-art results across a range of benchmarks, both in the per-pixel and component-level evaluations. In particular, Mask2Anomaly reduces the average false positives rate by 60% w.r.t. the previous state-of-the-art.
Conference
Divide&Classify: Fine-Grained Classification for City-Wide Visual Place Recognition
Gabriele Trivigno, Gabriele Berton, Juan Aragon, Barbara Caputo, and Carlo Masone
In IEEE/CVF International Conference on Computer Vision (ICCV), Jun 2023
Visual Place recognition is commonly addressed as an image retrieval problem. However, retrieval methods are impractical to scale to large datasets, densely sampled from city-wide maps, since their dimension impact negatively on the inference time. Using approximate nearest neighbour search for retrieval helps to mitigate this issue, at the cost of a performance drop. In this paper we investigate whether we can effectively approach this task as a classification problem, thus bypassing the need for a similarity search. We find that existing classification methods for coarse, planet-wide localization are not suitable for the fine-grained and city-wide setting. This is largely due to how the dataset is split into classes, because these methods are designed to handle a sparse distribution of photos and as such do not consider the visual aliasing problem across neighbouring classes that naturally arises in dense scenarios. Thus, we propose a partitioning scheme that enables a fast and accurate inference, preserving a simple learning procedure, and a novel inference pipeline based on an ensemble of novel classifiers that uses the prototypes learned via an angular margin loss. Our method, Divide&Classify (D&C), enjoys the fast inference of classification solutions and an accuracy competitive with retrieval methods on the fine-grained, city-wide setting. Moreover, we show that D&C can be paired with existing retrieval pipelines to speed up computations by over 20 times while increasing their recall, leading to new state-of-the-art results.
Workshop
The Robust Semantic Segmentation UNCV2023 Challenge Results
Xuanlong Yu, Yi Zuo, Zitao Wang, Xiaowen Zhang, Jiaxuan Zhao, Yuting Yang, Licheng Jiao, Rui Peng, Xinyi Wang, Junpei Zhang, Kexin Zhang, Fang Liu, Roberto Alcover-Couso, Juan C. SanMiguel, Marcos Escudero-Viñolo, Hanlin Tian, Kenta Matsui, Tianhao Wang, Fahmy Adan, Zhitong Gao, Xuming He, Quentin Bouniot, Hossein Moghaddam, Shyam Nandan Rai, Fabio Cermelli, Carlo Masone, Andrea Pilzer, Elisa Ricci, Andrei Bursuc, Arno Solin, Martin Trapp, Rui Li, Angela Yao, Wenlong Chen, Ivor Simpson, Neill D. F. Campbell, and Gianni Franchi
In IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Jun 2023
This paper outlines the winning solutions employed in addressing the MUAD uncertainty quantification challenge held at ICCV 2023. The challenge was centered around semantic segmentation in urban environments, with a particular focus on natural adversarial scenarios. The report presents the results of 19 submitted entries, with numerous techniques drawing inspiration from cutting-edge uncertainty quantification methodologies presented at prominent conferences in the fields of computer vision and machine learning and venues over the past few years. Within this document, the challenge is introduced, shedding light on its purpose and objectives, which primarily revolved around enhancing the robustness of semantic segmentation in urban scenes under varying natural adversarial conditions. The report then delves into the top-performing solutions. Moreover, the document aims to provide a comprehensive overview of the diverse solutions deployed by all participants. By doing so, it seeks to offer readers a deeper insight into the array of strategies that can be leveraged to effectively handle the inherent uncertainties associated with autonomous driving and semantic segmentation, especially within urban environments.
Conference
EigenPlaces: Training Viewpoint Robust Models for Visual Place Recognition
Gabriele Berton, Gabriele Trivigno, Barbara Caputo, and Carlo Masone
In IEEE/CVF International Conference on Computer Vision (ICCV), Jun 2023
Visual Place Recognition is a task that aims to predict the place of an image (called query) based solely on its visual features. This is typically done through image retrieval, where the query is matched to the most similar images from a large database of geotagged photos, using learned global descriptors. A major challenge in this task is recognizing places seen from different viewpoints. To overcome this limitation, we propose a new method, called EigenPlaces, to train our neural network on images from different point of views, which embeds viewpoint robustness into the learned global descriptors. The underlying idea is to cluster the training data so as to explicitly present the model with different views of the same points of interest. The selection of this points of interest is done without the need for extra supervision. We then present experiments on the most comprehensive set of datasets in literature, finding that EigenPlaces is able to outperform previous state of the art on the majority of datasets, while requiring 60% less GPU memory for training and using 50% smaller descriptors. The code and trained models for EigenPlaces are available at https://github.com/gmberton/EigenPlaces, while results with any other baseline can be computed with the codebase at https://github.com/gmberton/auto_VPR.
Workshop
Are Local Features All You Need for Cross-Domain Visual Place Recognition?
Giovanni Barbarani, Mohamad Mostafa, Hajali Bayramov, Gabriele Trivigno, Gabriele Berton, Carlo Masone, and Barbara Caputo
In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Jun 2023
Visual Place Recognition is a task that aims to predict the coordinates of an image (called query) based solely on visual clues. Most commonly, a retrieval approach is adopted, where the query is matched to the most similar images from a large database of geotagged photos, using learned global descriptors. Despite recent advances, recognizing the same place when the query comes from a significantly different distribution is still a major hurdle for state of the art retrieval methods. Examples are heavy illumination changes (e.g. night-time images) or substantial occlusions (e.g. transient objects). In this work we explore whether re-ranking methods based on spatial verification can tackle these challenges, following the intuition that local descriptors are inherently more robust than global features to domain shifts. To this end, we provide a new, comprehensive benchmark on current state of the art models. We also introduce two new demanding datasets with night and occluded queries, to be matched against a city-wide database. Code and datasets are available at https://github.com/gbarbarani/re-ranking-for-VPR.
2022
Journal
Learning Sequential Descriptors for Sequence-Based Visual Place Recognition
Riccardo Mereu, Gabriele Trivigno, Gabriele Berton, Carlo Masone, and Barbara Caputo
In robotics, visual place recognition (VPR) is a continuous process that receives as input a video stream to produce a hypothesis of the robot’s current position within a map of known places. This work proposes a taxonomy of the architectures used to learn sequential descriptors for VPR, highlighting different mechanisms to fuse the information from the individual images. This categorization is supported by a complete benchmark of experimental results that provides evidence of the strengths and weaknesses of these different architectural choices. The analysis is not limited to existing sequential descriptors, but we extend it further to investigate the viability of Transformers instead of CNN backbones. We further propose a new ad-hoc sequence-level aggregator called SeqVLAD, which outperforms prior state of the art on different datasets. The code is available at https://github.com/vandal-vpr/vg-transformers.
Journal
Adaptive-Attentive Geolocalization From Few Queries: A Hybrid Approach
Valerio Paolicelli, Gabriele Berton, Francesco Montagna, Carlo Masone, and Barbara Caputo
We tackle the task of cross-domain visual geo-localization, where the goal is to geo-localize a given query image against a database of geo-tagged images, in the case where the query and the database belong to different visual domains. In particular, at training time, we consider having access to only few unlabeled queries from the target domain. To adapt our deep neural network to the database distribution, we rely on a 2-fold domain adaptation technique, based on a hybrid generative-discriminative approach. To further enhance the architecture, and to ensure robustness across domains, we employ a novel attention layer that can easily be plugged into existing architectures. Through a large number of experiments, we show that this adaptive-attentive approach makes the model robust to large domain shifts, such as unseen cities or weather conditions. Finally, we propose a new large-scale dataset for cross-domain visual geo-localization, called SVOX.
Workshop
Augmentation Invariance and Adaptive Sampling in Semantic Segmentation of Agricultural Aerial Images
A. Tavera, E. Arnaudo, C. Masone, and B. Caputo
In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Jun 2022
In this paper, we investigate the problem of Semantic Segmentation for agricultural aerial imagery. We observe that the existing methods used for this task are designed without considering two characteristics of the aerial data: (i) the top-down perspective implies that the model cannot rely on a fixed semantic structure of the scene, because the same scene may be experienced with different rotations of the sensor; (ii) there can be a strong imbalance in the distribution of semantic classes because the relevant objects of the scene may appear at extremely different scales (e.g., a field of crops and a small vehicle). We propose a solution to these problems based on two ideas: (i) we use together a set of suitable augmentation and a consistency loss to guide the model to learn semantic representations that are invariant to the photometric and geometric shifts typical of the top-down perspective (Augmentation Invariance); (ii) we use a sampling method (Adaptive Sampling) that selects the training images based on a measure of pixel-wise distribution of classes and actual network confidence. With an extensive set of experiments conducted on the Agriculture-Vision dataset, we demonstrate that our proposed strategies improve the performance of the current state-of-the-art method.
Conference
Learning Semantics for Visual Place Recognition through Multi-Scale Attention
V. Paolicelli, A. Tavera, G. Berton, C. Masone, and B. Caputo
In International Conference on Image Analysis and Processing (ICIAP), Jun 2022
In this paper we address the task of visual place recognition (VPR), where the goal is to retrieve the correct GPS coordinates of a given query image against a huge geotagged gallery. While recent works have shown that building descriptors incorporating semantic and appearance information is beneficial, current state-of-the-art methods opt for a top down definition of the significant semantic content. Here we present the first VPR algorithm that learns robust global embeddings from both visual appearance and semantic content of the data, with the segmentation process being dynamically guided by the recognition of places through a multi-scale attention module. Experiments on various scenarios validate this new approach and demonstrate its performance against state-of-the-art methods. Finally, we propose the first synthetic-world dataset suited for both place recognition and segmentation tasks.
Conference
Rethinking Visual Geo-localization for Large-Scale Applications
G. Berton, C. Masone, and B. Caputo
In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2022
Visual Geo-localization (VG) is the task of estimating the position where a given photo was taken by comparing it with a large database of images of known locations. To investigate how existing techniques would perform on a real-world city-wide VG application, we build San Francisco eXtra Large, a new dataset covering a whole city and providing a wide range of challenging cases, with a size 30x bigger than the previous largest dataset for visual geo-localization. We find that current methods fail to scale to such large datasets, therefore we design a new highly scalable training technique, called CosPlace, which casts the training as a classification problem avoiding the expensive mining needed by the commonly used contrastive learning. We achieve state-of-the-art performance on a wide range of datasets and find that CosPlace is robust to heavy domain changes. Moreover, we show that, compared to the previous state-of-the-art, CosPlace requires roughly 80% less GPU memory at train time, and it achieves better results with 8x smaller descriptors, paving the way for city-wide real-world visual geo-localization.
Conference
Deep Visual Geo-localization Benchmark
G. Berton, R. Mereu, G. Trivigno, C. Masone, G. Csurka, T. Sattler, and B. Caputo
In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2022
Within the top 2% of submitted papers, and the top 7.8% of accepted papers.
In this paper, we propose a new open-source benchmarking framework for Visual Geo-localization (VG) that allows to build, train, and test a wide range of commonly used architectures, with the flexibility to change individual components of a geo-localization pipeline. The purpose of this framework is twofold: i) gaining insights into how different components and design choices in a VG pipeline impact the final results, both in terms of performance (recall@N metric) and system requirements (such as execution time and memory consumption); ii) establish a systematic evaluation protocol for comparing different methods. Using the proposed framework, we perform a large suite of experiments which provide criteria for choosing backbone, aggregation and negative mining depending on the use-case and requirements. We also assess the impact of engineering techniques like pre/post-processing, data augmentation and image resizing, showing that better performance can be obtained through somewhat simple procedures: for example, downscaling the images’ resolution to 80% can lead to similar results with a 36% savings in extraction time and dataset storage requirement.
Conference
Pixel-by-Pixel Cross-Domain Alignment for Few-Shot Semantic Segmentation
A. Tavera, F. Cermelli, C. Masone, and B. Caputo
In IEEE Winter Conference on Applications of Computer Vision (WACV), Jun 2022
In this paper we consider the task of semantic segmentation in autonomous driving applications. Specifically, we consider the cross-domain few-shot setting where training can use only few real-world annotated images and many annotated synthetic images. In this context, aligning the domains is made more challenging by the pixel-wise class imbalance that is intrinsic in the segmentation and that leads to ignoring the underrepresented classes and overfitting the well represented ones. We address this problem with a novel framework called Pixel-By-Pixel Cross-Domain Alignment (PixDA). We propose a novel pixel-by-pixel domain adversarial loss following three criteria: (i) align the source and the target domain for each pixel, (ii) avoid negative transfer on the correctly represented pixels, and (iii) regularize the training of infrequent classes to avoid overfitting. The pixel-wise adversarial training is assisted by a novel sample selection procedure, that handles the imbalance between source and target data, and a knowledge distillation strategy, that avoids overfitting towards the few target images. We demonstrate on standard synthetic-to-real benchmarks that PixDA outperforms previous state-of-the-art methods in (1-5)-shot settings.
2021
Conference
Adaptive-Attentive Geolocalization from few queries: a hybrid approach
G. Moreno Berton, V. Paolicelli, C. Masone, and B. Caputo
In IEEE Winter Conference on Applications of Computer Vision (WACV), Jun 2021
We address the task of cross-domain visual place recognition, where the goal is to geolocalize a given query image against a labeled gallery, in the case where the query and the gallery belong to different visual domains. To achieve this, we focus on building a domain robust deep network by leveraging over an attention mechanism combined with few-shot unsupervised domain adaptation techniques, where we use a small number of unlabeled target domain images to learn about the target distribution. With our method, we are able to outperform the current state of the art while using two orders of magnitude less target domain images. Finally we propose a new large-scale dataset for cross-domain visual place recognition, called SVOX.
Conference
Reimagine BiSeNet for Real-Time Domain Adaptation in Semantic Segmentation
A. Tavera, C. Masone, and B. Caputo
In Proceedings of the I-RIM 2021 Conference, Jun 2021
Semantic segmentation models have reached remarkable performance across various tasks. However, this performance is achieved with extremely large models, using powerful computational resources and without considering training and inference time. Real-world applications, on the other hand, necessitate models with minimal memory demands, efficient inference speed, and executable with low-resources embedded devices, such as self-driving vehicles. In this paper, we look at the challenge of real-time semantic segmentation across domains, and we train a model to act appropriately on real-world data even though it was trained on a synthetic realm. We employ a new lightweight and shallow discriminator that was specifically created for this purpose. To the best of our knowledge, we are the first to present a real-time adversarial approach for assessing the domain adaption problem in semantic segmentation. We tested our framework in the two standard protocol: GTA5 to Cityscapes and SYNTHIA to Cityscapes.
Conference
Viewpoint Invariant Dense Matching for Visual Geolocalization
G. Berton, C. Masone, V. Paolicelli, and B. Caputo
In IEEE/CVF International Conference on Computer Vision (ICCV), Jun 2021
In this paper we propose a novel method for image matching based on dense local features and tailored for visual geolocalization. Dense local features matching is robust against changes in illumination and occlusions, but not against viewpoint shifts which are a fundamental aspect of geolocalization. Our method, called GeoWarp, directly embeds invariance to viewpoint shifts in the process of extracting dense features. This is achieved via a trainable module which learns from the data an invariance that is meaningful for the task of recognizing places. We also devise a new self-supervised loss and two new weakly supervised losses to train this module using only unlabeled data and weak labels. GeoWarp is implemented efficiently as a re-ranking method that can be easily embedded into pre-existing visual geolocalization pipelines. Experimental validation on standard geolocalization benchmarks demonstrates that GeoWarp boosts the accuracy of state-of-the-art retrieval architectures.
Journal
Shared Control of an Aerial Cooperative Transportation System with a Cable-suspended Payload
C. Masone, and P. Stegagno
Journal of Intelligent & Robotic Systems, Jun 2021
This paper presents a novel bilateral shared framework for a cooperative aerial transportation and manipulation system composed by a team of micro aerial vehicles with a cable-suspended payload. The human operator is in charge of steering the payload and he/she can also change online the desired shape of the formation of robots. At the same time, an obstacle avoidance algorithm is in charge of avoiding collisions with the static environment. The signals from the user and from the obstacle avoidance are blended together in the trajectory generation module, by means of a tracking controller and a filter called dynamic input boundary (DIB). The DIB filters out the directions of motions that would bring the system too close to singularities, according to a suitable metric. The loop with the user is finally closed with a force feedback that is informative of the mismatch between the operator’s commands and the trajectory of the payload. This feedback intuitively increases the user’s awareness of obstacles or configurations of the system that are close to singularities. The proposed framework is validated by means of realistic hardware-in-the-loop simulations with a person operating the system via a force-feedback haptic interface.
In recent years visual place recognition (VPR), i.e., the problem of recognizing the location of images, has received considerable attention from multiple research communities, spanning from computer vision to robotics and even machine learning. This interest is fueled on one hand by the relevance that visual place recognition holds for many applications and on the other hand by the unsolved challenge of making these methods perform reliably in different conditions and environments. This paper presents a survey of the state-of-the-art of research on visual place recognition, focusing on how it has been shaped by the recent advances in deep learning. We start discussing the image representations used in this task and how they have evolved from using hand-crafted to deep-learned features. We further review how metric learning techniques are used to get more discriminative representations, as well as techniques for dealing with occlusions, distractors, and shifts in the visual domain of the images. The survey also provides an overview of the specific solutions that have been proposed for applications in robotics and with aerial imagery. Finally the survey provides a summary of datasets that are used in visual place recognition, highlighting their different characteristics.
2020
Journal
IDDA: A Large-Scale Multi-Domain Dataset for Autonomous Driving
Semantic segmentation is key in autonomous driving. Using deep visual learning architectures is not trivial in this context, because of the challenges in creating suitable large scale annotated datasets. This issue has been traditionally circumvented through the use of synthetic datasets, that have become a popular resource in this field. They have been released with the need to develop semantic segmentation algorithms able to close the visual domain shift between the training and test data. Although exacerbated by the use of artificial data, the problem is extremely relevant in this field even when training on real data. Indeed, weather conditions, viewpoint changes and variations in the city appearances can vary considerably from car to car, and even at test time for a single, specific vehicle. How to deal with domain adaptation in semantic segmentation, and how to leverage effectively several different data distributions (source domains) are important research questions in this field. To support work in this direction, this letter contributes a new large scale, synthetic dataset for semantic segmentation with more than 100 different source visual domains. The dataset has been created to explicitly address the challenges of domain shift between training and test data in various weather and view point conditions, in seven different city types. Extensive benchmark experiments assess the dataset, showcasing open challenges for the current state of the art.
2018
Conference
Application of a Differentiator-Based Adaptive Super-Twisting Controller for a Redundant Cable-Driven Parallel Robot
C. Schenk, C. Masone, A. Pott, and Heinrich H. Bülthoff
In this paper we present preliminary, experimental results of an Adaptive Super-Twisting Sliding-Mode Controller with time-varying gains for redundant Cable-Driven Parallel Robots. The sliding-mode controller is paired with a feed-forward action based on dynamics inversion. An exact sliding-mode differentiator is implemented to retrieve the velocity of the end-effector using only encoder measurements with the properties of finite-time convergence, robustness against perturbations and noise filtering. The platform used to validate the controller is a robot with eight cables and six degrees of freedom powered by 940W compact servo drives. The proposed experiment demonstrates the performance of the controller, finite-time convergence and robustness in tracking a trajectory while subject to external disturbances up to approximately 400% the mass of the end-effector.
Journal
Shared planning and control for mobile robots with integral haptic feedback
C. Masone, M. Mohammadi, P. Robuffo Giordano, and A. Franchi
The International Journal of Robotics Research, Jun 2018
This paper presents a novel bilateral shared framework for online trajectory generation for mobile robots. The robot navigates along a dynamic path, represented as a B-spline, whose parameters are jointly controlled by a human supervisor and an autonomous algorithm. The human steers the reference (ideal) path by acting on the path parameters that are also affected, at the same time, by the autonomous algorithm to ensure: (i) collision avoidance, (ii) path regularity, and (iii) proximity to some points of interest. These goals are achieved by combining a gradient descent-like control action with an automatic algorithm that re-initializes the traveled path (replanning) in cluttered environments to mitigate the effects of local minima. The control actions of both the human and the autonomous algorithm are fused via a filter that preserves a set of local geometrical properties of the path to ease the tracking task of the mobile robot. The bilateral component of the interaction is implemented via a force feedback that accounts for both human and autonomous control actions along the whole path, thus providing information about the mismatch between the reference and traveled path in an integral sense. The proposed framework is validated by means of realistic simulations and actual experiments deploying a quadrotor unmanned aerial vehicle (UAV) supervised by a human operator acting via a force-feedback haptic interface. Finally, a user study is presented to validate the effectiveness of the proposed framework and the usefulness of the provided force cues.
2016
Conference
The CableRobot simulator large scale motion platform based on cable robot technology
P. Miermeister, M. Lächele, R. Boss, C. Masone, C. Schenk, J. Tesch, M. Kerger, H. Teufel, A. Pott, and H. H. Bülthoff
In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Jun 2016
Winner of the IROS JTCF Novel Technology Paper Award for Amusement Culture
This paper introduces the CableRobot simulator, which was developed at the Max Planck Institute for Biological Cybernetics in cooperation with the Fraunhofer Institute for Manufacturing Engineering and Automation IPA. The simulator is a completely novel approach to the design of motion simulation platforms in so far as it uses cables and winches for actuation instead of rigid links known from hexapod simulators. This approach allows to reduce the actuated mass, scale up the workspace significantly, and provides great flexibility to switch between system configurations in which the robot can be operated. The simulator will be used for studies in the field of human perception research and virtual reality applications. The paper discusses some of the issues arising from the usage of cables and provides a system overview regarding kinematics and system dynamics as well as giving a brief introduction into possible application use cases.
Conference
Cooperative transportation of a payload using quadrotors: A reconfigurable cable-driven parallel robot
C. Masone, H. H. Bülthoff, and P. Stegagno
In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Jun 2016
This paper addresses the problem of cooperative aerial transportation of an object using a team of quadrotors. The approach presented to solve this problem accounts for the full dynamics of the system and it is inspired by the literature on reconfigurable cable-driven parallel robots (RCDPR). Using the modelling convention of RCDPR it is derived a direct relation between the motion of the quadrotors and the motion of the payload. This relation makes explicit the available internal motion of the system, which can be used to automatically achieve additional tasks. The proposed method does not require to specify a priory the forces in the cables and uses a tension distribution algorithm to optimally distribute them among the robots. The presented framework is also suitable for online teleoperation. Physical simulations with a human-in-the-loop validate the proposed approach.
Conference
Modeling and analysis of cable vibrations for a cable-driven parallel robot
C. Schenk, C. Masone, P. Miermeister, and H. H. Bülthoff
In IEEE International Conference on Information and Automation (ICIA), Jun 2016
Best Paper award Finalist at the 2016 IEEE International Conference on Information and Automation (ICIA)
In this paper we study if approximated linear models are accurate enough to predict the vibrations of a cable of a Cable-Driven Parallel Robot (CDPR) for different pretension levels. In two experiments we investigated the damping of a thick steel cable from the Cablerobot simulator and measured the motion of the cable when a sinusoidal force is applied at one end of the cable. Using this setup and power spectral density analysis we measured the natural frequencies of the cable and compared these results to the frequencies predicted by two linear models: i) the linearization of partial differential equations of motion for a distributed cable, and ii) the discretization of the cable using a finite elements model. This comparison provides remarkable insights into the limits of approximated linear models as well as important properties of vibrating cables used in CDPR.
Conference
Adaptive Super Twisting Controller for a quadrotor UAV
S. Rajappa, C. Masone, H. H. Bülthoff, and P. Stegagno
In IEEE International Conference on Robotics and Automation (ICRA), Jun 2016
In this paper we present a robust quadrotor controller for tracking a reference trajectory in presence of uncertainties and disturbances. A Super Twisting controller is implemented using the recently proposed gain adaptation law [1], [2], which has the advantage of not requiring the knowledge of the upper bound of the lumped uncertainties. The controller design is based on the regular form of the quadrotor dynamics, without separation in two nested control loops for position and attitude. The controller is further extended by a feedforward dynamic inversion control that reduces the effort of the sliding mode controller. The higher order quadrotor dynamic model and proposed controller are validated using a SimMechanics physical simulation with initial error, parameter uncertainties, noisy measurements and external perturbations.
2015
Conference
Robust adaptive sliding mode control of a redundant cable driven parallel robot
C. Schenk, H. H. Bülthoff, and C. Masone
In 2015 19th International Conference on System Theory, Control and Computing (ICSTCC), Jun 2015
In this paper we consider the application problem of a redundant cable-driven parallel robot, tracking a reference trajectory in presence of uncertainties and disturbances. A Super Twisting controller is implemented using a recently proposed gains adaptation law [1], thus not requiring the knowledge of the upper bound of the lumped uncertainties. The controller is extended by a feedforward dynamic inversion control that reduces the effort of the sliding mode controller. Compared to a recently developed Adaptive Terminal Sliding Mode Controller for cable-driven parallel robots [2], the proposed controller manages to achieve lower tracking errors and less chattering in the actuation forces even in presence of perturbations. The system is implemented and tested in simulation using a model of a large redundant cable-driven robot and assuming noisy measurements. Simulations show the effectiveness of the proposed method.
2014
Conference
Semi-autonomous trajectory generation for mobile robots with integral haptic shared control
C. Masone, P. Robuffo Giordano, H. H. Bülthoff, and A. Franchi
In IEEE International Conference on Robotics and Automation (ICRA), Jun 2014
A new framework for semi-autonomous path planning for mobile robots that extends the classical paradigm of bilateral shared control is presented. The path is represented as a B-spline and the human operator can modify its shape by controlling the motion of a finite number of control points. An autonomous algorithm corrects in real time the human directives in order to facilitate path tracking for the mobile robot and ensures i) collision avoidance, ii) path regularity, and iii) attraction to nearby points of interest. A haptic feedback algorithm processes both human’s and autonomous control terms, and their integrals, to provide an information of the mismatch between the path specified by the operator and the one corrected by the autonomous algorithm. The framework is validated with extensive experiments using a quadrotor UAV and a human in the loop with two haptic interfaces.
2012
Journal
Modeling and Control of UAV Bearing Formations with Bilateral High-level Steering
A. Franchi, C. Masone, V. Grabe, M. Ryll, H. H. Bülthoff, and P. Robuffo Giordano
The International Journal of Robotics Research, Jun 2012
In this paper we address the problem of controlling the motion of a group of unmanned aerial vehicles (UAVs) bound to keep a formation defined in terms of only relative angles (i.e. a bearing formation). This problem can naturally arise within the context of several multi-robot applications such as, e.g. exploration, coverage, and surveillance. First, we introduce and thoroughly analyze the concept and properties of bearing formations, and provide a class of minimally linear sets of bearings sufficient to uniquely define such formations. We then propose a bearing-only formation controller requiring only bearing measurements, converging almost globally, and maintaining bounded inter-agent distances despite the lack of direct metric information.The controller still leaves the possibility of imposing group motions tangent to the current bearing formation. These can be either autonomously chosen by the robots because of any additional task (e.g. exploration), or exploited by an assisting human co-operator. For this latter ’human-in-the-loop’ case, we propose a multi-master/multi-slave bilateral shared control system providing the co-operator with some suitable force cues informative of the UAV performance. The proposed theoretical framework is extensively validated by means of simulations and experiments with quadrotor UAVs equipped with onboard cameras. Practical limitations, e.g. limited field-of-view, are also considered.
Conference
Interactive planning of persistent trajectories for human-assisted navigation of mobile robots
C. Masone, A. Franchi, H. H. Bülthoff, and P. Robuffo Giordano
In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Jun 2012
This work extends the framework of bilateral shared control of mobile robots with the aim of increasing the robot autonomy and decreasing the operator commitment. We consider persistent autonomous behaviors where a cyclic motion must be executed by the robot. The human operator is in charge of modifying online some geometric properties of the desired path. This is then autonomously processed by the robot in order to produce an actual path guaranteeing: i) tracking feasibility, ii) collision avoidance with obstacles, iii) closeness to the desired path set by the human operator, and iv) proximity to some points of interest. A force feedback is implemented to inform the human operator of the global deformation of the path rather than using the classical mismatch between desired and executed motion commands. Physically-based simulations, with human/hardware-in-the-loop and a quadrotor UAV as robotic platform, demonstrate the feasibility of the method.
Conference
Roll rate thresholds and perceived realism in driving simulation
A. Nesti, C. Masone, M. Barnett-Cowan, P. Robuffo Giordano, H. H. Bülthoff, and P. Pretto
Due to limited operational space, in dynamic driving simulators it is common practice to implement motion cueing algorithms that tilt the simulator cabin to reproduce sustained accelerations. In order to avoid conflicting inertial cues, the tilt rate is kept below drivers’ perceptual thresholds, which are typically derived from the results of classical vestibular research where additional sensory cues to self-motion are removed. Here we conduct two experiments in order to assess whether higher tilt limits can be employed to expand the user’s perceptual workspace of dynamic driving simulators. In the first experiment we measure detection thresholds for roll in conditions that closely resemble typical driving. In the second experiment we measure drivers’ perceived realism in slalom driving for sub-, near- and supra-threshold roll rates. Results show that detection threshold for roll in an active driving task is remarkably higher than the limits currently used in motion cueing algorithms to drive simulators. Supra-threshold roll rates in the slalom task are also rated as more realistic. Overall, our findings suggest that higher tilt limits can be successfully implemented in motion cueing algorithms to better optimize simulator operational space.
2011
Conference
Bilateral teleoperation of multiple UAVs with decentralized bearing-only formation control
A. Franchi, C. Masone, H. H. Bülthoff, and P. Robuffo Giordano
In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Jun 2011
We present a decentralized system for the bilateral teleoperation of groups of UAVs which only relies on relative bearing measurements, i.e., without the need of distance information or global localization. The properties of a 3D bearing-formation are analyzed, and a minimal set of bearings needed for its definition is provided. We also design a novel decentralized formation control almost globally convergent and able to maintain bounded and non-vanishing inter-distances among the agents despite the absence of direct distance measurements. Furthermore, we develop a multi-master/ multi-slave teleoperation setup in order to control the overall behavior of the group and to convey to the human operator suitable force cues, while ensuring stability in presence of delays and packet losses over the master-slave communication channel. The theoretical framework is validated by means of extensive human/hardware in-the-loop simulations using two force-feedback devices and a group of quadrotors.
Conference
Mechanical design and control of the new 7-DOF CyberMotion simulator
C. Masone, P. Robuffo Giordano, and H. H. Bülthoff
In IEEE International Conference on Robotics and Automation (ICRA), Jun 2011
This paper describes the mechanical and control design of the new 7-DOF CyberMotion Simulator, a redundant industrial manipulator arm consisting of a standard 6-DOF anthropomorphic manipulator plus an actuated cabin attached to the end-effector. Contrarily to Stewart platforms, an industrial manipulator offers several advantages when used as motion simulator: larger motion envelope, higher dexterity, and possibility to realize any end-effector posture within the workspace. In addition to this, the new actuated cabin acts as an additional joint and provides the needed kinematic redundancy to cope with the robot actuator and joint range constraints, which in general can significantly deteriorate the desired motion cues the robot is reproducing. In particular, we will show that, by suitably exploiting the redundancy better results can be obtained in reproducing sustained acceleration cues, a relevant problem when implementing vehicle simulators.
2010
Conference
A novel framework for closed-loop robotic motion simulation - part II: Motion cueing design and experimental validation
P. Robuffo Giordano, C. Masone, J. Tesch, M. Breidt, L. Pollini, and H. H. Bülthoff
In IEEE International Conference on Robotics and Automation (ICRA), Jun 2010
This paper, divided in two Parts, considers the problem of realizing a 6-DOF closed-loop motion simulator by exploiting an anthropomorphic serial manipulator as motion platform. After having proposed a suitable inverse kinematics scheme in Part I [1], we address here the other key issue, i.e., devising a motion cueing algorithm tailored to the specific robot motion envelope. An extension of the well-known classical washout filter designed in cylindrical coordinates will provide an effective solution to this problem. The paper will then present a thorough experimental evaluation of the overall architecture (inverse kinematics + motion cueing) on the chosen scenario: closed-loop simulation of a Formula 1 racing car. This will prove the feasibility of our approach in fully exploiting the robot motion capabilities as a motion simulator.
Conference
A novel framework for closed-loop robotic motion simulation - part I: Inverse kinematics design
P. Robuffo Giordano, C. Masone, J. Tesch, M. Breidt, L. Pollini, and H. H. Bülthoff
In IEEE International Conference on Robotics and Automation (ICRA), Jun 2010
This paper considers the problem of realizing a 6-DOF closed-loop motion simulator by exploiting an anthropomorphic serial manipulator as motion platform. Contrary to standard Stewart platforms, an industrial anthropomorphic manipulator offers a considerably larger motion envelope and higher dexterity that let envisage it as a viable and superior alternative. Our work is divided in two papers. In this Part I, we discuss the main challenges in adopting a serial manipulator as motion platform, and thoroughly analyze one key issue: the design of a suitable inverse kinematics scheme for online motion reproduction. Experimental results are proposed to analyze the effectiveness of our approach. Part II [1] will address the design of a motion cueing algorithm tailored to the robot kinematics, and will provide an experimental evaluation on the chosen scenario: closed-loop simulation of a Formula 1 racing car.