Figure 1. (Vision-based) localization is an important part of spatial intelligence. Photo credits to Tabea Schimpf.
What is spatial intelligence? It may be defined as “the ability to generate, retain, retrieve, and transform well-structured visual images” (D. F. Lohman, Spatial ability and g, 1996) or, more broadly, “human’s computational capacity that provides the ability or mental skill to solve spatial problems of navigation, visualization of objects from different angles and space, faces or scenes recognition, or to notice fine details” (H. Gardner).
(Vision-based) Localization, i.e., the ability to place a visual (or multimodal) observation in space, or rather a suitable representation of space, is an important ingredient of spatial intelligence. This kind of reasoning is well developed and studied in humans. For example, when looking at the picture of a famous landmark, we can easily recognize it and infer where that picture was taken. In our daily routine, when we go around we are collecting observations of the space around us and we organize them in a cognitive map, a unified representation of the spatial environment that we can access both to support memory (e.g., to relate past observations with respect to each other and with respect to new observations) and to guide our future actions (e.g., when we mentally plan a route to a destination). Me and my team and colleagues in VANDAL are working on creating new and better algorithms that can provide this kind of reasoning, which is critical to develop applications that require advanced interactions with the world. For example, to enable autonomous navigation of robots and vehicles, to create more convincing and immersive augmented/extended reality applications, to make smarter personal assistive devices, etcetera.
Although it is only a part of spatial resoning, the research field of (vision-based) localization itself is very broad. The kind of reasoning that can be achieved may depend on how we represent the space (e.g., as an unordered collection of images, as a sparse point-cloud, as a dense 3D map, …). We may even not have a prior representation of the world, in which case we may seek to deduce the location of an observation relative to another one, without placing it in a map. The goals may also vary depending on the task we need to solve: we may be interested in coarsely predicting the geographical location of an observation (Visual Place Recognition or Visual Geo-localization); we may want to estimate the precise pose of the sensor that captured that observation (Visual Localization); we may want to recognize a place/object, irrespectively of the point from where we observe it (Landmark Recognition); we may want to refine a pose estimate (Pose refinement); we may try to establish correspondences between two different observations (Image Matching); and many more.
Figure 2. "Where is this place?" Example of Visual Place Recognition, where we try to predicting the coarse location where the image was taken with respect to a map.
We have been particularly interested in making these algorithms robust and scalable.
Robustness
One of the biggest challenges in visual geolocalization is the fact that the same place viewed at different times, in different weather conditions, and from slightly different angles may look substantially different. Making a visual geolocalization system robust to these variations and achieve good performance across different conditions and in presence of distractors or occlusions is a major topic of research.
Until recent, visual geolocalization research has focused on recognizing the location of images in moderately sized geographical areas, such as a neighborhood or a single route in a city. However, to empower the promised real-world applications of this technology, such as enabling the navigation of autonomous agents, it is ecessary to scale this task to much wider areas with databases of spatially-densely sampled images. The question of scalability in visual geolocalization system not only demands for larger datasets, but it the problems of how to make the deployed system scalable at test time on a limited budget of memory and computational time.
Figure 4. In CosPlace we use a classification task as a proxy to train the model that is used to extract the global descriptors for retrieving images from the same place as the query to geo-localize. For this purpose, naively dividing the environment in cells (left image) and using these cells as classes is not effective because i) images from adjaccent cells may see the same scene and thus be from the same place, and ii) the number of classes required to cover a large space will grow quickly. To solve these issues, CosPlace proposes a division of the space in sub-datasets (the slices with different colors in the image on the right), and the training iterates through the different sub-datasets, replacing the classification head. Images from Rethinking Visual Geo-localization for Large Scale Applications.
Using different 3D representations
One of the key questions in localization, when using a 3D representation of the known world (map), is what kind of representation to use? What are the advantages and disadvantages of different representations. We are exploring different solutions, particularly trying to aim for scalable approaches.
Figure 5.MeshVPR Citywide Visual Place Recognition Using 3D Meshes.
Robotics application
Classical Visual Place Recognition methods have been devised to localize a single query image at a time, but a robot navigating would collect a continuous stream of images from a camera. Thus, we seek new solutions that can also exploit the temporal information in this stream to reason about the location. An idea that we have explored is to to use sequential descriptors that summarize sequences as a whole, thus enabling to directly perform a sequence-to-sequence similarity search (see Figure 5). This idea is alluring, not only for its efficiency but also because a sequential descriptor naturally incorporates the temproal information from the sequence, which provides more robustness to high-confidence false matches than single image descriptors.
Figure 6. (Top) Sequence matching individually processes each frame in the sequences to extract single-image descriptors. The frame-to-frame similarity scores build a matrix, and the best matching sequence is determined by aggregating the scores in the matrix. (Bottom) With sequential descriptors, each sequence is mapped to a learned descriptor, and the best matching sequence is directly determined by measuring the sequence-to-sequence similarity Images from Learning Sequential Descriptors for Sequence-Based Visual Place Recognition.
Image matching
The ability to match and find correspondances between images is a cornerstone for vision based localization solutions. We have been working to understand how different image matching techniques work in different conditions, theyr advantages and shortcomings, but also to develop new keypoint detection/description strategies that are independent of scale by leveraging Morse theory and persistent homology, powerful tools rooted in algebraic topology.
Figure 7. The evolution of the sub-level sets of a surface filtered by height, i.e. value on the z axis. As the height crosses z1, a new loop is born in correspondence with a saddle (green point), then the loop changes smoothly until z hits z2, the value of a corresponding maximum (blue point), and the loop disappears. z1 and z2 are, respectively, the topological feature’s birth time and death time. Scale-Free Image Keypoints Using Differentiable Persistent Homology.
Lost in Space?
Yes, we have also worked on the problem of localizing observations taken from space. In fact, it is little known that astronauts on the International Space Station take thousands of photos each month, which are used for disaster management, climate change studies, and other earth science research. However, before a photo can be used, it must be localized: this was historically done manually, in a task that NASA defines as “monumentally important, but monumentally time-consuming job”. We provided tools to automatize this process.
We are working on many other things related to the problem of localization and, more in general, mapping observations to a spatial representation. Please check teh related publications down below for a more complete overview.
Related Publications
2024
Conference
MeshVPR: Citywide Visual Place Recognition Using 3D Meshes
Gabriele Berton, Lorenz Junglas, Riccardo Zaccone, Thomas Pollok, Barbara Caputo, and Carlo Masone
In European Conference on Computer Vision (ECCV), 2024
Mesh-based scene representation offers a promising direction for simplifying large-scale hierarchical visual localization pipelines, combining a visual place recognition step based on global features (retrieval) and a visual localization step based on local features. While existing work demonstrates the viability of meshes for visual localization, the impact of using synthetic databases rendered from them in visual place recognition remains largely unexplored. In this work we investigate using dense 3D textured meshes for large-scale Visual Place Recognition (VPR). We identify a significant performance drop when using synthetic mesh-based image databases compared to real-world images for retrieval. To address this, we propose MeshVPR, a novel VPR pipeline that utilizes a lightweight features alignment framework to bridge the gap between real-world and synthetic domains. MeshVPR leverages pre-trained VPR models and is efficient and scalable for city-wide deployments. We introduce novel datasets with freely available 3D meshes and manually collected queries from Berlin, Paris, and Melbourne. Extensive evaluations demonstrate that MeshVPR achieves competitive performance with standard VPR pipelines, paving the way for mesh-based localization systems.
Conference
Scale-Free Image Keypoints Using Differentiable Persistent Homology
Giovanni Barbarani, Francesco Vaccarino, Gabriele Trivigno, Marco Guerra, Gabriele Berton, and Carlo Masone
In International Conference on Machine Learning (ICML), 2024
In computer vision, keypoint detection is a fundamental task, with applications spanning from robotics to image retrieval; however, existing learning-based methods suffer from scale dependency and lack flexibility. This paper introduces a novel approach that leverages Morse theory and persistent homology, powerful tools rooted in algebraic topology. We propose a novel loss function based on the recent introduction of a notion of subgradient in persistent homology, paving the way toward topological learning. Our detector, MorseDet, is the first topology-based learning model for feature detection, which achieves competitive performance in keypoint repeatability and introduces a principled and theoretically robust approach to the problem.
Workshop
EarthMatch: Iterative Coregistration for Fine-grained Localization of Astronaut Photography
Gabriele Berton, Gabriele Goletto, Gabriele Trivigno, Alex Stoken, Barbara Caputo, and Carlo Masone
In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Jun 2024
Precise, pixel-wise geolocalization of astronaut photography is critical to unlocking the potential of this unique type of remotely sensed Earth data, particularly for its use in disaster management and climate change research. Recent works have established the Astronaut Photography Localization task, but have either proved too costly for mass deployment or generated too coarse a localization. Thus, we present EarthMatch, an iterative homography estimation method that produces fine-grained localization of astronaut photographs while maintaining an emphasis on speed. We refocus the astronaut photography benchmark, AIMS, on the geolocalization task itself, and prove our method’s efficacy on this dataset. In addition, we offer a new, fair method for image matcher comparison, and an extensive evaluation of different matching models within our localization pipeline. Our method will enable fast and accurate localization of the 4.5 million and growing collection of astronaut photography of Earth.
Journal
Distributed training of CosPlace for large-scale visual place recognition
Riccardo Zaccone, Gabriele Berton, and Carlo Masone
Visual place recognition (VPR) is a popular computer vision task aimed at recognizing the geographic location of a visual query, usually within a tolerance of a few meters. Modern approaches address VPR from an image retrieval standpoint using a kNN on top of embeddings extracted by a deep neural network from both the query and images in a database. Although most of these approaches rely on contrastive learning, which limits their ability to be trained on large-scale datasets (due to mining), the recently reported CosPlace proposes an alternative training paradigm using a classification task as the proxy. This has been shown to be effective in expanding the potential of VPR models to learn from large-scale and fine-grained datasets. In this work, we experimentally analyze CosPlace from a continual learning perspective and show that its sequential training procedure leads to suboptimal results. As a solution, we propose a different formulation that not only solves the pitfalls of the original training strategy effectively but also enables faster and more efficient distributed training. Finally, we discuss the open challenges in further speeding up large-scale image retrieval for VPR.
Conference
The Unreasonable Effectiveness of Pre-Trained Features for Camera Pose Refinement
Gabriele Trivigno, Carlo Masone, Barbara Caputo, and Torsten Sattler
In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2024
Pose refinement is an interesting and practically relevant research direction. Pose refinement can be used to (1) obtain a more accurate pose estimate from an initial prior (e.g. from retrieval) (2) as pre-processing i.e. to provide a better starting point to a more expensive pose estimator (3) as post-processing of a more accurate localizer. Existing approaches focus on learning features / scene representations for the pose refinement task. This involves training an implicit scene representation or learning features while optimizing a camera pose-based loss. A natural question is whether training specific features / representations is truly necessary or whether similar results can be already achieved with more generic features. In this work we present a simple approach that combines pre-trained features with a particle filter and a renderable representation of the scene. Despite its simplicity it achieves state-of-the-art results demonstrating that one can easily build a pose refiner without the need for specific training.
Workshop
Collaborative Visual Place Recognition through Federated Learning
Mattia Dutto, Gabriele Berton, Debora Caldarola, Eros Fanı̀, Gabriele Trivigno, and Carlo Masone
In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Jun 2024
Visual Place Recognition (VPR) aims to estimate the location of an image by treating it as a retrieval problem. VPR uses a database of geo-tagged images and leverages deep neural networks to extract a global representation called descriptor from each image. While the training data for VPR models often originates from diverse geographically scattered sources (geo-tagged images) the training process itself is typically assumed to be centralized. This research revisits the task of VPR through the lens of Federated Learning (FL) addressing several key challenges associated with this adaptation. VPR data inherently lacks well-defined classes and models are typically trained using contrastive learning which necessitates a data mining step on a centralized database. Additionally client devices in federated systems can be highly heterogeneous in terms of their processing capabilities. The proposed FedVPR framework not only presents a novel approach for VPR but also introduces a new challenging and realistic task for FL research. This has the potential to spur the application of FL to other image retrieval tasks.
Conference
Earthloc: Astronaut photography localization by indexing earth from space
Gabriele Berton, Alex Stoken, Barbara Caputo, and Carlo Masone
In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2024
Astronaut photography, spanning six decades of human spaceflight, presents a unique Earth observations dataset with immense value for both scientific research and disaster response. Despite its significance, accurately localizing the geographical extent of these images, crucial for effective utilization, poses substantial challenges. Current manual localization efforts are time-consuming, motivating the need for automated solutions. We propose a novel approach - leveraging image retrieval - to address this challenge efficiently. We introduce innovative training techniques, including Year-Wise Data Augmentation and a Neutral-Aware Multi-Similarity Loss, which contribute to the development of a high-performance model, EarthLoc. We develop six evaluation datasets and perform a comprehensive benchmark comparing EarthLoc to existing methods, showcasing its superior efficiency and accuracy. Our approach marks a significant advancement in automating the localization of astronaut photography, which will help bridge a critical gap in Earth observations data.
Journal
JIST: Joint Image and Sequence Training for Sequential Visual Place Recognition
Gabriele Berton, Gabriele Trivigno, Barbara Caputo, and Carlo Masone
Visual Place Recognition aims at recognizing previously visited places by relying on visual clues, and it is used in robotics applications for SLAM and localization. Since typically a mobile robot has access to a continuous stream of frames, this task is naturally cast as a sequence-to-sequence localization problem. Nevertheless, obtaining sequences of labelled data is much more expensive than collecting isolated images, which can be done in an automated way with little supervision. As a mitigation to this problem, we propose a novel Joint Image and Sequence Training protocol (JIST) that leverages large uncurated sets of images through a multi-task learning framework. With JIST we also introduce SeqGeM, an aggregation layer that revisits the popular GeM pooling to produce a single robust and compact embedding from a sequence of single-frame embeddings. We show that our model is able to outperform previous state of the art while being faster, using 8 times smaller descriptors, having a lighter architecture and allowing to process sequences of various lengths.
2023
Conference
Divide&Classify: Fine-Grained Classification for City-Wide Visual Place Recognition
Gabriele Trivigno, Gabriele Berton, Juan Aragon, Barbara Caputo, and Carlo Masone
In IEEE/CVF International Conference on Computer Vision (ICCV), Jun 2023
Visual Place recognition is commonly addressed as an image retrieval problem. However, retrieval methods are impractical to scale to large datasets, densely sampled from city-wide maps, since their dimension impact negatively on the inference time. Using approximate nearest neighbour search for retrieval helps to mitigate this issue, at the cost of a performance drop. In this paper we investigate whether we can effectively approach this task as a classification problem, thus bypassing the need for a similarity search. We find that existing classification methods for coarse, planet-wide localization are not suitable for the fine-grained and city-wide setting. This is largely due to how the dataset is split into classes, because these methods are designed to handle a sparse distribution of photos and as such do not consider the visual aliasing problem across neighbouring classes that naturally arises in dense scenarios. Thus, we propose a partitioning scheme that enables a fast and accurate inference, preserving a simple learning procedure, and a novel inference pipeline based on an ensemble of novel classifiers that uses the prototypes learned via an angular margin loss. Our method, Divide&Classify (D&C), enjoys the fast inference of classification solutions and an accuracy competitive with retrieval methods on the fine-grained, city-wide setting. Moreover, we show that D&C can be paired with existing retrieval pipelines to speed up computations by over 20 times while increasing their recall, leading to new state-of-the-art results.
Conference
EigenPlaces: Training Viewpoint Robust Models for Visual Place Recognition
Gabriele Berton, Gabriele Trivigno, Barbara Caputo, and Carlo Masone
In IEEE/CVF International Conference on Computer Vision (ICCV), Jun 2023
Visual Place Recognition is a task that aims to predict the place of an image (called query) based solely on its visual features. This is typically done through image retrieval, where the query is matched to the most similar images from a large database of geotagged photos, using learned global descriptors. A major challenge in this task is recognizing places seen from different viewpoints. To overcome this limitation, we propose a new method, called EigenPlaces, to train our neural network on images from different point of views, which embeds viewpoint robustness into the learned global descriptors. The underlying idea is to cluster the training data so as to explicitly present the model with different views of the same points of interest. The selection of this points of interest is done without the need for extra supervision. We then present experiments on the most comprehensive set of datasets in literature, finding that EigenPlaces is able to outperform previous state of the art on the majority of datasets, while requiring 60% less GPU memory for training and using 50% smaller descriptors. The code and trained models for EigenPlaces are available at https://github.com/gmberton/EigenPlaces, while results with any other baseline can be computed with the codebase at https://github.com/gmberton/auto_VPR.
2022
Journal
Learning Sequential Descriptors for Sequence-Based Visual Place Recognition
Riccardo Mereu, Gabriele Trivigno, Gabriele Berton, Carlo Masone, and Barbara Caputo
In robotics, visual place recognition (VPR) is a continuous process that receives as input a video stream to produce a hypothesis of the robot’s current position within a map of known places. This work proposes a taxonomy of the architectures used to learn sequential descriptors for VPR, highlighting different mechanisms to fuse the information from the individual images. This categorization is supported by a complete benchmark of experimental results that provides evidence of the strengths and weaknesses of these different architectural choices. The analysis is not limited to existing sequential descriptors, but we extend it further to investigate the viability of Transformers instead of CNN backbones. We further propose a new ad-hoc sequence-level aggregator called SeqVLAD, which outperforms prior state of the art on different datasets. The code is available at https://github.com/vandal-vpr/vg-transformers.
Journal
Adaptive-Attentive Geolocalization From Few Queries: A Hybrid Approach
Valerio Paolicelli, Gabriele Berton, Francesco Montagna, Carlo Masone, and Barbara Caputo
We tackle the task of cross-domain visual geo-localization, where the goal is to geo-localize a given query image against a database of geo-tagged images, in the case where the query and the database belong to different visual domains. In particular, at training time, we consider having access to only few unlabeled queries from the target domain. To adapt our deep neural network to the database distribution, we rely on a 2-fold domain adaptation technique, based on a hybrid generative-discriminative approach. To further enhance the architecture, and to ensure robustness across domains, we employ a novel attention layer that can easily be plugged into existing architectures. Through a large number of experiments, we show that this adaptive-attentive approach makes the model robust to large domain shifts, such as unseen cities or weather conditions. Finally, we propose a new large-scale dataset for cross-domain visual geo-localization, called SVOX.
Conference
Learning Semantics for Visual Place Recognition through Multi-Scale Attention
V. Paolicelli, A. Tavera, G. Berton, C. Masone, and B. Caputo
In International Conference on Image Analysis and Processing (ICIAP), Jun 2022
In this paper we address the task of visual place recognition (VPR), where the goal is to retrieve the correct GPS coordinates of a given query image against a huge geotagged gallery. While recent works have shown that building descriptors incorporating semantic and appearance information is beneficial, current state-of-the-art methods opt for a top down definition of the significant semantic content. Here we present the first VPR algorithm that learns robust global embeddings from both visual appearance and semantic content of the data, with the segmentation process being dynamically guided by the recognition of places through a multi-scale attention module. Experiments on various scenarios validate this new approach and demonstrate its performance against state-of-the-art methods. Finally, we propose the first synthetic-world dataset suited for both place recognition and segmentation tasks.
Conference
Rethinking Visual Geo-localization for Large-Scale Applications
G. Berton, C. Masone, and B. Caputo
In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2022
Visual Geo-localization (VG) is the task of estimating the position where a given photo was taken by comparing it with a large database of images of known locations. To investigate how existing techniques would perform on a real-world city-wide VG application, we build San Francisco eXtra Large, a new dataset covering a whole city and providing a wide range of challenging cases, with a size 30x bigger than the previous largest dataset for visual geo-localization. We find that current methods fail to scale to such large datasets, therefore we design a new highly scalable training technique, called CosPlace, which casts the training as a classification problem avoiding the expensive mining needed by the commonly used contrastive learning. We achieve state-of-the-art performance on a wide range of datasets and find that CosPlace is robust to heavy domain changes. Moreover, we show that, compared to the previous state-of-the-art, CosPlace requires roughly 80% less GPU memory at train time, and it achieves better results with 8x smaller descriptors, paving the way for city-wide real-world visual geo-localization.
Conference
Deep Visual Geo-localization Benchmark
G. Berton, R. Mereu, G. Trivigno, C. Masone, G. Csurka, T. Sattler, and B. Caputo
In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2022
Within the top 2% of submitted papers, and the top 7.8% of accepted papers.
In this paper, we propose a new open-source benchmarking framework for Visual Geo-localization (VG) that allows to build, train, and test a wide range of commonly used architectures, with the flexibility to change individual components of a geo-localization pipeline. The purpose of this framework is twofold: i) gaining insights into how different components and design choices in a VG pipeline impact the final results, both in terms of performance (recall@N metric) and system requirements (such as execution time and memory consumption); ii) establish a systematic evaluation protocol for comparing different methods. Using the proposed framework, we perform a large suite of experiments which provide criteria for choosing backbone, aggregation and negative mining depending on the use-case and requirements. We also assess the impact of engineering techniques like pre/post-processing, data augmentation and image resizing, showing that better performance can be obtained through somewhat simple procedures: for example, downscaling the images’ resolution to 80% can lead to similar results with a 36% savings in extraction time and dataset storage requirement.
2021
Conference
Viewpoint Invariant Dense Matching for Visual Geolocalization
G. Berton, C. Masone, V. Paolicelli, and B. Caputo
In IEEE/CVF International Conference on Computer Vision (ICCV), Jun 2021
In this paper we propose a novel method for image matching based on dense local features and tailored for visual geolocalization. Dense local features matching is robust against changes in illumination and occlusions, but not against viewpoint shifts which are a fundamental aspect of geolocalization. Our method, called GeoWarp, directly embeds invariance to viewpoint shifts in the process of extracting dense features. This is achieved via a trainable module which learns from the data an invariance that is meaningful for the task of recognizing places. We also devise a new self-supervised loss and two new weakly supervised losses to train this module using only unlabeled data and weak labels. GeoWarp is implemented efficiently as a re-ranking method that can be easily embedded into pre-existing visual geolocalization pipelines. Experimental validation on standard geolocalization benchmarks demonstrates that GeoWarp boosts the accuracy of state-of-the-art retrieval architectures.
In recent years visual place recognition (VPR), i.e., the problem of recognizing the location of images, has received considerable attention from multiple research communities, spanning from computer vision to robotics and even machine learning. This interest is fueled on one hand by the relevance that visual place recognition holds for many applications and on the other hand by the unsolved challenge of making these methods perform reliably in different conditions and environments. This paper presents a survey of the state-of-the-art of research on visual place recognition, focusing on how it has been shaped by the recent advances in deep learning. We start discussing the image representations used in this task and how they have evolved from using hand-crafted to deep-learned features. We further review how metric learning techniques are used to get more discriminative representations, as well as techniques for dealing with occlusions, distractors, and shifts in the visual domain of the images. The survey also provides an overview of the specific solutions that have been proposed for applications in robotics and with aerial imagery. Finally the survey provides a summary of datasets that are used in visual place recognition, highlighting their different characteristics.