Localization

Where am I? Where is everything else?

Figure 1. (Vision-based) localization is an important part of spatial intelligence. Photo credits to Tabea Schimpf.

What is spatial intelligence? It may be defined as “the ability to generate, retain, retrieve, and transform well-structured visual images” (D. F. Lohman, Spatial ability and g, 1996) or, more broadly, “human’s computational capacity that provides the ability or mental skill to solve spatial problems of navigation, visualization of objects from different angles and space, faces or scenes recognition, or to notice fine details” (H. Gardner).

(Vision-based) Localization, i.e., the ability to place a visual (or multimodal) observation in space, or rather a suitable representation of space, is an important ingredient of spatial intelligence. This kind of reasoning is well developed and studied in humans. For example, when looking at the picture of a famous landmark, we can easily recognize it and infer where that picture was taken. In our daily routine, when we go around we are collecting observations of the space around us and we organize them in a cognitive map, a unified representation of the spatial environment that we can access both to support memory (e.g., to relate past observations with respect to each other and with respect to new observations) and to guide our future actions (e.g., when we mentally plan a route to a destination). Me and my team and colleagues in VANDAL are working on creating new and better algorithms that can provide this kind of reasoning, which is critical to develop applications that require advanced interactions with the world. For example, to enable autonomous navigation of robots and vehicles, to create more convincing and immersive augmented/extended reality applications, to make smarter personal assistive devices, etcetera.

Although it is only a part of spatial resoning, the research field of (vision-based) localization itself is very broad. The kind of reasoning that can be achieved may depend on how we represent the space (e.g., as an unordered collection of images, as a sparse point-cloud, as a dense 3D map, …). We may even not have a prior representation of the world, in which case we may seek to deduce the location of an observation relative to another one, without placing it in a map. The goals may also vary depending on the task we need to solve: we may be interested in coarsely predicting the geographical location of an observation (Visual Place Recognition or Visual Geo-localization); we may want to estimate the precise pose of the sensor that captured that observation (Visual Localization); we may want to recognize a place/object, irrespectively of the point from where we observe it (Landmark Recognition); we may want to refine a pose estimate (Pose refinement); we may try to establish correspondences between two different observations (Image Matching); and many more.

Figure 2. "Where is this place?" Example of Visual Place Recognition, where we try to predicting the coarse location where the image was taken with respect to a map.

What are we working on?

We have been working largely on Visual Place Recognition problems, which is often taken as a first step in hierarchical localization pipelines. You may check A Survey on Deep Visual Place Recognition and Deep Visual Geo-localization Benchmark for an overview on this task.

We have been particularly interested in making these algorithms robust and scalable.

Robustness

One of the biggest challenges in visual geolocalization is the fact that the same place viewed at different times, in different weather conditions, and from slightly different angles may look substantially different. Making a visual geolocalization system robust to these variations and achieve good performance across different conditions and in presence of distractors or occlusions is a major topic of research.

Figure 3. Left: The appearance of a place naturally changes in different weather conditions, seasons and due to day/night cycles. Image from Adaptive-Attentive Geolocalization from few queries: a hybrid approach. Right: A place viewed from from slightly different observation points may appear difficult to recognize. Image from Viewpoint Invariant Dense Matching for Visual Geolocalization.
Scalability

Until recent, visual geolocalization research has focused on recognizing the location of images in moderately sized geographical areas, such as a neighborhood or a single route in a city. However, to empower the promised real-world applications of this technology, such as enabling the navigation of autonomous agents, it is ecessary to scale this task to much wider areas with databases of spatially-densely sampled images. The question of scalability in visual geolocalization system not only demands for larger datasets, but it the problems of how to make the deployed system scalable at test time on a limited budget of memory and computational time.

Figure 4. In CosPlace we use a classification task as a proxy to train the model that is used to extract the global descriptors for retrieving images from the same place as the query to geo-localize. For this purpose, naively dividing the environment in cells (left image) and using these cells as classes is not effective because i) images from adjaccent cells may see the same scene and thus be from the same place, and ii) the number of classes required to cover a large space will grow quickly. To solve these issues, CosPlace proposes a division of the space in sub-datasets (the slices with different colors in the image on the right), and the training iterates through the different sub-datasets, replacing the classification head. Images from Rethinking Visual Geo-localization for Large Scale Applications.
Using different 3D representations

One of the key questions in localization, when using a 3D representation of the known world (map), is what kind of representation to use? What are the advantages and disadvantages of different representations. We are exploring different solutions, particularly trying to aim for scalable approaches.

Figure 5. MeshVPR Citywide Visual Place Recognition Using 3D Meshes.
Robotics application

Classical Visual Place Recognition methods have been devised to localize a single query image at a time, but a robot navigating would collect a continuous stream of images from a camera. Thus, we seek new solutions that can also exploit the temporal information in this stream to reason about the location. An idea that we have explored is to to use sequential descriptors that summarize sequences as a whole, thus enabling to directly perform a sequence-to-sequence similarity search (see Figure 5). This idea is alluring, not only for its efficiency but also because a sequential descriptor naturally incorporates the temproal information from the sequence, which provides more robustness to high-confidence false matches than single image descriptors.

Figure 6. (Top) Sequence matching individually processes each frame in the sequences to extract single-image descriptors. The frame-to-frame similarity scores build a matrix, and the best matching sequence is determined by aggregating the scores in the matrix. (Bottom) With sequential descriptors, each sequence is mapped to a learned descriptor, and the best matching sequence is directly determined by measuring the sequence-to-sequence similarity Images from Learning Sequential Descriptors for Sequence-Based Visual Place Recognition.
Image matching

The ability to match and find correspondances between images is a cornerstone for vision based localization solutions. We have been working to understand how different image matching techniques work in different conditions, theyr advantages and shortcomings, but also to develop new keypoint detection/description strategies that are independent of scale by leveraging Morse theory and persistent homology, powerful tools rooted in algebraic topology.

Figure 7. The evolution of the sub-level sets of a surface filtered by height, i.e. value on the z axis. As the height crosses z1, a new loop is born in correspondence with a saddle (green point), then the loop changes smoothly until z hits z2, the value of a corresponding maximum (blue point), and the loop disappears. z1 and z2 are, respectively, the topological feature’s birth time and death time. Scale-Free Image Keypoints Using Differentiable Persistent Homology.
Lost in Space?

Yes, we have also worked on the problem of localizing observations taken from space. In fact, it is little known that astronauts on the International Space Station take thousands of photos each month, which are used for disaster management, climate change studies, and other earth science research. However, before a photo can be used, it must be localized: this was historically done manually, in a task that NASA defines as “monumentally important, but monumentally time-consuming job”. We provided tools to automatize this process.

Figure 8. EarthLoc + EarthMatch pipeline, to localize photos taken from the ISS.
And many more things

We are working on many other things related to the problem of localization and, more in general, mapping observations to a spatial representation. Please check teh related publications down below for a more complete overview.


Related Publications

2024

  1. Conference
    Berton-2024-meshvpr.gif
    MeshVPR: Citywide Visual Place Recognition Using 3D Meshes
    Gabriele Berton, Lorenz Junglas, Riccardo Zaccone, Thomas Pollok, Barbara Caputo, and Carlo Masone
    In European Conference on Computer Vision (ECCV), 2024
  2. Conference
    Barbarani-2024-morsedet.jpg
    Scale-Free Image Keypoints Using Differentiable Persistent Homology
    Giovanni Barbarani, Francesco Vaccarino, Gabriele Trivigno, Marco Guerra, Gabriele Berton, and Carlo Masone
    In International Conference on Machine Learning (ICML), 2024
  3. Workshop
    Berton-2024-earthmatch.gif
    EarthMatch: Iterative Coregistration for Fine-grained Localization of Astronaut Photography
    Gabriele Berton, Gabriele Goletto, Gabriele Trivigno, Alex Stoken, Barbara Caputo, and Carlo Masone
    In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Jun 2024
  4. Journal
    Zaccone-2024-dCosplace.jpg
    Distributed training of CosPlace for large-scale visual place recognition
    Riccardo Zaccone, Gabriele Berton, and Carlo Masone
    Frontiers in Robotics and AI, Jun 2024
  5. Conference
    Trivigno-2024-unreasonable.jpg
    The Unreasonable Effectiveness of Pre-Trained Features for Camera Pose Refinement
    Gabriele Trivigno, Carlo Masone, Barbara Caputo, and Torsten Sattler
    In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2024
  6. Workshop
    Dutto-2024-fedvpr.jpg
    Collaborative Visual Place Recognition through Federated Learning
    Mattia Dutto, Gabriele Berton, Debora Caldarola, Eros Fanı̀, Gabriele Trivigno, and Carlo Masone
    In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Jun 2024
  7. Conference
    Berton-2024-earthloc.gif
    Earthloc: Astronaut photography localization by indexing earth from space
    Gabriele Berton, Alex Stoken, Barbara Caputo, and Carlo Masone
    In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2024
  8. Journal
    Berton-2024-jist.jpg
    JIST: Joint Image and Sequence Training for Sequential Visual Place Recognition
    Gabriele Berton, Gabriele Trivigno, Barbara Caputo, and Carlo Masone
    IEEE Robotics and Automation Letters, Jun 2024

2023

  1. Conference
    Trivigno-2023-divide.jpg
    Divide&Classify: Fine-Grained Classification for City-Wide Visual Place Recognition
    Gabriele Trivigno, Gabriele Berton, Juan Aragon, Barbara Caputo, and Carlo Masone
    In IEEE/CVF International Conference on Computer Vision (ICCV), Jun 2023
  2. Conference
    Berton-2023-eigenplaces.jpg
    EigenPlaces: Training Viewpoint Robust Models for Visual Place Recognition
    Gabriele Berton, Gabriele Trivigno, Barbara Caputo, and Carlo Masone
    In IEEE/CVF International Conference on Computer Vision (ICCV), Jun 2023

2022

  1. Journal
    Mereu-2022-seqvlad.jpg
    Learning Sequential Descriptors for Sequence-Based Visual Place Recognition
    Riccardo Mereu, Gabriele Trivigno, Gabriele Berton, Carlo Masone, and Barbara Caputo
    IEEE Robotics and Automation Letters, Jun 2022
  2. Journal
    Paolicelli-2022-adageov2.jpg
    Adaptive-Attentive Geolocalization From Few Queries: A Hybrid Approach
    Valerio Paolicelli, Gabriele Berton, Francesco Montagna, Carlo Masone, and Barbara Caputo
    Frontiers in Computer Science, Jun 2022
  3. Conference
    Paolicelli-2022-semantic.jpg
    Learning Semantics for Visual Place Recognition through Multi-Scale Attention
    V. Paolicelli, A. Tavera, G. Berton, C. Masone, and B. Caputo
    In International Conference on Image Analysis and Processing (ICIAP), Jun 2022
  4. Conference
    Berton-2022-cosplace.jpg
    Rethinking Visual Geo-localization for Large-Scale Applications
    G. Berton, C. Masone, and B. Caputo
    In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2022
  5. Conference
    Berton-2022-benchmark.jpg
    Deep Visual Geo-localization Benchmark
    G. Berton, R. Mereu, G. Trivigno, C. Masone, G. Csurka, T. Sattler, and B. Caputo
    In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2022

2021

  1. Conference
    Berton-2021-viewpoint.jpg
    Viewpoint Invariant Dense Matching for Visual Geolocalization
    G. Berton, C. Masone, V. Paolicelli, and B. Caputo
    In IEEE/CVF International Conference on Computer Vision (ICCV), Jun 2021
  2. Journal
    Masone-2021-survey.jpg
    A Survey on Deep Visual Place Recognition
    C. Masone, and B. Caputo
    IEEE Access, Jun 2021