Estimating Geospatial Trajectories of Videos Using Cross-View Image Matching
Abstract
Videos from a moving camera are an invaluable source of information that can be used to obtain intelligence information. Video sources such as Unmanned Aircraft Systems (UASs) and hand-held devices may contain the key information to explaining important events. It is extremely useful to be able to estimate the geospatial trajectory of a video from a moving camera, such that additional information can be gathered for further analysis. Humans have the uncanny ability to remember places they have seen and correlate them to novel sightings. In some instances, the human-mind serves as a database to match a new image to one that has already been seen, while in other instances, it immediately picks up on similarities between the new image and those in its memory to correlate and classify them as belonging to a certain category. For example, the architectural similarities in certain areas may be a signature of particular cities in the world, while the presence of certain plants may be an indicator of specific regions in which they grow. In general, it is an extremely challenging problem in computer vision, to automatically understand a scene and infer geographical information from it. With a sufficiently large database of images, with known GPS locations, it is possible to attempt to match a scene, with those in the database, and obtain information about the query scene’s geographic location. In this proposal, we propose several stages of research that will be pursued to help achieve the goal of retrieving the location information from videos and estimating geospatial trajectories. Since no suitable dataset currently exists, in the first stage, in order to deal with the lack of geo-tagged satellite reference images, we propose to build and efficiently organize a reference database, with millions of geo-tagged satellite images from Google Earth satellite imagery. Next, we propose to investigate deep learning methods to localize a video by finding the best visual matches of individual images (frames) of the video to a dataset of geo-referenced satellite images. To achieve this, we propose two Convolutional Neural Network (convnets) models to solve feature matching between video frames and reference images. Since there exists a huge domain gap between query video frames from aerial/street view and satellite reference images, in the first model, we propose an adversarial domain adaptation which adapts representations at both pixel- and feature-level. While the second model is responsible for learning best feature embeddings by minimizing the distances between images from the same location (positive pairs) while pushing away images which are from different geographical locations (negative pairs). Finally, we use the GPS locations of the matched reference images to reconstruct geospatial trajectory of the video. Since localization results could be noisy at times due to mismatching between video frames and reference images, we propose deep learning approach employing Long Short-Term Memory (LSTM) model to refine noisy localizations. We then use the refined GPS localizations to generate trajectories of our source videos.
Document Details
- Document Type
- DoD Grant Award
- Publication Date
- Oct 06, 2020
- Source ID
- HM04762010001
Entities
People
- Mubarak Shah
Organizations
- National Geospatial-Intelligence Agency
- University of Central Florida Board of Trustees