monash_85496.pdf (2.79 MB)
A semantic SLAM model for autonomous mobile robots using content based image retrieval techniques
thesisposted on 2017-01-31, 04:21 authored by Tan, Choon Ling
Localization and environmental mapping are two fundamental functions for an autonomous mobile robot. This thesis develops a new framework that allows a robot with a vision sensor to simultaneously achieve both of these functions. The novel approach attempts to interpret video images for their meaning, generating a map and localization data from these meanings. The experiments show promising results for this new approach. If robots are to perform robustly with no a priori knowledge of their environment then they must have the ability to perform Simultaneous Localization and Mapping (SLAM), whereby the robot incrementally builds a map of the environment it is navigating while simultaneously keeping track of its location within the built map. SLAM might be considered a solved problem. There is certainly a large body of literature, discussed in this thesis, which provides decent models for building robust solutions. However, most, if not all, of the current state of the art SLAM techniques rely on solutions with a tight loop of detecting and tracking low-level features to update the robots current pose, or location. We argue these methods are brittle and do not offer general purpose solutions to the problem. This thesis takes a cognitive approach to the subject and develops a new SLAM model based on extracting semantic information from the robot’s sensor data. In this thesis, we develop a new SLAM framework which analyses video streams for semantic content. We do this with inspiration from the Content Based Image Retrieval (CBIR) research area. We use the well-established Tamura texture features to decompose the video stream into a grid of lexemes (or recognized categories) which we then use to construct grammatical sentences. These sentences form place descriptions and are used for constructing the environmental map and localization. In contrast to engineered methods, our framework does not return precise location information as we argue it is enough to know roughly where one is located. We have implemented a proof of concept model and tested within both indoor and outdoor environments. The results show that our model can construct useful semantic descriptions from the video stream and use these descriptions to implement SLAM. Although the derived semantic descriptions are fairly coarse (based on the limitations of the Tamura texture features), the technique could be refined by adopting a richer set of the feature vectors, however, we leave this as future work.