Building Category Graphs Representation with Spatial and Temporal Attention for Visual Navigation-Reference-Cited by-同舟云学术

Building Category Graphs Representation with Spatial and Temporal Attention for Visual Navigation

Published:2024-05-16 Issue:7 Volume:20 Page:1-22
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Hu Xiaobo¹^ORCID,Lin Youfang²^ORCID,Fan Hehe³^ORCID,Wang Shuo²^ORCID,Wu Zhihao²^ORCID,Lv Kai²^ORCID

Affiliation:

1. Beijing Key Laboratory of Traffic Data Analysis and Mining, School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China

2. Beijing Key Laboratory of Traffic Data Analysis and Mining, School of Computer and Information Technology, Beijing Jiaotong University, Beijing China

3. College of Computer Science and Technology, Zhejiang University, zhejiang China

Abstract

Given an object of interest, visual navigation aims to reach the object’s location based on a sequence of partial observations. To this end, an agent needs to (1) acquire specific knowledge about the relations of object categories in the world during training and (2) locate the target object based on the pre-learned object category relations and its trajectory in the current unseen environment. In this article, we propose a Category Relation Graph (CRG) to learn the knowledge of object category layout relations and a Temporal-Spatial-Region attention (TSR) architecture to perceive the long-term spatial-temporal dependencies of objects, aiding navigation. We establish CRG to learn prior knowledge of object layout and deduce the positions of specific objects. Subsequently, we propose the TSR architecture to capture relationships among objects in temporal, spatial, and regions within observation trajectories. Specifically, we implement a Temporal attention module (T) to model the temporal structure of the observation sequence, implicitly encoding historical moving or trajectory information. Then, a Spatial attention module (S) uncovers the spatial context of the current observation objects based on CRG and past observations. Last, a Region attention module (R) shifts the attention to the target-relevant region. Leveraging the visual representation extracted by our method, the agent accurately perceives the environment and easily learns a superior navigation policy. Experiments on AI2-THOR demonstrate that our CRG-TSR method significantly outperforms existing methods in both effectiveness and efficiency. The supplementary material includes the code and will be publicly available.

Funder

National Natural Science Foundation of China

Aeronautical Science Foundation of China

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3653714

Reference73 articles.

1. Mobile robot navigation using neural networks and nonmetrical environmental models

2. Vision based MAV navigation in unknown and unstructured environments

3. Probabilistic Appearance Based Navigation and Loop Closing

4. A solution to the simultaneous localization and map building (SLAM) problem

5. Andrew Y. Ng, H. Jin Kim, Michael I. Jordan, and Shankar Sastry. 2003. Autonomous helicopter flight via reinforcement learning. In Advances in Neural Information Processing Systems. MIT Press, 799–806. Retrieved from https://proceedings.neurips.cc/paper/2003/hash/b427426b8acd2c2e53827970f2c2f526-Abstract.html