Video Captioning Based on Both Egocentric and Exocentric Views of Robot Vision for Human-Robot Interaction

Soo Han Kang, Ji Hyeong Han

Research output: Contribution to journalArticlepeer-review

13 Scopus citations

Abstract

Robot vision provides the most important information to robots so that they can read the context and interact with human partners successfully. Moreover, to allow humans recognize the robot’s visual understanding during human-robot interaction (HRI), the best way is for the robot to provide an explanation of its understanding in natural language. In this paper, we propose a new approach by which to interpret robot vision from an egocentric standpoint and generate descriptions to explain egocentric videos particularly for HRI. Because robot vision equals to egocentric video on the robot’s side, it contains as much egocentric view information as exocentric view information. Thus, we propose a new dataset, referred to as the global, action, and interaction (GAI) dataset, which consists of egocentric video clips and GAI descriptions in natural language to represent both egocentric and exocentric information. The encoder-decoder based deep learning model is trained based on the GAI dataset and its performance on description generation assessments is evaluated. We also conduct experiments in actual environments to verify whether the GAI dataset and the trained deep learning model can improve a robot vision system.

Original languageEnglish
Pages (from-to)631-641
Number of pages11
JournalInternational Journal of Social Robotics
Volume15
Issue number4
DOIs
StatePublished - Apr 2023

Keywords

  • Deep learning
  • Egocentric video
  • Human robot interaction
  • Video captioning

Fingerprint

Dive into the research topics of 'Video Captioning Based on Both Egocentric and Exocentric Views of Robot Vision for Human-Robot Interaction'. Together they form a unique fingerprint.

Cite this