Improving visual relationship detection using linguistic and spatial cues

Jaewon Jung, Jongyoul Park

Research output: Contribution to journalArticlepeer-review

5 Scopus citations

Abstract

Detecting visual relationships in an image is important in an image understanding task. It enables higher image understanding tasks, that is, predicting the next scene and understanding what occurs in an image. A visual relationship comprises of a subject, a predicate, and an object, and is related to visual, language, and spatial cues. The predicate explains the relationship between the subject and object and can be categorized into different categories such as prepositions and verbs. A large visual gap exists although the visual relationship is included in the same predicate. This study improves upon a previous study (that uses language cues using two losses) and a spatial cue (that only includes individual information) by adding relative information on the subject and object of the extant study. The architectural limitation is demonstrated and is overcome to detect all zero-shot visual relationships. A new problem is discovered, and an explanation of how it decreases performance is provided. The experiment is conducted on the VRD and VG datasets and a significant improvement over previous results is obtained.

Original languageEnglish
Pages (from-to)399-410
Number of pages12
JournalETRI Journal
Volume42
Issue number3
DOIs
StatePublished - 1 Jun 2020

Keywords

  • deep learning
  • image retrieval
  • image understanding
  • predicate
  • visual relationship

Fingerprint

Dive into the research topics of 'Improving visual relationship detection using linguistic and spatial cues'. Together they form a unique fingerprint.

Cite this