Multiscale Vision Transformer with Deep Clustering-Guided Refinement for Weakly Supervised Object Localization

David Minkwan Kim, Sinhae Cha, Byeongkeun Kang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

This work addresses the task of weakly-supervised object localization. The goal is to learn object localization using only image-level class labels, which are much easier to obtain compared to bounding box annotations. This task is important because it reduces the need for labor-intensive ground-Truth annotations. However, methods for object localization trained using weak supervision often suffer from limited accuracy in localization. To address this challenge and enhance localization accuracy, we propose a multiscale object localization transformer (MOLT). It comprises multiple object localization transformers that extract patch embeddings across various scales. Moreover, we introduce a deep clustering-guided refinement method that further enhances localization accuracy by utilizing separately extracted image segments. These segments are obtained by clustering pixels using convolutional neural networks. Finally, we demonstrate the effectiveness of our proposed method by conducting experiments on the publicly available ILSVRC-2012 dataset.

Original languageEnglish
Title of host publication2023 IEEE International Conference on Visual Communications and Image Processing, VCIP 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350359855
DOIs
StatePublished - 2023
Event2023 IEEE International Conference on Visual Communications and Image Processing, VCIP 2023 - Jeju, Korea, Republic of
Duration: 4 Dec 20237 Dec 2023

Publication series

Name2023 IEEE International Conference on Visual Communications and Image Processing, VCIP 2023

Conference

Conference2023 IEEE International Conference on Visual Communications and Image Processing, VCIP 2023
Country/TerritoryKorea, Republic of
CityJeju
Period4/12/237/12/23

Keywords

  • neural networks
  • vision transformer
  • weakly-supervised learning
  • weakly-supervised object localization

Fingerprint

Dive into the research topics of 'Multiscale Vision Transformer with Deep Clustering-Guided Refinement for Weakly Supervised Object Localization'. Together they form a unique fingerprint.

Cite this