TY - GEN
T1 - Multiscale Vision Transformer with Deep Clustering-Guided Refinement for Weakly Supervised Object Localization
AU - Kim, David Minkwan
AU - Cha, Sinhae
AU - Kang, Byeongkeun
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - This work addresses the task of weakly-supervised object localization. The goal is to learn object localization using only image-level class labels, which are much easier to obtain compared to bounding box annotations. This task is important because it reduces the need for labor-intensive ground-Truth annotations. However, methods for object localization trained using weak supervision often suffer from limited accuracy in localization. To address this challenge and enhance localization accuracy, we propose a multiscale object localization transformer (MOLT). It comprises multiple object localization transformers that extract patch embeddings across various scales. Moreover, we introduce a deep clustering-guided refinement method that further enhances localization accuracy by utilizing separately extracted image segments. These segments are obtained by clustering pixels using convolutional neural networks. Finally, we demonstrate the effectiveness of our proposed method by conducting experiments on the publicly available ILSVRC-2012 dataset.
AB - This work addresses the task of weakly-supervised object localization. The goal is to learn object localization using only image-level class labels, which are much easier to obtain compared to bounding box annotations. This task is important because it reduces the need for labor-intensive ground-Truth annotations. However, methods for object localization trained using weak supervision often suffer from limited accuracy in localization. To address this challenge and enhance localization accuracy, we propose a multiscale object localization transformer (MOLT). It comprises multiple object localization transformers that extract patch embeddings across various scales. Moreover, we introduce a deep clustering-guided refinement method that further enhances localization accuracy by utilizing separately extracted image segments. These segments are obtained by clustering pixels using convolutional neural networks. Finally, we demonstrate the effectiveness of our proposed method by conducting experiments on the publicly available ILSVRC-2012 dataset.
KW - neural networks
KW - vision transformer
KW - weakly-supervised learning
KW - weakly-supervised object localization
UR - http://www.scopus.com/inward/record.url?scp=85184849156&partnerID=8YFLogxK
U2 - 10.1109/VCIP59821.2023.10402750
DO - 10.1109/VCIP59821.2023.10402750
M3 - Conference contribution
AN - SCOPUS:85184849156
T3 - 2023 IEEE International Conference on Visual Communications and Image Processing, VCIP 2023
BT - 2023 IEEE International Conference on Visual Communications and Image Processing, VCIP 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2023 IEEE International Conference on Visual Communications and Image Processing, VCIP 2023
Y2 - 4 December 2023 through 7 December 2023
ER -