Patch-Level Consistency Regularization in Self-Supervised Transfer Learning for Fine-Grained Image Recognition

Yejin Lee, Suho Lee, Sangheum Hwang

Research output: Contribution to journalArticlepeer-review

Abstract

Fine-grained image recognition aims to classify fine subcategories belonging to the same parent category, such as vehicle model or bird species classification. This is an inherently challenging task because a classifier must capture subtle interclass differences under large intraclass variances. Most previous approaches are based on supervised learning, which requires a large-scale labeled dataset. However, such large-scale annotated datasets for fine-grained image recognition are difficult to collect because they generally require domain expertise during the labeling process. In this study, we propose a self-supervised transfer learning method based on Vision Transformer (ViT) to learn finer representations without human annotations. Interestingly, it is observed that existing self-supervised learning methods using ViT (e.g., DINO) show poor patch-level semantic consistency, which may be detrimental to learning finer representations. Motivated by this observation, we propose a consistency loss function that encourages patch embeddings of the overlapping area between two augmented views to be similar to each other during self-supervised learning on fine-grained datasets. In addition, we explore effective transfer learning strategies to fully leverage existing self-supervised models trained on large-scale labeled datasets. Contrary to the previous literature, our findings indicate that training only the last block of ViT is effective for self-supervised transfer learning. We demonstrate the effectiveness of our proposed approach through extensive experiments using six fine-grained image classification benchmark datasets, including FGVC Aircraft, CUB-200-2011, Food-101, Oxford 102 Flowers, Stanford Cars, and Stanford Dogs. Under the linear evaluation protocol, our method achieves an average accuracy of (Formula presented.), outperforming the existing transfer learning method, which yields (Formula presented.).

Original languageEnglish
Article number10493
JournalApplied Sciences (Switzerland)
Volume13
Issue number18
DOIs
StatePublished - Sep 2023

Keywords

  • Vision Transformer
  • fine-grained image recognition
  • self-supervised learning
  • transfer learning

Fingerprint

Dive into the research topics of 'Patch-Level Consistency Regularization in Self-Supervised Transfer Learning for Fine-Grained Image Recognition'. Together they form a unique fingerprint.

Cite this