Vision transformer models for mobile/edge devices: a survey

Seung Il Lee, Kwanghyun Koo, Jong Ho Lee, Gilha Lee, Sangbeom Jeong, Seongjun O, Hyun Kim

Research output: Contribution to journalArticlepeer-review

10 Scopus citations

Abstract

With the rapidly growing demand for high-performance deep learning vision models on mobile and edge devices, this paper emphasizes the importance of compact deep learning-based vision models that can provide high accuracy while maintaining a small model size. In particular, based on the success of transformer models in natural language processing and computer vision tasks, this paper offers a comprehensive examination of the latest research in redesigning the Vision Transformer (ViT) model into a compact architecture suitable for mobile/edge devices. The paper classifies compact ViT models into three major categories: (1) architecture and hierarchy restructuring, (2) encoder block enhancements, and (3) integrated approaches, and provides a detailed overview of each category. This paper also analyzes the contribution of each method to model performance and computational efficiency, providing a deeper understanding of how to efficiently implement ViT models on edge devices. As a result, this paper can offer new insights into the design and implementation of compact ViT models for researchers in this field and provide guidelines for optimizing the performance and improving the efficiency of deep learning vision models on edge devices.

Original languageEnglish
Article number109
JournalMultimedia Systems
Volume30
Issue number2
DOIs
StatePublished - Apr 2024

Keywords

  • Mobile/edge devices
  • Survey
  • Vision transformer

Fingerprint

Dive into the research topics of 'Vision transformer models for mobile/edge devices: a survey'. Together they form a unique fingerprint.

Cite this