Hardware-aware Network Compression for Hybrid Vision Transformer via Low-Rank Approximation

Beom Jin Kang, Nam Joon Kim, Hyun Kim

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In recent years, transformer-based models have achieved excellent performance in various fields such as computer vision and language processing. Specifically, vision transformer (ViT) models outperform conventional convolutional neural networks (CNNs) in image classification tasks by achieving higher accuracy. However, ViT-based models often require more parameters than CNNs, making efficient deployment challenging in memory-constrained environments such as mobile devices. For example, the peak memory required in the output header layer of EfficientViT was 39.32 Mb. Deploying such a layer on the Zynq-7000 XC7Z045 FPGA board requires off-chip memory access, leading to inefficient power consumption. To address these issues, we applied a low-rank approximation method to reduce the memory requirements of the EfficientViT-B1 model. Our proposed method using the EfficientViT-B1 model on the ImageNet dataset achieved performance with only a 0.43% accuracy drop without requiring DRAM access.

Original languageEnglish
Title of host publicationProceedings - International SoC Design Conference 2024, ISOCC 2024
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages171-172
Number of pages2
ISBN (Electronic)9798350377088
DOIs
StatePublished - 2024
Event21st International System-on-Chip Design Conference, ISOCC 2024 - Sapporo, Japan
Duration: 19 Aug 202422 Aug 2024

Publication series

NameProceedings - International SoC Design Conference 2024, ISOCC 2024

Conference

Conference21st International System-on-Chip Design Conference, ISOCC 2024
Country/TerritoryJapan
CitySapporo
Period19/08/2422/08/24

Keywords

  • Computer Vision
  • Deep Learning
  • FPGA
  • Network Compression
  • Vision Transformer

Fingerprint

Dive into the research topics of 'Hardware-aware Network Compression for Hybrid Vision Transformer via Low-Rank Approximation'. Together they form a unique fingerprint.

Cite this