Abstract
Advancements in hardware accelerators, such as graphics processing units and neural processing units, have significantly propelled computer vision research. The vision transformer (ViT), leveraging the multi-head self-attention (MHSA) mechanism, has surpassed convolutional neural networks (CNNs) in accuracy but faces challenges in mobile and edge deployment due to its large size and computational demands. In addition, as privacy concerns push for on-device training, research on quantization methods for ViTs, particularly gradient quantization, has gained attention. Unlike CNNs, ViTs face challenges due to outliers and a complex loss landscape. To address this, we propose a gradient quantization framework that stabilizes training by adapting quantization points based on interquartile ranges and constructing an outlier-robust loss function. Additionally, we employ a scaling method to align quantized gradients with original gradients and adaptively assign the learning rate based on quantization error analysis. When quantizing weights, activations, and gradients to INT8, our method improves performance by 0.52% and 0.21% over DeiT-Base and Swin-Base, respectively, and achieves near parity with MobileViT-S with only a 0.09% accuracy drop. Furthermore, a 2.06× speedup was observed when applying our framework to MobileViT in a CUDA 11.8 environment.
| Original language | English |
|---|---|
| Pages (from-to) | 16019-16027 |
| Number of pages | 9 |
| Journal | Proceedings of the AAAI Conference on Artificial Intelligence |
| Volume | 39 |
| Issue number | 15 |
| DOIs | |
| State | Published - 11 Apr 2025 |
| Event | 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025 - Philadelphia, United States Duration: 25 Feb 2025 → 4 Mar 2025 |