TY - GEN
T1 - HyQ
T2 - 33rd International Joint Conference on Artificial Intelligence, IJCAI 2024
AU - Kim, Nam Joon
AU - Lee, Jongho
AU - Kim, Hyun
N1 - Publisher Copyright:
© 2024 International Joint Conferences on Artificial Intelligence. All rights reserved.
PY - 2024
Y1 - 2024
N2 - Hybrid models that combine CNNs and ViTs have recently emerged as state-of-the-art computer vision models. To efficiently deploy these hybrid models on resource-constrained mobile/edge devices, quantization is emerging as a promising solution. However, post-training quantization (PTQ), which does not require retraining or labeled data, has not been extensively studied for hybrid models. In this study, we propose a novel PTQ technique specialized for CNN-transformer hybrid models by considering the hardware design of hybrid models on AI accelerators such as GPUs and FPGAs. First, we introduce quantization-aware distribution scaling to address the large outliers caused by inter-channel variance in convolution layers. Furthermore, in the transformer block, we propose approximating the integer-only softmax with a linear function. This approach allows us to avoid costly FP32/INT32 multiplications, resulting in more efficient computations. Experimental results show that the proposed quantization method with INT8 precision demonstrated a 0.39% accuracy drop compared with the FP32 baseline on MobileViT-s with the ImageNet-1k dataset. Furthermore, when implemented on the FPGA platform, the proposed linear softmax achieved significant resource savings, reducing the look-up table and flip-flop usage by 1.8 ∼ 2.1× and 1.3 ∼ 1.9×, respectively, compared with the existing second-order polynomial approximation. The code is available at https://github.com/IDSL-SeoulTech/HyQ.
AB - Hybrid models that combine CNNs and ViTs have recently emerged as state-of-the-art computer vision models. To efficiently deploy these hybrid models on resource-constrained mobile/edge devices, quantization is emerging as a promising solution. However, post-training quantization (PTQ), which does not require retraining or labeled data, has not been extensively studied for hybrid models. In this study, we propose a novel PTQ technique specialized for CNN-transformer hybrid models by considering the hardware design of hybrid models on AI accelerators such as GPUs and FPGAs. First, we introduce quantization-aware distribution scaling to address the large outliers caused by inter-channel variance in convolution layers. Furthermore, in the transformer block, we propose approximating the integer-only softmax with a linear function. This approach allows us to avoid costly FP32/INT32 multiplications, resulting in more efficient computations. Experimental results show that the proposed quantization method with INT8 precision demonstrated a 0.39% accuracy drop compared with the FP32 baseline on MobileViT-s with the ImageNet-1k dataset. Furthermore, when implemented on the FPGA platform, the proposed linear softmax achieved significant resource savings, reducing the look-up table and flip-flop usage by 1.8 ∼ 2.1× and 1.3 ∼ 1.9×, respectively, compared with the existing second-order polynomial approximation. The code is available at https://github.com/IDSL-SeoulTech/HyQ.
UR - http://www.scopus.com/inward/record.url?scp=85204280457&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85204280457
T3 - IJCAI International Joint Conference on Artificial Intelligence
SP - 4291
EP - 4299
BT - Proceedings of the 33rd International Joint Conference on Artificial Intelligence, IJCAI 2024
A2 - Larson, Kate
PB - International Joint Conferences on Artificial Intelligence
Y2 - 3 August 2024 through 9 August 2024
ER -