TY - GEN
T1 - Mixed Precision Quantization with Hardware-Friendly Activation Functions for Hybrid ViT Models
AU - Kang, Beom Jin
AU - Choi, Da Hun
AU - Kim, Hyun
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - As hardware devices have advanced recently, various artificial intelligence tasks including convolutional neural networks (CNNs) have achieved high accuracy. Especially in computer vision tasks, vision transformer (ViT) based models have achieved unprecedented progress, and CNN + ViT hybrid models have also been proposed that take advantage of both CNNs and ViTs. However, the numerous parameters of hybrid ViTs are unsuitable for resource-constrained mobile/edge environments. In addition, the nonlinear activation functions in hybrid ViTs (e.g., GeLU and Swish) require more resources and computational cost compared to integer operation functions (e.g., ReLU) when using dedicated hardware accelerators. To address these issues, we propose a technique to efficiently compress the prominent hybrid ViT model, MobileViT, by applying the mixed precision quantization and the Shift-Swish activation function. Compressing the MobileViT-s, MobileViT-xs, and MobileViT-xxs models with the proposed method on the ImageNet dataset resulted in minimal accuracy drops of 0.41%, 0.18%, and 0.86%, respectively, while achieving effective quantization and activation function approximation at the average 7.9-bit level.
AB - As hardware devices have advanced recently, various artificial intelligence tasks including convolutional neural networks (CNNs) have achieved high accuracy. Especially in computer vision tasks, vision transformer (ViT) based models have achieved unprecedented progress, and CNN + ViT hybrid models have also been proposed that take advantage of both CNNs and ViTs. However, the numerous parameters of hybrid ViTs are unsuitable for resource-constrained mobile/edge environments. In addition, the nonlinear activation functions in hybrid ViTs (e.g., GeLU and Swish) require more resources and computational cost compared to integer operation functions (e.g., ReLU) when using dedicated hardware accelerators. To address these issues, we propose a technique to efficiently compress the prominent hybrid ViT model, MobileViT, by applying the mixed precision quantization and the Shift-Swish activation function. Compressing the MobileViT-s, MobileViT-xs, and MobileViT-xxs models with the proposed method on the ImageNet dataset resulted in minimal accuracy drops of 0.41%, 0.18%, and 0.86%, respectively, while achieving effective quantization and activation function approximation at the average 7.9-bit level.
KW - Activation function
KW - Deep learning
KW - Mixed precision
KW - Quantization
KW - Vision Transformer
UR - http://www.scopus.com/inward/record.url?scp=85189240599&partnerID=8YFLogxK
U2 - 10.1109/ICEIC61013.2024.10457283
DO - 10.1109/ICEIC61013.2024.10457283
M3 - Conference contribution
AN - SCOPUS:85189240599
T3 - 2024 International Conference on Electronics, Information, and Communication, ICEIC 2024
BT - 2024 International Conference on Electronics, Information, and Communication, ICEIC 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 International Conference on Electronics, Information, and Communication, ICEIC 2024
Y2 - 28 January 2024 through 31 January 2024
ER -