TY - JOUR
T1 - FAB
T2 - FPGA-Accelerated Fully-Pipelined Bottleneck Architecture With Batching for High-Performance MobileNetv2 Inference
AU - Kim, Young Chan
AU - Kim, Nam Joon
AU - Kim, Hyun
N1 - Publisher Copyright:
© 2004-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Lightweight neural networks (LWNNs) primarily employ the bottleneck block (BB) introduced in MobileNetv2 or similar architectural structures. However, the channel expansion-reduction process in BB imposes substantial activation memory overhead, a challenge that has not been adequately addressed in prior studies on LWNN accelerators incorporating BB. To overcome this limitation, we propose a fully-pipelined bottleneck architecture (FPB) optimized for the efficient hardware deployment of BB. FPB eliminates the need for intermediate off-chip memory access, effectively addressing deployment challenges associated with BB and enabling an end-to-end accelerator architecture. To enhance hardware efficiency, each FPB core utilizes 2-LUT DSP, Fused-ReLU6, and Q-Residual, optimizing computational performance while minimizing resource consumption. Furthermore, we introduce a batching technique that maximizes the benefits of FPB by ensuring high hardware utilization across FPB cores while enabling the concurrent processing of multiple images. To mitigate the off-chip memory access latency inherently incurred by batching, we propose a stem layer latency hiding technique, which effectively prevents performance degradation. We evaluate the performance of our proposed MobileNetv2 accelerator on the VCU118 board, achieving an energy efficiency of 120.7 GOPS/W at a batch size of 4. This represents an improvement of 1.5x to 10.5x over prior work. Depending on the batch size configuration, our FAB accelerator achieves a throughput performance ranging from 204.2 GOPS to 772.7 GOPS, demonstrating its high computational efficiency.
AB - Lightweight neural networks (LWNNs) primarily employ the bottleneck block (BB) introduced in MobileNetv2 or similar architectural structures. However, the channel expansion-reduction process in BB imposes substantial activation memory overhead, a challenge that has not been adequately addressed in prior studies on LWNN accelerators incorporating BB. To overcome this limitation, we propose a fully-pipelined bottleneck architecture (FPB) optimized for the efficient hardware deployment of BB. FPB eliminates the need for intermediate off-chip memory access, effectively addressing deployment challenges associated with BB and enabling an end-to-end accelerator architecture. To enhance hardware efficiency, each FPB core utilizes 2-LUT DSP, Fused-ReLU6, and Q-Residual, optimizing computational performance while minimizing resource consumption. Furthermore, we introduce a batching technique that maximizes the benefits of FPB by ensuring high hardware utilization across FPB cores while enabling the concurrent processing of multiple images. To mitigate the off-chip memory access latency inherently incurred by batching, we propose a stem layer latency hiding technique, which effectively prevents performance degradation. We evaluate the performance of our proposed MobileNetv2 accelerator on the VCU118 board, achieving an energy efficiency of 120.7 GOPS/W at a batch size of 4. This represents an improvement of 1.5x to 10.5x over prior work. Depending on the batch size configuration, our FAB accelerator achieves a throughput performance ranging from 204.2 GOPS to 772.7 GOPS, demonstrating its high computational efficiency.
KW - Lightweight convolution neural network
KW - batching
KW - bottleneck block
KW - fully-pipelined bottleneck architecture
KW - high throughput
UR - https://www.scopus.com/pages/publications/105007288392
U2 - 10.1109/TCSI.2025.3573274
DO - 10.1109/TCSI.2025.3573274
M3 - Article
AN - SCOPUS:105007288392
SN - 1549-8328
VL - 72
SP - 6615
EP - 6628
JO - IEEE Transactions on Circuits and Systems I: Regular Papers
JF - IEEE Transactions on Circuits and Systems I: Regular Papers
IS - 11
ER -