Abstract
Transformer-based models have achieved remarkable success across various AI tasks, but their growing complexity has led to significant computational and memory demands. While most optimization efforts have focused on linear operations such as matrix multiplications, non-linear functions like Softmax and layer normalization (LayerNorm) are increasingly dominating inference latency, especially for long sequences and high-dimensional inputs. To address this emerging bottleneck, we present a hardware accelerator that jointly approximates these non-linear functions using piecewise linear approximation for the exponential in Softmax and Newton–Raphson iteration for the square root in LayerNorm. The proposed unified architecture dynamically switches operation modes while reusing hardware resources. The proposed accelerator was implemented on a Xilinx VU37P FPGA and evaluated with BERT and GPT-2 models. Experimental results demonstrate speedups of up to 7.6× for Softmax and 2.0× for LayerNorm, while maintaining less than 1% accuracy degradation on classification tasks with conservative approximation settings. However, generation tasks showed greater sensitivity to approximation, underscoring the need for task-specific tuning.
| Original language | English |
|---|---|
| Article number | 2337 |
| Journal | Electronics (Switzerland) |
| Volume | 14 |
| Issue number | 12 |
| DOIs | |
| State | Published - Jun 2025 |
Keywords
- accelerator
- approximation
- FPGA
- non-linear function
- transformer