Accelerating language giants: A survey of optimization strategies for LLM inference on hardware platforms

  • Young Chan Kim
  • , Seok Kyu Yoon
  • , Sung Soo Han
  • , Chae Won Park
  • , Jun Oh Park
  • , Jun Ha Ko
  • , Hyun Kim

Research output: Contribution to journalReview articlepeer-review

Abstract

With the emergence of transformer-based models that have demonstrated remarkable performance in natural language processing tasks, large language models (LLMs) built upon the transformer architecture and trained on massive datasets have achieved outstanding results in various tasks such as translation and summarization. Among these, decoder-only LLMs have garnered significant attention due to their superior few-shot and zero-shot capabilities compared to other architectures. Motivated by their exceptional performance, numerous efforts have been made to deploy decoder-only LLMs on diverse hardware platforms. However, the substantial computational and memory demands during both training and inference pose considerable challenges for resource-constrained hardware. Although efficient architectural designs have been proposed to address these issues, LLM inference continues to require excessive computational and memory resources. Consequently, extensive research has been conducted to compress model components and enhance inference efficiency across different hardware platforms. To further accelerate the inherently repetitive computations of LLMs, a variety of approaches have been introduced, integrating operator-level optimizations within Transformer blocks and system-level optimizations at the granularity of repeated Transformer block execution. This paper surveys recent research on decoder-only LLM inference acceleration, categorizing existing approaches based on optimization levels specific to each hardware platform. Building on this classification, we provide a comprehensive analysis of prior decoder-only LLM acceleration techniques from multiple perspectives.

Original languageEnglish
Article number103690
JournalJournal of Systems Architecture
Volume172
DOIs
StatePublished - Mar 2026

Keywords

  • Application-specific integrated circuits (ASICs)
  • Decoder-only LLMs
  • Field-programmable gate arrays (FPGAs)
  • Graphics processing units (GPUs)
  • Large language models (LLMs)
  • LLM acceleration
  • Processing-in-memory (PIM)
  • Survey

Fingerprint

Dive into the research topics of 'Accelerating language giants: A survey of optimization strategies for LLM inference on hardware platforms'. Together they form a unique fingerprint.

Cite this