TY - JOUR
T1 - DL-Sort
T2 - A Hybrid Approach to Scalable Hardware-Accelerated Fully-Streaming Sorting
AU - Oh, Hyun Woo
AU - Park, Joungmin
AU - Lee, Seung Eun
N1 - Publisher Copyright:
© 2004-2012 IEEE.
PY - 2024/5/1
Y1 - 2024/5/1
N2 - Designing high-performance hardware sorter for resource-constrained systems is challenging due to physical limitations and the need to balance streaming bandwidth with memory throughput. This brief introduces a novel, scalable hardware sorter architecture with fully-streaming support and an accompanying RTL generator to provide versatile, energy-efficient hardware acceleration. Our solution employs a dual-layer architecture consisting of a parallel one-way linear insertion sorter (OLIS) for bandwidth optimization and a cyclic bitonic merge network (CBMN) for a compact, high-throughput implementation. Furthermore, we developed the RTL generator written in Chisel to provide the agile implementation of the scalable architecture. Experimental results targeting the Xilinx XVU37P-FSVH2892-2L-E FPGA show that our design achieves throughput increasing by 126.26% and latency decreasing by 68.46%, with an area increment of no more than 132.94% for LUTs and a decrement of flip-flops by 79.84%, compared to state-of-the-art streaming sorter. The source code is available at https://github.com/hyun-woo-oh/DL-Sort-Generator.
AB - Designing high-performance hardware sorter for resource-constrained systems is challenging due to physical limitations and the need to balance streaming bandwidth with memory throughput. This brief introduces a novel, scalable hardware sorter architecture with fully-streaming support and an accompanying RTL generator to provide versatile, energy-efficient hardware acceleration. Our solution employs a dual-layer architecture consisting of a parallel one-way linear insertion sorter (OLIS) for bandwidth optimization and a cyclic bitonic merge network (CBMN) for a compact, high-throughput implementation. Furthermore, we developed the RTL generator written in Chisel to provide the agile implementation of the scalable architecture. Experimental results targeting the Xilinx XVU37P-FSVH2892-2L-E FPGA show that our design achieves throughput increasing by 126.26% and latency decreasing by 68.46%, with an area increment of no more than 132.94% for LUTs and a decrement of flip-flops by 79.84%, compared to state-of-the-art streaming sorter. The source code is available at https://github.com/hyun-woo-oh/DL-Sort-Generator.
KW - bitonic sort
KW - energy-efficient computing
KW - hardware acceleration
KW - scalable architecture
KW - Sorting network
UR - http://www.scopus.com/inward/record.url?scp=85187980268&partnerID=8YFLogxK
U2 - 10.1109/TCSII.2024.3377255
DO - 10.1109/TCSII.2024.3377255
M3 - Article
AN - SCOPUS:85187980268
SN - 1549-7747
VL - 71
SP - 2549
EP - 2553
JO - IEEE Transactions on Circuits and Systems II: Express Briefs
JF - IEEE Transactions on Circuits and Systems II: Express Briefs
IS - 5
ER -