TY - GEN
T1 - Towards HPC I/O Performance Prediction through Large-scale Log Analysis
AU - Kim, Sunggon
AU - Sim, Alex
AU - Wu, Kesheng
AU - Byna, Suren
AU - Son, Yongseok
AU - Eom, Hyeonsang
N1 - Publisher Copyright:
© 2020 ACM.
PY - 2020/6/23
Y1 - 2020/6/23
N2 - Large-scale high performance computing (HPC) systems typically consist of many thousands of CPUs and storage units, while used by hundreds to thousands of users at the same time. Applications from these large numbers of users have diverse characteristics, such as varying compute, communication, memory, and I/O intensiveness. A good understanding of the performance characteristics of each user application is important for job scheduling and resource provisioning. Among these performance characteristics, the I/O performance is difficult to predict because the I/O system software is complex, the I/O system is shared among all users, and the I/O operations also heavily rely on networking systems. To improve the prediction of the I/O performance on HPC systems, we propose to integrate information from a number of different system logs and develop a regression-based approach that dynamically selects the most relevant features from the most recent log entries, and automatically select the best regression algorithm for the prediction task. Evaluation results show that our proposed scheme can predict the I/O performance with up to 84% prediction accuracy in the case of the I/O-intensive applications using the logs from CORI supercomputer at NERSC.
AB - Large-scale high performance computing (HPC) systems typically consist of many thousands of CPUs and storage units, while used by hundreds to thousands of users at the same time. Applications from these large numbers of users have diverse characteristics, such as varying compute, communication, memory, and I/O intensiveness. A good understanding of the performance characteristics of each user application is important for job scheduling and resource provisioning. Among these performance characteristics, the I/O performance is difficult to predict because the I/O system software is complex, the I/O system is shared among all users, and the I/O operations also heavily rely on networking systems. To improve the prediction of the I/O performance on HPC systems, we propose to integrate information from a number of different system logs and develop a regression-based approach that dynamically selects the most relevant features from the most recent log entries, and automatically select the best regression algorithm for the prediction task. Evaluation results show that our proposed scheme can predict the I/O performance with up to 84% prediction accuracy in the case of the I/O-intensive applications using the logs from CORI supercomputer at NERSC.
KW - I/O performance prediction
KW - distributed file system
KW - high performance computing
KW - log analysis
UR - https://www.scopus.com/pages/publications/85088395233
U2 - 10.1145/3369583.3392678
DO - 10.1145/3369583.3392678
M3 - Conference contribution
AN - SCOPUS:85088395233
T3 - HPDC 2020 - Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing
SP - 77
EP - 88
BT - HPDC 2020 - Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing
PB - Association for Computing Machinery, Inc
T2 - 29th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2020
Y2 - 23 June 2020 through 26 June 2020
ER -