TY - GEN
T1 - HPC Workload Characterization Using Feature Selection and Clustering
AU - Bang, Jiwoo
AU - Kim, Chungyong
AU - Wu, Kesheng
AU - Sim, Alex
AU - Byna, Suren
AU - Kim, Sunggon
AU - Eom, Hyeonsang
N1 - Publisher Copyright:
© 2020 ACM.
PY - 2020/6/23
Y1 - 2020/6/23
N2 - Large high-performance computers (HPC) are expensive tools responsible for supporting thousands of scientific applications. However, it is not easy to determine the best set of configurations for workloads to best utilize the storage and I/O systems. Users typically use the default configurations provided by the system administrators, which typically results in poor performance. In an effort to identify application characteristics more important to I/O performance, we applied several machine learning techniques to characterize these applications. To identify the features that are most relevant to the I/O performance, we evaluate a number of different feature selection methods, e.g., Mutual information regression and F regression, and develop a novel feature selection method based on Min-max mutual information. These feature selection methods allow us to sift through a large set of the real-world workloads collected from NERSC's Cori supercomputer system, and identify the most important features. We employ a number of different clustering algorithms, including KMeans, Gaussian Mixture Model (GMM) and Ward linkage, and measure the cluster quality with Davies Boulder Index (DBI), Silhouette and a new Combined Score developed for this work. The cluster evaluation result shows that the test dataset could be best divided into three clusters, where cluster 1 contains mostly small jobs with operations on standard I/O units, cluster 2 consists of middle size parallel jobs dominated by read operations, and cluster 3 include large parallel jobs with heavy write operations. The cluster characteristics suggest that using parallel I/O library MPI IO and a large number of parallel cores are important to achieve high I/O throughput.
AB - Large high-performance computers (HPC) are expensive tools responsible for supporting thousands of scientific applications. However, it is not easy to determine the best set of configurations for workloads to best utilize the storage and I/O systems. Users typically use the default configurations provided by the system administrators, which typically results in poor performance. In an effort to identify application characteristics more important to I/O performance, we applied several machine learning techniques to characterize these applications. To identify the features that are most relevant to the I/O performance, we evaluate a number of different feature selection methods, e.g., Mutual information regression and F regression, and develop a novel feature selection method based on Min-max mutual information. These feature selection methods allow us to sift through a large set of the real-world workloads collected from NERSC's Cori supercomputer system, and identify the most important features. We employ a number of different clustering algorithms, including KMeans, Gaussian Mixture Model (GMM) and Ward linkage, and measure the cluster quality with Davies Boulder Index (DBI), Silhouette and a new Combined Score developed for this work. The cluster evaluation result shows that the test dataset could be best divided into three clusters, where cluster 1 contains mostly small jobs with operations on standard I/O units, cluster 2 consists of middle size parallel jobs dominated by read operations, and cluster 3 include large parallel jobs with heavy write operations. The cluster characteristics suggest that using parallel I/O library MPI IO and a large number of parallel cores are important to achieve high I/O throughput.
KW - clustering
KW - feature selection
KW - high performance computing
KW - supercomputer
KW - workload characterization
UR - http://www.scopus.com/inward/record.url?scp=85089107500&partnerID=8YFLogxK
U2 - 10.1145/3391812.3396270
DO - 10.1145/3391812.3396270
M3 - Conference contribution
AN - SCOPUS:85089107500
T3 - SNTA 2020 - Proceedings of the 3rd International Workshop on Systems and Network Telemetry and Analytics
SP - 33
EP - 40
BT - SNTA 2020 - Proceedings of the 3rd International Workshop on Systems and Network Telemetry and Analytics
PB - Association for Computing Machinery, Inc
T2 - 3rd International Workshop on Systems and Network Telemetry and Analytics, SNTA 2020
Y2 - 23 June 2020
ER -