A zoom-in analysis of I/O logs to detect root causes of I/O performance bottlenecks

Teng Wang, Suren Byna, Glenn K. Lockwood, Shane Snyder, Philip Carns, Sunggon Kim, Nicholas J. Wright

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

15 Scopus citations

Abstract

Scientific applications frequently spend a large fraction of their execution time in reading and writing data on parallel file systems. Identifying these I/O performance bottlenecks and attributing root causes are critical steps toward devising optimization strategies. Several existing studies analyze I/O logs of a set of benchmarks or applications that were run with controlled behaviors. However, there is still a lack of general approach that systematically identifies I/O performance bottlenecks for applications running 'in the wild' on production systems. In this study, we have developed an analysis approach of 'zooming in' from platform-wide to application-wide to job-level I/O logs for identifying I/O bottlenecks in arbitrary scientific applications. We analyze the logs collected on a Cray XC40 system in production over a two-month period. This study results in several insights for application developers to use in optimizing I/O behavior.

Original languageEnglish
Title of host publicationProceedings - 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages102-111
Number of pages10
ISBN (Electronic)9781728109121
DOIs
StatePublished - May 2019
Event19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2019 - Larnaca, Cyprus
Duration: 14 May 201917 May 2019

Publication series

NameProceedings - 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2019

Conference

Conference19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2019
Country/TerritoryCyprus
CityLarnaca
Period14/05/1917/05/19

Keywords

  • Darshan
  • IO Analysis
  • IO Trace
  • Lustre Monitoring Tools
  • Slurm

Fingerprint

Dive into the research topics of 'A zoom-in analysis of I/O logs to detect root causes of I/O performance bottlenecks'. Together they form a unique fingerprint.

Cite this