Large-Scale System Monitoring Experiences and Recommendations Workshop paper: HPCMASPA 2018
Type
Conference PaperAuthors
Ahlgren, VilleAndersson, Stefan
Brandt, Jim
Cardo, Nicholas P.
Chunduri, Sudheer
Enos, Jeremy
Fields, Parks
Gentile, Ann
Gerber, Richard
Gienger, Michael
Greenseid, Joe
Greiner, Annette
Hadri, Bilel
He, Yun (Helen)
Hoppe, Dennis
Kaila, Urpo
Kelly, Kaki
Klein, Mark
Kristiansen, Alex
Leak, Steve
Mason, Mike
Pedretti, Kevin
Piccinali, Jean-Guillaume
Repik, Jason
Rogers, Jim
Salminen, Susanna
Showerman, Mike
Whitney, Cary
Williams, Jim
KAUST Department
Computational ScientistsDate
2018Permanent link to this record
http://hdl.handle.net/10754/670699
Metadata
Show full item recordAbstract
Monitoring of High Performance Computing (HPC) platforms is critical to successful operations, can provide insights into performance-impacting conditions, and can inform methodologies for improving science throughput. However, monitoring systems are not generally considered core capabilities in system requirements specifications nor in vendor development strategies. In this paper we present work performed at a number of large-scale HPC sites towards developing monitoring capabilities that fill current gaps in ease of problem identification and root cause discovery. We also present our collective views, based on the experiences presented, on needs and requirements for enabling development by vendors or users of effective sharable end-to-end monitoring capabilities.Citation
Ahlgren, V., Andersson, S., Brandt, J., Cardo, N., Chunduri, S., Enos, J., … Williams, J. (2018). Large-Scale System Monitoring Experiences and Recommendations. 2018 IEEE International Conference on Cluster Computing (CLUSTER). doi:10.1109/cluster.2018.00069Sponsors
This research was supported by and used resources of the Argonne Leadership Computing Facility, which is a U.S. Department of Energy Office of Science User Facility operated under contract DE-AC02-06CH11357. This document is approved for release under LA-UR-18-26485.Publisher
IEEEConference/Event name
2018 IEEE International Conference on Cluster Computing, CLUSTER 2018ISBN
9781538683194Additional Links
https://ieeexplore.ieee.org/document/8514913/ae974a485f413a2113503eed53cd6c53
10.1109/CLUSTER.2018.00069