A Fault-Tolerant HPC Scheduler Extension for Large and Operational Ensemble Data Assimilation:Application to the Red Sea
KAUST DepartmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
Applied Mathematics and Computational Science Program
Physical Sciences and Engineering (PSE) Division
Earth Science and Engineering Program
KAUST Supercomputing Laboratory (KSL)
Permanent link to this recordhttp://hdl.handle.net/10754/627684
MetadataShow full item record
AbstractA fully parallel ensemble data assimilation and forecasting system has been developed for the Red Sea based on the MIT general circulation model (MITgcm) to simulate the Red Sea circulation and the Data Assimilation Research Testbed (DART) ensemble assimilation software. An important limitation of operational ensemble assimilation systems is the risk of ensemble members’ collapse. This could happen in those situations when the filter update step imposes large corrections on one, or more, of the forecasted ensemble members that are not fully consistent with the model physics. Increasing the ensemble size is expected to improve the assimilation system performances, but obviously increases the risk of members’ collapse. Hardware failure or slow numerical convergence encountered for some members should also occur more frequently. In this context, the manual steering of the whole process appears as a real challenge and makes the implementation of the ensemble assimilation procedure uneasy and extremely time consuming.This paper presents our efforts to build an efficient and fault-tolerant MITgcm-DART ensemble assimilation system capable of operationally running thousands of members. Built on top of Decimate, a scheduler extension developed to ease the submission, monitoring and dynamic steering of workflow of dependent jobs in a fault-tolerant environment, we describe the assimilation system implementation and discuss in detail its coupling strategies. Within Decimate, only a few additional lines of Python is needed to define flexible convergence criteria and to implement any necessary actions to the forecast ensemble members, as for instance (i) restarting faulty job in case of job failure, (ii) changing the random seed in case of poor convergence or numerical instability, (iii) adjusting (reducing or increasing) the number of parallel forecasts on the fly, (iv) replacing members on the fly to enrich the ensemble with new members, etc.We demonstrate the efficiency of the system with numerical experiments assimilating real satellites sea surface height and temperature observations in the Red Sea.
CitationToye H, Kortas S, Zhan P, Hoteit I (2018) A Fault-Tolerant HPC Scheduler Extension for Large and Operational Ensemble Data Assimilation:Application to the Red Sea. Journal of Computational Science. Available: http://dx.doi.org/10.1016/j.jocs.2018.04.018.
SponsorsThe research reported in this manuscript was supported by King Abdullah University of Science and Technology (KAUST) and Saudi ARAMCO, and made use of the resources of the Supercomputing Core Laboratory of KAUST.
JournalJournal of Computational Science