Data-centric parallel debugging technique for petascale computers
thesisposted on 17.02.2017, 01:31 by Dinh, Minh Ngoc
Petascale computers and computing systems have the potential to solve large-scale, data-intensive problems in science and engineering. Petascale scientific applications, such as the Weather Research and Forecasting Model (WRF), involve enormous multi-dimensional data structures and operate with hundreds of thousands of concurrent processing threads. On the one hand, programming languages and environments have evolved significantly to support parallel application developers to explore the advantages in terms of computational power and memory usage. Co-array Fortran, Split-C, MPI and OpenMP are some successful examples. On the other hand, debugging tools for highly parallel software are still immature, especially in techniques for controlling multiple processes and monitoring large scale data structures during debugging time. Typically, contemporary parallel debuggers allow users to control more than one processing thread while supporting the same examination and visualisation operations that of sequential debuggers. This approach restricts the use of parallel debuggers when it comes to large scale scientific applications run across hundreds of thousands compute cores. First, manually observing the runtime data to detect error becomes impractical because the data is too big. Second, performing expensive but useful debugging operations, such as distributed expression evaluation, becomes infeasible as the computational codes become more complex, involving larger data structures, and as the machines become larger. This thesis explores the idea of a data-centric debugging approach, which could be used to make parallel debuggers more powerful. It discusses the use of ad-hoc debug-time assertions that allow a user to reason about the state of a parallel computation. These assertions are modeled on programming language systems that support the verification and validation of program state as a whole rather than focusing on that of only a single process state. The advantage of this approach is the capability to reason about the massive data structure at runtime. Furthermore, on parallel machines, the debugger‟s performance can be improved by exploiting the underlying parallel platform. The available compute cores can execute parallel debugging functions while idling at a program breakpoint.