Resilient Work Stealing

 

Introduction

Processors have seen an extreme reduction in scale over the decades, from 10 micro meters about 40 years ago to 11-12 nano meters in the coming years [1]. Due to this extreme reduction in scale, future generations of processors will exhibit a serious increase in the variability of performance: Not only will the cores on the same processor vary in speed compared to each other, but each core will also vary in speed over time, for example due to heat flux and temperature variability [2]. On top of that, the susceptibility to both transient and permanent failures will also increase, for example due physical wearout, or due to bit flips caused by alpha particles and cosmic rays.

Current generations of processors solve these variability and reliability issues at the hardware level, for example by way of modular redundancy, concurrent error detection, instruction retries, and so on [3]. However, this incurs serious costs at the hardware level in terms of design complexity and energy consumption. We have taken a look at existing software technologies to improve variability and reliability at the software level, and combined them into a promising new approach that deals with both classes of issues at the same time.

Variability

In the field of parallel programming, there already exist established solutions for dealing with load imbalances due to irregular parallel algorithms. In such algorithms, it is hard to determine upfront how much work needs to be performed in the different branches of the computation, and it is impossible to specify a static schedule that makes optimal use of the available processor cores. Work stealing schedulers can dynamically schedule parallel computations expressed in terms of sufficiently fine-grained tasks, such that each processor core contributes as much as possible to the ongoing computation, either by working on their own tasks, or by stealing tasks from other cores when they run out of work.

Work Stealing

Traditionally, work stealing is successfully used to compensate for performance variability in the algorithms themselves. However, our research shows that it can also be used to deal with variability arising from the hardware, and therefore we have used our own implementation of an experimental work stealing scheduler as a starting point for further investigations.

Reliability

Distributed leasing is a concept originally developed for distributed file systems, and later adopted for general distributed computing approaches, such as Java RMI or JavaSpaces. With distributed leasing, whenever a computing node wants to allocate a remote resource, it needs to specify a lease time – a maximum duration for which it expects to need the requested resource. When the lease time expires, the resource will automatically be freed, unless the lease is renewed in time. This mechanism ensures that resources will never leak, for example because a node fails before it can explicitly free a resource.

Distributed Leasing

We have integrated a variation of this idea into our own work stealing scheduler. When a processor core steals work from some other core, our scheduler will in this way detect potential failures and can then successfully reschedule the stolen tasks to be executed on other processor cores.

Idempotence

When a processor core fails while executing a task, we cannot know in the general case how far the task has progressed. Therefore, our approach requires that algorithms be expressed in an idempotent style, such that a task will produce the same result if executed for n > 1 times as if it were executed exactly once. The notion of idempotent computations is a known concept from RESTful web applications, where idempotence is required to ensure that repeated attempts for invoking a remote computation (for example, by pressing the same button in a web interface) will produce the same result as if requested exactly once [4]. We have shown with a number of example benchmarks that expressing idempotent algorithms is both relatively straightforward and efficient at the same time.

 

[1] http://en.wikipedia.org/wiki/45_nanometer

[2] Borkar, Designing Reliable Systems from Unreliable Components, IEEE Micro, November/December 2005.

[3] Rivers, Kudva, Reliability Challenges and System Performance at the Architecture Level, IEEE Design & Test of Computers, November/December 2009.

[4] Fielding, Architectural Styles and the Design of Network-based Software Architectures, University of California, Irvine, PhD thesis, 2000.

 

 

Comments are closed.