This article is the first part of the series about the heterogeneous map-reduce approach:
Part 1 – [You are here] – Heterogeneous Map-Reduce: meet the Task System (Part 1/3)
Part 2 – Heterogeneous Map-Reduce: Seismic Imaging Application (Part 2/3)
Part 3 – Heterogeneous Map-Reduce: Scientific Visualisation (Part 3/3)

Did it happen to you: you just finished a parallel implementation of a project and happily enjoying the speedups, until the next project arrives, where you have to use completely different hardware, and you have to start over again with the parallel framework for the new project? It happened to me several times! And Hans as well! Until he thought of a heterogeneous map-reduce approach, that ones implemented it can be easily adjusted for different hardware architectures (CPUs, GPUs, Intel’s Knight Hill, you name it).

Heterogeneous map-reduce

The idea of the heterogeneous map-reduce approach is a universal parallel framework. It assumes commodity hardware with several types of processors with different capabilities and performance (for example a cluster with some computers having one GPU, other having two GPUs or none), hence the name “heterogeneous”. The performance can also be affected by the number of users sharing a node.

The heterogeneous map-reduce approach is actually a way to distribute the computations across available hardware to achieve a good load balancing. After all, you would like to make use of the available compute resources.

The “map-reduce” component comes from the setup of the task system, which runs computations in parallel.

There is no need to know explicitly how to distribute the work beforehand.

Task system

The task system allows to split the work amongst compute nodes and monitor the execution. By a compute node we assume a multi-core CPU that might be connected to one or more GPUs, where GPUs can be used as a replacement or as an accelerator.

Read more about using GPUs as a replacement for CPU or as an accelerator in the post:
2 Ways of Using a GPU to Speedup Your Computations

 

The common approach for parallelization across multiple CPU nodes in the cluster is the so-called server-to-client approach. The server is aware of the number of nodes and the amount of work involved. It equally distributes the work-load among the nodes. This approach is efficient on clusters with homogeneous nodes as all CPU nodes have the same characteristics.

Since we are targeting the heterogeneous clusters, we propose a client-to-server approach where clients request tasks to the server.

The philosophy behind a task system is that a compute node will ask for a task, process it, and ask for another task as soon as the previous one has finished. A compute node is “busy” 100% of the time, regardless of its performance. The work is spread dynamically according to the speed of the compute nodes and load balancing is automatically achieved. Moreover, the task system gives the flexibility to launch new child-processes if one or more compute nodes crash or hang.

Each task is added to an “Available” queue. When a client (one compute node) requests a task, a given task is moved from the “Available” queue to the “Running” list. When the task is done, it is removed.

Failure-safe scheduling

It can happen that a node will crash due to a hardware failure. In that case, the tasks in the “Running” queue will be ranked against the starting time. Eventually, the task with the earliest starting time will be assigned to a new compute node. It may happen that 2 compute nodes might work on the same task. In this case, the first compute node that manages to compute the output “wins”.

The tasks are moving from the “Available” queue to the “Running” queue and after producing result to the “Done” queue. At the end of the computations, the “Available” queue is empty.

When the “Available” queue is empty, the tasks in the “Running” queue will be ranked against the starting time. The task with the earliest starting time will be assigned to a new compute node. Actually, now two compute nodes are running the same task. When one of them will finish first and produce a result, the (same) result from the second compute node will be discarded.

Why not MPI?

We have chosen the client-to-server approach over MPI for two reasons:

  1. If one MPI-child crashes, the master program will crash as well. We want to avoid that.
  2. It is currently not possible to add compute nodes dynamically to the MPI-server once the child processes are launched. We want to have a flexibility of adding additional compute nodes in case one has crashed or hangs.

You still can use MPI to spawn the child processes.

Levels of parallelism
Summarising, a task is an amount of work for one compute node in a heterogeneous cluster. It has the highest level of parallelism. The next step is to define other levels of parallelism to split this work across multiple threads within the compute node and also across one or more GPUs. The levels of parallelism will depend on the problem and the chosen way to solve it. Interestingly, the levels of parallelism have been introduced already two hundred years ago by Charles Babbage:
  1. Job level (the highest)
  2. Programm level
  3. Loop level
  4. Instruction level (the lowest)

Depending on your algorithm, you could use GPU(s) at the program or the loop level.

What is next?

In this post we have defined the heterogeneous map-reduce approach and the task system, that allow to spread work dynamically across your computing system and achieve automatic load balancing.

The next step is to demonstrate it on some applications from different fields. One example we will take from seismic imaging and another one from scientific visualisation. We will cover these in our next posts.

How many levels of parallelism do you use in your application?


Get EZNumeric’s future articles in your inbox:

[mc4wp_form id=”916″]