DIY Data Parallelism

A Little White Lie

To this point, we have been granting ourselves a bit of pedagogic license when discussing agenda parallelism and master/worker codes. We have chosen examples for which it was (relatively) simple to describe a task, by working, for instance, with pure functions of the given inputs.

Real codes are seldom this simple — especially, as is so often the case, when one is working with an existing sequential code rather than de novo. What do we do when the function isn't pure (say, some results are stashed away via side effects), or the inputs aren't simple data types but references to complex pointer nests? While conceptually we may readily identify a task and intuitively recognize an opportunity to deploy a master/worker solution — that is, know when it is appropriate to kick off tasks — pragmatically it may be very difficult to create a “task closure” if you will.

Can we generalize master/worker a bit to apply agenda parallelism in these cases?

A Little History

Early on, parallel computing developed a reputation for being especially arcane and demanding — the stuff of hero programmers. The push was on to create parallel programming systems for mere mortals. The grail: absolve programmers of any need to think about parallelism. Two popular lines of attack: (i) make compilers smart enough to extract parallelism from “normal” code, (ii) develop a language for which “normal” code was naturally parallel. Note: while related, these are different.

Both approaches resulted in systems of some limited utility, but failed to achieve the full goal.

Back to the drawing board. How about giving the compiler just a little help? One approach focused on “data parallelism” — closely related to what we've termed “result parallelism”.

Suppose we are writing a code that requires the manipulation of a large array. We write a sequential code in the normal way, but add some extra information that describes how the array should be broken up into parts to be map onto independent processors. This information, while fairly modest, nonetheless: (i) indicates parallelism is desired, (ii) establishes the granularity, and (iii) determines how the computational load is to be divided. Underlying all of this is the owner computes concept: the owner of a part of the array has responsibility for the computations that involve the values of that portion of the array. The compiler and/or runtime takes care of all the coordination activities implicit in computing over the partitioned array. Is important to note that (i) tasks (in the master/worker sense) are implicitly defined and that (ii) this is done via granularity coarsening that results in a parallel structure somewhere between result and agenda parallelism.

This approach enjoyed some success (e.g., was an important part of the software associated with Thinking Machines' products), and lives on in various forms today (e.g., certain constructs in Fortran90, extensions to MATLAB). But it has a number of drawbacks, chief among them the requirement, more or less by construction, for nice, large, regular data structures.

Suppose, instead, that we are working with a code whose data structures are large, intricate pointer nests (of the sort common to many if not most non-trivial C and C++ programs). Or one whose important control flow is not nested for loops over rows and columns of arrays, but recursive function calls and pointer link walking? Because of this, it is hard to describe a decomposition of the data structures and recognize operations on them that could profitably be done in parallel. As a result, we would not normally think of such a code as a promising candidate for classic data parallelism.

DIY Data Parallelism

DIY Data Parallelism mixes master/worker task creation with owner-compute style task definition to create a powerful, if somewhat subtle, paradigm for tackling parallelization of “ugly” sequential codes.

Imagine executing N copies of program P on N processors. They all run through (i) initialization, (ii) the computational heavy lifting, (iii) finalization and then exit. Not very exciting, not to mention rather wasteful of compute resources.

Now imagine that once the N instances get to (ii), they virtually partition the execution (rather than a data structure per se) via an ownership check. The ownership check grants one and only one instance the right to carry out a subpart of the overall computation. Those instances that are not granted permission, skip this subpart, moving on to request permission to do the next subpart. Each instance makes a record of the subparts it “owns”. At the end of (ii), the instances, in the most general case, exchange information about their subparts merging the partial results, and the overall flow continues.

[Trajectory Picture]

What does this paradigm buy us? First, because every instance executes (i), each is in principle capable of carrying out any of the computation subparts, so task definition is no longer an issue: it is simply a matter of counting ownership checks. Second, we can place the ownership check any where in the control flow; in a loop, in a recursively called function — whatever is appropriate to reflect the way the computation subparts occur in the original code. Third, the merge brings the states of all instances back into agreement (as was the case just prior to the execution of the first subpart), so each instance will make the same series of ensuing decisions setting up the possibility of iterating over the parallel execution phase just completed or moving on to a new one.

Yes, there are lots of details, limitations, and caveats, as well as latent flexibility and opportunities for optimization. We will address many of these next time. The important point for the moment is that this actually works and works well in many practical cases!

Practical Considerations

  1. IO

    Input: Generally not a problem if running in an NFS-like setting. If not we may have to clone relevant files as part of setup. If only a small amount of data is involved, we could use the coordination system for this, otherwise bundle this with remote process exectution.

    Output: If writing to standard out or the output file is parameterized in some way, just use /dev/null for the workers. If not, you will need to fiddle with the code a bit &mdash generally not that big a challenge. Sometimes a post-processing filter-and-merge is simple and works well.

  2. Granularity Adjustment Rather than a single ticket, grab a fist full. If the parallelism is in an iterative context, cache the tickets for use in successive iterations. — refreshing if necessary if out of balance (timing info can be piggy-backed on merge info).
  3. Affinity If a process has done a subtask once, and there is further downstream processing involving that task, it may make sense for it to always do that subtask. Since the ownership check is under our control, we can be very flexible.
  4. Multi-pass phasing In some situations, merging can be simplified by making multiple passes over the code block in question.
  5. Look ma, no coordination! The general approach is applicable in a wide variety of settings — including ones in which we may not even want to bother with “real coordination”.

    An example: Dealing with codes that make extensive use of random numbers can be quite tricky. How do I incrementally refine and still effectively determine the answers are correct (or at least consistent)?

    dnaml is such a code. dnaml is used to infer evolutionary relationships of a collection of genetic material via simulation.

    Goals:

    In practice, we are given a small test case. We want to build a parallel execution harness for running much bigger problems, but want byte-by-byte compatibility for validating the test case.

    Solution: apply DIY DP by running multiple copies. Each copy, as usual for DIY DP, does exactly the same set up (in general, and for each subtask). The subtasks are done independently. The ownership check reduces to interrogating an environment variable (implies a priori, static decomposition). The output from each run are then merged (each contains output blocks for every subtask, all but one of which is, essentially, empty).

    Changed 18 LOC in all. The majority were needed to fix (or reproduce) bugs to ensure that output from the parallel run (after the post-processing filter and merge) would be identical to the output from the sequential run.