Programming Parallel Computers: Part 6A

May 17, 2020 20:32 · 1252 words · 6 minute read recursive algorithm 19 parallelize better

Welcome to the last week of this course! Today we will take a bit more abstract view on parallel programming. Instead of talking about doing parallel computing with specific hardware and specific programming tools, we will talk about general techniques for designing parallel algorithms. Let’s start with the very basics. There are three concepts that are important to keep in mind and to try to not mix them up. First, we have computational problems, which are specifications of what we would like to do. This is something like “sorting”. Second, there are algorithms that are abstract descriptions of how we can solve the computational problem.

00:54 - This is something like the “quicksort algorithm”. We are working on the level of, for example, pseudocode. And finally, there are concrete implementations of algorithms. Now we are talking about something like real working C++ code. And when we look at parallel computing, we have got similar concepts. Now we are just talking about parallel algorithms, and efficient parallel implementations of algorithms. When we talk about efficient implementations, we need to worry about the technical details that we have been discussing throughout this course. Memory hierarchies, registers, instruction-level parallelism, vector operations, OpenMP, CUDA, and so on. But when we talk about parallel algorithms, what we really care about is the notion of independence. The key goal is to design algorithms in which there are always independent operations that can be computed simultaneously in parallel.

02:03 - This is something that is absolutely necessary, or otherwise there is no hope of solving the problem efficiently with the help of parallel computers, no matter which specific technique you use. And it is important to keep in mind that the same computational problem can be solved with many different algorithms. And some algorithms are inherently sequential. So if you pick some sequential algorithm, and ask how to develop a parallel implementation of it, you may have a dead end. What you really need to have first is a parallel algorithm in which you have many enough independent operations.

02:45 - And only after that you can start to think about how to implement it efficiently in practice. When you read literature on parallel algorithms, you will often encounter these three terms. First, the literature on parallel algorithms often refers to “processors”. Don’t think that one “processor” would mean something like the CPU. No. Here a “processor” is just some abstract entity that can perform operations. Maybe if you think that it refers to one thread, you get the right picture. Second, when we analyze parallel algorithms, we usually talk about “work” and “depth”. “Work” is basically just the total number of operations. You usually want work-efficient parallel algorithms, these are algorithms in which you do the smallest possible amount of work in total. But sometimes doing a bit more work makes it possible to parallelize better. So there might be a tradeoff here.

03:53 - However, we also want to have algorithms with a small depth. The depth here refers to the length of the longest sequential dependency chain. Basically, this tells how much time you will need, no matter how many parallel processors you have got. So ideally, we would have parallel algorithms that don’t do any more work than the best sequential algorithms, and that have as small depth as possible. If the depth is much smaller than the amount of work, then we can benefit from a large number of processors.

04:36 - Let’s look at a concrete example that will help to illustrate these concepts. Let’s have a look at perhaps the simplest possible computational problem: calculating the sum of n numbers. There is of course a trivial sequential solution. Just accumulate the sum one by one. Here the amount of work is linear, but also the depth is linear. Every addition depends on the previous addition. No room for parallelism.

05:04 - But we can also design an efficient parallel algorithm that is work-efficient but has only a small depth. Here is a simple example. We can calculate the sum with a recursive algorithm, in which we take the n numbers, split them in two halves, and then recursively calculate the sums of both halves simultaneously in parallel. And we keep splitting until we have got at most 2 numbers left. So how much room for parallelism do we have here, and what’s the depth? It’s often very helpful to visualize a parallel algorithm as a circuit in which data flows from top to bottom, nodes represent arithmetic operations, and edges represent the flow of information. Here is the naive sequential algorithm in which we accumulate the sum one by one.

05:59 - But here are two versions of the recursive parallel algorithm, depending on exactly how we split in two halves when we recurse. The key point here is that we can draw on the same level independent operations that can be computed in parallel, so the number of levels corresponds to the number of time steps that we need at minimum, even if we had lots and lots of parallel threads available. So we can read from this illustration both how much work we do, and what is the depth of the algorithm. Here both the sequential and the parallel algorithm do the same amount of work, there are exactly n - 1 addition operations. But the depth of the parallel algorithm is much smaller, it is only logarithmic, while the depth of the sequential algorithm is linear.

06:55 - So we have now an efficient parallel algorithm for computing the sum. Now I’d like to emphasize that you don’t directly want to jump to implement exactly this algorithm. A direct implementation of this recursive strategy makes absolutely no sense in practice. You should primarily see it as a source of inspiration. It shows that there is tons of room for parallelism in this task if you use the right algorithm.

07:24 - And it suggests that at least one way to solve it is to split the work in smaller parts, let each thread calculate the sum in its own part, and finally combine the sums. But of course if we are calculating the sum of 1 billion numbers, we won’t split it in 1 billion parts, we don’t have that many parallel computing units available anyway. Tons of common sense is still needed when you take the idea of a parallel algorithm and try to turn it into some efficient C++ code. Oh, and here we were talking about additions, but please note that nothing in this algorithm really makes use of the fact that the arithmetic operation is addition. We could equally well calculate products, or minimums, or maximums by using exactly the same idea.

08:19 - So the calculation of a sum can be parallelized, and this probably doesn’t come as a surprise at this point. But what about other more interesting computational problems? We don’t really know that. There are many problems for which efficient parallel algorithms exist, and also some problems for which no parallel algorithm is known. These are big open questions in theoretical computer science. Most likely there are some problems for which efficient parallel algorithms don’t exist.

08:52 - But in the next two parts we will bring plenty of good news. We will see some simple ideas with which we can design efficient parallel algorithms for many problems that at first sight might look like inherently sequential! See you soon! .