Programming Parallel Computers: Part 6B

May 17, 2020 20:32 · 1496 words · 8 minute read prefix sum calculation know also

Let’s now look at one very useful and important parallel algorithm: computation of prefix sums in parallel. Let me first tell what is the prefix sum, I will later give an example of what you can do with it… Prefix sum is simply this task. You have got a list of n numbers. And you want to compute the sums of all prefixes of the list. Your output is an array with n elements. The first output element is the first input. The second output element is the sum of the first two inputs.

00:39 - The third output element is the sum of the first three inputs. And so on. The last output element is the sum of all inputs. It doesn’t take long to come up with a sequential algorithm that solves this efficiently in linear time. But what may come a bit as a surprise is that even though the task seems somehow “sequential”, it can nevertheless be parallelized efficiently! Let’s look at the sequential solution first to make sure all of us are on the same page. Here is our input, and here are all the prefix sums we’d like to compute.

01:20 - So what we can do is to first calculate the sum of the first two elements. And once we know the sum, we can add the third element to it. And so on. We just accumulate a sum, and remember all intermediate values. And finally, all these intermediate values are exactly the prefix sums we wanted to know. Simple, but very sequential. We never had a chance of doing more than one operation in parallel. There is a long dependency chain.

01:51 - So we did a linear amount of work, but depth is also linear. How do we parallelize this? Let’s see. We have our input and output. We first compute all pairwise sums in parallel. So now we know the sums in these regions of length 2. Then we compute sums for pairs of pairs in parallel. And we know the sums for these fragments of length 4. And again combine the sums for fragments of length 4, so that we know the sums for these fragments of length 8. And continue this way until we eventually also know the sum for the entire input, all 16 values. OK, this is the same as what we can do if we want to know just the sum of all values. But how do we get all prefix sums? Let’s start combining things that we have computed earlier.

02:59 - In the same step, we can also calculate this sum. So we know also the sum of the first 12 values. And note that we already knew the sums of the first 4 and the first 8 values, so taking all of these into account, we now know the prefix sums for all prefixes whose length is a multiple of 4. Good, we are making progress, we know now some prefix sums. Next combine these values, first 4 plus next 2, first 8 plus next 2, first 12 plus next 2.

03:41 - So now we know also sums for the first 6, the first 10, the first 14. And of course we already knew the sum for the first 2. And now keeping in mind what we already know, we have prefix sums for the first 2, 4, 6, 8, etc. Put otherwise, we know the prefix sums for all prefixes whose length is a multiple of 2. More progress. Then let’s calculate these sums, first 2 plus next 1, first 4 plus next 1, etc. So now we know also prefix sums for the first 3, 5, 7, etc. And of course the sum for the first 1 is already known. So we got all odd prefix sums now. And keeping in mind that we already knew all even prefix sums, we have now got prefix sums for prefixes of all lengths, which is exactly what we wanted to know. And we are done. If you generalize this idea, you’ll see that the depth is now only logarithmic! So we got an exponential improvement in the depth. And we are still doing only a linear amount of work.

04:58 - So that’s the theory, if you had a huge number of parallel processing units, you could compute prefix sums very fast, in a massively parallel manner. But how should we compute prefix sums in practice, especially if we have got only a moderate number of threads? Here is one simple idea that is easy to implement and should give a nice performance in practice in many settings. Let’s say you’ve got p threads and an array of n input values, and you need to compute all prefix sums. Here in this example I’ve got 12 numbers and 3 threads. You first split your input in p parts, on part per thread.

05:39 - Then each thread calculates the sum inside its own part. And we store these numbers in some global array. So now everyone can see what is the total in each part. Next comes a sequential part, but if the number of threads is small, this will be very quick. We will simply accumulate prefix sums, using the naive sequential algorithm. So for instance this blue number tells that the total inside the second part is 26. While this orange number now tells that the total in the first two parts is 36. And now after this brief sequential computation, we can put all threads back to work, and they can figure out what are the prefix sums in their own parts. Each thread just starts with the previous orange number, which tells the prefix sum up to this point, and then you can just start to accumulate a sum in a sequential manner inside each thread. And we are done! Fast and easy. And if you got worried about this sequential step, just keep in mind that it is in itself a smaller prefix sum computation, so you can use the same idea recursively there, until you reach a point in which the sequential part is so cheap that it makes no sense to try to parallelize it any further.

07:04 - Good, so now you know how to compute prefix sums. What can you do with them? Let me just conclude with an application that at first seems to have absolutely nothing to do with prefix sums. There is no arithmetic here in this problem! We have got just an array of elements, some of them are “black”, some are “orange”. And the task is to simply store all orange elements in the output array in consecutive positions. Maybe orange things are small and black things are large, and you are doing partitioning in quicksort, and you would like to move the small elements to the beginning of the array.

07:47 - Or maybe orange things are rows in a database that match your query, and you’d like to construct the list of all matching rows. So how do we solve this? A trivial sequential algorithm just scans the input, sees what is orange, and puts it in the right place. But how do we parallelize this? The key challenge is that if some thread would like to copy this element in its right position, we would need to know what is the right position, and this depends on how many orange elements there are before it. So we have got a counting problem here. If we somehow knew how many orange elements are in this part, then one thread could start to put these elements in their final positions, without any further information or coordination. If you know that there are exactly 2 orange elements before x6, then you can check that x6 is orange so it goes to the third slot, x7 is also orange so it goes to the fourth slot, etc.

08:55 - So all that we need to know to get started is to know for each position how many orange elements are there before it. And then we know where each orange element will go, and we could move all of them in the right places simultaneously in parallel. So we would like to know all such counts. And if you start to think about it, this is just a special case of the prefix sum calculation! If you think that orange is 1, and black is 0, then prefix sums will directly tell how many orange elements there are up to each position! So even if you are solving something that has nothing to do with sums, where for example the key challenge is coordinating data movement, you can still benefit from efficient parallel prefix sum algorithms! That’s all for this part, in the next part we will present another simple but efficient idea that you can use in parallel algorithms! See you soon! .