Programming Parallel Computers: Part 1D

Apr 9, 2020 19:09 · 1487 words · 7 minute read initiating another one operand know

In the previous part, we eliminated one memory bottleneck from our program. Now, as we can see from these benchmarks, the program is equally fast regardless of the input size. Or, should we say, it’s equally slow. Even if the input is small enough so that everything fits in the cache memory, we are only achieving roughly 2 billion operations per second. And the reason for this relatively slow performance is that all of these minimum operations that we have in the innermost loop of the program happen in a strictly sequential order. Now why is this the case? Let’s imagine that we unroll the innermost loop, and look at the sequence of instructions that the CPU is going to perform.

00:58 - So what happens in the innermost loop? We read x, we read y, we do one addition, and we calculate one minimum. Then we read again, read again, do another addition, and calculate minimum, and so on. Now if you look at these operations, many of these operations are actually independent of each other. For instance, reading of x and reading of y could be performed in principle simultaneously in parallel. Also, all of these additions are independent of each other. The result of one addition is not needed as an operand for another addition. However, if you look at these minimum operations, we can see that every single minimum operation depends on the result of the previous minimum operation. You can’t even start to calculate the minimum of v and z if you don’t know what is the value of v. So even if our program only contained these minimum operations and nothing else, the running time of one iteration of this innermost loop is at least the latency of one minimum operation. But of course in this innermost loop we are doing all kinds of other things.

02:24 - So is this really the bottleneck? Let’s benchmark, and let’s check the specifications of this CPU. If you benchmark the code, you will see that one iteration of this innermost loop takes roughly 4 clock cycles. What about these minimum operations? Each of these minimum operations is going to compile into a machine language instruction called “vminss”. Let’s now have a look at the manual and see what is the latency of this machine language operation on these CPUs. It turns out it is actually exactly 4 clock cycles.

03:03 - So now we know that even if were able to eliminate all memory reads and all of these additions from the innermost loop, the performance would not improve unless we can do something about all of these minimum operations that are inherently sequential in the current implementation. But in this innermost loop we must calculate the minimum of n values. So how can we avoid this kind of a dependency chain? Thankfully we can freely arrange these minimum operations and re- group them. For instance, if your task is to calculate the minimum of z0, z1, z2, and z3, you don’t need to do this in a sequential order. Instead, you can first calculate the minimum of z0 and z2.

04:02 - And then, simultaneously in parallel, you could calculate the minimum of z1 and z3. And if arrange calculations this way, all of sudden you have arranged at least opportunities for parallelism. Your program is no longer inherently sequential. There are operations that could be, at least in principle, executed simultaneously in parallel. So what does this mean in practice? If you need to calculate the minimum of n values, we can for instance accumulate two minimums. v0 here is the minimum of even values. And v1 is the minimum of all odd values.

04:56 - And finally if you just calculate the minimum of v0 and v1, we get the minimum of all values. And if arrange computations this way, then at least in principle, it would be possible to calculate two minimum operations simultaneously in parallel. Because one of these minimum operations is only reading and writing v0, while the other one is reading and writing v1, and these two operations are independent of each other. And you can of course generalize this idea so that you are calculating for instance three minimums and then merging the results. And this way you can have three minimum operations executing simultaneously in parallel, and so on.

05:49 - So let’s try to rewrite our program so that some of these minimum operations can be computed simultaneously in parallel. How do you do it in practice? To keep all this simple, let’s assume that n is a multiple of 4. In that case we can simply write code like this. Here instead of accumulating one minimum v, we are actually accumulating 4 minimums, w[0], w[1], w[2], and w[3], in an interleaved manner. For instance, operations related to w[0] are independent of all operations that are related to w[1].

06:31 - We have a program where several minimum operations could be at least in principle executed simultaneously in parallel. Now comes the key question. How do we tell our CPU that it should actually execute all of these minimum operations in parallel whenever possible? Somewhat surprisingly, the answer is that we don’t need to do anything more than this. But please don’t believe me! Write the code, run benchmarks, and see what happens. It turns out that this very simple change actually resulted in performance improvements of a factor of 3. What we are seeing here is one very important form of parallelism, namely, instruction-level parallelism.

07:23 - This form of parallelism is, at least in principle, very simple to use for us. You just write code where there are opportunities for parallelism. And the CPU will try its best to parallelize this code. A modern CPU will look at the instruction stream further ahead. Even if the next instruction is not yet ready for execution, it will look at the next instruction after that, and after that, and after that, up to some distance. The CPU will try to find instructions that would be ready to be executed at the moment. For instance, in our example, even if one of these minimum operations is already in execution, the CPU will look further ahead. It will find another minimum operation and it will notice that in this new version of our program this another minimum operation is independent of the previous operation that is still in the pipeline. And therefore the CPU can push another minimum operation into the pipeline without waiting for the previous operation to finish. If your code is only doing lots of minimum operations and if all of those operations are independent of each other, then in the best case by simply exploiting instruction-level parallelism you would be able to achieve a throughput of 2 minimum operations finishing per clock cycle for each CPU core.

09:06 - With our program, we are not quite there yet. And we are only using one CPU core at the moment. But nevertheless this very simple application of instruction-level parallelism already gave us significant speedups in comparison with the naive solution that is completely sequential. So instruction-level parallelism is something that the CPU is going to do for you automatically whenever possible. But you must make sure that in your program in the instruction stream there are independent operations that are ready for execution without waiting for the previous operations to finish.

09:50 - For instance here if you have a chain of multiplications where every single multiplication is going to use the result of the previous multiplication… This is going to be bad, because at each step you are going to wait for the latency of a multiplication instruction. While if you have several multiplications, where none of these multiplications are using the results of other multiplications, then here you can automatically benefit from instruction-level parallelism. And this is something that holds not only for arithmetic operations, but it also holds for memory accesses. For instance, if you have a chain of memory accesses where every single memory access depends on the result of the previous memory access — something like following a linked list — then at each step you will need to wait for the previous memory access to finish before you can even initiate the next memory access. And therefore you are paying the latency of a memory access at each step.

11:07 - While if you have several independent memory accesses, then the CPU is happy to send all of these requests to the memory system without waiting for one operation to finish before initiating another one. Sometimes the task that you are solving is inherently sequential. However, if you can figure out a way to rearrange computations so that you have several independent operations next to each other, then you can immediately benefit from instruction-level parallelism. .