Programming Parallel Computers: Part 1C
Apr 9, 2020 19:09 · 1773 words · 9 minute read
Let us now introduce a simple computational problem that we are going to use as a running example throughout this course. In this problem, we are given n points; here, in this example, n = 3. And we are given for each pair of points the cost of traveling from A to B. For instance, here, in this example, the cost of traveling from point 0 to point 1 is 8. However, we can sometimes save money by taking a detour.
00:40 - For instance, in this example, if you travel first from 0 to 2, and then from 2 to your final destination 1, you are going to pay only 7 units of money. And, actually, you can easily check that in this example if you are going from 0 to 1, and you are willing to travel for at most two hops, then this is actually the cheapest way to get there. And this is the problem that we want to solve. For every pair of points A and B, we want to know what is the cost of the cheapest route of getting from A to B by taking at most two hops. We are given as input an n × n matrix “d” that contains for all pairs of points what is the cost of getting from A to B direct.
01:39 - And we want to compute a result matrix “r”, that is also an n × n matrix, and that contains for all pairs of points A and B what is the cost of the cheapest tour that takes at most two hops and takes you from A to B. Now, how do we solve this problem? Well, a straightforward way of solving this is that we simply check all possible pairs of points i and j, and for each such pair we also check all possible intermediate points k. And we simply calculate what is the cost of getting from i to k and from k to j. And, finally, we take the minimum of all of these costs — that is the cheapest route. And we store this result into our result matrix.
02:37 - So how does this code perform? Let’s compile it with the usual optimization flags, and run it on a typical modern desktop computer. Let’s create an instance where n = 4000. So we have got 4000 points, and therefore the input matrix and the result matrix have dimensions 4000 × 4000. And if you check the numbers, you can see that the total number of additions and the total number of minimum operations that we are going to do for n = 4000 is roughly 64 billion. So how long does it take to do all of these 64 billion operations? If you run the code, it turns out that the running time on our test platform is going to be roughly 99 seconds. We are going to do 1.3 billion operations per second. It’s a large number, but we are nevertheless doing only 0.36 operations per clock cycle.
03:45 - And if you remember what we discussed in the previous part of this lecture, we would expect to be able to do dozens of useful operations per clock cycle using all of the possible resources that we have in these CPUs. If you check how many additions and minimum operations this specific CPU could in principle do per clock cycle and compare it with these numbers that we get, it turns out that we are using currently less than 1% of the theoretical maximum performance of this CPU. So what is wrong here? Why is this code inefficient? The answer is that it’s not any single thing that is wrong here. There is no magic quick fix. As we are going to see, there are going to be several different bottlenecks that are going to limit the performance of this program. You take care of one of these bottlenecks, and then there is another one to overcome.
04:54 - But as we are going to see, it doesn’t need to be difficult. Without too much effort, we will see that we can actually improve the running time of this piece of code from the ballpark of minutes to the ballpark of seconds. Usually, in these kinds of applications, there are two key high-level challenges. The first challenge is related to getting data from the main memory to the CPU. And the second challenge is that once the data is actually there in the CPU, how do we do many independent operations simultaneously in parallel.
05:42 - So what do we need to know about computers in order to make some progress here. Let’s have a look at the structure of a typical modern computer from a high-level perspective. First of all, there is of course the CPU. Which is connected to the main memory. And the main memory is the place where all of our input data is going to be in the beginning, and where all of our results are going to be stored in the end. As we have already mentioned, inside the CPU there are multiple CPU cores. And these CPU cores are actually those units that contain all machinery that can actually perform arithmetic operations, like additions, multiplications, calculating minimums, and so on.
06:39 - But these arithmetic units don’t directly talk to the main memory. They can only process information that is already available in the CPU registers. And to get data from the main memory to CPU registers, this CPU needs to go through several levels of cache memory. We have got L1 cache that is fastest and closest to the registers. So if all of your data magically happened to be in L1 cache, then you would be able to access it very rapidly.
07:22 - On the other hand, if you need to go through all three levels of caches, all the way to the main memory to get some data, it’s going to be very slow. And a key challenge here is that these faster caches that are closer to CPU registers are also very small. While we have got gigabytes of main memory, we have got only some kilobytes of L1 cache, for instance. And the main memory is slow both in terms of latency and throughput. If all of sudden you need to get some data from the main memory, it’s going to take a lot of time. Let’s do a very simple experiment.
08:17 - We’ll simply run our program with different input sizes and measure how long does it take to solve the problem for each possible value of n. And if you calculate how many useful arithmetic operations you are doing per time unit, this kind of a figure emerges. You can see that when the input size exceeds roughly 1000, the performance all of sudden drops. And, if you check the numbers, with n = 1000 you have got roughly 4MB of input data. If you remember the previous discussion, the size of L3 cache is 6MB.
09:11 - So it clearly seems that as long as our entire input fits in L3 cache, the performance is reasonable. But as soon as we run out of cache space in L3 cache, the performance drops. So definitely for large values of n, our program is struggling with getting data fast enough fast enough from the main memory to the CPU. So how can we try to improve the situation? Let’s zoom in to the innermost loop of the program. Let’s think about the memory access pattern.
09:55 - What are the memory cells that this program is going to access inside this innermost loop. Let’s imagine we fix some values of i and j, for instance, both are zeros. And let’s look at what happens when k goes through all of these values 0, 1, 2, and so on. In the first line of the loop we are going to access consecutive elements of array d. First we are going to read d[0], then d[1], then d[2], and so on. However, the second line is way more interesting for us. Because it turns out that we are going to access elements like d[0], d[4000], d[8000], and so on. So here we are clearly not reading consecutive memory elements, but we are reading various different parts of the memory. We are going to discuss cache memory hierarchies in a lot more detail later during this course. But already now it’s very good to learn one very simple rule of thumb: If you are reading memory, linear reading is always good.
11:16 - In many applications, programs are reading consecutive elements of memory. And the hardware can detect that, and the hardware is optimized to deal very well with linear reading of consecutive elements. So here, we would expect that the first line of this innermost loop is performing well, because there we are doing linear reading. But the second line is probably is probably struggling with the issue of getting data from memory fast enough to the CPU. So, how do we fix this? We need to know d[0], d[4000], d[8000], and so on.
How can we avoid that? 12:03 - There is a very simple solution: we can spend a little bit of time doing preprocessing of our data so that we can have a nice memory access pattern inside the most performance-critical part of the innermost loop. What we can do in this case is to simply precompute an array “t” that simply contains the transpose of matrix “d”. Now d[0] is simply t[0], d[4000] is t[1], d[8000] is t[2], and so on. So the memory access pattern improves here dramatically. We are reading both “d” and “t” in a consecutive manner. So, does this simple trick help anything? Let’s benchmark! Here the blue and the orange curve are what we already for the previous version. While the black curve is the performance of the new version, where we simply replaced the reading of “d” by reading of the transpose of “d”. And, as we can see, this dramatic drop in the performance around the point where we ran out of L3 cache is simply gone. Apparently, memory is no longer a bottleneck in this current implementation. So, are we happy now? It actually turns out that we are still very far from the theoretical maximum performance of this CPU.
13:45 - And, in the next part of this lecture, we are going to see what is another bottleneck in the current implementation. .