Programming Parallel Computers: Part 2A

Apr 16, 2020 19:50 · 1552 words · 8 minute read notice runtime use openmp week

Welcome to the second week of our course! In modern CPUs, there are three main forms of parallelism that we are going to learn to exploit. Multicore parallelism, instruction-level parallelism, and vector operations. With multicore parallelism, we make sure that all CPU cores have useful work to do. With instruction-level parallelism, we make sure that each CPU core is executing its instructions as fast as possible. Finally, with the use of vector operations, we make sure each of the instructions does as much work as possible.

00:45 - Last week we learned to use instruction-level parallelism. This week we will learn about the other two forms of parallelism. In a sense instruction-level parallelism was the easiest form of parallelism for us to use. We just arranged our code so that there are independent operations that can be executed simultaneously in parallel. And then magic happened, and the CPU parallelized it for us, on the fly. The other two forms of parallelism will require a little bit of extra effort from us. To use multiple CPU cores, we must somehow create multiple threads of execution. And the CPU does vector operations only if we explicitly tell it to do so, by using special machine language instructions. It’s important to note that these forms of parallelism concern very different scales. For example, in multicore parallelism, the threads execute a large number of operations, perhaps millions or billions of operations.

02:09 - While in instruction-level parallelism, we might be concerned about parallelizing just a handful of operations somewhere in the most critical part of the innermost loop. Multicore parallelism in CPUs is tightly connected to multithreading in operating systems. A programmer simply creates multiple threads of execution. Then the operating system simply tells each CPU core to run its own thread. And you can pretty safely assume that the operating system is doing the right thing.

02:54 - If you are running a program that creates 4 threads on a CPU that has 4 cores, and there is nothing else happening in the computer at the moment, then there will be a 1-to-1 correspondence between threads and CPU cores. If you create more threads, then the operating system has to resort to some form of time-sharing, and the extra threads won’t give you any extra performance, just extra overhead. And if, on the other hand, you have fewer threads, then some CPU cores will simply sit idle, doing nothing. Now, how do we split our computation among multiple threads? The obvious answer is that you can create multiple threads using the low-level primitives in your programming environment, and then write some code in which threads communicate with each other and coordinate work-sharing. Lots of work, and very easy to make mistakes.

04:00 - However, there are much easier tools available. OpenMP makes work-sharing between multiple threads so easy that it almost feels like cheating. Let’s now see how to use OpenMP to distribute long-running computations among multiple CPU cores. Let’s start with a very simple example where we want to do some computation “c” ten times, with different parameter values 0, 1, and so on up to 9. Maybe operation “c” is some heavy-weight simulation that you want to try out with different parameters.

04:42 - The naive sequential solution simply calls “c” in a loop, and then we will have only one thread of execution that runs first c(0), then c(1), and so on, up to c(9). One CPU core is doing work, all others are idle. Let’s just add one OpenMP pragma directive before the loop. No other changes. Compile the code, run it with a computer with 4 cores, and see what happens! All of sudden we have got automatically 4 threads of execution, one per CPU core. And the loop is split automatically among the 4 threads.

05:34 - With very little coding effort we got almost factor-4 speedups! There is no synchronization between the threads while they are doing their own work. The first thread will start c(1) as soon as c(0) is done. The threads have no idea what other threads are doing. And this is what makes OpenMP fast. You only coordinate at the beginning and at the end of a loop. However, the beginning and the end are synchronized. Here function “d” can safely assume that all work in the loop is done, just like in the sequential version. Usually, OpenMP splits a loop so that each thread does consecutive iterations. But you can use the “schedule” directive to change it. You can even switch to a fully dynamic schedule. This usually gives the best balance of workload between the threads, but it only makes sense only if each iteration of the loop does a large amount of work.

06:50 - Otherwise communication overhead will ruin the performance. Now let’s put OpenMP in use in practice. Let’s return to the sample application that we introduced last week. Here is a naive sequential solution. There is a lot of things going on inside the outermost loop. But for us it’s enough to note that each iteration is independent. It doesn’t matter if we do first i = 0 and then i = 1 or vice versa.

07:27 - It doesn’t even matter if we do these simultaneously in parallel! So we can safely ask OpenMP to parallelize the outermost loop. And it is really this simple! Compile it, run it, benchmark, and you get performance improvements almost by a factor of 4. We used the most inefficient version as a starting point, but we can equally well start with the version that already uses a good memory access pattern and instruction-level parallelism. And we can just parallelize the outermost loop using OpenMP and we get large performance improvements. So effective use of OpenMP can be as simple as just adding one line! But a lot of care is needed to make sure you really can add this line.

08:30 - Let’s carefully check all variables and data items that we are accessing here. First, there is variable “v”. This is defined inside the parallel section, so each thread is going to have its own private copy of this variable. So different threads are here reading and writing their own private variables. No problems here. Second, we have variable “n”. This is a shared variable. It was defined somewhere outside the parallel section. But no problem, we are only reading it. Many threads reading the same shared variable is fine.

09:15 - And the same holds for the array “d”, we are only reading it, it is fine. The really tricky part is here. Array “r” is shared among all threads. And all threads are writing to this array. Now we need to be really careful. But if you think about this, each iteration of the loop writes to different parts of the array. For instance, let n = 10. When i = 0, we are only writing to elements between 0 and 9. When i = 1, we are only writing to elements between 10 and 19. And so on.

10:02 - For each value of i, we write to a different range of elements. So this is safe. Different threads will never try to write to the same array element. So we are allowed to ask OpenMP to parallelize the loop, OpenMP will be happy to do it and it will work correctly. So exactly what are the rules? Many threads accessing their own private variables is always fine. Many threads accessing some shared read-only data is always fine.

10:39 - You can also have threads writing to different parts of a shared array. But if you ever have a situation in which there is a data element that one thread is reading and another thread is writing without explicit synchronization, you have a serious bug in your program. And the same holds also if you have some shared data elements that are written by multiple threads. And these are really serious bugs! Your code will compile just fine, you will just start to notice that your program will misbehave in some situations at runtime. It might work on your computer but not on your customer’s computer.

11:29 - And it might misbehave in really strange unexpected ways. You don’t want to have these kinds of bugs, be careful! OpenMP is a powerful tool but, you can do a lot of good with just one line of code, but also a lot of damage! Here we have some simple examples that illustrate these rules. In the first loop, iteration 0 writes to x[1] while iteration 1 reads from x[1]. This would be a data race if you tried to parallelize this with OpenMP. Here iteration 0 writes to y[0] while iteration 1 also writes to y[0].

12:18 - Again, a data race if you tried to parallelize this with OpenMP. But the last example is just fine. Iteration 0 writes to y[0], but nobody else is writing to it or reading from it. Iteration 0 also reads from x[0], but nobody else is writing to it. Just add the OpenMP pragma and you have immediately a multicore version of this loop! What we have seen is already enough to get started with our basic OpenMP exercises! .