Programming Parallel Computers: Part 1A

Apr 9, 2020 19:09 · 1661 words · 8 minute read roughly speaking nowadays find easy

Welcome everyone to the “Programming Parallel Computers” course! In this course, our main goal is to learn to write code that runs very fast on modern computers. And, as we will see soon, basically the only way to get there is to write code in which lots of things are happening simultaneously in parallel. In this course, we are going to see how efficient use of all parallel computing resources on a modern CPU can lead to a performance improvement of a factor as high as 150. During this course, we are going to give you many assignments, and in each of these assignments, your task is to write code that solves this problem as fast as possible, using the computers that we are using for this course. While we are learning a lot about the principles and practice of parallel programming during this course, we are going to actually focus on good parts.

01:13 - We are going to focus on tools that are as easy to use as possible, as long as these tools are not sacrificing performance. We are going to put a lot of emphasis on understanding. We are going to demystify hardware. And you are going to learn not only to measure performance, but you are going to also learn how to predict the performance of your code. What we are going to do during this course is very much engineering. To solve the assignments, you are going to need a solid knowledge of the C or C++ programming language.

01:56 - But this is basically everything that you are going to need to get started. Let us now say a couple of words about why we are actually going to need to do parallel programming in order to get a good performance out of modern hardware. If you look at a modern computer, it turns out that it is a massively parallel computer. Not only we are going to have CPUs with multiple CPU cores, but on top of that in each of these cores there is going to be multiple execution units that you can use simultaneously in parallel. Furthermore, these execution units are vectorized, meaning that in one step, each of these execution units can perform many similar operations simultaneously in parallel.

02:54 - And furthermore these execution units are pipelined, which means that they can start new operations without waiting for the previous operations to complete. And, on top of that, there is also a massively parallel GPU in addition to the CPU, but a lot more about that later. Around year 2000, something happened in the performance of modern computers. Basically, sequential performance stopped improving. All new performance is basically coming from parallelism. And to benefit from this parallelism, new code is needed. Old code written for sequential computers might use only something like 1% of the performance of modern computers. So what changed around year 2000? Let’s have a quick look at the history of CPUs. Let’s go back in time to year 1965, when Gordon Moore made his famous prediction. He predicted that the number of transistors — which are the smallest elementary building blocks of CPUs — the number of transistors is going to increase exponentially over time.

04:29 - This was a really bold prediction, because this prediction, as you can see here, was based on just five data points — plus wild extrapolation based on this data. But surprisingly we can see that Moore’s law is still going strong. If you look at the number of transistors over time in various CPUs, you can see that still, in very recent years, the number of transistors is actually increasing rapidly. But something has changed. If you look at what kind of performance you get out of this ever- increasing number of transistors, something changed around year 2000. Until 2000, the sequential performance of these CPUs kept increasing, thanks to this increased number of transistors that engineers are putting in these CPUs.

05:35 - But after year 2000, basically only parallel performance kept improving. What does this mean? Basically, if you look at how long does it take to do one elementary operation, let’s say, how long does it take to do one floating-point multiplication with these CPUs, until year 2000, the time that it takes to do one multiplication kept decreasing. Around year 2000, this stopped. The time that it takes to do one multiplication is still roughly the same as what it was 20 years ago. But nowadays we can do more and more of these multiplications simultaneously in parallel. Put otherwise, until 2000, the latency of operations kept improving, while after year 2000, latency has remained roughly constant, while throughput of these CPUs has improved.

06:53 - So let’s have a look at these two terms, latency and throughput, because these are going to be very important for us when we try to understand efficient use of modern CPUs. Latency, roughly speaking, means what is the time that it takes to complete one operation from start to finish. While throughput, on the other hand, means how many operations we can complete in the long run per time unit. Let’s try to illustrate these concepts with a simple example that is familiar to all of us. Let’s look at the task of doing Master’s studies at Aalto University.

07:37 - On average, from start to finish, one Master’s degree at our university takes roughly two years. On the other hand, our university is actually massively parallel. We don’t just take one student in, wait for them to finish their Master’s degree, and then accept another student. If we were doing that, we would have roughly half a graduate per year in the long run. However, the real throughput of our university is almost 2000 Master’s degrees per year. And we are achieving this thanks to parallelism. At any given point of time, we have got roughly 4000 students here doing there Master’s studies. So the latency of one Master’s degree at our university is roughly 2 years, while the throughput is as high as 2000 degrees per year. Now let’s get back to CPUs. If you look at older CPUs, say, before year 2000, what was happening was that latency kept improving all the time. And thanks to latency improving, we were also achieving higher throughput.

09:01 - But nowadays what is happening is that latency pretty much remains constant. But CPUs are getting more and more parallel. And thanks to parallelism, throughput is still improving. In the old days, progress used to look like this. Imagine that each of these blue boxes is one operation, let’s say multiplication. Old CPUs used more time for one multiplication. And then you got a new CPU model, and there latency had improved and the time that it took to do one multiplication became shorter. And thanks to this, you had a higher throughput. On the other hand, nowadays progress looks more like this. The time that it takes to solve one operation hasn’t really improved.

09:58 - But you can do nowadays more and more operations simultaneously in parallel. And if you are able to benefit from this kind of parallelism in your program, we can still improve throughput. So far, I have been using words like “high throughput”, “low throughput”. Let’s now have a look at concrete numbers. Let’s fix a specific, modern, relatively cheap CPU. Something that you might find in everyday desktop computers. For the sake of example, let’s also fix the operation that we are looking at. Let’s look at single-precision floating-point multiplication. If, in your program, you are doing one floating-point multiplication, using this CPU, it’s going to take four clock cycles from start to finish. Therefore, if in your program you do a large number of multiplications so that you do one operation, wait for it to finish, then start another operation, and so on, in the long run you are going to achieve a throughput that is only roughly 0.25 multiplications per clock cycle.

11:24 - However, if you are able to exploit all parallel computing resources in this CPU, you are able to achieve a performance of 64 operations per clock cycle. Which is a factor-256 improvement over the sequential performance. So basically if you are using this kind of a CPU, you can have up to 256 operations simultaneously in progress at any given point in time. Now, where does this number 256 come from? In this CPU, there are 4 CPU cores. Furthermore, the CPU is superscalar. In every clock cycle, each of these CPU cores can initiate 2 multiplications simultaneously in parallel.

12:26 - Furthermore, these execution units are pipelined. Even though the latency of one operation is 4 clock cycles, nevertheless in every single clock cycle you can initiate a new operation without waiting for previous operations to finish. And finally, all of these execution units are vectorized, meaning, all of these operations can be vector multiplications, meaning, in one operation you can multiply up to 8 floating-point numbers. So one of the key take-home messages is that parallel computing is a lot more than just multithreading. If you just try to exploit multiple CPU cores, you are only going to get improvements by a factor of 4.

13:16 - While by exploiting all of these parallel computing resources, you can get improvements by a factor of 256. And the second message here is that parallel computing is not only for high-performance supercomputers. These kinds of parallel computing resources you can nowadays find in any desktop computer, in any laptop computer, even in any modern mobile phone. So there is tons of potential that we are missing unless we pay attention to parallelism when we are writing our programs. And in this course you are going to learn how to actually benefit from all of this potential that we have in all modern computers nowadays. .