Programming Parallel Computers: Part 5B
May 10, 2020 21:17 · 2361 words · 12 minute read
Welcome back! Now we will start to reason about what happens inside the GPU, and how we could predict the performance of our code, and reason about possible ways to improve performance. Of course the general principles are what is important, and the specific technical details of some specific GPU not so much, but I will need some concrete example as an illustration, so whenever I mention specific numbers, it refers to the GPU that we have in our classroom. An NVIDIA GPU, with the so-called Maxwell microarchitecture, 5 “streaming multiprocessors” or SMs for short. The specific numbers in different GPU models differ, but the numbers that we use should give you a good idea of the orders of magnitude. We already know almost all key ingredients that we need. We have kernels.
01:06 - We have blocks of threads and warps of threads; remember, the hardware organizes threads always in warps of size 32, while you as a programmer can organize threads in larger blocks that consist of many warps. And we have the concept of shared memory. Now one final thing we need to discuss is registers. Remember that in our CPU, we had 16 + 16 registers available for each thread. The machine language that we used was only able to refer to these registers. And most importantly, you never gained anything by saving registers.
01:51 - If you were able to solve something by using only 8 registers, good for you, but you didn’t really make anything faster that way. Now GPUs are different. In our GPU you can use up to 255 registers in each thread. As we have discussed, these are now scalar registers, each able to hold only one 32-bit number. But there are plenty of those, the machine language can refer to up to 255 different registers. So you can easily keep for example an array of 10 by 10 numbers in the GPU registers! However, using lots of registers isn’t exactly free, as we will see soon.
02:36 - All those registers need to be stored somewhere, and if you use many registers per thread, you won’t have space for as many active threads. So what will happen is that when you compile your kernel, NVCC will see how many registers you needed (at most 255), and also how much shared memory you needed. And this information will be stored together with your kernel. So your kernel will contain information such as “to run this kernel, we will need 96 registers per thread and 2 kilobytes of shared memory per block”. The compiler will do its best to use a reasonable number of registers.
03:22 - For example, in Chapter 4 part V2 we have this kind of a code in which we try to reuse data in registers. We read 8 + 8 values to arrays “x” and “y”. And we keep track of 8 by 8 intermediate results in array “v”. So we would expect that if the compiler does a reasonable job here, this code would use maybe 8 + 8 + 8x8 registers for these values, which is 80, plus some additional registers for loop counters, pointers, array indexes, and intermediate values. Note that the result of each addition has to be also stored temporarily somewhere before it can be used in the minimum operation, so some storage is needed also for them.
04:11 - And if you then look at the assembly code output produced by NVCC, you will see that it uses registers between R0 and R95, so pretty much what one would expect, 96 registers used in total. OK, so now we know all we need to know about our kernel. We have chosen how many threads there are per block, and this directly determines how many warps there are per block. We have chosen how much shared memory we allocate per block. And the compiler has decided how many registers are used per thread. What happens, then, if we launch the kernel? All blocks are first put in a queue. No resources are allocated for them yet. They just wait in the queue. Inside our GPU, there are 5 “streaming multiprocessors” or SMs. On the highest level, what will happen is this. Whenever an SM has some resources available, it will take one block from the queue. One SM can take multiple blocks if there is room for them. So a block goes to one SM. The SM allocates resources for it.
05:35 - The block is executed by the SM and once all threads in the block have completed their work, the block is thrown away. So the entire block goes to one SM, and sits there until all threads finish. Blocks don’t need to finish in the same order as they start, their execution inside the SM may be interleaved and parallelized in an arbitrary manner. And most importantly, SMs work completely independently. So this is the reason why the entire block has to go to one SM. You can have communication between threads only inside an SM. Now what happens inside an SM then when it takes a block from the queue. The block, as well as all warps in the block, become active and resources are allocated for them. First, you need to reserve one slot for the block itself. There is room for only 32 blocks per SM. Second, you need to reserve one slot for each warp. There are only 64 such slots per SM.
06:49 - Third, you need to allocate shared memory for the block. There is only 64 KB of shared memory in the SM. And finally, you need to allocate space for the registers of each thread. There are 64K physical 32-bit registers in the SM. If there are resources for all that, then the SM can take a block from the queue and activate it, otherwise the blocks will have to wait.
07:20 - And only active warps will take part in execution, at each clock cycle the SM will try to find some active warps that are ready for execution. So usually we would like to have many active warps. And now the number of active blocks and active warps we can have depends on how much resources we need. No matter what you do, you can’t have more than 64 active warps. But there are many reasons why you might have fewer. Let’s see what is needed if you want to have 64 active warps per SM. First, you need many enough threads in total. If you want to have all 5 SMs busy with 64 active warps, you need at least 320 warps and hence more than 10 000 threads. Second, your block size has to be large enough. If your blocks consist of only 1 warp, there is room for only 32 blocks per SM, and hence only 32 warps per SM.
08:25 - So the block size should be at least 2 warps, or 64 threads. Third, you can’t use too many registers in your kernel. If you used all 255 registers, there would be room for only 256 threads or 8 warps. You can use at most 32 registers if you want to have room for 2048 threads or 64 warps. Fourth, you can’t use too much shared memory in your kernel.
08:54 - For example, if your block size is 2 warps (64 threads), and you would like to have room for 32 blocks (64 warps), each block can only use 2 kilobytes of shared memory. I’d like to emphasize that it isn’t necessarily bad if you have got fewer than 64 active warps. But this is important to keep in mind. Merely having lots of blocks does not guarantee that there are lots of active warps ready for execution. A common mistake is to try to allocate too much shared memory, and then all of sudden you will have only a handful of warps active per SM. Good, so now we have reached a point in which each SM has some active blocks.
09:43 - We have allocated space for the registers of all active threads. And we have now some number of active warps ready for execution. To keep things simple, let’s now focus on arithmetic operations. Let’s assume the kernel is just doing some basic floating-point operations, like floating-point additions, using data that can be kept in registers. Say, the next instructions in the kernel are these, and assume a, b, c, and d are some single-precision floating-point numbers that are kept in registers.
10:16 - We want to first increment “a”, then increment “a” again, then increment “b”, etc. And let’s also assume there are 8 active warps. We don’t care about blocks here. In each SM there are 4 schedulers that can take instructions from active warps and put them to arithmetic units. And for each scheduler, there is one arithmetic unit that can do 32-wide arithmetic operations. And let’s assume that in the beginning, the first instruction of each warp is ready for execution.
10:54 - All threads know the current values of a and c. So what happens now might look a bit like this. The first clock cycle. Each scheduler can find an active warp in which there is an instruction ready to be executed. So the instruction goes to the arithmetic units. Next clock cycle, each scheduler can again find an active warp where the first instruction is ready to be executed. So we can again put 4 instructions to the arithmetic units. Good, arithmetic units are kept busy. Next clock cycle. But now we are in trouble! In each warp, then next instruction depends on the result of the instruction that is still in the pipeline. We can’t start to compute a + d without knowing what is a. So we wait. Everyone waits. Note that we don’t try to execute anything out of order. We just look at the first instruction, and if it isn’t ready, we wait. Tick, tick, tick.
12:02 - The latency of one addition is something like 6 cycles. Finally something is ready. Four instructions go to the arithmetic units. Next clock cycle, more things are ready, four instructions go to the arithmetic units. Next clock cycle, what happens now? The next instruction does not depend on anything that is currently being executed, so we can push it to the arithmetic units! Note that here we do have some instruction-level parallelism helping us! Two consecutive instructions that do not depend on each other can be in the arithmetic units simultaneously! As I said last week, instruction-level parallelism isn’t as important with GPUs as with CPUs. But occasionally it can help. Instead of having lots of active threads, you can also try to keep the arithmetic units busy by having a smaller number of active threads with plenty of independent consecutive instructions.
13:08 - Next clock cycle, we can do more work. Next clock cycle, again we need to wait. And wait, and wait. Until finally these threads know the value of b and can execute the next instruction. And so on, we continue as long as there is work to do. Finally, when the threads of a warp finish execution, their resources are released. And once there is enough room in the SM, it can take another block from the queue and make it active. So what does all this mean, for example, if your code is doing just a whole bunch of additions and you want to keep the arithmetic units busy. If you have only 3 warps per SM, then it is not going to be possible. In each clock cycle, only three schedulers can find something to do. But if you have got 4 warps per SM, and your kernel has got enough independent instructions in a row, then you should be able to keep all arithmetic units busy.
14:20 - At each clock cycle, each scheduler can find a warp that is ready for execution. But what if you don’t have any instruction-level parallelism? What if you have got only a long sequence of additions that all depend on previous additions? Then you would need a much larger number of active warps. If you want to keep all arithmetic units busy all the time, and there are 4 arithmetic units, and the latency of one addition is 6 cycles, you would need as many as 24 active warps (4 by 6)! At least this is the theory. When the number of active warps is at most 16, there is also a very good agreement between theory and practice. You check the number of active warps, you have a look at how much instruction-level parallelism there is, you do the math, benchmark, and what you get is usually at least 80% of what you would expect, often more than 95%.
15:19 - Unfortunately, with our GPU, going beyond 16 active warps in this task does not give more performance in practice. So if you have got a long sequence of additions that depend on each other, the arithmetic units will be necessarily a bit idle. You can achieve maybe some 65% of the maximum arithmetic throughput. You will need a little bit of instruction-level parallelism to get close to 100%. But you don’t need it much, just having always two instructions next to each other that are independent, and something like 16 active warps, is already enough to get 97% of the maximum arithmetic throughput.
16:05 - So maybe I lied a bit when I said last week that instruction-level parallelism isn’t that relevant with GPUs. But at least you need a lot less of it, and the lack of it will cost a lot less than with CPUs… So now we have got at least some understanding of how the GPU actually executes our code, and how to get some idea of what arithmetic throughput you might get from the GPU. In the next part we will continue with some aspects of GPU programming that may also influence the performance of our programs. See you soon! .