Programming Parallel Computers: Part 5A

May 10, 2020 21:17 · 1463 words · 7 minute read 06 called lot one block

Hi, and welcome back! Last week we learned the basics of GPU programming, and how to do it in practice using CUDA. The high-level idea that we learned is this: You define a kernel function. You create a very large number of threads. All threads run the kernel function. And in the kernel function, a thread can then ask “what is my thread number”, and accordingly do the right part of the work. But, there is a big but! Threads are not entirely independent.

00:38 - Threads are organized in warps of 32 threads, and warps are then organized in blocks. And the entire warp of threads will always execute in a synchronous manner. So why is this the case? Why do we need these two concepts, warps and blocks? A short answer is this: Warps are there to make the life of the hardware easier. And blocks are there to make the life of the programmer easier. Let me try to explain this. Consider a hypothetical GPU that didn’t have anything like warps or blocks, you could just create millions of threads, and they would execute completely independently of each other.

01:27 - Now remember that for example the GPU that we use as a running example here has got 640 arithmetic units for doing scalar operations. And to keep all of them busy, you will need to feed some work to each of them in every clock cycle. So if all threads were independent, the hardware would need to have 640 parallel circuits that find threads that are ready for execution, process instructions, and move the operands of each instruction to the right arithmetic unit, and so on. Maybe this could be done, but we would waste lots of space for all these circuits and lots of power as well. Most of the GPU would be circuits for control logic, instead of circuits that do useful work.

02:23 - By organizing threads in warps of 32 threads, the GPU only needs to find at each clock cycle 20 warps that are ready for execution. Instead of 640 scheduling units that find instructions from individual threads, you only need 20 scheduling units that find complete warps that are ready for execution. One piece of circuit puts 32 similar arithmetic operations into arithmetic units in one step. We could potentially save power consumption and space usage related to control logic by almost a factor of 32. Most of the GPU can be arithmetic units, instead of control logic.

03:10 - This is very similar to the reason why CPUs have 8-wide vector units instead of just increasing the number of cores by a factor of 8. You can add more processing power without adding more control logic. So warps help with hardware design. What about blocks? Blocks are a nice feature that GPU programmers can use. One block of threads can coordinate their work, threads in one block can efficiently communicate with each other. One really nice feature in GPUs that you can use is so-called “shared memory”.

03:55 - This means memory that is shared between the threads of one block. There isn’t much of it, we are talking about maybe a couple of kilobytes per block. But it is nevertheless very useful. For example, let’s say you would like to have one block of threads computing a sum of n values. One way to organize this is the following: Let’s say we have got b threads in a block. We allocate b words of shared memory. We split the input in b parts. Each thread calculates a local sum. Each thread stores it in the shared memory, in its own slot.

04:41 - Then the block synchronizes, waiting for all threads to finish their work and write their own part of result in the shared memory. And finally, for example, thread number 0 can read all b values, and calculate the grand total. And of course we can continue this way, maybe store the grand total again in shared memory, synchronize, let all threads read the grand total, and then each thread can use the sum in whatever calculations they need to do next. So blocks are there to help us, and hardware has got lots of nice features that make it efficient for the threads of one block to talk to each other. In simple applications you often don’t need to use such features, but in more complicated algorithms it often helps a lot.

05:32 - So to summarize, in GPUs threads are organized in warps and blocks. Warps come from the hardware design, you will have warps of size 32, wanted it or not, and the entire warp will do work in a synchronized manner, just because this is how the hardware works. You will need to live with that, and you will need to be aware of that for example when you reason about memory access patterns. Blocks, on the other hand, are there to help you, and you can fairly freely pick a block size that makes the most sense to you in your application. If you don’t need any coordination between threads, just pick a nice round number like 64 or 256 as your block size, and you are done.

06:25 - But if you need some coordination between threads, you are relatively free to choose what is the size of a block that would work together to solve one part of your problem. Let’s first quickly see how to use shared memory. There are just two things you will need. First, to state that your kernel would like to use shared memory to store, for example, 100 floats, you just add this line in the beginning of the kernel code. Then the NVCC compiler will see that this kernel needs 400 bytes of shared memory per block, write it down in the metadata for this kernel function, and whenever you launch the kernel, CUDA will take care of allocating shared memory as needed. Now keep in mind that each block has got its own array “x”.

07:21 - What is x[0] in block 10 is different from what is x[0] in block 11. However, all threads of a block share the same array “x”. For example, x[0] in thread 5 of block 10 refers to the same piece of storage as x[0] in thread 6 of block 10. And the usual rules regarding shared data apply, just like with OpenMP. Without explicit synchronization, you can either have only one thread accessing one element, or many threads reading one element.

08:06 - If one thread writes to x[0], no other thread can read or write it simultaneously. So how to synchronize things? You can just use “syncthreads”. Whenever one warp of a block reaches “syncthreads”, it will wait for all warps of the same block to reach the same place. So for instance you can first have each thread writing in its own part of shared memory, and then call “syncthreads”, and after that it is safe for all threads to read anywhere in the shared memory. And then you call “syncthreads” again to make sure everyone has finished reading, and then it is safe to again write to shared memory, etc.

08:56 - So a very common pattern is a loop in which you repeatedly write to shared memory, synchronize, read from shared memory, and synchronize. People usually remember to synchronize between writing and reading, but often forget to synchronize between reading and writing. Remember that you can’t start to write before you make sure everyone has finished reading. One key point before we continue: there is very little shared memory. Each SM (that is, streaming multiprocessor) has got only 64 kilobytes of shared memory in total, and this is shared among all blocks that are active, so if you’d like to have, say, 8 blocks active per SM, you can only use 8 kilobytes of shared memory per block.

09:52 - This isn’t much, but it is very fast memory, so you can try to use it a bit like L1 cache in CPUs. So now we have seen all the key primitives that we need to use in everyday CUDA programs. We can allocate some GPU memory, and move data between the CPU memory and GPU memory. We can create blocks of threads and launch a kernel. We allocate shared memory and use it to share data between threads of a block. And we can use “syncthreads” to synchronize our work. In the next part, we will start to reason about what happens inside the GPU, how does the GPU execute our code, how much parallelism we will get, and how to reason about the performance of our code. See you soon! .