Programming Parallel Computers: Part 5C

May 10, 2020 21:17 · 1115 words · 6 minute read avoid data races compile per

In the previous parts, we have said many times over and over again that all threads of a warp work in a synchronous manner. But our kernel is some C++ code that is written from the perspective of an individual thread. So what if there is an “if-else” statement there in the kernel? And what if the statement is true for some threads, and false for some other threads of a warp? Or what if we have a “for”-loop that runs for a different number of iterations in different threads? Or what if we call a recursive function that does completely different things for different threads? The answer to this is that there is an execution mask that tells which threads are enabled and which are disabled. For example, if you have an “if-else” statement where the condition is true for threads 0 to 15 and false for threads 16 to 31, then the warp will run both branches. During the “if” branch, threads 16 to 31 are disabled.

01:17 - During the “else” branch, threads 0 to 15 are disabled. And the same basic idea holds for all other cases in which the threads of a warp would like to do different things. If you have a for-loop in which one thread takes 1 million iterations then the entire warp will take 1 million iterations. Threads that run fewer iterations are just marked as disabled in the execution mask. So any C++ code will work correctly, it will do whatever you want, and you don’t directly see that the threads of a warp are tightly coupled together.

02:02 - But this will heavily influence the performance. If the 32 threads try to do 32 completely different things, then they will end up working in a sequential manner: 1 thread does its own part, while others are disabled, etc. So in general programs in which all threads always do the same operations tend to be most efficient. And note that these are usually also tasks in which you could make efficient use of vector operations in CPUs. I think by now you should have got a pretty good understanding of what kind of devices GPUs are, how to program GPUs, and what are the key factors that influence the performance of GPU code.

02:53 - While we have used a specific NVIDIA GPU and CUDA programming environment as a running example, the basic principles that we have seen generalize directly to many other GPUs and different programming environments. I will now conclude this lecture by mentioning a couple of technical details, these are also somewhat specific to CUDA. These are things that it is good to be aware of, so that you at least know what to search for when you need more information. First, a bit about the compilation process. When you use NVCC, it will compile the CPU-side code as usual, but it will compile the GPU-side code into an intermediate language called PTX. This is platform-independent.

03:45 - Then PTX is compiled to an assembly language called SASS. And SASS is what the GPU actually runs. So SASS is what you want to see if you really want to understand what happens when the GPU executes your kernel. You can use “cuobjdump” to look at the SASS code. Second, a bit about warps. We have seen that a block of threads can communicate through shared memory. But there you need to take care of explicit synchronization.

04:21 - However, the entire warp runs always in a synchronous manner, and therefore you can more efficiently exchange information between different threads of a warp. You can do it easily by using warp-wide operations, like these. These are very similar in spirit to CPU-side code that uses special instructions that reorder the elements of a vector. I think this is now enough new information for today. We could say a lot more, but I don’t want to overload you with tons of technical details that will take attention away from key ideas. So let’s recap the key ideas.

05:04 - You need to explicitly say what the GPU should run. You write a kernel. You specify how many blocks of threads you want, and you specify how many threads there are per block. And then you launch the kernel. All threads will run the same kernel code, but in the kernel you can use the thread index and block index to decide what to do. GPU-side code accesses only GPU memory, CPU-side code accesses only CPU memory, and you need to explicitly use CUDA functions to move data between them. Threads aren’t independent, they are organized in warps of 32 threads, and all threads of a warp are always synchronized.

05:49 - If you access memory in one thread, all other threads also do a memory lookup, and you’d better make sure these memory accesses don’t span too many cache lines, or it will get expensive. Threads are also organized in blocks, and you can choose the size of a block, within some limits. And the threads in a block can easily talk with each other, you can allocate a small amount of shared memory for each block, and the threads can read and write this memory. Just remember to synchronize properly to avoid data races. And finally, when reasoning about performance, keep in mind that the GPU executes instructions in a linear order.

06:31 - The CPU will look far ahead in the instruction stream to find something to do, but the GPU only looks at the next instruction in each warp. If it is not ready for execution, the whole warp will wait. So you will usually want to have lots of active warps, so that even if many of them are currently waiting, there is still useful work to do. And using lots of registers or lots of shared memory can be bad, as there are only limited resources in the GPU, and you can’t have many active blocks if each block uses lots of shared memory, or if each thread uses lots of registers. Now it is your turn to put these principles in use.

07:22 - Please remember that many ideas that we covered earlier in the context of CPU programming still apply here. For example, it is always a good idea to try to minimize memory reads by reusing data in registers. But now you have got an extra degree of freedom here: Do you use as many registers as possible, even if it means that you can have fewer active warps? Have fun experimenting, and see you next week! Then we will talk about something completely different… .