New tools can remove inefficiencies in GPU computing, but can we turn that into real performance gains?
Traditional GPU programming has followed a pattern where the CPU acts as the host that controls the execution of short-lived kernels on the GPU and controls the overall flow of computation and communication. However, with recent additions to the CUDA toolbox, persistent kernels that control program execution and GPU initiated communication routines such as NVSHMEM have become available. Using these tools, we can eliminate latencies and inefficiencies of frequent CPU-GPU communication.
The goal of this thesis is to implement a CPU-free version of an existing major GPU application and show the benefits of the new technique through rigorous benchmarking.
- Experience with C/C++
- Familiarity with GPU programming is very helpful
- Familiarity with parallel programming in MPI is helpful