CUDA separate compilation models and link trouble shooting

When I’m developing GooStats, I encountered the problem that I need to link against an external library GooFit which contains GPU code. If I link against GooFit library directly, the linker will give strange compilation errors, and here I post how I solve it as well as the principle of cuda linking models.

 

A short summary of this article is

  1. You need to do an extra step to link all GPU objects into one library using nvcc –device-link or cmake CUDA_LINK_SEPARABLE_COMPILATION_OBJECTS macro
  2. After that, you need to add this library as the last library to your linker when creating executables or libraries.

CUDA compilation model / linking model

For normal cxx code, you always compile source file into object files first, then link object files and/or libraries into executables or libraries. You can compile source code when the implementation of some functions are missing, however they must be found during the linking phase.

Cuda code contains both device (GPU) code and hosts (CPU) code and can only be compiled with nvcc compiler (or you can compile the CPU part only with a lot of #if __CUDACC__ directives and #include <cuda_runtime.h> which define __host__ __device__ etc. to nothing).

nvcc –device-c -o lib1.o lib1.cu

nvcc –device-c -o lib2.o lib2.cu

Each output file contains two part, the host objects and device objects and now you can link the CPU part to create static library (let’s skip dynamic library first..)

ar qc lib.a lib1.o lib2.o
ranlib lib.a

And you can “almost” use this library by linking lib.a

nvcc –device-c -o main.o main.cu

g++ -o main main.o lib.a -L/cuda/installation/path -lcudart -lcudadevrt

However it only “almost” works: the linker will complain about undefined reference to some cudaRegisterAll functions. Why “almost”? Because the GPU objects is not recognizable for hosts linker g++ and you need to wrap them to g++ recognizable objects and then link them to main.

To wrap GPU objects to g++ recognizable objects, run

nvcc –device-link -o lib_all_gpu.o lib1.o lib2.o main.o

Here is the most annoying part: you can only create one single file when wrapping all GPU objects, or you can create more, but each are independent, so you can

nvcc –device-link -o lib_all_lib.o lib1.o lib2.o

nvcc –device-link -o lib_all_main.o main.o

if all device functions in main never reference any device function/global variable/array defined in lib1 and lib2, and lib1 or lib2 never reference any device function/global variable/array defined in main.

That is, if you write your pacakge lib containing lib1.cu and lib2.cu, and link the gpu code prior

nvcc –device-link -o lib_all_gpu_in_lib.o lib1.o lib2.o

ar qc lib_with_gpu.a lib1.o lib2.o lib_all_gpu_in_lib.o

ranlib lib_with_gpu.a

and hope to use it by

nvcc –device-link -o lib_all_gpu_in_main.o main.o

g++ -o main lib_all_gpu_in_main.o lib.a main.o

you won’t be able to link lib_all_gpu_in_main.o and receive an error about undefined reference to functions defined in lib.a(in lib_all_gpu_in_lib.o )

So if you write your own libraries, you must install lib1.o and lib2.o to the installation path so that cuda can link them. 

Now you can link and create your final executable

g++ -o main lib_all_gpu.o lib.a main.o -L/cuda/installation/path -lcudart -lcudadevrt

There is another annoying thing: if you put any of lib.a and main.o after lib_all_gpu.o, you will get undefined reference to some “fatbin … ” functions in lib_all_gpu.o. If you use nm command you will see that

> nm lib_all_gpu.o | grep fatbin

             U …fatbin…

here U means this function is not defined. It is defined in lib.a:

> nm lib.a | grep fatbin

 00000 R …fatbin…

here R means defined in read-only section So it is defined, we just need to switch the order.. If you would like to understand it fully, you should read the magic of static linking

You can find the full example code on my github repository. if you want to compile it with cmake, you need to download the whole sandbox repository, because CUDAsafeAddLibraryAndExe.cmake is in sandbox/cmake folder.

reference: Nvidia CUDA compilation documents