Notes on using llamafile with CUDA offloading on NixOS 23.11

On this machine, it is not faster than running via CPU.

2024-01-10

Disclaimer: These are not supposed to be good instructions. There's definitely a proper way to do it, but llamafile is in the category of software that's intended to run everywhere (i.e. packages its own runtime and doesn't play nice with NixOS assumptions). There are few NixOS+CUDA success stories on the Internet, so this is provided as a reference.

tl;dr

set at least the following in the nix config:

{
    hardware.opengl.enable = true;
    hardware.opengl.driSupport = true;
    services.xserver.videoDrivers = [ "nvidia" ];
    hardware.nvidia.package = config.boot.kernelPackages.nvidiaPackages.production;
};

nix shell --impure nixpkgs#gcc11 nixpkgs#cudaPackages.cudatoolkit nixpkgs#cudaPackages.cuda_cudart.static
LD_LIBRARY_PATH=/run/opengl-driver/lib NVCC_APPEND_FLAGS="-L$(nix eval --impure --raw 'nixpkgs#cudaPackages.cuda_cudart.static')/lib" ./llamafile-0.6 -m ./models/mistral-7b-instruct-v0.2.Q4_K_M.gguf -ngl 10

findings

On an aging Dell laptop^[1], GPU offloading was actually slower than using CPU-based inference, no matter how many layers were offloaded! We suspect that the memory bandwidth between the GPU and CPU is the bottleneck, since the entire model (mistral-7b-instruct-v0.2.Q4_K_M.gguf) can be held in RAM rather than VRAM.

CPU only: 187 ms/tok
CPU + GPU (5 layers offloaded): 243 ms/tok
CPU + GPU (10 layers offloaded): 293 ms/tok
CPU + GPU (13 layers offloaded): 319 ms/tok
CPU + GPU (15 layers offloaded): out of VRAM while trying to load the model

At least for this hardware, it seems that running on the integrated GPU only is the way to go.

how we got here (critical path)

running llamafile with the -ngl flag ("store layers in VRAM") displays various nvcc errors, so add cudaPackages.cudatoolkit (ref: https://nixos.wiki/wiki/CUDA (known to be outdated as of 2024-01-10))^[2]
nvcc will spit out error messages about on GCC 11 (or earlier) not being in PATH, so add gcc11
nvcc will then complain about -lcudart_static not being present, so add the cudaPackages.cuda_cudart.static and configure it (ref: https://stackoverflow.com/a/51137934 for the flags, and https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html to figure out what they actually do)
- searching for "cudart_static" leads to https://stackoverflow.com/questions/51137730/cudart-static-when-is-it-necessary, which suggests that it's being injected by nvcc directly.
- searching for 'nixos "cudart"' leads to https://discourse.nixos.org/t/on-nixpkgs-and-the-ai-follow-up-to-2023-nix-developer-dialogues/37087, which refers to cudaPackages.cuda_curdart.
- searching for "nvcc environment variables" leads to https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html, which allows the configuration of nvcc without having to patch llamafile.
libcuda.so is required at runtime, and can be found in /run/opengl-driver/lib.
- searching for "nixos cudart" leads to https://discourse.nixos.org/t/cuda-in-nixos-on-gcp-for-a-tesla-k80/20145, which explains why it's in a location called "opengl-driver".

references and other resources

https://nixos.wiki/wiki/Nvidia was confusing, but following all the configuration for a standard Nvidia+X11 setup was enough to get the GPU set up.
https://nixos.wiki/wiki/CUDA was already known to be out of date, but it illustrated that the CUDA toolchain is just a set of packages, and doesn't include any drivers.
https://search.nixos.org/packages can be used to search for packages that contain binaries.
https://discourse.nixos.org/t/on-nixpkgs-and-the-ai-follow-up-to-2023-nix-developer-dialogues/37087 has a lot of information on setting up Nvidia drivers, and how to use CUDA the "proper" nixpkgs way. However, llamafile is not the nixpkgs way^[3].

CPU: Xeon E3-1505M; RAM: 32GB ECC; GPU: Quadro M620 Mobile ↩︎
cudaPackages.cuda_nvcc also exists, but it wasn't as simple as using the the nvcc included in cudatoolkit, which seems to know about its own headers for compilation. This way, we don't need to add any -I arguments. ↩︎
The "correct" way to do this could be to repackage the llamafile binary with a built-in ggml-rocm.so or ggml-cuda.so that's built using nixpkgs libraries, but this defeats the purpose of llamafile, whose job is to run one binary on all platforms. ↩︎