Notes on using llamafile with CUDA offloading on NixOS 23.11

On this machine, it is not faster than running via CPU.

2024-01-10

Disclaimer: These are not supposed to be good instructions. There's definitely a proper way to do it, but llamafile is in the category of software that's intended to run everywhere (i.e. packages its own runtime and doesn't play nice with NixOS assumptions). There are few NixOS+CUDA success stories on the Internet, so this is provided as a reference.

tl;dr

  1. set at least the following in the nix config:
    {
        hardware.opengl.enable = true;
        hardware.opengl.driSupport = true;
        services.xserver.videoDrivers = [ "nvidia" ];
        hardware.nvidia.package = config.boot.kernelPackages.nvidiaPackages.production;
    };
    
  2. nix shell --impure nixpkgs#gcc11 nixpkgs#cudaPackages.cudatoolkit nixpkgs#cudaPackages.cuda_cudart.static
  3. LD_LIBRARY_PATH=/run/opengl-driver/lib NVCC_APPEND_FLAGS="-L$(nix eval --impure --raw 'nixpkgs#cudaPackages.cuda_cudart.static')/lib" ./llamafile-0.6 -m ./models/mistral-7b-instruct-v0.2.Q4_K_M.gguf -ngl 10

findings

On an aging Dell laptop[1], GPU offloading was actually slower than using CPU-based inference, no matter how many layers were offloaded! We suspect that the memory bandwidth between the GPU and CPU is the bottleneck, since the entire model (mistral-7b-instruct-v0.2.Q4_K_M.gguf) can be held in RAM rather than VRAM.

At least for this hardware, it seems that running on the integrated GPU only is the way to go.

how we got here (critical path)

references and other resources


  1. CPU: Xeon E3-1505M; RAM: 32GB ECC; GPU: Quadro M620 Mobile ↩︎

  2. cudaPackages.cuda_nvcc also exists, but it wasn't as simple as using the the nvcc included in cudatoolkit, which seems to know about its own headers for compilation. This way, we don't need to add any -I arguments. ↩︎

  3. The "correct" way to do this could be to repackage the llamafile binary with a built-in ggml-rocm.so or ggml-cuda.so that's built using nixpkgs libraries, but this defeats the purpose of llamafile, whose job is to run one binary on all platforms. ↩︎