Notes on using llamafile with CUDA offloading on NixOS 23.11
On this machine, it is not faster than running via CPU.
2024-01-10
Disclaimer: These are not supposed to be good instructions. There's definitely a proper way to do it, but llamafile is in the category of software that's intended to run everywhere (i.e. packages its own runtime and doesn't play nice with NixOS assumptions). There are few NixOS+CUDA success stories on the Internet, so this is provided as a reference.
tl;dr
- set at least the following in the nix config:
{ hardware.opengl.enable = true; hardware.opengl.driSupport = true; services.xserver.videoDrivers = [ "nvidia" ]; hardware.nvidia.package = config.boot.kernelPackages.nvidiaPackages.production; };
nix shell --impure nixpkgs#gcc11 nixpkgs#cudaPackages.cudatoolkit nixpkgs#cudaPackages.cuda_cudart.static
LD_LIBRARY_PATH=/run/opengl-driver/lib NVCC_APPEND_FLAGS="-L$(nix eval --impure --raw 'nixpkgs#cudaPackages.cuda_cudart.static')/lib" ./llamafile-0.6 -m ./models/mistral-7b-instruct-v0.2.Q4_K_M.gguf -ngl 10
findings
On an aging Dell laptop[1], GPU offloading was actually slower than using CPU-based inference, no matter how many layers were offloaded! We suspect that the memory bandwidth between the GPU and CPU is the bottleneck, since the entire model (mistral-7b-instruct-v0.2.Q4_K_M.gguf) can be held in RAM rather than VRAM.
- CPU only: 187 ms/tok
- CPU + GPU (5 layers offloaded): 243 ms/tok
- CPU + GPU (10 layers offloaded): 293 ms/tok
- CPU + GPU (13 layers offloaded): 319 ms/tok
- CPU + GPU (15 layers offloaded): out of VRAM while trying to load the model
At least for this hardware, it seems that running on the integrated GPU only is the way to go.
how we got here (critical path)
- running
llamafile
with the-ngl
flag ("store layers in VRAM") displays variousnvcc
errors, so addcudaPackages.cudatoolkit
(ref: https://nixos.wiki/wiki/CUDA (known to be outdated as of 2024-01-10))[2] nvcc
will spit out error messages about on GCC 11 (or earlier) not being in PATH, so addgcc11
nvcc
will then complain about-lcudart_static
not being present, so add thecudaPackages.cuda_cudart.static
and configure it (ref: https://stackoverflow.com/a/51137934 for the flags, and https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html to figure out what they actually do)- searching for "cudart_static" leads to https://stackoverflow.com/questions/51137730/cudart-static-when-is-it-necessary, which suggests that it's being injected by
nvcc
directly. - searching for 'nixos "cudart"' leads to https://discourse.nixos.org/t/on-nixpkgs-and-the-ai-follow-up-to-2023-nix-developer-dialogues/37087, which refers to
cudaPackages.cuda_curdart
. - searching for "nvcc environment variables" leads to https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html, which allows the configuration of nvcc without having to patch llamafile.
- searching for "cudart_static" leads to https://stackoverflow.com/questions/51137730/cudart-static-when-is-it-necessary, which suggests that it's being injected by
libcuda.so
is required at runtime, and can be found in/run/opengl-driver/lib
.- searching for "nixos cudart" leads to https://discourse.nixos.org/t/cuda-in-nixos-on-gcp-for-a-tesla-k80/20145, which explains why it's in a location called "opengl-driver".
references and other resources
- https://nixos.wiki/wiki/Nvidia was confusing, but following all the configuration for a standard Nvidia+X11 setup was enough to get the GPU set up.
- https://nixos.wiki/wiki/CUDA was already known to be out of date, but it illustrated that the CUDA toolchain is just a set of packages, and doesn't include any drivers.
- https://search.nixos.org/packages can be used to search for packages that contain binaries.
- https://discourse.nixos.org/t/on-nixpkgs-and-the-ai-follow-up-to-2023-nix-developer-dialogues/37087 has a lot of information on setting up Nvidia drivers, and how to use CUDA the "proper" nixpkgs way. However, llamafile is not the nixpkgs way[3].
CPU: Xeon E3-1505M; RAM: 32GB ECC; GPU: Quadro M620 Mobile ↩︎
cudaPackages.cuda_nvcc
also exists, but it wasn't as simple as using the thenvcc
included incudatoolkit
, which seems to know about its own headers for compilation. This way, we don't need to add any-I
arguments. ↩︎The "correct" way to do this could be to repackage the
llamafile
binary with a built-inggml-rocm.so
orggml-cuda.so
that's built using nixpkgs libraries, but this defeats the purpose of llamafile, whose job is to run one binary on all platforms. ↩︎