After setting up CUDA on my other laptop, I moved to a different(older) machine that doesn’t have an NVIDIA GPU. This one is an everyday laptop with integrated Intel graphics, but that doesn’t mean we have to settle for slow CPU-only performance.

On this machine, I switched to the Vulkan backend for llama.cpp and the results were even more dramatic than I expected.

Machine Hardware Info

This laptop is running Debian 13 (Trixie/Sid) with the following specs:

  • CPU: Intel(R) Core(TM) i5-8250U @ 1.60GHz (4 Cores, 8 Threads)
  • GPU: Intel(R) UHD Graphics 620 (Integrated)
  • RAM: 8 GB
  • OS: Debian GNU/Linux 13 (trixie)
  • Kernel: 6.12.74+deb13+1-amd64

The Performance Gap: CPU vs. Vulkan

I tested both the Qwen 3.5 2B and the more capable Qwen 2.5 3B models (GGUF format) to see how the integrated Intel GPU handles different LLM sizes.

ModelSetupResponse Time (Eval)Total TimeTokens/secNotes
Qwen 3.5 2BCPU Only~7 minutes (428s)431s2.32Purely on i5-8250U
Qwen 3.5 2BVulkan14 seconds21s6.0730x improvement!
Qwen 2.5 3BVulkan47 seconds52s4.54More capable reasoning

The 14-second response vs. 7 minutes on the 2B model is a game-changer, but the 3B model (answering “write me hello world in rust” in 47 seconds) is the “sweet spot” for this machine. While the 2B model can be fully offloaded, the 3B model is too large to fit entirely in the GPU’s shared memory, but it still performs admirably.

Compiling llama.cpp with Vulkan

Compiling for Vulkan on Debian is straightforward but requires the right development headers. It took me about 10 minutes to finish the compilation.

First, ensure you have the Vulkan development packages:

sudo apt update
sudo apt install libvulkan-dev vulkan-tools

Then, compile llama.cpp using CMake with the Vulkan option enabled:

cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release

Running with Vulkan Acceleration

Once compiled, you can run llama-server (or llama-cli). The server will automatically detect your Vulkan-compatible devices.

./build/bin/llama-server -hf unsloth/Qwen3.5-2B-GGUF --jinja -c 4096 --host 127.0.0.1 --port 8033

In the logs, you’ll see it picking up the Intel UHD Graphics. For the 2B model, I was able to offload all 25 layers:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) UHD Graphics 620 (KBL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | ...
...
load_tensors: offloading 23 repeating layers to GPU
load_tensors: offloaded 25/25 layers to GPU

Pushing the Limits: Qwen 2.5 3B

When I moved to the 3.4B parameter model (Qwen2.5-3B-Instruct-Q4_K_M), the memory management became more complex. The system had to balance between the GPU’s shared memory and the CPU:

llama_params_fit_impl: projected to use 2283 MiB of device memory vs. 2200 MiB of free device memory
llama_params_fit_impl: cannot meet free memory target of 1024 MiB, need to reduce device memory by 1106 MiB
llama_params_fit_impl: filling dense layers back-to-front:
llama_params_fit_impl:   - Vulkan0 (Intel(R) UHD Graphics 620 (KBL GT2)): 13 layers,   1137 MiB used,   1062 MiB free
...
load_tensors: offloaded 13/37 layers to GPU

Even though it only offloaded 13/37 layers to the GPU, it still maintained a respectable 4.54 tokens/sec. This shows that even partial offloading on integrated graphics provides a significant boost over pure CPU execution for larger models.

Summary & Next Steps

If you don’t have an NVIDIA card, don’t ignore your integrated GPU. Vulkan provides a fantastic alternative that works out-of-the-box on Debian with Intel and AMD hardware.

My next target is to use Qwen on OpenClaw to further explore local LLM capabilities. Stay tuned!