How to setup llama.cpp as a systemd service
I finally found some time and motivation to host my own LLMs on my server - Intel i7 14700F with 80 GiB of RAM and NVIDIA’s RTX 4060 Ti 16 GB. Here’s a quick post on how you can do it to.
The steps are the following:
- Install llama.cpp
- Install llama-swap
- Create a llama-swap config file
- Create and enable llamaswap.service
Install llama.cpp
llama.cpp is known for being a bit cumbersome to setup, especially if you want to run it on your NVIDIA GPU. Since there are no prebuilt binaries, as there are for CPUs, you’ll need to mess around with NVIDIA Toolkit a bit to get everything up-and-running. I found the following guide more than enough to set everything up - https://blog.steelph0enix.dev/posts/llama-cpp-guide/.
Install llama-swap
As opposed to llama.cpp, setting up llama-swap is easy. Download the prebuilt binaries and add them to a folder that is in your PATH. For me this is .local/bin
. Then, you should be able to run llama-swap -h
.
Create llama-swap config
You need to create the necessary llama-swap config file. This config file determines how llama-swap should behave. Here’s a current version of my (relatively messy) config file:
healthCheckTimeout: 60
logLevel: info
metricsMaxInMemory: 200
startPort: 8080
models:
"qwen3":
# cmd: the command to run to start the inference server.
cmd: |
llama-server -hf unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF --ctx-size 32768 --jinja -ub 2048 -b 4096 --host 0.0.0.0 --port 8081 --temp 0.7 --top-p 0.8 --min-p 0.0 --top-k 20 -ngl 32
# name: a display name for the model
name: "Qwen3 30B A3B Instruct"
# proxy: the URL where llama-swap routes API requests
proxy: http://127.0.0.1:8081
"qwen3-coder":
# cmd: the command to run to start the inference server.
cmd: |
llama-server -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF --ctx-size 32768 --jinja -ub 2048 -b 4096 --host 0.0.0.0 --port 8082 --temp 0.7 --top-p 0.8 --min-p 0.0 --top-k 20 -ngl 32
# name: a display name for the model
name: "Qwen3 Coder 30B A3B Instruct"
# proxy: the URL where llama-swap routes API requests
proxy: http://127.0.0.1:8082
groups:
# group1 works the same as the default behaviour of llama-swap where only one model is allowed
# to run a time across the whole llama-swap instance
"standard":
# swap: controls the model swapping behaviour in within the group
swap: true
# exclusive: controls how the group affects other groups
exclusive: true
# members references the models defined above
members:
- "qwen3"
- "qwen3-coder"
I host a llama-swap proxy server on port 8080 and host each model on a seprate port. This is not necessary since I cannot have two models loaded in the GPU at the same time, but this just makes the setup a bit cleaner in my opinion.
Also, you’ll need to play around with the llama-server
a bit to see which models you can run.
Create and enable llamaswap.service
And finally you need to create the llamaswap.service
file. Run the sudo nano /etc/systemd/system/llamaswap.service
and enter the following:
[Unit]
Description=Start llama-swap service
After=network.target
[Service]
Type=simple
User=<your-user>
WorkingDirectory=<path/to/llama-swap-config>
ExecStart=</path/to/>llama-swap -config </path/to/>llamaswap.yaml
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
You might need to enter the additional environment variables if your llama.cpp binaries are not in the .local/bin
and if your llama.cpp cache is also somewhere else:
Environment=PATH=/path/to/llama.cpp/build/bin
Environment=LLAMA_CACHE=/path/to/llamacpp_cache
Add these just after WorkingDirectory entry.
And that is it. You need to run the following:
sudo systemctl daemon-reload
sudo systemctl enable llamaswap.service
sudo systemctl start llamaswap.service
sudo systemctl status llamaswap.service
You should see the something like:
● llamaswap.service - Start llama-swap service
Loaded: loaded (/etc/systemd/system/llamaswap.service; enabled; preset: en>
Active: active (running) since Sun 2025-09-28 07:20:51 UTC; 42min ago
Main PID: 1053 (llama-swap)
...
You should now be able to access the http://localhost:8080
and see the llama-swap ui.