Locally running languagem models

(Updated: Thursday, Apr 9, 2026), Apr 5, 2026
Section: Home / Note / Llms

Using llama.cpp

1
llama-server -hf MODEL_NAME

Observed speeds on M5 pro (brew install llama.cpp):

HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive:Q6_K: 35 tok/s
unsloth/gemma-4-26B-A4B-it-GGUF: 57 tok/s

Observed speeds on RTX 4080 (winget instal ggml.llamacpp):

HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive:Q6_K: 80 tok/s

Pointers:

--reasoning-budget 0 to disable thinking
-ngl 999 to offload all layes to GPU.

Using huggingface transformers