Locally running languagem models
Using llama.cpp
| |
Observed speeds on M5 pro (brew install llama.cpp):
- HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive:Q6_K: 35 tok/s
- unsloth/gemma-4-26B-A4B-it-GGUF: 57 tok/s
Observed speeds on RTX 4080 (winget instal ggml.llamacpp):
- HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive:Q6_K: 80 tok/s
Pointers:
--reasoning-budget 0to disable thinking-ngl 999to offload all layes to GPU.