The Two-Format Problem in Local Inference

Sun, 08 Mar 2026 00:00:00 +0000

A user who wants to run large language models locally on a machine with both a CPU and an NPU faces an awkward reality: the two processors require different model formats, and no practical conversion exists between them. This means downloading the same model twice, in two representations, to use both processors.

Two paths to the same destination

Ollama — the dominant local inference tool — uses GGUF format. GGUF was built for CPU inference: it stores quantized weights in block patterns optimized for loading from system memory and processing on general-purpose cores. The entire llama.cpp ecosystem — and by extension, most of the local LLM community — speaks GGUF. When someone says they “downloaded a 7B model,” they almost certainly mean a GGUF file.

Essay on emsenn.net

The Two-Format Problem in Local Inference

Two paths to the same destination