What is GGUF?
GGUF (GPT-Generated Unified Format) is a file format designed for storing and running large language models (LLMs) efficiently on consumer hardware. Created by the llama.cpp project, GGUF is now the standard format for local AI inference.
- Run AI models on CPU without expensive GPUs
- Quantization reduces model size by 50-75%
- Single file contains everything needed
- Works on Windows, Mac, and Linux
- Supported by all major local AI tools
What Does GGUF Stand For?
GGUF stands for GPT-Generated Unified Format. The name reflects its purpose: a unified, standardized way to store AI models that were originally in various formats (PyTorch, SafeTensors, etc.).
GGUF Quantization Types Explained
Quantization reduces model precision to decrease file size and memory usage. Here are the common GGUF quantization types:
| Quantization | Bits | Size Reduction | Quality | Best For |
|---|---|---|---|---|
Q4_K_M |
4-bit | ~75% | Excellent | Recommended - Best balance |
Q4_K_S |
4-bit | ~75% | Good | Low RAM systems |
Q5_K_M |
5-bit | ~65% | Very Good | Quality-focused users |
Q5_K_S |
5-bit | ~65% | Good | Balance with smaller size |
Q6_K |
6-bit | ~55% | Excellent | Near-original quality |
Q8_0 |
8-bit | ~50% | Best | Maximum quality |
Q2_K |
2-bit | ~85% | Lower | Extreme compression |
Q4_K_M offers the best balance of quality, speed, and memory usage. Use Q5_K_M or Q6_K if you have extra RAM and want better quality.
GGUF Memory Requirements
How much RAM do you need for different GGUF models? Here's a quick reference:
| Model Size | Q4_K_M RAM | Q5_K_M RAM | Q8_0 RAM |
|---|---|---|---|
| 1B parameters | ~1 GB | ~1.2 GB | ~1.5 GB |
| 3B parameters | ~2.5 GB | ~3 GB | ~4 GB |
| 7B parameters | ~5-6 GB | ~6-7 GB | ~8-9 GB |
| 13B parameters | ~9-10 GB | ~11-12 GB | ~15 GB |
| 70B parameters | ~40 GB | ~50 GB | ~70 GB |
GGUF vs GGML: What's the Difference?
GGUF replaced the older GGML format in August 2023. Here's why GGUF is better:
| Feature | GGML (Old) | GGUF (New) |
|---|---|---|
| File Structure | Multiple files needed | Single file |
| Metadata | Limited | Extensible key-value |
| Compatibility | Breaking changes | Forward compatible |
| Loading Speed | Slower | Faster |
| Tool Support | Deprecated | All modern tools |
Where to Download GGUF Models
GGUF models are available on HuggingFace. Popular sources include:
- TheBloke - Thousands of quantized models
- bartowski - High-quality quantizations
- Qwen - Official Qwen GGUF models
- Meta - Official Llama models
Look for files ending in .gguf with quantization suffix like Q4_K_M.gguf.
Tools That Support GGUF
- GGUF Loader - Simple GUI for running GGUF models
- llama.cpp - The original GGUF runtime
- Ollama - Easy model management
- LM Studio - Desktop app for local AI
- GPT4All - Cross-platform local AI
- KoboldCpp - For creative writing
Ready to Run GGUF Models?
GGUF Loader makes it easy to run AI models locally - no Python or command line needed.
Get Started with GGUF Loader →