← Back to Home

What is GGUF? Complete Guide to GGUF Format

Everything you need to know about the standard format for running AI models locally

What is GGUF?

GGUF (GPT-Generated Unified Format) is a file format designed for storing and running large language models (LLMs) efficiently on consumer hardware. Created by the llama.cpp project, GGUF is now the standard format for local AI inference.

Key Benefits of GGUF:
  • Run AI models on CPU without expensive GPUs
  • Quantization reduces model size by 50-75%
  • Single file contains everything needed
  • Works on Windows, Mac, and Linux
  • Supported by all major local AI tools

What Does GGUF Stand For?

GGUF stands for GPT-Generated Unified Format. The name reflects its purpose: a unified, standardized way to store AI models that were originally in various formats (PyTorch, SafeTensors, etc.).

GGUF Quantization Types Explained

Quantization reduces model precision to decrease file size and memory usage. Here are the common GGUF quantization types:

Quantization Bits Size Reduction Quality Best For
Q4_K_M 4-bit ~75% Excellent Recommended - Best balance
Q4_K_S 4-bit ~75% Good Low RAM systems
Q5_K_M 5-bit ~65% Very Good Quality-focused users
Q5_K_S 5-bit ~65% Good Balance with smaller size
Q6_K 6-bit ~55% Excellent Near-original quality
Q8_0 8-bit ~50% Best Maximum quality
Q2_K 2-bit ~85% Lower Extreme compression
💡 Recommendation: For most users, Q4_K_M offers the best balance of quality, speed, and memory usage. Use Q5_K_M or Q6_K if you have extra RAM and want better quality.

GGUF Memory Requirements

How much RAM do you need for different GGUF models? Here's a quick reference:

Model Size Q4_K_M RAM Q5_K_M RAM Q8_0 RAM
1B parameters ~1 GB ~1.2 GB ~1.5 GB
3B parameters ~2.5 GB ~3 GB ~4 GB
7B parameters ~5-6 GB ~6-7 GB ~8-9 GB
13B parameters ~9-10 GB ~11-12 GB ~15 GB
70B parameters ~40 GB ~50 GB ~70 GB
⚠️ Important: Add 1-2 GB for context window and system overhead. For 16GB RAM systems, stick to models under 7B parameters with Q4_K_M quantization.

GGUF vs GGML: What's the Difference?

GGUF replaced the older GGML format in August 2023. Here's why GGUF is better:

Feature GGML (Old) GGUF (New)
File Structure Multiple files needed Single file
Metadata Limited Extensible key-value
Compatibility Breaking changes Forward compatible
Loading Speed Slower Faster
Tool Support Deprecated All modern tools
✅ Bottom Line: Always use GGUF format. GGML is deprecated and no longer supported by modern tools like llama.cpp, Ollama, and LM Studio.

Where to Download GGUF Models

GGUF models are available on HuggingFace. Popular sources include:

  • TheBloke - Thousands of quantized models
  • bartowski - High-quality quantizations
  • Qwen - Official Qwen GGUF models
  • Meta - Official Llama models

Look for files ending in .gguf with quantization suffix like Q4_K_M.gguf.

Tools That Support GGUF

  • GGUF Loader - Simple GUI for running GGUF models
  • llama.cpp - The original GGUF runtime
  • Ollama - Easy model management
  • LM Studio - Desktop app for local AI
  • GPT4All - Cross-platform local AI
  • KoboldCpp - For creative writing

Ready to Run GGUF Models?

GGUF Loader makes it easy to run AI models locally - no Python or command line needed.

Get Started with GGUF Loader →