Open source · MIT licensed

Fuse your local models.
Cut your AI bill to zero.

LLMForge is an OpenAI-compatible proxy that combines 5 free local models into one answer that matches paid APIs — at $0 cloud cost.

★ Star on GitHub See how it works →

Cursor Claude Code OpenCode Continue Any OpenAI client

92.1%

HumanEval pass rate

Cloud cost per 1K calls

$80K

Saved per year (200 eng team)

4.2s

Avg response time

How it works

One URL swap. Zero code changes.

Point any OpenAI-compatible tool at LLMForge. It auto-routes, fuses multiple local models, caches, and falls back to cloud only when needed.

# 1. Pull 5 free models with Ollama
ollama pull qwen2.5-coder:3b qwen2.5-coder:7b phi4-mini llama3.2:3b

# 2. Build & run LLMForge
git clone https://github.com/Ozperium/llmforge.git
cd llmforge && cargo build --release
./target/release/llmforge

# 3. Point your tools at LLMForge instead of OpenAI
OPENAI_BASE_URL=http://localhost:8787/v1
    

The difference

Stop paying for what your GPU can do for free.

LiteLLM routes to one model. Portkey observes. OpenRouter charges per token. LLMForge fuses 5 free models into one answer — validated, tested, and voted on before you see it.

Your current setup

$6,750/mo for 200 engineers on GPT-5.2

One model, one answer — no fusion
Every prompt leaves your network
Vendor lock-in to OpenAI/Anthropic
Unpredictable monthly bills
81.7% HumanEval from your best local model alone

With LLMForge

$12/mo electricity + ~5% cloud fallback

5 local models fused — 92.1% HumanEval
Validated voting eliminates wrong answers
Data stays on your machine
Any OpenAI-compatible client, zero changes
Per-key budget caps prevent runaway spend
Auto-routing picks the best pattern per prompt

Benchmarks

92.1% HumanEval. At $0.

5 free local models, fused with validated consensus, match GPT-5.2 within 3 percentage points — at zero cloud cost. Full methodology in BENCHMARKS.md.

Configuration	Pass Rate	Avg Time	Cost / 1K calls
LLMForge — Validated Fallback (5 local)	151/164 (92.1%)	4.2s	$0.00
LLMForge — Validated Consensus (5 local)	151/164 (92.1%)	22.8s	$0.00
LLMForge — Auto-Routing	135/164 (82.3%)	31.4s	$0.00
qwen2.5-coder:3b (best single local)	134/164 (81.7%)	2.6s	$0.00
gpt-oss:20b (single local)	91/164 (55.5%)	15.6s	$0.00
GPT-5.2 (cloud, for comparison)	~95%	<1s	$1.62

3pp accuracy gap. 100% cost gap. Run the benchmarks yourself: benchmarks/

Cost projection

One GPU server serves your whole team.

LLMForge runs on a single GPU server you already own. Auto-route, cache, and fuse for every engineer. Cloud fallback only kicks in for the ~5% of hard problems.

Team size	Requests / month	Current cost / mo	LLMForge / mo	Annual savings
5 engineers	2,000	$20	$0.10	$239
50 engineers	50,000	$500	$2.10	$5,975
200 engineers	500,000	$6,750	$11.70	$80,860

Based on GPT-5.2 pricing ($1.62/1K calls) vs LLMForge on existing hardware. LLMForge cost = electricity + ~5% cloud fallback. Does not include GPU purchase — uses hardware you already have.

Features

Everything you need in one proxy.

Not just routing. LLMForge fuses models, auto-routes by prompt type, caches responses, and enforces budget caps — all in a single Rust binary.

🔀

Fusion Patterns

Validated consensus, cascade fallback, stream race, self-consistency, and auto-routing. Only patterns that beat the best single model — no martingales, no filler.

🎯

Auto-Routing

Classifies prompts as code, reasoning, or chat — then picks the optimal pattern and model. Zero-config smart routing out of the box.

⚡

Streaming + Race

SSE passthrough for all patterns. Stream race: fire all free models in parallel, first response wins. Losers are aborted instantly.

💾

Caching

Exact-match cache delivers 54x faster responses on repeat queries. Semantic cache planned for v2.

🔐

Budget Caps + Auth

Per-API-key spend limits, rate limiting, and Bearer token auth. Built for teams sharing one GPU server.

📊

Cost Tracking

Every request's cost recorded against its API key. Query spend via /v1/budget endpoint.

Is it for you?

If you're in one of these two buckets, yes.

You're a solo dev with a local GPU

Get better answers from your existing models.

You run Ollama for side projects but local models alone produce garbage on hard problems — 71% HumanEval from your best model. You're tired of $20/mo subscriptions for every AI coding tool, and you don't want every prompt sent to OpenAI.

With LLMForge: Same GPU, same models, 92.1% HumanEval. No subscriptions. No data leaving your machine. Change one env var and Cursor, Claude Code, and OpenCode just work — zero code changes.

You manage an engineering team

Cut your AI API bill by 99.8%.

You're paying $6,750/mo to OpenAI for 200 engineers. The GPU server you bought sits at 30% utilization. You need cost controls, per-team budgets, and compliance — sensitive code can't leave the company.

With LLMForge: One GPU server serves all 200 engineers. Auto-route, cache, fuse. $12/mo instead of $6,750. Per-key budget caps prevent runaway spend. Data stays internal. $80,860/year saved — enough to hire an engineer.

Fuse your local models.Cut your AI bill to zero.