Open source · MIT licensed

Fuse your local models.
Cut your AI bill to zero.

LLMForge is an OpenAI-compatible proxy that combines 5 free local models into one answer that matches paid APIs — at $0 cloud cost.

Cursor Claude Code OpenCode Continue Any OpenAI client
92.1%
HumanEval pass rate
$0
Cloud cost per 1K calls
$80K
Saved per year (200 eng team)
4.2s
Avg response time

One URL swap. Zero code changes.

Point any OpenAI-compatible tool at LLMForge. It auto-routes, fuses multiple local models, caches, and falls back to cloud only when needed.

# 1. Pull 5 free models with Ollama ollama pull qwen2.5-coder:3b qwen2.5-coder:7b phi4-mini llama3.2:3b # 2. Build & run LLMForge git clone https://github.com/Ozperium/llmforge.git cd llmforge && cargo build --release ./target/release/llmforge # 3. Point your tools at LLMForge instead of OpenAI OPENAI_BASE_URL=http://localhost:8787/v1

Stop paying for what your GPU can do for free.

LiteLLM routes to one model. Portkey observes. OpenRouter charges per token. LLMForge fuses 5 free models into one answer — validated, tested, and voted on before you see it.

Your current setup

$6,750/mo for 200 engineers on GPT-5.2
  • One model, one answer — no fusion
  • Every prompt leaves your network
  • Vendor lock-in to OpenAI/Anthropic
  • Unpredictable monthly bills
  • 81.7% HumanEval from your best local model alone

With LLMForge

$12/mo electricity + ~5% cloud fallback
  • 5 local models fused — 92.1% HumanEval
  • Validated voting eliminates wrong answers
  • Data stays on your machine
  • Any OpenAI-compatible client, zero changes
  • Per-key budget caps prevent runaway spend
  • Auto-routing picks the best pattern per prompt

92.1% HumanEval. At $0.

5 free local models, fused with validated consensus, match GPT-5.2 within 3 percentage points — at zero cloud cost. Full methodology in BENCHMARKS.md.

Configuration Pass Rate Avg Time Cost / 1K calls
LLMForge — Validated Fallback (5 local) 151/164 (92.1%) 4.2s $0.00
LLMForge — Validated Consensus (5 local) 151/164 (92.1%) 22.8s $0.00
LLMForge — Auto-Routing 135/164 (82.3%) 31.4s $0.00
qwen2.5-coder:3b (best single local) 134/164 (81.7%) 2.6s $0.00
gpt-oss:20b (single local) 91/164 (55.5%) 15.6s $0.00
GPT-5.2 (cloud, for comparison) ~95% <1s $1.62

3pp accuracy gap. 100% cost gap. Run the benchmarks yourself: benchmarks/

One GPU server serves your whole team.

LLMForge runs on a single GPU server you already own. Auto-route, cache, and fuse for every engineer. Cloud fallback only kicks in for the ~5% of hard problems.

Team size Requests / month Current cost / mo LLMForge / mo Annual savings
5 engineers 2,000 $20 $0.10 $239
50 engineers 50,000 $500 $2.10 $5,975
200 engineers 500,000 $6,750 $11.70 $80,860

Based on GPT-5.2 pricing ($1.62/1K calls) vs LLMForge on existing hardware. LLMForge cost = electricity + ~5% cloud fallback. Does not include GPU purchase — uses hardware you already have.

Everything you need in one proxy.

Not just routing. LLMForge fuses models, auto-routes by prompt type, caches responses, and enforces budget caps — all in a single Rust binary.

🔀

Fusion Patterns

Validated consensus, cascade fallback, stream race, self-consistency, and auto-routing. Only patterns that beat the best single model — no martingales, no filler.

🎯

Auto-Routing

Classifies prompts as code, reasoning, or chat — then picks the optimal pattern and model. Zero-config smart routing out of the box.

Streaming + Race

SSE passthrough for all patterns. Stream race: fire all free models in parallel, first response wins. Losers are aborted instantly.

💾

Caching

Exact-match cache delivers 54x faster responses on repeat queries. Semantic cache planned for v2.

🔐

Budget Caps + Auth

Per-API-key spend limits, rate limiting, and Bearer token auth. Built for teams sharing one GPU server.

📊

Cost Tracking

Every request's cost recorded against its API key. Query spend via /v1/budget endpoint.

If you're in one of these two buckets, yes.

You're a solo dev with a local GPU

Get better answers from your existing models.

You run Ollama for side projects but local models alone produce garbage on hard problems — 71% HumanEval from your best model. You're tired of $20/mo subscriptions for every AI coding tool, and you don't want every prompt sent to OpenAI.

With LLMForge: Same GPU, same models, 92.1% HumanEval. No subscriptions. No data leaving your machine. Change one env var and Cursor, Claude Code, and OpenCode just work — zero code changes.
You manage an engineering team

Cut your AI API bill by 99.8%.

You're paying $6,750/mo to OpenAI for 200 engineers. The GPU server you bought sits at 30% utilization. You need cost controls, per-team budgets, and compliance — sensitive code can't leave the company.

With LLMForge: One GPU server serves all 200 engineers. Auto-route, cache, fuse. $12/mo instead of $6,750. Per-key budget caps prevent runaway spend. Data stays internal. $80,860/year saved — enough to hire an engineer.

Deploy in 5 minutes.

Free, open source, MIT licensed. No signup, no cloud, no lock-in.