LLMForge is an OpenAI-compatible proxy that combines 5 free local models into one answer that matches paid APIs — at $0 cloud cost.
Point any OpenAI-compatible tool at LLMForge. It auto-routes, fuses multiple local models, caches, and falls back to cloud only when needed.
LiteLLM routes to one model. Portkey observes. OpenRouter charges per token. LLMForge fuses 5 free models into one answer — validated, tested, and voted on before you see it.
5 free local models, fused with validated consensus, match GPT-5.2 within 3 percentage points — at zero cloud cost. Full methodology in BENCHMARKS.md.
| Configuration | Pass Rate | Avg Time | Cost / 1K calls |
|---|---|---|---|
| LLMForge — Validated Fallback (5 local) | 151/164 (92.1%) | 4.2s | $0.00 |
| LLMForge — Validated Consensus (5 local) | 151/164 (92.1%) | 22.8s | $0.00 |
| LLMForge — Auto-Routing | 135/164 (82.3%) | 31.4s | $0.00 |
| qwen2.5-coder:3b (best single local) | 134/164 (81.7%) | 2.6s | $0.00 |
| gpt-oss:20b (single local) | 91/164 (55.5%) | 15.6s | $0.00 |
| GPT-5.2 (cloud, for comparison) | ~95% | <1s | $1.62 |
3pp accuracy gap. 100% cost gap. Run the benchmarks yourself: benchmarks/
LLMForge runs on a single GPU server you already own. Auto-route, cache, and fuse for every engineer. Cloud fallback only kicks in for the ~5% of hard problems.
| Team size | Requests / month | Current cost / mo | LLMForge / mo | Annual savings |
|---|---|---|---|---|
| 5 engineers | 2,000 | $20 | $0.10 | $239 |
| 50 engineers | 50,000 | $500 | $2.10 | $5,975 |
| 200 engineers | 500,000 | $6,750 | $11.70 | $80,860 |
Based on GPT-5.2 pricing ($1.62/1K calls) vs LLMForge on existing hardware. LLMForge cost = electricity + ~5% cloud fallback. Does not include GPU purchase — uses hardware you already have.
Not just routing. LLMForge fuses models, auto-routes by prompt type, caches responses, and enforces budget caps — all in a single Rust binary.
Validated consensus, cascade fallback, stream race, self-consistency, and auto-routing. Only patterns that beat the best single model — no martingales, no filler.
Classifies prompts as code, reasoning, or chat — then picks the optimal pattern and model. Zero-config smart routing out of the box.
SSE passthrough for all patterns. Stream race: fire all free models in parallel, first response wins. Losers are aborted instantly.
Exact-match cache delivers 54x faster responses on repeat queries. Semantic cache planned for v2.
Per-API-key spend limits, rate limiting, and Bearer token auth. Built for teams sharing one GPU server.
Every request's cost recorded against its API key. Query spend via /v1/budget endpoint.
You run Ollama for side projects but local models alone produce garbage on hard problems — 71% HumanEval from your best model. You're tired of $20/mo subscriptions for every AI coding tool, and you don't want every prompt sent to OpenAI.
You're paying $6,750/mo to OpenAI for 200 engineers. The GPU server you bought sits at 30% utilization. You need cost controls, per-team budgets, and compliance — sensitive code can't leave the company.
Free, open source, MIT licensed. No signup, no cloud, no lock-in.