LLMForge is a local-first LLM proxy that keeps every prompt on your GPU server. No data sent to OpenAI. No vendor lock-in. Cloud is fallback only — you control when and if. OpenAI-compatible, one env var.
Every request hits your local models first. Cache catches repeats. Budget caps prevent runaway spend. Cloud is the fallback, not the default — you control exactly when and how much you pay.
Every prompt hits your local models. Source code, internal docs, customer data — nothing leaves your network. Cache catches repeats instantly (54x faster, $0).
Classifies the prompt (code, reasoning, chat) and picks the optimal local model. No wasted compute on wrong models.
Your GPU server handles it. Optional fusion votes across models for harder problems. Cost: electricity.
If local can't answer, LLMForge falls back to your configured cloud provider — only if you enable it. You set the budget cap. Sensitive prompts can be pinned to local-only.
Every prompt your team sends to GPT-5.2 leaves your network — source code, internal docs, customer data. LiteLLM and Portkey route to the same cloud. LLMForge keeps it on your GPU server. Cloud is opt-in, not the default.
LLMForge runs on a single GPU server you already own. Route, cache, and cap spend for every engineer. Cloud fallback only kicks in for the ~5% of hard problems.
| Team size | Requests / month | Current cost / mo | LLMForge / mo | Annual savings |
|---|---|---|---|---|
| 5 engineers | 2,000 | $20 | $0.10 | $239 |
| 50 engineers | 50,000 | $500 | $2.10 | $5,975 |
| 200 engineers | 500,000 | $6,750 | $11.70 | $80,860 |
Based on GPT-5.2 pricing ($1.62/1K calls) vs LLMForge on existing hardware. LLMForge cost = electricity + ~5% cloud fallback. Does not include GPU purchase — uses hardware you already have.
Not just routing. LLMForge caches, caps spend, authenticates per-key, auto-routes by prompt type, and falls back to cloud — all in a single Rust binary.
100% of prompts stay on your GPU server by default. No source code, internal docs, or customer data sent to OpenAI. Cloud fallback is opt-in per-request. Sensitive prompts can be pinned to local-only.
Local models first, cloud only as opt-in fallback. Every request starts at $0. You set the budget — spend never exceeds it.
Exact-match cache delivers 54x faster responses on repeat queries. $0 on cached calls. Semantic cache planned for v2.
Per-API-key spend limits, rate limiting, and Bearer token auth. Built for teams sharing one GPU server.
Classifies prompts as code, reasoning, or chat — then picks the optimal model. Zero-config smart routing out of the box.
Every request's cost recorded against its API key. Query spend via /v1/budget endpoint.
Votes across multiple local models for harder problems. +10pp on HumanEval when you need it. Off by default for speed.
HTTP webhooks at 3 extension points: custom routing rules, prompt redaction, response auditing. Any language, any service. Build your own security controls.
The question isn't "can local match GPT-5.2?" — it's "does it matter for 95% of your queries?" Local handles the bulk. Cloud fills the gap. You pay for the gap, not the bulk.
| Configuration | HumanEval Pass Rate | Cost / 1K calls |
|---|---|---|
| LLMForge — Local (single best model) | 81.7% | $0.00 |
| LLMForge — Local + Fusion (5 models) | 92.1% | $0.00 |
| LLMForge — Auto-Routing | 82.3% | $0.00 |
| GPT-5.2 (cloud, for comparison) | ~95% | $1.62 |
3pp accuracy gap. 100% cost gap. Local handles the 95%, cloud fills the 5%. Run the benchmarks yourself: benchmarks/
You run Ollama for side projects but still pay $20/mo to OpenAI for the hard stuff. Every prompt you send includes your source code — going to a third party's servers. Your GPU sits idle 90% of the time.
Every prompt your engineers send to GPT-5.2 leaves your network — source code, internal docs, customer data. You're paying $6,750/mo for the privilege. You need compliance, per-team budgets, and data residency — but you can't rip out OpenAI without breaking every tool.
Free, open source, MIT licensed. No signup, no cloud, no lock-in.