Open source · MIT licensed

Your code never leaves.
Your AI bill drops to zero.

LLMForge is a local-first LLM proxy that keeps every prompt on your GPU server. No data sent to OpenAI. No vendor lock-in. Cloud is fallback only — you control when and if. OpenAI-compatible, one env var.

Cursor Claude Code OpenCode Continue Any OpenAI client
100%
Data stays on your server
99.8%
Cloud cost reduction
$80K
Saved per year (200 eng team)
54x
Faster on cache hits

Local first. Cloud only when you need it.

Every request hits your local models first. Cache catches repeats. Budget caps prevent runaway spend. Cloud is the fallback, not the default — you control exactly when and how much you pay.

1

Request comes in → stays on your server

Every prompt hits your local models. Source code, internal docs, customer data — nothing leaves your network. Cache catches repeats instantly (54x faster, $0).

2

Auto-route to the right model

Classifies the prompt (code, reasoning, chat) and picks the optimal local model. No wasted compute on wrong models.

3

Local models answer → $0

Your GPU server handles it. Optional fusion votes across models for harder problems. Cost: electricity.

4

Cloud fallback → opt-in, per-request

If local can't answer, LLMForge falls back to your configured cloud provider — only if you enable it. You set the budget cap. Sensitive prompts can be pinned to local-only.

# 1. Pull models with Ollama ollama pull qwen2.5-coder:3b qwen2.5-coder:7b phi4-mini llama3.2:3b # 2. Build & run LLMForge git clone https://github.com/Ozperium/llmforge.git cd llmforge && cargo build --release ./target/release/llmforge # 3. Point your tools at LLMForge instead of OpenAI OPENAI_BASE_URL=http://localhost:8787/v1

Your code is talking to OpenAI right now.

Every prompt your team sends to GPT-5.2 leaves your network — source code, internal docs, customer data. LiteLLM and Portkey route to the same cloud. LLMForge keeps it on your GPU server. Cloud is opt-in, not the default.

Your current setup

$6,750/mo for 200 engineers on GPT-5.2
  • Every prompt leaves your network — source code, internal docs
  • OpenAI trains on your data (opt-out is per-request, easy to forget)
  • No control over what models see your code
  • Repeat queries cost full price every time
  • Vendor lock-in — your whole stack depends on one API

With LLMForge

$12/mo electricity + ~5% optional cloud fallback
  • 100% of prompts stay on your GPU server by default
  • Cloud fallback is opt-in and per-request — you control it
  • Cache kills repeat costs — 54x faster on cache hits
  • Per-API-key budget caps prevent runaway spend
  • Any OpenAI-compatible client, zero code changes

One GPU server serves your whole team.

LLMForge runs on a single GPU server you already own. Route, cache, and cap spend for every engineer. Cloud fallback only kicks in for the ~5% of hard problems.

Team size Requests / month Current cost / mo LLMForge / mo Annual savings
5 engineers 2,000 $20 $0.10 $239
50 engineers 50,000 $500 $2.10 $5,975
200 engineers 500,000 $6,750 $11.70 $80,860

Based on GPT-5.2 pricing ($1.62/1K calls) vs LLMForge on existing hardware. LLMForge cost = electricity + ~5% cloud fallback. Does not include GPU purchase — uses hardware you already have.

Everything a team needs in one proxy.

Not just routing. LLMForge caches, caps spend, authenticates per-key, auto-routes by prompt type, and falls back to cloud — all in a single Rust binary.

🔒

Data Privacy

100% of prompts stay on your GPU server by default. No source code, internal docs, or customer data sent to OpenAI. Cloud fallback is opt-in per-request. Sensitive prompts can be pinned to local-only.

💸

Cost Routing

Local models first, cloud only as opt-in fallback. Every request starts at $0. You set the budget — spend never exceeds it.

Caching

Exact-match cache delivers 54x faster responses on repeat queries. $0 on cached calls. Semantic cache planned for v2.

🔐

Budget Caps + Auth

Per-API-key spend limits, rate limiting, and Bearer token auth. Built for teams sharing one GPU server.

🎯

Auto-Routing

Classifies prompts as code, reasoning, or chat — then picks the optimal model. Zero-config smart routing out of the box.

📊

Cost Tracking

Every request's cost recorded against its API key. Query spend via /v1/budget endpoint.

🔀

Optional Fusion

Votes across multiple local models for harder problems. +10pp on HumanEval when you need it. Off by default for speed.

🧩

Plugin System

HTTP webhooks at 3 extension points: custom routing rules, prompt redaction, response auditing. Any language, any service. Build your own security controls.

Local models are good enough for 95% of queries.

The question isn't "can local match GPT-5.2?" — it's "does it matter for 95% of your queries?" Local handles the bulk. Cloud fills the gap. You pay for the gap, not the bulk.

Configuration HumanEval Pass Rate Cost / 1K calls
LLMForge — Local (single best model) 81.7% $0.00
LLMForge — Local + Fusion (5 models) 92.1% $0.00
LLMForge — Auto-Routing 82.3% $0.00
GPT-5.2 (cloud, for comparison) ~95% $1.62

3pp accuracy gap. 100% cost gap. Local handles the 95%, cloud fills the 5%. Run the benchmarks yourself: benchmarks/

If you're in one of these two buckets, yes.

You're a solo dev with a local GPU

Stop sending your code to OpenAI.

You run Ollama for side projects but still pay $20/mo to OpenAI for the hard stuff. Every prompt you send includes your source code — going to a third party's servers. Your GPU sits idle 90% of the time.

With LLMForge: Local handles 95% of queries at $0 — your code never leaves your machine. Cloud fallback kicks in only for the hard 5% if you enable it. Cache kills repeat costs. Cursor, Claude Code, and OpenCode just work — one env var.
You manage an engineering team

Your source code is on OpenAI's servers right now.

Every prompt your engineers send to GPT-5.2 leaves your network — source code, internal docs, customer data. You're paying $6,750/mo for the privilege. You need compliance, per-team budgets, and data residency — but you can't rip out OpenAI without breaking every tool.

With LLMForge: 100% of prompts stay on your GPU server by default. Cloud fallback is opt-in per-request. One server serves all 200 engineers at $12/mo. Per-key budget caps. Data stays internal. $80,860/year saved — and your IP stays yours.

Deploy in 5 minutes.

Free, open source, MIT licensed. No signup, no cloud, no lock-in.