Open source · MIT licensed

A kill switch for runaway
AI agents — at zero cloud cost.

LLMForge is an AI API gateway with hard budget caps, semantic circuit breakers, and local-first routing. Route between local GPUs, network machines, and cloud. Cap spend per API key. Strip secrets. Audit every call. Extend with built-in, JS/TS, or webhook plugins. OpenAI-compatible, one env var.

★ Star on GitHub See how it works →

Cursor Claude Code OpenCode Continue Any OpenAI client

Cost on local models — no surprise bills

Infrastructure tiers (local → network → cloud)

100%

Prompts stay on your server by default

8ms

Rust startup — vs LiteLLM's Python overhead

How it works

Four pillars. One gateway. No surprise bills.

Hard budget caps kill runaway agents before they cost $30K. Route between local and cloud. Secure every prompt. Extend with plugins. All in a single Rust binary.

Budget Control — cap spend, kill loops

Per-API-key budget caps. Requests rejected when limit hit. Circuit breaker detects agent loops — exact hash + embedding similarity — and kills runaway sessions before they drain budget. 5 similar requests in 60s → 120s block. Cost tracking per key via /v1/budget.

Flow Control — route where you want

Multi-tier failover: local GPU → machines on your network → cloud. Route profiles (/v1/code, /v1/fast, /v1/reasoning) with fixed model, pattern, plugin chain. Cloud is opt-in, not the default.

Security — every prompt protected

PII redaction strips API keys, tokens, emails before they reach any model. Audit logging records every call. 100% on your server by default. Bearer token auth per key.

Plugins — three ways to extend

Built-in plugins (Rust, zero overhead): PII redact, prompt harness, code formatter, audit log. JS/TS plugins via embedded V8 (registerPlugin() — no server, no network). HTTP webhooks for any language. Fusion is a plugin pattern — route to multiple models, vote, cascade with tests.

# 1. One-click install (builds from source, installs to ~/.local/bin)
curl -sSf https://llmforge.pages.dev/install.sh | bash

# 2. Generate config & start the proxy
llmforge init
llmforge serve

# 3. Point your tools at LLMForge instead of OpenAI
OPENAI_BASE_URL=http://localhost:8787/v1
    

Write plugins in JS/TS — runs in-process via embedded V8, no server needed:

// plugins/pii-redact.js — strips secrets before they reach any model
registerPlugin({
  prompt_filter(req) {
    const filtered = req.messages.map(m => ({
      ...m,
      content: m.content
        .replace(/sk-[a-zA-Z0-9]{20,}/g, '[REDACTED_API_KEY]')
        .replace(/[a-z]+@[a-z]+\.[a-z]+/g, '[REDACTED_EMAIL]')
    }));
    return { messages: filtered };
  }
});
    

The difference

Your AI bill is a postmortem, not a control.

Agent loops burn $30K before billing alerts fire. Rate limits don't account for token density. Dashboards show the damage after it's done. LLMForge enforces hard caps and kills runaway agents — before the bill, not after.

Your current setup

$6,750/mo + $30K surprise from agent loops

Billing alerts fire after the damage is done
No circuit breaker — agents loop until budget is gone
No control over where requests go — all to cloud
No PII redaction — API keys and secrets sent to cloud
No audit trail of what was sent where
One endpoint, one model, no extensibility

With LLMForge

$12/mo electricity + optional cloud fallback

Hard budget caps per API key — requests rejected at the limit
Circuit breaker kills agent loops at iteration 3, not $30K later
Route local → network → cloud, per request
PII redaction strips secrets before they reach any model
Audit logging records every call — model, cost, latency, key
Plugin system — built-in (Rust), JS/TS (V8), or HTTP webhooks

Cost projection

One GPU server serves your whole team.

LLMForge runs on a single GPU server you already own. Route, cache, and cap spend for every engineer. Cloud fallback only kicks in for the ~5% of hard problems.

Team size	Requests / month	Current cost / mo	LLMForge / mo	Annual savings
5 engineers	2,000	$20	$0.10	$239
50 engineers	50,000	$500	$2.10	$5,975
200 engineers	500,000	$6,750	$11.70	$80,860

Based on GPT-5.2 pricing ($1.62/1K calls) vs LLMForge on existing hardware. LLMForge cost = electricity + ~5% cloud fallback. Does not include GPU purchase — uses hardware you already have.

vs LiteLLM

Same category. Different league.

LiteLLM is the OSS distribution leader — 50K+ GitHub stars, Python, 100+ providers. But it's heavy: 500MB+ RAM, Postgres + Redis, 3-5s startup, 1.7-4x throughput drop in production. LLMForge is a 7.5MB Rust binary that starts in 8ms.

Metric	LLMForge (Rust)	LiteLLM (Python)
Binary size	7.5MB	~200MB (pip + deps)
Startup time	~8ms	~3-5s
Memory (idle)	~10MB	~500MB+ (needs Postgres + Redis)
External deps	None	Postgres + Redis
Throughput overhead	Minimal	1.7-4x drop reported in production
Deployment	Single binary	Docker Compose (3+ containers)

LiteLLM has wider provider coverage (100+) and a larger community. LLMForge is purpose-built for teams that want a lightweight, local-first gateway with circuit breakers and plugins — not a full microservices stack.

Features

Enforce, don't just observe.

Dashboards show you the damage after it's done. LLMForge stops it before it happens. Hard caps, circuit breakers, routing control, security, and plugins — all in one Rust binary.

💸

Budget Caps

Per-API-key spend limits. Requests rejected at the limit. No surprise bills, no postmortems.

🛡️

Circuit Breaker

Semantic loop detection — exact hash + embedding similarity. Kills runaway agents at iteration 3, not $30K later.

🔀

Flow Control

Multi-tier: local GPU → network machines → cloud. Per-request routing. Multi-endpoint profiles with fixed model and budget.

🔒

Security & Privacy

100% on your server by default. PII redaction strips secrets. Audit logging records every call. Cloud is opt-in.

🧩

Built-in Plugins

PII redact, prompt harness, code formatter, audit log. Rust code — zero network overhead. Enable per route profile.

📜

JS/TS Plugins

Write plugins in JavaScript or TypeScript via embedded V8. registerPlugin() with 3 hooks. No server, no network — runs in-process. Feature-gated.

🔌

Webhook Plugins

HTTP hooks at 3 extension points: pre_request, prompt_filter, post_response. Any language, any service. Zero-config.

⚡

Fusion Patterns

Plugin patterns: Validated Fallback, Validated Consensus, Self-Consistency, Stream Race. Route to multiple models, vote, cascade.

📦

Two-Layer Cache

Exact match (instant) + semantic match (embedding similarity). $0 on cached calls. 54x faster on hits.

📡

Streaming

SSE passthrough + stream race. Race models, first response wins. Works with Cursor, Claude Code, OpenCode.

Benchmarks

Fusion plugin patterns work. Local models handle the rest.

Fusion is a plugin pattern — route to multiple local models, vote on results. +10pp on HumanEval when you need it. Off by default for speed. The core product is flow control and security. Fusion is one thing the plugin system enables.

Configuration	HumanEval Pass Rate	Cost / 1K calls
LLMForge — Local (single best model)	81.7%	$0.00
LLMForge — Local + Fusion (5 models)	92.1%	$0.00
LLMForge — Fusion (3 mixed models)	84.1%	$0.00
GPT-5.2 (cloud, for comparison)	~95%	$1.62

3pp accuracy gap. 100% cost gap. Local handles the 95%, cloud fills the 5%. Full 164-problem HumanEval, pass@1. Run the benchmarks yourself: benchmarks/

Is it for you?

If you want control over your AI infrastructure, yes.

You have local GPUs and cloud accounts

Stop choosing between local and cloud.

You run Ollama on your Mac Studio for side projects but pay OpenAI for the hard stuff. There's no way to route easy prompts to local and hard prompts to cloud. Your local GPUs sit idle 90% of the time. You have no control over the flow.

With LLMForge: Route by profile — /v1/fast hits local, /v1/reasoning cascades to cloud if local can't answer. Your local GPUs handle 95% of traffic. Cloud fills the gap. You control exactly where each request goes.

You manage an engineering team

Your source code is on OpenAI's servers right now.

Every prompt your engineers send leaves your network — source code, internal docs, customer data. No PII redaction. No audit trail. No way to extend the pipeline. You're paying $6,750/mo for the privilege of losing control over your data.

With LLMForge: Control the flow — local, network, or cloud per request. PII redaction strips secrets. Audit logging records every call. Plugin system lets you build custom harnesses. $80,860/year saved — and your IP stays yours.

A kill switch for runawayAI agents — at zero cloud cost.