Developers building on open-weight LLMs, particularly teams running agentic and coding workloads through Cursor, Cline, Claude Code, Aider, and Continue.

How does Keln compare to Together, Fireworks, DeepInfra, and SiliconFlow?

Keln runs on NVIDIA B200 and B300 GPUs, hosts large models most providers skip, publishes per-model tokens/sec and TTFT benchmarks, and undercuts official API pricing without tiers by running on optimized infrastructure.

What hardware does Keln run on?

NVIDIA B200 and B300 GPUs — the Blackwell generation — with FP8 and NVFP4 precision.

How does Keln pricing work?

Per-token, priced below the official API thanks to our optimized infrastructure. No subscriptions, usage tiers, or minimums.

How do I use Keln in my app?

Any OpenAI-compatible client works — no Keln-specific SDK required. Contact kris@keln.ai for early access details.

Does Keln support long context and prompt caching?

Yes. Full model context windows are supported, and prompt caching is available on models that support it, billed at the reduced cached-token rate.

Does Keln log or train on prompts?

No. Keln does not retain request or response content beyond what is required to serve the request, and does not train on user data.

FAQ | Keln

Q: What is Keln?

Keln is an LLM inference provider serving large open-weight models — starting with OpenAI's GPT OSS 120B, with GLM 5.1, Kimi K2.5, MiniMax M2.7 and others coming soon — on NVIDIA Blackwell hardware, at prices below the official API, thanks to our optimized infrastructure.

General

Keln is an LLM inference provider. We serve large open-weight models — starting with OpenAI's GPT OSS 120B, with GLM 5.1, Kimi K2.5, MiniMax M2.7 and others coming soon — on NVIDIA Blackwell hardware, at prices below the official API. Our optimized infrastructure lets us serve them for less.

Teams building on open-weight LLMs — especially ones running agent and coding workloads through tools like Cursor, Cline, Claude Code, Aider, and Continue, where long context and caching matter most.

Same category — inference for open-weight LLMs. What sets Keln apart: we run on the latest NVIDIA B200 and B300 GPUs, host larger models that most providers skip, publish live speed and latency benchmarks for every model, and come in below official API pricing with no tiers, thanks to our optimized infrastructure.

No. Keln is a managed service. The models we serve are open-weight, so you can always run them yourself, but Keln itself is hosted infrastructure.

Models

GPT OSS 120B (OpenAI) is live today. GLM 5.1 (Zhipu AI), Kimi K2.5 (Moonshot), MiniMax M2.7, Qwen 3.5 397B, and Step 3.5 Flash are coming soon, alongside a rotating catalog of popular community models. The full list with live benchmarks is on the Models page.

Possibly. If it is an open-weight model and there is real demand, we will evaluate it. Email kris@keln.ai with the model and your expected volume.

No. Keln serves open-weight models as released — we don't alter the weights. What we optimize is how they run (batching, precision, caching), and we publish quality benchmarks against the original so you can see how closely we match it.

Infrastructure

NVIDIA B200 and B300 GPUs — the Blackwell generation. These are the fastest GPUs available for inference today, which is what lets us host the largest open-weight models.

Across multiple regions and cloud providers, with automatic failover between them. A single region outage doesn't take your traffic offline.

99.5% or better, backed by redundant capacity across multiple regions and clouds.

Yes. Tokens per second, time-to-first-token, and pricing are published for every model, updated continuously. See the Models page.

Pricing

Per-token, priced below the official API for each model. No subscriptions, no tiers, no minimum spend — our optimized infrastructure lets us offer lower rates than the model's own provider.

No. During early access, Keln is offered through partners on per-token pricing — no Keln invoice or separate signup for standard usage. Reach out to kris@keln.ai for details.

Developers

Any OpenAI-compatible client works — no Keln-specific SDK required. Reach out to kris@keln.ai for early access details.

Yes. Point any of these tools at a Keln-served model and it works out of the box. We tune our setup specifically for the long-context, cache-heavy usage these tools produce.

Yes — the full context window of each model is available. Our setup is built for sustained long-context use, not just short chat. Agent tools sending hundreds of thousands of input tokens per request are a first-class use case.

Yes. Token streaming is available on every model we serve.

Yes, for models that support them. Function calling, JSON mode, and tool use pass through unchanged — we don't restrict features the underlying model already supports.

Keln doesn't add rate limits of its own during early access. For sustained high-volume usage, reach out to kris@keln.ai about reserved capacity.

Privacy & data

No. Keln doesn't store your prompts or completions beyond what's needed to serve each request, and we never train on user data. The only thing we keep is basic operational metadata (latency, token counts, status) for billing and reliability.

Frequentlyasked questions

Frequently
asked questions