Managed LLM Inference
Inside Your AWS

Managed LLM Inference Inside Your AWS

Run vLLM instances at scale. Skip the operational burden.
Define SLO and cost targets. Rivvr handles everything below the line,
inside your AWS account.

Run vLLM instances at scale. Skip the operational burden. Define SLO and cost targets. Rivvr handles everything below the line, inside your AWS account.

Reduce GPU costs by 70%

Meet latency SLOs automatically

Run any vLLM model or LoRA

VPC

Model + LoRAs

SLO + cost targets

Rivvr Orchestrator

Exact-target SLOs

Fleet optimization

Cost control

Observability

Mixed GPU Fleet

H100

Spot

vLLM · GPT-OSS 20B

L40S

Spot

vLLM · GPT-OSS 20B

A100

On-demand

vLLM · GPT-OSS 20B

L4

Spot

vLLM · GPT-OSS 20B

Live metrics

Built for teams shipping real-time agents

Shared inference APIs

No guaranteed SLOs in a shared API

No control over prompts, weights, or logs leaving your environment

Limited model selection, fixed pre-configs

Overprovisioned for SLOs, or cost-effective without them

With Rivvr

Custom SLOs, held automatically, even under traffic bursts

Runs inside your own AWS account

Any vLLM base model, fine-tune, LoRA, or quantization

Continuously optimized for your SLO and cost targets

You set the targets.
Rivvr runs the cluster.

Define your SLOs, not your infrastructure

Bring your own weights, or import from Hugging Face. Define model configs — data type, quantization, KV cache quantization, context length — or leave defaults. Set TTFT, TPS, p50/p90, cost guardrails, and speed mode.

Define your SLOs, not your infrastructure

Deployment Policy

llama-3-70b

Quantization

LoRa

TTFT p50

Output Token/s

TTFT p90

Cost guardrail

Speed mode

Loading...

Rivvr runs the cluster in real time

Rivvr places every model on the right GPU type and VM size at a right time to meet exact SLO you need, mixing spot and on-demand capacity. Support for more GPU and VM configurations gives Rivvr access to more spot capacity pools—and more opportunities to save. If spot is reclaimed, your workload shifts to on-demand before any SLO is at risk. Your topology follows live traffic, not your worst-case estimate.

Rivvr runs the cluster in real time

req / s

llama-3-70b

req / s

mistral-7b

GPU Cluster

A100

80GB

on-demandspot

llama-3-70b

65%

H100

80GB

spot

llama-3-70b

70%

L40S

48GB

spoton-demand

llama-3-70b

58%

L4

24GB

spot

mistral-7bllama-3-70b

62%

A100

40GB

on-demandspot

mistral-7bllama-3-70b

55%

L4

24GB

spoton-demand

mistral-7b

60%

Monitor performance. Change objectives.

Track live TTFT, TPS, SLO compliance, cost per model, cost per token. Change your targets anytime — no redeploy, no downtime. Tighten an SLO, raise a cost guardrail, see the cluster adapt.

Monitor performance. Change objectives.

Track live TTFT, TPS, SLO compliance, cost per model, cost per token. Change your targets anytime — no redeploy, no downtime. Tighten an SLO, raise a cost guardrail, see the cluster adapt.

Live Monitoring

llama-3-70b

TTFT

p50 0msp90 0ms

Tokens / s

p50 0p90 0

Prefill / M tokens

$0.00

Decode / M tokens

$0.00

SLO Compliance0.0%

Built for production
AI agents

Voice AI

Support chat agents

Coding assistants & agents

OCR & document intake

LATENCY-CRITICAL

The Challenge

•

Any TTFT variance is an audible pause; there’s no graceful degradation

•

Static warm pools protect latency, but bill you for capacity you don’t use

With Rivvr

•

Your pipeline stays under 400ms without a dedicated warm pool

•

GPU spend tracks actual call volume, not your peak estimate

Voice AI

Support chat agents

Coding assistants & agents

OCR & document intake

LATENCY-CRITICAL

The Challenge

•

Any TTFT variance is an audible pause; there’s no graceful degradation

•

Static warm pools protect latency, but bill you for capacity you don’t use

With Rivvr

•

Your pipeline stays under 400ms without a dedicated warm pool

•

GPU spend tracks actual call volume, not your peak estimate

Runs in your cloud.
Stays in your cloud.

VPC deployment

Rivvr runs inside your AWS account, in a network-isolated environment you control.

VPC deployment

Rivvr runs inside your AWS account, in a network-isolated environment you control.

Data privacy

Prompts, responses, logs, and model weights stay in your cloud.

Data privacy

Prompts, responses, logs, and model weights stay in your cloud.

RBAC and access control

Control who can deploy models, modify policies, and view cost and performance data.

RBAC and access control

Control who can deploy models, modify policies, and view cost and performance data.

Flexible tenancy isolation

Choose per model: dedicated cluster for strict isolation, or shared fleet for efficiency. Switch without redeployment.

Flexible tenancy isolation

Choose per model: dedicated cluster for strict isolation, or shared fleet for efficiency. Switch without redeployment.

Everything to run inference in production

Bring your own weights

Deploy any vLLM-compatible model from Hugging Face, Amazon S3, or your own model registry.

Managed endpoints

Expose every model through a production-ready, OpenAI-compatible endpoint.

Multi-model & Multi-LoRA

Serve independent models with their own configurations and SLOs, alongside LoRA adapters sharing their base-model capacity.

Batch processing

Scale capacity for asynchronous jobs only when work is ready to run.

Bring your own weights

Deploy any vLLM-compatible model from Hugging Face, Amazon S3, or your own model registry.

Managed endpoints

Expose every model through a production-ready, OpenAI-compatible endpoint.

Multi-model & Multi-LoRA

Serve independent models with their own configurations and SLOs, alongside LoRA adapters sharing their base-model capacity.

Batch processing

Scale capacity for asynchronous jobs only when work is ready to run.

Frequently Asked Questions

Frequently
Asked Questions

How is this different from managed platforms?

Rivvr gives you the managed service experience — no infra to run, no scaling rules to tune — but inside your own AWS account. You keep full control over security, SLOs, and cost, without the operational overhead.

How is this different from AIBrix or NVIDIA Dynamo?

Two things. First: you don't operate anything — Rivvr runs as a managed platform inside your AWS account, unlike the software your team deploys and maintains. Second: a different operational model. AIBrix and Dynamo ask you to configure autoscaling, profiling, and dozens of other parameters. Rivvr asks for two — your SLO and your cost guardrail. Set those, and you're running.

Isn't spot capacity risky for something with an SLO?

Alone, yes — spot can be reclaimed with little warning. Rivvr's cluster spans multiple GPU types, so there's rarely just one spot pool to lose — if one type is reclaimed, another picks up the load, with on-demand as the fallback. More pools, more savings, no interruption risk.

Where does Rivvr run?

Inside your AWS account, behind your VPC. Inference traffic, data, and model weights never leave your environment.

How is Rivvr priced?

Rivvr is priced as a management layer based on GPUs under orchestration. You pay AWS directly for infrastructure; Rivvr charges a separate management fee.

Do I need to change my integration code?

No. Rivvr uses an OpenAI-compatible API. Most teams point their existing client at a new endpoint, and nothing else changes.

Does Rivvr modify or compress my models?

No. Rivvr doesn't touch model weights. Optimization happens at the orchestration, placement, and infrastructure layers.

What models does Rivvr support?

Any Large Language Model served by vLLM, up to 400B parameters. If you're running something larger or unusual, let's talk.

What inference engines are supported?

Rivvr uses a vLLM-compatible runtime.

What GPUs are supported?

NVIDIA GPUs with CUDA compute capability 7.0 or higher — including L4, L40S, A10G, T4, A100, H100, H200, B200, B300, V100.

See it running on your setup

Bring your current models and traffic. We'll show you what it looks like running on Rivvr — SLOs held automatically, your own AWS account, no operational burden.

Bring your current models and traffic. We'll show you what it looks like running on Rivvr — SLOs held automatically, your own AWS account,
no operational burden.

Bring your current models and traffic. We'll show you what it looks like running on Rivvr —
SLOs held automatically,
your own AWS account,
no operational burden.

Managed LLM InferenceInside Your AWS