The Polymorphic Agent: How to Cut AI Costs by 90% Using an API Gateway

January 20, 2026

In the early days of AI development, we treated Large Language Models like databases: you picked one (usually GPT-4), hardcoded the API key, and married it.

That marriage is now expensive and obsolete.

At Abba Baba, we shifted to a Gateway Architecture. Instead of building agents that rely on a specific "brain," we built agents that rely on a Router. This allows a single agent to swap between "genius mode" (expensive) and "speed mode" (cheap) depending on the task at hand—without changing a single line of application code.

Here's how we use API gateways to build "Polymorphic Agents" that are 10x cheaper and infinitely more resilient.

The Problem: Model Monogamy

If you hardcode model="gpt-4" into your chatbot, you're paying premium prices for "Hello, how are you?" interactions.

We discovered this with our own support system. We were using high-intelligence models for everything:

Query Type	Example	Actual Requirement	What We Were Paying
Complex	"Debug my API rate limit issue"	Deep reasoning	High (appropriate)
Simple	"What is your pricing?"	Basic retrieval	High (wasteful)

Using a genius-tier model for simple queries is like hiring a PhD to answer the phone. It works, but it burns money.

The math is brutal. If 80% of your queries are simple and you're using a premium model for everything, you're overpaying by 10-20x on the majority of your traffic.

The Solution: The AI Gateway Pattern

An AI Gateway (or Router) sits between your application code and the AI providers. You send your prompt to the Gateway, and the Gateway decides where to route it based on your rules.

┌─────────────┐     ┌─────────────┐     ┌─────────────────┐
│  Your App   │ ──▶ │   Gateway   │ ──▶ │  Claude Haiku   │ (fast/cheap)
│             │     │   (Router)  │ ──▶ │  Claude Sonnet  │ (balanced)
│             │     │             │ ──▶ │  Claude Opus    │ (powerful)
│             │     │             │ ──▶ │  GPT-4          │ (fallback)
└─────────────┘     └─────────────┘     └─────────────────┘

The gateway abstraction gives you three superpowers:

Dynamic Routing: Send simple queries to cheap models, complex queries to powerful ones
Automatic Fallbacks: If one provider goes down, fail over to another instantly
Zero-Code Updates: Swap models via environment variables, no deployment required

Gateway Options

We evaluated several approaches:

Vercel AI Gateway

What we currently use. Tight integration with our Next.js deployment, automatic fallbacks, streaming support out of the box. The trade-off is vendor lock-in to the Vercel ecosystem.

OpenRouter

A unified API that provides access to almost any model—Llama, Mistral, Claude, Gemini, and dozens more—through a single API key. Excellent for experimentation and accessing models you might not have direct API access to. Pay-as-you-go pricing across all providers.

LiteLLM

Open-source proxy server you host yourself. Normalizes inputs across providers so you can swap OpenAI for Bedrock or Azure without rewriting code. Maximum control, but you're responsible for infrastructure.

The strategy works with any of these. Pick based on your constraints: convenience (Vercel), selection (OpenRouter), or control (LiteLLM).

Real-World Implementation: The Dual-Brain Chat

On January 21, we deployed a dual-model system for our customer-facing chat that demonstrates this pattern perfectly.

The Marketing Agent (The Fast Brain)

Role: Lead capture, basic Q&A, "What does your company do?"

Model: Claude Haiku (via Gateway)

Cost: ~$0.001 per conversation

Why: Speed matters more than depth here. The agent needs to be conversational and responsive. It doesn't need to reason through complex problems—it needs to answer common questions quickly and capture lead information.

Haiku handles this beautifully. Sub-second responses. Costs almost nothing. Users get a snappy experience.

The Support Agent (The Smart Brain)

Role: Debugging, ticket resolution, knowledge base search, escalation decisions

Model: Claude Sonnet (via Gateway)

Cost: ~$0.01 per conversation

Why: These interactions require understanding complex context, reading documentation, correlating error messages with known issues, and making judgment calls about whether to auto-resolve or escalate.

Sonnet has the reasoning capability to handle this. The 10x cost increase is justified by the 10x complexity of the task.

The Results

By routing simple traffic to Haiku and complex traffic to Sonnet, we reduced our blended AI costs by approximately 90% compared to routing everything through a high-end model.

Not 10%. Not 50%. 90%.

And user satisfaction actually improved, because the simple queries now get faster responses. The PhD isn't stuck answering the phone anymore—they're working on the hard problems.

How to Build a Router Agent

You don't need complex logic in your application. You move the complexity to configuration.

Step 1: Abstract the Client

Instead of instantiating provider-specific clients, create a unified interface:

// The application doesn't know or care which model it's talking to
const createClient = (taskType: 'marketing' | 'support') => {
  const modelId = taskType === 'marketing'
    ? process.env.MODEL_FAST    // e.g., claude-haiku
    : process.env.MODEL_SMART;  // e.g., claude-sonnet
 
  return new GatewayClient({ model: modelId });
}

Your application code calls createClient('marketing') or createClient('support'). It never references a specific model. The routing decision is externalized.

Step 2: Define Fallbacks

Gateways let you define fallback chains. If Anthropic is experiencing issues, the gateway automatically routes to OpenAI or Gemini without your user noticing.

We implemented createMessageStreamWithFallback() to ensure our agents never go silent:

const response = await createMessageStreamWithFallback({
  primary: 'anthropic/claude-sonnet',
  fallbacks: [
    'openai/gpt-4-turbo',
    'google/gemini-pro'
  ],
  messages: conversation
});

If the primary fails, the gateway tries fallbacks in order. Your user experiences a slightly different "personality" at worst—never a failure.

This is the circuit breaker pattern applied to intelligence itself.

Step 3: Configuration Over Code

Models are managed via environment variables:

# Production Config
MARKETING_MODEL="anthropic/claude-haiku"
SUPPORT_MODEL="anthropic/claude-sonnet"
 
# Fallback Config
MARKETING_MODEL_FALLBACK="openai/gpt-4o-mini"
SUPPORT_MODEL_FALLBACK="openai/gpt-4-turbo"

When a newer, cheaper model releases, we update the environment variable. The next API call uses the new model. No deployment. No code change. No risk.

When Claude Haiku 4.5 dropped, we switched in production within minutes. Instant cost savings, zero engineering effort.

Advanced Pattern: Dynamic Complexity Routing

The next evolution is letting the gateway itself determine query complexity.

Instead of hardcoding "marketing = fast, support = smart," you can implement a classifier that examines each query and routes dynamically:

const routeQuery = async (query: string) => {
  // Quick classification (use the cheapest model)
  const complexity = await classify(query, {
    model: 'haiku',
    prompt: 'Rate complexity 1-5: ' + query
  });
 
  if (complexity <= 2) return 'haiku';
  if (complexity <= 4) return 'sonnet';
  return 'opus';
}

The classifier itself uses the cheapest model. The cost of classification is negligible compared to the savings from proper routing.

We're experimenting with this approach. Early results suggest another 20-30% cost reduction on top of the static routing gains.

The Bigger Picture: Liquid Intelligence

The polymorphic agent pattern reflects a broader truth: intelligence is becoming a utility.

You don't hardcode which power plant generates your electricity. You plug into the grid, and the grid routes power from wherever it's available and cheapest. You pay for what you use.

AI is heading the same direction.

The agents we build shouldn't be married to specific models. They should be model-agnostic—capable of drawing intelligence from whatever source makes sense for the task, the moment, the budget.

Today, that might mean Claude for reasoning and Haiku for chitchat. Tomorrow, it might mean a local model for privacy-sensitive queries and a cloud model for general tasks. Next year, it might mean something we haven't imagined yet.

The gateway architecture makes all of this possible. Decouple your agent from its brain, and you can upgrade the brain whenever a better one comes along.

Conclusion

Stop paying genius prices for phone-answering work.

Build your agents against a gateway, not a specific model. Route simple queries to fast models and complex queries to powerful ones. Define fallbacks so you're never dependent on a single provider. Manage models through configuration, not code.

The result: 90% cost reduction, better resilience, and the flexibility to adopt new models the day they release.

Whether you use Vercel AI Gateway for convenience, OpenRouter for selection, or LiteLLM for control—the strategy is the same:

Decouple your agent from its brain.

Next up: Building agents that route themselves—adaptive intelligence that learns which brain to use.

Designing for Failure Welcome to the Blog