/ Directory / Playground / Azure AI Gateway
● Official Azure-Samples 🔑 Needs your key

Azure AI Gateway

by Azure-Samples · Azure-Samples/AI-Gateway

Microsoft's APIM-based AI Gateway patterns — route, meter, and govern LLM traffic (including MCP) from Azure API Management.

Azure AI Gateway is a reference-implementation repo from Microsoft showing how to put Azure API Management (APIM) in front of LLM/MCP endpoints for auth, quota, caching, routing, logging, and circuit-breaking. The MCP exposes these gateway operations so an agent can configure and inspect them.

Why use it

Key features

Live Demo

What it looks like in practice

azure-ai-gateway.replay ▶ ready
0/0

Install

Pick your client

~/Library/Application Support/Claude/claude_desktop_config.json  · Windows: %APPDATA%\Claude\claude_desktop_config.json
{
  "mcpServers": {
    "azure-ai-gateway": {
      "command": "uvx",
      "args": [
        "azure-ai-gateway-mcp"
      ]
    }
  }
}

Open Claude Desktop → Settings → Developer → Edit Config. Restart after saving.

~/.cursor/mcp.json · .cursor/mcp.json
{
  "mcpServers": {
    "azure-ai-gateway": {
      "command": "uvx",
      "args": [
        "azure-ai-gateway-mcp"
      ]
    }
  }
}

Cursor uses the same mcpServers schema as Claude Desktop. Project config wins over global.

VS Code → Cline → MCP Servers → Edit
{
  "mcpServers": {
    "azure-ai-gateway": {
      "command": "uvx",
      "args": [
        "azure-ai-gateway-mcp"
      ]
    }
  }
}

Click the MCP Servers icon in the Cline sidebar, then "Edit Configuration".

~/.codeium/windsurf/mcp_config.json
{
  "mcpServers": {
    "azure-ai-gateway": {
      "command": "uvx",
      "args": [
        "azure-ai-gateway-mcp"
      ]
    }
  }
}

Same shape as Claude Desktop. Restart Windsurf to pick up changes.

~/.continue/config.json
{
  "mcpServers": [
    {
      "name": "azure-ai-gateway",
      "command": "uvx",
      "args": [
        "azure-ai-gateway-mcp"
      ]
    }
  ]
}

Continue uses an array of server objects rather than a map.

~/.config/zed/settings.json
{
  "context_servers": {
    "azure-ai-gateway": {
      "command": {
        "path": "uvx",
        "args": [
          "azure-ai-gateway-mcp"
        ]
      }
    }
  }
}

Add to context_servers. Zed hot-reloads on save.

claude mcp add azure-ai-gateway -- uvx azure-ai-gateway-mcp

One-liner. Verify with claude mcp list. Remove with claude mcp remove.

Use Cases

Real-world ways to use Azure AI Gateway

Enforce per-team token quotas across Azure OpenAI deployments

👤 Central platform teams governing LLM spend ⏱ ~30 min advanced

When to use: Multiple product teams share AOAI; one team's runaway loop shouldn't burn the shared TPM budget.

Prerequisites
  • APIM instance with the AI-Gateway patterns applied — Deploy the reference architecture from the Azure-Samples/AI-Gateway repo
  • APIM subscription key per team — Each team gets a distinct APIM subscription (key) they include in the Ocp-Apim-Subscription-Key header
Flow
  1. Review current quotas
    List APIM subscriptions with their current TPM and RPM quotas for the AOAI product.✓ Copied
    → Per-team quota table
  2. Adjust a noisy team down
    Team 'growth' is at 90% TPM burn daily. Reduce their quota from 200k → 100k TPM. Keep others unchanged.✓ Copied
    → Quota updated; confirmation
  3. Monitor after the change
    Over the next hour, pull 429 (rate-limited) counts per subscription. Confirm growth is being shaped but prod-critical teams aren't affected.✓ Copied
    → Enforcement visible in metrics

Outcome: Controlled shared AOAI spend without nuking legit high-priority traffic.

Pitfalls
  • Setting quotas too low starves legitimate workloads — Roll out in shadow mode first (log-only), then enforce once you understand real patterns

Configure multi-region failover for an Azure OpenAI deployment

👤 SREs running production AI workloads ⏱ ~45 min advanced

When to use: A regional AOAI outage (uncommon but real) should fail over transparently to another region.

Prerequisites
  • AOAI deployments in ≥2 regions (e.g. East US, West Europe) — Provision via Azure portal; match model + version
Flow
  1. Inspect current backend pool
    Show the APIM backend pool for our AOAI API. How many backends, priority, circuit-breaker config?✓ Copied
    → Current pool config
  2. Add a secondary region
    Add the West Europe AOAI endpoint as priority=2 with circuit-breaker: 5 failures in 1 min → open for 5 min. Keep East US as primary.✓ Copied
    → Pool updated, 2 backends configured
  3. Test failover
    Simulate primary outage by disabling the East US backend for 2 min. Confirm traffic shifts to West Europe, then rollback.✓ Copied
    → Traffic shift observed; rollback verified

Outcome: Transparent failover with evidence it works before you need it.

Pitfalls
  • Different regions have different deployed model versions — Pin to a model version that exists in both regions; mismatched versions silently return different quality

Deploy semantic caching to reduce repeat prompt costs

👤 Cost-conscious platform teams ⏱ ~30 min advanced

When to use: Your users ask similar questions over and over; 30–60% of calls are effectively cache hits.

Flow
  1. Turn on semantic cache policy
    Enable the APIM semantic-cache-lookup policy on the AOAI completions API with similarity threshold 0.95, TTL 1h.✓ Copied
    → Policy applied
  2. Observe hit rate
    After 24h, pull cache hit rate and token savings from App Insights.✓ Copied
    → Hit rate % + tokens saved
  3. Tune threshold
    If hit rate <20%, lower threshold to 0.92 and observe again. If quality complaints, raise back to 0.97.✓ Copied
    → Iterative tuning with measurements

Outcome: Measured cost savings on repeat queries without degrading output quality.

Pitfalls
  • Over-aggressive caching serves wrong answers for similar-but-different questions — Start high (0.97) and only lower based on observed quality

Combinations

Pair with other MCPs for X10 leverage

azure-ai-gateway + sentry

Correlate APIM 5xx spikes with application-side errors

If Sentry shows 5xx spike in app X at 10:02, pull APIM metrics for the same window and identify if the gateway caused it.✓ Copied
azure-ai-gateway + notion

Weekly AI-traffic governance report to Notion

Compile per-team TPM usage for the week, 429 counts, cache hit rate; post as a Notion page.✓ Copied

Tools

What this MCP exposes

ToolInputsWhen to callCost
list_subscriptions product_id? Inventory teams consuming the gateway free (ARM API call)
update_quota subscription_id, tpm?, rpm? Adjust a team's token/request limits free
get_backend_pool api_id Inspect routing and failover config free
update_backend_pool api_id, backends, policies Change priorities, circuit breakers, load balancing free
apply_policy api_id, policy_xml Deploy APIM policy (cache, auth, logging) free
get_metrics api_id, since, until Observe traffic shape per API free

Cost & Limits

What this costs to run

API quota
Azure Resource Manager rate limits (generous per tenant)
Tokens per call
Policy/backend-pool reads: 500–2000 tokens
Monetary
APIM pricing starts at ~$40/mo (Basic tier); Standard tier recommended for prod
Tip
Semantic caching usually pays for APIM's cost many times over if your traffic repeats. Measure hit rate to justify.

Security

Permissions, secrets, blast radius

Minimum scopes: APIM Contributor on the target APIM instance
Credential storage: Azure service principal credentials (client id/secret/tenant) in env
Data egress: ARM API calls to management.azure.com; prompt/response bodies traverse APIM itself
Never grant: Owner on the subscription Global admin

Troubleshooting

Common errors and fixes

401 on ARM API calls

Service principal lacks APIM Contributor role on the resource group. Grant via portal or az cli.

Verify: az role assignment list --assignee <sp>
Policy apply fails — XML validation error

APIM policy XML is strict; use the portal's policy editor to validate, then copy-paste.

429s persist after raising TPM quota

Underlying AOAI deployment itself may be the bottleneck. Check deployment TPM, not just APIM subscription TPM.

Semantic cache hit rate is 0%

Embedding backend for cache-lookup not configured; check the cache policy's embeddings reference.

Alternatives

Azure AI Gateway vs others

AlternativeWhen to use it insteadTradeoff
Cloudflare AI GatewayYou're on Cloudflare and want multi-provider LLM routing out of the boxLess deep integration with Azure-hosted models
Portkey / LiteLLMYou want a provider-agnostic gateway with a dashboardThird-party SaaS; data leaves your cloud

More

Resources

📖 Read the official README on GitHub

🐙 Browse open issues

🔍 Browse all 400+ MCP servers and Skills