Azure AI Gateway MCP — Install & Live Demo

Why use it

Key features

APIM in front of Azure OpenAI / MCP servers as the governance layer
Token quotas per subscription, per model, per team
Caching of deterministic completions to reduce spend
Load balancing + failover across multiple AOAI regions
Full request/response logging to App Insights for audit

Live Demo

What it looks like in practice

azure-ai-gateway.replay ▶ ready

0/0

Install

Pick your client

~/Library/Application Support/Claude/claude_desktop_config.json · Windows: %APPDATA%\Claude\claude_desktop_config.json

{
  "mcpServers": {
    "azure-ai-gateway": {
      "command": "uvx",
      "args": [
        "azure-ai-gateway-mcp"
      ]
    }
  }
}

Open Claude Desktop → Settings → Developer → Edit Config. Restart after saving.

~/.cursor/mcp.json · .cursor/mcp.json

{
  "mcpServers": {
    "azure-ai-gateway": {
      "command": "uvx",
      "args": [
        "azure-ai-gateway-mcp"
      ]
    }
  }
}

Cursor uses the same mcpServers schema as Claude Desktop. Project config wins over global.

VS Code → Cline → MCP Servers → Edit

{
  "mcpServers": {
    "azure-ai-gateway": {
      "command": "uvx",
      "args": [
        "azure-ai-gateway-mcp"
      ]
    }
  }
}

Click the MCP Servers icon in the Cline sidebar, then "Edit Configuration".

~/.codeium/windsurf/mcp_config.json

{
  "mcpServers": {
    "azure-ai-gateway": {
      "command": "uvx",
      "args": [
        "azure-ai-gateway-mcp"
      ]
    }
  }
}

Same shape as Claude Desktop. Restart Windsurf to pick up changes.

~/.continue/config.json

{
  "mcpServers": [
    {
      "name": "azure-ai-gateway",
      "command": "uvx",
      "args": [
        "azure-ai-gateway-mcp"
      ]
    }
  ]
}

Continue uses an array of server objects rather than a map.

~/.config/zed/settings.json

{
  "context_servers": {
    "azure-ai-gateway": {
      "command": {
        "path": "uvx",
        "args": [
          "azure-ai-gateway-mcp"
        ]
      }
    }
  }
}

Add to context_servers. Zed hot-reloads on save.

claude mcp add azure-ai-gateway -- uvx azure-ai-gateway-mcp

One-liner. Verify with claude mcp list. Remove with claude mcp remove.

Use Cases

Real-world ways to use Azure AI Gateway

Enforce per-team token quotas across Azure OpenAI deployments

👤 Central platform teams governing LLM spend ⏱ ~30 min advanced

When to use: Multiple product teams share AOAI; one team's runaway loop shouldn't burn the shared TPM budget.

Prerequisites

APIM instance with the AI-Gateway patterns applied — Deploy the reference architecture from the Azure-Samples/AI-Gateway repo
APIM subscription key per team — Each team gets a distinct APIM subscription (key) they include in the Ocp-Apim-Subscription-Key header

Flow

Review current quotas

List APIM subscriptions with their current TPM and RPM quotas for the AOAI product.✓ Copied

→ Per-team quota table
Adjust a noisy team down

Team 'growth' is at 90% TPM burn daily. Reduce their quota from 200k → 100k TPM. Keep others unchanged.✓ Copied

→ Quota updated; confirmation
Monitor after the change

Over the next hour, pull 429 (rate-limited) counts per subscription. Confirm growth is being shaped but prod-critical teams aren't affected.✓ Copied

→ Enforcement visible in metrics

Outcome: Controlled shared AOAI spend without nuking legit high-priority traffic.

Pitfalls

Setting quotas too low starves legitimate workloads — Roll out in shadow mode first (log-only), then enforce once you understand real patterns

Configure multi-region failover for an Azure OpenAI deployment

👤 SREs running production AI workloads ⏱ ~45 min advanced

When to use: A regional AOAI outage (uncommon but real) should fail over transparently to another region.

Prerequisites

AOAI deployments in ≥2 regions (e.g. East US, West Europe) — Provision via Azure portal; match model + version

Flow

Inspect current backend pool

Show the APIM backend pool for our AOAI API. How many backends, priority, circuit-breaker config?✓ Copied

→ Current pool config
Add a secondary region

Add the West Europe AOAI endpoint as priority=2 with circuit-breaker: 5 failures in 1 min → open for 5 min. Keep East US as primary.✓ Copied

→ Pool updated, 2 backends configured
Test failover

Simulate primary outage by disabling the East US backend for 2 min. Confirm traffic shifts to West Europe, then rollback.✓ Copied

→ Traffic shift observed; rollback verified

Outcome: Transparent failover with evidence it works before you need it.

Pitfalls

Different regions have different deployed model versions — Pin to a model version that exists in both regions; mismatched versions silently return different quality

Deploy semantic caching to reduce repeat prompt costs

👤 Cost-conscious platform teams ⏱ ~30 min advanced

When to use: Your users ask similar questions over and over; 30–60% of calls are effectively cache hits.

Flow

Turn on semantic cache policy

Enable the APIM semantic-cache-lookup policy on the AOAI completions API with similarity threshold 0.95, TTL 1h.✓ Copied

→ Policy applied
Observe hit rate

After 24h, pull cache hit rate and token savings from App Insights.✓ Copied

→ Hit rate % + tokens saved
Tune threshold

If hit rate <20%, lower threshold to 0.92 and observe again. If quality complaints, raise back to 0.97.✓ Copied

→ Iterative tuning with measurements

Outcome: Measured cost savings on repeat queries without degrading output quality.

Pitfalls

Over-aggressive caching serves wrong answers for similar-but-different questions — Start high (0.97) and only lower based on observed quality

Combinations

Pair with other MCPs for X10 leverage

azure-ai-gateway + sentry

Correlate APIM 5xx spikes with application-side errors

If Sentry shows 5xx spike in app X at 10:02, pull APIM metrics for the same window and identify if the gateway caused it.✓ Copied

azure-ai-gateway + notion

Weekly AI-traffic governance report to Notion

Compile per-team TPM usage for the week, 429 counts, cache hit rate; post as a Notion page.✓ Copied

Tools

What this MCP exposes

Tool	Inputs	When to call	Cost
list_subscriptions	product_id?	Inventory teams consuming the gateway	free (ARM API call)
update_quota	subscription_id, tpm?, rpm?	Adjust a team's token/request limits	free
get_backend_pool	api_id	Inspect routing and failover config	free
update_backend_pool	api_id, backends, policies	Change priorities, circuit breakers, load balancing	free
apply_policy	api_id, policy_xml	Deploy APIM policy (cache, auth, logging)	free
get_metrics	api_id, since, until	Observe traffic shape per API	free

Cost & Limits

What this costs to run

API quota: Azure Resource Manager rate limits (generous per tenant)
Tokens per call: Policy/backend-pool reads: 500–2000 tokens
Monetary: APIM pricing starts at ~$40/mo (Basic tier); Standard tier recommended for prod
Tip: Semantic caching usually pays for APIM's cost many times over if your traffic repeats. Measure hit rate to justify.

Security

Permissions, secrets, blast radius

Minimum scopes: APIM Contributor on the target APIM instance

Credential storage: Azure service principal credentials (client id/secret/tenant) in env

Data egress: ARM API calls to management.azure.com; prompt/response bodies traverse APIM itself

Never grant: Owner on the subscription Global admin

The AI-Gateway repo is reference/sample code; review before deploying verbatim to production.
Request/response logging can capture PII — mask in policy before sending to App Insights.

Troubleshooting

Common errors and fixes

401 on ARM API calls

Service principal lacks APIM Contributor role on the resource group. Grant via portal or az cli.

Verify: az role assignment list --assignee <sp>

Policy apply fails — XML validation error

APIM policy XML is strict; use the portal's policy editor to validate, then copy-paste.

429s persist after raising TPM quota

Underlying AOAI deployment itself may be the bottleneck. Check deployment TPM, not just APIM subscription TPM.

Semantic cache hit rate is 0%

Embedding backend for cache-lookup not configured; check the cache policy's embeddings reference.

Alternatives

Azure AI Gateway vs others

Alternative	When to use it instead	Tradeoff
Cloudflare AI Gateway	You're on Cloudflare and want multi-provider LLM routing out of the box	Less deep integration with Azure-hosted models
Portkey / LiteLLM	You want a provider-agnostic gateway with a dashboard	Third-party SaaS; data leaves your cloud

More

Resources

📖 Read the official README on GitHub

🐙 Browse open issues

🔍 Browse all 400+ MCP servers and Skills