Azure AI Gateway MCP — Installieren & Live-Demo

Warum nutzen

Hauptfunktionen

APIM in front of Azure OpenAI / MCP servers as the governance layer
Token quotas per subscription, per model, per team
Caching of deterministic completions to reduce spend
Load balancing + failover across multiple AOAI regions
Full request/response logging to App Insights for audit

Live-Demo

In der Praxis

azure-ai-gateway.replay ▶ bereit

0/0

Installieren

Wählen Sie Ihren Client

~/Library/Application Support/Claude/claude_desktop_config.json · Windows: %APPDATA%\Claude\claude_desktop_config.json

{
  "mcpServers": {
    "azure-ai-gateway": {
      "command": "uvx",
      "args": [
        "azure-ai-gateway-mcp"
      ]
    }
  }
}

Öffne Claude Desktop → Settings → Developer → Edit Config. Nach dem Speichern neu starten.

~/.cursor/mcp.json · .cursor/mcp.json

{
  "mcpServers": {
    "azure-ai-gateway": {
      "command": "uvx",
      "args": [
        "azure-ai-gateway-mcp"
      ]
    }
  }
}

Cursor nutzt das gleiche mcpServers-Schema wie Claude Desktop. Projektkonfiguration schlägt die globale.

VS Code → Cline → MCP Servers → Edit

{
  "mcpServers": {
    "azure-ai-gateway": {
      "command": "uvx",
      "args": [
        "azure-ai-gateway-mcp"
      ]
    }
  }
}

Klicken Sie auf das MCP-Servers-Symbol in der Cline-Seitenleiste, dann "Edit Configuration".

~/.codeium/windsurf/mcp_config.json

{
  "mcpServers": {
    "azure-ai-gateway": {
      "command": "uvx",
      "args": [
        "azure-ai-gateway-mcp"
      ]
    }
  }
}

Gleiche Struktur wie Claude Desktop. Windsurf neu starten zum Übernehmen.

~/.continue/config.json

{
  "mcpServers": [
    {
      "name": "azure-ai-gateway",
      "command": "uvx",
      "args": [
        "azure-ai-gateway-mcp"
      ]
    }
  ]
}

Continue nutzt ein Array von Serverobjekten statt einer Map.

~/.config/zed/settings.json

{
  "context_servers": {
    "azure-ai-gateway": {
      "command": {
        "path": "uvx",
        "args": [
          "azure-ai-gateway-mcp"
        ]
      }
    }
  }
}

In context_servers hinzufügen. Zed lädt beim Speichern neu.

claude mcp add azure-ai-gateway -- uvx azure-ai-gateway-mcp

Einzeiler. Prüfen mit claude mcp list. Entfernen mit claude mcp remove.

Anwendungsfälle

Praxisnahe Nutzung: Azure AI Gateway

Enforce per-team token quotas across Azure OpenAI deployments

👤 Central platform teams governing LLM spend ⏱ ~30 min advanced

Wann einsetzen: Multiple product teams share AOAI; one team's runaway loop shouldn't burn the shared TPM budget.

Voraussetzungen

APIM instance with the AI-Gateway patterns applied — Deploy the reference architecture from the Azure-Samples/AI-Gateway repo
APIM subscription key per team — Each team gets a distinct APIM subscription (key) they include in the Ocp-Apim-Subscription-Key header

Ablauf

Review current quotas

List APIM subscriptions with their current TPM and RPM quotas for the AOAI product.✓ Kopiert

→ Per-team quota table
Adjust a noisy team down

Team 'growth' is at 90% TPM burn daily. Reduce their quota from 200k → 100k TPM. Keep others unchanged.✓ Kopiert

→ Quota updated; confirmation
Monitor after the change

Over the next hour, pull 429 (rate-limited) counts per subscription. Confirm growth is being shaped but prod-critical teams aren't affected.✓ Kopiert

→ Enforcement visible in metrics

Ergebnis: Controlled shared AOAI spend without nuking legit high-priority traffic.

Fallstricke

Setting quotas too low starves legitimate workloads — Roll out in shadow mode first (log-only), then enforce once you understand real patterns

Configure multi-region failover for an Azure OpenAI deployment

👤 SREs running production AI workloads ⏱ ~45 min advanced

Wann einsetzen: A regional AOAI outage (uncommon but real) should fail over transparently to another region.

Voraussetzungen

AOAI deployments in ≥2 regions (e.g. East US, West Europe) — Provision via Azure portal; match model + version

Ablauf

Inspect current backend pool

Show the APIM backend pool for our AOAI API. How many backends, priority, circuit-breaker config?✓ Kopiert

→ Current pool config
Add a secondary region

Add the West Europe AOAI endpoint as priority=2 with circuit-breaker: 5 failures in 1 min → open for 5 min. Keep East US as primary.✓ Kopiert

→ Pool updated, 2 backends configured
Test failover

Simulate primary outage by disabling the East US backend for 2 min. Confirm traffic shifts to West Europe, then rollback.✓ Kopiert

→ Traffic shift observed; rollback verified

Ergebnis: Transparent failover with evidence it works before you need it.

Fallstricke

Different regions have different deployed model versions — Pin to a model version that exists in both regions; mismatched versions silently return different quality

Deploy semantic caching to reduce repeat prompt costs

👤 Cost-conscious platform teams ⏱ ~30 min advanced

Wann einsetzen: Your users ask similar questions over and over; 30–60% of calls are effectively cache hits.

Ablauf

Turn on semantic cache policy

Enable the APIM semantic-cache-lookup policy on the AOAI completions API with similarity threshold 0.95, TTL 1h.✓ Kopiert

→ Policy applied
Observe hit rate

After 24h, pull cache hit rate and token savings from App Insights.✓ Kopiert

→ Hit rate % + tokens saved
Tune threshold

If hit rate <20%, lower threshold to 0.92 and observe again. If quality complaints, raise back to 0.97.✓ Kopiert

→ Iterative tuning with measurements

Ergebnis: Measured cost savings on repeat queries without degrading output quality.

Fallstricke

Over-aggressive caching serves wrong answers for similar-but-different questions — Start high (0.97) and only lower based on observed quality

Kombinationen

Mit anderen MCPs für 10-fache Wirkung

azure-ai-gateway + sentry

Correlate APIM 5xx spikes with application-side errors

If Sentry shows 5xx spike in app X at 10:02, pull APIM metrics for the same window and identify if the gateway caused it.✓ Kopiert

azure-ai-gateway + notion

Weekly AI-traffic governance report to Notion

Compile per-team TPM usage for the week, 429 counts, cache hit rate; post as a Notion page.✓ Kopiert

Werkzeuge

Was dieses MCP bereitstellt

Werkzeug	Eingaben	Wann aufrufen	Kosten
list_subscriptions	product_id?	Inventory teams consuming the gateway	free (ARM API call)
update_quota	subscription_id, tpm?, rpm?	Adjust a team's token/request limits	free
get_backend_pool	api_id	Inspect routing and failover config	free
update_backend_pool	api_id, backends, policies	Change priorities, circuit breakers, load balancing	free
apply_policy	api_id, policy_xml	Deploy APIM policy (cache, auth, logging)	free
get_metrics	api_id, since, until	Observe traffic shape per API	free

Kosten & Limits

Was der Betrieb kostet

API-Kontingent: Azure Resource Manager rate limits (generous per tenant)
Tokens pro Aufruf: Policy/backend-pool reads: 500–2000 tokens
Kosten in €: APIM pricing starts at ~$40/mo (Basic tier); Standard tier recommended for prod
Tipp: Semantic caching usually pays for APIM's cost many times over if your traffic repeats. Measure hit rate to justify.

Sicherheit

Rechte, Secrets, Reichweite

Minimale Scopes: APIM Contributor on the target APIM instance

Credential-Speicherung: Azure service principal credentials (client id/secret/tenant) in env

Datenabfluss: ARM API calls to management.azure.com; prompt/response bodies traverse APIM itself

Niemals gewähren: Owner on the subscription Global admin

The AI-Gateway repo is reference/sample code; review before deploying verbatim to production.
Request/response logging can capture PII — mask in policy before sending to App Insights.

Fehlerbehebung

Häufige Fehler und Lösungen

401 on ARM API calls

Service principal lacks APIM Contributor role on the resource group. Grant via portal or az cli.

Prüfen: az role assignment list --assignee <sp>

Policy apply fails — XML validation error

APIM policy XML is strict; use the portal's policy editor to validate, then copy-paste.

429s persist after raising TPM quota

Underlying AOAI deployment itself may be the bottleneck. Check deployment TPM, not just APIM subscription TPM.

Semantic cache hit rate is 0%

Embedding backend for cache-lookup not configured; check the cache policy's embeddings reference.

Alternativen

Azure AI Gateway vs. andere

Alternative	Wann stattdessen	Kompromiss
Cloudflare AI Gateway	You're on Cloudflare and want multi-provider LLM routing out of the box	Less deep integration with Azure-hosted models
Portkey / LiteLLM	You want a provider-agnostic gateway with a dashboard	Third-party SaaS; data leaves your cloud

Mehr

Ressourcen

📖 Offizielle README auf GitHub lesen

🐙 Offene Issues ansehen

🔍 Alle 400+ MCP-Server und Skills durchsuchen