/ Verzeichnis / Playground / Azure AI Gateway
● Offiziell Azure-Samples 🔑 Eigener Schlüssel nötig

Azure AI Gateway

von Azure-Samples · Azure-Samples/AI-Gateway

Microsoft's APIM-based AI Gateway patterns — route, meter, and govern LLM traffic (including MCP) from Azure API Management.

Azure AI Gateway is a reference-implementation repo from Microsoft showing how to put Azure API Management (APIM) in front of LLM/MCP endpoints for auth, quota, caching, routing, logging, and circuit-breaking. The MCP exposes these gateway operations so an agent can configure and inspect them.

Warum nutzen

Hauptfunktionen

Live-Demo

In der Praxis

azure-ai-gateway.replay ▶ bereit
0/0

Installieren

Wählen Sie Ihren Client

~/Library/Application Support/Claude/claude_desktop_config.json  · Windows: %APPDATA%\Claude\claude_desktop_config.json
{
  "mcpServers": {
    "azure-ai-gateway": {
      "command": "uvx",
      "args": [
        "azure-ai-gateway-mcp"
      ]
    }
  }
}

Öffne Claude Desktop → Settings → Developer → Edit Config. Nach dem Speichern neu starten.

~/.cursor/mcp.json · .cursor/mcp.json
{
  "mcpServers": {
    "azure-ai-gateway": {
      "command": "uvx",
      "args": [
        "azure-ai-gateway-mcp"
      ]
    }
  }
}

Cursor nutzt das gleiche mcpServers-Schema wie Claude Desktop. Projektkonfiguration schlägt die globale.

VS Code → Cline → MCP Servers → Edit
{
  "mcpServers": {
    "azure-ai-gateway": {
      "command": "uvx",
      "args": [
        "azure-ai-gateway-mcp"
      ]
    }
  }
}

Klicken Sie auf das MCP-Servers-Symbol in der Cline-Seitenleiste, dann "Edit Configuration".

~/.codeium/windsurf/mcp_config.json
{
  "mcpServers": {
    "azure-ai-gateway": {
      "command": "uvx",
      "args": [
        "azure-ai-gateway-mcp"
      ]
    }
  }
}

Gleiche Struktur wie Claude Desktop. Windsurf neu starten zum Übernehmen.

~/.continue/config.json
{
  "mcpServers": [
    {
      "name": "azure-ai-gateway",
      "command": "uvx",
      "args": [
        "azure-ai-gateway-mcp"
      ]
    }
  ]
}

Continue nutzt ein Array von Serverobjekten statt einer Map.

~/.config/zed/settings.json
{
  "context_servers": {
    "azure-ai-gateway": {
      "command": {
        "path": "uvx",
        "args": [
          "azure-ai-gateway-mcp"
        ]
      }
    }
  }
}

In context_servers hinzufügen. Zed lädt beim Speichern neu.

claude mcp add azure-ai-gateway -- uvx azure-ai-gateway-mcp

Einzeiler. Prüfen mit claude mcp list. Entfernen mit claude mcp remove.

Anwendungsfälle

Praxisnahe Nutzung: Azure AI Gateway

Enforce per-team token quotas across Azure OpenAI deployments

👤 Central platform teams governing LLM spend ⏱ ~30 min advanced

Wann einsetzen: Multiple product teams share AOAI; one team's runaway loop shouldn't burn the shared TPM budget.

Voraussetzungen
  • APIM instance with the AI-Gateway patterns applied — Deploy the reference architecture from the Azure-Samples/AI-Gateway repo
  • APIM subscription key per team — Each team gets a distinct APIM subscription (key) they include in the Ocp-Apim-Subscription-Key header
Ablauf
  1. Review current quotas
    List APIM subscriptions with their current TPM and RPM quotas for the AOAI product.✓ Kopiert
    → Per-team quota table
  2. Adjust a noisy team down
    Team 'growth' is at 90% TPM burn daily. Reduce their quota from 200k → 100k TPM. Keep others unchanged.✓ Kopiert
    → Quota updated; confirmation
  3. Monitor after the change
    Over the next hour, pull 429 (rate-limited) counts per subscription. Confirm growth is being shaped but prod-critical teams aren't affected.✓ Kopiert
    → Enforcement visible in metrics

Ergebnis: Controlled shared AOAI spend without nuking legit high-priority traffic.

Fallstricke
  • Setting quotas too low starves legitimate workloads — Roll out in shadow mode first (log-only), then enforce once you understand real patterns

Configure multi-region failover for an Azure OpenAI deployment

👤 SREs running production AI workloads ⏱ ~45 min advanced

Wann einsetzen: A regional AOAI outage (uncommon but real) should fail over transparently to another region.

Voraussetzungen
  • AOAI deployments in ≥2 regions (e.g. East US, West Europe) — Provision via Azure portal; match model + version
Ablauf
  1. Inspect current backend pool
    Show the APIM backend pool for our AOAI API. How many backends, priority, circuit-breaker config?✓ Kopiert
    → Current pool config
  2. Add a secondary region
    Add the West Europe AOAI endpoint as priority=2 with circuit-breaker: 5 failures in 1 min → open for 5 min. Keep East US as primary.✓ Kopiert
    → Pool updated, 2 backends configured
  3. Test failover
    Simulate primary outage by disabling the East US backend for 2 min. Confirm traffic shifts to West Europe, then rollback.✓ Kopiert
    → Traffic shift observed; rollback verified

Ergebnis: Transparent failover with evidence it works before you need it.

Fallstricke
  • Different regions have different deployed model versions — Pin to a model version that exists in both regions; mismatched versions silently return different quality

Deploy semantic caching to reduce repeat prompt costs

👤 Cost-conscious platform teams ⏱ ~30 min advanced

Wann einsetzen: Your users ask similar questions over and over; 30–60% of calls are effectively cache hits.

Ablauf
  1. Turn on semantic cache policy
    Enable the APIM semantic-cache-lookup policy on the AOAI completions API with similarity threshold 0.95, TTL 1h.✓ Kopiert
    → Policy applied
  2. Observe hit rate
    After 24h, pull cache hit rate and token savings from App Insights.✓ Kopiert
    → Hit rate % + tokens saved
  3. Tune threshold
    If hit rate <20%, lower threshold to 0.92 and observe again. If quality complaints, raise back to 0.97.✓ Kopiert
    → Iterative tuning with measurements

Ergebnis: Measured cost savings on repeat queries without degrading output quality.

Fallstricke
  • Over-aggressive caching serves wrong answers for similar-but-different questions — Start high (0.97) and only lower based on observed quality

Kombinationen

Mit anderen MCPs für 10-fache Wirkung

azure-ai-gateway + sentry

Correlate APIM 5xx spikes with application-side errors

If Sentry shows 5xx spike in app X at 10:02, pull APIM metrics for the same window and identify if the gateway caused it.✓ Kopiert
azure-ai-gateway + notion

Weekly AI-traffic governance report to Notion

Compile per-team TPM usage for the week, 429 counts, cache hit rate; post as a Notion page.✓ Kopiert

Werkzeuge

Was dieses MCP bereitstellt

WerkzeugEingabenWann aufrufenKosten
list_subscriptions product_id? Inventory teams consuming the gateway free (ARM API call)
update_quota subscription_id, tpm?, rpm? Adjust a team's token/request limits free
get_backend_pool api_id Inspect routing and failover config free
update_backend_pool api_id, backends, policies Change priorities, circuit breakers, load balancing free
apply_policy api_id, policy_xml Deploy APIM policy (cache, auth, logging) free
get_metrics api_id, since, until Observe traffic shape per API free

Kosten & Limits

Was der Betrieb kostet

API-Kontingent
Azure Resource Manager rate limits (generous per tenant)
Tokens pro Aufruf
Policy/backend-pool reads: 500–2000 tokens
Kosten in €
APIM pricing starts at ~$40/mo (Basic tier); Standard tier recommended for prod
Tipp
Semantic caching usually pays for APIM's cost many times over if your traffic repeats. Measure hit rate to justify.

Sicherheit

Rechte, Secrets, Reichweite

Minimale Scopes: APIM Contributor on the target APIM instance
Credential-Speicherung: Azure service principal credentials (client id/secret/tenant) in env
Datenabfluss: ARM API calls to management.azure.com; prompt/response bodies traverse APIM itself
Niemals gewähren: Owner on the subscription Global admin

Fehlerbehebung

Häufige Fehler und Lösungen

401 on ARM API calls

Service principal lacks APIM Contributor role on the resource group. Grant via portal or az cli.

Prüfen: az role assignment list --assignee <sp>
Policy apply fails — XML validation error

APIM policy XML is strict; use the portal's policy editor to validate, then copy-paste.

429s persist after raising TPM quota

Underlying AOAI deployment itself may be the bottleneck. Check deployment TPM, not just APIM subscription TPM.

Semantic cache hit rate is 0%

Embedding backend for cache-lookup not configured; check the cache policy's embeddings reference.

Alternativen

Azure AI Gateway vs. andere

AlternativeWann stattdessenKompromiss
Cloudflare AI GatewayYou're on Cloudflare and want multi-provider LLM routing out of the boxLess deep integration with Azure-hosted models
Portkey / LiteLLMYou want a provider-agnostic gateway with a dashboardThird-party SaaS; data leaves your cloud

Mehr

Ressourcen

📖 Offizielle README auf GitHub lesen

🐙 Offene Issues ansehen

🔍 Alle 400+ MCP-Server und Skills durchsuchen