Azure AI Gateway MCP — Установка & Живое демо

Зачем использовать

Ключевые функции

APIM in front of Azure OpenAI / MCP servers as the governance layer
Token quotas per subscription, per model, per team
Caching of deterministic completions to reduce spend
Load balancing + failover across multiple AOAI regions
Full request/response logging to App Insights for audit

Живое демо

Как выглядит на практике

azure-ai-gateway.replay ▶ готово

0/0

Установка

Выберите клиент

~/Library/Application Support/Claude/claude_desktop_config.json · Windows: %APPDATA%\Claude\claude_desktop_config.json

{
  "mcpServers": {
    "azure-ai-gateway": {
      "command": "uvx",
      "args": [
        "azure-ai-gateway-mcp"
      ]
    }
  }
}

Откройте Claude Desktop → Settings → Developer → Edit Config. Перезапустите после сохранения.

~/.cursor/mcp.json · .cursor/mcp.json

{
  "mcpServers": {
    "azure-ai-gateway": {
      "command": "uvx",
      "args": [
        "azure-ai-gateway-mcp"
      ]
    }
  }
}

Cursor использует ту же схему mcpServers, что и Claude Desktop. Конфиг проекта приоритетнее глобального.

VS Code → Cline → MCP Servers → Edit

{
  "mcpServers": {
    "azure-ai-gateway": {
      "command": "uvx",
      "args": [
        "azure-ai-gateway-mcp"
      ]
    }
  }
}

Щёлкните значок MCP Servers на боковой панели Cline, затем "Edit Configuration".

~/.codeium/windsurf/mcp_config.json

{
  "mcpServers": {
    "azure-ai-gateway": {
      "command": "uvx",
      "args": [
        "azure-ai-gateway-mcp"
      ]
    }
  }
}

Тот же формат, что и Claude Desktop. Перезапустите Windsurf для применения.

~/.continue/config.json

{
  "mcpServers": [
    {
      "name": "azure-ai-gateway",
      "command": "uvx",
      "args": [
        "azure-ai-gateway-mcp"
      ]
    }
  ]
}

Continue использует массив объектов серверов, а не map.

~/.config/zed/settings.json

{
  "context_servers": {
    "azure-ai-gateway": {
      "command": {
        "path": "uvx",
        "args": [
          "azure-ai-gateway-mcp"
        ]
      }
    }
  }
}

Добавьте в context_servers. Zed перезагружается автоматически.

claude mcp add azure-ai-gateway -- uvx azure-ai-gateway-mcp

Однострочная команда. Проверить: claude mcp list. Удалить: claude mcp remove.

Сценарии использования

Реальные сценарии: Azure AI Gateway

Enforce per-team token quotas across Azure OpenAI deployments

👤 Central platform teams governing LLM spend ⏱ ~30 min advanced

Когда использовать: Multiple product teams share AOAI; one team's runaway loop shouldn't burn the shared TPM budget.

Предварительные требования

APIM instance with the AI-Gateway patterns applied — Deploy the reference architecture from the Azure-Samples/AI-Gateway repo
APIM subscription key per team — Each team gets a distinct APIM subscription (key) they include in the Ocp-Apim-Subscription-Key header

Поток

Review current quotas

List APIM subscriptions with their current TPM and RPM quotas for the AOAI product.✓ Скопировано

→ Per-team quota table
Adjust a noisy team down

Team 'growth' is at 90% TPM burn daily. Reduce their quota from 200k → 100k TPM. Keep others unchanged.✓ Скопировано

→ Quota updated; confirmation
Monitor after the change

Over the next hour, pull 429 (rate-limited) counts per subscription. Confirm growth is being shaped but prod-critical teams aren't affected.✓ Скопировано

→ Enforcement visible in metrics

Итог: Controlled shared AOAI spend without nuking legit high-priority traffic.

Подводные камни

Setting quotas too low starves legitimate workloads — Roll out in shadow mode first (log-only), then enforce once you understand real patterns

Configure multi-region failover for an Azure OpenAI deployment

👤 SREs running production AI workloads ⏱ ~45 min advanced

Когда использовать: A regional AOAI outage (uncommon but real) should fail over transparently to another region.

Предварительные требования

AOAI deployments in ≥2 regions (e.g. East US, West Europe) — Provision via Azure portal; match model + version

Поток

Inspect current backend pool

Show the APIM backend pool for our AOAI API. How many backends, priority, circuit-breaker config?✓ Скопировано

→ Current pool config
Add a secondary region

Add the West Europe AOAI endpoint as priority=2 with circuit-breaker: 5 failures in 1 min → open for 5 min. Keep East US as primary.✓ Скопировано

→ Pool updated, 2 backends configured
Test failover

Simulate primary outage by disabling the East US backend for 2 min. Confirm traffic shifts to West Europe, then rollback.✓ Скопировано

→ Traffic shift observed; rollback verified

Итог: Transparent failover with evidence it works before you need it.

Подводные камни

Different regions have different deployed model versions — Pin to a model version that exists in both regions; mismatched versions silently return different quality

Deploy semantic caching to reduce repeat prompt costs

👤 Cost-conscious platform teams ⏱ ~30 min advanced

Когда использовать: Your users ask similar questions over and over; 30–60% of calls are effectively cache hits.

Поток

Turn on semantic cache policy

Enable the APIM semantic-cache-lookup policy on the AOAI completions API with similarity threshold 0.95, TTL 1h.✓ Скопировано

→ Policy applied
Observe hit rate

After 24h, pull cache hit rate and token savings from App Insights.✓ Скопировано

→ Hit rate % + tokens saved
Tune threshold

If hit rate <20%, lower threshold to 0.92 and observe again. If quality complaints, raise back to 0.97.✓ Скопировано

→ Iterative tuning with measurements

Итог: Measured cost savings on repeat queries without degrading output quality.

Подводные камни

Over-aggressive caching serves wrong answers for similar-but-different questions — Start high (0.97) and only lower based on observed quality

Комбинации

Сочетайте с другими MCP — эффект x10

azure-ai-gateway + sentry

Correlate APIM 5xx spikes with application-side errors

If Sentry shows 5xx spike in app X at 10:02, pull APIM metrics for the same window and identify if the gateway caused it.✓ Скопировано

azure-ai-gateway + notion

Weekly AI-traffic governance report to Notion

Compile per-team TPM usage for the week, 429 counts, cache hit rate; post as a Notion page.✓ Скопировано

Инструменты

Что предоставляет этот MCP

Инструмент	Входные данные	Когда вызывать	Стоимость
list_subscriptions	product_id?	Inventory teams consuming the gateway	free (ARM API call)
update_quota	subscription_id, tpm?, rpm?	Adjust a team's token/request limits	free
get_backend_pool	api_id	Inspect routing and failover config	free
update_backend_pool	api_id, backends, policies	Change priorities, circuit breakers, load balancing	free
apply_policy	api_id, policy_xml	Deploy APIM policy (cache, auth, logging)	free
get_metrics	api_id, since, until	Observe traffic shape per API	free

Стоимость и лимиты

Во что обходится

Квота API: Azure Resource Manager rate limits (generous per tenant)
Токенов на вызов: Policy/backend-pool reads: 500–2000 tokens
Деньги: APIM pricing starts at ~$40/mo (Basic tier); Standard tier recommended for prod
Совет: Semantic caching usually pays for APIM's cost many times over if your traffic repeats. Measure hit rate to justify.

Безопасность

Права, секреты, радиус поражения

Минимальные скоупы: APIM Contributor on the target APIM instance

Хранение учётных данных: Azure service principal credentials (client id/secret/tenant) in env

Исходящий трафик: ARM API calls to management.azure.com; prompt/response bodies traverse APIM itself

Никогда не давайте: Owner on the subscription Global admin

The AI-Gateway repo is reference/sample code; review before deploying verbatim to production.
Request/response logging can capture PII — mask in policy before sending to App Insights.

Устранение неполадок

Частые ошибки и исправления

401 on ARM API calls

Service principal lacks APIM Contributor role on the resource group. Grant via portal or az cli.

Проверить: az role assignment list --assignee <sp>

Policy apply fails — XML validation error

APIM policy XML is strict; use the portal's policy editor to validate, then copy-paste.

429s persist after raising TPM quota

Underlying AOAI deployment itself may be the bottleneck. Check deployment TPM, not just APIM subscription TPM.

Semantic cache hit rate is 0%

Embedding backend for cache-lookup not configured; check the cache policy's embeddings reference.

Альтернативы

Azure AI Gateway в сравнении

Альтернатива	Когда использовать	Компромисс
Cloudflare AI Gateway	You're on Cloudflare and want multi-provider LLM routing out of the box	Less deep integration with Azure-hosted models
Portkey / LiteLLM	You want a provider-agnostic gateway with a dashboard	Third-party SaaS; data leaves your cloud

Ещё

Ресурсы

📖 Читать официальный README на GitHub

🐙 Открытые задачи

🔍 Все 400+ MCP-серверов и Skills