Enforce per-team token quotas across Azure OpenAI deployments
When to use: Multiple product teams share AOAI; one team's runaway loop shouldn't burn the shared TPM budget.
Prerequisites
- APIM instance with the AI-Gateway patterns applied — Deploy the reference architecture from the Azure-Samples/AI-Gateway repo
- APIM subscription key per team — Each team gets a distinct APIM subscription (key) they include in the Ocp-Apim-Subscription-Key header
Flow
-
Review current quotasList APIM subscriptions with their current TPM and RPM quotas for the AOAI product.✓ Copied→ Per-team quota table
-
Adjust a noisy team downTeam 'growth' is at 90% TPM burn daily. Reduce their quota from 200k → 100k TPM. Keep others unchanged.✓ Copied→ Quota updated; confirmation
-
Monitor after the changeOver the next hour, pull 429 (rate-limited) counts per subscription. Confirm growth is being shaped but prod-critical teams aren't affected.✓ Copied→ Enforcement visible in metrics
Outcome: Controlled shared AOAI spend without nuking legit high-priority traffic.
Pitfalls
- Setting quotas too low starves legitimate workloads — Roll out in shadow mode first (log-only), then enforce once you understand real patterns