Day 6: Rate Limiting & Quotas
Learning Objectives
- Understand why rate limiting is essential (cost control, fair access, server protection)
- Learn token bucket vs sliding window algorithms
- Build a rate limiter for your inference API
Theory (15 min)
Why Rate Limit?
Three reasons, in priority order:
- Cost control ā inference costs money. A runaway loop or abusive user can burn through your budget.
- Fairness ā one user shouldn't consume all serving capacity.
- Server protection ā too many concurrent requests OOM the GPU.
Algorithm 1: Token Bucket
Bucket holds N tokens.
Tokens refill at R/second.
Each request consumes 1 token.
If bucket empty ā request rejected (429).
Intuition: Like a bucket with a small leak. It can handle bursts (up to N tokens) but averages out to the refill rate.
Good for: Absorbing bursty traffic while enforcing average rate.
Algorithm 2: Sliding Window Log
Maintain a queue of timestamps per user.
Window = W seconds (e.g. 60s).
Each request: add timestamp, remove timestamps older than W.
If queue length > limit ā reject.
Intuition: "No more than L requests in the last W seconds."
Good for: Precise enforcement. More memory than token bucket (need to store timestamps).
Algorithm 3: Sliding Window Counter (Redis-based)
Divide window into buckets (e.g. 1 second for a 60s window).
Count requests in current bucket + weight of previous bucket.
weight = time_into_current_bucket / bucket_duration
total = current_count + prev_count * (1 - weight)
Intuition: Approximates sliding window precisely enough, using only 2 counters.
Tradeoffs
| Algorithm | Memory | Precision | Burst Handling | Complexity |
|---|---|---|---|---|
| Token bucket | 2 ints | Medium | Excellent | Low |
| Sliding window log | O(N) per window | Exact | Poor | Medium |
| Sliding window counter | 2 ints per key | Good | Good | Medium |
| Fixed window (naive) | 1 int per key | Poor (edge spike) | Poor | Very low |
Multi-Tier Rate Limiting
In production, you layer limits:
Global: 1000 req/min total across all users
User tier: 100 req/min (free users)
1000 req/min (pro users)
Endpoint: 50 req/min for /v1/chat/completions (expensive)
500 req/min for /v1/embeddings (cheap)
Hands-on (15 min)
Build a Token-Bucket Rate Limiter
#!/usr/bin/env python3
"""token-bucket-limiter.py ā rate-limit inference requests."""
import time
import threading
from collections import defaultdict
class TokenBucket:
def __init__(self, rate: float, capacity: int):
"""
rate: tokens added per second
capacity: max tokens (burst size)
"""
self.rate = rate
self.capacity = capacity
self.tokens = capacity
self.last_refill = time.time()
self.lock = threading.Lock()
def _refill(self):
now = time.time()
elapsed = now - self.last_refill
self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
self.last_refill = now
def consume(self, tokens: int = 1) -> bool:
with self.lock:
self._refill()
if self.tokens >= tokens:
self.tokens -= tokens
return True
return False
class MultiTierRateLimiter:
"""User-level + global rate limiting."""
def __init__(self):
self.global_bucket = TokenBucket(rate=10, capacity=20) # 10 req/s, burst 20
self.user_buckets = defaultdict(lambda: TokenBucket(rate=2, capacity=5)) # 2 req/s per user
self.expensive_bucket = TokenBucket(rate=1, capacity=3) # 1/s for expensive ops
def allow_request(self, user_id: str, endpoint: str = "default") -> tuple[bool, str]:
if not self.global_bucket.consume():
return False, "global rate limit exceeded"
if not self.user_buckets[user_id].consume():
return False, f"user {user_id} rate limit exceeded"
if endpoint == "chat" and not self.expensive_bucket.consume():
return False, "chat endpoint rate limit exceeded"
return True, "ok"
# Demo
limiter = MultiTierRateLimiter()
test_requests = [
("user-a", "chat"),
("user-a", "chat"),
("user-a", "chat"),
("user-a", "chat"), # should fail: user limit
("user-b", "chat"),
("user-b", "embed"),
] * 3
print("Testing rate limiter...")
allowed = denied = 0
for user_id, endpoint in test_requests:
ok, reason = limiter.allow_request(user_id, endpoint)
if ok:
allowed += 1
status = "ā
ALLOW"
else:
denied += 1
status = f"ā DENY ({reason})"
print(f" {status:45} {user_id}/{endpoint}")
time.sleep(0.05) # simulate request spacing
print(f"\nš Result: {allowed} allowed, {denied} denied")
Run it:
cd /tmp
python3 token-bucket-limiter.py
Now integrate with your inference server:
# Quick test: send rapid requests, watch rate limiter kick in
for i in $(seq 1 10); do
time curl -s http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{"prompt":"hello","max_tokens":10}' & \
done
wait
Extension: Store the bucket state in Redis for persistence across server restarts.
Key Takeaways
- Rate limiting protects costs, ensures fairness, and prevents server overload
- Token bucket is the most practical algorithm ā handles bursts, simple to implement
- Multi-tier limiting (global ā user ā endpoint) gives fine-grained control
- Always return proper HTTP 429 with Retry-After header so clients can back off