🧠 AI System Design

Day 6: Rate Limiting & Quotas

šŸ“‚ Foundations šŸ“– 15 min read Ready

Learning Objectives

  • Understand why rate limiting is essential (cost control, fair access, server protection)
  • Learn token bucket vs sliding window algorithms
  • Build a rate limiter for your inference API

Theory (15 min)

Why Rate Limit?

Three reasons, in priority order:

  1. Cost control — inference costs money. A runaway loop or abusive user can burn through your budget.
  2. Fairness — one user shouldn't consume all serving capacity.
  3. Server protection — too many concurrent requests OOM the GPU.

Algorithm 1: Token Bucket

Bucket holds N tokens.
Tokens refill at R/second.
Each request consumes 1 token.
If bucket empty → request rejected (429).

Intuition: Like a bucket with a small leak. It can handle bursts (up to N tokens) but averages out to the refill rate.

Good for: Absorbing bursty traffic while enforcing average rate.

Algorithm 2: Sliding Window Log

Maintain a queue of timestamps per user.
Window = W seconds (e.g. 60s).
Each request: add timestamp, remove timestamps older than W.
If queue length > limit → reject.

Intuition: "No more than L requests in the last W seconds."

Good for: Precise enforcement. More memory than token bucket (need to store timestamps).

Algorithm 3: Sliding Window Counter (Redis-based)

Divide window into buckets (e.g. 1 second for a 60s window).
Count requests in current bucket + weight of previous bucket.

weight = time_into_current_bucket / bucket_duration
total = current_count + prev_count * (1 - weight)

Intuition: Approximates sliding window precisely enough, using only 2 counters.

Tradeoffs

Algorithm Memory Precision Burst Handling Complexity
Token bucket 2 ints Medium Excellent Low
Sliding window log O(N) per window Exact Poor Medium
Sliding window counter 2 ints per key Good Good Medium
Fixed window (naive) 1 int per key Poor (edge spike) Poor Very low

Multi-Tier Rate Limiting

In production, you layer limits:

Global:    1000 req/min total across all users
User tier:   100 req/min (free users)
             1000 req/min (pro users)
Endpoint:    50 req/min for /v1/chat/completions (expensive)
             500 req/min for /v1/embeddings (cheap)

Hands-on (15 min)

Build a Token-Bucket Rate Limiter

#!/usr/bin/env python3
"""token-bucket-limiter.py — rate-limit inference requests."""
import time
import threading
from collections import defaultdict

class TokenBucket:
    def __init__(self, rate: float, capacity: int):
        """
        rate: tokens added per second
        capacity: max tokens (burst size)
        """
        self.rate = rate
        self.capacity = capacity
        self.tokens = capacity
        self.last_refill = time.time()
        self.lock = threading.Lock()

    def _refill(self):
        now = time.time()
        elapsed = now - self.last_refill
        self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
        self.last_refill = now

    def consume(self, tokens: int = 1) -> bool:
        with self.lock:
            self._refill()
            if self.tokens >= tokens:
                self.tokens -= tokens
                return True
            return False


class MultiTierRateLimiter:
    """User-level + global rate limiting."""

    def __init__(self):
        self.global_bucket = TokenBucket(rate=10, capacity=20)  # 10 req/s, burst 20
        self.user_buckets = defaultdict(lambda: TokenBucket(rate=2, capacity=5))  # 2 req/s per user
        self.expensive_bucket = TokenBucket(rate=1, capacity=3)  # 1/s for expensive ops

    def allow_request(self, user_id: str, endpoint: str = "default") -> tuple[bool, str]:
        if not self.global_bucket.consume():
            return False, "global rate limit exceeded"

        if not self.user_buckets[user_id].consume():
            return False, f"user {user_id} rate limit exceeded"

        if endpoint == "chat" and not self.expensive_bucket.consume():
            return False, "chat endpoint rate limit exceeded"

        return True, "ok"


# Demo
limiter = MultiTierRateLimiter()
test_requests = [
    ("user-a", "chat"),
    ("user-a", "chat"),
    ("user-a", "chat"),
    ("user-a", "chat"),   # should fail: user limit
    ("user-b", "chat"),
    ("user-b", "embed"),
] * 3

print("Testing rate limiter...")
allowed = denied = 0
for user_id, endpoint in test_requests:
    ok, reason = limiter.allow_request(user_id, endpoint)
    if ok:
        allowed += 1
        status = "āœ… ALLOW"
    else:
        denied += 1
        status = f"āŒ DENY ({reason})"
    print(f"  {status:45} {user_id}/{endpoint}")
    time.sleep(0.05)  # simulate request spacing

print(f"\nšŸ“Š Result: {allowed} allowed, {denied} denied")

Run it:

cd /tmp
python3 token-bucket-limiter.py

Now integrate with your inference server:

# Quick test: send rapid requests, watch rate limiter kick in
for i in $(seq 1 10); do
  time curl -s http://localhost:8080/v1/completions \
    -H "Content-Type: application/json" \
    -d '{"prompt":"hello","max_tokens":10}' & \
done
wait

Extension: Store the bucket state in Redis for persistence across server restarts.


Key Takeaways

  • Rate limiting protects costs, ensures fairness, and prevents server overload
  • Token bucket is the most practical algorithm — handles bursts, simple to implement
  • Multi-tier limiting (global → user → endpoint) gives fine-grained control
  • Always return proper HTTP 429 with Retry-After header so clients can back off

References