🧠 AI System Design

Day 27: Case Study: GitHub Copilot

šŸ“‚ Production & Case Studies šŸ“– 15 min read Needs expansion

Learning Objectives

  • Understand the unique challenges of code generation (vs text generation)
  • Learn fill-in-the-middle (FIM) and context window strategies
  • Think through how you'd build code completion for your own Hermes agent

Theory (15 min)

Copilot's Unique Challenges

Code is not human language: - Structured: Syntax matters — invalid code is worse than no code - Context-dependent: What's in the file matters more than the conversation - Latency-sensitive: Developers won't wait >500ms for a suggestion - Multi-language: Python, JS, TS, Rust, Go — different patterns per language

Fill-in-the-Middle (FIM)

Standard LLM: predict next token → always forward.

Code completion needs: "I have code before cursor and after cursor — what's in the middle?"

def hello(name: str) -> str:
    return f"Hello, {name}!"


def goodbye(name: str) -> str:
    <CURSOR HERE>    ← What goes here?


# Test the functions
assert hello("World") == "Hello, World!"

FIM format:

<|fim_prefix|>def goodbye(name: str) -> str:\n    <|fim_suffix|>
\n\n# Test the functions<|fim_middle|>   ← Model fills this

Copilot sends the context before cursor as prefix, after cursor as suffix, and the model fills the middle.

Context Window Strategy

Copilot's most important architectural decision: what to put in the context window.

Priority ranking: 1. Current file (highest priority) 2. Recently opened files (tab history) 3. Imported/dependency files (type definitions, function signatures) 4. Similar files in the same project (class/file with similar names) 5. Language server diagnostics (errors in current file)

Total context: ~10K tokens. Distributed by relevance score.

Latency Budget

Developer types ──▶ 50ms debounce ──▶ Embed context ──▶ FIM inference ──▶ Display
                    (don't fire on      (200ms)          (200-500ms)       (50ms)
                     every keystroke)

Total: ~500ms-1s. Anything over 2s is discarded.

Hands-on (15 min)

Design a Copilot-Like System

#!/usr/bin/env python3
"""copilot-like.py — design document for a local code completion system."""

# Stub — Ayva will expand with:
# - Real FIM implementation with llama.cpp (--fill-in-middle flag)
# - Context extraction from the current file + surrounding files
# - Debounce mechanism for keystroke handling
# - Multi-language support (Python, JS, TS, Go)
# - Snippet ranking (reject low-confidence suggestions)
# - Integration with Neovim/VSCode via LSP
# - Performance benchmark (latency p50, p95)

copilot_design = {
    "trigger": {
        "description": "Suggest on pause, newline after trigger chars, or manual shortcut",
        "debounce_ms": 75,
        "implement": "Wait 75ms after last keystroke before generating",
    },
    "context_builder": {
        "priority": [
            "Current file content (before cursor)",
            "Current file content (after cursor) — for FIM",
            "Imports / dependencies (signatures, not bodies)",
            "Recently opened files (tab MRU list)",
            "Similar files (by filename or directory pattern)",
        ],
        "max_tokens": 4096,
        "implement": "Read buffer, extract prefix + suffix, collect auxiliary files",
    },
    "inference": {
        "model": "Qwen2.5-Coder-3B (q4_K_M)",
        "format": "FIM (prefix, suffix, middle)",
        "max_suggestion_tokens": 64,
        "temperature": 0.2,  # low for code
        "top_p": 0.95,
        "stop_tokens": ["\n\n", "\\n```"],
        "implement": "llama.cpp --fill-in-middle with proper FIM tokens",
    },
    "post_processing": {
        "validation": [
            "Check syntax (AST parse if possible)",
            "Check indentation consistency",
            "Check line length",
            "Remove trailing whitespace",
        ],
        "implement": "AST parser or regex validation per language",
    },
    "ranking": {
        "strategy": "Show up to 3 suggestions, ranked by confidence score",
        "confidence_factors": [
            "Token probability (model's confidence)",
            "Syntax validity (passed parse)",
            "Context relevance (embedding cosine with current line)",
        ],
    },
}

print("šŸ§‘ā€šŸ’» GitHub Copilot — Architecture Design\n")
for component, details in copilot_design.items():
    print(f"\n{'='*40}")
    print(f"šŸ“ {component.replace('_', ' ').title()}")
    print(f"{'='*40}")
    print(f"  {details.get('description', '')}")
    for k, v in details.items():
        if k != "description" and k != "implement":
            print(f"  {k}: {v}")
    if "implement" in details:
        print(f"  ā–¶  {details['implement']}")

print("\n\nšŸ“Š Key Metrics:")
metrics = {
    "Target latency": "<800ms from keystroke to display",
    "Suggestion length": "16-64 tokens (1-5 lines)",
    "Acceptance rate target": ">25% (industry average is ~30%)",
    "Model": "Qwen2.5-Coder-3B (3GB VRAM, runs fast on CPU)",
}
for k, v in metrics.items():
    print(f"  {k}: {v}")

Questions for Ayva: - What's the best local code model for CPU inference (Qwen2.5-Coder vs DeepSeek Coder vs StarCoder)? - How does FIM performance compare to standard left-to-right completion for code? - What context selection heuristic works best for multi-file projects?


Key Takeaways

  • Code completion needs FIM (fill-in-the-middle), not just next-token prediction
  • Context window management is the most important architectural decision
  • Latency budget is tight (<800ms) — every ms counts
  • Post-processing (syntax validation, indentation) improves acceptance rate significantly

References