Home › Tracks › AI Infrastructure & Inference

Elite track

AI Infrastructure & Inference

Serve frontier models fast, cheap, and at scale.

Own the serving layer: the transformer internals that decide cost, quantization, KV-cache management, continuous batching, paged attention, speculative decoding, and GPU-aware deployment. You leave able to stand up an inference platform that holds an SLO under load. Bridges to Operating Systems, Computer Architecture, and Computer Networks.

Start the 30-Day Challenge See it on the graph

Week by week

Mapped week by week.

Every week unlocks the next. Concepts route you to free, world-class material; projects turn that knowledge into something deployed.

Week 1

Transformer Internals for Serving

You cannot optimize what you do not understand. Tokens, attention, the prefill and decode phases, and exactly where compute and memory go during a forward pass.

Bridges to Computer Architecture — instruction-level parallelism and the memory wall

Builds on: nothing, start here

Read the study notes

Week 2

GPU Architecture & the Memory Wall

Why inference is memory-bandwidth bound, not compute bound. SMs, warps, HBM versus SRAM, arithmetic intensity, and the roofline model that decides what 'fast' even means.

Bridges to Computer Architecture — parallelism, memory hierarchy, and the roofline model

Builds on: Transformer Internals for Serving

Read the study notes

Week 3

FlashAttention & Subquadratic Sparse Attention

Deep dive into kernel-level attention optimization. Move beyond quadratic computation complexity with Subquadratic Sparse Attention (SSA) routing mechanisms that selectively process key-value tokens, pushing viable context windows up to 12 million tokens.

Bridges to Operating Systems — I/O scheduling and memory-bound versus compute-bound work

Builds on: GPU Architecture & the Memory Wall

Read the study notes

Week 4

KV-Cache & Paged Attention

The KV cache is the dominant memory cost of serving. Fragmentation, paging, prefix sharing, and how PagedAttention applied virtual-memory ideas to GPU memory.

Bridges to Operating Systems — virtual memory, paging, and fragmentation

Builds on: FlashAttention & Subquadratic Sparse Attention

Read the study notes

Week 5

Continuous Batching & Throughput Scheduling

Static batching wastes the GPU. Continuous (iteration-level) batching, request scheduling, prefill-decode tradeoffs, and how to push throughput without wrecking tail latency.

Bridges to Operating Systems — CPU scheduling, throughput versus latency tradeoffs

Builds on: KV-Cache & Paged Attention

Read the study notes

Week 6

Production LLM Inference Server with Continuous Batching

Week 6 milestone

An enterprise mandate: the platform team needs a launched inference product that holds a strict latency SLO while maximizing GPU throughput. Build (or extend a serving engine into) an inference service that implements a KV cache with paged memory management, iteration-level continuous batching, and a request scheduler that balances time-to-first-token against tokens-per-second. The deliverable is not a benchmark notebook — it is a directly deployable, hyperscalable product: a real public API and a clean playground UI, CI/CD, autoscaling across replicas, production observability for tokens-per-second and latency percentiles, security on the endpoint, and full marketing (landing page, pitch, demo) so it is presentable as a real product. Measure it honestly under load. The GPU is the most expensive thing in the building; idle cycles are a defect. We are not here to babysit it; ship it as a real product.

Why it matters: Inference serving is where AI cost is won or lost; a 2x throughput gain is a direct margin gain for any company running models. Shipping a measured, SLO-holding inference server makes a builder a credible AI Infrastructure or Inference Engineer, an in-demand frontier role because it converts directly into saved spend.

The deliverable

A publicly hosted inference service with a stable URL and a clean playground UI, plus a public repo: the batching scheduler and KV-cache manager, an autoscaling deployment, CI/CD on every commit, production observability dashboards, a load-test harness, a benchmark report comparing static versus continuous batching across concurrency levels, a marketing landing page, a 10-slide pitch, a recorded demo, and a README explaining the memory, scheduling, and scaling design.

What it ships

An OpenAI-compatible HTTP API (chat/completions, streaming) so the service is a drop-in for existing clients.
A paged KV-cache manager that eliminates memory fragmentation and supports prefix sharing across requests.
Iteration-level continuous batching so new requests join the running batch without waiting for it to drain.
A request scheduler with configurable priority and a tunable time-to-first-token versus throughput policy.
Token-level response streaming over server-sent events or WebSocket.
A clean playground UI to send prompts, watch streaming output, and see live latency and throughput.
A live metrics dashboard: tokens-per-second, time-to-first-token, queue depth, GPU memory, and KV-cache utilization.
Autoscaling across replicas driven by queue depth, with health and readiness probes.
A built-in load-test harness that sweeps concurrency and emits a static-vs-continuous-batching benchmark report.
API-key authentication and per-key rate limiting on the endpoint.
Graceful degradation and request shedding when the GPU is saturated, instead of timeouts.

Stack you orchestrate

vLLM or a from-scratch serving loopPyTorchCUDAPythona load-testing tool (Locust or k6)PrometheusDocker

Market signal, who wants thisInference is now a FinOps problem: at production scale it accounts for over 80% of AI GPU spend, and software optimization alone has driven cost-per-million-tokens down 5x on new hardware within months. A funded infrastructure category has formed around exactly this product — vLLM, Runpod (FlashBoot sub-250ms cold starts), BentoML, and Yotta Labs — because self-hosting beats managed APIs on unit economics above ~100M tokens/month. Investors fund inference platforms because every company running open-weight models needs to cut serving cost without losing quality.

How it is graded

The server implements paged KV-cache management and iteration-level continuous batching.
A request scheduler is present and its time-to-first-token versus throughput tradeoff is documented.
The service is deployed publicly with a clean playground UI, CI/CD on every commit, and autoscaling across replicas.
Production observability tracks tokens-per-second and latency percentiles, and the endpoint is secured.
A load-test report shows throughput and latency percentiles across concurrency levels, with continuous batching measurably compared against a static-batching baseline.
GPU memory usage and KV-cache fragmentation are reported with the design that controls them.
The project ships complete marketing — a landing page, a 10-slide pitch, and a recorded demo.
The service is publicly reachable and reproducible, with a clear benchmark methodology.

Bridges to Operating Systems — scheduling, virtual memory, and throughput optimization

Week 7

Quantization & Model Compression

Shrink a model to fit and run faster: INT8/INT4 weight quantization, GPTQ and AWQ, FP8, and the accuracy-versus-cost tradeoff measured honestly.

Bridges to Computer Architecture — number representation and fixed-point arithmetic

Builds on: Continuous Batching & Throughput Scheduling

Read the study notes

Week 8

Speculative Decoding & Latency Optimization

Cut decode latency by guessing ahead: a small draft model proposes tokens a large model verifies in parallel. Acceptance rates, draft selection, and when speculation pays off.

Bridges to Computer Architecture — speculative and out-of-order execution

Builds on: Quantization & Model Compression

Read the study notes

Week 10

Production Serving & Autoscaling

Wrap a model in a service that holds an SLO: load balancing across replicas, GPU autoscaling, cold-start mitigation, and observability for tokens-per-second and time-to-first-token.

Bridges to Distributed Systems — load balancing, replication, and capacity planning

Builds on: Speculative Decoding & Latency Optimization

Read the study notes

Week 11

Sovereign AI, MLX & Local GPU Clustering

Deploy trillion-parameter open-weight models locally on non-NVIDIA hardware. Master Apple's MLX framework for 4-bit quantization, speculative decoding, and continuous batching on Apple Silicon. Build high-bandwidth local clusters using Thunderbolt 5 RDMA.

Bridges to Computer Architecture — specialized processors and hardware-agnostic compilation

Builds on: Production Serving & Autoscaling

Read the study notes

Week 12

Local Trillion-Parameter Cluster & MLX Deployment

Week 12 milestone

Quantize and deploy an open-weight mixture-of-experts (MoE) model locally. Architect a multi-node local cluster utilizing high-speed interconnects (Thunderbolt 5 RDMA) and the Apple MLX framework, implementing speculative decoding and continuous batching on native hardware.

The deliverable

An optimized mlx_lm or local inference server cluster achieving high tokens-per-second, utilizing 4-bit quantized layers, speculative drafting, and local node orchestration.

What it ships

4-bit quantization
Speculative decoding
Local RDMA clustering

Stack you orchestrate

Apple MLXC++PythonThunderbolt 5 RDMAllama.cpp

How it is graded

Model quantized to 4-bit representation with minimal loss in perplexity
Multi-node local cluster handles speculative decoding correctly across Thunderbolt 5 RDMA/local interfaces
Inference server matches or exceeds baseline CPU-only token throughput by 10x

Bridges to Computer Architecture — number representation, speculative execution, and the memory hierarchy

What's next

Finished here? Keep climbing.

Each track stands alone, so there's no wrong order. If you want a suggestion, this one pairs well next.

Applied ML & Model Engineering Suggested next Take a base model and make it yours.

See the full roadmap