ParallelCS Start building

HomeTracksAI Infrastructure & Inference

12-week elite track

AI Infrastructure & Inference

Serve frontier models fast, cheap, and at scale.

Own the serving layer: the transformer internals that decide cost, quantization, KV-cache management, continuous batching, paged attention, speculative decoding, and GPU-aware deployment. You leave able to stand up an inference platform that holds an SLO under load. Bridges to Operating Systems, Computer Architecture, and Computer Networks.

Week by week

Twelve weeks, fully mapped.

Every week unlocks the next. Concepts route you to free, world-class material; projects turn that knowledge into something deployed.

Week 1

Transformer Internals for Serving

You cannot optimize what you do not understand. Tokens, attention, the prefill and decode phases, and exactly where compute and memory go during a forward pass.

Bridges to Computer Architecture — instruction-level parallelism and the memory wall

Builds on: nothing — start here

Week 2

GPU Architecture & the Memory Wall

Why inference is memory-bandwidth bound, not compute bound. SMs, warps, HBM versus SRAM, arithmetic intensity, and the roofline model that decides what 'fast' even means.

Bridges to Computer Architecture — parallelism, memory hierarchy, and the roofline model

Builds on: Transformer Internals for Serving

Week 3

FlashAttention & Kernel-Level Optimization

The attention kernel that made long context affordable: tiling, kernel fusion, and being IO-aware about HBM traffic. The general principle of fusing memory-bound operations.

Bridges to Operating Systems — I/O scheduling and memory-bound versus compute-bound work

Builds on: GPU Architecture & the Memory Wall

Week 4

KV-Cache & Paged Attention

The KV cache is the dominant memory cost of serving. Fragmentation, paging, prefix sharing, and how PagedAttention applied virtual-memory ideas to GPU memory.

Bridges to Operating Systems — virtual memory, paging, and fragmentation

Builds on: FlashAttention & Kernel-Level Optimization

Week 5

Continuous Batching & Throughput Scheduling

Static batching wastes the GPU. Continuous (iteration-level) batching, request scheduling, prefill-decode tradeoffs, and how to push throughput without wrecking tail latency.

Bridges to Operating Systems — CPU scheduling, throughput versus latency tradeoffs

Builds on: KV-Cache & Paged Attention

Week 6

Production LLM Inference Server with Continuous Batching

Week 6 milestone

An enterprise mandate: the platform team needs a launched inference product that holds a strict latency SLO while maximizing GPU throughput. Build (or extend a serving engine into) an inference service that implements a KV cache with paged memory management, iteration-level continuous batching, and a request scheduler that balances time-to-first-token against tokens-per-second. The deliverable is not a benchmark notebook — it is a directly deployable, hyperscalable product: a real public API and a clean playground UI, CI/CD, autoscaling across replicas, production observability for tokens-per-second and latency percentiles, security on the endpoint, and full marketing (landing page, pitch, demo) so it is presentable as a real product. Measure it honestly under load. The GPU is the most expensive thing in the building; idle cycles are a defect. We are not here to babysit it; ship it as a real product.

Why it matters: Inference serving is where AI cost is won or lost; a 2x throughput gain is a direct margin gain for any company running models. Shipping a measured, SLO-holding inference server makes a builder a credible AI Infrastructure or Inference Engineer, one of the highest-paid frontier roles because it converts directly into saved spend.

The deliverable

A publicly hosted inference service with a stable URL and a clean playground UI, plus a public repo: the batching scheduler and KV-cache manager, an autoscaling deployment, CI/CD on every commit, production observability dashboards, a load-test harness, a benchmark report comparing static versus continuous batching across concurrency levels, a marketing landing page, a 10-slide pitch, a recorded demo, and a README explaining the memory, scheduling, and scaling design.

What it ships
  • An OpenAI-compatible HTTP API (chat/completions, streaming) so the service is a drop-in for existing clients.
  • A paged KV-cache manager that eliminates memory fragmentation and supports prefix sharing across requests.
  • Iteration-level continuous batching so new requests join the running batch without waiting for it to drain.
  • A request scheduler with configurable priority and a tunable time-to-first-token versus throughput policy.
  • Token-level response streaming over server-sent events or WebSocket.
  • A clean playground UI to send prompts, watch streaming output, and see live latency and throughput.
  • A live metrics dashboard: tokens-per-second, time-to-first-token, queue depth, GPU memory, and KV-cache utilization.
  • Autoscaling across replicas driven by queue depth, with health and readiness probes.
  • A built-in load-test harness that sweeps concurrency and emits a static-vs-continuous-batching benchmark report.
  • API-key authentication and per-key rate limiting on the endpoint.
  • Graceful degradation and request shedding when the GPU is saturated, instead of timeouts.
Stack you orchestrate
vLLM or a from-scratch serving loopPyTorchCUDAPythona load-testing tool (Locust or k6)PrometheusDocker

Market signal — who wants thisInference is now a FinOps problem: at production scale it accounts for over 80% of AI GPU spend, and software optimization alone has driven cost-per-million-tokens down 5x on new hardware within months. A funded infrastructure category has formed around exactly this product — vLLM, Runpod (FlashBoot sub-250ms cold starts), BentoML, and Yotta Labs — because self-hosting beats managed APIs on unit economics above ~100M tokens/month. Investors fund inference platforms because every company running open-weight models needs to cut serving cost without losing quality.

How it is graded
  • The server implements paged KV-cache management and iteration-level continuous batching.
  • A request scheduler is present and its time-to-first-token versus throughput tradeoff is documented.
  • The service is deployed publicly with a clean playground UI, CI/CD on every commit, and autoscaling across replicas.
  • Production observability tracks tokens-per-second and latency percentiles, and the endpoint is secured.
  • A load-test report shows throughput and latency percentiles across concurrency levels, with continuous batching measurably compared against a static-batching baseline.
  • GPU memory usage and KV-cache fragmentation are reported with the design that controls them.
  • The project ships complete marketing — a landing page, a 10-slide pitch, and a recorded demo.
  • The service is publicly reachable and reproducible, with a clear benchmark methodology.
Bridges to Operating Systems — scheduling, virtual memory, and throughput optimization

Week 7

Quantization & Model Compression

Shrink a model to fit and run faster: INT8/INT4 weight quantization, GPTQ and AWQ, FP8, and the accuracy-versus-cost tradeoff measured honestly.

Bridges to Computer Architecture — number representation and fixed-point arithmetic

Builds on: Continuous Batching & Throughput Scheduling

Week 8

Speculative Decoding & Latency Optimization

Cut decode latency by guessing ahead: a small draft model proposes tokens a large model verifies in parallel. Acceptance rates, draft selection, and when speculation pays off.

Bridges to Computer Architecture — speculative and out-of-order execution

Builds on: Quantization & Model Compression

Week 10

Production Serving & Autoscaling

Wrap a model in a service that holds an SLO: load balancing across replicas, GPU autoscaling, cold-start mitigation, and observability for tokens-per-second and time-to-first-token.

Bridges to Distributed Systems — load balancing, replication, and capacity planning

Builds on: Speculative Decoding & Latency Optimization

Week 12

Quantized, Speculatively-Decoded Model Deployment

Week 12 milestone

An enterprise mandate: a capable open-weight model must run within a fixed GPU budget at half the current latency, with no unacceptable accuracy loss, and it must ship as a launched product. Quantize the model (INT4/INT8 or FP8), pair it with a draft model for speculative decoding, deploy it behind an autoscaling service, and prove the result. The deliverable is directly deployable and hyperscalable: a real public API and a hyper-usable demo UI, CI/CD, autoscaling, production observability, endpoint security, a finance-grade cost-per-token report, and full marketing (landing page, pitch, demo). Deliver a deployment a finance team would sign off on: the cost-per-million-tokens must drop and you must show it did. We are not here to babysit it; ship it as a real product.

Why it matters: Every company running open-weight models in production needs someone who can cut inference cost without breaking quality. A builder who can quantize, speculate, and deploy with a defensible cost report is directly deployable as a Senior Inference Engineer, a role compensated at the ₹1-crore tier because the savings are measured in real money.

The deliverable

A publicly hosted deployment with a stable URL and a hyper-usable demo UI, plus a public repo: the quantization and speculative-decoding pipeline, an autoscaling serving setup, CI/CD on every commit, production observability, an accuracy report on a representative benchmark before and after compression, a marketing landing page, a 10-slide pitch, a recorded demo, and a README with the cost-per-token analysis and scaling design.

What it ships
  • A quantization pipeline supporting INT4/INT8 (GPTQ or AWQ) and FP8, with a one-command recompress workflow.
  • An automatic accuracy-regression check that benchmarks the model before and after compression on a representative task.
  • Speculative decoding with a paired draft model, exposing the acceptance rate as a tunable, observable metric.
  • An OpenAI-compatible serving API behind the optimized model so it is a drop-in replacement.
  • A hyper-usable demo UI showing side-by-side latency of the baseline versus the optimized deployment.
  • A live cost dashboard computing cost-per-million-tokens from real throughput and GPU pricing.
  • Autoscaling with fast cold starts and scale-to-zero so idle GPU spend is eliminated.
  • A finance-grade report export: before/after cost, latency, and accuracy in one shareable document.
  • Configurable quality gates that block a deployment if accuracy loss exceeds a set threshold.
  • API-key auth, rate limiting, and request quotas on the public endpoint.
  • Production observability for time-to-first-token, tokens-per-second, and GPU utilization.
Stack you orchestrate
vLLM or TGIGPTQ/AWQ or bitsandbytesPyTorchan eval harnessKubernetes or Cloud RunPrometheusDocker

Market signal — who wants thisGPU FinOps is a defined 2026 budget line: inference is over 80% of AI GPU spend, and quantization plus speculative decoding are the highest-leverage cost cuts (FP8 alone gives 1.3-2x throughput at under 2% quality loss). Hardware-plus-software optimization has delivered 5x cost-per-token reductions, and a market of inference-cost tooling (Spheron, regolo.ai, BentoML, Yotta Labs) has formed around it. Investors fund cost-optimization products because the savings convert directly into gross margin for anyone serving open-weight models at volume.

How it is graded
  • The model is quantized with a named method and the accuracy delta on a representative benchmark is reported.
  • Speculative decoding is implemented and its acceptance rate and latency gain are measured.
  • The deployment is publicly hosted with a hyper-usable demo UI, CI/CD on every commit, autoscaling, and production observability tracking time-to-first-token and tokens-per-second.
  • The endpoint is secured and the architecture holds under concurrent load.
  • A finance-grade cost-per-million-tokens analysis shows the before-and-after improvement, and accuracy loss is reported honestly with the tradeoff justified.
  • The project ships complete marketing — a landing page, a 10-slide pitch, and a recorded demo.
  • The deployment is publicly reachable and fully reproducible from the repo.
Bridges to Computer Architecture — number representation, speculative execution, and the memory hierarchy