16 production-grade briefs

Projects an employer can open and judge.

No toy exercises. Each brief is written as a corporate mandate, produces a publicly hosted product, ships with the exact rubric it is graded against, and bridges to a classic CS subject, so it doubles as your coursework. You orchestrate the AI; you own the result.

Pick a brief, ship your first build Plan the learning path

Agentic Systems Engineering

Orchestrate fleets of autonomous agents that ship real work.

Engineer reliable multi-agent systems: tool use, planning, memory, sandboxed code execution, and orchestration patterns that move from a single augmented LLM to a coordinated fleet. You build agents an enterprise can put in production, not demos. Bridges directly to classic Operating Systems, Distributed Systems, and Compilers.

Autonomous Multi-Agent Research-and-Ship System

Week 7 milestone

You are handed an enterprise mandate: the research division needs a launched product — a system that takes an open-ended technical question, autonomously researches it across many sources, synthesizes a defensible report, and ships the report as a published artifact, with zero human steps in the middle. Build an orchestrator-worker multi-agent system: a lead agent that decomposes the question and spawns specialized worker agents (search, read, synthesize, fact-check), coordinates their results through shared state, and produces a cited deliverable. This is not a notebook demo. The result must be a directly deployable, hyperscalable product: real public hosting, CI/CD on every commit, observability, security hardening, a polished and accessible web UI a non-technical analyst will happily use, and complete go-to-market material — a landing page, a pitch, and a recorded demo. The architecture must absorb concurrent research runs without falling over, and recover from a failed worker. We are not here to babysit the run; ship it as a real product.

Why it matters: Multi-agent research and synthesis systems are being deployed across consulting, finance, and R&D to compress weeks of analyst work into hours. Shipping a coordinated, fault-tolerant agent fleet makes a builder ready for an Agentic Systems Engineer or Applied AI Engineer role, where the bar is production reliability, not a demo.

The deliverable

A publicly hosted product with its own domain or stable URL, plus a public repo: the orchestrator and worker agents, an MCP-based tool layer, a fast accessible web UI for submitting questions and reading results, CI/CD running lint/tests/build on every commit, persisted and inspectable run traces, a marketing landing page, a 10-slide pitch, a recorded demo video, and a README documenting the coordination design, the failure-recovery and scaling strategy, and three example end-to-end runs with their published reports.

What it ships

Submit-a-question interface accepting an open-ended technical or market question with a depth setting (quick scan vs deep dive).
A lead orchestrator agent that decomposes the question into a research plan and spawns specialized worker agents.
Specialized workers — web search, source reading, synthesis, and an independent fact-checker that verifies every claim.
An MCP tool layer exposing search, fetch, and document tools so the same tools are reusable across agents and projects.
Live run view: a real-time graph of agent activity, sub-questions in flight, and sources being consumed.
Inline-cited report output where every claim links to the exact retrieved passage that supports it.
Export to PDF, Markdown, and a shareable public report URL.
Persisted, replayable run traces with token spend and latency per agent for cost auditing.
Automatic worker-failure detection and re-dispatch so a crashed worker never aborts a run.
A workspace history of past research runs with search and one-click re-run.
Concurrency controls and per-run budget caps so many users can run research in parallel safely.

Stack you orchestrate

Claude API or open-weight LLMModel Context ProtocolLangGraphNode.js or PythonDockerGoogle Cloud Runa tracing backend (LangSmith or OpenTelemetry)

Market signal, who wants thisAgentic deep-research is one of the hottest 2026 categories: the AI agent market is projected to grow from $7.84B in 2025 to $52.62B by 2030 (41% CAGR), and a16z reports a portfolio pivot from copilots to autonomous systems, with Sierra, Glean, and Decagon as comparables and YC W26 funding multi-agent orchestration startups such as Tensol and Korso. Consulting, finance, and corporate R&D teams are actively buying systems that compress weeks of analyst work into hours; investors fund this because it sells time back to high-cost knowledge workers.

How it is graded

The orchestrator decomposes a question and coordinates at least three specialized worker agents through explicit shared state.
Tools are exposed through a standard protocol (MCP), not bespoke per-agent glue.
The system is deployed to real public hosting with CI/CD on every commit and production observability (logs, traces, metrics).
The architecture handles concurrent research runs under load, and a worker failure mid-run still yields a complete, correct deliverable.
The web UI is fast, WCAG 2.2 AA accessible, and usable by a non-technical analyst without instruction.
Every claim in the output report is traceable to a retrieved source, and run traces are persisted and inspectable.
The project ships complete marketing: a landing page, a 10-slide pitch, and a recorded demo, presentable as a real product.
The product is publicly reachable and fully reproducible from the repo by a stranger.

Bridges to Distributed Systems — coordination, message passing, and fault tolerance

Autonomous Software 3.0 Coding Agent with Sandboxed Verification

Week 12 milestone

Build an autonomous coding agent that operates in verifiable domains (e.g., Python/Rust code editing). Implement active test-time scaling (selectable thinking effort), a correctness classifier for intermediate steps, and sandboxed compilation/test loops that serve as the ground-truth reward signal.

The deliverable

A fully functioning coding agent CLI and backend that takes a GitHub issue, spawns a planning tree with selectable thinking effort, executes code in a gVisor/Wasm sandbox, runs a compiler-driven verification loop, and submits a verified PR without human-in-the-loop 'vibe coding'.

What it ships

Compiler-directed feedback loops
Selectable thinking effort
Intermediate reasoning verifier

Stack you orchestrate

PythongVisorWasmDockerModel Context Protocol

How it is graded

Agent successfully explores multi-turn code changes and uses compiler feedback to self-correct
Execution environment is properly sandboxed via gVisor or WebAssembly with strict resource boundaries
Thinking-effort scaling tier is programmatically selectable, balancing token budget against search depth

Bridges to Operating Systems — virtualization, process isolation, and resource management

AI Infrastructure & Inference

Serve frontier models fast, cheap, and at scale.

Own the serving layer: the transformer internals that decide cost, quantization, KV-cache management, continuous batching, paged attention, speculative decoding, and GPU-aware deployment. You leave able to stand up an inference platform that holds an SLO under load. Bridges to Operating Systems, Computer Architecture, and Computer Networks.

Production LLM Inference Server with Continuous Batching

Week 6 milestone

An enterprise mandate: the platform team needs a launched inference product that holds a strict latency SLO while maximizing GPU throughput. Build (or extend a serving engine into) an inference service that implements a KV cache with paged memory management, iteration-level continuous batching, and a request scheduler that balances time-to-first-token against tokens-per-second. The deliverable is not a benchmark notebook — it is a directly deployable, hyperscalable product: a real public API and a clean playground UI, CI/CD, autoscaling across replicas, production observability for tokens-per-second and latency percentiles, security on the endpoint, and full marketing (landing page, pitch, demo) so it is presentable as a real product. Measure it honestly under load. The GPU is the most expensive thing in the building; idle cycles are a defect. We are not here to babysit it; ship it as a real product.

Why it matters: Inference serving is where AI cost is won or lost; a 2x throughput gain is a direct margin gain for any company running models. Shipping a measured, SLO-holding inference server makes a builder a credible AI Infrastructure or Inference Engineer, an in-demand frontier role because it converts directly into saved spend.

The deliverable

A publicly hosted inference service with a stable URL and a clean playground UI, plus a public repo: the batching scheduler and KV-cache manager, an autoscaling deployment, CI/CD on every commit, production observability dashboards, a load-test harness, a benchmark report comparing static versus continuous batching across concurrency levels, a marketing landing page, a 10-slide pitch, a recorded demo, and a README explaining the memory, scheduling, and scaling design.

What it ships

An OpenAI-compatible HTTP API (chat/completions, streaming) so the service is a drop-in for existing clients.
A paged KV-cache manager that eliminates memory fragmentation and supports prefix sharing across requests.
Iteration-level continuous batching so new requests join the running batch without waiting for it to drain.
A request scheduler with configurable priority and a tunable time-to-first-token versus throughput policy.
Token-level response streaming over server-sent events or WebSocket.
A clean playground UI to send prompts, watch streaming output, and see live latency and throughput.
A live metrics dashboard: tokens-per-second, time-to-first-token, queue depth, GPU memory, and KV-cache utilization.
Autoscaling across replicas driven by queue depth, with health and readiness probes.
A built-in load-test harness that sweeps concurrency and emits a static-vs-continuous-batching benchmark report.
API-key authentication and per-key rate limiting on the endpoint.
Graceful degradation and request shedding when the GPU is saturated, instead of timeouts.

Stack you orchestrate

vLLM or a from-scratch serving loopPyTorchCUDAPythona load-testing tool (Locust or k6)PrometheusDocker

Market signal, who wants thisInference is now a FinOps problem: at production scale it accounts for over 80% of AI GPU spend, and software optimization alone has driven cost-per-million-tokens down 5x on new hardware within months. A funded infrastructure category has formed around exactly this product — vLLM, Runpod (FlashBoot sub-250ms cold starts), BentoML, and Yotta Labs — because self-hosting beats managed APIs on unit economics above ~100M tokens/month. Investors fund inference platforms because every company running open-weight models needs to cut serving cost without losing quality.

How it is graded

The server implements paged KV-cache management and iteration-level continuous batching.
A request scheduler is present and its time-to-first-token versus throughput tradeoff is documented.
The service is deployed publicly with a clean playground UI, CI/CD on every commit, and autoscaling across replicas.
Production observability tracks tokens-per-second and latency percentiles, and the endpoint is secured.
A load-test report shows throughput and latency percentiles across concurrency levels, with continuous batching measurably compared against a static-batching baseline.
GPU memory usage and KV-cache fragmentation are reported with the design that controls them.
The project ships complete marketing — a landing page, a 10-slide pitch, and a recorded demo.
The service is publicly reachable and reproducible, with a clear benchmark methodology.

Bridges to Operating Systems — scheduling, virtual memory, and throughput optimization

Local Trillion-Parameter Cluster & MLX Deployment

Week 12 milestone

Quantize and deploy an open-weight mixture-of-experts (MoE) model locally. Architect a multi-node local cluster utilizing high-speed interconnects (Thunderbolt 5 RDMA) and the Apple MLX framework, implementing speculative decoding and continuous batching on native hardware.

The deliverable

An optimized mlx_lm or local inference server cluster achieving high tokens-per-second, utilizing 4-bit quantized layers, speculative drafting, and local node orchestration.

What it ships

4-bit quantization
Speculative decoding
Local RDMA clustering

Stack you orchestrate

Apple MLXC++PythonThunderbolt 5 RDMAllama.cpp

How it is graded

Model quantized to 4-bit representation with minimal loss in perplexity
Multi-node local cluster handles speculative decoding correctly across Thunderbolt 5 RDMA/local interfaces
Inference server matches or exceeds baseline CPU-only token throughput by 10x

Bridges to Computer Architecture — number representation, speculative execution, and the memory hierarchy

Applied ML & Model Engineering

Take a base model and make it yours.

Go from neural-net first principles to shipping adapted models: transformer pretraining intuition, supervised fine-tuning, parameter-efficient methods (LoRA/QLoRA), preference optimization (RLHF/DPO), distillation, and rigorous evaluation. You leave able to own a model-customization pipeline end to end. Bridges to Machine Learning, Linear Algebra, and Statistics.

End-to-End Fine-Tuning Pipeline for a Domain Model

Week 8 milestone

An enterprise mandate: take a base open-weight model and adapt it into a specialist for a real domain you choose (legal, medical, code, support), then ship it as a launched product. Own the whole pipeline: build and clean the training dataset, run parameter-efficient fine-tuning (LoRA or QLoRA), and prove on a held-out, contamination-controlled evaluation that the adapted model beats the base model on the target task. The deliverable is not a notebook — it is a directly deployable, hyperscalable product: the fine-tuned model served behind a real public API with a hyper-usable demo UI, autoscaling, CI/CD, observability, security, and full marketing (landing page, pitch, demo) so a domain user can try it and a buyer can evaluate it. We do not accept 'it seems better' — bring the numbers, and ship it as a real product.

Why it matters: Domain-adapted models are how companies turn a generic LLM into a defensible product, and most fine-tuning projects fail on data discipline. A builder who can run a clean, evaluated, reproducible fine-tuning pipeline is directly deployable as an ML Engineer or Model Engineer, a sought-after frontier role.

The deliverable

A publicly hosted product with a stable URL and a hyper-usable demo UI, plus a public repo and a published model card: the data pipeline, the QLoRA training configuration and run logs, an autoscaling serving deployment, CI/CD on every commit, production observability, an evaluation comparing base versus fine-tuned on a held-out set, a marketing landing page, a 10-slide pitch, a recorded demo, and a README documenting data sourcing, the contamination check, the scaling design, and the cost of the run.

What it ships

A dataset builder that ingests raw domain documents and turns them into cleaned, deduplicated instruction data.
An automatic train/eval contamination check that flags and removes overlap before training.
A LoRA/QLoRA training workflow with a reproducible config file and full run logging.
An experiment view comparing runs across hyperparameters, with loss curves and eval scores.
A held-out evaluation harness reporting target-task accuracy for the base model versus the fine-tuned model.
A catastrophic-forgetting check that scores the fine-tuned model on general tasks, not just the target task.
Adapter management: deploy one base model and hot-swap LoRA adapters per request.
An OpenAI-compatible serving API for the fine-tuned model, with autoscaling.
A hyper-usable demo UI where a domain user can try the specialist model on real prompts.
An auto-generated model card documenting data sourcing, intended use, limitations, and run cost.
Production observability and a secured, rate-limited endpoint.

Stack you orchestrate

Hugging Face TransformersTRLPEFTbitsandbytesPyTorchHugging Face Datasetsa GPU runtime (Colab, Kaggle, or a cloud instance)

Market signal, who wants thisDomain fine-tuning is a funded 2026 platform category: Together AI, Predibase (acquired by Rubrik in June 2025 for enterprise security depth), and Prem Studio compete on managed LoRA/QLoRA, and adapter-routing (one base model, many adapters per request) is now standard. The economics are compelling — a 7B model can be specialized on a single consumer GPU in an afternoon. Enterprises buy custom models that speak their technical language; investors fund fine-tuning platforms because every vertical AI product needs a model adapted to its own data.

How it is graded

A training dataset is built, cleaned, and deduplicated, with sourcing documented.
Parameter-efficient fine-tuning (LoRA or QLoRA) is run with a reproducible configuration.
A held-out evaluation shows a measured improvement of the fine-tuned model over the base, with train/eval contamination explicitly checked and catastrophic forgetting measured.
The fine-tuned model is served behind a real public API with a hyper-usable demo UI, autoscaling, CI/CD on every commit, observability, and a secured endpoint.
The serving architecture holds under concurrent load and the scaling design is documented.
The project ships complete marketing — a landing page, a 10-slide pitch, and a recorded demo.
The repo is reproducible, a model card documents intended use and limitations, and the product is publicly reachable.

Bridges to Machine Learning — transfer learning, supervised training, and evaluation

Distill a Frontier Model into a Deployable Specialist

Week 12 milestone

An enterprise mandate: a large model solves a task well but is too expensive to serve at volume. Distill its capability on that task into a small student model that can run cheaply, then ship the student as a launched product. Use the teacher to generate or label training data, train and align the student, and prove the student keeps most of the capability at a fraction of the cost. The deliverable is not a benchmark table — it is a directly deployable, hyperscalable product: the student served behind a real public API with a hyper-usable demo UI, autoscaling, CI/CD, observability, security, and full marketing (landing page, pitch, demo). Whatever time it takes — the deliverable is a model the business can actually afford to run and a product a buyer can try. Ship it as a real product.

Why it matters: Distillation is the standard route from an expensive frontier model to an economically viable product feature. A builder who can distill, align, and deploy a specialist student is directly deployable as a senior Model Engineer or Applied Scientist, a high-leverage role because distillation work converts directly into serving-cost reduction at scale.

The deliverable

A publicly hosted product with a stable URL and a hyper-usable demo UI, plus a public repo and a published student model: the distillation data pipeline, the student training and preference-optimization configuration, an autoscaling serving deployment, CI/CD on every commit, production observability, a benchmark comparing teacher, student, and base on the target task, a cost-and-latency comparison, a marketing landing page, a 10-slide pitch, a recorded demo, and a README on the distillation method and scaling design.

What it ships

A teacher-labeling pipeline that uses a frontier model to generate or label distillation data for a chosen task.
Synthetic-data generation with quality filtering so the student trains on clean, diverse examples.
A student-training workflow producing a small (0.6B-8B) model, with reproducible configs and run logs.
Optional preference optimization (DPO) to align the student where the task needs it.
A three-way benchmark — teacher, student, and base — reporting capability retained on the target task.
A cost-and-latency comparison computing the serving-cost reduction versus the teacher.
An accuracy-floor gate that blocks shipping a student that drops below a configured retention threshold.
An OpenAI-compatible serving API for the student model with autoscaling and scale-to-zero.
A hyper-usable demo UI letting a buyer try teacher and student side by side on real prompts.
An auto-generated model card with the distillation method, retention numbers, and intended use.
Production observability and a secured, rate-limited endpoint.

Stack you orchestrate

Hugging Face TransformersTRLPEFTPyTorchvLLM for servingan eval harnessDocker

Market signal, who wants thisDistillation drives the most-cited 2026 enterprise-AI economics: task-specific small models (0.6B-8B) match or beat frontier models at 10-100x lower inference cost, retaining 85-95% of capability. A $35K-$120K distillation project pays back in three weeks to three months against frontier inference bills, and startups like distil labs are funded purely to 'replace LLMs with custom small language models.' Investors back distillation because it is the clearest path from an expensive frontier model to a margin-positive product feature.

How it is graded

A teacher model is used to generate or label distillation data with a documented method.
A smaller student is trained and the capability retained on the target task is measured against the teacher.
Preference optimization (DPO or RLHF) or alignment of the student is applied where appropriate and justified.
The student is served behind a real public API with a hyper-usable demo UI, autoscaling, CI/CD on every commit, observability, and a secured endpoint that holds under concurrent load.
A cost and latency comparison shows the student is materially cheaper to serve, with the accuracy-versus-cost tradeoff reported honestly.
The project ships complete marketing — a landing page, a 10-slide pitch, and a recorded demo.
The repo is reproducible, the student model is published with a model card, and the product is publicly reachable.

Bridges to Machine Learning — model compression and the teacher-student paradigm

Production AI Products

Ship AI products that survive real users and real attackers.

Build the full product around a model: retrieval and context engineering at scale, LLM evaluation and observability, AI red-teaming and security, cost governance, and LLMOps. You leave able to take an AI feature from prototype to a hardened, monitored, publicly hosted product. Bridges to Databases, Software Engineering, and Information Security.

EDDOps-Hardened RAG Platform & Agent-First Lakehouse

Week 6 milestone

Build a production-ready RAG platform backed by an agent-first embedded vector lakehouse (LanceDB/pgvector). Implement an automated EDDOps validation suite with custom golden datasets, trace-based latency checks, and LLM-as-a-judge regression tests.

The deliverable

A dockerized, query-active vector service and structured context database with live tracing and a production CI/CD evaluation gate.

What it ships

Contextual retrieval
Automated EDDOps trace evaluations
Continuous integration regression gates

Stack you orchestrate

PythonLanceDBpgvectorMLflowDocker

How it is graded

Vector retrieval latency is under 50ms at scale with structured indices
EDD regression test suite executes automatically on mock context drift
LLM-as-a-judge correctly rates context relevance and faithfulness with >85% alignment to human golden labels

Bridges to Databases — indexing, information retrieval, and query optimization

AI Observability & Red-Team Pipeline

Week 12 milestone

An enterprise mandate: the company's AI features are live and the security and reliability teams are flying blind. Build and launch a product with two interlocking systems: an observability pipeline that traces every LLM call with token, cost, and latency telemetry and surfaces silent quality drift; and an automated red-team harness that continuously attacks the AI product with prompt injection, jailbreaks, and data-exfiltration probes, and reports which guardrails held. The deliverable is directly deployable and hyperscalable: real public hosting, CI/CD, a hyper-usable dashboard a security lead reads at a glance, the platform itself secured, and the ingestion path able to absorb high call volume. It ships complete with marketing — a landing page, a pitch, and a demo. Deliver something an enterprise can buy and run on a real product before an incident, not after. Ship it as a real product.

Why it matters: AI security and observability is a board-level concern as AI features ship into regulated industries, and almost no one combines both. A builder who delivers a tracing-plus-red-team pipeline is directly deployable as an AI Security Engineer or LLMOps Lead, a scarce role because it sits at the intersection of security, reliability, and AI.

The deliverable

A publicly hosted product with a stable URL and a hyper-usable security dashboard, plus a public repo: the tracing and observability pipeline, the automated red-team attack suite with a results report, the guardrails it validates, CI/CD on every commit, a marketing landing page, a 10-slide pitch, a recorded demo, and a README documenting the threat model, the drift-detection method, and the scaling design.

What it ships

An SDK/proxy that traces every LLM call with token counts, cost, latency, model, and a correlation ID.
A real-time dashboard of spend, latency percentiles, error rate, and call volume, sliceable by feature and model.
Silent-quality-drift detection that scores live traffic and alerts when output quality degrades.
An automated red-team suite running prompt-injection, jailbreak, indirect-injection, and data-exfiltration attack batteries.
A continuously updated attack library so new jailbreak techniques are tested as they emerge.
Input and output guardrails (PII redaction, injection filtering, policy checks) with a report of which held under attack.
A red-team scorecard mapping every finding to the OWASP LLM Top 10, exportable for audit.
Alerting integrations (email, webhook, Slack) for cost spikes, drift, and failed guardrails.
A high-throughput ingestion path that absorbs production call volume without sampling loss.
Scheduled red-team runs in CI so a regression in defenses fails the build.
Multi-project workspaces with role-based access so security leads and engineers see scoped views.

Stack you orchestrate

Claude API or open-weight LLMOpenTelemetrya tracing backenda guardrails libraryNode.js or PythonGitHub ActionsGoogle Cloud Run

Market signal, who wants thisAI security is a proven, acquisition-grade 2026 market: Lakera, which built exactly this guardrails-plus-red-teaming product (Lakera Guard at 98%+ detection, sub-50ms; Lakera Red for automated attack simulation), was acquired by Cisco in May 2025 and folded into Cisco AI Defense. Evaluation leaders like Galileo now ship guardrails that intercept outputs before tool execution. Investors fund AI observability and red-teaming because shipping AI into regulated industries makes pre-incident security a board-level requirement, and almost no product combines tracing and red-teaming in one.

How it is graded

Every LLM call is traced with token, cost, and latency telemetry and correlation IDs.
Silent quality drift is detected and surfaced, not just raw metrics displayed.
An automated red-team suite runs prompt-injection, jailbreak, and exfiltration attacks, and the report shows which input/output guardrails held and which failed.
The platform is deployed to real public hosting with CI/CD on every commit and is itself secured.
The ingestion path is hyperscalable and absorbs high call volume; the scaling design is documented.
The dashboard is fast, WCAG 2.2 AA accessible, and readable at a glance by a security lead.
The threat model is documented and mapped to the OWASP LLM Top 10.
The project ships complete marketing — a landing page, a 10-slide pitch, and a recorded demo — and is publicly reachable and reproducible.

Bridges to Information Security — threat modeling, penetration testing, and monitoring

Frontier Systems

Build the distributed, real-time substrate AI runs on.

Engineer the systems frontier AI depends on: large-scale distributed training, real-time and streaming AI, GPU cluster scheduling, vector databases at scale, and the consistency and fault-tolerance tradeoffs underneath it all. You leave able to reason about and operate planet-scale AI infrastructure. Bridges to Distributed Systems, Databases, and Operating Systems.

Distributed Training System for a Multi-GPU Model

Week 8 milestone

An enterprise mandate: train a model that does not fit on one GPU, and turn the result into a launched product. Build a distributed training setup that uses data and at least one model-parallel strategy (tensor, pipeline, or fully-sharded), with correct collective communication, checkpointing that survives a node failure, and a throughput report. The run will be long; it must resume cleanly from a checkpoint after a simulated crash. The deliverable is not just training logs — the resulting model must be served as a directly deployable, hyperscalable product: a real public API with a hyper-usable demo UI, autoscaling, CI/CD, observability, security, and a live training-metrics dashboard. It ships complete with marketing — a landing page, a pitch, and a demo. We are not here to babysit a job that loses days of compute on one failure; ship the model as a real product.

Why it matters: Distributed training is the backbone skill behind every frontier model, and few engineers can debug a stalled all-reduce or a corrupt checkpoint. A builder who ships a fault-tolerant multi-GPU training system is directly deployable as a Distributed Systems or Training Infrastructure Engineer, one of the scarcest frontier roles.

The deliverable

A public repo, a benchmark report, and a publicly hosted product with a stable URL and a hyper-usable demo UI serving the trained model: the distributed training configuration, the parallelism strategy, the checkpoint-and-resume logic, a live training-metrics dashboard, an autoscaling serving deployment, CI/CD on every commit, production observability, a scaling report across GPU counts, a marketing landing page, a 10-slide pitch, a recorded demo, and a README documenting the communication pattern, the failure-recovery design, and the serving-scale design.

What it ships

A job launcher that takes a model and dataset and spreads training across multiple GPUs from a simple config.
Data parallelism plus at least one model-parallel strategy (tensor, pipeline, or fully-sharded).
Gang scheduling so every worker in a job starts together, preventing partial scheduling that idles GPUs.
Periodic distributed checkpointing with clean resume after a simulated node failure, losing no completed steps.
A live training dashboard: loss curves, throughput, GPU utilization, and inter-node communication overhead.
A scaling report that sweeps GPU counts and reports throughput and parallel efficiency.
Automatic detection and recovery from a stalled or crashed worker mid-run.
Spot/preemptible-instance support with checkpoint-driven recovery to cut training cost.
Serving of the trained model behind an OpenAI-compatible API with autoscaling.
A hyper-usable demo UI where a user can try the trained model on real prompts.
Production observability and a secured, rate-limited serving endpoint.

Stack you orchestrate

PyTorchPyTorch FSDP or DeepSpeedNCCLPythona multi-GPU runtimea cluster scheduler (Slurm or Kubernetes)Weights & Biases or TensorBoard

Market signal, who wants thisDistributed training infrastructure is a heavily funded 2026 category: Gartner projects $37.5B of end-user spending on AI-optimized infrastructure in 2026, and a market of training-orchestration products has formed — CoreWeave's Kubernetes-native GPU cloud, dstack's distributed-training orchestration, NVIDIA's open-source KAI Scheduler with gang scheduling, and NVIDIA Run:ai. Investors fund training infrastructure that keeps expensive clusters above 70% utilization; the scarce, decisive skill is making multi-GPU jobs fault-tolerant and efficient, not merely launching them.

How it is graded

Training runs across multiple GPUs using data plus at least one model-parallel strategy, with correct collective communication and the parallelism decomposition documented.
Checkpointing is implemented and the run resumes correctly after a simulated node failure.
A scaling report shows throughput and efficiency across GPU counts, and communication-versus-computation overhead is measured and discussed.
The trained model is served as a directly deployable product behind a real public API with a hyper-usable demo UI, autoscaling, CI/CD on every commit, observability, and a secured endpoint.
A live training-metrics dashboard is provided, and the serving layer holds under concurrent load.
The project ships complete marketing — a landing page, a 10-slide pitch, and a recorded demo.
The repo is reproducible with a clear benchmark methodology and the product is publicly reachable.

Bridges to Distributed Systems — parallelism, collective communication, and fault tolerance

Real-Time Streaming AI Inference Platform

Week 12 milestone

An enterprise mandate: build and launch a platform that consumes a high-rate live event stream, runs inference on each event with low latency, indexes results into a vector store, and serves real-time similarity queries — all while staying healthy under bursty load and a node failure. This is a distributed-systems problem with AI inside it: exactly-once or well-reasoned delivery semantics, backpressure, autoscaling, GPU-aware scheduling, and observability across the fleet. The deliverable is a directly deployable, hyperscalable product: real public hosting, CI/CD, security, a hyper-usable real-time dashboard and query UI, and full marketing — a landing page, a pitch, and a demo. Deliver a platform that does not fall over when traffic spikes, that a buyer can evaluate, and that is presentable as a real product. Ship it as a real product.

Why it matters: Real-time AI on live data powers fraud detection, recommendations, and observability products across every major platform. A builder who ships a streaming inference platform that holds up under load and failure is directly deployable as a senior Distributed Systems or Real-Time AI Engineer, a role in demand because it requires both systems depth and AI fluency.

The deliverable

A publicly hosted platform with a stable URL and a hyper-usable real-time dashboard plus query UI, plus a public repo: the streaming ingestion and inference pipeline, the at-scale vector index, the autoscaling and backpressure design, CI/CD on every commit, fleet observability, a load-and-failure test report, a marketing landing page, a 10-slide pitch, a recorded demo, and a README documenting delivery semantics, fault tolerance, security, and capacity planning.

What it ships

Ingestion of a high-rate live event stream (transactions, clicks, logs) from a streaming log such as Kafka.
Per-event low-latency inference with a documented end-to-end latency budget (target sub-100ms).
Stated delivery semantics — exactly-once, or at-least-once with idempotent processing — enforced in the pipeline.
A feature layer that assembles per-event features within the latency window for the model.
Inference results indexed into a vector store serving real-time similarity and nearest-neighbor queries.
Backpressure and autoscaling that keep the platform healthy through a sudden traffic burst.
Node-failure recovery with graceful degradation, demonstrated via injected failure.
A real-time dashboard of event throughput, inference latency percentiles, and fleet health.
A query UI for similarity search and recent-event lookup, usable without instruction.
Alerting on latency-SLO breaches, lag buildup, and anomalous event rates.
Multi-tenant isolation and a secured, rate-limited query API.

Stack you orchestrate

Apache Kafka or a streaming logApache Flink or a stream processorvLLM or a serving engineFAISS or a vector databaseKubernetesPrometheus and GrafanaDocker

Market signal, who wants thisReal-time streaming AI is a funded 2026 category anchored in fraud detection and live personalization: Artie raised a $12M Series A to make real-time data the default for AI systems, Experian launched real-time AI fraud detection with Resistant AI's 80+ models, and global fintech venture funding hit $12B across 751 deals by April 2026. Production fraud models need sub-millisecond feature retrieval and 20-100+ features within a 100ms window, served by vector databases like Pinecone, Milvus, and Redis. Investors fund streaming-AI platforms because regulated finance and large consumer platforms must score live events instantly or lose money.

How it is graded

A live event stream is consumed and inference runs per event with measured low latency.
Delivery semantics (exactly-once or at-least-once with idempotency) are stated and justified.
Inference results are indexed into a vector store that serves real-time similarity queries.
Backpressure and autoscaling keep the platform healthy under a simulated traffic burst, and an injected node failure is recovered with graceful degradation.
The platform is deployed to real public hosting with CI/CD on every commit, fleet observability, and security hardening.
A fast, WCAG 2.2 AA accessible real-time dashboard and query UI lets a stranger use the platform without instruction.
The project ships complete marketing — a landing page, a 10-slide pitch, and a recorded demo.
The platform is publicly reachable and fully reproducible from the repo.

Bridges to Distributed Systems — stream processing, fault tolerance, and capacity planning

Multimodal AI & Embodied World Modeling

Build models that see, hear, and physically model the world.

Engineer the systems behind vision-language, any-to-any models, and physics-simulating World Action Models (WAMs): how a vision encoder, projection layer, and language decoder fuse; how diffusion and flow-matching generate images and video; and how embodied models process unified physical trajectories for robotic interventions. Bridges to Computer Vision, Signal Processing, and Machine Learning.

Multimodal Visual Assistant with a Fine-Tuned VLM

Week 6 milestone

An enterprise mandate: ship a launched product — a visual assistant that answers grounded questions about images a user uploads (documents, charts, product photos, screenshots), built on a fine-tuned open-weight vision-language model. Adapt a base VLM with parameter-efficient fine-tuning on a domain image-text dataset you build, prove on a held-out set that it beats the base model, and serve it. The deliverable is not a notebook — it is a directly deployable, hyperscalable product: the VLM behind a real public API with a fast, accessible image-and-chat UI, autoscaling, CI/CD, observability, security, and full marketing (landing page, pitch, demo). Hallucination on images is the failure mode that loses trust; measure it and report it honestly. Ship it as a real product.

Why it matters: Vision-language assistants are moving into document processing, support, retail, and accessibility tooling, and the hard part is grounded, low-hallucination answers on real images, not a demo on a benchmark photo. A builder who can fine-tune, evaluate, and ship a VLM product is directly deployable as a Multimodal AI Engineer or Applied AI Engineer, a growing frontier role.

The deliverable

A publicly hosted visual-assistant product with a stable URL and a fast, accessible image-upload-and-chat UI, plus a public repo and a published model card: the image-text dataset pipeline, the LoRA/QLoRA fine-tuning configuration and run logs, an autoscaling serving deployment, CI/CD on every commit, production observability, an evaluation comparing the base versus fine-tuned VLM on a held-out set including a hallucination measurement, a marketing landing page, a 10-slide pitch, a recorded demo, and a README documenting the architecture, data sourcing, and scaling design.

What it ships

Image upload supporting documents, charts, screenshots, and photos, with a chat thread per image.
A fine-tuned open-weight vision-language model adapted to a chosen domain with LoRA or QLoRA.
A domain dataset builder that turns raw images and annotations into cleaned image-text training pairs.
Grounded answers that reference regions or content of the uploaded image, with an explicit decline when the image cannot support the question.
A held-out evaluation harness reporting target-task accuracy for the base versus the fine-tuned model.
A visual-hallucination check that scores how often the model asserts content not present in the image.
An OpenAI-compatible multimodal serving API so the assistant is a drop-in for existing clients.
A fast, accessible chat UI with streaming answers, image thumbnails, and conversation history.
A cost-and-latency dashboard tracking image-token usage and per-request timing.
Autoscaling with health and readiness probes, and a secured, rate-limited endpoint.
An auto-generated model card documenting data sourcing, intended use, and measured limitations.

Stack you orchestrate

Hugging Face TransformersTRLPEFTan open-weight VLM (Qwen-VL or similar)PyTorchNode.js or PythonGoogle Cloud Run

Market signal, who wants thisOpen-weight vision-language models are a fast-moving 2026 category: models such as Qwen2.5-VL now match closed frontier VLMs on many tasks and can be fine-tuned on 5,000-50,000 examples with LoRA for modest compute, and unified omni-models (Qwen3.5-Omni) extend this to audio and video. Document intelligence, retail visual search, and accessibility tooling are active buyers, and Stanford VHELM has standardized how VLM quality is compared. Investors fund multimodal-AI products because customer data is overwhelmingly visual, not just text.

How it is graded

A domain image-text dataset is built, cleaned, and documented, with a train/eval split that is contamination-checked.
A base open-weight VLM is fine-tuned with a parameter-efficient method (LoRA or QLoRA) using a reproducible configuration.
A held-out evaluation shows a measured improvement of the fine-tuned VLM over the base model, and visual hallucination is measured and reported honestly.
The VLM is served behind a real public API with a fast, WCAG 2.2 AA accessible image-upload-and-chat UI, autoscaling, CI/CD on every commit, observability, and a secured endpoint.
Answers are grounded in the uploaded image, and the assistant declines or flags questions the image cannot support instead of fabricating.
The project ships complete marketing — a landing page, a 10-slide pitch, and a recorded demo.
The repo is reproducible, a model card documents intended use and limitations, and the product is publicly reachable.

Bridges to Computer Vision — image understanding, representation learning, and evaluation

Generative Image Studio with a Diffusion Pipeline

Week 12 milestone

An enterprise mandate: build and launch a generative image studio — a product where a user describes or sketches what they want and the system generates, edits, and refines images with a diffusion pipeline you assemble and control. Use a latent diffusion model with text and image conditioning and classifier-free guidance, add controllable editing (inpainting, image-to-image), and prove generation quality and safety. The deliverable is directly deployable and hyperscalable: real public hosting, an autoscaling generation queue, CI/CD, observability, a hyper-usable creative UI, content-safety filtering, and full marketing (landing page, pitch, demo). A diffusion demo is easy; a launched, safe, queue-backed studio is the job. Ship it as a real product.

Why it matters: Generative image tooling is a mainstream creative-software category, and the engineering differentiator is a controllable, safe, queue-backed pipeline, not a single text-to-image call. A builder who ships a generation studio with editing, safety filtering, and autoscaling is directly deployable as a Generative AI Engineer or Applied AI Engineer, a sought-after role in creative-tools and media teams.

The deliverable

A publicly hosted generative image studio with a stable URL and a hyper-usable creative UI, plus a public repo: the diffusion generation and editing pipeline, an autoscaling generation queue, content-safety filtering, CI/CD on every commit, production observability, a quality-and-latency report across guidance and step settings, a marketing landing page, a 10-slide pitch, a recorded demo, and a README documenting the conditioning design, the safety design, and the scaling design.

What it ships

Text-to-image generation with a latent diffusion model and tunable classifier-free guidance.
Image-to-image and inpainting so a user can edit and refine an uploaded or generated image.
A prompt workspace with generation history, versioning, and one-click re-run of a past generation.
An autoscaling generation queue that accepts bursts of jobs and reports position and estimated wait.
Content-safety filtering on prompts and outputs, with a documented policy and a clear refusal path.
A quality-and-latency panel comparing step count and guidance settings on real prompts.
A creative UI with live previews, thumbnails, and a gallery of past generations.
Negative prompts and seed control so a creator can reproduce or steer a result deterministically.
Export to common image formats and a shareable public gallery URL.
A cost-and-throughput dashboard tracking generations, queue depth, and per-image compute.
Autoscaling with health and readiness probes and a secured, rate-limited generation API.

Stack you orchestrate

Hugging Face DiffusersPyTorchan open-weight latent diffusion modela job queue (Redis or Cloud Tasks)Node.js or PythonPrometheusGoogle Cloud Run

Market signal, who wants thisGenerative media is a large, funded 2026 market with diffusion and flow-matching as the dominant techniques across image and now video, taught in fresh university courses such as MIT 6.S184. Creative-tools companies and marketing, design, and media teams are active buyers, and the production bar has moved from raw generation to controllable editing, safety filtering, and reliable throughput. Investors fund generative-media products because controllable, safe creation at scale is what turns a model into a usable creative tool.

How it is graded

A latent diffusion pipeline generates images from text prompts with classifier-free guidance, and the guidance-versus-quality tradeoff is documented.
Controllable editing is implemented — at least inpainting and image-to-image — and works on user-supplied images.
The studio runs on an autoscaling generation queue that absorbs bursts without dropping jobs, and the scaling design is documented.
Content-safety filtering screens prompts and outputs, and the policy and its enforcement are documented.
The product is deployed to real public hosting with CI/CD on every commit, production observability of generation latency and queue depth, and a secured endpoint.
The creative UI is fast, WCAG 2.2 AA accessible, and usable by a non-technical creator without instruction.
The project ships complete marketing — a landing page, a 10-slide pitch, and a recorded demo — and is publicly reachable and reproducible.

Bridges to Computer Vision — image synthesis, conditional generation, and sampling

AI Safety, Alignment & Interpretability

Make powerful models honest, transparent, and governable.

Work on what frontier labs hire most heavily for: alignment techniques (RLHF, DPO, Constitutional AI), mechanistic interpretability that reverse-engineers what a network computes, safety evaluations and scalable oversight, adversarial robustness and red-teaming depth, and AI governance grounded in real frameworks (model cards, the EU AI Act, the NIST AI RMF). You leave able to ship an interpretability tool and a safety-and-governance pipeline. Bridges to Information Security, Theory of Computation, and Software Engineering.

Mechanistic Interpretability Tool for a Transformer

Week 7 milestone

An enterprise mandate: build and launch an interpretability tool — a product that lets a user load a small open-weight transformer, run a prompt, and inspect what the model is actually computing: attention patterns, per-layer activations, the contribution of individual components, and features surfaced by a sparse autoencoder. This is reverse engineering as a product: make the internals of a black-box model legible. The deliverable is directly deployable and hyperscalable: real public hosting, a fast accessible inspector UI, CI/CD, observability, security, and full marketing (landing page, pitch, demo). Interpretability is what frontier labs hire most heavily for; ship a tool a researcher would actually use. Ship it as a real product.

Why it matters: Mechanistic interpretability is among the most heavily hired-for research directions at frontier labs, because regulators and leadership increasingly want to know what a deployed model is doing internally, not just how it scores. A builder who ships an interpretability tool that runs real circuit and feature analysis is directly deployable as an Interpretability Researcher or AI Safety Engineer, a scarce and rapidly growing role.

The deliverable

A publicly hosted interpretability tool with a stable URL and a fast, accessible model-inspector UI, plus a public repo: the activation-capture and analysis backend, attention and activation visualizations, an activation-patching workflow, a sparse-autoencoder feature view, CI/CD on every commit, production observability, a marketing landing page, a 10-slide pitch, a recorded demo, and a README documenting the interpretability methods, their limitations, and the scaling design.

What it ships

Load a small open-weight transformer and run an arbitrary prompt through an instrumented forward pass.
Attention-pattern visualization per head and per layer, with token-to-token attribution.
Per-layer residual-stream and activation inspection, with the ability to compare two prompts side by side.
An activation-patching workflow that swaps activations between runs to make causal claims about components.
A sparse-autoencoder feature view that surfaces interpretable features and shows where they activate.
Logit-lens style projection so a user can see how a prediction forms across layers.
A saved-investigation workspace so a researcher can revisit and share a prior analysis.
A fast, accessible inspector UI with clear, labelled visualizations and keyboard navigation.
Concurrent-session support so multiple users can run analyses without contention.
Production observability for analysis latency and session load, and a secured, rate-limited backend.
A documentation panel stating, for each method, what it can and cannot tell you.

Stack you orchestrate

TransformerLensPyTorcha small open-weight transformera sparse-autoencoder libraryNode.js or PythonOpenTelemetryGoogle Cloud Run

Market signal, who wants thisInterpretability is a defined and growing 2026 research field with a dedicated ICML 2026 workshop and standard open tooling (TransformerLens, sparse autoencoders, Anthropic’s Transformer Circuits sequence). Frontier labs invest in it because understanding a model internally is increasingly a release and governance requirement, and the open-source ARENA curriculum exists specifically to train people into these roles. Investors and labs back interpretability because a model you cannot inspect is a model you cannot safely scale.

How it is graded

A small open-weight transformer is loaded and instrumented, and per-layer activations and attention patterns are captured for an arbitrary prompt.
An activation-patching workflow lets a user make a causal claim about which component drives a behavior, with the method documented.
A sparse-autoencoder or probe-based feature view surfaces interpretable features from the residual stream, with honest limitations stated.
The tool is deployed to real public hosting with a fast, WCAG 2.2 AA accessible inspector UI, CI/CD on every commit, production observability, and a secured endpoint.
The analysis backend handles concurrent inspection sessions and the scaling design is documented.
The project ships complete marketing — a landing page, a 10-slide pitch, and a recorded demo.
The tool is publicly reachable and fully reproducible from the repo, and a researcher can use it without instruction.

Bridges to Compilers — program analysis, intermediate representations, and reverse engineering

AI Safety Evaluation & Governance Pipeline

Week 12 milestone

An enterprise mandate: a company is about to ship an AI feature and has no defensible answer to "is this safe and is it compliant?". Build and launch a product with two interlocking systems: a safety-evaluation pipeline that runs capability, propensity, honesty, and adversarial red-team batteries against a model and scores how it behaves; and a governance layer that turns those results into an audit-ready model card and a risk classification mapped to the NIST AI RMF and the EU AI Act. The deliverable is directly deployable and hyperscalable: real public hosting, CI/CD, a hyper-usable dashboard a safety or compliance lead reads at a glance, security, and full marketing (landing page, pitch, demo). Deliver something an organization can run before release, not after an incident. Ship it as a real product.

Why it matters: AI safety evaluation and governance is becoming a release gate, not an afterthought: the EU AI Act transparency rules apply from August 2026 and the NIST AI RMF is the de facto governance structure organizations adopt. A builder who can ship an evaluation-plus-governance pipeline is directly deployable as an AI Safety Engineer, an Evaluations Engineer, or an AI Governance specialist, scarce roles because they combine ML, security, and policy.

The deliverable

A publicly hosted product with a stable URL and a hyper-usable safety-and-governance dashboard, plus a public repo: the safety-evaluation pipeline with capability, propensity, honesty, and red-team batteries, the auto-generated model card and risk-classification layer, CI/CD on every commit, production observability, a marketing landing page, a 10-slide pitch, a recorded demo, and a README documenting the evaluation methodology, the governance mapping, and the scaling design.

What it ships

A safety-evaluation runner with capability, propensity, honesty, and sycophancy batteries against a target model.
An adversarial red-team suite covering jailbreaks, prompt injection, and indirect injection, with a continuously updated attack library.
Structured, reproducible scoring so the same evaluation can be re-run and compared across model versions.
An auto-generated, audit-ready model card capturing intended use, evaluation results, and limitations.
A risk-classification layer that maps results to the NIST AI RMF Govern-Map-Measure-Manage functions and the EU AI Act risk tiers.
A benchmark-overfitting check that runs novel held-out prompts alongside public benchmarks to expose inflated scores.
A safety dashboard showing per-category results, risk status, and trend across evaluation runs.
CI integration so a safety regression in a new model version fails the release gate.
Exportable compliance reports suitable for an internal audit or external review.
Role-based access so safety engineers, compliance leads, and reviewers see scoped views.
Production observability and a secured, rate-limited evaluation API.

Stack you orchestrate

Inspect or an LLM eval frameworkan open-weight or API modela red-teaming libraryNode.js or PythonGitHub ActionsOpenTelemetryGoogle Cloud Run

Market signal, who wants thisAI governance and safety evaluation is a defined 2026 enterprise requirement: the EU AI Act transparency obligations take effect in August 2026, NIST released an AI RMF profile for critical infrastructure in April 2026, and enterprises are explicitly investing in audit-ready model cards, risk classification, and incident response. Research shows public benchmark scores can hide real-world failure, making independent evaluation valuable. Investors and enterprises fund safety-and-governance tooling because shipping AI into regulated industries without it is now a legal and reputational liability.

How it is graded

A safety-evaluation pipeline runs capability, propensity, honesty, and adversarial red-team batteries against a target model and produces structured, reproducible scores.
The red-team battery includes jailbreaks, prompt injection, and indirect injection, and the report shows which behaviors held and which failed.
An auto-generated model card documents intended use, evaluation results, and limitations in an audit-ready format.
A risk-classification layer maps the model and its results to the NIST AI RMF functions and the EU AI Act risk tiers, with the mapping justified.
The platform is deployed to real public hosting with CI/CD on every commit, production observability, and a secured endpoint, and the evaluation path is hyperscalable under load.
A fast, WCAG 2.2 AA accessible dashboard lets a safety or compliance lead read results and risk status at a glance.
The project ships complete marketing — a landing page, a 10-slide pitch, and a recorded demo — and is publicly reachable and reproducible.

Bridges to Information Security — threat modeling, evaluation, audit, and compliance

Land the Elite AI Role

Turn frontier skill into a frontier offer.

Convert everything you have built into an elite AI role: AI/ML system-design interviews, a public portfolio that gets noticed, the real hiring pipeline at frontier labs, top product companies, well-funded startups, and global remote teams, technical writing and communication, open-source contribution and visibility, and compensation and negotiation. You leave with a shipped portfolio product and a rigorous public technical writeup. Bridges to professional practice and software-engineering communication.

Ship Your Portfolio as a Real Product

Week 7 milestone

A career mandate: your portfolio is itself a product, and a hiring manager judges it the way a user judges any product — in seconds. Build and launch a portfolio site that presents a small number of your deep, shipped AI projects with live demos, clean repos, honest writeups, and real metrics. This is not a resume page: it is a directly deployable, hyperscalable, hyper-usable product with real public hosting, CI/CD, observability, accessibility, security headers, and full marketing polish. Treat curation as ruthlessly as engineering — three projects shown well beat ten shown badly. The portfolio must load fast, be reachable by anyone, and make a stranger want to talk to you. Ship it as a real product.

Why it matters: For AI roles, a public portfolio of deep, shipped projects is one of the strongest hiring signals, because it lets a hiring manager verify real work instead of trusting a resume. A builder who can present curated, hosted, well-documented projects stands out in the elite hiring pipeline, where take-home quality and demonstrable shipped work weigh heavily.

The deliverable

A publicly hosted portfolio product with a stable URL and a polished, fast, accessible UI, plus a public repo: the portfolio site, deep writeups of a curated set of AI projects each with a live demo link and real metrics, CI/CD on every commit, observability, security headers, a README documenting the build and design decisions, and a short pitch of the portfolio itself as a product.

What it ships

A landing view that communicates who the builder is and the strongest project within seconds.
A curated set of deep project pages, each with a live demo link, a public repo, and an honest writeup.
Real metrics on each project — usage, performance, or evaluation numbers — presented without inflation.
A fast, accessible, responsive UI with semantic HTML, keyboard navigation, and security headers.
CI/CD that rebuilds and redeploys the portfolio on every commit.
An honest limitations section per project, so the portfolio reads as credible rather than marketed.
A contact and links section that makes it frictionless for a recruiter to reach out.
Lightweight observability or analytics so the builder sees how the portfolio performs.
A documented build so the portfolio itself doubles as a reproducible engineering artifact.

Stack you orchestrate

Astro or a static site frameworka CI/CD provider (GitHub Actions)a static or edge host (Cloudflare Pages or Vercel)analytics or observability toolingHTML, CSS, and JavaScript

Market signal, who wants thisIn the 2026 AI hiring market, demonstrable shipped work is a primary screening signal: field-guide research into AI-engineering hiring shows take-home projects and write-up quality are weighted heavily, and recruiters respond to verifiable public work over cold resumes. A portfolio of deep, hosted projects is what moves a candidate from the application pile into a real conversation.

How it is graded

The portfolio presents a curated, small set of deep AI projects, each with a live demo, a public repo, an honest writeup, and real metrics.
The site is deployed to real public hosting with CI/CD on every commit, observability, and security headers on every response.
The UI is fast, WCAG 2.2 AA accessible, and communicates the value of each project to a non-expert within seconds.
Each project writeup states what was built, the result, and the limitations honestly, without inflation.
The portfolio is hyperscalable and reachable globally, with the build and design decisions documented in the repo.
The portfolio ships with a clear narrative and presentation polish, defensible as a real product.
The product is publicly reachable and fully reproducible from the repo.

Bridges to Professional Practice — portfolio building and engineering communication

Publish a Rigorous Public Technical Writeup

Week 12 milestone

A career mandate: take one frontier project you have built — a fine-tuned model, an interpretability tool, an inference system, a safety evaluation — and publish a rigorous, public technical writeup of it. This is the artifact that demonstrates research-grade communication: a clear claim, a reproducible method, honest results with ablations, and stated limitations, written so a frontier-lab engineer takes it seriously. Publish it as a launched product: a real public page with its own stable URL, fast and accessible, with the code and data to reproduce it linked, and presentation polish to the standard set by Distill. A writeup nobody can find or reproduce does not count. Ship it as a real product.

Why it matters: Research-grade technical communication is a differentiator in elite AI hiring: frontier and research-adjacent roles expect a candidate to read, reproduce, and clearly write up results. A builder who can publish a rigorous, reproducible writeup demonstrates exactly the communication and methodology those roles assume, and creates a durable, verifiable public signal of frontier-level work.

The deliverable

A publicly hosted technical writeup with a stable URL and a fast, accessible reading experience, plus a public repo: the writeup itself with a clear claim, method, results, ablations, and limitations, the code and data needed to reproduce the central result, CI/CD on every commit, and a README pointing to the reproduction steps.

What it ships

A clear claim stated upfront, with the method and results structured so a reader can follow the argument.
A reproducible experiment — code and data public — that a reader can run to confirm the central result.
At least one ablation that isolates the cause of the result, with the experiment design explained.
Honest, explicit limitations so the writeup reads as credible research rather than marketing.
Clear figures and, where it helps, interactive or visual explanation to the standard set by Distill.
Experiment tracking linked so the runs behind the numbers are inspectable.
A fast, accessible reading experience with semantic structure and keyboard navigation.
CI/CD that rebuilds and republishes the writeup on every commit.
A stable public URL suitable for sharing in an application or with a referral.

Stack you orchestrate

a static site or publishing frameworka notebook or script for the reproducible experimenta CI/CD provider (GitHub Actions)a static or edge hostan experiment-tracking tool (Weights & Biases or equivalent)

Market signal, who wants thisClear public technical writing is a recognized signal in AI hiring and research culture: venues like Distill established that rigorous, well-communicated research artifacts are real contributions, and frontier-lab interview guides emphasize the ability to read, reproduce, and communicate results. A reproducible public writeup gives a candidate a durable, verifiable demonstration of frontier-level methodology and communication.

How it is graded

The writeup makes a clear, specific claim and supports it with a reproducible method and honest results.
At least one ablation is reported that isolates what actually caused the result, with the experiment design described.
Limitations are stated explicitly, and no claim is overstated beyond the evidence.
The code and data needed to reproduce the central result are public and the reproduction steps are documented.
The writeup is published on real public hosting with a stable URL, CI/CD on every commit, and a fast, WCAG 2.2 AA accessible reading experience.
The writeup is written and presented to a standard a frontier-lab engineer would take seriously, with clear structure and figures.
The product is publicly reachable and the result is independently reproducible from the linked repo.

Bridges to Professional Practice — technical writing, research communication, and reproducibility

Why the subject bridge matters. Every project is mapped to a real classic CS subject, Operating Systems, Distributed Systems, Databases, Compilers. You are not choosing between your degree and shipping real software. You can tell a faculty member, truthfully, "I built this for my Distributed Systems mini-project", and tell a hiring panel it is production-grade.