ParallelCS Start building

10 production-grade briefs

Projects an employer can open and judge.

No toy exercises. Each brief is written as a corporate mandate, produces a publicly hosted product, ships with the exact rubric it is graded against, and bridges to a classic CS subject — so it doubles as your coursework. You orchestrate the AI; you own the result.

Agentic Systems Engineering

Orchestrate fleets of autonomous agents that ship real work.

Engineer reliable multi-agent systems: tool use, planning, memory, sandboxed code execution, and orchestration patterns that move from a single augmented LLM to a coordinated fleet. You build agents an enterprise can put in production, not demos. Bridges directly to classic Operating Systems, Distributed Systems, and Compilers.

Autonomous Multi-Agent Research-and-Ship System

Week 7 milestone

You are handed an enterprise mandate: the research division needs a launched product — a system that takes an open-ended technical question, autonomously researches it across many sources, synthesizes a defensible report, and ships the report as a published artifact, with zero human steps in the middle. Build an orchestrator-worker multi-agent system: a lead agent that decomposes the question and spawns specialized worker agents (search, read, synthesize, fact-check), coordinates their results through shared state, and produces a cited deliverable. This is not a notebook demo. The result must be a directly deployable, hyperscalable product: real public hosting, CI/CD on every commit, observability, security hardening, a polished and accessible web UI a non-technical analyst will happily use, and complete go-to-market material — a landing page, a pitch, and a recorded demo. The architecture must absorb concurrent research runs without falling over, and recover from a failed worker. We are not here to babysit the run; ship it as a real product.

Why it matters: Multi-agent research and synthesis systems are being deployed across consulting, finance, and R&D to compress weeks of analyst work into hours. Shipping a coordinated, fault-tolerant agent fleet makes a builder ready for an Agentic Systems Engineer or Applied AI Engineer role at the ₹1-crore tier, where the bar is production reliability, not a demo.

The deliverable

A publicly hosted product with its own domain or stable URL, plus a public repo: the orchestrator and worker agents, an MCP-based tool layer, a fast accessible web UI for submitting questions and reading results, CI/CD running lint/tests/build on every commit, persisted and inspectable run traces, a marketing landing page, a 10-slide pitch, a recorded demo video, and a README documenting the coordination design, the failure-recovery and scaling strategy, and three example end-to-end runs with their published reports.

What it ships
  • Submit-a-question interface accepting an open-ended technical or market question with a depth setting (quick scan vs deep dive).
  • A lead orchestrator agent that decomposes the question into a research plan and spawns specialized worker agents.
  • Specialized workers — web search, source reading, synthesis, and an independent fact-checker that verifies every claim.
  • An MCP tool layer exposing search, fetch, and document tools so the same tools are reusable across agents and projects.
  • Live run view: a real-time graph of agent activity, sub-questions in flight, and sources being consumed.
  • Inline-cited report output where every claim links to the exact retrieved passage that supports it.
  • Export to PDF, Markdown, and a shareable public report URL.
  • Persisted, replayable run traces with token spend and latency per agent for cost auditing.
  • Automatic worker-failure detection and re-dispatch so a crashed worker never aborts a run.
  • A workspace history of past research runs with search and one-click re-run.
  • Concurrency controls and per-run budget caps so many users can run research in parallel safely.
Stack you orchestrate
Claude API or open-weight LLMModel Context ProtocolLangGraphNode.js or PythonDockerGoogle Cloud Runa tracing backend (LangSmith or OpenTelemetry)

Market signal — who wants thisAgentic deep-research is one of the hottest 2026 categories: the AI agent market is projected to grow from $7.84B in 2025 to $52.62B by 2030 (41% CAGR), and a16z reports a portfolio pivot from copilots to autonomous systems, with Sierra, Glean, and Decagon as comparables and YC W26 funding multi-agent orchestration startups such as Tensol and Korso. Consulting, finance, and corporate R&D teams are actively buying systems that compress weeks of analyst work into hours; investors fund this because it sells time back to high-cost knowledge workers.

How it is graded
  • The orchestrator decomposes a question and coordinates at least three specialized worker agents through explicit shared state.
  • Tools are exposed through a standard protocol (MCP), not bespoke per-agent glue.
  • The system is deployed to real public hosting with CI/CD on every commit and production observability (logs, traces, metrics).
  • The architecture handles concurrent research runs under load, and a worker failure mid-run still yields a complete, correct deliverable.
  • The web UI is fast, WCAG 2.2 AA accessible, and usable by a non-technical analyst without instruction.
  • Every claim in the output report is traceable to a retrieved source, and run traces are persisted and inspectable.
  • The project ships complete marketing: a landing page, a 10-slide pitch, and a recorded demo, presentable as a real product.
  • The product is publicly reachable and fully reproducible from the repo by a stranger.
Bridges to Distributed Systems — coordination, message passing, and fault tolerance

Autonomous Coding Agent with Sandboxed Execution

Week 12 milestone

An enterprise mandate: ship a launched product — an autonomous coding agent that, given a real GitHub issue, plans a fix, writes and runs code inside a hardened sandbox, iterates against test feedback, and opens a pull request, with untrusted code never touching the host. This is an operating-systems problem as much as an AI problem: the agent's generated code is hostile by assumption. Confine it with containers or microVMs, enforce filesystem and network policy, cap CPU and memory, and survive an agent that tries to escape or hang. The deliverable is not a script — it is a directly deployable, hyperscalable product: real public hosting, CI/CD, observability, a clean dashboard where a developer queues issues and watches trajectories, security hardening, and full marketing (landing page, pitch, demo). The sandbox fleet must scale horizontally to many concurrent jobs. We are not here to babysit a run; ship it as a real product.

Why it matters: Autonomous coding agents are now a core part of engineering org tooling, and the hard part is safe execution at scale, not code generation. A builder who can ship a sandboxed, evaluated coding agent is directly deployable as an Agent Infrastructure Engineer or AI Platform Engineer, roles commanding ₹1-crore-plus compensation because they sit between security and AI.

The deliverable

A publicly hosted product with a stable URL, plus a public repo: the planning-and-execution loop, the sandbox isolation layer, a fast accessible dashboard for queuing issues and inspecting trajectories, CI/CD on every commit, production observability, an eval suite over a set of real issues with pass/fail trajectories, a marketing landing page, a 10-slide pitch, a recorded demo, and a README documenting the isolation threat model, the horizontal-scaling design, and the workflow-versus-agent decision.

What it ships
  • GitHub integration: connect a repo, and the agent picks up issues labeled for automation.
  • A planning stage that reads the issue, explores the codebase, and produces a fix plan before writing code.
  • Generated code runs only inside a hardened microVM or container sandbox with enforced CPU, memory, filesystem, and network limits.
  • An iterative loop: run the test suite, read failures, revise, and retry until tests pass or a budget is hit.
  • Automatic pull-request creation with a written summary of the change and the test evidence.
  • A live trajectory dashboard showing the agent's plan, edits, command output, and retries in real time.
  • An eval mode that runs the agent over a fixed set of real issues and reports pass rate and trajectory quality.
  • Prompt-injection defenses: untrusted issue and repo text is treated as hostile, with least-privilege tool scopes.
  • Horizontal sandbox-fleet scaling so many issues are worked concurrently without contention.
  • Per-job cost and time tracking, with configurable budget caps and a hard kill on runaway jobs.
  • An audit log of every command the agent ran inside the sandbox.
Stack you orchestrate
Claude API or open-weight LLMgVisor or Firecracker microVMsDockerGitHub APINode.js or Pythonan eval frameworkGoogle Cloud Run

Market signal — who wants thisAutonomous coding agents are the breakout 2026 developer-tools category: Cursor reached $2B ARR in February 2026 and is raising at a $50B+ valuation, and Replit raised a $400M Series D at a $9B valuation in March 2026. SWE-bench Verified is now the industry yardstick, with top agents exceeding 80%. A whole sub-market of sandbox-execution infrastructure (E2B, Northflank, Modal, Blaxel) is being funded specifically to run agent-written code safely; investors back this because safe execution at scale, not code generation, is the unsolved bottleneck.

How it is graded
  • Generated code executes only inside an isolated sandbox with enforced CPU, memory, filesystem, and network limits.
  • The agent iterates on test feedback and recovers from its own failed attempts.
  • The sandbox fleet scales horizontally to many concurrent jobs and the scaling design is documented.
  • The product is deployed to real public hosting with CI/CD on every commit and production observability.
  • A fast, WCAG 2.2 AA accessible dashboard lets a developer queue issues and inspect agent trajectories.
  • An eval suite reports pass rate and trajectory quality over a fixed set of real issues.
  • The isolation threat model is documented, including sandbox-escape containment, and prompt-injection via issue or repo content is mitigated with least-privilege tool access.
  • The project ships complete marketing — a landing page, a 10-slide pitch, and a recorded demo — and is publicly hosted and reproducible from the repo.
Bridges to Operating Systems — virtualization, process isolation, and resource management

AI Infrastructure & Inference

Serve frontier models fast, cheap, and at scale.

Own the serving layer: the transformer internals that decide cost, quantization, KV-cache management, continuous batching, paged attention, speculative decoding, and GPU-aware deployment. You leave able to stand up an inference platform that holds an SLO under load. Bridges to Operating Systems, Computer Architecture, and Computer Networks.

Production LLM Inference Server with Continuous Batching

Week 6 milestone

An enterprise mandate: the platform team needs a launched inference product that holds a strict latency SLO while maximizing GPU throughput. Build (or extend a serving engine into) an inference service that implements a KV cache with paged memory management, iteration-level continuous batching, and a request scheduler that balances time-to-first-token against tokens-per-second. The deliverable is not a benchmark notebook — it is a directly deployable, hyperscalable product: a real public API and a clean playground UI, CI/CD, autoscaling across replicas, production observability for tokens-per-second and latency percentiles, security on the endpoint, and full marketing (landing page, pitch, demo) so it is presentable as a real product. Measure it honestly under load. The GPU is the most expensive thing in the building; idle cycles are a defect. We are not here to babysit it; ship it as a real product.

Why it matters: Inference serving is where AI cost is won or lost; a 2x throughput gain is a direct margin gain for any company running models. Shipping a measured, SLO-holding inference server makes a builder a credible AI Infrastructure or Inference Engineer, one of the highest-paid frontier roles because it converts directly into saved spend.

The deliverable

A publicly hosted inference service with a stable URL and a clean playground UI, plus a public repo: the batching scheduler and KV-cache manager, an autoscaling deployment, CI/CD on every commit, production observability dashboards, a load-test harness, a benchmark report comparing static versus continuous batching across concurrency levels, a marketing landing page, a 10-slide pitch, a recorded demo, and a README explaining the memory, scheduling, and scaling design.

What it ships
  • An OpenAI-compatible HTTP API (chat/completions, streaming) so the service is a drop-in for existing clients.
  • A paged KV-cache manager that eliminates memory fragmentation and supports prefix sharing across requests.
  • Iteration-level continuous batching so new requests join the running batch without waiting for it to drain.
  • A request scheduler with configurable priority and a tunable time-to-first-token versus throughput policy.
  • Token-level response streaming over server-sent events or WebSocket.
  • A clean playground UI to send prompts, watch streaming output, and see live latency and throughput.
  • A live metrics dashboard: tokens-per-second, time-to-first-token, queue depth, GPU memory, and KV-cache utilization.
  • Autoscaling across replicas driven by queue depth, with health and readiness probes.
  • A built-in load-test harness that sweeps concurrency and emits a static-vs-continuous-batching benchmark report.
  • API-key authentication and per-key rate limiting on the endpoint.
  • Graceful degradation and request shedding when the GPU is saturated, instead of timeouts.
Stack you orchestrate
vLLM or a from-scratch serving loopPyTorchCUDAPythona load-testing tool (Locust or k6)PrometheusDocker

Market signal — who wants thisInference is now a FinOps problem: at production scale it accounts for over 80% of AI GPU spend, and software optimization alone has driven cost-per-million-tokens down 5x on new hardware within months. A funded infrastructure category has formed around exactly this product — vLLM, Runpod (FlashBoot sub-250ms cold starts), BentoML, and Yotta Labs — because self-hosting beats managed APIs on unit economics above ~100M tokens/month. Investors fund inference platforms because every company running open-weight models needs to cut serving cost without losing quality.

How it is graded
  • The server implements paged KV-cache management and iteration-level continuous batching.
  • A request scheduler is present and its time-to-first-token versus throughput tradeoff is documented.
  • The service is deployed publicly with a clean playground UI, CI/CD on every commit, and autoscaling across replicas.
  • Production observability tracks tokens-per-second and latency percentiles, and the endpoint is secured.
  • A load-test report shows throughput and latency percentiles across concurrency levels, with continuous batching measurably compared against a static-batching baseline.
  • GPU memory usage and KV-cache fragmentation are reported with the design that controls them.
  • The project ships complete marketing — a landing page, a 10-slide pitch, and a recorded demo.
  • The service is publicly reachable and reproducible, with a clear benchmark methodology.
Bridges to Operating Systems — scheduling, virtual memory, and throughput optimization

Quantized, Speculatively-Decoded Model Deployment

Week 12 milestone

An enterprise mandate: a capable open-weight model must run within a fixed GPU budget at half the current latency, with no unacceptable accuracy loss, and it must ship as a launched product. Quantize the model (INT4/INT8 or FP8), pair it with a draft model for speculative decoding, deploy it behind an autoscaling service, and prove the result. The deliverable is directly deployable and hyperscalable: a real public API and a hyper-usable demo UI, CI/CD, autoscaling, production observability, endpoint security, a finance-grade cost-per-token report, and full marketing (landing page, pitch, demo). Deliver a deployment a finance team would sign off on: the cost-per-million-tokens must drop and you must show it did. We are not here to babysit it; ship it as a real product.

Why it matters: Every company running open-weight models in production needs someone who can cut inference cost without breaking quality. A builder who can quantize, speculate, and deploy with a defensible cost report is directly deployable as a Senior Inference Engineer, a role compensated at the ₹1-crore tier because the savings are measured in real money.

The deliverable

A publicly hosted deployment with a stable URL and a hyper-usable demo UI, plus a public repo: the quantization and speculative-decoding pipeline, an autoscaling serving setup, CI/CD on every commit, production observability, an accuracy report on a representative benchmark before and after compression, a marketing landing page, a 10-slide pitch, a recorded demo, and a README with the cost-per-token analysis and scaling design.

What it ships
  • A quantization pipeline supporting INT4/INT8 (GPTQ or AWQ) and FP8, with a one-command recompress workflow.
  • An automatic accuracy-regression check that benchmarks the model before and after compression on a representative task.
  • Speculative decoding with a paired draft model, exposing the acceptance rate as a tunable, observable metric.
  • An OpenAI-compatible serving API behind the optimized model so it is a drop-in replacement.
  • A hyper-usable demo UI showing side-by-side latency of the baseline versus the optimized deployment.
  • A live cost dashboard computing cost-per-million-tokens from real throughput and GPU pricing.
  • Autoscaling with fast cold starts and scale-to-zero so idle GPU spend is eliminated.
  • A finance-grade report export: before/after cost, latency, and accuracy in one shareable document.
  • Configurable quality gates that block a deployment if accuracy loss exceeds a set threshold.
  • API-key auth, rate limiting, and request quotas on the public endpoint.
  • Production observability for time-to-first-token, tokens-per-second, and GPU utilization.
Stack you orchestrate
vLLM or TGIGPTQ/AWQ or bitsandbytesPyTorchan eval harnessKubernetes or Cloud RunPrometheusDocker

Market signal — who wants thisGPU FinOps is a defined 2026 budget line: inference is over 80% of AI GPU spend, and quantization plus speculative decoding are the highest-leverage cost cuts (FP8 alone gives 1.3-2x throughput at under 2% quality loss). Hardware-plus-software optimization has delivered 5x cost-per-token reductions, and a market of inference-cost tooling (Spheron, regolo.ai, BentoML, Yotta Labs) has formed around it. Investors fund cost-optimization products because the savings convert directly into gross margin for anyone serving open-weight models at volume.

How it is graded
  • The model is quantized with a named method and the accuracy delta on a representative benchmark is reported.
  • Speculative decoding is implemented and its acceptance rate and latency gain are measured.
  • The deployment is publicly hosted with a hyper-usable demo UI, CI/CD on every commit, autoscaling, and production observability tracking time-to-first-token and tokens-per-second.
  • The endpoint is secured and the architecture holds under concurrent load.
  • A finance-grade cost-per-million-tokens analysis shows the before-and-after improvement, and accuracy loss is reported honestly with the tradeoff justified.
  • The project ships complete marketing — a landing page, a 10-slide pitch, and a recorded demo.
  • The deployment is publicly reachable and fully reproducible from the repo.
Bridges to Computer Architecture — number representation, speculative execution, and the memory hierarchy

Applied ML & Model Engineering

Take a base model and make it yours.

Go from neural-net first principles to shipping adapted models: transformer pretraining intuition, supervised fine-tuning, parameter-efficient methods (LoRA/QLoRA), preference optimization (RLHF/DPO), distillation, and rigorous evaluation. You leave able to own a model-customization pipeline end to end. Bridges to Machine Learning, Linear Algebra, and Statistics.

End-to-End Fine-Tuning Pipeline for a Domain Model

Week 8 milestone

An enterprise mandate: take a base open-weight model and adapt it into a specialist for a real domain you choose (legal, medical, code, support), then ship it as a launched product. Own the whole pipeline: build and clean the training dataset, run parameter-efficient fine-tuning (LoRA or QLoRA), and prove on a held-out, contamination-controlled evaluation that the adapted model beats the base model on the target task. The deliverable is not a notebook — it is a directly deployable, hyperscalable product: the fine-tuned model served behind a real public API with a hyper-usable demo UI, autoscaling, CI/CD, observability, security, and full marketing (landing page, pitch, demo) so a domain user can try it and a buyer can evaluate it. We do not accept 'it seems better' — bring the numbers, and ship it as a real product.

Why it matters: Domain-adapted models are how companies turn a generic LLM into a defensible product, and most fine-tuning projects fail on data discipline. A builder who can run a clean, evaluated, reproducible fine-tuning pipeline is directly deployable as an ML Engineer or Model Engineer, a frontier role at the ₹1-crore tier.

The deliverable

A publicly hosted product with a stable URL and a hyper-usable demo UI, plus a public repo and a published model card: the data pipeline, the QLoRA training configuration and run logs, an autoscaling serving deployment, CI/CD on every commit, production observability, an evaluation comparing base versus fine-tuned on a held-out set, a marketing landing page, a 10-slide pitch, a recorded demo, and a README documenting data sourcing, the contamination check, the scaling design, and the cost of the run.

What it ships
  • A dataset builder that ingests raw domain documents and turns them into cleaned, deduplicated instruction data.
  • An automatic train/eval contamination check that flags and removes overlap before training.
  • A LoRA/QLoRA training workflow with a reproducible config file and full run logging.
  • An experiment view comparing runs across hyperparameters, with loss curves and eval scores.
  • A held-out evaluation harness reporting target-task accuracy for the base model versus the fine-tuned model.
  • A catastrophic-forgetting check that scores the fine-tuned model on general tasks, not just the target task.
  • Adapter management: deploy one base model and hot-swap LoRA adapters per request.
  • An OpenAI-compatible serving API for the fine-tuned model, with autoscaling.
  • A hyper-usable demo UI where a domain user can try the specialist model on real prompts.
  • An auto-generated model card documenting data sourcing, intended use, limitations, and run cost.
  • Production observability and a secured, rate-limited endpoint.
Stack you orchestrate
Hugging Face TransformersTRLPEFTbitsandbytesPyTorchHugging Face Datasetsa GPU runtime (Colab, Kaggle, or a cloud instance)

Market signal — who wants thisDomain fine-tuning is a funded 2026 platform category: Together AI, Predibase (acquired by Rubrik in June 2025 for enterprise security depth), and Prem Studio compete on managed LoRA/QLoRA, and adapter-routing (one base model, many adapters per request) is now standard. The economics are compelling — a 7B model can be specialized on a single consumer GPU in an afternoon. Enterprises buy custom models that speak their technical language; investors fund fine-tuning platforms because every vertical AI product needs a model adapted to its own data.

How it is graded
  • A training dataset is built, cleaned, and deduplicated, with sourcing documented.
  • Parameter-efficient fine-tuning (LoRA or QLoRA) is run with a reproducible configuration.
  • A held-out evaluation shows a measured improvement of the fine-tuned model over the base, with train/eval contamination explicitly checked and catastrophic forgetting measured.
  • The fine-tuned model is served behind a real public API with a hyper-usable demo UI, autoscaling, CI/CD on every commit, observability, and a secured endpoint.
  • The serving architecture holds under concurrent load and the scaling design is documented.
  • The project ships complete marketing — a landing page, a 10-slide pitch, and a recorded demo.
  • The repo is reproducible, a model card documents intended use and limitations, and the product is publicly reachable.
Bridges to Machine Learning — transfer learning, supervised training, and evaluation

Distill a Frontier Model into a Deployable Specialist

Week 12 milestone

An enterprise mandate: a large model solves a task well but is too expensive to serve at volume. Distill its capability on that task into a small student model that can run cheaply, then ship the student as a launched product. Use the teacher to generate or label training data, train and align the student, and prove the student keeps most of the capability at a fraction of the cost. The deliverable is not a benchmark table — it is a directly deployable, hyperscalable product: the student served behind a real public API with a hyper-usable demo UI, autoscaling, CI/CD, observability, security, and full marketing (landing page, pitch, demo). Whatever time it takes — the deliverable is a model the business can actually afford to run and a product a buyer can try. Ship it as a real product.

Why it matters: Distillation is the standard route from an expensive frontier model to an economically viable product feature. A builder who can distill, align, and deploy a specialist student is directly deployable as a senior Model Engineer or Applied Scientist, a ₹1-crore-tier role because distillation work converts directly into serving-cost reduction at scale.

The deliverable

A publicly hosted product with a stable URL and a hyper-usable demo UI, plus a public repo and a published student model: the distillation data pipeline, the student training and preference-optimization configuration, an autoscaling serving deployment, CI/CD on every commit, production observability, a benchmark comparing teacher, student, and base on the target task, a cost-and-latency comparison, a marketing landing page, a 10-slide pitch, a recorded demo, and a README on the distillation method and scaling design.

What it ships
  • A teacher-labeling pipeline that uses a frontier model to generate or label distillation data for a chosen task.
  • Synthetic-data generation with quality filtering so the student trains on clean, diverse examples.
  • A student-training workflow producing a small (0.6B-8B) model, with reproducible configs and run logs.
  • Optional preference optimization (DPO) to align the student where the task needs it.
  • A three-way benchmark — teacher, student, and base — reporting capability retained on the target task.
  • A cost-and-latency comparison computing the serving-cost reduction versus the teacher.
  • An accuracy-floor gate that blocks shipping a student that drops below a configured retention threshold.
  • An OpenAI-compatible serving API for the student model with autoscaling and scale-to-zero.
  • A hyper-usable demo UI letting a buyer try teacher and student side by side on real prompts.
  • An auto-generated model card with the distillation method, retention numbers, and intended use.
  • Production observability and a secured, rate-limited endpoint.
Stack you orchestrate
Hugging Face TransformersTRLPEFTPyTorchvLLM for servingan eval harnessDocker

Market signal — who wants thisDistillation drives the most-cited 2026 enterprise-AI economics: task-specific small models (0.6B-8B) match or beat frontier models at 10-100x lower inference cost, retaining 85-95% of capability. A $35K-$120K distillation project pays back in three weeks to three months against frontier inference bills, and startups like distil labs are funded purely to 'replace LLMs with custom small language models.' Investors back distillation because it is the clearest path from an expensive frontier model to a margin-positive product feature.

How it is graded
  • A teacher model is used to generate or label distillation data with a documented method.
  • A smaller student is trained and the capability retained on the target task is measured against the teacher.
  • Preference optimization (DPO or RLHF) or alignment of the student is applied where appropriate and justified.
  • The student is served behind a real public API with a hyper-usable demo UI, autoscaling, CI/CD on every commit, observability, and a secured endpoint that holds under concurrent load.
  • A cost and latency comparison shows the student is materially cheaper to serve, with the accuracy-versus-cost tradeoff reported honestly.
  • The project ships complete marketing — a landing page, a 10-slide pitch, and a recorded demo.
  • The repo is reproducible, the student model is published with a model card, and the product is publicly reachable.
Bridges to Machine Learning — model compression and the teacher-student paradigm

Production AI Products

Ship AI products that survive real users and real attackers.

Build the full product around a model: retrieval and context engineering at scale, LLM evaluation and observability, AI red-teaming and security, cost governance, and LLMOps. You leave able to take an AI feature from prototype to a hardened, monitored, publicly hosted product. Bridges to Databases, Software Engineering, and Information Security.

Production RAG Platform with a Real Eval Harness

Week 6 milestone

An enterprise mandate: deliver a launched retrieval-augmented product over a large, real corpus that a non-technical team can trust for answers. Build the full system: ingestion and chunking, hybrid plus contextual retrieval with reranking, a grounded-and-cited generation layer, and — non-negotiable — an automated evaluation harness that scores retrieval and answer quality and runs in CI so quality regressions are caught before users see them. The deliverable must be production-grade and directly deployable: real public hosting, CI/CD, observability, security hardening against the OWASP LLM Top 10, and a hyper-usable, fast, accessible chat UI. It must be hyperscalable — the retrieval and serving layers hold as the corpus and traffic grow. And it ships complete with marketing: a landing page, a pitch, and a demo. A RAG demo is easy; a launched RAG product you can defend is the job. Ship it as a real product.

Why it matters: RAG is the default architecture for enterprise AI products, and the differentiator between teams is rigorous evaluation, not retrieval cleverness. A builder who ships a RAG platform with a CI-integrated eval harness is directly deployable as an AI Product Engineer or Applied AI Engineer, a ₹1-crore-tier role at companies betting their roadmap on grounded AI.

The deliverable

A publicly hosted RAG product with a stable URL and a fast, accessible chat UI, plus a public repo: the ingestion and retrieval pipeline, the citation-grounded answer layer, an automated eval harness with a golden dataset wired into CI, CI/CD on every commit, observability of cost and latency, a marketing landing page, a 10-slide pitch, a recorded demo, and a README documenting the retrieval, evaluation, security, and scaling design.

What it ships
  • Document ingestion for PDFs, web pages, and office files, with incremental re-indexing as the corpus changes.
  • Configurable chunking plus contextual retrieval that prepends document context to each chunk before embedding.
  • Hybrid retrieval combining dense vector search and keyword search, followed by a reranking model.
  • A grounded answer layer where every response cites the exact passages it relied on, with click-through to source.
  • A fast, accessible chat UI with streaming answers, source citations, and conversation history.
  • An automated eval harness scoring faithfulness, context precision, context recall, and answer relevancy against a golden dataset.
  • CI integration where an eval-score regression fails the build before a change ships.
  • A cost-and-latency dashboard with per-request token usage and retrieval timing.
  • Out-of-scope detection so the assistant declines questions the corpus cannot answer instead of hallucinating.
  • Input/output guardrails mapped to the OWASP LLM Top 10, including prompt-injection filtering on ingested content.
  • Multi-tenant workspaces with access controls so different teams query different corpora.
Stack you orchestrate
Claude API or open-weight LLMpgvector or a vector databasean embedding and reranking modelNode.js or PythonGitHub ActionsOpenTelemetryGoogle Cloud Run

Market signal — who wants thisProduction RAG is a mature, funded 2026 market, and its defining requirement is rigorous evaluation: systematic eval frameworks (context precision, context recall, faithfulness, answer relevancy) are now mandatory for enterprise deployments, served by RAGAS, Galileo, Braintrust, and Maxim AI. Enterprise knowledge systems are forecast to keep evolving hard through 2026-2030. Investors fund RAG platforms with built-in eval because retrieval cleverness is commoditized; trustworthy, regression-proof answer quality is what enterprises actually pay for.

How it is graded
  • Ingestion, chunking, and a hybrid or contextual retrieval pipeline are implemented and justified.
  • Answers are grounded in and cite the retrieved passages they rely on.
  • An automated eval harness scores retrieval and answer quality against a golden dataset and runs in CI, where a quality regression fails the build.
  • The platform is deployed to real public hosting with CI/CD on every commit, observability of cost and latency, and security hardening mapped to the OWASP LLM Top 10.
  • The retrieval and serving layers are hyperscalable and hold as corpus and traffic grow; the scaling design is documented.
  • The chat UI is fast, WCAG 2.2 AA accessible, and usable by a non-technical stranger without instruction.
  • The project ships complete marketing — a landing page, a 10-slide pitch, and a recorded demo.
  • The product is publicly reachable and fully reproducible from the repo.
Bridges to Databases — indexing, information retrieval, and query optimization

AI Observability & Red-Team Pipeline

Week 12 milestone

An enterprise mandate: the company's AI features are live and the security and reliability teams are flying blind. Build and launch a product with two interlocking systems: an observability pipeline that traces every LLM call with token, cost, and latency telemetry and surfaces silent quality drift; and an automated red-team harness that continuously attacks the AI product with prompt injection, jailbreaks, and data-exfiltration probes, and reports which guardrails held. The deliverable is directly deployable and hyperscalable: real public hosting, CI/CD, a hyper-usable dashboard a security lead reads at a glance, the platform itself secured, and the ingestion path able to absorb high call volume. It ships complete with marketing — a landing page, a pitch, and a demo. Deliver something an enterprise can buy and run on a real product before an incident, not after. Ship it as a real product.

Why it matters: AI security and observability is a board-level concern as AI features ship into regulated industries, and almost no one combines both. A builder who delivers a tracing-plus-red-team pipeline is directly deployable as an AI Security Engineer or LLMOps Lead, a scarce ₹1-crore-tier role because it sits at the intersection of security, reliability, and AI.

The deliverable

A publicly hosted product with a stable URL and a hyper-usable security dashboard, plus a public repo: the tracing and observability pipeline, the automated red-team attack suite with a results report, the guardrails it validates, CI/CD on every commit, a marketing landing page, a 10-slide pitch, a recorded demo, and a README documenting the threat model, the drift-detection method, and the scaling design.

What it ships
  • An SDK/proxy that traces every LLM call with token counts, cost, latency, model, and a correlation ID.
  • A real-time dashboard of spend, latency percentiles, error rate, and call volume, sliceable by feature and model.
  • Silent-quality-drift detection that scores live traffic and alerts when output quality degrades.
  • An automated red-team suite running prompt-injection, jailbreak, indirect-injection, and data-exfiltration attack batteries.
  • A continuously updated attack library so new jailbreak techniques are tested as they emerge.
  • Input and output guardrails (PII redaction, injection filtering, policy checks) with a report of which held under attack.
  • A red-team scorecard mapping every finding to the OWASP LLM Top 10, exportable for audit.
  • Alerting integrations (email, webhook, Slack) for cost spikes, drift, and failed guardrails.
  • A high-throughput ingestion path that absorbs production call volume without sampling loss.
  • Scheduled red-team runs in CI so a regression in defenses fails the build.
  • Multi-project workspaces with role-based access so security leads and engineers see scoped views.
Stack you orchestrate
Claude API or open-weight LLMOpenTelemetrya tracing backenda guardrails libraryNode.js or PythonGitHub ActionsGoogle Cloud Run

Market signal — who wants thisAI security is a proven, acquisition-grade 2026 market: Lakera, which built exactly this guardrails-plus-red-teaming product (Lakera Guard at 98%+ detection, sub-50ms; Lakera Red for automated attack simulation), was acquired by Cisco in May 2025 and folded into Cisco AI Defense. Evaluation leaders like Galileo now ship guardrails that intercept outputs before tool execution. Investors fund AI observability and red-teaming because shipping AI into regulated industries makes pre-incident security a board-level requirement, and almost no product combines tracing and red-teaming in one.

How it is graded
  • Every LLM call is traced with token, cost, and latency telemetry and correlation IDs.
  • Silent quality drift is detected and surfaced, not just raw metrics displayed.
  • An automated red-team suite runs prompt-injection, jailbreak, and exfiltration attacks, and the report shows which input/output guardrails held and which failed.
  • The platform is deployed to real public hosting with CI/CD on every commit and is itself secured.
  • The ingestion path is hyperscalable and absorbs high call volume; the scaling design is documented.
  • The dashboard is fast, WCAG 2.2 AA accessible, and readable at a glance by a security lead.
  • The threat model is documented and mapped to the OWASP LLM Top 10.
  • The project ships complete marketing — a landing page, a 10-slide pitch, and a recorded demo — and is publicly reachable and reproducible.
Bridges to Information Security — threat modeling, penetration testing, and monitoring

Frontier Systems

Build the distributed, real-time substrate AI runs on.

Engineer the systems frontier AI depends on: large-scale distributed training, real-time and streaming AI, GPU cluster scheduling, vector databases at scale, and the consistency and fault-tolerance tradeoffs underneath it all. You leave able to reason about and operate planet-scale AI infrastructure. Bridges to Distributed Systems, Databases, and Operating Systems.

Distributed Training System for a Multi-GPU Model

Week 8 milestone

An enterprise mandate: train a model that does not fit on one GPU, and turn the result into a launched product. Build a distributed training setup that uses data and at least one model-parallel strategy (tensor, pipeline, or fully-sharded), with correct collective communication, checkpointing that survives a node failure, and a throughput report. The run will be long; it must resume cleanly from a checkpoint after a simulated crash. The deliverable is not just training logs — the resulting model must be served as a directly deployable, hyperscalable product: a real public API with a hyper-usable demo UI, autoscaling, CI/CD, observability, security, and a live training-metrics dashboard. It ships complete with marketing — a landing page, a pitch, and a demo. We are not here to babysit a job that loses days of compute on one failure; ship the model as a real product.

Why it matters: Distributed training is the backbone skill behind every frontier model, and few engineers can debug a stalled all-reduce or a corrupt checkpoint. A builder who ships a fault-tolerant multi-GPU training system is directly deployable as a Distributed Systems or Training Infrastructure Engineer, one of the scarcest and highest-paid frontier roles, well into ₹1-crore-tier compensation.

The deliverable

A public repo, a benchmark report, and a publicly hosted product with a stable URL and a hyper-usable demo UI serving the trained model: the distributed training configuration, the parallelism strategy, the checkpoint-and-resume logic, a live training-metrics dashboard, an autoscaling serving deployment, CI/CD on every commit, production observability, a scaling report across GPU counts, a marketing landing page, a 10-slide pitch, a recorded demo, and a README documenting the communication pattern, the failure-recovery design, and the serving-scale design.

What it ships
  • A job launcher that takes a model and dataset and spreads training across multiple GPUs from a simple config.
  • Data parallelism plus at least one model-parallel strategy (tensor, pipeline, or fully-sharded).
  • Gang scheduling so every worker in a job starts together, preventing partial scheduling that idles GPUs.
  • Periodic distributed checkpointing with clean resume after a simulated node failure, losing no completed steps.
  • A live training dashboard: loss curves, throughput, GPU utilization, and inter-node communication overhead.
  • A scaling report that sweeps GPU counts and reports throughput and parallel efficiency.
  • Automatic detection and recovery from a stalled or crashed worker mid-run.
  • Spot/preemptible-instance support with checkpoint-driven recovery to cut training cost.
  • Serving of the trained model behind an OpenAI-compatible API with autoscaling.
  • A hyper-usable demo UI where a user can try the trained model on real prompts.
  • Production observability and a secured, rate-limited serving endpoint.
Stack you orchestrate
PyTorchPyTorch FSDP or DeepSpeedNCCLPythona multi-GPU runtimea cluster scheduler (Slurm or Kubernetes)Weights & Biases or TensorBoard

Market signal — who wants thisDistributed training infrastructure is a heavily funded 2026 category: Gartner projects $37.5B of end-user spending on AI-optimized infrastructure in 2026, and a market of training-orchestration products has formed — CoreWeave's Kubernetes-native GPU cloud, dstack's distributed-training orchestration, NVIDIA's open-source KAI Scheduler with gang scheduling, and NVIDIA Run:ai. Investors fund training infrastructure that keeps expensive clusters above 70% utilization; the scarce, decisive skill is making multi-GPU jobs fault-tolerant and efficient, not merely launching them.

How it is graded
  • Training runs across multiple GPUs using data plus at least one model-parallel strategy, with correct collective communication and the parallelism decomposition documented.
  • Checkpointing is implemented and the run resumes correctly after a simulated node failure.
  • A scaling report shows throughput and efficiency across GPU counts, and communication-versus-computation overhead is measured and discussed.
  • The trained model is served as a directly deployable product behind a real public API with a hyper-usable demo UI, autoscaling, CI/CD on every commit, observability, and a secured endpoint.
  • A live training-metrics dashboard is provided, and the serving layer holds under concurrent load.
  • The project ships complete marketing — a landing page, a 10-slide pitch, and a recorded demo.
  • The repo is reproducible with a clear benchmark methodology and the product is publicly reachable.
Bridges to Distributed Systems — parallelism, collective communication, and fault tolerance

Real-Time Streaming AI Inference Platform

Week 12 milestone

An enterprise mandate: build and launch a platform that consumes a high-rate live event stream, runs inference on each event with low latency, indexes results into a vector store, and serves real-time similarity queries — all while staying healthy under bursty load and a node failure. This is a distributed-systems problem with AI inside it: exactly-once or well-reasoned delivery semantics, backpressure, autoscaling, GPU-aware scheduling, and observability across the fleet. The deliverable is a directly deployable, hyperscalable product: real public hosting, CI/CD, security, a hyper-usable real-time dashboard and query UI, and full marketing — a landing page, a pitch, and a demo. Deliver a platform that does not fall over when traffic spikes, that a buyer can evaluate, and that is presentable as a real product. Ship it as a real product.

Why it matters: Real-time AI on live data powers fraud detection, recommendations, and observability products across every major platform. A builder who ships a streaming inference platform that holds up under load and failure is directly deployable as a senior Distributed Systems or Real-Time AI Engineer, a ₹1-crore-tier role because it demands both systems depth and AI fluency.

The deliverable

A publicly hosted platform with a stable URL and a hyper-usable real-time dashboard plus query UI, plus a public repo: the streaming ingestion and inference pipeline, the at-scale vector index, the autoscaling and backpressure design, CI/CD on every commit, fleet observability, a load-and-failure test report, a marketing landing page, a 10-slide pitch, a recorded demo, and a README documenting delivery semantics, fault tolerance, security, and capacity planning.

What it ships
  • Ingestion of a high-rate live event stream (transactions, clicks, logs) from a streaming log such as Kafka.
  • Per-event low-latency inference with a documented end-to-end latency budget (target sub-100ms).
  • Stated delivery semantics — exactly-once, or at-least-once with idempotent processing — enforced in the pipeline.
  • A feature layer that assembles per-event features within the latency window for the model.
  • Inference results indexed into a vector store serving real-time similarity and nearest-neighbor queries.
  • Backpressure and autoscaling that keep the platform healthy through a sudden traffic burst.
  • Node-failure recovery with graceful degradation, demonstrated via injected failure.
  • A real-time dashboard of event throughput, inference latency percentiles, and fleet health.
  • A query UI for similarity search and recent-event lookup, usable without instruction.
  • Alerting on latency-SLO breaches, lag buildup, and anomalous event rates.
  • Multi-tenant isolation and a secured, rate-limited query API.
Stack you orchestrate
Apache Kafka or a streaming logApache Flink or a stream processorvLLM or a serving engineFAISS or a vector databaseKubernetesPrometheus and GrafanaDocker

Market signal — who wants thisReal-time streaming AI is a funded 2026 category anchored in fraud detection and live personalization: Artie raised a $12M Series A to make real-time data the default for AI systems, Experian launched real-time AI fraud detection with Resistant AI's 80+ models, and global fintech venture funding hit $12B across 751 deals by April 2026. Production fraud models need sub-millisecond feature retrieval and 20-100+ features within a 100ms window, served by vector databases like Pinecone, Milvus, and Redis. Investors fund streaming-AI platforms because regulated finance and large consumer platforms must score live events instantly or lose money.

How it is graded
  • A live event stream is consumed and inference runs per event with measured low latency.
  • Delivery semantics (exactly-once or at-least-once with idempotency) are stated and justified.
  • Inference results are indexed into a vector store that serves real-time similarity queries.
  • Backpressure and autoscaling keep the platform healthy under a simulated traffic burst, and an injected node failure is recovered with graceful degradation.
  • The platform is deployed to real public hosting with CI/CD on every commit, fleet observability, and security hardening.
  • A fast, WCAG 2.2 AA accessible real-time dashboard and query UI lets a stranger use the platform without instruction.
  • The project ships complete marketing — a landing page, a 10-slide pitch, and a recorded demo.
  • The platform is publicly reachable and fully reproducible from the repo.
Bridges to Distributed Systems — stream processing, fault tolerance, and capacity planning

Why the subject bridge matters. Every project is mapped to a real classic CS subject — Operating Systems, Distributed Systems, Databases, Compilers. You are not choosing between your degree and shipping real software. You can tell a faculty member, truthfully, "I built this for my Distributed Systems mini-project" — and tell a hiring panel it is production-grade.