Autonomous Multi-Agent Research-and-Ship System
Week 7 milestone
You are handed an enterprise mandate: the research division needs a launched product — a system that takes an open-ended technical question, autonomously researches it across many sources, synthesizes a defensible report, and ships the report as a published artifact, with zero human steps in the middle. Build an orchestrator-worker multi-agent system: a lead agent that decomposes the question and spawns specialized worker agents (search, read, synthesize, fact-check), coordinates their results through shared state, and produces a cited deliverable. This is not a notebook demo. The result must be a directly deployable, hyperscalable product: real public hosting, CI/CD on every commit, observability, security hardening, a polished and accessible web UI a non-technical analyst will happily use, and complete go-to-market material — a landing page, a pitch, and a recorded demo. The architecture must absorb concurrent research runs without falling over, and recover from a failed worker. We are not here to babysit the run; ship it as a real product.
Why it matters: Multi-agent research and synthesis systems are being deployed across consulting, finance, and R&D to compress weeks of analyst work into hours. Shipping a coordinated, fault-tolerant agent fleet makes a builder ready for an Agentic Systems Engineer or Applied AI Engineer role at the ₹1-crore tier, where the bar is production reliability, not a demo.
The deliverable
A publicly hosted product with its own domain or stable URL, plus a public repo: the orchestrator and worker agents, an MCP-based tool layer, a fast accessible web UI for submitting questions and reading results, CI/CD running lint/tests/build on every commit, persisted and inspectable run traces, a marketing landing page, a 10-slide pitch, a recorded demo video, and a README documenting the coordination design, the failure-recovery and scaling strategy, and three example end-to-end runs with their published reports.
What it ships
- Submit-a-question interface accepting an open-ended technical or market question with a depth setting (quick scan vs deep dive).
- A lead orchestrator agent that decomposes the question into a research plan and spawns specialized worker agents.
- Specialized workers — web search, source reading, synthesis, and an independent fact-checker that verifies every claim.
- An MCP tool layer exposing search, fetch, and document tools so the same tools are reusable across agents and projects.
- Live run view: a real-time graph of agent activity, sub-questions in flight, and sources being consumed.
- Inline-cited report output where every claim links to the exact retrieved passage that supports it.
- Export to PDF, Markdown, and a shareable public report URL.
- Persisted, replayable run traces with token spend and latency per agent for cost auditing.
- Automatic worker-failure detection and re-dispatch so a crashed worker never aborts a run.
- A workspace history of past research runs with search and one-click re-run.
- Concurrency controls and per-run budget caps so many users can run research in parallel safely.
Stack you orchestrate
Claude API or open-weight LLMModel Context ProtocolLangGraphNode.js or PythonDockerGoogle Cloud Runa tracing backend (LangSmith or OpenTelemetry)
Market signal — who wants thisAgentic deep-research is one of the hottest 2026 categories: the AI agent market is projected to grow from $7.84B in 2025 to $52.62B by 2030 (41% CAGR), and a16z reports a portfolio pivot from copilots to autonomous systems, with Sierra, Glean, and Decagon as comparables and YC W26 funding multi-agent orchestration startups such as Tensol and Korso. Consulting, finance, and corporate R&D teams are actively buying systems that compress weeks of analyst work into hours; investors fund this because it sells time back to high-cost knowledge workers.
How it is graded
- The orchestrator decomposes a question and coordinates at least three specialized worker agents through explicit shared state.
- Tools are exposed through a standard protocol (MCP), not bespoke per-agent glue.
- The system is deployed to real public hosting with CI/CD on every commit and production observability (logs, traces, metrics).
- The architecture handles concurrent research runs under load, and a worker failure mid-run still yields a complete, correct deliverable.
- The web UI is fast, WCAG 2.2 AA accessible, and usable by a non-technical analyst without instruction.
- Every claim in the output report is traceable to a retrieved source, and run traces are persisted and inspectable.
- The project ships complete marketing: a landing page, a 10-slide pitch, and a recorded demo, presentable as a real product.
- The product is publicly reachable and fully reproducible from the repo by a stranger.
Bridges to Distributed Systems — coordination, message passing, and fault tolerance
Autonomous Coding Agent with Sandboxed Execution
Week 12 milestone
An enterprise mandate: ship a launched product — an autonomous coding agent that, given a real GitHub issue, plans a fix, writes and runs code inside a hardened sandbox, iterates against test feedback, and opens a pull request, with untrusted code never touching the host. This is an operating-systems problem as much as an AI problem: the agent's generated code is hostile by assumption. Confine it with containers or microVMs, enforce filesystem and network policy, cap CPU and memory, and survive an agent that tries to escape or hang. The deliverable is not a script — it is a directly deployable, hyperscalable product: real public hosting, CI/CD, observability, a clean dashboard where a developer queues issues and watches trajectories, security hardening, and full marketing (landing page, pitch, demo). The sandbox fleet must scale horizontally to many concurrent jobs. We are not here to babysit a run; ship it as a real product.
Why it matters: Autonomous coding agents are now a core part of engineering org tooling, and the hard part is safe execution at scale, not code generation. A builder who can ship a sandboxed, evaluated coding agent is directly deployable as an Agent Infrastructure Engineer or AI Platform Engineer, roles commanding ₹1-crore-plus compensation because they sit between security and AI.
The deliverable
A publicly hosted product with a stable URL, plus a public repo: the planning-and-execution loop, the sandbox isolation layer, a fast accessible dashboard for queuing issues and inspecting trajectories, CI/CD on every commit, production observability, an eval suite over a set of real issues with pass/fail trajectories, a marketing landing page, a 10-slide pitch, a recorded demo, and a README documenting the isolation threat model, the horizontal-scaling design, and the workflow-versus-agent decision.
What it ships
- GitHub integration: connect a repo, and the agent picks up issues labeled for automation.
- A planning stage that reads the issue, explores the codebase, and produces a fix plan before writing code.
- Generated code runs only inside a hardened microVM or container sandbox with enforced CPU, memory, filesystem, and network limits.
- An iterative loop: run the test suite, read failures, revise, and retry until tests pass or a budget is hit.
- Automatic pull-request creation with a written summary of the change and the test evidence.
- A live trajectory dashboard showing the agent's plan, edits, command output, and retries in real time.
- An eval mode that runs the agent over a fixed set of real issues and reports pass rate and trajectory quality.
- Prompt-injection defenses: untrusted issue and repo text is treated as hostile, with least-privilege tool scopes.
- Horizontal sandbox-fleet scaling so many issues are worked concurrently without contention.
- Per-job cost and time tracking, with configurable budget caps and a hard kill on runaway jobs.
- An audit log of every command the agent ran inside the sandbox.
Stack you orchestrate
Claude API or open-weight LLMgVisor or Firecracker microVMsDockerGitHub APINode.js or Pythonan eval frameworkGoogle Cloud Run
Market signal — who wants thisAutonomous coding agents are the breakout 2026 developer-tools category: Cursor reached $2B ARR in February 2026 and is raising at a $50B+ valuation, and Replit raised a $400M Series D at a $9B valuation in March 2026. SWE-bench Verified is now the industry yardstick, with top agents exceeding 80%. A whole sub-market of sandbox-execution infrastructure (E2B, Northflank, Modal, Blaxel) is being funded specifically to run agent-written code safely; investors back this because safe execution at scale, not code generation, is the unsolved bottleneck.
How it is graded
- Generated code executes only inside an isolated sandbox with enforced CPU, memory, filesystem, and network limits.
- The agent iterates on test feedback and recovers from its own failed attempts.
- The sandbox fleet scales horizontally to many concurrent jobs and the scaling design is documented.
- The product is deployed to real public hosting with CI/CD on every commit and production observability.
- A fast, WCAG 2.2 AA accessible dashboard lets a developer queue issues and inspect agent trajectories.
- An eval suite reports pass rate and trajectory quality over a fixed set of real issues.
- The isolation threat model is documented, including sandbox-escape containment, and prompt-injection via issue or repo content is mitigated with least-privilege tool access.
- The project ships complete marketing — a landing page, a 10-slide pitch, and a recorded demo — and is publicly hosted and reproducible from the repo.
Bridges to Operating Systems — virtualization, process isolation, and resource management