Applied ML & Model Engineering, ParallelCS

Week 1

Neural Networks & Backpropagation from Scratch

Build a neural net and autograd by hand. Gradients, the chain rule, and what 'training' actually computes — no framework magic until you have earned the abstraction.

Bridges to Calculus & Linear Algebra — gradients, the chain rule, and vector spaces

Builds on: nothing, start here

Video Neural Networks: Zero to Hero — building micrograd and makemore Andrej Karpathy Free (opens in a new tab)
Video 3Blue1Brown — Neural Networks series 3Blue1Brown Free (opens in a new tab)
Course Machine Learning Crash Course Google Free (opens in a new tab)

Read the study notes

Week 2

Deep Learning Foundations

Optimization that actually converges: SGD and Adam, regularization, normalization, initialization, and the failure modes (vanishing gradients, overfitting) every practitioner must recognize.

Bridges to Machine Learning — optimization, generalization, and the bias-variance tradeoff

Builds on: Neural Networks & Backpropagation from Scratch

Read the study notes

Week 3

Transformers, Attention & Pretraining

Why attention replaced recurrence, and what pretraining a language model on a corpus actually optimizes. Tokenization, the training objective, and scaling laws.

Bridges to Machine Learning — sequence modeling and representation learning

Builds on: Deep Learning Foundations

Read the study notes

Week 4

Pre-Training Data & Recursive Self-Improvement

Master pre-training data engineering and curation pipelines. Implement recursive self-improvement (RSI) techniques where advanced frontier models curate high-density synthetic data, generate training taxonomies, and manage model self-succession.

Bridges to Databases — data cleaning, deduplication, and ETL pipelines

Builds on: Transformers, Attention & Pretraining

Read the study notes

Week 5

Hybrid Architectures: SSM-Transformer Hybrids & Attention Alternatives

Explore alternatives and hybrids to standard self-attention. Analyze how State Space Models (SSMs) like Mamba are merged with traditional attention mechanisms to form high-throughput, linear-complexity hybrid layers in modern frontier models.

Bridges to Computer Architecture — specialized processors and hardware-agnostic compilation

Builds on: Pre-Training Data & Recursive Self-Improvement

Read the study notes

Week 6

Supervised Fine-Tuning

Adapt a base model to a task or style with SFT: instruction tuning, hyperparameters that matter, overfitting and catastrophic forgetting, and measuring whether it worked.

Bridges to Machine Learning — transfer learning and supervised training

Builds on: Pre-Training Data & Recursive Self-Improvement

Read the study notes

Week 7

Parameter-Efficient Fine-Tuning: LoRA & QLoRA

Adapt billion-parameter models on a single GPU. Low-rank adaptation, QLoRA's 4-bit base plus adapters, and the cost-versus-quality math that makes customization affordable.

Bridges to Linear Algebra — matrix rank, decomposition, and low-rank approximation

Builds on: Supervised Fine-Tuning

Read the study notes

Week 8

End-to-End Fine-Tuning Pipeline for a Domain Model

Week 8 milestone

An enterprise mandate: take a base open-weight model and adapt it into a specialist for a real domain you choose (legal, medical, code, support), then ship it as a launched product. Own the whole pipeline: build and clean the training dataset, run parameter-efficient fine-tuning (LoRA or QLoRA), and prove on a held-out, contamination-controlled evaluation that the adapted model beats the base model on the target task. The deliverable is not a notebook — it is a directly deployable, hyperscalable product: the fine-tuned model served behind a real public API with a hyper-usable demo UI, autoscaling, CI/CD, observability, security, and full marketing (landing page, pitch, demo) so a domain user can try it and a buyer can evaluate it. We do not accept 'it seems better' — bring the numbers, and ship it as a real product.

Why it matters: Domain-adapted models are how companies turn a generic LLM into a defensible product, and most fine-tuning projects fail on data discipline. A builder who can run a clean, evaluated, reproducible fine-tuning pipeline is directly deployable as an ML Engineer or Model Engineer, a sought-after frontier role.

The deliverable

A publicly hosted product with a stable URL and a hyper-usable demo UI, plus a public repo and a published model card: the data pipeline, the QLoRA training configuration and run logs, an autoscaling serving deployment, CI/CD on every commit, production observability, an evaluation comparing base versus fine-tuned on a held-out set, a marketing landing page, a 10-slide pitch, a recorded demo, and a README documenting data sourcing, the contamination check, the scaling design, and the cost of the run.

What it ships

A dataset builder that ingests raw domain documents and turns them into cleaned, deduplicated instruction data.
An automatic train/eval contamination check that flags and removes overlap before training.
A LoRA/QLoRA training workflow with a reproducible config file and full run logging.
An experiment view comparing runs across hyperparameters, with loss curves and eval scores.
A held-out evaluation harness reporting target-task accuracy for the base model versus the fine-tuned model.
A catastrophic-forgetting check that scores the fine-tuned model on general tasks, not just the target task.
Adapter management: deploy one base model and hot-swap LoRA adapters per request.
An OpenAI-compatible serving API for the fine-tuned model, with autoscaling.
A hyper-usable demo UI where a domain user can try the specialist model on real prompts.
An auto-generated model card documenting data sourcing, intended use, limitations, and run cost.
Production observability and a secured, rate-limited endpoint.

Stack you orchestrate

Hugging Face TransformersTRLPEFTbitsandbytesPyTorchHugging Face Datasetsa GPU runtime (Colab, Kaggle, or a cloud instance)

Market signal, who wants thisDomain fine-tuning is a funded 2026 platform category: Together AI, Predibase (acquired by Rubrik in June 2025 for enterprise security depth), and Prem Studio compete on managed LoRA/QLoRA, and adapter-routing (one base model, many adapters per request) is now standard. The economics are compelling — a 7B model can be specialized on a single consumer GPU in an afternoon. Enterprises buy custom models that speak their technical language; investors fund fine-tuning platforms because every vertical AI product needs a model adapted to its own data.

How it is graded

A training dataset is built, cleaned, and deduplicated, with sourcing documented.
Parameter-efficient fine-tuning (LoRA or QLoRA) is run with a reproducible configuration.
A held-out evaluation shows a measured improvement of the fine-tuned model over the base, with train/eval contamination explicitly checked and catastrophic forgetting measured.
The fine-tuned model is served behind a real public API with a hyper-usable demo UI, autoscaling, CI/CD on every commit, observability, and a secured endpoint.
The serving architecture holds under concurrent load and the scaling design is documented.
The project ships complete marketing — a landing page, a 10-slide pitch, and a recorded demo.
The repo is reproducible, a model card documents intended use and limitations, and the product is publicly reachable.

Bridges to Machine Learning — transfer learning, supervised training, and evaluation

Week 9

Preference Optimization & Reasoning Verification

Align models using RLHF, DPO, and semi-supervised reasoning verification. Train lightweight correctness classifiers to verify intermediate reasoning steps rather than final answers, enabling sample-efficient fine-tuning on mid-weight architectures.

Bridges to Machine Learning — reinforcement learning and policy optimization

Builds on: Parameter-Efficient Fine-Tuning: LoRA & QLoRA

Read the study notes

Week 10

Knowledge Distillation & Model Compression

Compress a large teacher into a small, deployable student that keeps most of the capability. Distillation objectives, synthetic-data distillation, and honest accuracy accounting.

Bridges to Machine Learning — model compression and the teacher-student paradigm

Builds on: Preference Optimization & Reasoning Verification

Read the study notes

Week 11

Rigorous Model Evaluation

Benchmarks lie when misused. Build evaluation harnesses, control for contamination, measure on task-representative data, and report uncertainty instead of a single number.

Bridges to Statistics — sampling, confidence intervals, and experimental design

Builds on: Knowledge Distillation & Model Compression

Read the study notes

Week 12

Distill a Frontier Model into a Deployable Specialist

Week 12 milestone

An enterprise mandate: a large model solves a task well but is too expensive to serve at volume. Distill its capability on that task into a small student model that can run cheaply, then ship the student as a launched product. Use the teacher to generate or label training data, train and align the student, and prove the student keeps most of the capability at a fraction of the cost. The deliverable is not a benchmark table — it is a directly deployable, hyperscalable product: the student served behind a real public API with a hyper-usable demo UI, autoscaling, CI/CD, observability, security, and full marketing (landing page, pitch, demo). Whatever time it takes — the deliverable is a model the business can actually afford to run and a product a buyer can try. Ship it as a real product.

Why it matters: Distillation is the standard route from an expensive frontier model to an economically viable product feature. A builder who can distill, align, and deploy a specialist student is directly deployable as a senior Model Engineer or Applied Scientist, a high-leverage role because distillation work converts directly into serving-cost reduction at scale.

The deliverable

A publicly hosted product with a stable URL and a hyper-usable demo UI, plus a public repo and a published student model: the distillation data pipeline, the student training and preference-optimization configuration, an autoscaling serving deployment, CI/CD on every commit, production observability, a benchmark comparing teacher, student, and base on the target task, a cost-and-latency comparison, a marketing landing page, a 10-slide pitch, a recorded demo, and a README on the distillation method and scaling design.

What it ships

A teacher-labeling pipeline that uses a frontier model to generate or label distillation data for a chosen task.
Synthetic-data generation with quality filtering so the student trains on clean, diverse examples.
A student-training workflow producing a small (0.6B-8B) model, with reproducible configs and run logs.
Optional preference optimization (DPO) to align the student where the task needs it.
A three-way benchmark — teacher, student, and base — reporting capability retained on the target task.
A cost-and-latency comparison computing the serving-cost reduction versus the teacher.
An accuracy-floor gate that blocks shipping a student that drops below a configured retention threshold.
An OpenAI-compatible serving API for the student model with autoscaling and scale-to-zero.
A hyper-usable demo UI letting a buyer try teacher and student side by side on real prompts.
An auto-generated model card with the distillation method, retention numbers, and intended use.
Production observability and a secured, rate-limited endpoint.

Stack you orchestrate

Hugging Face TransformersTRLPEFTPyTorchvLLM for servingan eval harnessDocker

Market signal, who wants thisDistillation drives the most-cited 2026 enterprise-AI economics: task-specific small models (0.6B-8B) match or beat frontier models at 10-100x lower inference cost, retaining 85-95% of capability. A $35K-$120K distillation project pays back in three weeks to three months against frontier inference bills, and startups like distil labs are funded purely to 'replace LLMs with custom small language models.' Investors back distillation because it is the clearest path from an expensive frontier model to a margin-positive product feature.

How it is graded

A teacher model is used to generate or label distillation data with a documented method.
A smaller student is trained and the capability retained on the target task is measured against the teacher.
Preference optimization (DPO or RLHF) or alignment of the student is applied where appropriate and justified.
The student is served behind a real public API with a hyper-usable demo UI, autoscaling, CI/CD on every commit, observability, and a secured endpoint that holds under concurrent load.
A cost and latency comparison shows the student is materially cheaper to serve, with the accuracy-versus-cost tradeoff reported honestly.
The project ships complete marketing — a landing page, a 10-slide pitch, and a recorded demo.
The repo is reproducible, the student model is published with a model card, and the product is publicly reachable.

Bridges to Machine Learning — model compression and the teacher-student paradigm

Applied ML & Model Engineering

Mapped week by week.

Neural Networks & Backpropagation from Scratch

Deep Learning Foundations

Transformers, Attention & Pretraining

Pre-Training Data & Recursive Self-Improvement

Hybrid Architectures: SSM-Transformer Hybrids & Attention Alternatives

Supervised Fine-Tuning

Parameter-Efficient Fine-Tuning: LoRA & QLoRA

End-to-End Fine-Tuning Pipeline for a Domain Model

Preference Optimization & Reasoning Verification

Knowledge Distillation & Model Compression

Rigorous Model Evaluation

Distill a Frontier Model into a Deployable Specialist

Finished here? Keep climbing.