What you'll learn
Explore related topics
This course includes:
Course content
-
AI Evals01:00
-
1 HW 1&2 walkthrough with Braintrust (pre recorded) 110:50
-
1 HW 1&2 walkthrough with Braintrust (pre recorded) 205:13
-
2 HW 1&2 walkthrough with Phoenix (pre recorded)15:04
-
3 HW 1&2 walkthrough with Lang Smith (pre recorded)22:41
-
1 HW 3 walkthrough with Braintrust (pre recorded)21:41
-
2 HW 3 walkthrough with Phoenix (pre recorded)16:39
-
1 HW 4 walkthrough with Braintrust (pre recorded)23:11
-
2 HW 4 walkthrough with Phoenix (pre recorded)16:38
-
1 HW 5 walkthrough with Braintrust (pre recorded)22:03
-
2 HW 5 walkthrough with Phoenix (pre recorded)14:57
-
1 Lesson 1%3 a Fundamentals & Lifecycle LLM Application Evaluation01:00
-
2 Lesson 2%3 a Systematic Error Analysis01:00
-
3 Braintrust Tutorial w Wayde Gilliam43:02
-
4 Optional%3 a Office Hours01:00
-
AIE Braintrust Intro05:00
-
Lesson 101:00
-
Lesson 201:00
-
5 Lesson 3%3 a More Error Analysis & Collaborative Evaluation01:00
-
6 Lesson 4%3 a Automated Evaluators01:00
-
7 Taming diffusion QR codes with evals and inference time scaling w Charles Frye44:43
-
8 10 x Your RAG Evaluation by Avoiding these Pitfalls w Skylar Payne28:25
-
9 Optional%3 a Office Hours01:00
-
10 Optional%3 a Office Hours01:00
-
Lesson 301:00
-
Lesson 401:00
-
11 Lesson 5%3 a More Automated Evaluators01:00
-
12 Lesson 6%3 a RAG & Complex Architectures01:00
-
13 Scaling Inference Time Compute for Better LLM Judges w Leonard Tang31:08
-
14 Building custom eval tools with coding agents w Isaac Flath46:39
-
15 From Vibe Checks to Evals to Feedback Loops Case Studies in Al System Maturities w David Karam30:02
-
16 A Playbook For Building Al Agents You Can Trust w Udi Menkes38:25
-
17 Al Evals in Vertical Industries (such as healthcare, finance and law) w Dr Chris Lovejoy34:15
-
18 Arize Phoenix tutorial W Mikyo King49:02
-
19 Optional%3 a Office Hours01:00
-
20 Optional%3 a Office Hours01:00
-
21 Optional%3 a Office Hours01:00
-
Building Custom Eval Tools with cod01:00
-
Lesson 501:00
-
Lesson 601:00
-
22 Lesson 7%3 a Efficient Continuous Human Review Systems01:00
-
23 Lesson 8%3 a Cost Optimization01:00
-
24 Techniques for evaluating agents w Sally Ann De Lucia (Arize)33:37
-
25 Lang Smith Tutorial w Harrison Chase48:24
-
26 From Noob to 5 Automated Evals in 4 Weeks (as a PM) w Teresa Torres1:10:21
-
27 Solvelt%3 a The Thinking Developer's Environment w Jeremy Howard & Johno Whitaker01:00
-
28 Testing Real Al Products LIVE w Robert Ta1:00:49
-
29 Fireside Chat with DSP Creator w Omar Khattab44:59
-
30 Optional%3 a Office Hours01:00
-
31 Optional%3 a Office Hours (Bonus)01:00
-
Lesson 701:00
-
Lesson 801:00
Requirements
- Basic understanding of AI, machine learning, or large language models (LLMs)
- Familiarity with software development or product management concepts
- A computer with internet access and ability to work with AI tools
- Interest in AI quality assurance, testing, and evaluation methodologies
Description
AI Evals For Engineers, PMs provides a comprehensive foundation in evaluating artificial intelligence systems, with particular emphasis on large language models and generative AI applications. This course addresses one of the most critical challenges facing AI practitioners today: how to systematically assess whether AI systems are performing as intended, meeting quality standards, and delivering value in production environments.
The learning journey begins with foundational concepts that establish why evaluation is essential in the AI development lifecycle. Students gain understanding of the unique challenges posed by non-deterministic AI systems, where traditional software testing approaches fall short. The course explores how evaluation frameworks serve as the backbone for responsible AI deployment, enabling teams to build confidence in their systems before releasing them to users.
As students progress, they learn to design evaluation frameworks from the ground up. This involves understanding different types of evaluations, from unit-level tests that assess specific model behaviors to system-level evaluations that examine end-to-end performance. The course teaches how to identify what matters most for a given application, translating business requirements and user needs into measurable evaluation criteria. Students work through the process of defining success metrics that align with product goals, whether those involve accuracy, relevance, safety, consistency, or other dimensions of performance.
The curriculum then moves into practical implementation of evaluation systems. Students learn to construct datasets specifically designed for testing AI behavior, including strategies for sampling representative examples, creating edge case collections, and building adversarial test sets that probe model weaknesses. The course covers both automated evaluation techniques, such as model-based scoring and rule-based checks, and approaches that incorporate human judgment for subjective quality dimensions.
A significant portion of the course focuses on interpreting evaluation results and turning insights into action. Students develop skills in analyzing patterns across evaluation metrics, identifying systematic failures, and diagnosing root causes of poor performance. The course teaches how to establish baselines, track improvements over time, and make informed decisions about when a model is ready for production or requires further development.
The course explores industry-standard benchmarking practices, examining how public benchmarks can inform development while understanding their limitations. Students learn to adapt existing benchmarks to their specific contexts and create custom benchmarks that reflect real-world use cases. This includes understanding the relationship between benchmark performance and actual user experience.
Integration of evaluation into development workflows receives dedicated attention. The course covers how to build continuous evaluation pipelines that run automatically as models and systems evolve, enabling rapid iteration while maintaining quality gates. Students learn to set up monitoring systems that track evaluation metrics in production, providing early warning signals when model performance degrades.
Cross-functional collaboration forms another key component of the curriculum. The course addresses how engineers and product managers can work together effectively using evaluation frameworks as a common language. Students learn to communicate evaluation results to diverse stakeholders, from technical teams who need diagnostic details to business leaders who care about impact metrics.
Advanced topics include evaluating for safety and alignment, assessing robustness across different user populations and use cases, and handling the challenges of evaluating creative or open-ended AI outputs where correct answers are not clearly defined. The course also covers cost-benefit considerations in evaluation design, helping students balance thoroughness with resource constraints.
Throughout the course, students engage with real-world scenarios that mirror the challenges faced by AI teams in production environments. The curriculum emphasizes practical decision-making, equipping students with frameworks they can immediately apply to their own projects. By the end of the course, students possess a systematic approach to AI evaluation that enables them to build more reliable, trustworthy, and effective AI systems.
Who this course is for:
AI Evals For Engineers, PMs is designed for software engineers building AI-powered applications, product managers overseeing AI products, technical leads responsible for AI system quality, ML engineers looking to strengthen evaluation skills, and anyone involved in deploying and maintaining production AI systems who needs to ensure reliability and performance.Instructor
Parlance Labs
About Me
We are a specialized research and education organization focused on the practical challenges of building and deploying artificial intelligence systems in production environments. Our work centers on evaluation methodologies for large language models and generative AI, an area where we have developed deep expertise through direct engagement with the technical challenges facing engineering and product teams.
Our organization emerged from recognizing a critical gap in the AI ecosystem. While tremendous progress has been made in model capabilities, the methods for systematically evaluating these systems have not kept pace. We saw engineering teams struggling with questions that traditional software testing approaches could not answer, and product teams unable to confidently assess whether AI systems were ready for users. This motivated us to focus specifically on evaluation frameworks that bridge the gap between AI research and practical deployment.
Our approach is grounded in real-world application rather than purely academic research. We work closely with teams building AI products across industries, understanding firsthand the constraints and tradeoffs they face. This practical orientation shapes everything we teach, ensuring our educational content addresses actual challenges rather than theoretical ideals. We prioritize frameworks and techniques that work within resource constraints and integrate smoothly into existing development workflows.
We believe that robust evaluation is not just a technical necessity but a foundation for responsible AI development. Our work emphasizes transparency in understanding model behavior, systematic approaches to identifying failures, and clear communication of capabilities and limitations. We advocate for evaluation practices that help teams build AI systems that are not only performant but trustworthy and aligned with user needs.
Our educational philosophy centers on practical implementation. We translate complex evaluation concepts into actionable frameworks that engineers and product managers can apply immediately. Our curriculum design reflects the multidisciplinary nature of modern AI development, bridging technical depth with product thinking and emphasizing collaboration between different roles.
Through our courses and research, we aim to establish evaluation as a core competency for AI practitioners, elevating the standards for how AI systems are tested and validated before reaching production.
Relative Courses