Name: AI Evals For Engineers, PMs
Price: 79 USD
Availability: InStock
Author: Parlance Labs

What you'll learn

Design and implement comprehensive evaluation frameworks for AI systems and LLMs

Build custom evaluation metrics tailored to specific use cases and business objectives

Apply industry-standard benchmarking techniques to assess model performance

Create automated testing pipelines for continuous AI model evaluation

Analyze and interpret evaluation results to make data-driven improvement decisions

Implement quality assurance processes for production AI applications

Develop strategies to identify and mitigate model failures and edge cases

Collaborate effectively between engineering and product teams using evaluation frameworks

Explore related topics

Artificial Intelligence (AI)Professional Skills

This course includes:

29.36 hours on-demand video

41 videos

10 documents

8 GB downloadable resources

Access on mobile and PC

Instant access after payment

Course content

8 sections • 52 lectures • 13h 24m total length

Expand all sections

AI Evals

01:00

1. HW 1&2 walkthrough with Braintrust (pre-recorded) 1

10:50
1. HW 1&2 walkthrough with Braintrust (pre-recorded) 2

05:13
2. HW 1&2 walkthrough with Phoenix (pre-recorded)

15:04
3. HW 1&2 walkthrough with LangSmith (pre-recorded)

22:41

1. HW 3 walkthrough with Braintrust (pre-recorded)

21:41
2. HW 3 walkthrough with Phoenix (pre-recorded)

16:39

1. HW 4 walkthrough with Braintrust (pre-recorded)

23:11
2. HW 4 walkthrough with Phoenix (pre-recorded)

16:38

1. HW 5 walkthrough with Braintrust (pre-recorded)

22:03
2. HW 5 walkthrough with Phoenix (pre-recorded)

14:57

1. Lesson 1%3a Fundamentals & Lifecycle LLM Application Evaluation

01:00
2. Lesson 2%3a Systematic Error Analysis

01:00
3. Braintrust Tutorial w Wayde Gilliam

43:02
4. Optional%3a Office Hours

01:00
AIE - Braintrust Intro

05:00
Lesson 1

01:00
Lesson 2

01:00

5. Lesson 3%3a More Error Analysis & Collaborative Evaluation

01:00
6. Lesson 4%3a Automated Evaluators

01:00
7. Taming diffusion QR codes with evals and inference-time scaling w Charles Frye

44:43
8. 10x Your RAG Evaluation by Avoiding these Pitfalls w Skylar Payne

28:25
9. Optional%3a Office Hours

01:00
10. Optional%3a Office Hours

01:00
Lesson 3

01:00
Lesson 4

01:00

11. Lesson 5%3a More Automated Evaluators

01:00
12. Lesson 6%3a RAG & Complex Architectures

01:00
13. Scaling Inference-Time Compute for Better LLM Judges w Leonard Tang

31:08
14. Building custom eval tools with coding agents w Isaac Flath

46:39
15. From Vibe Checks to Evals to Feedback Loops - Case Studies in Al System Maturities w David Karam

30:02
16. A Playbook For Building Al Agents You Can Trust w Udi Menkes

38:25
17. Al Evals in Vertical Industries (such as healthcare, finance and law) w Dr Chris Lovejoy

34:15
18. Arize Phoenix tutorial W Mikyo King

49:02
19. Optional%3a Office Hours

01:00
20. Optional%3a Office Hours

01:00
21. Optional%3a Office Hours

01:00
Building Custom Eval Tools with cod

01:00
Lesson 5

01:00
Lesson 6

01:00

22. Lesson 7%3a Efficient Continuous Human Review Systems

01:00
23. Lesson 8%3a Cost Optimization

01:00
24. Techniques for evaluating agents w SallyAnn DeLucia (Arize)

33:37
25. LangSmith Tutorial w Harrison Chase

48:24
26. From Noob to 5 Automated Evals in 4 Weeks (as a PM) w Teresa Torres

1:10:21
27. Solvelt%3a The Thinking Developer's Environment w Jeremy Howard & Johno Whitaker

01:00
28. Testing Real Al Products LIVE w Robert Ta

1:00:49
29. Fireside Chat with DSP Creator w Omar Khattab

44:59
30. Optional%3a Office Hours

01:00
31. Optional%3a Office Hours (Bonus)

01:00
Lesson 7

01:00
Lesson 8

01:00

Requirements

Basic understanding of AI, machine learning, or large language models (LLMs)
Familiarity with software development or product management concepts
A computer with internet access and ability to work with AI tools
Interest in AI quality assurance, testing, and evaluation methodologies

Description

AI Evals For Engineers, PMs provides a comprehensive foundation in evaluating artificial intelligence systems, with particular emphasis on large language models and generative AI applications. This course addresses one of the most critical challenges facing AI practitioners today: how to systematically assess whether AI systems are performing as intended, meeting quality standards, and delivering value in production environments.

The learning journey begins with foundational concepts that establish why evaluation is essential in the AI development lifecycle. Students gain understanding of the unique challenges posed by non-deterministic AI systems, where traditional software testing approaches fall short. The course explores how evaluation frameworks serve as the backbone for responsible AI deployment, enabling teams to build confidence in their systems before releasing them to users.

As students progress, they learn to design evaluation frameworks from the ground up. This involves understanding different types of evaluations, from unit-level tests that assess specific model behaviors to system-level evaluations that examine end-to-end performance. The course teaches how to identify what matters most for a given application, translating business requirements and user needs into measurable evaluation criteria. Students work through the process of defining success metrics that align with product goals, whether those involve accuracy, relevance, safety, consistency, or other dimensions of performance.

The curriculum then moves into practical implementation of evaluation systems. Students learn to construct datasets specifically designed for testing AI behavior, including strategies for sampling representative examples, creating edge case collections, and building adversarial test sets that probe model weaknesses. The course covers both automated evaluation techniques, such as model-based scoring and rule-based checks, and approaches that incorporate human judgment for subjective quality dimensions.

A significant portion of the course focuses on interpreting evaluation results and turning insights into action. Students develop skills in analyzing patterns across evaluation metrics, identifying systematic failures, and diagnosing root causes of poor performance. The course teaches how to establish baselines, track improvements over time, and make informed decisions about when a model is ready for production or requires further development.

The course explores industry-standard benchmarking practices, examining how public benchmarks can inform development while understanding their limitations. Students learn to adapt existing benchmarks to their specific contexts and create custom benchmarks that reflect real-world use cases. This includes understanding the relationship between benchmark performance and actual user experience.

Integration of evaluation into development workflows receives dedicated attention. The course covers how to build continuous evaluation pipelines that run automatically as models and systems evolve, enabling rapid iteration while maintaining quality gates. Students learn to set up monitoring systems that track evaluation metrics in production, providing early warning signals when model performance degrades.

Cross-functional collaboration forms another key component of the curriculum. The course addresses how engineers and product managers can work together effectively using evaluation frameworks as a common language. Students learn to communicate evaluation results to diverse stakeholders, from technical teams who need diagnostic details to business leaders who care about impact metrics.

Advanced topics include evaluating for safety and alignment, assessing robustness across different user populations and use cases, and handling the challenges of evaluating creative or open-ended AI outputs where correct answers are not clearly defined. The course also covers cost-benefit considerations in evaluation design, helping students balance thoroughness with resource constraints.

Throughout the course, students engage with real-world scenarios that mirror the challenges faced by AI teams in production environments. The curriculum emphasizes practical decision-making, equipping students with frameworks they can immediately apply to their own projects. By the end of the course, students possess a systematic approach to AI evaluation that enables them to build more reliable, trustworthy, and effective AI systems.

Who this course is for:

AI Evals For Engineers, PMs is designed for software engineers building AI-powered applications, product managers overseeing AI products, technical leads responsible for AI system quality, ML engineers looking to strengthen evaluation skills, and anyone involved in deploying and maintaining production AI systems who needs to ensure reliability and performance.

Instructor

Parlance Labs

AI research and education organization specializing in language model evaluation and practical AI implementation

About Me

We are a specialized research and education organization focused on the practical challenges of building and deploying artificial intelligence systems in production environments. Our work centers on evaluation methodologies for large language models and generative AI, an area where we have developed deep expertise through direct engagement with the technical challenges facing engineering and product teams.

Our organization emerged from recognizing a critical gap in the AI ecosystem. While tremendous progress has been made in model capabilities, the methods for systematically evaluating these systems have not kept pace. We saw engineering teams struggling with questions that traditional software testing approaches could not answer, and product teams unable to confidently assess whether AI systems were ready for users. This motivated us to focus specifically on evaluation frameworks that bridge the gap between AI research and practical deployment.

Our approach is grounded in real-world application rather than purely academic research. We work closely with teams building AI products across industries, understanding firsthand the constraints and tradeoffs they face. This practical orientation shapes everything we teach, ensuring our educational content addresses actual challenges rather than theoretical ideals. We prioritize frameworks and techniques that work within resource constraints and integrate smoothly into existing development workflows.

We believe that robust evaluation is not just a technical necessity but a foundation for responsible AI development. Our work emphasizes transparency in understanding model behavior, systematic approaches to identifying failures, and clear communication of capabilities and limitations. We advocate for evaluation practices that help teams build AI systems that are not only performant but trustworthy and aligned with user needs.

Our educational philosophy centers on practical implementation. We translate complex evaluation concepts into actionable frameworks that engineers and product managers can apply immediately. Our curriculum design reflects the multidisciplinary nature of modern AI development, bridging technical depth with product thinking and emphasizing collaboration between different roles.

Through our courses and research, we aim to establish evaluation as a core competency for AI practitioners, elevating the standards for how AI systems are tested and validated before reaching production.