Design custom evaluation frameworks for AI features
Compare how different AI judges rate the same outputs
Rate LLM outputs and compare with AI judge
Watch an agent use tools to solve multi-step problems
Configure approval workflows and see tradeoffs
Explore common agent failure modes and how to prevent them