Evaluations & Metrics

Evaluations help you understand how well your automation performs, which models work best for your use cases, and how to optimize for cost and reliability. This guide covers both monitoring your own workflows and running comprehensive evaluations.

Why Evaluations Matter

Performance Optimization: Identify which models and settings work best for your specific automation tasks
Cost Control: Track token usage and inference time to optimize spending
Reliability: Measure success rates and identify failure patterns
Model Selection: Compare different LLMs on real-world tasks to make informed decisions

Live Model Comparisons

View real-time performance comparisons across different LLMs on the Stagehand Evals Dashboard

Comprehensive Evaluations

Evaluations help you systematically test and improve your automation workflows. Stagehand provides both built-in evaluations and tools to create your own. We have 2 types of evals:

Deterministic Evals - These include unit tests, integration tests, and E2E tests that can be run without any LLM inference.
LLM-based Evals - These are evals that test the underlying functionality of Stagehand’s AI primitives.

Evals CLI

To run evals, you’ll need to clone the Stagehand repo and set up the CLI.We recommend using Braintrust to help visualize evals results and metrics.

The Stagehand CLI provides a powerful interface for running evaluations. You can run specific evals, categories, or external benchmarks with customizable settings. Evals are grouped into:

Act Evals - These are evals that test the functionality of the act method.
Extract Evals - These are evals that test the functionality of the extract method.
Observe Evals - These are evals that test the functionality of the observe method.
Combination Evals - These are evals that test the functionality of the act, extract, and observe methods together.
Experimental Evals - These are experimental custom evals that test the functionality of the stagehand primitives.
Agent Evals - These are evals that test the functionality of agent.
(NEW) External Benchmarks - Run external benchmarks like WebBench, GAIA, WebVoyager, OnlineMind2Web, and OSWorld.

Installation

Install Dependencies

# From the stagehand root directory
pnpm install

Build the CLI

pnpm run build:cli

Verify Installation

evals help

CLI Commands and Options

Basic Commands

# Run all evals
evals run all

# Run specific category
evals run act
evals run extract
evals run observe
evals run agent

# Run specific eval
evals run extract/extract_text

# List available evals
evals list
evals list --detailed

# Configure defaults
evals config
evals config set env browserbase
evals config set trials 5

Command Options

-e, --env: Environment (local or browserbase)
-t, --trials: Number of trials per eval (default: 3)
-c, --concurrency: Max parallel sessions (default: 10)
-m, --model: Model override
-p, --provider: Provider override
--api: Use Stagehand API instead of SDK

Running External Benchmarks

The CLI supports several industry-standard benchmarks:

# WebBench with filters
evals run benchmark:webbench -l 10 -f difficulty=easy -f category=READ

# GAIA benchmark
evals run b:gaia -s 100 -l 25 -f level=1

# WebVoyager
evals run b:webvoyager -l 50

# OnlineMind2Web
evals run b:onlineMind2Web

# OSWorld
evals run b:osworld -f source=Mind2Web

Configuration Files

You can view the specific evals in evals/tasks. Each eval is grouped into eval categories based on evals/evals.config.json.

Viewing eval results

Eval results are viewable on Braintrust. You can view the results of a specific eval by going to the Braintrust URL specified in the terminal when you run npm run evals. By default, each eval will run five times per model. The “Exact Match” column shows the percentage of times the eval was correct. The “Error Rate” column shows the percentage of times the eval errored out. You can use the Braintrust UI to filter by model/eval and aggregate results across all evals.

Deterministic Evals

To run deterministic evals, you can run npm run e2e from within the Stagehand repo. This will test the functionality of Playwright within Stagehand to make sure it’s working as expected. These tests are in evals/deterministic and test on both Browserbase browsers and local headless Chromium browsers.

Creating Custom Evaluations

Step-by-Step Guide

Create Evaluation File

Create a new file in evals/tasks/your-eval.ts:

import { EvalTask } from '../types';

export const customEvalTask: EvalTask = {
  name: 'custom_task_name',
  description: 'Test specific automation workflow',
  
  // Test setup
  setup: async ({ page }) => {
    await page.goto('https://example.com');
  },
  
  // The actual test
  task: async ({ stagehand, page }) => {
    // Your automation logic
    await stagehand.act({ action: 'click the login button' });
    const result = await stagehand.extract({ 
      instruction: 'Get the user name',
      schema: { username: 'string' }
    });
    return result;
  },
  
  // Validation
  validate: (result, expected) => {
    return result.username === expected.username;
  },
  
  // Test cases
  testCases: [
    {
      input: { /* test input */ },
      expected: { username: 'john_doe' }
    }
  ],
  
  // Evaluation criteria
  scoring: {
    exactMatch: true,
    timeout: 30000,
    retries: 2
  }
};

Add to Configuration

Update evals/evals.config.json:

{
  "categories": {
    "custom": ["custom_task_name"],
    "existing_category": ["custom_task_name"]
  }
}

Run Your Evaluation

# Test your custom evaluation
evals run custom_task_name

# Run the entire custom category
evals run custom

# Run with specific settings
evals run custom_task_name -e browserbase -t 5 -m gpt-4o

Best Practices for Custom Evals

Test Design Principles

Atomic: Each test should validate one specific capability
Deterministic: Tests should produce consistent results
Realistic: Use real-world scenarios and websites
Measurable: Define clear success/failure criteria

Performance Optimization

Parallel Execution: Design tests to run independently
Resource Management: Clean up after each test
Timeout Handling: Set appropriate timeouts for operations
Error Recovery: Handle failures gracefully

Data Quality

Ground Truth: Establish reliable expected outcomes
Edge Cases: Test boundary conditions and error scenarios
Statistical Significance: Run multiple iterations for reliability
Version Control: Track changes to test cases over time

Troubleshooting Evaluations

Evaluation Timeouts

Symptoms: Tests fail with timeout errorsSolutions:

Increase timeout in taskConfig.ts
Use faster models (Gemini 2.5 Flash, GPT-4o Mini)
Optimize test scenarios to be less complex
Check network connectivity to LLM providers

Inconsistent Results

Symptoms: Same test passes/fails randomlySolutions:

Set temperature to 0 for deterministic outputs
Increase repetitions for statistical significance
Use more capable models for complex tasks
Check for dynamic website content affecting tests

High Evaluation Costs

Symptoms: Token usage exceeding budgetSolutions:

Use cost-effective models (Gemini 2.0 Flash, GPT-4o Mini)
Reduce repetitions for initial testing
Focus on specific evaluation categories
Use local browser environment to reduce Browserbase costs

Braintrust Integration Issues

Symptoms: Results not uploading to dashboardSolutions:

Check Braintrust API key configuration
Verify internet connectivity
Update Braintrust SDK to latest version
Check project permissions in Braintrust dashboard

Live Model Comparisons

First Steps

The Basics

Configuration

Best Practices

Integrations

Reference

Why Evaluations Matter

Comprehensive Evaluations

Evals CLI

Installation

CLI Commands and Options

Basic Commands

Command Options

Running External Benchmarks

Configuration Files

Viewing eval results

Deterministic Evals

Creating Custom Evaluations

Step-by-Step Guide

Best Practices for Custom Evals

Troubleshooting Evaluations

First Steps

The Basics

Configuration

Best Practices

Integrations

Reference

​Why Evaluations Matter

Live Model Comparisons

​Comprehensive Evaluations

​Evals CLI

​Installation

​CLI Commands and Options

Basic Commands

Command Options

Running External Benchmarks

​Configuration Files

​Viewing eval results

​Deterministic Evals

​Creating Custom Evaluations

​Step-by-Step Guide

​Best Practices for Custom Evals

​Troubleshooting Evaluations

Why Evaluations Matter

Comprehensive Evaluations

Evals CLI

Installation

CLI Commands and Options

Configuration Files

Viewing eval results

Deterministic Evals

Creating Custom Evaluations

Step-by-Step Guide

Best Practices for Custom Evals

Troubleshooting Evaluations