Evaluations help you understand how well your automation performs, which models work best for your use cases, and how to optimize for cost and reliability. This guide covers both monitoring your own workflows and running comprehensive evaluations.

Why Evaluations Matter

  • Performance Optimization: Identify which models and settings work best for your specific automation tasks
  • Cost Control: Track token usage and inference time to optimize spending
  • Reliability: Measure success rates and identify failure patterns
  • Model Selection: Compare different LLMs on real-world tasks to make informed decisions

Comprehensive Evaluations

Evaluations help you systematically test and improve your automation workflows. Stagehand provides both built-in evaluations and tools to create your own.
To run evals, you’ll need to clone the Stagehand repo and run npm install to install the dependencies.
We have three types of evals:
  1. Deterministic Evals - These are evals that are deterministic and can be run without any LLM inference.
  2. LLM-based Evals - These are evals that test the underlying functionality of Stagehand’s AI primitives.

LLM-based Evals

To run LLM evals, you’ll need a Braintrust account.
To run LLM-based evals, you can run npm run evals from within the Stagehand repo. This will test the functionality of the LLM primitives within Stagehand to make sure they’re working as expected. Evals are grouped into three categories:
  1. Act Evals - These are evals that test the functionality of the act method.
  2. Extract Evals - These are evals that test the functionality of the extract method.
  3. Observe Evals - These are evals that test the functionality of the observe method.
  4. Combination Evals - These are evals that test the functionality of the act, extract, and observe methods together.

Configuring and Running Evals

You can view the specific evals in evals/tasks. Each eval is grouped into eval categories based on evals/evals.config.json. You can specify models to run and other general task config in evals/taskConfig.ts. To run a specific eval, you can run npm run evals <eval>, or run all evals in a category with npm run evals category <category>.

Viewing eval results

Eval results Eval results are viewable on Braintrust. You can view the results of a specific eval by going to the Braintrust URL specified in the terminal when you run npm run evals. By default, each eval will run five times per model. The “Exact Match” column shows the percentage of times the eval was correct. The “Error Rate” column shows the percentage of times the eval errored out. You can use the Braintrust UI to filter by model/eval and aggregate results across all evals.

Deterministic Evals

To run deterministic evals, you can just run npm run e2e from within the Stagehand repo. This will test the functionality of Playwright within Stagehand to make sure it’s working as expected. These tests are in evals/deterministic and test on both Browserbase browsers and local headless Chromium browsers.

Creating Custom Evaluations

Step-by-Step Guide

1

Create Evaluation File

Create a new file in evals/tasks/your-eval.ts:
import { EvalTask } from '../types';

export const customEvalTask: EvalTask = {
  name: 'custom_task_name',
  description: 'Test specific automation workflow',
  
  // Test setup
  setup: async ({ page }) => {
    await page.goto('https://example.com');
  },
  
  // The actual test
  task: async ({ stagehand, page }) => {
    // Your automation logic
    await stagehand.act({ action: 'click the login button' });
    const result = await stagehand.extract({ 
      instruction: 'Get the user name',
      schema: { username: 'string' }
    });
    return result;
  },
  
  // Validation
  validate: (result, expected) => {
    return result.username === expected.username;
  },
  
  // Test cases
  testCases: [
    {
      input: { /* test input */ },
      expected: { username: 'john_doe' }
    }
  ],
  
  // Evaluation criteria
  scoring: {
    exactMatch: true,
    timeout: 30000,
    retries: 2
  }
};
2

Add to Configuration

Update evals/evals.config.json:
{
  "categories": {
    "custom": ["custom_task_name"],
    "existing_category": ["custom_task_name"]
  }
}
3

Run Your Evaluation

# Test your custom evaluation
npm run evals custom_task_name

# Run the entire custom category
npm run evals category custom

Best Practices for Custom Evals

Troubleshooting Evaluations