How to view LLM usage and run evals on your Stagehand workflows.
stagehand.metrics
.
logInferenceToFile: true
in the Stagehand constructors. This will dump all act, extract, and observe calls to a directory called inference_summary
.inference_summary
will have the following structure:
npm install
to install the dependencies.npm run e2e
from within the Stagehand repo. This will test the functionality of Playwright within Stagehand to make sure it’s working as expected.
These tests are in evals/deterministic
and test on both Browserbase browsers and local headless Chromium browsers.
npm run evals
from within the Stagehand repo. This will test the functionality of the LLM primitives within Stagehand to make sure they’re working as expected.
Evals are grouped into three categories:
act
method.extract
method.observe
method.act
, extract
, and observe
methods together.evals/tasks
. Each eval is grouped into eval categories based on evals/evals.config.json
. You can specify models to run and other general task config in evals/taskConfig.ts
.
To run a specific eval, you can run npm run evals <eval>
, or run all evals in a category with npm run evals category <category>
.
npm run evals
.
By default, each eval will run five times per model. The “Exact Match” column shows the percentage of times the eval was correct. The “Error Rate” column shows the percentage of times the eval errored out.
You can use the Braintrust UI to filter by model/eval and aggregate results across all evals.
evals/tasks
and add it to the appropriate category in evals/evals.config.json
.