Skip to main content

What is extract()?

await stagehand.extract("extract the name of the repository");
extract() grabs structured data from a webpage. You can define your schema with Zod (TypeScript) or JSON. If you don’t want to define a schema, you can also call extract with just a natural language prompt, or call extract with no parameters.

Why use extract()?

Structured

Turn messy webpage data into clean objects that follow a schema.

Resilient

Build resilient extractions that don’t break when the website changes

Return value

When you use extract(), Stagehand will return a Promise<ExtractResult> with the following structure:
When extracting with a schema, the return type is inferred from your Zod schema:
const result = await stagehand.extract(
  "extract product details",
  z.object({
    name: z.string(),
    price: z.number(),
    inStock: z.boolean()
  })
);
Example result:
{
  name: "Wireless Mouse",
  price: 29.99,
  inStock: true
}

Advanced Configuration

You can pass additional options to configure the model, timeout, selector scope, and whether to include a screenshot:
const result = await stagehand.extract("extract the repository name", {
  model: "anthropic/claude-sonnet-4-5",
  timeout: 30000,
  selector: "//header" // Focus on specific area
});

Server-side Caching

serverCache only works when running with env: "BROWSERBASE". It has no effect in local environments.
When running on Browserbase, Stagehand automatically caches extract() results server-side. Repeated calls with the same inputs return instantly without consuming LLM tokens. Caching is enabled by default and can be controlled globally on the constructor or overridden per call:
// Disable server-side caching for the entire instance
const stagehand = new Stagehand({
  env: "BROWSERBASE",
  serverCache: false,
});

// Or disable it for a single call
const data = await stagehand.extract("extract the repository name", { serverCache: false });

// Check whether a result was served from cache
const result = await stagehand.extract("extract the page title");
console.log(result.cacheStatus); // "HIT" or "MISS"

Targeted Extract

Pass a selector to extract to target a specific element on the page.
This helps reduce the context passed to the LLM, optimizing token usage/speed and improving accuracy.
const tableData = await stagehand.extract(
  "Extract the values of the third row",
  z.object({
    values: z.array(z.string())
  }),
  {
    // xPath or CSS selector
    selector: "xpath=/html/body/div/table/" 
  }
);
You can also exclude specific nodes (including their descendant nodes) with ignoreSelectors.
const article = await stagehand.extract(
  "extract the article title and body",
  z.object({
    title: z.string(),
    body: z.string()
  }),
  {
    ignoreSelectors: [".ad", ".newsletter-modal", "nav.related-posts"]
  }
);
ignoreSelectors removes all matches for each selector, along with each matched node’s descendants. selector still scopes extraction to a single resolved subtree.

Visual Extract

Set screenshot: true when the extraction needs visual information from the current viewport in addition to the page accessibility tree.
const saleBadge = await stagehand.extract(
  "extract the text shown in the visible sale badge",
  z.object({
    text: z.string()
  }),
  {
    screenshot: true
  }
);
screenshot: true captures the current viewport, not the full page. It is only supported with AI SDK clients.

Best practices

Extract with Context

You can provide additional context to your schema to help the model extract the data more accurately.
const apartments = await stagehand.extract(
  "Extract ALL the apartment listings and their details, including address, price, and square feet.",
  z.array(
    z.object({
      address: z.string().describe("the address of the apartment"),
      price: z.string().describe("the price of the apartment"),
      square_feet: z.string().describe("the square footage of the apartment"),
    })
  )
);
To extract links or URLs, define the relevant field as z.string().url().
Here is how an extract call might look for extracting a link or URL. This also works for image links.
const contactLink = await stagehand.extract(
  "extract the link to the 'contact us' page",
  z.string().url() // note the usage of z.string().url() for URL validation
);

console.log("the link to the contact us page is: ", contactLink);
Inside Stagehand, extracting links works by asking the LLM to select an ID. Stagehand looks up that ID in a mapping of IDs -> URLs. When logging the LLM trace, you should expect to see IDs. The actual URLs will be included in the final ExtractResult.

Troubleshooting

Problem: extract() returns empty or incomplete dataSolutions:
  • Check your instruction clarity: Make sure your instruction is specific and describes exactly what data you want to extract
  • Verify the data exists: Use stagehand.observe() first to confirm the data is present on the page
  • Wait for dynamic content: If the page loads content dynamically, use stagehand.act("wait for the content to load") before extracting
Solution: Wait for content before extracting
// Wait for content before extracting
await stagehand.act("wait for the product listings to load");
const products = await stagehand.extract(
  "extract all product names and prices",
  z.array(z.object({
    name: z.string(),
    price: z.string()
  }))
);
Problem: Getting schema validation errors or type mismatchesSolutions:
  • Use optional fields: Make fields optional with z.optional() if the data might not always be present
  • Use flexible types: Consider using z.string() instead of z.number() for prices that might include currency symbols
  • Add descriptions: Use .describe() to help the model understand field requirements
Solution: More flexible schema
const schema = z.object({
  price: z.string().describe("price including currency symbol, e.g., '$19.99'"),
  availability: z.string().optional().describe("stock status if available"),
  rating: z.number().optional()
});
Problem: Extraction results vary between runsSolutions:
  • Be more specific in instructions: Instead of “extract prices”, use “extract the numerical price value for each item”
  • Use context in schema descriptions: Add field descriptions to guide the model
  • Combine with observe: Use stagehand.observe() to understand the page structure first
Solution: Validate with observe first
// First observe to understand the page structure
const elements = await stagehand.observe("find all product listings");
console.log("Found elements:", elements.map(e => e.description));

// Then extract with specific targeting
const products = await stagehand.extract(
  "extract name and price from each product listing shown on the page",
  z.array(z.object({
    name: z.string().describe("the product title or name"),
    price: z.string().describe("the price as displayed, including currency")
  }))
);
Problem: Extraction is slow or timing outSolutions:
  • Reduce scope: Extract smaller chunks of data in multiple calls rather than everything at once
  • Use targeted instructions: Be specific about which part of the page to focus on
  • Consider pagination: For large datasets, extract one page at a time
  • Increase timeout: Use timeoutMs parameter for complex extractions
Solution: Break down large extractions
// Instead of extracting everything at once
const allData = [];
const pageNumbers = [1, 2, 3, 4, 5];

for (const pageNum of pageNumbers) {
  await stagehand.act(`navigate to page ${pageNum}`);

  const pageData = await stagehand.extract(
    "extract product data from the current page only",
    z.array(z.object({
      name: z.string(),
      price: z.number()
    })),
    { timeout: 60000 } // 60 second timeout
  );

  allData.push(...pageData);
}

Next steps

Act

Execute actions efficiently

Observe

Analyze pages and preview actions