Skip to main content

What is extract()?

await stagehand.extract("extract the name of the repository");
extract grabs structured data from a webpage. You can define your schema with zod (TypeScript) or JSON. If you do not want to define a schema, you can also call extract with just a natural language prompt, or call extract with no parameters.

Why use extract()?

Using extract()

You can use extract() to extract structured data from a webpage. You can define your schema with zod (TypeScript) or JSON. If you do not want to define a schema, you can also call extract with just a natural language prompt, or call extract with no parameters.
const result = await stagehand.extract("extract the product details");

Return value of extract()?

When you use extract(), Stagehand will return a Promise<ExtractResult> with the following structure:
  • Basic Schema
  • Array
  • Primitive
  • Instruction Only
  • No Parameters
When extracting with a schema, the return type is inferred from your Zod schema:
const result = await stagehand.extract(
  "extract product details",
  z.object({
    name: z.string(),
    price: z.number(),
    inStock: z.boolean()
  })
);
Example result:
{
  name: "Wireless Mouse",
  price: 29.99,
  inStock: true
}

Advanced Configuration

You can pass additional options to configure the model, timeout, and selector scope:
const result = await stagehand.extract("extract the repository name", {
  model: "anthropic/claude-sonnet-4-5",
  timeout: 30000,
  selector: "//header" // Focus on specific area
});

Targeted Extract

Pass a selector to extract to target a specific element on the page.
This helps reduce the context passed to the LLM, optimizing token usage/speed and improving accuracy.
const tableData = await stagehand.extract(
  "Extract the values of the third row",
  z.object({
    values: z.array(z.string())
  }),
  {
    // xPath or CSS selector
    selector: "xpath=/html/body/div/table/" 
  }
);

Best practices

Extract with Context

You can provide additional context to your schema to help the model extract the data more accurately.
const apartments = await stagehand.extract(
  "Extract ALL the apartment listings and their details, including address, price, and square feet.",
  z.array(
    z.object({
      address: z.string().describe("the address of the apartment"),
      price: z.string().describe("the price of the apartment"),
      square_feet: z.string().describe("the square footage of the apartment"),
    })
  )
);
To extract links or URLs, define the relevant field as z.string().url().
Here is how an extract call might look for extracting a link or URL. This also works for image links.
const contactLink = await stagehand.extract(
  "extract the link to the 'contact us' page",
  z.string().url() // note the usage of z.string().url() for URL validation
);

console.log("the link to the contact us page is: ", contactLink);
Inside Stagehand, extracting links works by asking the LLM to select an ID. Stagehand looks up that ID in a mapping of IDs -> URLs. When logging the LLM trace, you should expect to see IDs. The actual URLs will be included in the final ExtractResult.

Troubleshooting

Problem: extract() returns empty or incomplete dataSolutions:
  • Check your instruction clarity: Make sure your instruction is specific and describes exactly what data you want to extract
  • Verify the data exists: Use stagehand.observe() first to confirm the data is present on the page
  • Wait for dynamic content: If the page loads content dynamically, use stagehand.act("wait for the content to load") before extracting
Solution: Wait for content before extracting
// Wait for content before extracting
await stagehand.act("wait for the product listings to load");
const products = await stagehand.extract(
  "extract all product names and prices",
  z.array(z.object({
    name: z.string(),
    price: z.string()
  }))
);
Problem: Getting schema validation errors or type mismatchesSolutions:
  • Use optional fields: Make fields optional with z.optional() if the data might not always be present
  • Use flexible types: Consider using z.string() instead of z.number() for prices that might include currency symbols
  • Add descriptions: Use .describe() to help the model understand field requirements
Solution: More flexible schema
const schema = z.object({
  price: z.string().describe("price including currency symbol, e.g., '$19.99'"),
  availability: z.string().optional().describe("stock status if available"),
  rating: z.number().optional()
});
Problem: Extraction results vary between runsSolutions:
  • Be more specific in instructions: Instead of “extract prices”, use “extract the numerical price value for each item”
  • Use context in schema descriptions: Add field descriptions to guide the model
  • Combine with observe: Use stagehand.observe() to understand the page structure first
Solution: Validate with observe first
// First observe to understand the page structure
const elements = await stagehand.observe("find all product listings");
console.log("Found elements:", elements.map(e => e.description));

// Then extract with specific targeting
const products = await stagehand.extract(
  "extract name and price from each product listing shown on the page",
  z.array(z.object({
    name: z.string().describe("the product title or name"),
    price: z.string().describe("the price as displayed, including currency")
  }))
);
Problem: Extraction is slow or timing outSolutions:
  • Reduce scope: Extract smaller chunks of data in multiple calls rather than everything at once
  • Use targeted instructions: Be specific about which part of the page to focus on
  • Consider pagination: For large datasets, extract one page at a time
  • Increase timeout: Use timeoutMs parameter for complex extractions
Solution: Break down large extractions
// Instead of extracting everything at once
const allData = [];
const pageNumbers = [1, 2, 3, 4, 5];

for (const pageNum of pageNumbers) {
  await stagehand.act(`navigate to page ${pageNum}`);

  const pageData = await stagehand.extract(
    "extract product data from the current page only",
    z.array(z.object({
      name: z.string(),
      price: z.number()
    })),
    { timeout: 60000 } // 60 second timeout
  );

  allData.push(...pageData);
}

Next steps