Extract structured data from a webpage
extract()
?extract
grabs structured data from a webpage. You can define your schema with zod (TypeScript) or pydantic (Python). If you do not want to define a schema, you can also call extract
with just a natural language prompt, or call extract
with no parameters.
extract()
?extract()
extract
call might look for a single object:
extract
call might look for a list of objects.
extract
with just a natural language prompt:
extract
with just a prompt, your output schema will look like:
extract
with no parameters.
extract
with no parameters will return hierarchical tree representation of the root DOM. This will not be passed through an LLM. It will look something like this:
z.string().url()
.
In Python, you’ll need to define it as HttpUrl
.extract
call might look for extracting a link or URL. This also works for image links.
ExtractResult
.Empty or partial results
extract()
returns empty or incomplete dataSolutions:page.observe()
first to confirm the data is present on the pagepage.act("wait for the content to load")
before extractingSchema validation errors
z.optional()
(TypeScript) or Optional[type]
(Python) if the data might not always be presentz.string()
instead of z.number()
for prices that might include currency symbols.describe()
(TypeScript) or Field(description="...")
(Python) to help the model understand field requirementsInconsistent results
page.observe()
to understand the page structure firstPerformance issues
timeoutMs
parameter for complex extractions