feat(utils): add Schema.org microdata extraction utility #3246

Yash-Hirgude · 2025-11-15T17:34:15Z

Introduces #276

This PR adds a reusable function extractSchemaOrgMicrodata to the @crawlee/utils package.

It enables extracting Schema.org microdata from:

a browser DOM (e.g., Puppeteer / Playwright crawlers), or
raw HTML (e.g., HTTP crawler using JSDOM or Cheerio)

The extractor uses only native DOM APIs, no jQuery dependency.
The extractor is fully serializable, allowing it to run both in a browser context (via page.evaluate in Puppeteer/Playwright) and in Node.js environments (JSDOM/Cheerio), with no external dependencies.
Comprehensive test cases are included to ensure correct extraction across different input types.

barjin

Hello and thank you for your contribution.

We'd prefer to follow the API of the existing functions, e.g. the parseOpenGraph util.

With the current (native DOM-based) API, the usage from e.g. CheerioCrawler would be complicated - it's our most popular non-browser crawler, yet Cheerio is not compatible with the DOM API. Reading the global document object is a bad practice - you see that e.g. for testing, you have to set the global variable (passing the context to the function as a parameter is much easier).

Please see #3233 - this PR proposes a better API, but is lacking tests. I'll close your PR to track the initiative in one PR only - feel free to expand on that one.

Thank you again!

Yash-Hirgude added 2 commits November 15, 2025 22:25

extract microdata function and tests added

b140bcc

removed unecessary space

ec34f7a

barjin reviewed Dec 18, 2025

View reviewed changes

barjin closed this Dec 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(utils): add Schema.org microdata extraction utility #3246

feat(utils): add Schema.org microdata extraction utility #3246

Uh oh!

Yash-Hirgude commented Nov 15, 2025

Uh oh!

barjin left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(utils): add Schema.org microdata extraction utility #3246

feat(utils): add Schema.org microdata extraction utility #3246

Uh oh!

Conversation

Yash-Hirgude commented Nov 15, 2025

Uh oh!

barjin left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants