Skip to content

Conversation

@Yash-Hirgude
Copy link
Contributor

Introduces #276

This PR adds a reusable function extractSchemaOrgMicrodata to the @crawlee/utils package.

It enables extracting Schema.org microdata from:

  • a browser DOM (e.g., Puppeteer / Playwright crawlers), or
  • raw HTML (e.g., HTTP crawler using JSDOM or Cheerio)

The extractor uses only native DOM APIs, no jQuery dependency.
The extractor is fully serializable, allowing it to run both in a browser context (via page.evaluate in Puppeteer/Playwright) and in Node.js environments (JSDOM/Cheerio), with no external dependencies.
Comprehensive test cases are included to ensure correct extraction across different input types.

Copy link
Member

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello and thank you for your contribution.

We'd prefer to follow the API of the existing functions, e.g. the parseOpenGraph util.

With the current (native DOM-based) API, the usage from e.g. CheerioCrawler would be complicated - it's our most popular non-browser crawler, yet Cheerio is not compatible with the DOM API. Reading the global document object is a bad practice - you see that e.g. for testing, you have to set the global variable (passing the context to the function as a parameter is much easier).

Please see #3233 - this PR proposes a better API, but is lacking tests. I'll close your PR to track the initiative in one PR only - feel free to expand on that one.

Thank you again!

@barjin barjin closed this Dec 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants