Skip to content

Add support for AI/LLM-based HTML parsing (selectors) #1593

@vdusek

Description

@vdusek

At recent events I attended, I was asked about AI/LLM-based HTML parsing. I also found a few dedicated AI-based scraping frameworks, such as ScrapeGraphAI and Parsera, that appear to be gaining traction.

Right now, we provide an AI-selector workflow only through the PlaywrightCrawler via Stagehand guide.

This means:

  • AI-based selectors are supported only for Playwright, not for HTTP-based crawlers.
  • Even for PlaywrightCrawler, the integration is not very smooth compared to the tools mentioned above.

Example from the ScrapeGraphAI:

# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
    prompt="Extract useful information from the webpage, including a description of what the company does, founders and social media links",
    source="https://scrapegraphai.com/",
    config=graph_config
)

# Run the pipeline
result = smart_scraper_graph.run()

It might be worth exploring a more native solution:

  • Better Stagehand integration so that AI-based selectors in Playwright crawlers are as straightforward as in the dedicated AI-scraping libraries.
  • Introduce an AI/LLM-powered crawler built on top of AbstractHttpCrawler, enabling AI/LLM selectors for HTTP-based scraping as well.

This could make Crawlee more usable for AI/LLM-based extractions, and/or for faster prototype scrapers without manual CSS/XPath selectors.

Metadata

Metadata

Assignees

Labels

t-toolingIssues with this label are in the ownership of the tooling team.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions