docs: Add Scrapling guide#938
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #938 +/- ##
=======================================
Coverage 86.94% 86.95%
=======================================
Files 48 48
Lines 2942 2943 +1
=======================================
+ Hits 2558 2559 +1
Misses 384 384
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Mantisus
left a comment
There was a problem hiding this comment.
A couple of questions that might change the guide
| ) -> tuple[dict[str, Any], list[str]]: | ||
| """Fetch a page in a real browser with Scrapling and return data and links.""" | ||
| # `network_idle` waits until the page stops making network requests. | ||
| response = await DynamicFetcher.async_fetch( |
There was a problem hiding this comment.
How does this work internally? Does the browser open, send a request, and then close? If so, it looks like an overhead for a guide example.
| ) -> tuple[dict[str, Any], list[str]]: | ||
| """Fetch a page with Scrapling's HTTP fetcher and return data and links.""" | ||
| # `impersonate` and `stealthy_headers` make the request look like Chrome. | ||
| response = await AsyncFetcher.get( |
There was a problem hiding this comment.
The guide is titled 'Adaptive Scraping with Scrapling'. Should we use the 'adaptive=True' mode in the example? 🙂
https://scrapling.readthedocs.io/en/latest/parsing/adaptive.html
| {ScraplingBrowserScraper} | ||
| </CodeBlock> | ||
|
|
||
| To run this on the Apify platform, build on top of the [Apify Playwright base image](https://hub.docker.com/r/apify/actor-python-playwright), which already ships a browser together with all of its system-level dependencies, and run `scrapling install` during the Docker build to download the browser binaries that Scrapling expects. |
There was a problem hiding this comment.
We can add a simple Dockerfile example here.
Adds a guide for the Scrapling adaptive web scraping library in Apify Actors, following the structure of the existing scraping-library guides.
docs/03_guides/07_scrapling.mdx— the guide: introduction & features, choosing a fetcher (HTTP vs. browser-based), a runnable example Actor, Apify Proxy integration, and running browser fetchers (DynamicFetcher/StealthyFetcher) with the requiredscrapling installstep in the Dockerfile.code/07_scrapling.py— runnable single-file example: a recursive title scraper using Scrapling's async HTTPAsyncFetcherthrough Apify Proxy.code/07_scrapling_browser.pyshows the browser-based variant.Verified locally (
apify run) and on the Apify platform (build + run SUCCEEDED, correct dataset output via Apify Proxy), including the browser path. Lint + type-check pass.Closes: #836
TODO before merging
docs/03_guides/07_scrapling.mdx+docs/03_guides/code/07_scrapling.py+docs/03_guides/code/07_scrapling_browser.py) intowebsite/versioned_docs/version-3.4/so it also shows in the current docs version, not only under "next".