Skip to content

docs: Add guide security-of-web-scraping#1908

Open
Mantisus wants to merge 5 commits into
apify:masterfrom
Mantisus:secure-scraping
Open

docs: Add guide security-of-web-scraping#1908
Mantisus wants to merge 5 commits into
apify:masterfrom
Mantisus:secure-scraping

Conversation

@Mantisus
Copy link
Copy Markdown
Collaborator

@Mantisus Mantisus commented May 22, 2026

Description

  • Add a guide covering the security threats a crawler can encounter while scraping, how to handle each of them, and the Crawlee defaults that already mitigate some of them.

Issues

@Mantisus Mantisus marked this pull request as ready for review May 24, 2026 17:02
@Mantisus Mantisus assigned janbuchar, vdusek and Mantisus and unassigned janbuchar and vdusek May 24, 2026
@Mantisus Mantisus requested review from janbuchar and vdusek May 24, 2026 17:03
Copy link
Copy Markdown
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few thoughts

Comment thread docs/guides/secure_scraping.mdx Outdated
Comment thread docs/guides/secure_scraping.mdx Outdated
Comment thread docs/guides/secure_scraping.mdx Outdated
Comment thread docs/guides/secure_scraping.mdx
Comment thread docs/guides/secure_scraping.mdx Outdated
Comment thread docs/guides/secure_scraping.mdx Outdated
@Mantisus Mantisus changed the title docs: Add guide secure-scraping docs: Add guide security-of-web-scraping May 27, 2026
Copy link
Copy Markdown
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @szaganek, would you like to take a look as well?

@vdusek vdusek requested review from szaganek and removed request for janbuchar June 1, 2026 09:37
Copy link
Copy Markdown

@szaganek szaganek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very, very LLM-heavy 😅 I'd suggest to:

  • get rid of all en dashes, break these long sentences into shorter ones
  • get rid of bold for emphasis
  • use contractions
  • trim a lil fat, in general. What I mean by that, people won't read this doc "from cover to cover". So sections need to be a little more self-contained and easier to scan through. Starting a section with "Everything so far has been about the requests your crawler makes." doesn't make sense, because a doc is not meant to be read this way.
  • adding to that, I'm not sure about the information architecture. Maybe a little extra structure to these headings would help (like threats: requests vs data).

@Mantisus Mantisus requested a review from szaganek June 1, 2026 23:24
@Mantisus
Copy link
Copy Markdown
Collaborator Author

Mantisus commented Jun 2, 2026

@szaganek, I've updated the text based on some of your suggestions.

adding to that, I'm not sure about the information architecture. Maybe a little extra structure to these headings would help (like threats: requests vs data).

I'm not sure about restructuring (though I agree the current architecture isn't perfect). I worry about adding another heading layer for grouping. It could actually make this harder to read because some threats are tightly related but might naturally belong to different meta-sections. For example, queueing URLs through enqueue_links could just as well live under untrusted-content handling and SSRF attacks.

Copy link
Copy Markdown

@szaganek szaganek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @Mantisus!

Grouping information in a clear way is always a challenge :) Not blocking at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Secure scraping guide

6 participants