architecture
Scrapling Ingestion
Status: implemented behind the existing Pilot ingestion boundary. Linear: MIN 254, MIN 245 Scrapling 0.4.x adds the MCP server and anti bot/Turnstile oriented fetch surface used by the Pilot bridge. Scrapling 0.4.5 changStatus: implemented behind the existing Pilot ingestion boundary.
Linear: MIN-254, MIN-245
Source Check
- Scrapling 0.4.x adds the MCP server and anti-bot/Turnstile-oriented fetch surface used by the Pilot bridge.
- Scrapling 0.4.5 changes redirect handling so safe redirects reject loopback, private, and link-local targets by default.
- 0.4.5 also adds spider development mode. Pilot exposes it only through explicit development flags and disables the env override in production.
Primary sources:
- https://github.com/D4Vinci/Scrapling/releases/tag/v0.4.0
- https://github.com/D4Vinci/Scrapling/releases/tag/v0.4.5
- https://scrapling.readthedocs.io/en/latest/ai/mcp-server.html
Repository Changes
pipelines/requirements.txtpinsscrapling[ai]==0.4.5.pipelines/scraper/lib/scrapling_adapter.pycentralizesfollow_redirects="safe"for fetcher, dynamic, and stealthy paths.pipelines/scraper/run_fetch.pyreports safe redirect mode and accepts--development-mode.pipelines/yc-scraper/scrape_startup_school.pycan use development mode for spider iteration outside production.services/orchestrator/src/tools.tsvalidatesscrapling_fetchwith a Zod schema before the Python bridge runs.
MCP Boundary
Scrapling MCP exposure must use the existing MCP registry path:
- Configure Scrapling in
packs/mcp/servers.jsonorMCP_SERVERS_CONFIG_PATH. - Let
McpServerRegistryinstantiate the server. - Let
ToolRegistry.registerMcpTools("scrapling", client)create namespaced tools such asmcp.scrapling.fetch. - Let
AgentLoop.evaluateToolGovernance()evaluate everymcp.scrapling.*call throughpackages/helm-clientbefore execution.
Do not call the Scrapling MCP server directly from services or Telegram handlers.
Validation
Dry-run examples:
PYTHONPATH=pipelines python pipelines/scraper/run_fetch.py \
--url https://www.ycombinator.com/companies \
--strategy fetcher \
--selector title \
--limit 1
PYTHONPATH=pipelines python pipelines/yc-scraper/scrape_startup_school.py \
--limit 2 \
--dry-run \
--development-mode
Live YC validation should be run with a short --limit first, then a scheduled crawl after the Python runtime is rebuilt with scripts/install-python-runtime.sh.
Public Operator Checklist
A public Scrapling ingestion claim is complete only when it names the source type, capture boundary, replay behavior, redaction rule, and validation command. Keep private session-backed captures out of anonymous exports; public docs should describe deterministic parsing, operator-triggered replay, error capture, and evidence metadata without exposing cookies, private YC session state, or raw browser storage. When the parser changes, update the fixture or replay example first, then update this page and the public manifest.
Expected Output
For a successful public ingestion run, the operator should see a queued or completed ingestion record, a source label, count metadata, replay counters when the replay migration is present, and a redacted error field when parsing fails. For a failed run, collect the source, replay reference, parser version, sanitized payload shape, and worker logs. Public examples should include a local command, a queued job or ingestion row, a replay reference, and a sanitized parser result. The docs must also say how to recover from partial capture, parser drift, duplicate replay, rate limiting, and unavailable upstream pages. Operators should validate that failed captures remain inspectable without exposing protected session state.
Boundary
Do not include raw session cookies, connector tokens, private page bodies, or founder-specific application material in public examples. If upstream layout drift changes extraction, update the parser test and explain the operator action: rerun capture, replay stored input, or mark the record stale.
Troubleshooting
If ingestion is queued but never completes, check worker availability, rate limits, source reachability, and database migrations before changing docs. If replay produces different output, preserve the original capture metadata, compare parser version, and document the drift as a deterministic replay finding rather than a new public claim.