architecture

Scrapling Ingestion

Status: implemented behind the existing Pilot ingestion boundary. Linear: MIN 254, MIN 245 Scrapling 0.4.x adds the MCP server and anti bot/Turnstile oriented fetch surface used by the Pilot bridge. Scrapling 0.4.5 chang

PublicSource-ownedMarkdown export

Status: implemented behind the existing Pilot ingestion boundary.

Linear: MIN-254, MIN-245

Source Check

Scrapling 0.4.x adds the MCP server and anti-bot/Turnstile-oriented fetch surface used by the Pilot bridge.
Scrapling 0.4.5 changes redirect handling so safe redirects reject loopback, private, and link-local targets by default.
0.4.5 also adds spider development mode. Pilot exposes it only through explicit development flags and disables the env override in production.

Primary sources:

Repository Changes

pipelines/requirements.txt pins scrapling[ai]==0.4.5.
pipelines/scraper/lib/scrapling_adapter.py centralizes follow_redirects="safe" for fetcher, dynamic, and stealthy paths.
pipelines/scraper/run_fetch.py reports safe redirect mode and accepts --development-mode.
pipelines/yc-scraper/scrape_startup_school.py can use development mode for spider iteration outside production.
services/orchestrator/src/tools.ts validates scrapling_fetch with a Zod schema before the Python bridge runs.

MCP Boundary

Scrapling MCP exposure must use the existing MCP registry path:

Configure Scrapling in packs/mcp/servers.json or MCP_SERVERS_CONFIG_PATH.
Let McpServerRegistry instantiate the server.
Let ToolRegistry.registerMcpTools("scrapling", client) create namespaced tools such as mcp.scrapling.fetch.
Let AgentLoop.evaluateToolGovernance() evaluate every mcp.scrapling.* call through packages/helm-client before execution.

Do not call the Scrapling MCP server directly from services or Telegram handlers.

Validation

Dry-run examples:

PYTHONPATH=pipelines python pipelines/scraper/run_fetch.py \
  --url https://www.ycombinator.com/companies \
  --strategy fetcher \
  --selector title \
  --limit 1

PYTHONPATH=pipelines python pipelines/yc-scraper/scrape_startup_school.py \
  --limit 2 \
  --dry-run \
  --development-mode

Live YC validation should be run with a short --limit first, then a scheduled crawl after the Python runtime is rebuilt with scripts/install-python-runtime.sh.

Public Operator Checklist

A public Scrapling ingestion claim is complete only when it names the source type, capture boundary, replay behavior, redaction rule, and validation command. Keep private session-backed captures out of anonymous exports; public docs should describe deterministic parsing, operator-triggered replay, error capture, and evidence metadata without exposing cookies, private YC session state, or raw browser storage. When the parser changes, update the fixture or replay example first, then update this page and the public manifest.

Expected Output

For a successful public ingestion run, the operator should see a queued or completed ingestion record, a source label, count metadata, replay counters when the replay migration is present, and a redacted error field when parsing fails. For a failed run, collect the source, replay reference, parser version, sanitized payload shape, and worker logs. Public examples should include a local command, a queued job or ingestion row, a replay reference, and a sanitized parser result. The docs must also say how to recover from partial capture, parser drift, duplicate replay, rate limiting, and unavailable upstream pages. Operators should validate that failed captures remain inspectable without exposing protected session state.

Boundary

Do not include raw session cookies, connector tokens, private page bodies, or founder-specific application material in public examples. If upstream layout drift changes extraction, update the parser test and explain the operator action: rerun capture, replay stored input, or mark the record stale.

Troubleshooting

If ingestion is queued but never completes, check worker availability, rate limits, source reachability, and database migrations before changing docs. If replay produces different output, preserve the original capture metadata, compare parser version, and document the drift as a deterministic replay finding rather than a new public claim.