Web Scraper Service

Scope

The web scraper service is used by knowledge-base URL document creation, web document refresh, and backend flows that need to convert web pages into Markdown.

Callers should keep using WebScraperService.scrape_url() and the ScrapedContent result structure. Internal scraping strategies, PDF extraction, and Playwright fallback can evolve, but the public contract must remain stable:

ScrapedContent
WebScraperService.scrape_url(url: str)
get_web_scraper_service()

Scraping Flow

The service selects an extraction path based on content type and quality:

Validate the initial URL and reject non-HTTP(S), local, private-network, and unsafe redirect targets.
Use the PDF extractor for PDF URLs, including final response URL validation.
Use the Crawl4AI strategy first for regular web pages and convert the result to Markdown.
Enable Playwright frame extraction only when the primary result is empty or low quality and policy allows fallback. A failed primary extraction that still returns a 2xx HTTP status (the page is reachable but the primary strategy extracted nothing, e.g. content lives in nested iframes) is treated as empty content and is also fallback-eligible.
Run fallback output through the Markdown cleaner, classifier, and quality evaluator.

Playwright fallback prefers frame innerHTML and converts it to Markdown. It only falls back to innerText when structured HTML is unavailable. Multi-frame output keeps frame titles and source URLs for indexing and troubleshooting.

Module Relationships

WebScraperService is the only public entry point. Internally, it resolves policy, selects the extraction path, and normalizes strategy results into ScrapedContent.

Markdown Output

Page content should be emitted as structured Markdown suitable for knowledge-base indexing and reading:

Preserve headings, paragraphs, lists, links, and basic tables.
Remove navigation, footers, buttons, forms, login pages, CAPTCHA pages, Access Denied pages, Too Many Requests pages, and repeated boilerplate.
Keep the cleaner conservative. Do not remove body content only because class names or localized keywords look action-oriented.
When only plain text is available, mark the result as degraded instead of pretending it is structured Markdown.

Proxy Configuration

Proxy behavior is controlled by WEBSCRAPER_PROXY and WEBSCRAPER_PROXY_MODE.

WEBSCRAPER_PROXY is the proxy URL and supports HTTP, HTTPS, and SOCKS5:

WEBSCRAPER_PROXY=http://proxy.example.com:8080
WEBSCRAPER_PROXY_MODE=fallback

WEBSCRAPER_PROXY_MODE supports:

Mode	Behavior
`none`	Never use a proxy
`fallback`	Try direct connection first, then retry with proxy after network errors, timeout, 403, 429, or 5xx
`proxy`	Always use the proxy, and `WEBSCRAPER_PROXY` must be configured
`direct`	Legacy alias with the same behavior as `proxy`

New deployments should use proxy for forced proxy behavior. direct exists only as a legacy alias for older environment variable semantics.

Site Configuration

Site-specific behavior is configured through WEB_SCRAPER_SITE_CONFIG, a JSON object keyed by domain or URL pattern.

Example:

WEB_SCRAPER_SITE_CONFIG={"example.com":{"wait_until":"networkidle","page_timeout_ms":30000,"delay_before_return_html":3.0}}

Site configuration is parsed through an allowlist. Unknown fields are logged as warnings and ignored. They are not passed through to Crawl4AI or Playwright.

Common supported fields include:

wait_until
wait_for
delay_before_return_html
page_timeout_ms
process_iframes
locale
timezone_id
user_agent
user_agent_mode
fallback_enabled
fallback_on_empty
fallback_on_blocked
deep_iframe_extraction

Fallback Policy

fallback_enabled is the top-level switch. When it is disabled, Playwright fallback does not run even if content is empty, quality is low, or deep iframe extraction is enabled.

When fallback_enabled=true:

fallback_on_empty=true allows empty content to trigger fallback.
deep_iframe_extraction=true allows empty or low-quality content to trigger fallback.
A failed primary extraction with a 2xx HTTP status is treated as reachable-but-empty (classified as empty content) and triggers fallback per the empty-content rule, i.e. when fallback_on_empty=true or deep_iframe_extraction=true. This covers content in nested iframes and SPA shells that the primary strategy misclassifies as anti-bot.
Genuine transport failures (no HTTP response, e.g. connection refused or DNS failure) are classified as network failures and do not trigger fallback, since re-rendering cannot help.
Auth-required, rate-limited, SSRF-blocked, and other states that rendering cannot fix do not automatically trigger fallback through deep iframe extraction.
Whether blocked pages trigger fallback is controlled by fallback_on_blocked.

Security Boundary

All scraping paths must follow the SSRF guard:

Validate both initial URLs and final response URLs.
PDF downloads must validate response.url after redirects, not only the request URL.
Playwright fallback must use request interception and reject unsafe schemes, local addresses, and private-network targets.
WebSocket URLs must pass the same class of SSRF validation.
Browser, context, and page resources must be released on error paths.

Add focused tests before wiring a new scraping strategy or extractor into the orchestrator.

Scope​

Scraping Flow​

Module Relationships​

Markdown Output​

Proxy Configuration​

Site Configuration​

Fallback Policy​

Security Boundary​