Headless Web Scraping
Skill Verified ActiveExtract data from web pages using the scrapling Python library — select the appropriate fetcher tier (HTTP, stealth Chromium, or full browser automation) based on target site defenses, configure headless browsing, and extract structured data with CSS selectors. Use when WebFetch is insufficient for JS-rendered pages, anti-bot-protected sites, or structured multi-element extraction requiring DOM traversal.
Extract structured data from complex or protected web pages that cannot be accessed by simpler HTTP requests.
Features
- Select appropriate fetcher tier (HTTP, Stealthy, Dynamic)
- Configure headless browsing and network idle states
- Extract structured data using CSS selectors
- Handle anti-bot defenses and JS-rendered content
- Implement rate limiting and ethical scraping practices
Use Cases
- Scraping JS-rendered single-page applications
- Bypassing anti-bot protections like Cloudflare Turnstile
- Extracting data from dynamically loaded content or complex DOM structures
- Gathering data from sites that block basic HTTP requests
Non-Goals
- Solving CAPTCHA challenges (e.g., altcha)
- Scraping sites that require manual login without additional setup
- Replacing dedicated API clients when available
Workflow
- Select Fetcher Tier
- Configure Fetcher
- Fetch and Extract Data
- Handle Failures and Edge Cases
- Implement Rate Limiting and Ethical Scraping
Practices
- Web scraping
- Data extraction
- Ethical automation
Prerequisites
- Python 3
- scrapling library
- Playwright Chromium binary
Installation
/plugin install agent-almanac@pjt222-agent-almanacQuality Score
VerifiedTrust Signals
Similar Extensions
Agent Browser
100Browser automation CLI for AI agents. Use when the user needs to interact with websites, including navigating pages, filling forms, clicking buttons, taking screenshots, extracting data, testing web apps, or automating any browser task. Triggers include requests to "open a website", "fill out a form", "click a button", "take a screenshot", "scrape data from a page", "test this web app", "login to a site", "automate browser actions", or any task requiring programmatic web interaction.
Chatgpt Search
100Search ChatGPT and extract the full response + hydration JSON that powers the UI. Attaches to a running Chrome instance (port 9222 by default), opens ChatGPT, submits a query, waits for the streamed response, and returns structured data: messages, product cards, hydration JSON, and API calls. Use when asked to "search chatgpt", "ask chatgpt", "chatgpt search", "get chatgpt response", or "scrape chatgpt".
Website Extraction Api
100Extract typed JSON from public website pages using a schema.
Extract Supplier Catalog From Website
100Extract SKUs, product names, unit prices, availability, and minimum order quantities from a supplier catalog page.
Extract Public Registry Page
100Extract organization name, registration number, status, registration date, and officers from a public registry page.
Browser Extract
99Extract structured data via stored browser-templates or one-shot DOM queries, with mandatory AIDefence PII + prompt-injection gates before content reaches the model