Headless Web Scraping

Skill Verified Active

Extract data from web pages using the scrapling Python library — select the appropriate fetcher tier (HTTP, stealth Chromium, or full browser automation) based on target site defenses, configure headless browsing, and extract structured data with CSS selectors. Use when WebFetch is insufficient for JS-rendered pages, anti-bot-protected sites, or structured multi-element extraction requiring DOM traversal.

Purpose

Extract structured data from complex or protected web pages that cannot be accessed by simpler HTTP requests.

Features

Select appropriate fetcher tier (HTTP, Stealthy, Dynamic)
Configure headless browsing and network idle states
Extract structured data using CSS selectors
Handle anti-bot defenses and JS-rendered content
Implement rate limiting and ethical scraping practices

Use Cases

Scraping JS-rendered single-page applications
Bypassing anti-bot protections like Cloudflare Turnstile
Extracting data from dynamically loaded content or complex DOM structures
Gathering data from sites that block basic HTTP requests

Non-Goals

Solving CAPTCHA challenges (e.g., altcha)
Scraping sites that require manual login without additional setup
Replacing dedicated API clients when available

Workflow

Select Fetcher Tier
Configure Fetcher
Fetch and Extract Data
Handle Failures and Edge Cases
Implement Rate Limiting and Ethical Scraping

Practices

Web scraping
Data extraction
Ethical automation

Prerequisites

Python 3
scrapling library
Playwright Chromium binary

Installation

/plugin install agent-almanac@pjt222-agent-almanac

Quality Score

Verified

98 /100

Analyzed about 21 hours ago

Trust Signals

Last commit1 day ago

GitHub owner pjt222

Stars14

Downloads 308

LicenseMIT

Websitepjt222.github.io

Status

View Source

Similar Extensions

Agent Browser

100

Browser automation CLI for AI agents. Use when the user needs to interact with websites, including navigating pages, filling forms, clicking buttons, taking screenshots, extracting data, testing web apps, or automating any browser task. Triggers include requests to "open a website", "fill out a form", "click a button", "take a screenshot", "scrape data from a page", "test this web app", "login to a site", "automate browser actions", or any task requiring programmatic web interaction.

Skill

shanraisshan

Chatgpt Search

100

Search ChatGPT and extract the full response + hydration JSON that powers the UI. Attaches to a running Chrome instance (port 9222 by default), opens ChatGPT, submits a query, waits for the streamed response, and returns structured data: messages, product cards, hydration JSON, and API calls. Use when asked to "search chatgpt", "ask chatgpt", "chatgpt search", "get chatgpt response", or "scrape chatgpt".

Skill

SeifBenayed

Website Extraction Api

100

Extract typed JSON from public website pages using a schema.

Skill

iterationlayer

Extract Supplier Catalog From Website

100

Extract SKUs, product names, unit prices, availability, and minimum order quantities from a supplier catalog page.

Skill

iterationlayer

Extract Public Registry Page

100

Extract organization name, registration number, status, registration date, and officers from a public registry page.

Skill

iterationlayer

Browser Extract

Extract structured data via stored browser-templates or one-shot DOM queries, with mandatory AIDefence PII + prompt-injection gates before content reaches the model

Skill

ruvnet