此内容尚未提供您的语言版本,正在以英文显示。

Headless Web Scraping

技能已验证活跃

Extract data from web pages using the scrapling Python library — select the appropriate fetcher tier (HTTP, stealth Chromium, or full browser automation) based on target site defenses, configure headless browsing, and extract structured data with CSS selectors. Use when WebFetch is insufficient for JS-rendered pages, anti-bot-protected sites, or structured multi-element extraction requiring DOM traversal.

目的

Extract structured data from complex or protected web pages that cannot be accessed by simpler HTTP requests.

功能

Select appropriate fetcher tier (HTTP, Stealthy, Dynamic)
Configure headless browsing and network idle states
Extract structured data using CSS selectors
Handle anti-bot defenses and JS-rendered content
Implement rate limiting and ethical scraping practices

使用场景

Scraping JS-rendered single-page applications
Bypassing anti-bot protections like Cloudflare Turnstile
Extracting data from dynamically loaded content or complex DOM structures
Gathering data from sites that block basic HTTP requests

非目标

Solving CAPTCHA challenges (e.g., altcha)
Scraping sites that require manual login without additional setup
Replacing dedicated API clients when available

工作流

Select Fetcher Tier
Configure Fetcher
Fetch and Extract Data
Handle Failures and Edge Cases
Implement Rate Limiting and Ethical Scraping

实践

Web scraping
Data extraction
Ethical automation

先决条件

Python 3
scrapling library
Playwright Chromium binary

安装

/plugin install agent-almanac@pjt222-agent-almanac

质量评分

已验证

98 /100

about 23 hours ago 分析

信任信号

最近提交2 days ago

GitHub 所有者 pjt222

星标14

下载量 308

许可证MIT

网站pjt222.github.io

状态

查看源代码

类似扩展

Agent Browser

100

AI 代理的浏览器自动化 CLI。当用户需要与网站交互时使用，包括浏览页面、填写表单、点击按钮、截屏、提取数据、测试 Web 应用或自动化任何浏览器任务。触发条件包括请求“打开网站”、“填表”、“点击按钮”、“截屏”、“抓取页面数据”、“测试此 Web 应用”、“登录网站”、“自动化浏览器操作”或任何需要以编程方式进行 Web 交互的任务。

技能

shanraisshan

Chatgpt Search

100

Search ChatGPT and extract the full response + hydration JSON that powers the UI. Attaches to a running Chrome instance (port 9222 by default), opens ChatGPT, submits a query, waits for the streamed response, and returns structured data: messages, product cards, hydration JSON, and API calls. Use when asked to "search chatgpt", "ask chatgpt", "chatgpt search", "get chatgpt response", or "scrape chatgpt".

技能

SeifBenayed

Website Extraction Api

100

Extract typed JSON from public website pages using a schema.

技能

iterationlayer

Extract Supplier Catalog From Website

100

Extract SKUs, product names, unit prices, availability, and minimum order quantities from a supplier catalog page.

技能

iterationlayer

Extract Public Registry Page

100

Extract organization name, registration number, status, registration date, and officers from a public registry page.

技能

iterationlayer

Browser Extract

Extract structured data via stored browser-templates or one-shot DOM queries, with mandatory AIDefence PII + prompt-injection gates before content reaches the model

技能

ruvnet