跳转到主要内容
此内容尚未提供您的语言版本,正在以英文显示。

Headless Web Scraping

技能 已验证 活跃

Extract data from web pages using the scrapling Python library — select the appropriate fetcher tier (HTTP, stealth Chromium, or full browser automation) based on target site defenses, configure headless browsing, and extract structured data with CSS selectors. Use when WebFetch is insufficient for JS-rendered pages, anti-bot-protected sites, or structured multi-element extraction requiring DOM traversal.

目的

Extract structured data from complex or protected web pages that cannot be accessed by simpler HTTP requests.

功能

  • Select appropriate fetcher tier (HTTP, Stealthy, Dynamic)
  • Configure headless browsing and network idle states
  • Extract structured data using CSS selectors
  • Handle anti-bot defenses and JS-rendered content
  • Implement rate limiting and ethical scraping practices

使用场景

  • Scraping JS-rendered single-page applications
  • Bypassing anti-bot protections like Cloudflare Turnstile
  • Extracting data from dynamically loaded content or complex DOM structures
  • Gathering data from sites that block basic HTTP requests

非目标

  • Solving CAPTCHA challenges (e.g., altcha)
  • Scraping sites that require manual login without additional setup
  • Replacing dedicated API clients when available

工作流

  1. Select Fetcher Tier
  2. Configure Fetcher
  3. Fetch and Extract Data
  4. Handle Failures and Edge Cases
  5. Implement Rate Limiting and Ethical Scraping

实践

  • Web scraping
  • Data extraction
  • Ethical automation

先决条件

  • Python 3
  • scrapling library
  • Playwright Chromium binary

安装

/plugin install agent-almanac@pjt222-agent-almanac

质量评分

已验证
98 /100
about 23 hours ago 分析

信任信号

最近提交2 days ago
星标14
许可证MIT
状态
查看源代码

类似扩展

Agent Browser

100

AI 代理的浏览器自动化 CLI。当用户需要与网站交互时使用,包括浏览页面、填写表单、点击按钮、截屏、提取数据、测试 Web 应用或自动化任何浏览器任务。触发条件包括请求“打开网站”、“填表”、“点击按钮”、“截屏”、“抓取页面数据”、“测试此 Web 应用”、“登录网站”、“自动化浏览器操作”或任何需要以编程方式进行 Web 交互的任务。

技能
shanraisshan

Chatgpt Search

100

Search ChatGPT and extract the full response + hydration JSON that powers the UI. Attaches to a running Chrome instance (port 9222 by default), opens ChatGPT, submits a query, waits for the streamed response, and returns structured data: messages, product cards, hydration JSON, and API calls. Use when asked to "search chatgpt", "ask chatgpt", "chatgpt search", "get chatgpt response", or "scrape chatgpt".

技能
SeifBenayed

Website Extraction Api

100

Extract typed JSON from public website pages using a schema.

技能
iterationlayer

Extract Supplier Catalog From Website

100

Extract SKUs, product names, unit prices, availability, and minimum order quantities from a supplier catalog page.

技能
iterationlayer

Extract Public Registry Page

100

Extract organization name, registration number, status, registration date, and officers from a public registry page.

技能
iterationlayer

Browser Extract

99

Extract structured data via stored browser-templates or one-shot DOM queries, with mandatory AIDefence PII + prompt-injection gates before content reaches the model

技能
ruvnet