/ Каталог / Песочница / web-scraper
● Сообщество yfe404 ⚡ Сразу

web-scraper

автор yfe404 · yfe404/web-scraper

Phased reconnaissance scraper that prefers APIs > sitemaps > HTML, picks the right Apify template (Cheerio vs Playwright), and only escalates anti-detection when signals warrant.

Intelligent web scraping via Claude Code. Runs a 5-phase reconnaissance (Phases 0–5) that tries the cheapest extraction path first: public APIs, sitemap feeds, structured data. Only when those fail does it consider browser automation, and only when protection signals appear does it layer on stealth. Built around TypeScript-first Apify Actor development, with templates (Cheerio for static, Playwright for JS-heavy).

Зачем использовать

Ключевые функции

Живое демо

Как выглядит на практике

web-scraper-skill.replay ▶ готово
0/0

Установка

Выберите клиент

~/Library/Application Support/Claude/claude_desktop_config.json  · Windows: %APPDATA%\Claude\claude_desktop_config.json
{
  "mcpServers": {
    "web-scraper-skill": {
      "command": "git",
      "args": [
        "clone",
        "https://github.com/yfe404/web-scraper",
        "~/.claude/skills/web-scraper"
      ],
      "_inferred": true
    }
  }
}

Откройте Claude Desktop → Settings → Developer → Edit Config. Перезапустите после сохранения.

~/.cursor/mcp.json · .cursor/mcp.json
{
  "mcpServers": {
    "web-scraper-skill": {
      "command": "git",
      "args": [
        "clone",
        "https://github.com/yfe404/web-scraper",
        "~/.claude/skills/web-scraper"
      ],
      "_inferred": true
    }
  }
}

Cursor использует ту же схему mcpServers, что и Claude Desktop. Конфиг проекта приоритетнее глобального.

VS Code → Cline → MCP Servers → Edit
{
  "mcpServers": {
    "web-scraper-skill": {
      "command": "git",
      "args": [
        "clone",
        "https://github.com/yfe404/web-scraper",
        "~/.claude/skills/web-scraper"
      ],
      "_inferred": true
    }
  }
}

Щёлкните значок MCP Servers на боковой панели Cline, затем "Edit Configuration".

~/.codeium/windsurf/mcp_config.json
{
  "mcpServers": {
    "web-scraper-skill": {
      "command": "git",
      "args": [
        "clone",
        "https://github.com/yfe404/web-scraper",
        "~/.claude/skills/web-scraper"
      ],
      "_inferred": true
    }
  }
}

Тот же формат, что и Claude Desktop. Перезапустите Windsurf для применения.

~/.continue/config.json
{
  "mcpServers": [
    {
      "name": "web-scraper-skill",
      "command": "git",
      "args": [
        "clone",
        "https://github.com/yfe404/web-scraper",
        "~/.claude/skills/web-scraper"
      ]
    }
  ]
}

Continue использует массив объектов серверов, а не map.

~/.config/zed/settings.json
{
  "context_servers": {
    "web-scraper-skill": {
      "command": {
        "path": "git",
        "args": [
          "clone",
          "https://github.com/yfe404/web-scraper",
          "~/.claude/skills/web-scraper"
        ]
      }
    }
  }
}

Добавьте в context_servers. Zed перезагружается автоматически.

claude mcp add web-scraper-skill -- git clone https://github.com/yfe404/web-scraper ~/.claude/skills/web-scraper

Однострочная команда. Проверить: claude mcp list. Удалить: claude mcp remove.

Сценарии использования

Реальные сценарии: web-scraper

Scrape a static listing site into a structured dataset

👤 Data engineers pulling public data (directories, price lists, public records) ⏱ ~45 min intermediate

Когда использовать: You need a dataset from a public site that doesn't have an API.

Предварительные требования
  • Skill installed — git clone https://github.com/yfe404/web-scraper ~/.claude/skills/web-scraper
  • Node 20 for Apify Actors — nvm install 20
Поток
  1. Let the skill do recon
    Use web-scraper. Target: https://example.com/listings. I want name + URL + category. Recon first — tell me the cheapest extraction path.✓ Скопировано
    → Skill reports: 'sitemap.xml available, use Cheerio'
  2. Scaffold the Apify Actor
    Scaffold a TypeScript Apify Cheerio actor for that extraction.✓ Скопировано
    → Actor tree + main.ts ready to run
  3. Run and iterate
    Run locally on 10 pages; tighten selectors if needed.✓ Скопировано
    → Clean JSON output

Итог: An Apify Actor you can deploy for scheduled scrapes.

Подводные камни
  • Jumping to Playwright when Cheerio would do — Trust the recon — headful browsers 10x the cost unnecessarily
Сочетать с: apify · filesystem

Discover and use a site's undocumented JSON API instead of HTML

👤 Scraper devs who want reliability ⏱ ~30 min intermediate

Когда использовать: The page is a SPA and HTML is gross, but the XHR calls are clean JSON.

Поток
  1. Run API-discovery phase
    Use web-scraper phase 1 — API discovery on https://example.com/app. Enumerate XHR/fetch endpoints.✓ Скопировано
    → List of endpoints with observed payloads
  2. Build the JSON-based actor
    Generate an actor that hits those endpoints directly with auth as needed.✓ Скопировано
    → Lightweight fetch-based actor

Итог: A far more stable scrape than HTML parsing.

Подводные камни
  • Private/session-auth APIs that break when token rotates — Plan token-refresh logic or fall back to browser flow

Комбинации

Сочетайте с другими MCP — эффект x10

web-scraper-skill + apify

Deploy the scaffolded actor to Apify for scheduled runs

Deploy this actor to my Apify account and schedule it daily.✓ Скопировано
web-scraper-skill + filesystem

Keep the actor code in-repo alongside the consuming app

Scaffold into scrapers/ and commit with the main project.✓ Скопировано

Инструменты

Что предоставляет этот MCP

ИнструментВходные данныеКогда вызыватьСтоимость
recon url Always first 0
scaffold_actor template (cheerio|playwright), target After recon picks template 0
record_session url Debugging dynamic sites 0
run_local actor path, limit Iteration phase 0

Стоимость и лимиты

Во что обходится

Квота API
Apify has its own compute + proxy quotas
Токенов на вызов
Moderate — scaffold and iteration loops
Деньги
Free skill; Apify costs separate
Совет
Prefer Cheerio — one of the cheapest run profiles on Apify

Безопасность

Права, секреты, радиус поражения

Минимальные скоупы: Apify API token with actor:read + actor:write
Хранение учётных данных: APIFY_TOKEN in env
Исходящий трафик: Whatever sites you target + Apify platform

Устранение неполадок

Частые ошибки и исправления

Cheerio returns empty selectors

Content is JS-rendered — rerun recon, expect Playwright template

Playwright times out

Bump navigation timeout; consider waiting for specific selector instead of networkidle

403 / bot-block page

Stop and reconsider. This is the legit signal to re-check ToS, not a cue to escalate stealth.

Альтернативы

web-scraper в сравнении

АльтернативаКогда использоватьКомпромисс
Direct Apify consoleYou already know which template you needNo recon phase
firecrawlYou just need markdown of a page, not structured extractionNo actor scaffolding

Ещё

Ресурсы

📖 Читать официальный README на GitHub

🐙 Открытые задачи

🔍 Все 400+ MCP-серверов и Skills