Zum Hauptinhalt springen
Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

CLIP

Skill Verifiziert Aktiv

OpenAI's model connecting vision and language. Enables zero-shot image classification, image-text matching, and cross-modal retrieval. Trained on 400M image-text pairs. Use for image search, content moderation, or vision-language tasks without fine-tuning. Best for general-purpose image understanding.

Zweck

To provide a powerful, zero-shot capability for understanding and relating images and text, useful for a wide range of AI-driven tasks without requiring custom model training.

Funktionen

  • Zero-shot image classification
  • Image-text matching and similarity
  • Semantic image search
  • Content moderation
  • Visual question answering
  • Cross-modal retrieval

Anwendungsfälle

  • Use for image search based on natural language queries.
  • Use for content moderation to detect inappropriate or sensitive content.
  • Use for classifying images into categories without prior training data.
  • Use for visual question answering tasks on images.

Nicht-Ziele

  • Use for image segmentation tasks.
  • Use for advanced image captioning (suggests BLIP-2).
  • Use for vision-language chat applications (suggests LLaVA).

Workflow

  1. Load CLIP model and preprocessor.
  2. Prepare image and text inputs.
  3. Encode image and/or text features.
  4. Compute similarity scores or probabilities.
  5. Interpret results for classification, search, or moderation.

Voraussetzungen

  • Python 3.7+
  • PyTorch and torchvision
  • transformers library
  • Pillow library

Trust

  • info:Issues AttentionThere are 17 open issues and 4 closed issues in the last 90 days. The closure rate is low, suggesting maintainers may respond slowly.

Code Execution

  • info:ValidationThe Python code includes basic image and text processing, but parameter validation via a schema library is not explicitly demonstrated or used.
  • info:Error HandlingThe provided Python code includes basic error handling for file operations but does not implement structured error reporting with retryable flags or hints for the agent.

Errors

  • info:Actionable error messagesThe Python code includes basic error handling for file loading, but error messages are standard Python exceptions and do not provide specific remediation steps or doc links for the agent.

Practical Utility

  • info:Edge casesThe 'Limitations' section in SKILL.md names several edge cases such as dataset biases and limited spatial understanding, but does not provide specific recovery steps for each.

Installation

npx skills add davila7/claude-code-templates

Führt das Vercel skills CLI (skills.sh) via npx aus — benötigt Node.js lokal und mindestens einen installierten skills-kompatiblen Agent (Claude Code, Cursor, Codex, …). Setzt voraus, dass das Repo dem agentskills.io-Format folgt.

Qualitätspunktzahl

Verifiziert
95 /100
Analysiert about 23 hours ago

Vertrauenssignale

Letzter Commit1 day ago
Sterne27.2k
LizenzMIT
Status
Quellcode ansehen

Ähnliche Erweiterungen

Clip

98

OpenAI's model connecting vision and language. Enables zero-shot image classification, image-text matching, and cross-modal retrieval. Trained on 400M image-text pairs. Use for image search, content moderation, or vision-language tasks without fine-tuning. Best for general-purpose image understanding.

Skill
Orchestra-Research

Blip 2 Vision Language

98

Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text retrieval, or multimodal chat with state-of-the-art zero-shot performance.

Skill
Orchestra-Research

Baoyu Imagine

99

AI image generation with OpenAI GPT Image 2, Azure OpenAI, Google, OpenRouter, DashScope, Z.AI GLM-Image, MiniMax, Jimeng, Seedream and Replicate APIs. Supports text-to-image, reference images, aspect ratios, and batch generation from saved prompt files. Sequential by default; use batch parallel generation when the user already has multiple prompts or wants stable multi-image throughput. Use when user asks to generate, create, or draw images.

Skill
jimliu

Whisper

97

OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.

Skill
Orchestra-Research

Llava

96

Large Language and Vision Assistant. Enables visual instruction tuning and image-based conversations. Combines CLIP vision encoder with Vicuna/LLaMA language models. Supports multi-turn image chat, visual question answering, and instruction following. Use for vision-language chatbots or image understanding tasks. Best for conversational image analysis.

Skill
Orchestra-Research

Segment Anything Model

95

Foundation model for image segmentation with zero-shot transfer. Use when you need to segment any object in images using points, boxes, or masks as prompts, or automatically generate all object masks in an image.

Skill
davila7