CLIP
Skill Verifiziert AktivOpenAI's model connecting vision and language. Enables zero-shot image classification, image-text matching, and cross-modal retrieval. Trained on 400M image-text pairs. Use for image search, content moderation, or vision-language tasks without fine-tuning. Best for general-purpose image understanding.
To provide a powerful, zero-shot capability for understanding and relating images and text, useful for a wide range of AI-driven tasks without requiring custom model training.
Funktionen
- Zero-shot image classification
- Image-text matching and similarity
- Semantic image search
- Content moderation
- Visual question answering
- Cross-modal retrieval
Anwendungsfälle
- Use for image search based on natural language queries.
- Use for content moderation to detect inappropriate or sensitive content.
- Use for classifying images into categories without prior training data.
- Use for visual question answering tasks on images.
Nicht-Ziele
- Use for image segmentation tasks.
- Use for advanced image captioning (suggests BLIP-2).
- Use for vision-language chat applications (suggests LLaVA).
Workflow
- Load CLIP model and preprocessor.
- Prepare image and text inputs.
- Encode image and/or text features.
- Compute similarity scores or probabilities.
- Interpret results for classification, search, or moderation.
Voraussetzungen
- Python 3.7+
- PyTorch and torchvision
- transformers library
- Pillow library
Trust
- info:Issues AttentionThere are 17 open issues and 4 closed issues in the last 90 days. The closure rate is low, suggesting maintainers may respond slowly.
Code Execution
- info:ValidationThe Python code includes basic image and text processing, but parameter validation via a schema library is not explicitly demonstrated or used.
- info:Error HandlingThe provided Python code includes basic error handling for file operations but does not implement structured error reporting with retryable flags or hints for the agent.
Errors
- info:Actionable error messagesThe Python code includes basic error handling for file loading, but error messages are standard Python exceptions and do not provide specific remediation steps or doc links for the agent.
Practical Utility
- info:Edge casesThe 'Limitations' section in SKILL.md names several edge cases such as dataset biases and limited spatial understanding, but does not provide specific recovery steps for each.
Installation
npx skills add davila7/claude-code-templatesFührt das Vercel skills CLI (skills.sh) via npx aus — benötigt Node.js lokal und mindestens einen installierten skills-kompatiblen Agent (Claude Code, Cursor, Codex, …). Setzt voraus, dass das Repo dem agentskills.io-Format folgt.
Qualitätspunktzahl
VerifiziertVertrauenssignale
Ähnliche Erweiterungen
Clip
98OpenAI's model connecting vision and language. Enables zero-shot image classification, image-text matching, and cross-modal retrieval. Trained on 400M image-text pairs. Use for image search, content moderation, or vision-language tasks without fine-tuning. Best for general-purpose image understanding.
Blip 2 Vision Language
98Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text retrieval, or multimodal chat with state-of-the-art zero-shot performance.
Baoyu Imagine
99AI image generation with OpenAI GPT Image 2, Azure OpenAI, Google, OpenRouter, DashScope, Z.AI GLM-Image, MiniMax, Jimeng, Seedream and Replicate APIs. Supports text-to-image, reference images, aspect ratios, and batch generation from saved prompt files. Sequential by default; use batch parallel generation when the user already has multiple prompts or wants stable multi-image throughput. Use when user asks to generate, create, or draw images.
Whisper
97OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.
Llava
96Large Language and Vision Assistant. Enables visual instruction tuning and image-based conversations. Combines CLIP vision encoder with Vicuna/LLaMA language models. Supports multi-turn image chat, visual question answering, and instruction following. Use for vision-language chatbots or image understanding tasks. Best for conversational image analysis.
Segment Anything Model
95Foundation model for image segmentation with zero-shot transfer. Use when you need to segment any object in images using points, boxes, or masks as prompts, or automatically generate all object masks in an image.