跳转到主要内容
此内容尚未提供您的语言版本,正在以英文显示。

CLIP

技能 已验证 活跃

OpenAI's model connecting vision and language. Enables zero-shot image classification, image-text matching, and cross-modal retrieval. Trained on 400M image-text pairs. Use for image search, content moderation, or vision-language tasks without fine-tuning. Best for general-purpose image understanding.

目的

To provide a powerful, zero-shot capability for understanding and relating images and text, useful for a wide range of AI-driven tasks without requiring custom model training.

功能

  • Zero-shot image classification
  • Image-text matching and similarity
  • Semantic image search
  • Content moderation
  • Visual question answering
  • Cross-modal retrieval

使用场景

  • Use for image search based on natural language queries.
  • Use for content moderation to detect inappropriate or sensitive content.
  • Use for classifying images into categories without prior training data.
  • Use for visual question answering tasks on images.

非目标

  • Use for image segmentation tasks.
  • Use for advanced image captioning (suggests BLIP-2).
  • Use for vision-language chat applications (suggests LLaVA).

工作流

  1. Load CLIP model and preprocessor.
  2. Prepare image and text inputs.
  3. Encode image and/or text features.
  4. Compute similarity scores or probabilities.
  5. Interpret results for classification, search, or moderation.

先决条件

  • Python 3.7+
  • PyTorch and torchvision
  • transformers library
  • Pillow library

Trust

  • info:Issues AttentionThere are 17 open issues and 4 closed issues in the last 90 days. The closure rate is low, suggesting maintainers may respond slowly.

Code Execution

  • info:ValidationThe Python code includes basic image and text processing, but parameter validation via a schema library is not explicitly demonstrated or used.
  • info:Error HandlingThe provided Python code includes basic error handling for file operations but does not implement structured error reporting with retryable flags or hints for the agent.

Errors

  • info:Actionable error messagesThe Python code includes basic error handling for file loading, but error messages are standard Python exceptions and do not provide specific remediation steps or doc links for the agent.

Practical Utility

  • info:Edge casesThe 'Limitations' section in SKILL.md names several edge cases such as dataset biases and limited spatial understanding, but does not provide specific recovery steps for each.

安装

npx skills add davila7/claude-code-templates

通过 npx 运行 Vercel skills CLI(skills.sh)— 需要本地安装 Node.js,以及至少一个兼容 skills 的智能体(Claude Code、Cursor、Codex 等)。前提是仓库遵循 agentskills.io 格式。

质量评分

已验证
95 /100
1 day ago 分析

信任信号

最近提交1 day ago
星标27.2k
许可证MIT
状态
查看源代码

类似扩展

Clip

98

OpenAI's model connecting vision and language. Enables zero-shot image classification, image-text matching, and cross-modal retrieval. Trained on 400M image-text pairs. Use for image search, content moderation, or vision-language tasks without fine-tuning. Best for general-purpose image understanding.

技能
Orchestra-Research

Blip 2 Vision Language

98

Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text retrieval, or multimodal chat with state-of-the-art zero-shot performance.

技能
Orchestra-Research

Baoyu Imagine

99

AI image generation with OpenAI GPT Image 2, Azure OpenAI, Google, OpenRouter, DashScope, Z.AI GLM-Image, MiniMax, Jimeng, Seedream and Replicate APIs. Supports text-to-image, reference images, aspect ratios, and batch generation from saved prompt files. Sequential by default; use batch parallel generation when the user already has multiple prompts or wants stable multi-image throughput. Use when user asks to generate, create, or draw images.

技能
jimliu

Whisper

97

OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.

技能
Orchestra-Research

Llava

96

Large Language and Vision Assistant. Enables visual instruction tuning and image-based conversations. Combines CLIP vision encoder with Vicuna/LLaMA language models. Supports multi-turn image chat, visual question answering, and instruction following. Use for vision-language chatbots or image understanding tasks. Best for conversational image analysis.

技能
Orchestra-Research

Segment Anything Model

95

Foundation model for image segmentation with zero-shot transfer. Use when you need to segment any object in images using points, boxes, or masks as prompts, or automatically generate all object masks in an image.

技能
davila7