Clip
Skill ActiveOpenAI's model connecting vision and language. Enables zero-shot image classification, image-text matching, and cross-modal retrieval. Trained on 400M image-text pairs. Use for image search, content moderation, or vision-language tasks without fine-tuning. Best for general-purpose image understanding.
To enable AI agents to understand and process images in conjunction with natural language, facilitating tasks like image search and classification without fine-tuning.
Features
- Zero-shot image classification
- Image-text similarity and matching
- Semantic image search
- Cross-modal retrieval (image-to-text, text-to-image)
- Content moderation capabilities
Use Cases
- Use when performing zero-shot image classification on custom datasets.
- Use when needing to find images semantically related to a text description.
- Use for content moderation tasks like detecting NSFW or violent imagery.
- Use for building vision-language applications requiring understanding of image content.
Non-Goals
- Use for fine-grained object detection or segmentation.
- Use for tasks requiring extensive fine-tuning on domain-specific data.
- Use when only text-based analysis is required.
- Use for tasks where spatial understanding (position, counting) is critical.
Code Execution
- info:ValidationWhile the code uses standard libraries and parameter passing, explicit schema validation for inputs like file paths is not evident in the provided snippets.
Execution
- warning:Pinned dependenciesWhile dependencies are listed, they are not explicitly pinned with versions or accompanied by a lockfile in the SKILL.md, which could lead to compatibility issues.
Installation
First, add the marketplace
/plugin marketplace add Orchestra-Research/AI-Research-SKILLs/plugin install AI-Research-SKILLs@ai-research-skillsQuality Score
Trust Signals
Similar Extensions
CLIP
95OpenAI's model connecting vision and language. Enables zero-shot image classification, image-text matching, and cross-modal retrieval. Trained on 400M image-text pairs. Use for image search, content moderation, or vision-language tasks without fine-tuning. Best for general-purpose image understanding.
Blip 2 Vision Language
98Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text retrieval, or multimodal chat with state-of-the-art zero-shot performance.
Baoyu Imagine
99AI image generation with OpenAI GPT Image 2, Azure OpenAI, Google, OpenRouter, DashScope, Z.AI GLM-Image, MiniMax, Jimeng, Seedream and Replicate APIs. Supports text-to-image, reference images, aspect ratios, and batch generation from saved prompt files. Sequential by default; use batch parallel generation when the user already has multiple prompts or wants stable multi-image throughput. Use when user asks to generate, create, or draw images.
Whisper
97OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.
Llava
96Large Language and Vision Assistant. Enables visual instruction tuning and image-based conversations. Combines CLIP vision encoder with Vicuna/LLaMA language models. Supports multi-turn image chat, visual question answering, and instruction following. Use for vision-language chatbots or image understanding tasks. Best for conversational image analysis.
Segment Anything Model
95Foundation model for image segmentation with zero-shot transfer. Use when you need to segment any object in images using points, boxes, or masks as prompts, or automatically generate all object masks in an image.