Clip

Skill Active

Part of:Agent Native Research Artifact (ARA) Tooling

OpenAI's model connecting vision and language. Enables zero-shot image classification, image-text matching, and cross-modal retrieval. Trained on 400M image-text pairs. Use for image search, content moderation, or vision-language tasks without fine-tuning. Best for general-purpose image understanding.

Purpose

To enable AI agents to understand and process images in conjunction with natural language, facilitating tasks like image search and classification without fine-tuning.

Features

Zero-shot image classification
Image-text similarity and matching
Semantic image search
Cross-modal retrieval (image-to-text, text-to-image)
Content moderation capabilities

Use Cases

Use when performing zero-shot image classification on custom datasets.
Use when needing to find images semantically related to a text description.
Use for content moderation tasks like detecting NSFW or violent imagery.
Use for building vision-language applications requiring understanding of image content.

Non-Goals

Use for fine-grained object detection or segmentation.
Use for tasks requiring extensive fine-tuning on domain-specific data.
Use when only text-based analysis is required.
Use for tasks where spatial understanding (position, counting) is critical.

Code Execution

info:ValidationWhile the code uses standard libraries and parameter passing, explicit schema validation for inputs like file paths is not evident in the provided snippets.

Execution

warning:Pinned dependenciesWhile dependencies are listed, they are not explicitly pinned with versions or accompanied by a lockfile in the SKILL.md, which could lead to compatibility issues.

Installation

First, add the marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs

/plugin install AI-Research-SKILLs@ai-research-skills

Quality Score

98 /100

Analyzed about 22 hours ago

Trust Signals

Last commit16 days ago

GitHub owner Orchestra-Research

Stars8.3k

Downloads 0

LicenseMIT

Websiteorchestra-research.com

Status

View Source

Similar Extensions

CLIP

Skill

davila7

Blip 2 Vision Language

Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text retrieval, or multimodal chat with state-of-the-art zero-shot performance.

Skill

Orchestra-Research

Baoyu Imagine

AI image generation with OpenAI GPT Image 2, Azure OpenAI, Google, OpenRouter, DashScope, Z.AI GLM-Image, MiniMax, Jimeng, Seedream and Replicate APIs. Supports text-to-image, reference images, aspect ratios, and batch generation from saved prompt files. Sequential by default; use batch parallel generation when the user already has multiple prompts or wants stable multi-image throughput. Use when user asks to generate, create, or draw images.

Skill

jimliu

Whisper

OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.

Skill

Orchestra-Research

Llava

Large Language and Vision Assistant. Enables visual instruction tuning and image-based conversations. Combines CLIP vision encoder with Vicuna/LLaMA language models. Supports multi-turn image chat, visual question answering, and instruction following. Use for vision-language chatbots or image understanding tasks. Best for conversational image analysis.

Skill

Orchestra-Research

Segment Anything Model

Foundation model for image segmentation with zero-shot transfer. Use when you need to segment any object in images using points, boxes, or masks as prompts, or automatically generate all object masks in an image.

Skill

davila7