Llava

Skill Verified Active

Part of:Agent Native Research Artifact (ARA) Tooling

Large Language and Vision Assistant. Enables visual instruction tuning and image-based conversations. Combines CLIP vision encoder with Vicuna/LLaMA language models. Supports multi-turn image chat, visual question answering, and instruction following. Use for vision-language chatbots or image understanding tasks. Best for conversational image analysis.

Purpose

To enable AI agents to conduct visual instruction tuning and engage in image-based conversations, facilitating applications like vision-language chatbots and sophisticated image analysis.

Features

Visual instruction tuning
Image-based conversations
Multi-turn image chat
Visual question answering (VQA)
Supports multiple model sizes (7B-34B) with quantization

Use Cases

Building vision-language chatbots
Performing visual question answering
Generating detailed image captions
Engaging in multi-turn image dialogues

Non-Goals

Being a simple zero-shot classifier like CLIP
Performing only image captioning like BLIP-2
Being purely API-based without local model options

Practical Utility

info:Edge casesThe 'Limitations' section in SKILL.md addresses potential issues like hallucinations, spatial reasoning struggles, and VRAM requirements, but doesn't detail specific recovery steps for each.

Installation

First, add the marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs

/plugin install AI-Research-SKILLs@ai-research-skills

Quality Score

Verified

96 /100

Analyzed about 21 hours ago

Trust Signals

Last commit16 days ago

GitHub owner Orchestra-Research

Stars8.3k

Downloads 0

LicenseMIT

Websiteorchestra-research.com

Status

View Source

Similar Extensions

Blip 2 Vision Language

Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text retrieval, or multimodal chat with state-of-the-art zero-shot performance.

Skill

Orchestra-Research

LLaVA Large Language and Vision Assistant

Skill

davila7

Clip

OpenAI's model connecting vision and language. Enables zero-shot image classification, image-text matching, and cross-modal retrieval. Trained on 400M image-text pairs. Use for image search, content moderation, or vision-language tasks without fine-tuning. Best for general-purpose image understanding.

Skill

Orchestra-Research

Blip 2 Vision Language

Skill

davila7

Segment Anything Model

Foundation model for image segmentation with zero-shot transfer. Use when you need to segment any object in images using points, boxes, or masks as prompts, or automatically generate all object masks in an image.

Skill

Orchestra-Research

Azure Ai Contentunderstanding Py

Azure AI Content Understanding SDK for Python. Use for multimodal content extraction from documents, images, audio, and video. Triggers: "azure-ai-contentunderstanding", "ContentUnderstandingClient", "multimodal analysis", "document extraction", "video analysis", "audio transcription".

Skill

microsoft