Skip to main content

Llava

Skill Verified Active

Large Language and Vision Assistant. Enables visual instruction tuning and image-based conversations. Combines CLIP vision encoder with Vicuna/LLaMA language models. Supports multi-turn image chat, visual question answering, and instruction following. Use for vision-language chatbots or image understanding tasks. Best for conversational image analysis.

Purpose

To enable AI agents to conduct visual instruction tuning and engage in image-based conversations, facilitating applications like vision-language chatbots and sophisticated image analysis.

Features

  • Visual instruction tuning
  • Image-based conversations
  • Multi-turn image chat
  • Visual question answering (VQA)
  • Supports multiple model sizes (7B-34B) with quantization

Use Cases

  • Building vision-language chatbots
  • Performing visual question answering
  • Generating detailed image captions
  • Engaging in multi-turn image dialogues

Non-Goals

  • Being a simple zero-shot classifier like CLIP
  • Performing only image captioning like BLIP-2
  • Being purely API-based without local model options

Practical Utility

  • info:Edge casesThe 'Limitations' section in SKILL.md addresses potential issues like hallucinations, spatial reasoning struggles, and VRAM requirements, but doesn't detail specific recovery steps for each.

Installation

First, add the marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs
/plugin install AI-Research-SKILLs@ai-research-skills

Quality Score

Verified
96 /100
Analyzed about 21 hours ago

Trust Signals

Last commit16 days ago
Stars8.3k
LicenseMIT
Status
View Source

Similar Extensions

Blip 2 Vision Language

98

Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text retrieval, or multimodal chat with state-of-the-art zero-shot performance.

Skill
Orchestra-Research

LLaVA Large Language and Vision Assistant

75

Large Language and Vision Assistant. Enables visual instruction tuning and image-based conversations. Combines CLIP vision encoder with Vicuna/LLaMA language models. Supports multi-turn image chat, visual question answering, and instruction following. Use for vision-language chatbots or image understanding tasks. Best for conversational image analysis.

Skill
davila7

Clip

98

OpenAI's model connecting vision and language. Enables zero-shot image classification, image-text matching, and cross-modal retrieval. Trained on 400M image-text pairs. Use for image search, content moderation, or vision-language tasks without fine-tuning. Best for general-purpose image understanding.

Skill
Orchestra-Research

Blip 2 Vision Language

90

Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text retrieval, or multimodal chat with state-of-the-art zero-shot performance.

Skill
davila7

Segment Anything Model

99

Foundation model for image segmentation with zero-shot transfer. Use when you need to segment any object in images using points, boxes, or masks as prompts, or automatically generate all object masks in an image.

Skill
Orchestra-Research

Azure Ai Contentunderstanding Py

99

Azure AI Content Understanding SDK for Python. Use for multimodal content extraction from documents, images, audio, and video. Triggers: "azure-ai-contentunderstanding", "ContentUnderstandingClient", "multimodal analysis", "document extraction", "video analysis", "audio transcription".

Skill
microsoft

© 2025 SkillRepo · Find the right skill, skip the noise.