Zum Hauptinhalt springen
Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

Llava

Skill Verifiziert Aktiv

Large Language and Vision Assistant. Enables visual instruction tuning and image-based conversations. Combines CLIP vision encoder with Vicuna/LLaMA language models. Supports multi-turn image chat, visual question answering, and instruction following. Use for vision-language chatbots or image understanding tasks. Best for conversational image analysis.

Zweck

To enable AI agents to conduct visual instruction tuning and engage in image-based conversations, facilitating applications like vision-language chatbots and sophisticated image analysis.

Funktionen

  • Visual instruction tuning
  • Image-based conversations
  • Multi-turn image chat
  • Visual question answering (VQA)
  • Supports multiple model sizes (7B-34B) with quantization

Anwendungsfälle

  • Building vision-language chatbots
  • Performing visual question answering
  • Generating detailed image captions
  • Engaging in multi-turn image dialogues

Nicht-Ziele

  • Being a simple zero-shot classifier like CLIP
  • Performing only image captioning like BLIP-2
  • Being purely API-based without local model options

Practical Utility

  • info:Edge casesThe 'Limitations' section in SKILL.md addresses potential issues like hallucinations, spatial reasoning struggles, and VRAM requirements, but doesn't detail specific recovery steps for each.

Installation

Zuerst Marketplace hinzufügen

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs
/plugin install AI-Research-SKILLs@ai-research-skills

Qualitätspunktzahl

Verifiziert
96 /100
Analysiert 1 day ago

Vertrauenssignale

Letzter Commit17 days ago
Sterne8.3k
LizenzMIT
Status
Quellcode ansehen

Ähnliche Erweiterungen

Blip 2 Vision Language

98

Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text retrieval, or multimodal chat with state-of-the-art zero-shot performance.

Skill
Orchestra-Research

LLaVA Large Language and Vision Assistant

75

Large Language and Vision Assistant. Enables visual instruction tuning and image-based conversations. Combines CLIP vision encoder with Vicuna/LLaMA language models. Supports multi-turn image chat, visual question answering, and instruction following. Use for vision-language chatbots or image understanding tasks. Best for conversational image analysis.

Skill
davila7

Clip

98

OpenAI's model connecting vision and language. Enables zero-shot image classification, image-text matching, and cross-modal retrieval. Trained on 400M image-text pairs. Use for image search, content moderation, or vision-language tasks without fine-tuning. Best for general-purpose image understanding.

Skill
Orchestra-Research

Blip 2 Vision Language

90

Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text retrieval, or multimodal chat with state-of-the-art zero-shot performance.

Skill
davila7

Segment Anything Model

99

Foundation model for image segmentation with zero-shot transfer. Use when you need to segment any object in images using points, boxes, or masks as prompts, or automatically generate all object masks in an image.

Skill
Orchestra-Research

Azure Ai Contentunderstanding Py

99

Azure AI Content Understanding SDK for Python. Use for multimodal content extraction from documents, images, audio, and video. Triggers: "azure-ai-contentunderstanding", "ContentUnderstandingClient", "multimodal analysis", "document extraction", "video analysis", "audio transcription".

Skill
microsoft