Zum Hauptinhalt springen
Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

LLaVA Large Language and Vision Assistant

Skill Aktiv

Large Language and Vision Assistant. Enables visual instruction tuning and image-based conversations. Combines CLIP vision encoder with Vicuna/LLaMA language models. Supports multi-turn image chat, visual question answering, and instruction following. Use for vision-language chatbots or image understanding tasks. Best for conversational image analysis.

Zweck

To enable conversational image understanding and visual instruction following through a powerful multimodal large language model.

Funktionen

  • Conversational image analysis
  • Visual question answering
  • Multi-turn image chat
  • Visual instruction tuning
  • Support for multiple LLaVA models

Anwendungsfälle

  • Building vision-language chatbots
  • Performing visual question answering on images
  • Generating detailed image descriptions
  • Following visual instructions

Nicht-Ziele

  • Providing the highest quality API-based vision models (e.g., GPT-4V)
  • Simple zero-shot classification (use CLIP)
  • Image captioning only (use BLIP-2)
  • Research-only models (use Flamingo)

Workflow

  1. Load pretrained LLaVA model
  2. Process input image
  3. Format conversation prompt with image token
  4. Generate response using the model
  5. Decode and return response

Voraussetzungen

  • Python 3.8+
  • PyTorch
  • Transformers
  • Pillow
  • Sufficient GPU VRAM (e.g., ~4GB for 7B 4-bit, ~14GB for 7B FP16)

Trust

  • warning:Issues AttentionIn the last 90 days, 17 issues were opened and 4 were closed, indicating a slow issue closure rate.

Installation

npx skills add davila7/claude-code-templates

Führt das Vercel skills CLI (skills.sh) via npx aus — benötigt Node.js lokal und mindestens einen installierten skills-kompatiblen Agent (Claude Code, Cursor, Codex, …). Setzt voraus, dass das Repo dem agentskills.io-Format folgt.

Qualitätspunktzahl

75 /100
Analysiert about 23 hours ago

Vertrauenssignale

Letzter Commit1 day ago
Sterne27.2k
LizenzMIT
Status
Quellcode ansehen

Ähnliche Erweiterungen

Llava

96

Large Language and Vision Assistant. Enables visual instruction tuning and image-based conversations. Combines CLIP vision encoder with Vicuna/LLaMA language models. Supports multi-turn image chat, visual question answering, and instruction following. Use for vision-language chatbots or image understanding tasks. Best for conversational image analysis.

Skill
Orchestra-Research

Blip 2 Vision Language

98

Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text retrieval, or multimodal chat with state-of-the-art zero-shot performance.

Skill
Orchestra-Research

Clip

98

OpenAI's model connecting vision and language. Enables zero-shot image classification, image-text matching, and cross-modal retrieval. Trained on 400M image-text pairs. Use for image search, content moderation, or vision-language tasks without fine-tuning. Best for general-purpose image understanding.

Skill
Orchestra-Research

Blip 2 Vision Language

90

Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text retrieval, or multimodal chat with state-of-the-art zero-shot performance.

Skill
davila7

Segment Anything Model

99

Foundation model for image segmentation with zero-shot transfer. Use when you need to segment any object in images using points, boxes, or masks as prompts, or automatically generate all object masks in an image.

Skill
Orchestra-Research

Azure Ai Contentunderstanding Py

99

Azure AI Content Understanding SDK for Python. Use for multimodal content extraction from documents, images, audio, and video. Triggers: "azure-ai-contentunderstanding", "ContentUnderstandingClient", "multimodal analysis", "document extraction", "video analysis", "audio transcription".

Skill
microsoft