LLaVA Large Language and Vision Assistant
Skill AktivLarge Language and Vision Assistant. Enables visual instruction tuning and image-based conversations. Combines CLIP vision encoder with Vicuna/LLaMA language models. Supports multi-turn image chat, visual question answering, and instruction following. Use for vision-language chatbots or image understanding tasks. Best for conversational image analysis.
To enable conversational image understanding and visual instruction following through a powerful multimodal large language model.
Funktionen
- Conversational image analysis
- Visual question answering
- Multi-turn image chat
- Visual instruction tuning
- Support for multiple LLaVA models
Anwendungsfälle
- Building vision-language chatbots
- Performing visual question answering on images
- Generating detailed image descriptions
- Following visual instructions
Nicht-Ziele
- Providing the highest quality API-based vision models (e.g., GPT-4V)
- Simple zero-shot classification (use CLIP)
- Image captioning only (use BLIP-2)
- Research-only models (use Flamingo)
Workflow
- Load pretrained LLaVA model
- Process input image
- Format conversation prompt with image token
- Generate response using the model
- Decode and return response
Voraussetzungen
- Python 3.8+
- PyTorch
- Transformers
- Pillow
- Sufficient GPU VRAM (e.g., ~4GB for 7B 4-bit, ~14GB for 7B FP16)
Trust
- warning:Issues AttentionIn the last 90 days, 17 issues were opened and 4 were closed, indicating a slow issue closure rate.
Installation
npx skills add davila7/claude-code-templatesFührt das Vercel skills CLI (skills.sh) via npx aus — benötigt Node.js lokal und mindestens einen installierten skills-kompatiblen Agent (Claude Code, Cursor, Codex, …). Setzt voraus, dass das Repo dem agentskills.io-Format folgt.
Qualitätspunktzahl
Vertrauenssignale
Ähnliche Erweiterungen
Llava
96Large Language and Vision Assistant. Enables visual instruction tuning and image-based conversations. Combines CLIP vision encoder with Vicuna/LLaMA language models. Supports multi-turn image chat, visual question answering, and instruction following. Use for vision-language chatbots or image understanding tasks. Best for conversational image analysis.
Blip 2 Vision Language
98Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text retrieval, or multimodal chat with state-of-the-art zero-shot performance.
Clip
98OpenAI's model connecting vision and language. Enables zero-shot image classification, image-text matching, and cross-modal retrieval. Trained on 400M image-text pairs. Use for image search, content moderation, or vision-language tasks without fine-tuning. Best for general-purpose image understanding.
Blip 2 Vision Language
90Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text retrieval, or multimodal chat with state-of-the-art zero-shot performance.
Segment Anything Model
99Foundation model for image segmentation with zero-shot transfer. Use when you need to segment any object in images using points, boxes, or masks as prompts, or automatically generate all object masks in an image.
Azure Ai Contentunderstanding Py
99Azure AI Content Understanding SDK for Python. Use for multimodal content extraction from documents, images, audio, and video. Triggers: "azure-ai-contentunderstanding", "ContentUnderstandingClient", "multimodal analysis", "document extraction", "video analysis", "audio transcription".