LLaVA Large Language and Vision Assistant
技能 活跃Large Language and Vision Assistant. Enables visual instruction tuning and image-based conversations. Combines CLIP vision encoder with Vicuna/LLaMA language models. Supports multi-turn image chat, visual question answering, and instruction following. Use for vision-language chatbots or image understanding tasks. Best for conversational image analysis.
To enable conversational image understanding and visual instruction following through a powerful multimodal large language model.
功能
- Conversational image analysis
- Visual question answering
- Multi-turn image chat
- Visual instruction tuning
- Support for multiple LLaVA models
使用场景
- Building vision-language chatbots
- Performing visual question answering on images
- Generating detailed image descriptions
- Following visual instructions
非目标
- Providing the highest quality API-based vision models (e.g., GPT-4V)
- Simple zero-shot classification (use CLIP)
- Image captioning only (use BLIP-2)
- Research-only models (use Flamingo)
工作流
- Load pretrained LLaVA model
- Process input image
- Format conversation prompt with image token
- Generate response using the model
- Decode and return response
先决条件
- Python 3.8+
- PyTorch
- Transformers
- Pillow
- Sufficient GPU VRAM (e.g., ~4GB for 7B 4-bit, ~14GB for 7B FP16)
Trust
- warning:Issues AttentionIn the last 90 days, 17 issues were opened and 4 were closed, indicating a slow issue closure rate.
安装
npx skills add davila7/claude-code-templates通过 npx 运行 Vercel skills CLI(skills.sh)— 需要本地安装 Node.js,以及至少一个兼容 skills 的智能体(Claude Code、Cursor、Codex 等)。前提是仓库遵循 agentskills.io 格式。
质量评分
类似扩展
Llava
96Large Language and Vision Assistant. Enables visual instruction tuning and image-based conversations. Combines CLIP vision encoder with Vicuna/LLaMA language models. Supports multi-turn image chat, visual question answering, and instruction following. Use for vision-language chatbots or image understanding tasks. Best for conversational image analysis.
Blip 2 Vision Language
98Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text retrieval, or multimodal chat with state-of-the-art zero-shot performance.
Clip
98OpenAI's model connecting vision and language. Enables zero-shot image classification, image-text matching, and cross-modal retrieval. Trained on 400M image-text pairs. Use for image search, content moderation, or vision-language tasks without fine-tuning. Best for general-purpose image understanding.
Blip 2 Vision Language
90Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text retrieval, or multimodal chat with state-of-the-art zero-shot performance.
Segment Anything Model
99Foundation model for image segmentation with zero-shot transfer. Use when you need to segment any object in images using points, boxes, or masks as prompts, or automatically generate all object masks in an image.
Azure Ai Contentunderstanding Py
99Azure AI Content Understanding SDK for Python. Use for multimodal content extraction from documents, images, audio, and video. Triggers: "azure-ai-contentunderstanding", "ContentUnderstandingClient", "multimodal analysis", "document extraction", "video analysis", "audio transcription".