跳转到主要内容
此内容尚未提供您的语言版本,正在以英文显示。

LLaVA Large Language and Vision Assistant

技能 活跃

Large Language and Vision Assistant. Enables visual instruction tuning and image-based conversations. Combines CLIP vision encoder with Vicuna/LLaMA language models. Supports multi-turn image chat, visual question answering, and instruction following. Use for vision-language chatbots or image understanding tasks. Best for conversational image analysis.

目的

To enable conversational image understanding and visual instruction following through a powerful multimodal large language model.

功能

  • Conversational image analysis
  • Visual question answering
  • Multi-turn image chat
  • Visual instruction tuning
  • Support for multiple LLaVA models

使用场景

  • Building vision-language chatbots
  • Performing visual question answering on images
  • Generating detailed image descriptions
  • Following visual instructions

非目标

  • Providing the highest quality API-based vision models (e.g., GPT-4V)
  • Simple zero-shot classification (use CLIP)
  • Image captioning only (use BLIP-2)
  • Research-only models (use Flamingo)

工作流

  1. Load pretrained LLaVA model
  2. Process input image
  3. Format conversation prompt with image token
  4. Generate response using the model
  5. Decode and return response

先决条件

  • Python 3.8+
  • PyTorch
  • Transformers
  • Pillow
  • Sufficient GPU VRAM (e.g., ~4GB for 7B 4-bit, ~14GB for 7B FP16)

Trust

  • warning:Issues AttentionIn the last 90 days, 17 issues were opened and 4 were closed, indicating a slow issue closure rate.

安装

npx skills add davila7/claude-code-templates

通过 npx 运行 Vercel skills CLI(skills.sh)— 需要本地安装 Node.js,以及至少一个兼容 skills 的智能体(Claude Code、Cursor、Codex 等)。前提是仓库遵循 agentskills.io 格式。

质量评分

75 /100
1 day ago 分析

信任信号

最近提交1 day ago
星标27.2k
许可证MIT
状态
查看源代码

类似扩展

Llava

96

Large Language and Vision Assistant. Enables visual instruction tuning and image-based conversations. Combines CLIP vision encoder with Vicuna/LLaMA language models. Supports multi-turn image chat, visual question answering, and instruction following. Use for vision-language chatbots or image understanding tasks. Best for conversational image analysis.

技能
Orchestra-Research

Blip 2 Vision Language

98

Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text retrieval, or multimodal chat with state-of-the-art zero-shot performance.

技能
Orchestra-Research

Clip

98

OpenAI's model connecting vision and language. Enables zero-shot image classification, image-text matching, and cross-modal retrieval. Trained on 400M image-text pairs. Use for image search, content moderation, or vision-language tasks without fine-tuning. Best for general-purpose image understanding.

技能
Orchestra-Research

Blip 2 Vision Language

90

Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text retrieval, or multimodal chat with state-of-the-art zero-shot performance.

技能
davila7

Segment Anything Model

99

Foundation model for image segmentation with zero-shot transfer. Use when you need to segment any object in images using points, boxes, or masks as prompts, or automatically generate all object masks in an image.

技能
Orchestra-Research

Azure Ai Contentunderstanding Py

99

Azure AI Content Understanding SDK for Python. Use for multimodal content extraction from documents, images, audio, and video. Triggers: "azure-ai-contentunderstanding", "ContentUnderstandingClient", "multimodal analysis", "document extraction", "video analysis", "audio transcription".

技能
microsoft