此内容尚未提供您的语言版本,正在以英文显示。

LLaVA Large Language and Vision Assistant

技能活跃

Large Language and Vision Assistant. Enables visual instruction tuning and image-based conversations. Combines CLIP vision encoder with Vicuna/LLaMA language models. Supports multi-turn image chat, visual question answering, and instruction following. Use for vision-language chatbots or image understanding tasks. Best for conversational image analysis.

目的

To enable conversational image understanding and visual instruction following through a powerful multimodal large language model.

功能

Conversational image analysis
Visual question answering
Multi-turn image chat
Visual instruction tuning
Support for multiple LLaVA models

使用场景

Building vision-language chatbots
Performing visual question answering on images
Generating detailed image descriptions
Following visual instructions

非目标

Providing the highest quality API-based vision models (e.g., GPT-4V)
Simple zero-shot classification (use CLIP)
Image captioning only (use BLIP-2)
Research-only models (use Flamingo)

工作流

Load pretrained LLaVA model
Process input image
Format conversation prompt with image token
Generate response using the model
Decode and return response

先决条件

Python 3.8+
PyTorch
Transformers
Pillow
Sufficient GPU VRAM (e.g., ~4GB for 7B 4-bit, ~14GB for 7B FP16)

Trust

warning:Issues AttentionIn the last 90 days, 17 issues were opened and 4 were closed, indicating a slow issue closure rate.

安装

npx skills add davila7/claude-code-templates

通过 npx 运行 Vercel skills CLI(skills.sh)— 需要本地安装 Node.js,以及至少一个兼容 skills 的智能体(Claude Code、Cursor、Codex 等)。前提是仓库遵循 agentskills.io 格式。

质量评分

75 /100

1 day ago 分析

信任信号

最近提交1 day ago

GitHub 所有者 davila7

星标27.2k

下载量 23k

许可证MIT

网站aitmpl.com

状态

查看源代码

类似扩展

Llava

技能

Orchestra-Research

Blip 2 Vision Language

Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text retrieval, or multimodal chat with state-of-the-art zero-shot performance.

技能

Orchestra-Research

Clip

OpenAI's model connecting vision and language. Enables zero-shot image classification, image-text matching, and cross-modal retrieval. Trained on 400M image-text pairs. Use for image search, content moderation, or vision-language tasks without fine-tuning. Best for general-purpose image understanding.

技能

Orchestra-Research

Blip 2 Vision Language

技能

davila7

Segment Anything Model

Foundation model for image segmentation with zero-shot transfer. Use when you need to segment any object in images using points, boxes, or masks as prompts, or automatically generate all object masks in an image.

技能

Orchestra-Research

Azure Ai Contentunderstanding Py

Azure AI Content Understanding SDK for Python. Use for multimodal content extraction from documents, images, audio, and video. Triggers: "azure-ai-contentunderstanding", "ContentUnderstandingClient", "multimodal analysis", "document extraction", "video analysis", "audio transcription".

技能

microsoft