此内容尚未提供您的语言版本,正在以英文显示。

Llava

技能已验证活跃

属于:Agent Native Research Artifact (ARA) Tooling

Large Language and Vision Assistant. Enables visual instruction tuning and image-based conversations. Combines CLIP vision encoder with Vicuna/LLaMA language models. Supports multi-turn image chat, visual question answering, and instruction following. Use for vision-language chatbots or image understanding tasks. Best for conversational image analysis.

目的

To enable AI agents to conduct visual instruction tuning and engage in image-based conversations, facilitating applications like vision-language chatbots and sophisticated image analysis.

功能

Visual instruction tuning
Image-based conversations
Multi-turn image chat
Visual question answering (VQA)
Supports multiple model sizes (7B-34B) with quantization

使用场景

Building vision-language chatbots
Performing visual question answering
Generating detailed image captions
Engaging in multi-turn image dialogues

非目标

Being a simple zero-shot classifier like CLIP
Performing only image captioning like BLIP-2
Being purely API-based without local model options

Practical Utility

info:Edge casesThe 'Limitations' section in SKILL.md addresses potential issues like hallucinations, spatial reasoning struggles, and VRAM requirements, but doesn't detail specific recovery steps for each.

安装

请先添加 Marketplace

/plugin marketplace add Orchestra-Research/AI-Research-SKILLs

/plugin install AI-Research-SKILLs@ai-research-skills

质量评分

已验证

96 /100

1 day ago 分析

信任信号

最近提交17 days ago

GitHub 所有者 Orchestra-Research

星标8.3k

下载量 0

许可证MIT

网站orchestra-research.com

状态

查看源代码

类似扩展

Blip 2 Vision Language

Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text retrieval, or multimodal chat with state-of-the-art zero-shot performance.

技能

Orchestra-Research

LLaVA Large Language and Vision Assistant

技能

davila7

Clip

OpenAI's model connecting vision and language. Enables zero-shot image classification, image-text matching, and cross-modal retrieval. Trained on 400M image-text pairs. Use for image search, content moderation, or vision-language tasks without fine-tuning. Best for general-purpose image understanding.

技能

Orchestra-Research

Blip 2 Vision Language

技能

davila7

Segment Anything Model

Foundation model for image segmentation with zero-shot transfer. Use when you need to segment any object in images using points, boxes, or masks as prompts, or automatically generate all object masks in an image.

技能

Orchestra-Research

Azure Ai Contentunderstanding Py

Azure AI Content Understanding SDK for Python. Use for multimodal content extraction from documents, images, audio, and video. Triggers: "azure-ai-contentunderstanding", "ContentUnderstandingClient", "multimodal analysis", "document extraction", "video analysis", "audio transcription".

技能

microsoft