Llava
Skill Verifiziert AktivLarge Language and Vision Assistant. Enables visual instruction tuning and image-based conversations. Combines CLIP vision encoder with Vicuna/LLaMA language models. Supports multi-turn image chat, visual question answering, and instruction following. Use for vision-language chatbots or image understanding tasks. Best for conversational image analysis.
To enable AI agents to conduct visual instruction tuning and engage in image-based conversations, facilitating applications like vision-language chatbots and sophisticated image analysis.
Funktionen
- Visual instruction tuning
- Image-based conversations
- Multi-turn image chat
- Visual question answering (VQA)
- Supports multiple model sizes (7B-34B) with quantization
Anwendungsfälle
- Building vision-language chatbots
- Performing visual question answering
- Generating detailed image captions
- Engaging in multi-turn image dialogues
Nicht-Ziele
- Being a simple zero-shot classifier like CLIP
- Performing only image captioning like BLIP-2
- Being purely API-based without local model options
Practical Utility
- info:Edge casesThe 'Limitations' section in SKILL.md addresses potential issues like hallucinations, spatial reasoning struggles, and VRAM requirements, but doesn't detail specific recovery steps for each.
Installation
Zuerst Marketplace hinzufügen
/plugin marketplace add Orchestra-Research/AI-Research-SKILLs/plugin install AI-Research-SKILLs@ai-research-skillsQualitätspunktzahl
VerifiziertVertrauenssignale
Ähnliche Erweiterungen
Blip 2 Vision Language
98Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text retrieval, or multimodal chat with state-of-the-art zero-shot performance.
LLaVA Large Language and Vision Assistant
75Large Language and Vision Assistant. Enables visual instruction tuning and image-based conversations. Combines CLIP vision encoder with Vicuna/LLaMA language models. Supports multi-turn image chat, visual question answering, and instruction following. Use for vision-language chatbots or image understanding tasks. Best for conversational image analysis.
Clip
98OpenAI's model connecting vision and language. Enables zero-shot image classification, image-text matching, and cross-modal retrieval. Trained on 400M image-text pairs. Use for image search, content moderation, or vision-language tasks without fine-tuning. Best for general-purpose image understanding.
Blip 2 Vision Language
90Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text retrieval, or multimodal chat with state-of-the-art zero-shot performance.
Segment Anything Model
99Foundation model for image segmentation with zero-shot transfer. Use when you need to segment any object in images using points, boxes, or masks as prompts, or automatically generate all object masks in an image.
Azure Ai Contentunderstanding Py
99Azure AI Content Understanding SDK for Python. Use for multimodal content extraction from documents, images, audio, and video. Triggers: "azure-ai-contentunderstanding", "ContentUnderstandingClient", "multimodal analysis", "document extraction", "video analysis", "audio transcription".