Quantizing Models Bitsandbytes
技能 已验证 活跃Quantizes LLMs to 8-bit or 4-bit for 50-75% memory reduction with minimal accuracy loss. Use when GPU memory is limited, need to fit larger models, or want faster inference. Supports INT8, NF4, FP4 formats, QLoRA training, and 8-bit optimizers. Works with HuggingFace Transformers.
Reduce LLM memory consumption by 50-75% through quantization, enabling larger models on limited hardware or faster inference.
功能
- Quantize LLMs to 8-bit or 4-bit
- Support for INT8, NF4, FP4 formats
- Enable QLoRA fine-tuning
- Reduce memory usage by 50-75%
- Compatible with HuggingFace Transformers
使用场景
- Fit larger models onto GPUs with limited VRAM
- Accelerate LLM inference speed
- Fine-tune large models (e.g., 70B) on consumer hardware using QLoRA
- Optimize memory usage during LLM training
非目标
- Providing a runtime quantization service
- Replacing the underlying bitsandbytes library
- Quantizing models not compatible with HuggingFace Transformers
安装
请先添加 Marketplace
/plugin marketplace add Orchestra-Research/AI-Research-SKILLs/plugin install AI-Research-SKILLs@ai-research-skills质量评分
已验证类似扩展
Quantizing Models Bitsandbytes
95Quantizes LLMs to 8-bit or 4-bit for 50-75% memory reduction with minimal accuracy loss. Use when GPU memory is limited, need to fit larger models, or want faster inference. Supports INT8, NF4, FP4 formats, QLoRA training, and 8-bit optimizers. Works with HuggingFace Transformers.
Arize Prompt Optimization
100Optimizes, improves, and debugs LLM prompts using production trace data, evaluations, and annotations. Extracts prompts from spans, gathers performance signal, and runs a data-driven optimization loop using the ax CLI. Use when the user mentions optimize prompt, improve prompt, make AI respond better, improve output quality, prompt engineering, prompt tuning, or system prompt improvement.
Unsloth
100Expert guidance for fast fine-tuning with Unsloth - 2-5x faster training, 50-80% less memory, LoRA/QLoRA optimization
Prompt Optimization
100应用提示重复以提高非推理 LLM 的准确性
Vector Index Tuning
99Optimize vector index performance for latency, recall, and memory. Use when tuning HNSW parameters, selecting quantization strategies, or scaling vector search infrastructure.
Transformers
98This skill should be used when working with pre-trained transformer models for natural language processing, computer vision, audio, or multimodal tasks. Use for text generation, classification, question answering, translation, summarization, image classification, object detection, speech recognition, and fine-tuning models on custom datasets.