Quantizing Models Bitsandbytes

Skill Verified Active

Quantizes LLMs to 8-bit or 4-bit for 50-75% memory reduction with minimal accuracy loss. Use when GPU memory is limited, need to fit larger models, or want faster inference. Supports INT8, NF4, FP4 formats, QLoRA training, and 8-bit optimizers. Works with HuggingFace Transformers.

Purpose

Quantize LLMs to reduce memory usage by 50-75% with minimal accuracy loss, enabling larger models on limited hardware and faster inference.

Features

Quantizes LLMs to 8-bit or 4-bit
Supports INT8, NF4, FP4 formats
Enables QLoRA training
Integrates with HuggingFace Transformers
Reduces memory by 50-75%

Use Cases

Fitting larger models into limited GPU memory
Achieving faster LLM inference speeds
Fine-tuning large models on consumer GPUs with QLoRA
Reducing optimizer memory during training with 8-bit optimizers

Non-Goals

Replacing advanced inference optimization frameworks like GPTQ or AWQ
Providing CPU-only inference solutions like GGUF
Supporting hardware without tensor core acceleration

Trust

info:Issues Attention17 issues opened and 4 closed in the last 90 days indicates a closure rate below 50% with a moderate number of open issues.

Installation

npx skills add davila7/claude-code-templates

Runs the Vercel skills CLI (skills.sh) via npx — needs Node.js locally and at least one installed skills-compatible agent (Claude Code, Cursor, Codex, …). Assumes the repo follows the agentskills.io format.

Quality Score

Verified

95 /100

Analyzed 1 day ago

Trust Signals

Last commit1 day ago

GitHub owner davila7

Stars27.2k

Downloads 23k

LicenseMIT

Websiteaitmpl.com

Status

View Source

Similar Extensions

Quantizing Models Bitsandbytes

Skill

Orchestra-Research

Arize Prompt Optimization

100

Optimizes, improves, and debugs LLM prompts using production trace data, evaluations, and annotations. Extracts prompts from spans, gathers performance signal, and runs a data-driven optimization loop using the ax CLI. Use when the user mentions optimize prompt, improve prompt, make AI respond better, improve output quality, prompt engineering, prompt tuning, or system prompt improvement.

Skill

github

Unsloth

100

Expert guidance for fast fine-tuning with Unsloth - 2-5x faster training, 50-80% less memory, LoRA/QLoRA optimization

Skill

davila7

Prompt Optimization

100

Applies prompt repetition to improve accuracy for non-reasoning LLMs

Skill

asklokesh

Vector Index Tuning

Optimize vector index performance for latency, recall, and memory. Use when tuning HNSW parameters, selecting quantization strategies, or scaling vector search infrastructure.

Skill

wshobson

Transformers

This skill should be used when working with pre-trained transformer models for natural language processing, computer vision, audio, or multimodal tasks. Use for text generation, classification, question answering, translation, summarization, image classification, object detection, speech recognition, and fine-tuning models on custom datasets.

Skill

K-Dense-AI