NanoGPT
Skill ActiveEducational GPT implementation in ~300 lines. Reproduces GPT-2 (124M) on OpenWebText. Clean, hackable code for learning transformers. By Andrej Karpathy. Perfect for understanding GPT architecture from scratch. Train on Shakespeare (CPU) or OpenWebText (multi-GPU).
To provide a clear, concise, and hackable implementation of the GPT-2 architecture for educational purposes, enabling users to understand transformer models from scratch.
Features
- Minimalist GPT-2 (124M) implementation
- Reproduces GPT-2 on OpenWebText
- Clean, hackable code for learning transformers
- Supports training on CPU (Shakespeare) or multi-GPU (OpenWebText)
- Includes example configurations and data preparation scripts
Use Cases
- Learning transformer architecture from scratch
- Experimenting with GPT model components
- Teaching or understanding deep learning models
- Prototyping new transformer ideas
Non-Goals
- Production-ready deployment of LLMs
- State-of-the-art performance benchmarks
- Large-scale distributed training beyond 8 GPUs
- Complex model tuning for specific applications
Workflow
- Prepare data (e.g., Shakespeare or OpenWebText)
- Configure training parameters
- Train the model
- Generate text from the trained model
Practices
- Model Architecture
- Transformer Implementation
- Educational Code
Prerequisites
- Python 3.8+
- PyTorch
- torch, numpy, transformers, datasets, tiktoken, wandb, tqdm
Practical Utility
- info:Production readinessWhile the code is clean and educational, it is presented as an educational tool rather than a production-ready system. Training large models like GPT-2 requires significant computational resources not typically available for immediate production use.
Trust
- warning:Issues AttentionIn the last 90 days, 17 issues were opened and 4 were closed, indicating a low closure rate and potentially slow maintainer response.
Installation
npx skills add davila7/claude-code-templatesRuns the Vercel skills CLI (skills.sh) via npx — needs Node.js locally and at least one installed skills-compatible agent (Claude Code, Cursor, Codex, …). Assumes the repo follows the agentskills.io format.
Quality Score
Trust Signals
Similar Extensions
Nanogpt
95Educational GPT implementation in ~300 lines. Reproduces GPT-2 (124M) on OpenWebText. Clean, hackable code for learning transformers. By Andrej Karpathy. Perfect for understanding GPT architecture from scratch. Train on Shakespeare (CPU) or OpenWebText (multi-GPU).
PyTorch Lightning
100Deep learning framework (PyTorch Lightning). Organize PyTorch code into LightningModules, configure Trainers for multi-GPU/TPU, implement data pipelines, callbacks, logging (W&B, TensorBoard), distributed training (DDP, FSDP, DeepSpeed), for scalable neural network training.
Pytorch Lightning
99High-level PyTorch framework with Trainer class, automatic distributed training (DDP/FSDP/DeepSpeed), callbacks system, and minimal boilerplate. Scales from laptop to supercomputer with same code. Use when you want clean training loops with built-in best practices.
Nnsight Remote Interpretability
99Provides guidance for interpreting and manipulating neural network internals using nnsight with optional NDIF remote execution. Use when needing to run interpretability experiments on massive models (70B+) without local GPU resources, or when working with any PyTorch architecture.
Huggingface Accelerate
99Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.
TorchTitan Distributed LLM Pretraining
99Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.