Analyze Kernel Bottleneck
Skill Verified ActiveSystematically identify whether a GPU kernel is compute-bound, memory-bound, or latency-bound using roofline analysis, occupancy calculations, compute/load ratio per tile, and SASS instruction inspection. Produces a decision matrix for optimization strategy selection (cp.async, warp interleaving, tiling, double-buffering, or CuAssembler hand-tuning).
To systematically identify GPU kernel performance bottlenecks and provide actionable insights for optimization strategies, enabling developers to improve kernel efficiency.
Features
- GPU kernel bottleneck classification (compute-bound, memory-bound, latency-bound)
- Roofline analysis using arithmetic intensity and machine balance points
- Occupancy calculation to determine active warps per SM
- Compute/load ratio analysis from SASS instructions
- SASS instruction mix and stall code inspection
- Shared memory cliff analysis
- Decision matrix for optimization strategy selection (cp.async, warp interleaving, etc.)
- Structured bottleneck report generation
Use Cases
- Before optimizing any CUDA kernel to establish a baseline and identify bottlenecks
- After initial kernel implementation to pinpoint optimization paths
- When a kernel's performance does not meet expectations
- To decide between various optimization techniques like cp.async, tiling, or algorithmic changes
Non-Goals
- Directly modifying CUDA source code
- Automated kernel recompilation without user input
- Real-time performance monitoring beyond discrete analysis runs
- Analysis of CPU-bound aspects of host-device workflows
Installation
/plugin install agent-almanac@pjt222-agent-almanacQuality Score
VerifiedTrust Signals
Similar Extensions
Optimize for GPU
97GPU-accelerate Python code using CuPy, Numba CUDA, Warp, cuDF, cuML, cuGraph, KvikIO, cuCIM, cuxfilter, cuVS, cuSpatial, and RAFT. Use whenever the user mentions GPU/CUDA/NVIDIA acceleration, or wants to speed up NumPy, pandas, scikit-learn, scikit-image, NetworkX, GeoPandas, or Faiss workloads. Covers physics simulation, differentiable rendering, mesh ray casting, particle systems (DEM/SPH/fluids), vector/similarity search, GPUDirect Storage file IO, interactive dashboards, geospatial analysis, medical imaging, and sparse eigensolvers. Also use when you see CPU-bound Python code (loops, large arrays, ML pipelines, graph analytics, image processing) that would benefit from GPU acceleration, even if not explicitly requested.
Pipeline Gpu Kernel
95Apply software pipelining (double-buffering) to a tiled GPU kernel to overlap global memory loads with Tensor Core computation. Covers prologue/loop/epilogue restructuring, LDG-register vs cp.async (LDGSTS) variant selection based on compute/load ratio, shared memory budget verification against architecture-specific occupancy cliffs, and SASS-level verification of load/compute overlap.
Performance Analysis
100Comprehensive performance analysis, bottleneck detection, and optimization recommendations for Claude Flow swarms
Oraclaw Solver
100Industrial-grade scheduling and resource optimization for AI agents. Solve task scheduling with energy matching, budget allocation, and any LP/MIP constraint problem in milliseconds.
Oraclaw Decide
100Decision intelligence for AI agents. Analyze options, map decision dependencies with PageRank, detect when information sources conflict, and find the choices that matter most.
MongoDB Connection Optimizer
100Optimize MongoDB client connection configuration (pools, timeouts, patterns) for any supported driver language. Use this skill when working/updating/reviewing on functions that instantiate or configure a MongoDB client (eg, when calling `connect()`), configuring connection pools, troubleshooting connection errors (ECONNREFUSED, timeouts, pool exhaustion), optimizing performance issues related to connections. This includes scenarios like building serverless functions with MongoDB, creating API endpoints that use MongoDB, optimizing high-traffic MongoDB applications, creating long-running tasks and concurrency, or debugging connection-related failures.