Analyze Kernel Bottleneck

Skill Verified Active

Systematically identify whether a GPU kernel is compute-bound, memory-bound, or latency-bound using roofline analysis, occupancy calculations, compute/load ratio per tile, and SASS instruction inspection. Produces a decision matrix for optimization strategy selection (cp.async, warp interleaving, tiling, double-buffering, or CuAssembler hand-tuning).

Purpose

To systematically identify GPU kernel performance bottlenecks and provide actionable insights for optimization strategies, enabling developers to improve kernel efficiency.

Features

GPU kernel bottleneck classification (compute-bound, memory-bound, latency-bound)
Roofline analysis using arithmetic intensity and machine balance points
Occupancy calculation to determine active warps per SM
Compute/load ratio analysis from SASS instructions
SASS instruction mix and stall code inspection
Shared memory cliff analysis
Decision matrix for optimization strategy selection (cp.async, warp interleaving, etc.)
Structured bottleneck report generation

Use Cases

Before optimizing any CUDA kernel to establish a baseline and identify bottlenecks
After initial kernel implementation to pinpoint optimization paths
When a kernel's performance does not meet expectations
To decide between various optimization techniques like cp.async, tiling, or algorithmic changes

Non-Goals

Directly modifying CUDA source code
Automated kernel recompilation without user input
Real-time performance monitoring beyond discrete analysis runs
Analysis of CPU-bound aspects of host-device workflows

Installation

/plugin install agent-almanac@pjt222-agent-almanac

Quality Score

Verified

99 /100

Analyzed about 18 hours ago

Trust Signals

Last commit1 day ago

GitHub owner pjt222

Stars14

Downloads 308

LicenseMIT

Websitepjt222.github.io

Status

View Source

Similar Extensions

Optimize for GPU

GPU-accelerate Python code using CuPy, Numba CUDA, Warp, cuDF, cuML, cuGraph, KvikIO, cuCIM, cuxfilter, cuVS, cuSpatial, and RAFT. Use whenever the user mentions GPU/CUDA/NVIDIA acceleration, or wants to speed up NumPy, pandas, scikit-learn, scikit-image, NetworkX, GeoPandas, or Faiss workloads. Covers physics simulation, differentiable rendering, mesh ray casting, particle systems (DEM/SPH/fluids), vector/similarity search, GPUDirect Storage file IO, interactive dashboards, geospatial analysis, medical imaging, and sparse eigensolvers. Also use when you see CPU-bound Python code (loops, large arrays, ML pipelines, graph analytics, image processing) that would benefit from GPU acceleration, even if not explicitly requested.

Skill

K-Dense-AI

Pipeline Gpu Kernel

Apply software pipelining (double-buffering) to a tiled GPU kernel to overlap global memory loads with Tensor Core computation. Covers prologue/loop/epilogue restructuring, LDG-register vs cp.async (LDGSTS) variant selection based on compute/load ratio, shared memory budget verification against architecture-specific occupancy cliffs, and SASS-level verification of load/compute overlap.

Skill

pjt222

Performance Analysis

100

Comprehensive performance analysis, bottleneck detection, and optimization recommendations for Claude Flow swarms

Skill

ruvnet

Oraclaw Solver

100

Industrial-grade scheduling and resource optimization for AI agents. Solve task scheduling with energy matching, budget allocation, and any LP/MIP constraint problem in milliseconds.

Skill

Whatsonyourmind

Oraclaw Decide

100

Decision intelligence for AI agents. Analyze options, map decision dependencies with PageRank, detect when information sources conflict, and find the choices that matter most.

Skill

Whatsonyourmind

MongoDB Connection Optimizer

100

Optimize MongoDB client connection configuration (pools, timeouts, patterns) for any supported driver language. Use this skill when working/updating/reviewing on functions that instantiate or configure a MongoDB client (eg, when calling `connect()`), configuring connection pools, troubleshooting connection errors (ECONNREFUSED, timeouts, pool exhaustion), optimizing performance issues related to connections. This includes scenarios like building serverless functions with MongoDB, creating API endpoints that use MongoDB, optimizing high-traffic MongoDB applications, creating long-running tasks and concurrency, or debugging connection-related failures.

Skill

mongodb