Skip to main content

Analyze Kernel Bottleneck

Skill Verified Active

Systematically identify whether a GPU kernel is compute-bound, memory-bound, or latency-bound using roofline analysis, occupancy calculations, compute/load ratio per tile, and SASS instruction inspection. Produces a decision matrix for optimization strategy selection (cp.async, warp interleaving, tiling, double-buffering, or CuAssembler hand-tuning).

Purpose

To systematically identify GPU kernel performance bottlenecks and provide actionable insights for optimization strategies, enabling developers to improve kernel efficiency.

Features

  • GPU kernel bottleneck classification (compute-bound, memory-bound, latency-bound)
  • Roofline analysis using arithmetic intensity and machine balance points
  • Occupancy calculation to determine active warps per SM
  • Compute/load ratio analysis from SASS instructions
  • SASS instruction mix and stall code inspection
  • Shared memory cliff analysis
  • Decision matrix for optimization strategy selection (cp.async, warp interleaving, etc.)
  • Structured bottleneck report generation

Use Cases

  • Before optimizing any CUDA kernel to establish a baseline and identify bottlenecks
  • After initial kernel implementation to pinpoint optimization paths
  • When a kernel's performance does not meet expectations
  • To decide between various optimization techniques like cp.async, tiling, or algorithmic changes

Non-Goals

  • Directly modifying CUDA source code
  • Automated kernel recompilation without user input
  • Real-time performance monitoring beyond discrete analysis runs
  • Analysis of CPU-bound aspects of host-device workflows

Installation

/plugin install agent-almanac@pjt222-agent-almanac

Quality Score

Verified
99 /100
Analyzed about 18 hours ago

Trust Signals

Last commit1 day ago
Stars14
LicenseMIT
Status
View Source

Similar Extensions

Optimize for GPU

97

GPU-accelerate Python code using CuPy, Numba CUDA, Warp, cuDF, cuML, cuGraph, KvikIO, cuCIM, cuxfilter, cuVS, cuSpatial, and RAFT. Use whenever the user mentions GPU/CUDA/NVIDIA acceleration, or wants to speed up NumPy, pandas, scikit-learn, scikit-image, NetworkX, GeoPandas, or Faiss workloads. Covers physics simulation, differentiable rendering, mesh ray casting, particle systems (DEM/SPH/fluids), vector/similarity search, GPUDirect Storage file IO, interactive dashboards, geospatial analysis, medical imaging, and sparse eigensolvers. Also use when you see CPU-bound Python code (loops, large arrays, ML pipelines, graph analytics, image processing) that would benefit from GPU acceleration, even if not explicitly requested.

Skill
K-Dense-AI

Pipeline Gpu Kernel

95

Apply software pipelining (double-buffering) to a tiled GPU kernel to overlap global memory loads with Tensor Core computation. Covers prologue/loop/epilogue restructuring, LDG-register vs cp.async (LDGSTS) variant selection based on compute/load ratio, shared memory budget verification against architecture-specific occupancy cliffs, and SASS-level verification of load/compute overlap.

Skill
pjt222

Performance Analysis

100

Comprehensive performance analysis, bottleneck detection, and optimization recommendations for Claude Flow swarms

Skill
ruvnet

Oraclaw Solver

100

Industrial-grade scheduling and resource optimization for AI agents. Solve task scheduling with energy matching, budget allocation, and any LP/MIP constraint problem in milliseconds.

Skill
Whatsonyourmind

Oraclaw Decide

100

Decision intelligence for AI agents. Analyze options, map decision dependencies with PageRank, detect when information sources conflict, and find the choices that matter most.

Skill
Whatsonyourmind

MongoDB Connection Optimizer

100

Optimize MongoDB client connection configuration (pools, timeouts, patterns) for any supported driver language. Use this skill when working/updating/reviewing on functions that instantiate or configure a MongoDB client (eg, when calling `connect()`), configuring connection pools, troubleshooting connection errors (ECONNREFUSED, timeouts, pool exhaustion), optimizing performance issues related to connections. This includes scenarios like building serverless functions with MongoDB, creating API endpoints that use MongoDB, optimizing high-traffic MongoDB applications, creating long-running tasks and concurrency, or debugging connection-related failures.

Skill
mongodb

© 2025 SkillRepo · Find the right skill, skip the noise.