此内容尚未提供您的语言版本,正在以英文显示。

Analyze Kernel Bottleneck

技能已验证活跃

Systematically identify whether a GPU kernel is compute-bound, memory-bound, or latency-bound using roofline analysis, occupancy calculations, compute/load ratio per tile, and SASS instruction inspection. Produces a decision matrix for optimization strategy selection (cp.async, warp interleaving, tiling, double-buffering, or CuAssembler hand-tuning).

目的

To systematically identify GPU kernel performance bottlenecks and provide actionable insights for optimization strategies, enabling developers to improve kernel efficiency.

功能

GPU kernel bottleneck classification (compute-bound, memory-bound, latency-bound)
Roofline analysis using arithmetic intensity and machine balance points
Occupancy calculation to determine active warps per SM
Compute/load ratio analysis from SASS instructions
SASS instruction mix and stall code inspection
Shared memory cliff analysis
Decision matrix for optimization strategy selection (cp.async, warp interleaving, etc.)
Structured bottleneck report generation

使用场景

Before optimizing any CUDA kernel to establish a baseline and identify bottlenecks
After initial kernel implementation to pinpoint optimization paths
When a kernel's performance does not meet expectations
To decide between various optimization techniques like cp.async, tiling, or algorithmic changes

非目标

Directly modifying CUDA source code
Automated kernel recompilation without user input
Real-time performance monitoring beyond discrete analysis runs
Analysis of CPU-bound aspects of host-device workflows

安装

/plugin install agent-almanac@pjt222-agent-almanac

质量评分

已验证

99 /100

1 day ago 分析

信任信号

最近提交2 days ago

GitHub 所有者 pjt222

星标14

下载量 308

许可证MIT

网站pjt222.github.io

状态

查看源代码

类似扩展

Optimize for GPU

GPU-accelerate Python code using CuPy, Numba CUDA, Warp, cuDF, cuML, cuGraph, KvikIO, cuCIM, cuxfilter, cuVS, cuSpatial, and RAFT. Use whenever the user mentions GPU/CUDA/NVIDIA acceleration, or wants to speed up NumPy, pandas, scikit-learn, scikit-image, NetworkX, GeoPandas, or Faiss workloads. Covers physics simulation, differentiable rendering, mesh ray casting, particle systems (DEM/SPH/fluids), vector/similarity search, GPUDirect Storage file IO, interactive dashboards, geospatial analysis, medical imaging, and sparse eigensolvers. Also use when you see CPU-bound Python code (loops, large arrays, ML pipelines, graph analytics, image processing) that would benefit from GPU acceleration, even if not explicitly requested.

技能

K-Dense-AI

Pipeline Gpu Kernel

Apply software pipelining (double-buffering) to a tiled GPU kernel to overlap global memory loads with Tensor Core computation. Covers prologue/loop/epilogue restructuring, LDG-register vs cp.async (LDGSTS) variant selection based on compute/load ratio, shared memory budget verification against architecture-specific occupancy cliffs, and SASS-level verification of load/compute overlap.

技能

pjt222

Performance Analysis

100

Comprehensive performance analysis, bottleneck detection, and optimization recommendations for Claude Flow swarms

技能

ruvnet

Oraclaw Solver

100

为 AI 代理提供工业级的调度和资源优化。在几毫秒内通过能源匹配、预算分配和任何 LP/MIP 约束问题来解决任务调度。

技能

Whatsonyourmind

Oraclaw Decide

100

为 AI 代理提供决策智能。分析选项、使用 PageRank 映射决策依赖关系、检测信息源冲突，并找出最重要的选择。

技能

Whatsonyourmind

MongoDB Connection Optimizer

100

为任何支持的驱动程序语言优化 MongoDB 客户端连接配置（池、超时、模式）。在处理/更新/审查实例化或配置 MongoDB 客户端（例如，调用 `connect()` 时）、配置连接池、对连接错误（ECONNREFUSED、超时、池耗尽）进行故障排除、优化与连接相关的性能问题时，请使用此技能。这包括构建具有 MongoDB 的无服务器函数、创建使用 MongoDB 的 API 端点、优化高流量 MongoDB 应用程序、创建长期运行任务和并发性，或调试与连接相关的失败等场景。

技能

mongodb