跳转到主要内容
此内容尚未提供您的语言版本,正在以英文显示。

Analyze Kernel Bottleneck

技能 已验证 活跃

Systematically identify whether a GPU kernel is compute-bound, memory-bound, or latency-bound using roofline analysis, occupancy calculations, compute/load ratio per tile, and SASS instruction inspection. Produces a decision matrix for optimization strategy selection (cp.async, warp interleaving, tiling, double-buffering, or CuAssembler hand-tuning).

目的

To systematically identify GPU kernel performance bottlenecks and provide actionable insights for optimization strategies, enabling developers to improve kernel efficiency.

功能

  • GPU kernel bottleneck classification (compute-bound, memory-bound, latency-bound)
  • Roofline analysis using arithmetic intensity and machine balance points
  • Occupancy calculation to determine active warps per SM
  • Compute/load ratio analysis from SASS instructions
  • SASS instruction mix and stall code inspection
  • Shared memory cliff analysis
  • Decision matrix for optimization strategy selection (cp.async, warp interleaving, etc.)
  • Structured bottleneck report generation

使用场景

  • Before optimizing any CUDA kernel to establish a baseline and identify bottlenecks
  • After initial kernel implementation to pinpoint optimization paths
  • When a kernel's performance does not meet expectations
  • To decide between various optimization techniques like cp.async, tiling, or algorithmic changes

非目标

  • Directly modifying CUDA source code
  • Automated kernel recompilation without user input
  • Real-time performance monitoring beyond discrete analysis runs
  • Analysis of CPU-bound aspects of host-device workflows

安装

/plugin install agent-almanac@pjt222-agent-almanac

质量评分

已验证
99 /100
1 day ago 分析

信任信号

最近提交2 days ago
星标14
许可证MIT
状态
查看源代码

类似扩展

Optimize for GPU

97

GPU-accelerate Python code using CuPy, Numba CUDA, Warp, cuDF, cuML, cuGraph, KvikIO, cuCIM, cuxfilter, cuVS, cuSpatial, and RAFT. Use whenever the user mentions GPU/CUDA/NVIDIA acceleration, or wants to speed up NumPy, pandas, scikit-learn, scikit-image, NetworkX, GeoPandas, or Faiss workloads. Covers physics simulation, differentiable rendering, mesh ray casting, particle systems (DEM/SPH/fluids), vector/similarity search, GPUDirect Storage file IO, interactive dashboards, geospatial analysis, medical imaging, and sparse eigensolvers. Also use when you see CPU-bound Python code (loops, large arrays, ML pipelines, graph analytics, image processing) that would benefit from GPU acceleration, even if not explicitly requested.

技能
K-Dense-AI

Pipeline Gpu Kernel

95

Apply software pipelining (double-buffering) to a tiled GPU kernel to overlap global memory loads with Tensor Core computation. Covers prologue/loop/epilogue restructuring, LDG-register vs cp.async (LDGSTS) variant selection based on compute/load ratio, shared memory budget verification against architecture-specific occupancy cliffs, and SASS-level verification of load/compute overlap.

技能
pjt222

Performance Analysis

100

Comprehensive performance analysis, bottleneck detection, and optimization recommendations for Claude Flow swarms

技能
ruvnet

Oraclaw Solver

100

为 AI 代理提供工业级的调度和资源优化。在几毫秒内通过能源匹配、预算分配和任何 LP/MIP 约束问题来解决任务调度。

技能
Whatsonyourmind

Oraclaw Decide

100

为 AI 代理提供决策智能。分析选项、使用 PageRank 映射决策依赖关系、检测信息源冲突,并找出最重要的选择。

技能
Whatsonyourmind

MongoDB Connection Optimizer

100

为任何支持的驱动程序语言优化 MongoDB 客户端连接配置(池、超时、模式)。在处理/更新/审查实例化或配置 MongoDB 客户端(例如,调用 `connect()` 时)、配置连接池、对连接错误(ECONNREFUSED、超时、池耗尽)进行故障排除、优化与连接相关的性能问题时,请使用此技能。这包括构建具有 MongoDB 的无服务器函数、创建使用 MongoDB 的 API 端点、优化高流量 MongoDB 应用程序、创建长期运行任务和并发性,或调试与连接相关的失败等场景。

技能
mongodb