Analyze Kernel Bottleneck
技能 已验证 活跃Systematically identify whether a GPU kernel is compute-bound, memory-bound, or latency-bound using roofline analysis, occupancy calculations, compute/load ratio per tile, and SASS instruction inspection. Produces a decision matrix for optimization strategy selection (cp.async, warp interleaving, tiling, double-buffering, or CuAssembler hand-tuning).
To systematically identify GPU kernel performance bottlenecks and provide actionable insights for optimization strategies, enabling developers to improve kernel efficiency.
功能
- GPU kernel bottleneck classification (compute-bound, memory-bound, latency-bound)
- Roofline analysis using arithmetic intensity and machine balance points
- Occupancy calculation to determine active warps per SM
- Compute/load ratio analysis from SASS instructions
- SASS instruction mix and stall code inspection
- Shared memory cliff analysis
- Decision matrix for optimization strategy selection (cp.async, warp interleaving, etc.)
- Structured bottleneck report generation
使用场景
- Before optimizing any CUDA kernel to establish a baseline and identify bottlenecks
- After initial kernel implementation to pinpoint optimization paths
- When a kernel's performance does not meet expectations
- To decide between various optimization techniques like cp.async, tiling, or algorithmic changes
非目标
- Directly modifying CUDA source code
- Automated kernel recompilation without user input
- Real-time performance monitoring beyond discrete analysis runs
- Analysis of CPU-bound aspects of host-device workflows
安装
/plugin install agent-almanac@pjt222-agent-almanac质量评分
已验证类似扩展
Optimize for GPU
97GPU-accelerate Python code using CuPy, Numba CUDA, Warp, cuDF, cuML, cuGraph, KvikIO, cuCIM, cuxfilter, cuVS, cuSpatial, and RAFT. Use whenever the user mentions GPU/CUDA/NVIDIA acceleration, or wants to speed up NumPy, pandas, scikit-learn, scikit-image, NetworkX, GeoPandas, or Faiss workloads. Covers physics simulation, differentiable rendering, mesh ray casting, particle systems (DEM/SPH/fluids), vector/similarity search, GPUDirect Storage file IO, interactive dashboards, geospatial analysis, medical imaging, and sparse eigensolvers. Also use when you see CPU-bound Python code (loops, large arrays, ML pipelines, graph analytics, image processing) that would benefit from GPU acceleration, even if not explicitly requested.
Pipeline Gpu Kernel
95Apply software pipelining (double-buffering) to a tiled GPU kernel to overlap global memory loads with Tensor Core computation. Covers prologue/loop/epilogue restructuring, LDG-register vs cp.async (LDGSTS) variant selection based on compute/load ratio, shared memory budget verification against architecture-specific occupancy cliffs, and SASS-level verification of load/compute overlap.
Performance Analysis
100Comprehensive performance analysis, bottleneck detection, and optimization recommendations for Claude Flow swarms
Oraclaw Solver
100为 AI 代理提供工业级的调度和资源优化。在几毫秒内通过能源匹配、预算分配和任何 LP/MIP 约束问题来解决任务调度。
Oraclaw Decide
100为 AI 代理提供决策智能。分析选项、使用 PageRank 映射决策依赖关系、检测信息源冲突,并找出最重要的选择。
MongoDB Connection Optimizer
100为任何支持的驱动程序语言优化 MongoDB 客户端连接配置(池、超时、模式)。在处理/更新/审查实例化或配置 MongoDB 客户端(例如,调用 `connect()` 时)、配置连接池、对连接错误(ECONNREFUSED、超时、池耗尽)进行故障排除、优化与连接相关的性能问题时,请使用此技能。这包括构建具有 MongoDB 的无服务器函数、创建使用 MongoDB 的 API 端点、优化高流量 MongoDB 应用程序、创建长期运行任务和并发性,或调试与连接相关的失败等场景。