跳转到主要内容
此内容尚未提供您的语言版本,正在以英文显示。

Pipeline Gpu Kernel

技能 已验证 活跃

Apply software pipelining (double-buffering) to a tiled GPU kernel to overlap global memory loads with Tensor Core computation. Covers prologue/loop/epilogue restructuring, LDG-register vs cp.async (LDGSTS) variant selection based on compute/load ratio, shared memory budget verification against architecture-specific occupancy cliffs, and SASS-level verification of load/compute overlap.

目的

Optimize GPU kernel performance by implementing advanced software pipelining techniques to effectively overlap memory operations with computation.

功能

  • Software pipelining for GPU kernels
  • Double-buffering of shared memory
  • Variant selection based on compute/load ratio
  • Analysis of load/compute overlap in SASS
  • Shared memory budget verification against occupancy cliffs

使用场景

  • When a GPU kernel is identified as memory-bound.
  • When warp interleaving alone is insufficient to hide DRAM latency.
  • When restructuring a sequential load-sync-compute-sync kernel loop.
  • When needing to optimize Tensor Core computation by overlapping memory loads.

非目标

  • Optimizing kernels that are not memory-bound.
  • Addressing bottlenecks unrelated to memory loads or Tensor Core computation.
  • Applying pipelining to kernels without a distinct load and compute phase.
  • Basic CUDA compilation; assumes familiarity with `nvcc` and GPU architectures.

Practical Utility

  • info:Usage examplesWhile the SKILL.md provides detailed procedural steps, it lacks concrete end-to-end invocation examples with specific inputs and expected outputs for the CUDA kernel optimization.

安装

/plugin install agent-almanac@pjt222-agent-almanac

质量评分

已验证
95 /100
2 days ago 分析

信任信号

最近提交3 days ago
星标14
许可证MIT
状态
查看源代码

类似扩展

Performance Analysis

100

Comprehensive performance analysis, bottleneck detection, and optimization recommendations for Claude Flow swarms

技能
ruvnet

MongoDB Connection Optimizer

100

为任何支持的驱动程序语言优化 MongoDB 客户端连接配置(池、超时、模式)。在处理/更新/审查实例化或配置 MongoDB 客户端(例如,调用 `connect()` 时)、配置连接池、对连接错误(ECONNREFUSED、超时、池耗尽)进行故障排除、优化与连接相关的性能问题时,请使用此技能。这包括构建具有 MongoDB 的无服务器函数、创建使用 MongoDB 的 API 端点、优化高流量 MongoDB 应用程序、创建长期运行任务和并发性,或调试与连接相关的失败等场景。

技能
mongodb

Sql Optimization

100

Universal SQL performance optimization assistant for comprehensive query tuning, indexing strategies, and database performance analysis across all SQL databases (MySQL, PostgreSQL, SQL Server, Oracle). Provides execution plan analysis, pagination optimization, batch operations, and performance monitoring guidance.

技能
github

Core Web Vitals

100

优化核心 Web 指标(LCP、INP、CLS),以获得更好的页面体验和搜索排名。当被要求“改进核心 Web 指标”、“修复 LCP”、“减少 CLS”、“优化 INP”、“页面体验优化”或“修复布局偏移”时使用。

技能
addyosmani

Analyze Kernel Bottleneck

99

Systematically identify whether a GPU kernel is compute-bound, memory-bound, or latency-bound using roofline analysis, occupancy calculations, compute/load ratio per tile, and SASS instruction inspection. Produces a decision matrix for optimization strategy selection (cp.async, warp interleaving, tiling, double-buffering, or CuAssembler hand-tuning).

技能
pjt222

Vector Index Tuning

99

Optimize vector index performance for latency, recall, and memory. Use when tuning HNSW parameters, selecting quantization strategies, or scaling vector search infrastructure.

技能
wshobson