Pipeline Gpu Kernel

Skill Verified Active

Apply software pipelining (double-buffering) to a tiled GPU kernel to overlap global memory loads with Tensor Core computation. Covers prologue/loop/epilogue restructuring, LDG-register vs cp.async (LDGSTS) variant selection based on compute/load ratio, shared memory budget verification against architecture-specific occupancy cliffs, and SASS-level verification of load/compute overlap.

Purpose

Optimize GPU kernel performance by implementing advanced software pipelining techniques to effectively overlap memory operations with computation.

Features

Software pipelining for GPU kernels
Double-buffering of shared memory
Variant selection based on compute/load ratio
Analysis of load/compute overlap in SASS
Shared memory budget verification against occupancy cliffs

Use Cases

When a GPU kernel is identified as memory-bound.
When warp interleaving alone is insufficient to hide DRAM latency.
When restructuring a sequential load-sync-compute-sync kernel loop.
When needing to optimize Tensor Core computation by overlapping memory loads.

Non-Goals

Optimizing kernels that are not memory-bound.
Addressing bottlenecks unrelated to memory loads or Tensor Core computation.
Applying pipelining to kernels without a distinct load and compute phase.
Basic CUDA compilation; assumes familiarity with `nvcc` and GPU architectures.

Practical Utility

info:Usage examplesWhile the SKILL.md provides detailed procedural steps, it lacks concrete end-to-end invocation examples with specific inputs and expected outputs for the CUDA kernel optimization.

Installation

/plugin install agent-almanac@pjt222-agent-almanac

Quality Score

Verified

95 /100

Analyzed about 20 hours ago

Trust Signals

Last commit1 day ago

GitHub owner pjt222

Stars14

Downloads 308

LicenseMIT

Websitepjt222.github.io

Status

View Source

Similar Extensions

Performance Analysis

100

Comprehensive performance analysis, bottleneck detection, and optimization recommendations for Claude Flow swarms

Skill

ruvnet

MongoDB Connection Optimizer

100

Optimize MongoDB client connection configuration (pools, timeouts, patterns) for any supported driver language. Use this skill when working/updating/reviewing on functions that instantiate or configure a MongoDB client (eg, when calling `connect()`), configuring connection pools, troubleshooting connection errors (ECONNREFUSED, timeouts, pool exhaustion), optimizing performance issues related to connections. This includes scenarios like building serverless functions with MongoDB, creating API endpoints that use MongoDB, optimizing high-traffic MongoDB applications, creating long-running tasks and concurrency, or debugging connection-related failures.

Skill

mongodb

Sql Optimization

100

Universal SQL performance optimization assistant for comprehensive query tuning, indexing strategies, and database performance analysis across all SQL databases (MySQL, PostgreSQL, SQL Server, Oracle). Provides execution plan analysis, pagination optimization, batch operations, and performance monitoring guidance.

Skill

github

Core Web Vitals

100

Optimize Core Web Vitals (LCP, INP, CLS) for better page experience and search ranking. Use when asked to "improve Core Web Vitals", "fix LCP", "reduce CLS", "optimize INP", "page experience optimization", or "fix layout shifts".

Skill

addyosmani

Analyze Kernel Bottleneck

Systematically identify whether a GPU kernel is compute-bound, memory-bound, or latency-bound using roofline analysis, occupancy calculations, compute/load ratio per tile, and SASS instruction inspection. Produces a decision matrix for optimization strategy selection (cp.async, warp interleaving, tiling, double-buffering, or CuAssembler hand-tuning).

Skill

pjt222

Vector Index Tuning

Optimize vector index performance for latency, recall, and memory. Use when tuning HNSW parameters, selecting quantization strategies, or scaling vector search infrastructure.

Skill

wshobson