Zum Hauptinhalt springen
Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

Run Ab Test Models

Skill Verifiziert Aktiv
Teil von:Agent Almanac

Design and execute A/B tests for ML models in production using traffic splitting, statistical significance testing, and canary/shadow deployment strategies. Measure performance differences and make data-driven decisions about model rollout. Use when validating a new model version before full rollout, comparing candidate models trained with different algorithms, measuring business metric impact of model changes, or when regulatory requirements mandate gradual rollout.

Zweck

To enable data-driven decisions about ML model rollouts by designing and executing controlled A/B tests in production environments.

Funktionen

  • A/B test design with statistical significance
  • Traffic splitting and user assignment
  • Canary and shadow deployment strategies
  • Performance metric collection and analysis
  • Guardrail monitoring for safety thresholds
  • Automated rollout decision support

Anwendungsfälle

  • Validating new model versions before full rollout
  • Comparing candidate models trained with different algorithms
  • Measuring business metric impact of model changes
  • Meeting regulatory requirements for gradual rollout

Nicht-Ziele

  • The actual deployment of ML models to production infrastructure
  • The training of ML models
  • Real-time model serving infrastructure management

Documentation

  • info:Configuration & parameter referenceThe SKILL.md outlines the required and optional inputs for the A/B test experiment but does not provide a detailed reference for all parameters or their defaults within the main document. The referenced examples.md likely contain this detail.

Execution

  • info:ValidationThe Python examples demonstrate basic data handling and analysis, but explicit schema validation libraries like Zod or Pydantic are not shown for all inputs/outputs within the main SKILL.md.

Installation

/plugin install agent-almanac@pjt222-agent-almanac

Qualitätspunktzahl

Verifiziert
95 /100
Analysiert about 16 hours ago

Vertrauenssignale

Letzter Commit1 day ago
Sterne14
LizenzMIT
Status
Quellcode ansehen

Ähnliche Erweiterungen

Measure Experiment Design

100

Designs an A/B test or experiment with clear hypothesis, variants, success metrics, sample size, and duration. Use when planning experiments to validate product changes or test hypotheses.

Skill
product-on-purpose

Arize Experiment

100

Creates, runs, and analyzes Arize experiments for evaluating and comparing model performance. Covers experiment CRUD, exporting runs, comparing results, and evaluation workflows using the ax CLI. Use when the user mentions create experiment, run experiment, compare models, model performance, evaluate AI, experiment results, benchmark, A/B test models, or measure accuracy.

Skill
github

CE Optimize

100

Run metric-driven iterative optimization loops -- define a measurable goal, run parallel experiments, measure each against hard gates or LLM-as-judge scores, keep improvements, and converge on the best solution. Use when optimizing clustering quality, search relevance, build performance, prompt quality, or any measurable outcome that benefits from systematic experimentation.

Skill
EveryInc

OraClaw Bandit

99

A/B-Tests und Funktionsoptimierung für KI-Agenten. Wählen Sie automatisch die beste Option mit Multi-Armed Bandits und kontextbezogenen Bandits (LinUCB). Kein Data Warehouse erforderlich – funktioniert ab der Anfrage.

Skill
Whatsonyourmind

Experiment Designer

99

Use when planning product experiments, writing testable hypotheses, estimating sample size, prioritizing tests, or interpreting A/B outcomes with practical statistical rigor.

Skill
alirezarezvani

Ab Test Setup

98

When the user wants to plan, design, or implement an A/B test or experiment. Also use when the user mentions "A/B test," "split test," "experiment," "test this change," "variant copy," "multivariate test," "hypothesis," "conversion experiment," "statistical significance," or "test this." For tracking implementation, see analytics-tracking.

Skill
alirezarezvani