Arize Experiment
Skill Verified ActiveCreates, runs, and analyzes Arize experiments for evaluating and comparing model performance. Covers experiment CRUD, exporting runs, comparing results, and evaluation workflows using the ax CLI. Use when the user mentions create experiment, run experiment, compare models, model performance, evaluate AI, experiment results, benchmark, A/B test models, or measure accuracy.
To streamline the process of evaluating and comparing model performance by managing Arize experiments, from creation to detailed analysis.
Features
- Create, run, and analyze Arize experiments
- Export experiment runs to various formats (JSON, CSV, Parquet)
- Compare results between different experiments
- Manage datasets and experiment configurations
- Utilize the `ax` CLI for all operations
Use Cases
- Use when you need to create a new experiment to track model performance against a dataset.
- Use when you need to compare the results of two or more experiments to identify regressions or improvements.
- Use when you need to export experiment data for further analysis or visualization.
- Use when you need to benchmark different models or prompt strategies using A/B testing.
Non-Goals
- Setting up or managing the Arize platform itself (focus is on experiments).
- Directly calling the Arize REST API (relies on `ax` CLI abstraction).
- Fabricating or simulating model outputs or evaluation scores.
Installation
npx skills add github/awesome-copilotRuns the Vercel skills CLI (skills.sh) via npx — needs Node.js locally and at least one installed skills-compatible agent (Claude Code, Cursor, Codex, …). Assumes the repo follows the agentskills.io format.
Quality Score
VerifiedSimilar Extensions
Arize Evaluator
100Handles LLM-as-judge evaluation workflows on Arize including creating/updating evaluators, running evaluations on spans or experiments, managing tasks, trigger-run operations, column mapping, and continuous monitoring. Use when the user mentions create evaluator, LLM judge, hallucination, faithfulness, correctness, relevance, run eval, score spans, score experiment, trigger-run, column mapping, continuous monitoring, or improve evaluator prompt.
Arize Dataset
100Creates, manages, and queries Arize datasets and examples. Covers dataset CRUD, appending examples, exporting data, and file-based dataset creation using the ax CLI. Use when the user needs test data, evaluation examples, or mentions create dataset, list datasets, export dataset, append examples, dataset version, golden dataset, or test set.
CE Optimize
100Run metric-driven iterative optimization loops -- define a measurable goal, run parallel experiments, measure each against hard gates or LLM-as-judge scores, keep improvements, and converge on the best solution. Use when optimizing clustering quality, search relevance, build performance, prompt quality, or any measurable outcome that benefits from systematic experimentation.
Arize Prompt Optimization
100Optimizes, improves, and debugs LLM prompts using production trace data, evaluations, and annotations. Extracts prompts from spans, gathers performance signal, and runs a data-driven optimization loop using the ax CLI. Use when the user mentions optimize prompt, improve prompt, make AI respond better, improve output quality, prompt engineering, prompt tuning, or system prompt improvement.
Arize Ai Provider Integration
100Creates, reads, updates, and deletes Arize AI integrations that store LLM provider credentials used by evaluators and other Arize features. Supports any LLM provider (e.g. OpenAI, Anthropic, Azure OpenAI, AWS Bedrock, Vertex AI, Gemini, NVIDIA NIM). Use when the user mentions AI integration, LLM provider credentials, create integration, list integrations, update credentials, delete integration, or connecting an LLM provider to Arize.
Trader Regime
100Detect current market regime using npx neural-trader — bull/bear/ranging/volatile classification with recommended strategy