मुख्य सामग्री पर जाएँ
यह सामग्री अभी आपकी भाषा में उपलब्ध नहीं है और अंग्रेज़ी में दिखाई जा रही है।

ASR (Speech to Text) Skill

Skill सत्यापित
95

Implement speech-to-text (ASR/automatic speech recognition) capabilities using the z-ai-web-dev-sdk. Use this skill when the user needs to transcribe audio files, convert speech to text, build voice input features, or process audio recordings. Supports base64 encoded audio files and returns accurate text transcriptions.

AI सारांश

This skill provides Automatic Speech Recognition (ASR) functionality, allowing users to transcribe audio files and convert speech to text. It supports both command-line interface (CLI) for quick tasks and an SDK for programmatic integration, handling base64 encoded audio and offering detailed examples for various use cases.

Versioning

  • warning:Release ManagementThere is no explicit versioning information (e.g., version field in SKILL.md, CHANGELOG, or GitHub releases) provided for this skill.

Code Execution

  • info:ValidationWhile the CLI parameters are documented, there is no explicit mention or demonstration of input validation libraries (like Zod or pydantic) for script arguments or SDK parameters. The `safeTranscribe` function includes basic file existence and size checks.

Compliance

  • info:GDPRThe skill processes audio data, which may contain personal data. While it doesn't submit this data to third parties, it's not explicitly sanitized before being sent to the z-ai-web-dev-sdk's ASR service.

इंस्टॉलेशन

npx skills add answerzhao/agent-skills

Vercel skills CLI (skills.sh) को npx के माध्यम से चलाता है — स्थानीय रूप से Node.js और कम से कम एक इंस्टॉल किया गया skills-संगत एजेंट (Claude Code, Cursor, Codex, …) ज़रूरी है। यह मानता है कि रिपॉज़िटरी agentskills.io फ़ॉर्मैट का पालन करती है।

4 months ago
26 stars
MIT
6 days ago को अपडेट किया गया
सोर्स देखें

मिलते-जुलते एक्सटेंशन

ElevenLabs Speech-to-Text

95

Transcribe audio to text using ElevenLabs Scribe v2. Use when converting audio/video to text, generating subtitles, transcribing meetings, or processing spoken content.

Skill
elevenlabs

Qwen ASR

92

Transcribe audio files using Qwen ASR. Use when the user sends voice messages and wants them converted to text.

Skill
aahl

ElevenLabs Speech-to-Text

98

Transcribe audio to text using ElevenLabs Scribe. Supports batch transcription, realtime streaming from URLs, microphone input, and local files.

Skill
elevenlabs

ElevenLabs Agents

98

Build voice AI agents with ElevenLabs. Use when creating voice assistants, customer service bots, interactive voice characters, or any real-time voice conversation experience.

Skill
elevenlabs

AI Multimodal Processing Skill

95

Multimodal AI processing via Google Gemini API (2M tokens context). Capabilities: audio (transcription, 9.5hr max, summarization, music analysis), images (captioning, OCR, object detection, segmentation, visual Q&A), video (scene detection, 6hr max, YouTube URLs, temporal analysis), documents (PDF extraction, tables, forms, charts), image generation (text-to-image, editing). Actions: transcribe, analyze, extract, caption, detect, segment, generate from media. Keywords: Gemini API, audio transcription, image captioning, OCR, object detection, video analysis, PDF extraction, text-to-image, multimodal, speech recognition, visual Q&A, scene detection, YouTube transcription, table extraction, form processing, image generation, Imagen. Use when: transcribing audio/video, analyzing images/screenshots, extracting data from PDFs, processing YouTube videos, generating images from text, implementing multimodal AI features.

Skill
samhvw8

Text-to-Speech (TTS)

95

Implement text-to-speech (TTS) capabilities using the z-ai-web-dev-sdk. Use this skill when the user needs to convert text into natural-sounding speech, create audio content, build voice-enabled applications, or generate spoken audio files. Supports multiple voices, adjustable speed, and various audio formats.

Skill
answerzhao