Este contenido aún no está disponible en tu idioma y se muestra en inglés.

AI Multimodal Processing Skill

Skill Verificado

Multimodal AI processing via Google Gemini API (2M tokens context). Capabilities: audio (transcription, 9.5hr max, summarization, music analysis), images (captioning, OCR, object detection, segmentation, visual Q&A), video (scene detection, 6hr max, YouTube URLs, temporal analysis), documents (PDF extraction, tables, forms, charts), image generation (text-to-image, editing). Actions: transcribe, analyze, extract, caption, detect, segment, generate from media. Keywords: Gemini API, audio transcription, image captioning, OCR, object detection, video analysis, PDF extraction, text-to-image, multimodal, speech recognition, visual Q&A, scene detection, YouTube transcription, table extraction, form processing, image generation, Imagen. Use when: transcribing audio/video, analyzing images/screenshots, extracting data from PDFs, processing YouTube videos, generating images from text, implementing multimodal AI features.

Resumen IA

This skill provides a unified command-line interface for interacting with the Google Gemini API, enabling processing of audio, images, videos, and documents, as well as image generation. It includes Python scripts for batch processing, media optimization, and document conversion, with clear instructions for API key setup and usage.

Versioning

warning:Release ManagementThe SKILL.md frontmatter has 'Manifest Version: n/a' and no other versioning signal (like CHANGELOG or GitHub releases) is apparent. Install instructions do not reference a specific version, potentially defaulting to 'main'.

Code Execution

info:ValidationWhile the scripts handle command-line arguments and file paths, explicit schema validation libraries (like Zod or Pydantic) are not visibly used for all inputs, though basic argument parsing is present.

Compliance

info:GDPRThe extension processes user-provided documents and media. While it sends this data to the Gemini API for processing, there's no explicit mention of personal data sanitization before sending, though the Gemini API likely has its own privacy measures.

Instalación

npx skills add samhvw8/dot-claude

Ejecuta el CLI de skills de Vercel (skills.sh) mediante npx — requiere Node.js en local y al menos un agente compatible con skills instalado (Claude Code, Cursor, Codex, …). Asume que el repositorio sigue el formato de agentskills.io.

5 months ago

samhvw8

10 stars

MIT

Actualizado el 2 days ago

Ver código fuente

Extensiones similares

Transcription Automation

Automate audio/video transcription, meeting notes, subtitle generation, and content processing

Skill

claude-office-skills

Contendeo

Extension from 0xKaroshi/contendeo-mcp

Skill

0xKaroshi

ElevenLabs Speech-to-Text

Transcribe audio to text using ElevenLabs Scribe. Supports batch transcription, realtime streaming from URLs, microphone input, and local files.

Skill

elevenlabs

ASR (Speech to Text) Skill

Implement speech-to-text (ASR/automatic speech recognition) capabilities using the z-ai-web-dev-sdk. Use this skill when the user needs to transcribe audio files, convert speech to text, build voice input features, or process audio recordings. Supports base64 encoded audio files and returns accurate text transcriptions.

Skill

answerzhao

FFmpeg for Video Production

Video and audio processing with FFmpeg. Use for format conversion, resizing, compression, audio extraction, and preparing assets for Remotion. Triggers include converting GIF to MP4, resizing video, extracting audio, compressing files, or any media transformation task.

Skill

digitalsamba

ElevenLabs Speech-to-Text

Transcribe audio to text using ElevenLabs Scribe v2. Use when converting audio/video to text, generating subtitles, transcribing meetings, or processing spoken content.

Skill

elevenlabs