Dieser Inhalt ist noch nicht in Ihrer Sprache verfügbar und wird auf Englisch angezeigt.

AI Multimodal Processing Skill

Skill Verifiziert

Multimodal AI processing via Google Gemini API (2M tokens context). Capabilities: audio (transcription, 9.5hr max, summarization, music analysis), images (captioning, OCR, object detection, segmentation, visual Q&A), video (scene detection, 6hr max, YouTube URLs, temporal analysis), documents (PDF extraction, tables, forms, charts), image generation (text-to-image, editing). Actions: transcribe, analyze, extract, caption, detect, segment, generate from media. Keywords: Gemini API, audio transcription, image captioning, OCR, object detection, video analysis, PDF extraction, text-to-image, multimodal, speech recognition, visual Q&A, scene detection, YouTube transcription, table extraction, form processing, image generation, Imagen. Use when: transcribing audio/video, analyzing images/screenshots, extracting data from PDFs, processing YouTube videos, generating images from text, implementing multimodal AI features.

KI-Zusammenfassung

This skill provides a unified command-line interface for interacting with the Google Gemini API, enabling processing of audio, images, videos, and documents, as well as image generation. It includes Python scripts for batch processing, media optimization, and document conversion, with clear instructions for API key setup and usage.

Versioning

warning:Release ManagementThe SKILL.md frontmatter has 'Manifest Version: n/a' and no other versioning signal (like CHANGELOG or GitHub releases) is apparent. Install instructions do not reference a specific version, potentially defaulting to 'main'.

Code Execution

info:ValidationWhile the scripts handle command-line arguments and file paths, explicit schema validation libraries (like Zod or Pydantic) are not visibly used for all inputs, though basic argument parsing is present.

Compliance

info:GDPRThe extension processes user-provided documents and media. While it sends this data to the Gemini API for processing, there's no explicit mention of personal data sanitization before sending, though the Gemini API likely has its own privacy measures.

Installation

npx skills add samhvw8/dot-claude

Führt das Vercel skills CLI (skills.sh) via npx aus — benötigt Node.js lokal und mindestens einen installierten skills-kompatiblen Agent (Claude Code, Cursor, Codex, …). Setzt voraus, dass das Repo dem agentskills.io-Format folgt.

5 months ago

samhvw8

10 stars

MIT

Aktualisiert 6 days ago

Quellcode ansehen

AI Multimodal Processing Skill

Versioning

Code Execution

Compliance

Ähnliche Erweiterungen

Transcription Automation

Contendeo

ElevenLabs Speech-to-Text

ASR (Speech to Text) Skill

FFmpeg for Video Production

ElevenLabs Speech-to-Text