Skip to main content

AI Multimodal Processing Skill

Skill Verified
95

Multimodal AI processing via Google Gemini API (2M tokens context). Capabilities: audio (transcription, 9.5hr max, summarization, music analysis), images (captioning, OCR, object detection, segmentation, visual Q&A), video (scene detection, 6hr max, YouTube URLs, temporal analysis), documents (PDF extraction, tables, forms, charts), image generation (text-to-image, editing). Actions: transcribe, analyze, extract, caption, detect, segment, generate from media. Keywords: Gemini API, audio transcription, image captioning, OCR, object detection, video analysis, PDF extraction, text-to-image, multimodal, speech recognition, visual Q&A, scene detection, YouTube transcription, table extraction, form processing, image generation, Imagen. Use when: transcribing audio/video, analyzing images/screenshots, extracting data from PDFs, processing YouTube videos, generating images from text, implementing multimodal AI features.

AI Summary

This skill provides a unified command-line interface for interacting with the Google Gemini API, enabling processing of audio, images, videos, and documents, as well as image generation. It includes Python scripts for batch processing, media optimization, and document conversion, with clear instructions for API key setup and usage.

Versioning

  • warning:Release ManagementThe SKILL.md frontmatter has 'Manifest Version: n/a' and no other versioning signal (like CHANGELOG or GitHub releases) is apparent. Install instructions do not reference a specific version, potentially defaulting to 'main'.

Code Execution

  • info:ValidationWhile the scripts handle command-line arguments and file paths, explicit schema validation libraries (like Zod or Pydantic) are not visibly used for all inputs, though basic argument parsing is present.

Compliance

  • info:GDPRThe extension processes user-provided documents and media. While it sends this data to the Gemini API for processing, there's no explicit mention of personal data sanitization before sending, though the Gemini API likely has its own privacy measures.

Installation

npx skills add samhvw8/dot-claude

Runs the Vercel skills CLI (skills.sh) via npx — needs Node.js locally and at least one installed skills-compatible agent (Claude Code, Cursor, Codex, …). Assumes the repo follows the agentskills.io format.

5 months ago
10 stars
MIT
Updated 6 days ago
View Source

© 2025 SkillRepo · Find the right skill, skip the noise.