Add media generation API specification for images, music, video and more #38

konard · 2025-10-30T04:17:00Z

Summary

This PR implements issue #13 by providing a comprehensive OpenAI-compatible API specification for media generation services including images, audio (speech and transcription), music, and video.

What's Included

📄 Files Added

media-api-spec.yaml (37 KB) - Complete OpenAPI 3.1 specification
- 9 API endpoints
- 28 schema definitions
- Full request/response examples
- Authentication and security schemes
media-api-spec.json (48 KB) - JSON version of the specification
- For tools that prefer JSON format
- Identical content to YAML version
MEDIA_API.md (22 KB) - Comprehensive documentation
- Complete API reference
- Implementation guide for API Gateway
- Provider integration patterns
- Database schemas
- Error handling strategies
- Code examples in Python, Node.js, cURL
README_MEDIA_API.md (13 KB) - Quick start guide
- Overview and quick examples
- Provider comparison tables
- Integration instructions
- Testing and validation guide

Features Implemented

🖼️ Image Generation

Endpoints: /v1/images/generations, /v1/images/edits, /v1/images/variations
Supported Models: DALL-E 2, DALL-E 3, GPT-Image-1, Midjourney, Flux, Stable Diffusion
Features:
- Multiple image generation (1-10 images)
- Progressive streaming support
- Transparent backgrounds (PNG/WebP)
- Quality control (low/medium/high/auto)
- Multiple resolutions (up to 4096x4096)
- Style control (vivid vs natural)

🎵 Audio Services

Endpoints: /v1/audio/speech, /v1/audio/transcriptions, /v1/audio/translations
Text-to-Speech:
- 11 voices (alloy, ash, ballad, coral, echo, fable, onyx, nova, sage, shimmer, verse)
- Multiple formats (MP3, Opus, AAC, FLAC, WAV, PCM)
- Speed control (0.25x to 4.0x)
- Streaming support
Speech-to-Text:
- High-accuracy transcription
- Speaker diarization (identify different speakers)
- Multiple response formats (JSON, verbose JSON, SRT, VTT)
- Word and segment timestamps
- Streaming transcription

🎼 Music Generation

Endpoints: /v1/music/generations, /v1/music/generations/{task_id}
Features:
- Text-to-music generation
- Suno AI v3 and v3.5 models
- Style and genre control
- Instrumental or vocal options
- Async task-based processing with status polling
- Duration control (10-300 seconds)

🎬 Video Generation (Planned)

Endpoint: /v1/video/generations
Features:
- Text-to-video generation
- Image-to-video animation
- Multiple resolution options (up to 1920x1080)
- FPS control (24/30/60)
- Async task-based processing

Technical Highlights

OpenAI Compatibility

100% compatible with OpenAI SDK (Python, Node.js, etc.)
Simply change the base_url to use this API
Follows OpenAI's request/response format exactly

Provider Abstraction

Unified interface for multiple AI providers
Easy provider switching without code changes
Automatic response normalization

Streaming Support

Server-Sent Events (SSE) for real-time updates
Progressive image generation
Incremental audio transcription
Status updates for async tasks

Async Task Management

Task-based processing for long-running operations
Status polling endpoints
Progress tracking (0-100%)
Error handling with detailed messages

Auto Failover

Automatic fallback to alternative providers
Configurable provider priority
Graceful error handling

Implementation Guide

The documentation includes a complete implementation guide for the API Gateway:

Provider Abstraction Layer - Abstract provider-specific APIs
Request Routing - Route based on model parameter
Streaming Implementation - Handle SSE streaming
Task Management - Manage async operations
Failover Strategy - Implement provider failover
Database Schemas - Track tasks and usage

Example Provider Integration

class ImageProvider {
  async generateImage(params) {
    switch (params.model) {
      case 'gpt-image-1':
        return this.openaiProvider.generate(params);
      case 'midjourney':
        return this.midjourneyProvider.generate(params);
      case 'flux':
        return this.fluxProvider.generate(params);
      default:
        throw new Error(`Unsupported model: ${params.model}`);
    }
  }
}

Usage Examples

Generate an Image (Python)

from openai import OpenAI

client = OpenAI(
    api_key="your-api-key",
    base_url="https://api.example.com/v1"
)

response = client.images.generate(
    prompt="A serene mountain landscape at sunset",
    model="gpt-image-1",
    size="1536x1024",
    quality="high"
)

Transcribe Audio with Speaker Identification

curl https://api.example.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F file="@meeting.mp3" \
  -F model="gpt-4o-transcribe-diarize" \
  -F response_format="diarized_json"

Generate Music

curl https://api.example.com/v1/music/generations \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Upbeat electronic dance music",
    "duration": 120
  }'

Validation

The OpenAPI specification has been validated:

✅ Valid OpenAPI 3.1 format
✅ All required fields present
✅ 9 endpoints defined
✅ 28 schemas with full validation
✅ Examples for all endpoints
✅ Error responses documented

Integration with Deep Assistant

This specification is designed to integrate with:

api-gateway - Central routing and failover
telegram-bot - Media generation for users
GPTutor - Educational platform features
web-capture - Content processing

Provider Comparison

Provider	Type	Features	Status
OpenAI	Images, Audio	DALL-E, GPT-Image-1, Whisper, TTS	✅ Specified
Midjourney	Images	Artistic generation	✅ Specified
Flux	Images	Fast generation	✅ Specified
Stable Diffusion	Images	Open-source	✅ Specified
Suno AI	Music	Music generation	✅ Specified
RunwayML	Video	Text-to-video	📋 Planned
Stability AI	Video	Video generation	📋 Planned

Testing

You can test the specification with:

# Validate specification
npx @openapitools/openapi-generator-cli validate -i media-api-spec.yaml

# Generate mock server
npx @stoplight/prism-cli mock media-api-spec.yaml

# Generate documentation
npx @redocly/cli build-docs media-api-spec.yaml

Next Steps

Implement in API Gateway - Add provider integrations and routing
Create Client SDKs - Generate SDKs for Python, Node.js, Go, Ruby
Set up Testing - Integration tests with mock providers
Deploy - Roll out to production
Integrate with Telegram Bot - Add media generation commands
Monitor - Set up observability and tracking

Documentation Links

Quick Start: README_MEDIA_API.md
Full Documentation: MEDIA_API.md
OpenAPI Spec (YAML): media-api-spec.yaml
OpenAPI Spec (JSON): media-api-spec.json

Closes

Fixes #13

🤖 Generated with Claude Code

Co-Authored-By: Claude noreply@anthropic.com

Adding CLAUDE.md with task information for AI processing. This file will be removed when the task is complete. Issue: undefined

This commit implements issue #13 by providing a complete OpenAI-compatible API specification for media generation including images, audio, music, and video. Changes: - Add media-api-spec.yaml: Complete OpenAPI 3.1 specification with 9 endpoints - Add media-api-spec.json: JSON version of the specification - Add MEDIA_API.md: Comprehensive documentation with examples and implementation guide - Add README_MEDIA_API.md: Quick start guide and overview Features: - Image generation, editing, and variations (DALL-E, GPT-Image-1, Midjourney, Flux, Stable Diffusion) - Text-to-speech with 11 voices (TTS-1, GPT-4o-TTS) - Speech-to-text with diarization (Whisper, GPT-4o-transcribe) - Audio translation to English - Music generation with Suno AI (async task-based) - Video generation specification (planned) Implementation details: - Provider abstraction layer for seamless switching - Streaming support for progressive generation - Async task management for long-running operations - Auto failover between providers - OpenAI SDK compatibility - Comprehensive error handling - Usage tracking and monitoring Documentation includes: - Complete API reference with request/response examples - Provider comparison tables - Implementation guide for API Gateway integration - Database schemas for task and usage tracking - Error handling best practices - Rate limiting strategies - Code examples in Python, Node.js, and cURL This specification is ready for implementation in the api-gateway service and can be integrated with telegram-bot and other Deep Assistant services. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

konard · 2025-10-30T04:28:03Z

🤖 Solution Draft Log

This log file contains the complete execution trace of the AI solution draft process.

📎 Log file uploaded as GitHub Gist (396KB)
🔗 View complete solution draft log

Now working session is ended, feel free to review and add any feedback on the solution draft.

Initial commit with task details for issue #13

59703b8

Adding CLAUDE.md with task information for AI processing. This file will be removed when the task is complete. Issue: undefined

konard self-assigned this Oct 30, 2025

konard changed the title ~~[WIP] Make media generation API for images, music, video and more~~ Add media generation API specification for images, music, video and more Oct 30, 2025

konard marked this pull request as ready for review October 30, 2025 04:27

Remove CLAUDE.md as task is complete

79cf4ac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add media generation API specification for images, music, video and more #38

Add media generation API specification for images, music, video and more #38

Uh oh!

konard commented Oct 30, 2025 •

edited

Loading

Uh oh!

konard commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add media generation API specification for images, music, video and more #38

Are you sure you want to change the base?

Add media generation API specification for images, music, video and more #38

Uh oh!

Conversation

konard commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's Included

📄 Files Added

Features Implemented

🖼️ Image Generation

🎵 Audio Services

🎼 Music Generation

🎬 Video Generation (Planned)

Technical Highlights

OpenAI Compatibility

Provider Abstraction

Streaming Support

Async Task Management

Auto Failover

Implementation Guide

Example Provider Integration

Usage Examples

Generate an Image (Python)

Transcribe Audio with Speaker Identification

Generate Music

Validation

Integration with Deep Assistant

Provider Comparison

Testing

Next Steps

Documentation Links

Closes

Uh oh!

konard commented Oct 30, 2025

🤖 Solution Draft Log

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

konard commented Oct 30, 2025 •

edited

Loading