Skip to content

[FEATURE] Add image input support for Vision-Language models (e.g. Qwen3-VL) #37

@AbdulShahzeb

Description

@AbdulShahzeb

Summary

Add support for processing image inputs in text/transcript mode, enabling CAAL to answer questions about images, identify objects, read text from photos, etc., through VLMs like Qwen3-VL and LLaVA.

Proposed Solution

1. Image Input Pipeline

  • Add image upload button to chat interface
  • Display uploaded images in conversation history

2. Provider Support

  • Detect VLM models
  • Format image messages for Ollama API:
{
  "role": "user",
  "content": "What's in this image?",
  "images": ["base64_encoded_image_data"]
}

Use Case

1. Vision Assistance

  • "What's in this image?" - General image description
  • "Read the text from this receipt" - OCR functionality

2. Smart Home Integration

  • "What's in my fridge?" - Using camera input
  • "Is there a package on my doorstep?" - Using doorbell camera

Additional Context

Performance:

  • Longer inference time for vision models
  • Image preprocessing (resize to 512x512 or 768x768 for faster inference)

User experience:

  • Display image thumbnails in chat history
  • Visual feedback when image is being processed
  • Error handling for unsupported formats

Compatibility

  • Text-only models continue to work unchanged
  • VLM support is opt-in (ideally auto-detected)
  • If manually enabled, image messages should be ignored by non-VLM providers
  • Should work with any VL model (similar to LM Studio)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions