-
Notifications
You must be signed in to change notification settings - Fork 95
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
Add support for processing image inputs in text/transcript mode, enabling CAAL to answer questions about images, identify objects, read text from photos, etc., through VLMs like Qwen3-VL and LLaVA.
Proposed Solution
1. Image Input Pipeline
- Add image upload button to chat interface
- Display uploaded images in conversation history
2. Provider Support
- Detect VLM models
- Format image messages for Ollama API:
{
"role": "user",
"content": "What's in this image?",
"images": ["base64_encoded_image_data"]
}Use Case
1. Vision Assistance
- "What's in this image?" - General image description
- "Read the text from this receipt" - OCR functionality
2. Smart Home Integration
- "What's in my fridge?" - Using camera input
- "Is there a package on my doorstep?" - Using doorbell camera
Additional Context
Performance:
- Longer inference time for vision models
- Image preprocessing (resize to 512x512 or 768x768 for faster inference)
User experience:
- Display image thumbnails in chat history
- Visual feedback when image is being processed
- Error handling for unsupported formats
Compatibility
- Text-only models continue to work unchanged
- VLM support is opt-in (ideally auto-detected)
- If manually enabled, image messages should be ignored by non-VLM providers
- Should work with any VL model (similar to LM Studio)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request