Support Vision Language Models

Recently, multimodal models are gaining traction. It would be better if this project supported them.

The underlying llama.cpp already has support for [vision language models](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#multimodal). So this shouldn't be too difficult to implement.