Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2,281 changes: 2,281 additions & 0 deletions Multimodal/Vision/2D_Grounding.ipynb

Large diffs are not rendered by default.

614 changes: 614 additions & 0 deletions Multimodal/Vision/3D_Grounding.ipynb

Large diffs are not rendered by default.

567 changes: 567 additions & 0 deletions Multimodal/Vision/Document_Parsing.ipynb

Large diffs are not rendered by default.

541 changes: 541 additions & 0 deletions Multimodal/Vision/Image_to_Code.ipynb

Large diffs are not rendered by default.

417 changes: 417 additions & 0 deletions Multimodal/Vision/Long_Document_Understanding.ipynb

Large diffs are not rendered by default.

1,121 changes: 1,121 additions & 0 deletions Multimodal/Vision/OCR.ipynb

Large diffs are not rendered by default.

878 changes: 878 additions & 0 deletions Multimodal/Vision/Omni_Recognition.ipynb

Large diffs are not rendered by default.

66 changes: 66 additions & 0 deletions Multimodal/Vision/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Vision with Qwen3-VL

Explore Qwen3-VL's vision-language capabilities using Together AI's API. From OCR to 3D grounding, these notebooks cover a wide range of visual understanding tasks.

## 📚 Notebooks

| Notebook | Description | Open |
| -------- | ----------- | ---- |
| [OCR](OCR.ipynb) | Text extraction, multilingual OCR, and text spotting with bounding boxes | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Multimodal/Vision/OCR.ipynb) |
| [2D Grounding](2D_Grounding.ipynb) | Object detection with 2D bounding boxes, multi-target detection, point grounding | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Multimodal/Vision/2D_Grounding.ipynb) |
| [3D Grounding](3D_Grounding.ipynb) | 3D bounding boxes, camera parameters, depth perception | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Multimodal/Vision/3D_Grounding.ipynb) |
| [Spatial Understanding](Spatial_Understanding.ipynb) | Object relationships, affordances, embodied reasoning | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Multimodal/Vision/Spatial_Understanding.ipynb) |
| [Video Understanding](Video_Understanding.ipynb) | Video description, temporal localization, video Q&A | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Multimodal/Vision/Video_Understanding.ipynb) |
| [Omni Recognition](Omni_Recognition.ipynb) | Universal recognition for celebrities, anime, food, landmarks | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Multimodal/Vision/Omni_Recognition.ipynb) |
| [Document Parsing](Document_Parsing.ipynb) | Convert documents to HTML/Markdown with coordinates | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Multimodal/Vision/Document_Parsing.ipynb) |
| [Image to Code](Image_to_Code.ipynb) | Screenshot to HTML, chart to matplotlib code | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Multimodal/Vision/Image_to_Code.ipynb) |
| [Long Document Understanding](Long_Document_Understanding.ipynb) | Multi-page PDF analysis and Q&A | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Multimodal/Vision/Long_Document_Understanding.ipynb) |

## 🎯 Key Concepts

### Coordinate System
Qwen3-VL uses a **relative coordinate system from 0 to 1000**:
- `bbox_2d`: `[x1, y1, x2, y2]` - top-left and bottom-right corners
- `point_2d`: `[x, y]` - point coordinates
- `bbox_3d`: `[x, y, z, x_size, y_size, z_size, roll, pitch, yaw]` - 3D box

### Token Calculation
Together AI calculates image tokens as:
```
T = min(2, max(H // 560, 1)) * min(2, max(W // 560, 1)) * 1601
```
- Per image: ~1,601 to 6,404 tokens
- Plan accordingly for multi-image inputs

## 🚀 Quick Start

```python
import os
import together

client = together.Together()

response = client.chat.completions.create(
model="Qwen/Qwen3-VL-32B-Instruct",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
{"type": "text", "text": "What's in this image?"},
],
}],
)
print(response.choices[0].message.content)
```

## 📋 Prerequisites

- Together AI API key ([get one here](https://api.together.xyz/settings/api-keys))
- Python 3.8+
- Additional dependencies per notebook (see each notebook for details)

## 📖 Resources

- [Together AI Documentation](https://docs.together.ai)
- [Together AI Vision Guide](https://docs.together.ai/docs/vision)
- [Qwen3-VL Model Card](https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct)
466 changes: 466 additions & 0 deletions Multimodal/Vision/Spatial_Understanding.ipynb

Large diffs are not rendered by default.

446 changes: 446 additions & 0 deletions Multimodal/Vision/Video_Understanding.ipynb

Large diffs are not rendered by default.

26 changes: 26 additions & 0 deletions Multimodal/Vision/utils/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
"""Together AI utilities for Qwen3-VL cookbooks."""

from .together_client import (
get_client,
encode_image,
pil_to_base64,
inference_with_image,
inference_with_images,
inference_with_video,
inference_with_system_prompt,
TOGETHER_MODEL,
TOGETHER_BASE_URL,
)

__all__ = [
"get_client",
"encode_image",
"pil_to_base64",
"inference_with_image",
"inference_with_images",
"inference_with_video",
"inference_with_system_prompt",
"TOGETHER_MODEL",
"TOGETHER_BASE_URL",
]

273 changes: 273 additions & 0 deletions Multimodal/Vision/utils/together_client.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,273 @@
"""
Together AI Client Utilities for Qwen3-VL

This module provides shared utilities for calling Together AI's API
with the Qwen3-VL vision-language model.

Usage:
export TOGETHER_API_KEY=your_key_here

from utils.together_client import inference_with_image, inference_with_images
"""

import os
import base64
import openai
from PIL import Image
from io import BytesIO

# Together AI configuration
TOGETHER_API_KEY = os.environ.get("TOGETHER_API_KEY")
TOGETHER_BASE_URL = "https://api.together.xyz/v1"
TOGETHER_MODEL = "Qwen/Qwen3-VL-32B-Instruct"


def get_client():
"""
Get OpenAI client configured for Together AI.

Returns:
openai.OpenAI: Configured client instance
"""
api_key = os.environ.get("TOGETHER_API_KEY")
if not api_key:
raise ValueError(
"TOGETHER_API_KEY environment variable not set. "
"Get your API key from https://api.together.xyz/settings/api-keys"
)
return openai.OpenAI(
api_key=api_key,
base_url=TOGETHER_BASE_URL,
)


def encode_image(image_path):
"""
Encode a local image file to base64.

Args:
image_path: Path to the image file

Returns:
str: Base64 encoded image string
"""
with open(image_path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")


def pil_to_base64(pil_image, format="PNG"):
"""
Convert PIL Image to base64 string.

Args:
pil_image: PIL Image object
format: Image format (PNG, JPEG, etc.)

Returns:
str: Base64 encoded image string
"""
buffer = BytesIO()
pil_image.save(buffer, format=format)
return base64.b64encode(buffer.getvalue()).decode("utf-8")


def get_mime_type(image_path):
"""Get MIME type from file extension."""
ext = image_path.split(".")[-1].lower()
mime_map = {
"jpg": "jpeg",
"jpeg": "jpeg",
"png": "png",
"gif": "gif",
"webp": "webp"
}
return mime_map.get(ext, "jpeg")


def inference_with_image(image_path_or_url, prompt, max_tokens=4096, temperature=None):
"""
Run inference with a single image.

Args:
image_path_or_url: Local path or URL to image
prompt: Text prompt
max_tokens: Maximum tokens to generate
temperature: Sampling temperature (optional)

Returns:
str: Model response text
"""
client = get_client()

# Handle local file vs URL
if image_path_or_url.startswith(("http://", "https://")):
image_content = {
"type": "image_url",
"image_url": {"url": image_path_or_url}
}
else:
base64_img = encode_image(image_path_or_url)
mime_type = get_mime_type(image_path_or_url)
image_content = {
"type": "image_url",
"image_url": {"url": f"data:image/{mime_type};base64,{base64_img}"}
}

kwargs = {
"model": TOGETHER_MODEL,
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": prompt},
image_content
]
}],
"max_tokens": max_tokens
}

if temperature is not None:
kwargs["temperature"] = temperature

response = client.chat.completions.create(**kwargs)
return response.choices[0].message.content


def inference_with_images(images, prompt, max_tokens=250_000, temperature=None):
"""
Run inference with multiple images (e.g., for PDF pages).

Args:
images: List of PIL Images, file paths, or URLs
prompt: Text prompt
max_tokens: Maximum tokens to generate
temperature: Sampling temperature (optional)

Returns:
str: Model response text
"""
client = get_client()

content = []
for img in images:
if isinstance(img, str):
if img.startswith(("http://", "https://")):
content.append({
"type": "image_url",
"image_url": {"url": img}
})
else:
base64_img = encode_image(img)
mime_type = get_mime_type(img)
content.append({
"type": "image_url",
"image_url": {"url": f"data:image/{mime_type};base64,{base64_img}"}
})
else: # PIL Image
base64_img = pil_to_base64(img, format="PNG")
content.append({
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{base64_img}"}
})

content.append({"type": "text", "text": prompt})

kwargs = {
"model": TOGETHER_MODEL,
"messages": [{"role": "user", "content": content}],
"max_tokens": max_tokens
}

if temperature is not None:
kwargs["temperature"] = temperature

response = client.chat.completions.create(**kwargs)
return response.choices[0].message.content


def inference_with_video(video_url, prompt, max_tokens=4096, temperature=None):
"""
Run inference with a video URL.

Note: Together AI only supports video URLs, not local files or frame lists.

Args:
video_url: URL to video file (must be publicly accessible)
prompt: Text prompt
max_tokens: Maximum tokens to generate
temperature: Sampling temperature (optional)

Returns:
str: Model response text
"""
client = get_client()

kwargs = {
"model": TOGETHER_MODEL,
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "video_url", "video_url": {"url": video_url}}
]
}],
"max_tokens": max_tokens
}

if temperature is not None:
kwargs["temperature"] = temperature

response = client.chat.completions.create(**kwargs)
return response.choices[0].message.content


def inference_with_system_prompt(image_path_or_url, prompt, system_prompt, max_tokens=4096, temperature=None):
"""
Run inference with a system prompt.

Args:
image_path_or_url: Local path or URL to image
prompt: User text prompt
system_prompt: System prompt
max_tokens: Maximum tokens to generate
temperature: Sampling temperature (optional)

Returns:
str: Model response text
"""
client = get_client()

# Handle local file vs URL
if image_path_or_url.startswith(("http://", "https://")):
image_content = {
"type": "image_url",
"image_url": {"url": image_path_or_url}
}
else:
base64_img = encode_image(image_path_or_url)
mime_type = get_mime_type(image_path_or_url)
image_content = {
"type": "image_url",
"image_url": {"url": f"data:image/{mime_type};base64,{base64_img}"}
}

kwargs = {
"model": TOGETHER_MODEL,
"messages": [
{"role": "system", "content": system_prompt},
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
image_content
]
}
],
"max_tokens": max_tokens
}

if temperature is not None:
kwargs["temperature"] = temperature

response = client.chat.completions.create(**kwargs)
return response.choices[0].message.content

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Multimodal/assets/gettyimages-476996143.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Multimodal/assets/ocr/ocr_example1.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Multimodal/assets/ocr/ocr_example2.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Multimodal/assets/ocr/ocr_example3.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Multimodal/assets/ocr/ocr_example4.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Multimodal/assets/ocr/ocr_example5.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Multimodal/assets/ocr/ocr_example6.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Multimodal/assets/screen_to_code.png
26 changes: 26 additions & 0 deletions Multimodal/assets/spatial_understanding/cam_infos.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{
"autonomous_driving.jpg": {
"fx": 1266.417203046554,
"fy": 1266.417203046554,
"cx": 816.2670197447984,
"cy": 491.50706579294757
},
"office.jpg": {
"fx": 1470.4238891601562,
"fy": 1470.4237747192383,
"cx": 715.1304817199707,
"cy": 937.8031539916992
},
"lounge.jpg": {
"fx": 529.5,
"fy": 529.5,
"cx": 365.0,
"cy": 265.0
},
"manipulation.jpg": {
"fx": 866.27,
"fy": 866.27,
"cx": 512.0,
"cy": 384.0
}
}
Binary file added Multimodal/assets/sphx_glr_barchart_001.webp
Binary file added Multimodal/assets/swedish-smorgasbord-how-to.jpg