togethercomputer · zainhas · Dec 18, 2025 · Dec 18, 2025 · Dec 18, 2025 · Dec 18, 2025
diff --git a/Multimodal/Vision/2D_Grounding.ipynb b/Multimodal/Vision/2D_Grounding.ipynb
diff --git a/Multimodal/Vision/3D_Grounding.ipynb b/Multimodal/Vision/3D_Grounding.ipynb
diff --git a/Multimodal/Vision/Document_Parsing.ipynb b/Multimodal/Vision/Document_Parsing.ipynb
diff --git a/Multimodal/Vision/Image_to_Code.ipynb b/Multimodal/Vision/Image_to_Code.ipynb
diff --git a/Multimodal/Vision/Long_Document_Understanding.ipynb b/Multimodal/Vision/Long_Document_Understanding.ipynb
diff --git a/Multimodal/Vision/OCR.ipynb b/Multimodal/Vision/OCR.ipynb
diff --git a/Multimodal/Vision/Omni_Recognition.ipynb b/Multimodal/Vision/Omni_Recognition.ipynb
diff --git a/Multimodal/Vision/README.md b/Multimodal/Vision/README.md
@@ -0,0 +1,66 @@
+# Vision with Qwen3-VL
+
+Explore Qwen3-VL's vision-language capabilities using Together AI's API. From OCR to 3D grounding, these notebooks cover a wide range of visual understanding tasks.
+
+## 📚 Notebooks
+
+| Notebook | Description | Open |
+| -------- | ----------- | ---- |
+| [OCR](OCR.ipynb) | Text extraction, multilingual OCR, and text spotting with bounding boxes | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Multimodal/Vision/OCR.ipynb) |
+| [2D Grounding](2D_Grounding.ipynb) | Object detection with 2D bounding boxes, multi-target detection, point grounding | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Multimodal/Vision/2D_Grounding.ipynb) |
+| [3D Grounding](3D_Grounding.ipynb) | 3D bounding boxes, camera parameters, depth perception | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Multimodal/Vision/3D_Grounding.ipynb) |
+| [Spatial Understanding](Spatial_Understanding.ipynb) | Object relationships, affordances, embodied reasoning | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Multimodal/Vision/Spatial_Understanding.ipynb) |
+| [Video Understanding](Video_Understanding.ipynb) | Video description, temporal localization, video Q&A | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Multimodal/Vision/Video_Understanding.ipynb) |
+| [Omni Recognition](Omni_Recognition.ipynb) | Universal recognition for celebrities, anime, food, landmarks | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Multimodal/Vision/Omni_Recognition.ipynb) |
+| [Document Parsing](Document_Parsing.ipynb) | Convert documents to HTML/Markdown with coordinates | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Multimodal/Vision/Document_Parsing.ipynb) |
+| [Image to Code](Image_to_Code.ipynb) | Screenshot to HTML, chart to matplotlib code | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Multimodal/Vision/Image_to_Code.ipynb) |
+| [Long Document Understanding](Long_Document_Understanding.ipynb) | Multi-page PDF analysis and Q&A | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Multimodal/Vision/Long_Document_Understanding.ipynb) |
+
+## 🎯 Key Concepts
+
+### Coordinate System
+Qwen3-VL uses a **relative coordinate system from 0 to 1000**:
+- `bbox_2d`: `[x1, y1, x2, y2]` - top-left and bottom-right corners
+- `point_2d`: `[x, y]` - point coordinates
+- `bbox_3d`: `[x, y, z, x_size, y_size, z_size, roll, pitch, yaw]` - 3D box
+
+### Token Calculation
+Together AI calculates image tokens as:
+```
+T = min(2, max(H // 560, 1)) * min(2, max(W // 560, 1)) * 1601
+```
+- Per image: ~1,601 to 6,404 tokens
+- Plan accordingly for multi-image inputs
+
+## 🚀 Quick Start
+
+```python
+import os
+import together
+
+client = together.Together()
+
+response = client.chat.completions.create(
+    model="Qwen/Qwen3-VL-32B-Instruct",
+    messages=[{
+        "role": "user",
+        "content": [
+            {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
+            {"type": "text", "text": "What's in this image?"},
+        ],
+    }],
+)
+print(response.choices[0].message.content)
+```
+
+## 📋 Prerequisites
+
+- Together AI API key ([get one here](https://api.together.xyz/settings/api-keys))
+- Python 3.8+
+- Additional dependencies per notebook (see each notebook for details)
+
+## 📖 Resources
+
+- [Together AI Documentation](https://docs.together.ai)
+- [Together AI Vision Guide](https://docs.together.ai/docs/vision)
+- [Qwen3-VL Model Card](https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct)
diff --git a/Multimodal/Vision/Spatial_Understanding.ipynb b/Multimodal/Vision/Spatial_Understanding.ipynb
diff --git a/Multimodal/Vision/Video_Understanding.ipynb b/Multimodal/Vision/Video_Understanding.ipynb
diff --git a/Multimodal/Vision/utils/__init__.py b/Multimodal/Vision/utils/__init__.py
@@ -0,0 +1,26 @@
+"""Together AI utilities for Qwen3-VL cookbooks."""
+
+from .together_client import (
+    get_client,
+    encode_image,
+    pil_to_base64,
+    inference_with_image,
+    inference_with_images,
+    inference_with_video,
+    inference_with_system_prompt,
+    TOGETHER_MODEL,
+    TOGETHER_BASE_URL,
+)
+
+__all__ = [
+    "get_client",
+    "encode_image", 
+    "pil_to_base64",
+    "inference_with_image",
+    "inference_with_images",
+    "inference_with_video",
+    "inference_with_system_prompt",
+    "TOGETHER_MODEL",
+    "TOGETHER_BASE_URL",
+]
+
diff --git a/Multimodal/Vision/utils/together_client.py b/Multimodal/Vision/utils/together_client.py
@@ -0,0 +1,273 @@
+"""
+Together AI Client Utilities for Qwen3-VL
+
+This module provides shared utilities for calling Together AI's API
+with the Qwen3-VL vision-language model.
+
+Usage:
+    export TOGETHER_API_KEY=your_key_here
+
+    from utils.together_client import inference_with_image, inference_with_images
+"""
+
+import os
+import base64
+import openai
+from PIL import Image
+from io import BytesIO
+
+# Together AI configuration
+TOGETHER_API_KEY = os.environ.get("TOGETHER_API_KEY")
+TOGETHER_BASE_URL = "https://api.together.xyz/v1"
+TOGETHER_MODEL = "Qwen/Qwen3-VL-32B-Instruct"
+
+
+def get_client():
+    """
+    Get OpenAI client configured for Together AI.
+
+    Returns:
+        openai.OpenAI: Configured client instance
+    """
+    api_key = os.environ.get("TOGETHER_API_KEY")
+    if not api_key:
+        raise ValueError(
+            "TOGETHER_API_KEY environment variable not set. "
+            "Get your API key from https://api.together.xyz/settings/api-keys"
+        )
+    return openai.OpenAI(
+        api_key=api_key,
+        base_url=TOGETHER_BASE_URL,
+    )
+
+
+def encode_image(image_path):
+    """
+    Encode a local image file to base64.
+
+    Args:
+        image_path: Path to the image file
+
+    Returns:
+        str: Base64 encoded image string
+    """
+    with open(image_path, "rb") as f:
+        return base64.b64encode(f.read()).decode("utf-8")
+
+
+def pil_to_base64(pil_image, format="PNG"):
+    """
+    Convert PIL Image to base64 string.
+
+    Args:
+        pil_image: PIL Image object
+        format: Image format (PNG, JPEG, etc.)
+
+    Returns:
+        str: Base64 encoded image string
+    """
+    buffer = BytesIO()
+    pil_image.save(buffer, format=format)
+    return base64.b64encode(buffer.getvalue()).decode("utf-8")
+
+
+def get_mime_type(image_path):
+    """Get MIME type from file extension."""
+    ext = image_path.split(".")[-1].lower()
+    mime_map = {
+        "jpg": "jpeg",
+        "jpeg": "jpeg", 
+        "png": "png",
+        "gif": "gif",
+        "webp": "webp"
+    }
+    return mime_map.get(ext, "jpeg")
+
+
+def inference_with_image(image_path_or_url, prompt, max_tokens=4096, temperature=None):
+    """
+    Run inference with a single image.
+
+    Args:
+        image_path_or_url: Local path or URL to image
+        prompt: Text prompt
+        max_tokens: Maximum tokens to generate
+        temperature: Sampling temperature (optional)
+
+    Returns:
+        str: Model response text
+    """
+    client = get_client()
+
+    # Handle local file vs URL
+    if image_path_or_url.startswith(("http://", "https://")):
+        image_content = {
+            "type": "image_url", 
+            "image_url": {"url": image_path_or_url}
+        }
+    else:
+        base64_img = encode_image(image_path_or_url)
+        mime_type = get_mime_type(image_path_or_url)
+        image_content = {
+            "type": "image_url", 
+            "image_url": {"url": f"data:image/{mime_type};base64,{base64_img}"}
+        }
+
+    kwargs = {
+        "model": TOGETHER_MODEL,
+        "messages": [{
+            "role": "user",
+            "content": [
+                {"type": "text", "text": prompt},
+                image_content
+            ]
+        }],
+        "max_tokens": max_tokens
+    }
+
+    if temperature is not None:
+        kwargs["temperature"] = temperature
+
+    response = client.chat.completions.create(**kwargs)
+    return response.choices[0].message.content
+
+
+def inference_with_images(images, prompt, max_tokens=250_000, temperature=None):
+    """
+    Run inference with multiple images (e.g., for PDF pages).
+
+    Args:
+        images: List of PIL Images, file paths, or URLs
+        prompt: Text prompt
+        max_tokens: Maximum tokens to generate
+        temperature: Sampling temperature (optional)
+
+    Returns:
+        str: Model response text
+    """
+    client = get_client()
+
+    content = []
+    for img in images:
+        if isinstance(img, str):
+            if img.startswith(("http://", "https://")):
+                content.append({
+                    "type": "image_url",
+                    "image_url": {"url": img}
+                })
+            else:
+                base64_img = encode_image(img)
+                mime_type = get_mime_type(img)
+                content.append({
+                    "type": "image_url",
+                    "image_url": {"url": f"data:image/{mime_type};base64,{base64_img}"}
+                })
+        else:  # PIL Image
+            base64_img = pil_to_base64(img, format="PNG")
+            content.append({
+                "type": "image_url",
+                "image_url": {"url": f"data:image/png;base64,{base64_img}"}
+            })
+
+    content.append({"type": "text", "text": prompt})
+
+    kwargs = {
+        "model": TOGETHER_MODEL,
+        "messages": [{"role": "user", "content": content}],
+        "max_tokens": max_tokens
+    }
+
+    if temperature is not None:
+        kwargs["temperature"] = temperature
+
+    response = client.chat.completions.create(**kwargs)
+    return response.choices[0].message.content
+
+
+def inference_with_video(video_url, prompt, max_tokens=4096, temperature=None):
+    """
+    Run inference with a video URL.
+
+    Note: Together AI only supports video URLs, not local files or frame lists.
+
+    Args:
+        video_url: URL to video file (must be publicly accessible)
+        prompt: Text prompt
+        max_tokens: Maximum tokens to generate
+        temperature: Sampling temperature (optional)
+
+    Returns:
+        str: Model response text
+    """
+    client = get_client()
+
+    kwargs = {
+        "model": TOGETHER_MODEL,
+        "messages": [{
+            "role": "user",
+            "content": [
+                {"type": "text", "text": prompt},
+                {"type": "video_url", "video_url": {"url": video_url}}
+            ]
+        }],
+        "max_tokens": max_tokens
+    }
+
+    if temperature is not None:
+        kwargs["temperature"] = temperature
+
+    response = client.chat.completions.create(**kwargs)
+    return response.choices[0].message.content
+
+
+def inference_with_system_prompt(image_path_or_url, prompt, system_prompt, max_tokens=4096, temperature=None):
+    """
+    Run inference with a system prompt.
+
+    Args:
+        image_path_or_url: Local path or URL to image
+        prompt: User text prompt
+        system_prompt: System prompt
+        max_tokens: Maximum tokens to generate
+        temperature: Sampling temperature (optional)
+
+    Returns:
+        str: Model response text
+    """
+    client = get_client()
+
+    # Handle local file vs URL
+    if image_path_or_url.startswith(("http://", "https://")):
+        image_content = {
+            "type": "image_url", 
+            "image_url": {"url": image_path_or_url}
+        }
+    else:
+        base64_img = encode_image(image_path_or_url)
+        mime_type = get_mime_type(image_path_or_url)
+        image_content = {
+            "type": "image_url", 
+            "image_url": {"url": f"data:image/{mime_type};base64,{base64_img}"}
+        }
+
+    kwargs = {
+        "model": TOGETHER_MODEL,
+        "messages": [
+            {"role": "system", "content": system_prompt},
+            {
+                "role": "user",
+                "content": [
+                    {"type": "text", "text": prompt},
+                    image_content
+                ]
+            }
+        ],
+        "max_tokens": max_tokens
+    }
+
+    if temperature is not None:
+        kwargs["temperature"] = temperature
+
+    response = client.chat.completions.create(**kwargs)
+    return response.choices[0].message.content
+
diff --git a/Multimodal/assets/document_parsing/docparsing_example1.jpg b/Multimodal/assets/document_parsing/docparsing_example1.jpg
diff --git a/Multimodal/assets/document_parsing/docparsing_example2.jpg b/Multimodal/assets/document_parsing/docparsing_example2.jpg
diff --git a/Multimodal/assets/document_parsing/docparsing_example3.jpg b/Multimodal/assets/document_parsing/docparsing_example3.jpg
diff --git a/Multimodal/assets/document_parsing/docparsing_example4.jpg b/Multimodal/assets/document_parsing/docparsing_example4.jpg
diff --git a/Multimodal/assets/document_parsing/docparsing_example5.png b/Multimodal/assets/document_parsing/docparsing_example5.png
diff --git a/Multimodal/assets/document_parsing/docparsing_example6.png b/Multimodal/assets/document_parsing/docparsing_example6.png
diff --git a/Multimodal/assets/document_parsing/docparsing_example7.jpg b/Multimodal/assets/document_parsing/docparsing_example7.jpg
diff --git a/Multimodal/assets/document_parsing/docparsing_example8.png b/Multimodal/assets/document_parsing/docparsing_example8.png
diff --git a/Multimodal/assets/gettyimages-476996143.jpg b/Multimodal/assets/gettyimages-476996143.jpg
diff --git a/Multimodal/assets/multimodal_coding/chart2code_input.png b/Multimodal/assets/multimodal_coding/chart2code_input.png
diff --git a/Multimodal/assets/multimodal_coding/screenshot_demo.png b/Multimodal/assets/multimodal_coding/screenshot_demo.png
diff --git a/Multimodal/assets/multimodal_coding/sketch2code_input.jpeg b/Multimodal/assets/multimodal_coding/sketch2code_input.jpeg
diff --git a/Multimodal/assets/ocr/ocr_example1.jpg b/Multimodal/assets/ocr/ocr_example1.jpg
diff --git a/Multimodal/assets/ocr/ocr_example2.jpg b/Multimodal/assets/ocr/ocr_example2.jpg
diff --git a/Multimodal/assets/ocr/ocr_example3.jpg b/Multimodal/assets/ocr/ocr_example3.jpg
diff --git a/Multimodal/assets/ocr/ocr_example4.jpg b/Multimodal/assets/ocr/ocr_example4.jpg
diff --git a/Multimodal/assets/ocr/ocr_example5.jpg b/Multimodal/assets/ocr/ocr_example5.jpg
diff --git a/Multimodal/assets/ocr/ocr_example6.jpg b/Multimodal/assets/ocr/ocr_example6.jpg
diff --git a/Multimodal/assets/omni_recognition/sample-anime-result.jpg b/Multimodal/assets/omni_recognition/sample-anime-result.jpg
diff --git a/Multimodal/assets/omni_recognition/sample-anime.jpeg b/Multimodal/assets/omni_recognition/sample-anime.jpeg
diff --git a/Multimodal/assets/omni_recognition/sample-bird-result.jpg b/Multimodal/assets/omni_recognition/sample-bird-result.jpg
diff --git a/Multimodal/assets/omni_recognition/sample-bird.jpg b/Multimodal/assets/omni_recognition/sample-bird.jpg
diff --git a/Multimodal/assets/omni_recognition/sample-celebrity-2.jpg b/Multimodal/assets/omni_recognition/sample-celebrity-2.jpg
diff --git a/Multimodal/assets/omni_recognition/sample-celebrity-result.jpg b/Multimodal/assets/omni_recognition/sample-celebrity-result.jpg
diff --git a/Multimodal/assets/omni_recognition/sample-celebrity.jpeg b/Multimodal/assets/omni_recognition/sample-celebrity.jpeg
diff --git a/Multimodal/assets/omni_recognition/sample-food-result.jpeg b/Multimodal/assets/omni_recognition/sample-food-result.jpeg
diff --git a/Multimodal/assets/omni_recognition/sample-food.jpeg b/Multimodal/assets/omni_recognition/sample-food.jpeg
diff --git a/Multimodal/assets/screen_to_code.png b/Multimodal/assets/screen_to_code.png
diff --git a/Multimodal/assets/spatial_understanding/autonomous_driving.jpg b/Multimodal/assets/spatial_understanding/autonomous_driving.jpg
diff --git a/Multimodal/assets/spatial_understanding/cam_infos.json b/Multimodal/assets/spatial_understanding/cam_infos.json
@@ -0,0 +1,26 @@
+{
+  "autonomous_driving.jpg": {
+    "fx": 1266.417203046554,
+    "fy": 1266.417203046554,
+    "cx": 816.2670197447984,
+    "cy": 491.50706579294757
+  },
+  "office.jpg": {
+    "fx": 1470.4238891601562,
+    "fy": 1470.4237747192383,
+    "cx": 715.1304817199707,
+    "cy": 937.8031539916992
+  },
+  "lounge.jpg": {
+    "fx": 529.5,
+    "fy": 529.5,
+    "cx": 365.0,
+    "cy": 265.0
+  },
+  "manipulation.jpg": {
+    "fx": 866.27,
+    "fy": 866.27,
+    "cx": 512.0,
+    "cy": 384.0
+  }
+}
diff --git a/Multimodal/assets/spatial_understanding/dining_table.png b/Multimodal/assets/spatial_understanding/dining_table.png
diff --git a/Multimodal/assets/spatial_understanding/drone_cars2.png b/Multimodal/assets/spatial_understanding/drone_cars2.png
diff --git a/Multimodal/assets/spatial_understanding/football_field.jpg b/Multimodal/assets/spatial_understanding/football_field.jpg
diff --git a/Multimodal/assets/spatial_understanding/lots_of_cars.png b/Multimodal/assets/spatial_understanding/lots_of_cars.png
diff --git a/Multimodal/assets/spatial_understanding/lots_of_people.jpeg b/Multimodal/assets/spatial_understanding/lots_of_people.jpeg
diff --git a/Multimodal/assets/spatial_understanding/lounge.jpg b/Multimodal/assets/spatial_understanding/lounge.jpg
diff --git a/Multimodal/assets/spatial_understanding/manipulation.jpg b/Multimodal/assets/spatial_understanding/manipulation.jpg
diff --git a/Multimodal/assets/spatial_understanding/office.jpg b/Multimodal/assets/spatial_understanding/office.jpg
diff --git a/Multimodal/assets/spatial_understanding/spatio_case1.jpg b/Multimodal/assets/spatial_understanding/spatio_case1.jpg
diff --git a/Multimodal/assets/spatial_understanding/spatio_case2_aff.png b/Multimodal/assets/spatial_understanding/spatio_case2_aff.png
diff --git a/Multimodal/assets/spatial_understanding/spatio_case2_aff2.png b/Multimodal/assets/spatial_understanding/spatio_case2_aff2.png
diff --git a/Multimodal/assets/spatial_understanding/spatio_case2_plan.png b/Multimodal/assets/spatial_understanding/spatio_case2_plan.png
diff --git a/Multimodal/assets/spatial_understanding/spatio_case2_plan2.png b/Multimodal/assets/spatial_understanding/spatio_case2_plan2.png
diff --git a/Multimodal/assets/sphx_glr_barchart_001.webp b/Multimodal/assets/sphx_glr_barchart_001.webp
diff --git a/Multimodal/assets/swedish-smorgasbord-how-to.jpg b/Multimodal/assets/swedish-smorgasbord-how-to.jpg
diff --git a/...s/whats-a-dragon-ball-opinion-you-hold-that-gets-everyone-v0-a7tpukavv59c1.webp b/...s/whats-a-dragon-ball-opinion-you-hold-that-gets-everyone-v0-a7tpukavv59c1.webp