InternRobotics · kew6688 · Oct 10, 2025 · Oct 9, 2025 · Oct 9, 2025 · Oct 10, 2025
diff --git a/challenge/README.md b/challenge/README.md
@@ -10,6 +10,7 @@ The system should be capable of handling challenges such as camera shake, height
 
 ---
 ## 🆕 Updates
+- [2025/10/09] Real-world challenge phase is released! check onsite_competition part for the details.
 - We have fixed possible memory leak inside InternUtopia. Please pull the latest image v1.2 to use.
 - For submission, please make sure the image contain `screen`. Quick check: `$ screen --version`.
 

diff --git a/challenge/onsite_competition/README.md b/challenge/onsite_competition/README.md
@@ -0,0 +1,60 @@
+# 🧭 IROS On-site Challenge
+
+Welcome to the **IROS Vision-Language Navigation On-site Challenge**!
+In this phase, participants’ models will be deployed on **a real robot** to evaluate performance in real-world conditions.
+
+---
+
+## ⚙️ Installation
+
+First, install the `InternNav` package:
+
+```bash
+cd /InternNav
+pip install -e .
+```
+
+## 🚀 Running Your Agent
+### 1. Start the Agent Server
+Launch your agent server with the desired configuration file:
+
+```bash
+python -m internnav.agent.utils.server --config path/to/cfg.py
+```
+
+### 2. Test the Agent with Robot Captures
+You can locally test your model using previously recorded observations from the robot (stored under ./captures):
+
+```bash
+python sdk/test_agent.py --config path/to/cfg.py
+```
+
+### 3. Actual Competition Execution
+During the on-site evaluation, the organizers will run:
+
+```bash
+python sdk/main.py
+```
+
+for each episode, paired with its corresponding natural language instruction.
+
+## 🧩 Data Format
+Action
+```python
+action = [{'action': [int], 'ideal_flag': bool}]
+```
+Observation
+```python
+obs = {
+    "rgb": rgb,           # RGB image from the robot
+    "depth": depth,       # Depth image (aligned with RGB)
+    "instruction": str    # Natural language navigation
+}
+```
+
+## 📋 Rules
+Please check out the [onsite competition rules](./onsite_competition_rules_en-US.md) .
+
+
+## 🚀 Code Submission
+Submit a Docker image with your agent server preconfigured and ready to run. During the competition, the robot will connect to a local server over the network. We’ll share additional details soon.
diff --git a/challenge/onsite_competition/captures/rs_meta.json b/challenge/onsite_competition/captures/rs_meta.json
@@ -0,0 +1,10 @@
+{
+  "timestamp_s": 1759218399.0439963,
+  "paths": {
+    "rgb": "./captures/rs_rgb.jpg",
+    "depth_mm": "./captures/rs_depth_mm.png",
+    "depth_vis": "./captures/rs_depth_vis.png"
+  },
+  "intrinsics": {},
+  "notes": "depth_mm.png 是以毫米存储的 16-bit PNG；depth_vis.png 仅用于可视化。"
+}
diff --git a/challenge/onsite_competition/captures/rs_rgb.jpg b/challenge/onsite_competition/captures/rs_rgb.jpg
diff --git a/challenge/onsite_competition/captures/sim_depth.npy b/challenge/onsite_competition/captures/sim_depth.npy
diff --git a/challenge/onsite_competition/captures/sim_rgb.npy b/challenge/onsite_competition/captures/sim_rgb.npy
diff --git a/challenge/onsite_competition/onsite_competition_rules_en-US.md b/challenge/onsite_competition/onsite_competition_rules_en-US.md
@@ -0,0 +1,105 @@
+
+# Nav Track Onsite Competition Rules
+English Version | [中文版](./onsite_competition_rules_zh-CN.md)
+
+## 1. Task Description
+This track focuses on building a multimodal mobile robot navigation system with language understanding capabilities. Participants must design a perception–decision pipeline that performs the full process of:
+- Egocentric visual perception,
+- Natural language instruction understanding,
+- Historical trajectory modeling,
+- Navigation action prediction.
+
+The submitted algorithms will be deployed on a real robot, which must navigate indoors under natural language guidance. The robot should robustly handle camera shake, height changes, and local obstacle avoidance, ensuring safe and reliable vision-language navigation.
+
+Key Challenges:
+- Effectively fusing language and visual information to support an integrated perception–decision–control process.
+- Operating robustly on a real robotic platform, handling viewpoint shake, height changes, and local obstacle avoidance during navigation.
+- Generalizing to unseen indoor environments and novel language instructions for improved robustness and adaptability.
+
+## 2. Competition Environment & Equipment
+### 2.1 Competition Venue
+A realistic apartment-like environment will be built, consisting of connected rooms (living room, bedroom, kitchen, corridor, bathroom, etc.) with typical household furniture and decorations.
+
+### 2.2 Robot
+The competition will use Robot (provided by the organizers) with same RGB-D camera and Sensor configuration. Detailed specifications and open-source navigation resources will be provided.
+- Teams will have an on-site debugging session on October 18.
+- Final code must be submitted 1 days before the competition.
+
+## 3. Task Setup
+The challenge centers on vision-language fusion for cross-room end-to-end navigation.
+ The organizers will predefine ~10 natural language navigation instructions, each with a corresponding start and goal position.
+- Each instruction must cross at least one room.
+- Ground-truth path length will be between 5–20 meters.
+- The goal location must be precise, unambiguous, and clearly defined.
+
+## 4. Competition Rules
+### 4.1 Pre-competition Preparation
+- Teams must package the competition image in advance according to the GitHub documentation.
+- A standardized debugging time and environment will be provided onsite. Teams may use model weights different from the online stage and make environment-specific configuration adjustments, but modifications to the core algorithm logic are strictly prohibited.
+
+### 4.2 Procedure
+Each team will receive 10 fixed-sequence instructions. Start from the first one.
+For each instruction:
+- Move the robot to the given starting position.
+- Provide the instruction to the robot and raise your hand to signal the start.
+- If execution fails, teams may retry (up to 3 attempts) or skip the instruction.
+- After 3 failed attempts (timeout, collision, human intervention, etc.), the instruction must be skipped.
+- Skipped instructions cannot be retried later.
+- Before each attempt, the algorithm state must be reset and previous observations cleared.
+
+### 4.3 Time Limit
+Each instruction has a maximum runtime of 6 minutes.
+The total maximum time per team is 55 minutes, including:
+- Moving to start points,
+- Discussion about retry/skip decisions,
+- Executing instructions.
+If time runs out mid-instruction, the robot’s position at timeout will be used for scoring, and remaining instructions will be considered skipped.
+
+### 4.4 Fair Play
+Participants must not seek unfair advantages. Restrictions include:
+- No pre-mapping of the environment before the competition.
+- No code changes during the run except for:
+  - Modifying input instructions,
+  - Fixing fatal runtime errors (crashes).
+- The robot must be teleoperated only to reach the starting position (confirmed by referees).
+- No human intervention is allowed during navigation.
+- The submitted runtime container/image must be frozen and submitted before the event; no internet updates are allowed during competition.
+
+### 4.5 Refereeing
+- Each match is monitored by two referees remotely via the venue surveillance system.
+- Referees remain outside the arena and observe in real time via cameras.
+- All matches are recorded and live-streamed.
+- Robot execution is controlled from a centralized control console by the organizers.
+- Referees have remote emergency stop (E-Stop) authority and will intervene if unsafe or unfair behavior is detected.
+
+## 5. Scoring System
+### 5.1 Onsite Scoring
+Each team starts with 0 points. Multiple attempts per instruction are allowed, but only the highest score per instruction counts.
+
+**Scoring Rules**:
+Successfully completing one instruction will add 10 points to the total score, and the completion time for that instruction will be recorded.
+
+| Action | Score Impact |
+|:--|:--|
+| Successfully reach goal | +10 points |
+| Minor scrape with obstacle | –2 points (cannot go below 0 for that instruction) |
+| Collision with obstacle | 0 points, navigation terminated for this instruction |
+
+If there is a trend of continuous collisions, the referee has the right to terminate the robot’s current action.
+If the collision occurs only once at the end of a forward movement when approaching an obstacle, it may be considered a minor scrape, with the severity of the impact determined by the on-site referee.
+
+**Success Condition**:
+ The goal is defined as a 2m-radius circular area (no wall crossing). The run is considered successful if the robot stops inside this area.
+
+**Ranking Rules (Onsite Competition)**:
+- Higher total score ranks higher.
+- Tie-breaker: lower total completion time ranks higher.
+
+## 5.2 Final Results
+Final results combine online phase and onsite phase scores using a rank-based point system:
+- Points per Rank:
+Points = 100 – 5 × (Rank – 1)
+- Final Score Calculation:
+Final Score = (Online Points × 40%) + (Onsite Points × 60%)
+
+If the final score the same, onsite points break the tie.
diff --git a/challenge/onsite_competition/onsite_competition_rules_zh-CN.md b/challenge/onsite_competition/onsite_competition_rules_zh-CN.md
@@ -0,0 +1,83 @@
+# Nav 赛道现场比赛规则
+中文版 | [English Version](./onsite_competition_rules_en-US.md)
+
+## 1. 任务描述
+本赛道任务旨在构建具备语言理解能力的多模态移动机器人导航系统。参赛者需设计感知-决策模型，实现从自我中心视觉感知、语言指令理解、历史轨迹建模到导航动作预测的完整流程。参赛算法将在真实环境中，驱动机器人在语言引导下完成室内导航任务，并需具备应对视角抖动、高度变化及局部避障等挑战的能力，实现稳健、安全的视觉语言导航行为。
+
+主要挑战包括：
+- 有效融合语言与视觉信息，支撑感知—决策—控制的一体化流程；
+- 在真实机器人平台上运行，稳健应对行走过程中的视角抖动、高度变化和局部避障问题；
+- 面对复杂的室内环境与多样化的导航指令，提升模型在未知场景和新指令下的泛化能力。
+
+## 2. 比赛环境与设备
+### 2.1 挑战场地布置
+本次挑战将会搭建一个真实装修的住宅空间，由几个相连接的房间组成，其中将包含客厅、卧室、厨房、走廊、洗手间等常见房间。比赛场地包含普通家庭中常见的家具与装饰。
+
+### 2.2 机器人
+本次挑战将会使用由主办团队统一提供的机器人，配置统一RGB-D摄像头，传感器安装位置、视角、及帧率等将在赛前通过官方统一规定。
+
+参赛队伍将于10月18日到现场调试代码，代码需在比赛前一天提交。
+
+## 3. 任务
+### 3.1 任务设定
+本次挑战的任务聚焦于机器人视觉-语言信息融合与跨房间端到端导航。主办方将预先设定约 10 条可实际执行的自然语言导航指令，并为每条指令提供对应的起始点和终止点真值坐标。
+
+  - 每条指令需至少跨越 1 个房间；
+  - 真值导航的路径长度在 5–20 米之间；
+  - 指令的终点需准确、明确且无歧义。
+
+## 4. 比赛
+### 4.1 赛前准备
+- 参赛队伍需在赛前参考github文档打包好比赛用镜像。
+- 比赛现场将提供统一的调试时间和环境，允许使用与线上赛不同的模型权重，允许进行环境适配性配置修改，但禁止修改核心算法逻辑。
+
+### 4.2 比赛过程
+每支队伍开始挑战时将会提供十条固定顺序的指令，从第一条指令开始。
+  - 移动机器人到达该指令的起始位置后，将指令输入给机器人后举手示意作为开始挑战的信号。
+  - 指令执行过程中若中途失败，可选择重试，也可以选择跳过该指令，顺序执行下一条未执行指令。
+  - 每条指令可重试3次，若3次失败，则必须跳过该指令。（失败指选手选择重新开始，或其他条件，如超时，冲撞障碍物，人为干预等）
+  - 已经跳过的指令不能重复执行。
+  - 每次测试前需重置算法状态并清除所有前序执行过程中获取的观测缓存。
+
+### 4.3 时间限制
+每条指令的最长运行时间为5分钟。每支队伍完整的最长比赛时间为55分钟，其中包括指令抽取、前往起始点、关于是否重复尝试还是跳过指令等比赛相关内容的讨论。若到达最长比赛时间且仍在测试中，将以结束时间所在位置为终点对本次执行计分，剩余未测试指令均视为跳过。
+
+### 4.4 公平性
+参赛选手不得通过任何不正当手段获取优势，相关限制包括：
+  - 禁止在比赛前为机器人提前构建地图；
+  - 比赛过程中，除修改输入指令和修复程序运行卡死的Bug外，不得更改与算法相关的代码；
+  - 前往起始点的过程由参赛者使用遥控器操控，并由裁判确认机器人已抵达起始位置和朝向方可开始，在执行导航指令期间严禁人工干预机器人；
+  - 参赛运行镜像需在赛前提交，比赛过程中禁止联网更新。
+
+### 4.5 裁判
+每场比赛由主办方两名裁判在场外通过场地监控系统执裁。裁判不进入赛场，全程依靠监控进行实时观测与取证，并保留录像。机器人由主办方在统一控制台进行启动/暂停/终止，裁判拥有远程紧急停止（E-Stop）权限。裁判应全程保持公正，并对不合理行为通过远程方式及时制止与记录处理。赛场画面通过监控信号全程直播。
+
+## 5. 评分细则
+### 5.1 线下赛
+比赛采用加分与扣分相结合的制度。每支队伍初始分为 0 分。每条指令的多次尝试将独立计分，并仅取该指令的最高得分计入总分。
+
+**加分规则**：
+- 成功完成1条指令，总分增加 10 分，并记录该指令的完成时间。（成功条件：在比赛中，每条指令均对应一个目标位置。目标位置定义为半径 2 米的圆形区域（不穿墙），机器人最终停止在该范围即可判定为该条指令导航成功。）
+
+**扣分规则**：
+- 运行过程中，若机器人轻微擦碰障碍物，该条指令每次扣除 2 分，最低不低于 0 分；
+- 若机器人直接撞击障碍物，本次导航将被强制终止，该条指令不计分。
+
+| Action | Score |
+|:--|:--:|
+| 抵达目标 | +10 |
+| 刮蹭 | -2 |
+| 连续碰撞 | 本次不计分 |
+
+有连续撞击趋势裁判有权终止当前机器人行为，如果是单次前进行为末尾抵近障碍物，可算一次剐蹭，由现场裁判判断撞击剧烈程度。
+
+**线下排名规则**：
+- 总分高者排名靠前；
+- 若总分相同，则以成功完成指令所耗总时间更少的队伍排名更高。
+
+### 5.2  最终成绩
+最终结果由线上阶段与线下阶段成绩加权计算，采用基于排名的积分制：
+- **积分计算方式**：100 - 5×(排名-1)分
+- **最终成绩计算**：最终积分 = 线上积分 × 40% + 线下积分 × 60%
+
+若最终成绩相同，则以线下积分优先。
diff --git a/challenge/onsite_competition/sdk/cam.py b/challenge/onsite_competition/sdk/cam.py
@@ -0,0 +1,121 @@
+# aligned_realsense.py
+import time
+from typing import Dict, Optional, Tuple
+
+import cv2
+import numpy as np
+import pyrealsense2 as rs
+from save_obs import save_obs
+
+
+class AlignedRealSense:
+    def __init__(
+        self,
+        serial_no: Optional[str] = None,
+        color_res: Tuple[int, int, int] = (640, 480, 30),  # (w,h,fps)
+        depth_res: Tuple[int, int, int] = (640, 480, 30),
+        warmup_frames: int = 15,
+    ):
+        self.serial_no = serial_no
+        self.color_res = color_res
+        self.depth_res = depth_res
+        self.warmup_frames = warmup_frames
+
+        self.pipeline: Optional[rs.pipeline] = None
+        self.align: Optional[rs.align] = None
+        self.depth_scale: Optional[float] = None
+        self.started = False
+
+    def start(self):
+        if self.started:
+            return
+        self.pipeline = rs.pipeline()
+        cfg = rs.config()
+        if self.serial_no:
+            cfg.enable_device(self.serial_no)
+
+        cw, ch, cfps = self.color_res
+        dw, dh, dfps = self.depth_res
+
+        # open stream for color and depth
+        cfg.enable_stream(rs.stream.color, cw, ch, rs.format.bgr8, cfps)
+        cfg.enable_stream(rs.stream.depth, dw, dh, rs.format.z16, dfps)
+
+        profile = self.pipeline.start(cfg)
+
+        # 深度缩放（将 z16 转米）
+        depth_sensor = profile.get_device().first_depth_sensor()
+        self.depth_scale = float(depth_sensor.get_depth_scale())
+
+        # align to color
+        self.align = rs.align(rs.stream.color)
+
+        # warm up
+        for _ in range(self.warmup_frames):
+            self.pipeline.wait_for_frames()
+
+        # align check
+        frames = self.pipeline.wait_for_frames()
+        frames = self.align.process(frames)
+        color = frames.get_color_frame()
+        depth = frames.get_depth_frame()
+        assert color and depth, "warm up align failed"
+        rgb = np.asanyarray(color.get_data())
+        depth_raw = np.asanyarray(depth.get_data())
+        if depth_raw.shape != rgb.shape[:2]:
+            depth_raw = cv2.resize(depth_raw, (rgb.shape[1], rgb.shape[0]), interpolation=cv2.INTER_NEAREST)
+        self.started = True
+
+    def stop(self):
+        if self.pipeline:
+            self.pipeline.stop()
+        self.pipeline = None
+        self.started = False
+
+    def __enter__(self):
+        self.start()
+        return self
+
+    def __exit__(self, et, ev, tb):
+        self.stop()
+
+    def get_observation(self, timeout_ms: int = 1000) -> Dict:
+        """
+        Returns:
+            {
+              "rgb": uint8[H,W,3] (BGR),
+              "depth": float32[H,W] (meters),
+              "timestamp_s": float
+            }
+        """
+        if not self.started:
+            self.start()
+
+        frames = self.pipeline.wait_for_frames(timeout_ms)
+        frames = self.align.process(frames)
+
+        color = frames.get_color_frame()
+        depth = frames.get_depth_frame()
+        if not color or not depth:
+            raise RuntimeError("can not align color/depth frame")
+
+        rgb = np.asanyarray(color.get_data())  # HxWx3, uint8 (BGR)
+        depth_raw = np.asanyarray(depth.get_data())  # HxW, uint16
+        if depth_raw.shape != rgb.shape[:2]:
+            # Extreme fallback (theoretically should be consistent after alignment).
+            depth_raw = cv2.resize(depth_raw, (rgb.shape[1], rgb.shape[0]), interpolation=cv2.INTER_NEAREST)
+
+        depth_m = depth_raw.astype(np.float32) * float(self.depth_scale)
+        ts_ms = color.get_timestamp() or frames.get_timestamp()
+        ts_s = float(ts_ms) / 1000.0 if ts_ms is not None else time.time()
+
+        return {"rgb": rgb, "depth": depth_m, "timestamp_s": ts_s}
+
+
+if __name__ == "__main__":
+    with AlignedRealSense(serial_no=None) as cam:
+        obs = cam.get_observation()
+        print("RGB:", obs["rgb"].shape, obs["rgb"].dtype)
+        print("Depth:", obs["depth"].shape, obs["depth"].dtype, "(meters)")
+        meta = save_obs(obs, outdir="./captures", prefix="rs")
+        print("Saved:", meta)