diff --git a/challenge/README.md b/challenge/README.md
index e2e0cd8b..8c30c5f9 100644
--- a/challenge/README.md
+++ b/challenge/README.md
@@ -10,6 +10,7 @@ The system should be capable of handling challenges such as camera shake, height
 
 ---
 ## 🆕 Updates
+- [2025/10/09] Real-world challenge phase is released! check onsite_competition part for the details.
 - We have fixed possible memory leak inside InternUtopia. Please pull the latest image v1.2 to use.
 - For submission, please make sure the image contain `screen`. Quick check: `$ screen --version`.
 
diff --git a/challenge/onsite_competition/README.md b/challenge/onsite_competition/README.md
new file mode 100644
index 00000000..76b91ee7
--- /dev/null
+++ b/challenge/onsite_competition/README.md
@@ -0,0 +1,60 @@
+# 🧭 IROS On-site Challenge
+
+Welcome to the **IROS Vision-Language Navigation On-site Challenge**!
+In this phase, participants’ models will be deployed on **a real robot** to evaluate performance in real-world conditions.
+
+---
+
+## ⚙️ Installation
+
+First, install the `InternNav` package:
+
+```bash
+cd /InternNav
+pip install -e .
+```
+
+## 🚀 Running Your Agent
+### 1. Start the Agent Server
+Launch your agent server with the desired configuration file:
+
+```bash
+python -m internnav.agent.utils.server --config path/to/cfg.py
+```
+
+### 2. Test the Agent with Robot Captures
+You can locally test your model using previously recorded observations from the robot (stored under ./captures):
+
+```bash
+python sdk/test_agent.py --config path/to/cfg.py
+```
+
+### 3. Actual Competition Execution
+During the on-site evaluation, the organizers will run:
+
+```bash
+python sdk/main.py
+```
+
+for each episode, paired with its corresponding natural language instruction.
+
+## 🧩 Data Format
+Action
+```python
+action = [{'action': [int], 'ideal_flag': bool}]
+```
+Observation
+```python
+obs = {
+    "rgb": rgb,           # RGB image from the robot
+    "depth": depth,       # Depth image (aligned with RGB)
+    "instruction": str    # Natural language navigation
+}
+```
+
+## 📋 Rules
+Please check out the [onsite competition rules](./onsite_competition_rules_en-US.md) .
+
+
+## 🚀 Code Submission
+Submit a Docker image with your agent server preconfigured and ready to run. During the competition, the robot will connect to a local server over the network. We’ll share additional details soon.
diff --git a/challenge/onsite_competition/captures/rs_meta.json b/challenge/onsite_competition/captures/rs_meta.json
new file mode 100644
index 00000000..56af2237
--- /dev/null
+++ b/challenge/onsite_competition/captures/rs_meta.json
@@ -0,0 +1,10 @@
+{
+  "timestamp_s": 1759218399.0439963,
+  "paths": {
+    "rgb": "./captures/rs_rgb.jpg",
+    "depth_mm": "./captures/rs_depth_mm.png",
+    "depth_vis": "./captures/rs_depth_vis.png"
+  },
+  "intrinsics": {},
+  "notes": "depth_mm.png 是以毫米存储的 16-bit PNG；depth_vis.png 仅用于可视化。"
+}
diff --git a/challenge/onsite_competition/captures/rs_rgb.jpg b/challenge/onsite_competition/captures/rs_rgb.jpg
new file mode 100644
index 00000000..14c2cabb
Binary files /dev/null and b/challenge/onsite_competition/captures/rs_rgb.jpg differ
diff --git a/challenge/onsite_competition/captures/sim_depth.npy b/challenge/onsite_competition/captures/sim_depth.npy
new file mode 100644
index 00000000..fa7547fa
Binary files /dev/null and b/challenge/onsite_competition/captures/sim_depth.npy differ
diff --git a/challenge/onsite_competition/captures/sim_rgb.npy b/challenge/onsite_competition/captures/sim_rgb.npy
new file mode 100644
index 00000000..eadc24ac
Binary files /dev/null and b/challenge/onsite_competition/captures/sim_rgb.npy differ
diff --git a/challenge/onsite_competition/onsite_competition_rules_en-US.md b/challenge/onsite_competition/onsite_competition_rules_en-US.md
new file mode 100644
index 00000000..13068ce4
--- /dev/null
+++ b/challenge/onsite_competition/onsite_competition_rules_en-US.md
@@ -0,0 +1,105 @@
+
+# Nav Track Onsite Competition Rules
+English Version | [中文版](./onsite_competition_rules_zh-CN.md)
+
+## 1. Task Description
+This track focuses on building a multimodal mobile robot navigation system with language understanding capabilities. Participants must design a perception–decision pipeline that performs the full process of:
+- Egocentric visual perception,
+- Natural language instruction understanding,
+- Historical trajectory modeling,
+- Navigation action prediction.
+
+The submitted algorithms will be deployed on a real robot, which must navigate indoors under natural language guidance. The robot should robustly handle camera shake, height changes, and local obstacle avoidance, ensuring safe and reliable vision-language navigation.
+
+Key Challenges:
+- Effectively fusing language and visual information to support an integrated perception–decision–control process.
+- Operating robustly on a real robotic platform, handling viewpoint shake, height changes, and local obstacle avoidance during navigation.
+- Generalizing to unseen indoor environments and novel language instructions for improved robustness and adaptability.
+
+## 2. Competition Environment & Equipment
+### 2.1 Competition Venue
+A realistic apartment-like environment will be built, consisting of connected rooms (living room, bedroom, kitchen, corridor, bathroom, etc.) with typical household furniture and decorations.
+
+### 2.2 Robot
+The competition will use Robot (provided by the organizers) with same RGB-D camera and Sensor configuration. Detailed specifications and open-source navigation resources will be provided.
+- Teams will have an on-site debugging session on October 18.
+- Final code must be submitted 1 days before the competition.
+
+## 3. Task Setup
+The challenge centers on vision-language fusion for cross-room end-to-end navigation.
+ The organizers will predefine ~10 natural language navigation instructions, each with a corresponding start and goal position.
+- Each instruction must cross at least one room.
+- Ground-truth path length will be between 5–20 meters.
+- The goal location must be precise, unambiguous, and clearly defined.
+
+## 4. Competition Rules
+### 4.1 Pre-competition Preparation
+- Teams must package the competition image in advance according to the GitHub documentation.
+- A standardized debugging time and environment will be provided onsite. Teams may use model weights different from the online stage and make environment-specific configuration adjustments, but modifications to the core algorithm logic are strictly prohibited.
+
+### 4.2 Procedure
+Each team will receive 10 fixed-sequence instructions. Start from the first one.
+For each instruction:
+- Move the robot to the given starting position.
+- Provide the instruction to the robot and raise your hand to signal the start.
+- If execution fails, teams may retry (up to 3 attempts) or skip the instruction.
+- After 3 failed attempts (timeout, collision, human intervention, etc.), the instruction must be skipped.
+- Skipped instructions cannot be retried later.
+- Before each attempt, the algorithm state must be reset and previous observations cleared.
+
+### 4.3 Time Limit
+Each instruction has a maximum runtime of 6 minutes.
+The total maximum time per team is 55 minutes, including:
+- Moving to start points,
+- Discussion about retry/skip decisions,
+- Executing instructions.
+If time runs out mid-instruction, the robot’s position at timeout will be used for scoring, and remaining instructions will be considered skipped.
+
+### 4.4 Fair Play
+Participants must not seek unfair advantages. Restrictions include:
+- No pre-mapping of the environment before the competition.
+- No code changes during the run except for:
+  - Modifying input instructions,
+  - Fixing fatal runtime errors (crashes).
+- The robot must be teleoperated only to reach the starting position (confirmed by referees).
+- No human intervention is allowed during navigation.
+- The submitted runtime container/image must be frozen and submitted before the event; no internet updates are allowed during competition.
+
+### 4.5 Refereeing
+- Each match is monitored by two referees remotely via the venue surveillance system.
+- Referees remain outside the arena and observe in real time via cameras.
+- All matches are recorded and live-streamed.
+- Robot execution is controlled from a centralized control console by the organizers.
+- Referees have remote emergency stop (E-Stop) authority and will intervene if unsafe or unfair behavior is detected.
+
+## 5. Scoring System
+### 5.1 Onsite Scoring
+Each team starts with 0 points. Multiple attempts per instruction are allowed, but only the highest score per instruction counts.
+
+**Scoring Rules**:
+Successfully completing one instruction will add 10 points to the total score, and the completion time for that instruction will be recorded.
+
+| Action | Score Impact |
+|:--|:--|
+| Successfully reach goal | +10 points |
+| Minor scrape with obstacle | –2 points (cannot go below 0 for that instruction) |
+| Collision with obstacle | 0 points, navigation terminated for this instruction |
+
+If there is a trend of continuous collisions, the referee has the right to terminate the robot’s current action.
+If the collision occurs only once at the end of a forward movement when approaching an obstacle, it may be considered a minor scrape, with the severity of the impact determined by the on-site referee.
+
+**Success Condition**:
+ The goal is defined as a 2m-radius circular area (no wall crossing). The run is considered successful if the robot stops inside this area.
+
+**Ranking Rules (Onsite Competition)**:
+- Higher total score ranks higher.
+- Tie-breaker: lower total completion time ranks higher.
+
+## 5.2 Final Results
+Final results combine online phase and onsite phase scores using a rank-based point system:
+- Points per Rank:
+Points = 100 – 5 × (Rank – 1)
+- Final Score Calculation:
+Final Score = (Online Points × 40%) + (Onsite Points × 60%)
+
+If the final score the same, onsite points break the tie.
diff --git a/challenge/onsite_competition/onsite_competition_rules_zh-CN.md b/challenge/onsite_competition/onsite_competition_rules_zh-CN.md
new file mode 100644
index 00000000..2006999a
--- /dev/null
+++ b/challenge/onsite_competition/onsite_competition_rules_zh-CN.md
@@ -0,0 +1,83 @@
+# Nav 赛道现场比赛规则
+中文版 | [English Version](./onsite_competition_rules_en-US.md)
+
+## 1. 任务描述
+本赛道任务旨在构建具备语言理解能力的多模态移动机器人导航系统。参赛者需设计感知-决策模型，实现从自我中心视觉感知、语言指令理解、历史轨迹建模到导航动作预测的完整流程。参赛算法将在真实环境中，驱动机器人在语言引导下完成室内导航任务，并需具备应对视角抖动、高度变化及局部避障等挑战的能力，实现稳健、安全的视觉语言导航行为。
+
+主要挑战包括：
+- 有效融合语言与视觉信息，支撑感知—决策—控制的一体化流程；
+- 在真实机器人平台上运行，稳健应对行走过程中的视角抖动、高度变化和局部避障问题；
+- 面对复杂的室内环境与多样化的导航指令，提升模型在未知场景和新指令下的泛化能力。
+
+## 2. 比赛环境与设备
+### 2.1 挑战场地布置
+本次挑战将会搭建一个真实装修的住宅空间，由几个相连接的房间组成，其中将包含客厅、卧室、厨房、走廊、洗手间等常见房间。比赛场地包含普通家庭中常见的家具与装饰。
+
+### 2.2 机器人
+本次挑战将会使用由主办团队统一提供的机器人，配置统一RGB-D摄像头，传感器安装位置、视角、及帧率等将在赛前通过官方统一规定。
+
+参赛队伍将于10月18日到现场调试代码，代码需在比赛前一天提交。
+
+## 3. 任务
+### 3.1 任务设定
+本次挑战的任务聚焦于机器人视觉-语言信息融合与跨房间端到端导航。主办方将预先设定约 10 条可实际执行的自然语言导航指令，并为每条指令提供对应的起始点和终止点真值坐标。
+
+  - 每条指令需至少跨越 1 个房间；
+  - 真值导航的路径长度在 5–20 米之间；
+  - 指令的终点需准确、明确且无歧义。
+
+## 4. 比赛
+### 4.1 赛前准备
+- 参赛队伍需在赛前参考github文档打包好比赛用镜像。
+- 比赛现场将提供统一的调试时间和环境，允许使用与线上赛不同的模型权重，允许进行环境适配性配置修改，但禁止修改核心算法逻辑。
+
+### 4.2 比赛过程
+每支队伍开始挑战时将会提供十条固定顺序的指令，从第一条指令开始。
+  - 移动机器人到达该指令的起始位置后，将指令输入给机器人后举手示意作为开始挑战的信号。
+  - 指令执行过程中若中途失败，可选择重试，也可以选择跳过该指令，顺序执行下一条未执行指令。
+  - 每条指令可重试3次，若3次失败，则必须跳过该指令。（失败指选手选择重新开始，或其他条件，如超时，冲撞障碍物，人为干预等）
+  - 已经跳过的指令不能重复执行。
+  - 每次测试前需重置算法状态并清除所有前序执行过程中获取的观测缓存。
+
+### 4.3 时间限制
+每条指令的最长运行时间为5分钟。每支队伍完整的最长比赛时间为55分钟，其中包括指令抽取、前往起始点、关于是否重复尝试还是跳过指令等比赛相关内容的讨论。若到达最长比赛时间且仍在测试中，将以结束时间所在位置为终点对本次执行计分，剩余未测试指令均视为跳过。
+
+### 4.4 公平性
+参赛选手不得通过任何不正当手段获取优势，相关限制包括：
+  - 禁止在比赛前为机器人提前构建地图；
+  - 比赛过程中，除修改输入指令和修复程序运行卡死的Bug外，不得更改与算法相关的代码；
+  - 前往起始点的过程由参赛者使用遥控器操控，并由裁判确认机器人已抵达起始位置和朝向方可开始，在执行导航指令期间严禁人工干预机器人；
+  - 参赛运行镜像需在赛前提交，比赛过程中禁止联网更新。
+
+### 4.5 裁判
+每场比赛由主办方两名裁判在场外通过场地监控系统执裁。裁判不进入赛场，全程依靠监控进行实时观测与取证，并保留录像。机器人由主办方在统一控制台进行启动/暂停/终止，裁判拥有远程紧急停止（E-Stop）权限。裁判应全程保持公正，并对不合理行为通过远程方式及时制止与记录处理。赛场画面通过监控信号全程直播。
+
+## 5. 评分细则
+### 5.1 线下赛
+比赛采用加分与扣分相结合的制度。每支队伍初始分为 0 分。每条指令的多次尝试将独立计分，并仅取该指令的最高得分计入总分。
+
+**加分规则**：
+- 成功完成1条指令，总分增加 10 分，并记录该指令的完成时间。（成功条件：在比赛中，每条指令均对应一个目标位置。目标位置定义为半径 2 米的圆形区域（不穿墙），机器人最终停止在该范围即可判定为该条指令导航成功。）
+
+**扣分规则**：
+- 运行过程中，若机器人轻微擦碰障碍物，该条指令每次扣除 2 分，最低不低于 0 分；
+- 若机器人直接撞击障碍物，本次导航将被强制终止，该条指令不计分。
+
+| Action | Score |
+|:--|:--:|
+| 抵达目标 | +10 |
+| 刮蹭 | -2 |
+| 连续碰撞 | 本次不计分 |
+
+有连续撞击趋势裁判有权终止当前机器人行为，如果是单次前进行为末尾抵近障碍物，可算一次剐蹭，由现场裁判判断撞击剧烈程度。
+
+**线下排名规则**：
+- 总分高者排名靠前；
+- 若总分相同，则以成功完成指令所耗总时间更少的队伍排名更高。
+
+### 5.2  最终成绩
+最终结果由线上阶段与线下阶段成绩加权计算，采用基于排名的积分制：
+- **积分计算方式**：100 - 5×(排名-1)分
+- **最终成绩计算**：最终积分 = 线上积分 × 40% + 线下积分 × 60%
+
+若最终成绩相同，则以线下积分优先。
diff --git a/challenge/onsite_competition/sdk/cam.py b/challenge/onsite_competition/sdk/cam.py
new file mode 100644
index 00000000..9526c23d
--- /dev/null
+++ b/challenge/onsite_competition/sdk/cam.py
@@ -0,0 +1,121 @@
+# aligned_realsense.py
+import time
+from typing import Dict, Optional, Tuple
+
+import cv2
+import numpy as np
+import pyrealsense2 as rs
+from save_obs import save_obs
+
+
+class AlignedRealSense:
+    def __init__(
+        self,
+        serial_no: Optional[str] = None,
+        color_res: Tuple[int, int, int] = (640, 480, 30),  # (w,h,fps)
+        depth_res: Tuple[int, int, int] = (640, 480, 30),
+        warmup_frames: int = 15,
+    ):
+        self.serial_no = serial_no
+        self.color_res = color_res
+        self.depth_res = depth_res
+        self.warmup_frames = warmup_frames
+
+        self.pipeline: Optional[rs.pipeline] = None
+        self.align: Optional[rs.align] = None
+        self.depth_scale: Optional[float] = None
+        self.started = False
+
+    def start(self):
+        if self.started:
+            return
+        self.pipeline = rs.pipeline()
+        cfg = rs.config()
+        if self.serial_no:
+            cfg.enable_device(self.serial_no)
+
+        cw, ch, cfps = self.color_res
+        dw, dh, dfps = self.depth_res
+
+        # open stream for color and depth
+        cfg.enable_stream(rs.stream.color, cw, ch, rs.format.bgr8, cfps)
+        cfg.enable_stream(rs.stream.depth, dw, dh, rs.format.z16, dfps)
+
+        profile = self.pipeline.start(cfg)
+
+        # 深度缩放（将 z16 转米）
+        depth_sensor = profile.get_device().first_depth_sensor()
+        self.depth_scale = float(depth_sensor.get_depth_scale())
+
+        # align to color
+        self.align = rs.align(rs.stream.color)
+
+        # warm up
+        for _ in range(self.warmup_frames):
+            self.pipeline.wait_for_frames()
+
+        # align check
+        frames = self.pipeline.wait_for_frames()
+        frames = self.align.process(frames)
+        color = frames.get_color_frame()
+        depth = frames.get_depth_frame()
+        assert color and depth, "warm up align failed"
+        rgb = np.asanyarray(color.get_data())
+        depth_raw = np.asanyarray(depth.get_data())
+        if depth_raw.shape != rgb.shape[:2]:
+            depth_raw = cv2.resize(depth_raw, (rgb.shape[1], rgb.shape[0]), interpolation=cv2.INTER_NEAREST)
+        self.started = True
+
+    def stop(self):
+        if self.pipeline:
+            self.pipeline.stop()
+        self.pipeline = None
+        self.started = False
+
+    def __enter__(self):
+        self.start()
+        return self
+
+    def __exit__(self, et, ev, tb):
+        self.stop()
+
+    def get_observation(self, timeout_ms: int = 1000) -> Dict:
+        """
+        Returns:
+            {
+              "rgb": uint8[H,W,3] (BGR),
+              "depth": float32[H,W] (meters),
+              "timestamp_s": float
+            }
+        """
+        if not self.started:
+            self.start()
+
+        frames = self.pipeline.wait_for_frames(timeout_ms)
+        frames = self.align.process(frames)
+
+        color = frames.get_color_frame()
+        depth = frames.get_depth_frame()
+        if not color or not depth:
+            raise RuntimeError("can not align color/depth frame")
+
+        rgb = np.asanyarray(color.get_data())  # HxWx3, uint8 (BGR)
+        depth_raw = np.asanyarray(depth.get_data())  # HxW, uint16
+        if depth_raw.shape != rgb.shape[:2]:
+            # Extreme fallback (theoretically should be consistent after alignment).
+            depth_raw = cv2.resize(depth_raw, (rgb.shape[1], rgb.shape[0]), interpolation=cv2.INTER_NEAREST)
+
+        depth_m = depth_raw.astype(np.float32) * float(self.depth_scale)
+        ts_ms = color.get_timestamp() or frames.get_timestamp()
+        ts_s = float(ts_ms) / 1000.0 if ts_ms is not None else time.time()
+
+        return {"rgb": rgb, "depth": depth_m, "timestamp_s": ts_s}
+
+
+if __name__ == "__main__":
+    with AlignedRealSense(serial_no=None) as cam:
+        obs = cam.get_observation()
+        print("RGB:", obs["rgb"].shape, obs["rgb"].dtype)
+        print("Depth:", obs["depth"].shape, obs["depth"].dtype, "(meters)")
+        meta = save_obs(obs, outdir="./captures", prefix="rs")
+        print("Saved:", meta)
diff --git a/challenge/onsite_competition/sdk/control.py b/challenge/onsite_competition/sdk/control.py
new file mode 100644
index 00000000..5040946d
--- /dev/null
+++ b/challenge/onsite_competition/sdk/control.py
@@ -0,0 +1,141 @@
+#!/usr/bin/env python3
+import math
+
+import rospy
+from geometry_msgs.msg import Twist
+from nav_msgs.msg import Odometry
+from tf.transformations import euler_from_quaternion
+
+
+class Turn90Degrees:
+    def __init__(self):
+        rospy.init_node('turn_90_degrees_node', anonymous=True)
+
+        self.cmd_vel_pub = rospy.Publisher('/cmd_vel', Twist, queue_size=10)
+
+        self.odom_sub = rospy.Subscriber('/ranger_base_node/odom', Odometry, self.odom_callback)
+
+        self.current_yaw = 0.0
+        self.start_yaw = None
+        self.turning = False
+        self.turn_angle = math.radians(90)  # angle
+        self.angular_speed = -0.2  # dirention and speed
+        self.rate = rospy.Rate(10)  # 10Hz
+
+    def odom_callback(self, msg):
+        # Get the yaw angle from a quaternion.
+        orientation = msg.pose.pose.orientation
+        quaternion = [orientation.x, orientation.y, orientation.z, orientation.w]
+        _, _, yaw = euler_from_quaternion(quaternion)
+
+        self.current_yaw = yaw
+
+        # initialize the start yaw
+        if self.start_yaw is None and not self.turning:
+            self.start_yaw = yaw
+            rospy.loginfo(f"start yaw: {math.degrees(self.start_yaw):.2f}")
+
+    def execute_turn(self):
+        if self.start_yaw is None:
+            rospy.loginfo("wait for init yaw state...")
+            return False
+
+        if not self.turning:
+            self.turning = True
+            rospy.loginfo("start to turn")
+
+        # compute turned angle
+        current_angle = self.current_yaw - self.start_yaw
+
+        # normalize the angle
+        if current_angle > math.pi:
+            current_angle -= 2 * math.pi
+        elif current_angle < -math.pi:
+            current_angle += 2 * math.pi
+
+        # compute remaining angle
+        remaining_angle = self.turn_angle - abs(current_angle)
+
+        # create Twist msg
+        twist = Twist()
+
+        # if not reach the goal angle, keep turning
+        if remaining_angle > 0.05:  # allow some diff (about 2.86 degree）
+            twist.angular.z = self.angular_speed * min(1.0, remaining_angle * 6)
+            print(f"twist.angular.z {twist.angular.z} remaining_angle {remaining_angle}")
+            self.cmd_vel_pub.publish(twist)
+            return False
+        else:
+            twist.angular.z = 0.0
+            self.cmd_vel_pub.publish(twist)
+            rospy.loginfo(f"finish turn, final yaw: {math.degrees(self.current_yaw):.2f}")
+            return True
+
+    def run(self):
+        while not rospy.is_shutdown():
+            if self.execute_turn():
+                rospy.loginfo("Task finished")
+                break
+            self.rate.sleep()
+
+
+class DiscreteRobotController(Turn90Degrees):
+    """
+    Extends Turn90Degree to allow discrete step-based control.
+    """
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)  # initialize parent class
+
+    def stand_still(self, duration=0.5):
+        twist = Twist()
+        twist.linear.x = 0.0
+        twist.angular.z = 0.0
+        self.cmd_vel_pub.publish(twist)
+        rospy.sleep(duration)  # Maintain stand still for a short duration
+        rospy.loginfo("Stand still command executed.")
+
+    def move_forward(self, distance=0.25):
+        twist = Twist()
+        twist.linear.x = 0.2  # Forward speed
+        twist.angular.z = 0.0
+        duration = distance / twist.linear.x  # Time to move forward the specified distance
+
+        rospy.loginfo(f"Moving forward for {duration:.2f} seconds.")
+        end_time = rospy.Time.now() + rospy.Duration(duration)
+
+        while rospy.Time.now() < end_time and not rospy.is_shutdown():
+            self.cmd_vel_pub.publish(twist)
+            self.rate.sleep()
+
+        self.stand_still()  # Stop after moving forward
+        rospy.loginfo("Move forward command executed.")
+
+    def turn_left(self, angle=15, speed=0.2):
+        self.turn_angle = math.radians(angle)  # update angle
+        self.angular_speed = speed  # Set positive angular speed for left turn
+        self.start_yaw = None  # Reset start yaw to current position
+        self.turning = False  # Reset turning flag
+        self.run()
+        self.stand_still()  # Stop after moving forward
+        rospy.loginfo("Turn left command executed.")
+
+    def turn_right(self, angle=15, speed=-0.2):
+        self.turn_angle = math.radians(angle)  # update angle
+        self.angular_speed = speed  # Set positive angular speed for left turn
+        self.start_yaw = None  # Reset start yaw to current position
+        self.turning = False  # Reset turning flag
+        self.run()
+        self.stand_still()  # Stop after moving forward
+        rospy.loginfo("Turn right command executed.")
+
+
+if __name__ == '__main__':
+    try:
+        control = DiscreteRobotController()
+        control.turn_left(10)
+        control.move_forward(0.1)
+        control.turn_right(10)
+
+    except rospy.ROSInterruptException:
+        pass
diff --git a/challenge/onsite_competition/sdk/main.py b/challenge/onsite_competition/sdk/main.py
new file mode 100644
index 00000000..2e9468ef
--- /dev/null
+++ b/challenge/onsite_competition/sdk/main.py
@@ -0,0 +1,58 @@
+import argparse
+import importlib.util
+import sys
+
+from real_world_env import RealWorldEnv
+
+from internnav.agent.utils.client import AgentClient
+from internnav.configs.evaluator.default_config import get_config
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config",
+        type=str,
+        default='scripts/eval/configs/h1_cma_cfg.py',
+        help='eval config file path, e.g. scripts/eval/configs/h1_cma_cfg.py',
+    )
+    parser.add_argument(
+        "--instruction",
+        type=str,
+        help='current instruction to follow',
+    )
+    return parser.parse_args()
+
+
+def load_eval_cfg(config_path, attr_name='eval_cfg'):
+    spec = importlib.util.spec_from_file_location("eval_config_module", config_path)
+    config_module = importlib.util.module_from_spec(spec)
+    sys.modules["eval_config_module"] = config_module
+    spec.loader.exec_module(config_module)
+    return getattr(config_module, attr_name)
+
+
+def main():
+    args = parse_args()
+    print("--- Loading config from:", args.config, "---")
+    evaluator_cfg = load_eval_cfg(args.config, attr_name='eval_cfg')
+    cfg = get_config(evaluator_cfg)
+    print(cfg)
+
+    # initialize user agent
+    agent = AgentClient(cfg.agent)
+
+    # initialize real world env
+    env = RealWorldEnv(args.instruction)
+
+    while True:
+        # print("get observation...")
+        # obs contains {rgb, depth, instruction}
+        obs = env.get_observation()
+
+        # print("agent step...")
+        # action is a integer in [0, 3], agent return [{'action': [int], 'ideal_flag': bool}] (same to internvla_n1 agent)
+        action = agent.step(obs)[0]['action'][0]  # only take the first env's action integer
+
+        # print("env step...")
+        env.step(action)
diff --git a/challenge/onsite_competition/sdk/real_world_env.py b/challenge/onsite_competition/sdk/real_world_env.py
new file mode 100644
index 00000000..8d13c99c
--- /dev/null
+++ b/challenge/onsite_competition/sdk/real_world_env.py
@@ -0,0 +1,32 @@
+from cam import AlignedRealSense
+from control import DiscreteRobotController
+
+from internnav.env import Env
+
+
+class RealWorldEnv(Env):
+    def __init__(self):
+        self.node = DiscreteRobotController()
+        self.cam = AlignedRealSense()
+
+    def get_observation(self):
+        frame = self.cam.get_observation()
+        return frame
+
+    def step(self, action):
+
+        '''
+        action (int): Discrete action to apply:
+                    - 0: no movement (stand still)
+                    - 1: move forward
+                    - 2: rotate left
+                    - 3: rotate right
+        '''
+        if action == 0:
+            self.node.stand_still()
+        elif action == 1:
+            self.node.move_forward()
+        elif action == 2:
+            self.node.turn_left()
+        elif action == 3:
+            self.node.turn_right()
diff --git a/challenge/onsite_competition/sdk/save_obs.py b/challenge/onsite_competition/sdk/save_obs.py
new file mode 100644
index 00000000..adb1d8bd
--- /dev/null
+++ b/challenge/onsite_competition/sdk/save_obs.py
@@ -0,0 +1,152 @@
+# save_obs.py
+import json
+import os
+import time
+from datetime import datetime
+from pathlib import Path
+
+import cv2
+import numpy as np
+
+
+def save_obs(
+    obs: dict,
+    outdir: str = "./captures",
+    prefix: str = None,
+    max_depth_m: float = 3.0,
+    save_rgb: bool = True,
+    save_depth_16bit: bool = True,
+    save_depth_vis: bool = True,
+):
+    """
+    save obs = {"rgb": HxWx3 uint8 (BGR), "depth": HxW float32 (m), "timestamp_s": float, "intrinsics": {...}}
+    """
+    Path(outdir).mkdir(parents=True, exist_ok=True)
+
+    ts = float(obs.get("timestamp_s", time.time()))
+    stamp = datetime.fromtimestamp(ts).strftime("%Y%m%d_%H%M%S_%f")
+    prefix = prefix or f"{stamp}"
+
+    rgb = obs.get("rgb", None)
+    depth_m = obs.get("depth", None)
+
+    # 1) save RGB（BGR，cv2）
+    rgb_path = None
+    if save_rgb and rgb is not None:
+        rgb_path = os.path.join(outdir, f"{prefix}_rgb.jpg")
+        cv2.imwrite(rgb_path, rgb)
+
+    # 2) save 16-bit depth（unit: mm）
+    depth16_path = None
+    vis_path = None
+    if depth_m is not None and (save_depth_16bit or save_depth_vis):
+        d = np.nan_to_num(depth_m, nan=0.0, posinf=0.0, neginf=0.0)  # 清 NaN/Inf
+        if save_depth_16bit:
+            depth_mm = np.clip(np.round(d * 1000.0), 0, 65535).astype(np.uint16)
+            depth16_path = os.path.join(outdir, f"{prefix}_depth_mm.png")
+            cv2.imwrite(depth16_path, depth_mm)
+
+        # 3) save depth vis
+        if save_depth_vis:
+            d_clip = np.clip(d, 0.0, max_depth_m)
+            # Brighten the near field: first normalize to 0–255, then invert and apply a colormap.
+            d_norm = (d_clip / max_depth_m * 255.0).astype(np.uint8)
+            depth_color = cv2.applyColorMap(255 - d_norm, cv2.COLORMAP_JET)
+            vis_path = os.path.join(outdir, f"{prefix}_depth_vis.png")
+            cv2.imwrite(vis_path, depth_color)
+
+    # 4) meta info
+    meta = {
+        "timestamp_s": ts,
+        "paths": {
+            "rgb": rgb_path,
+            "depth_mm": depth16_path,
+            "depth_vis": vis_path,
+        },
+        "intrinsics": obs.get("intrinsics", {}),
+        "notes": "depth_mm.png 是以毫米存储的 16-bit PNG, depth_vis.png 仅用于可视化。",
+    }
+    with open(os.path.join(outdir, f"{prefix}_meta.json"), "w") as f:
+        json.dump(meta, f, indent=2, ensure_ascii=False)
+
+    return meta
+
+
+# load_obs.py
+from glob import glob
+from typing import Dict, List, Optional, Tuple
+
+
+def _resolve(base: str, p: Optional[str]) -> Optional[str]:
+    if not p:
+        return None
+    return p if os.path.isabs(p) else os.path.abspath(os.path.join(base, p))
+
+
+def load_obs_from_meta(meta_path: str, nan_for_zeros: bool = False) -> Dict:
+    """
+    读取由 save_obs() 生成的 meta.json, 并还原为 obs dict:
+      {
+        "rgb": uint8[H,W,3]  (BGR),
+        "depth": float32[H,W] (meters) 或 None,
+        "timestamp_s": float,
+        "intrinsics": dict
+      }
+
+    Args:
+        meta_path: *_meta.json 路径
+        nan_for_zeros: 若为 True，则把深度为 0 的像素转为 NaN（便于下游遮挡处理）
+    """
+    meta_path = os.path.abspath(meta_path)
+    base = os.path.dirname(os.path.dirname(meta_path))
+    with open(meta_path, "r") as f:
+        meta = json.load(f)
+
+    paths = meta.get("paths", {})
+    rgb_path = _resolve(base, paths.get("rgb"))
+    depth_mm_path = _resolve(base, paths.get("depth_mm"))
+
+    # 读 RGB（保存时就是 BGR，OpenCV 读回来仍是 BGR）
+    rgb = None
+    if rgb_path and os.path.exists(rgb_path):
+        rgb = cv2.imread(rgb_path, cv2.IMREAD_COLOR)  # uint8, HxWx3 (BGR)
+
+    # 读深度（16-bit 毫米 PNG → 米）
+    depth = None
+    if depth_mm_path and os.path.exists(depth_mm_path):
+        depth_mm = cv2.imread(depth_mm_path, cv2.IMREAD_UNCHANGED)  # uint16
+        if depth_mm is None:
+            raise RuntimeError(f"Failed to read depth image: {depth_mm_path}")
+        if depth_mm.dtype != np.uint16:
+            raise ValueError(f"Depth image must be uint16 (mm). Got {depth_mm.dtype}")
+        depth = depth_mm.astype(np.float32) / 1000.0  # meters
+        if nan_for_zeros:
+            depth[depth_mm == 0] = np.nan
+
+    # 尺寸一致性检查（若两者都有）
+    if rgb is not None and depth is not None and depth.shape != rgb.shape[:2]:
+        raise ValueError(f"Shape mismatch: rgb {rgb.shape[:2]} vs depth {depth.shape}. " "确保保存前已对齐（align 到 color）。")
+
+    obs = {
+        "rgb": rgb,
+        "depth": depth,
+        "timestamp_s": float(meta.get("timestamp_s", 0.0)),
+        "intrinsics": meta.get("intrinsics", {}),
+    }
+    return obs
+
+
+def load_all_obs_in_dir(captures_dir: str, pattern: str = "*_meta.json", sort: bool = True) -> List[Tuple[str, Dict]]:
+    """
+    批量读取目录下所有 meta，返回 [(meta_path, obs), ...]
+    """
+    metas = glob(os.path.join(captures_dir, pattern))
+    if sort:
+        metas.sort()
+    result = []
+    for m in metas:
+        try:
+            result.append((m, load_obs_from_meta(m)))
+        except Exception as e:
+            print(f"[warn] skip {m}: {e}")
+    return result
diff --git a/challenge/onsite_competition/sdk/test_agent.py b/challenge/onsite_competition/sdk/test_agent.py
new file mode 100644
index 00000000..1d6ac59f
--- /dev/null
+++ b/challenge/onsite_competition/sdk/test_agent.py
@@ -0,0 +1,56 @@
+import importlib.util
+import sys
+
+import numpy as np
+from save_obs import load_obs_from_meta
+
+from internnav.agent.utils.client import AgentClient
+from internnav.configs.evaluator.default_config import get_config
+
+
+def load_eval_cfg(config_path, attr_name='eval_cfg'):
+    spec = importlib.util.spec_from_file_location("eval_config_module", config_path)
+    config_module = importlib.util.module_from_spec(spec)
+    sys.modules["eval_config_module"] = config_module
+    spec.loader.exec_module(config_module)
+    return getattr(config_module, attr_name)
+
+
+# test if agent behave normally with fake observation
+def test_agent(cfg_path=None, obs=None):
+    cfg = load_eval_cfg(cfg_path, attr_name='eval_cfg')
+    cfg = get_config(cfg)
+
+    agent = AgentClient(cfg.agent)
+    for _ in range(10):
+        action = agent.step([obs])[0]['action'][0]  # modify your agent to match the output format
+        print(f"Action taken: {action}")
+        assert action in [0, 1, 2, 3]
+
+
+if __name__ == "__main__":
+    # use your own path
+    # cfg_path = '/root/InternNav/scripts/eval/configs/h1_rdp_cfg.py'
+    cfg_path = '/root/InternNav/scripts/eval/configs/h1_internvla_n1_cfg.py'
+    rs_meta_path = 'challenge/onsite_competition/captures/rs_meta.json'
+
+    fake_obs_256 = {
+        'rgb': np.zeros((256, 256, 3), dtype=np.uint8),
+        'depth': np.zeros((256, 256), dtype=np.float32),
+        'instruction': 'go to the red car',
+    }
+    fake_obs_640 = load_obs_from_meta(rs_meta_path)
+    fake_obs_640['instruction'] = 'go to the red car'
+    print(fake_obs_640['rgb'].shape, fake_obs_640['depth'].shape)
+
+    sim_obs = {
+        'rgb': np.load('challenge/onsite_competition/captures/sim_rgb.npy'),
+        'depth': np.load('challenge/onsite_competition/captures/sim_depth.npy'),
+    }
+    print(sim_obs['rgb'].shape, sim_obs['depth'].shape)  # dtype (uint8 and float32) and value range
+    # TODO: crop to 256,256, test with fake_obs_256
+
+    # work in progress, baseline model will be updated soon
+    # test_agent(cfg_path=cfg_path, obs=fake_obs_256)
+    test_agent(cfg_path=cfg_path, obs=fake_obs_640)
+    print("All tests passed!")
diff --git a/challenge/onsite_competition/sdk/test_robot.py b/challenge/onsite_competition/sdk/test_robot.py
new file mode 100644
index 00000000..9069ebee
--- /dev/null
+++ b/challenge/onsite_competition/sdk/test_robot.py
@@ -0,0 +1,13 @@
+from real_world_env import RealWorldEnv
+from save_obs import save_obs
+
+env = RealWorldEnv()
+
+env.step(3)  # 3: rotate right
+env.step(2)  # 2: rotate left
+env.step(1)  # 1: move forward
+env.step(0)  # 0: no movement (stand still)
+
+obs = env.get_observation()  # {rgb: array, depth: array, instruction: str}
+meta = save_obs(obs, outdir="./captures", prefix="test")
+print("Saved observation metadata:", meta)