InternRobotics · yuqiang-yang · Oct 17, 2025 · Sep 24, 2025 · Sep 24, 2025 · Sep 24, 2025
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -14,7 +14,7 @@ repos:
           "--exclude=__init__.py",
         ]
   - repo: https://github.com/PyCQA/flake8
-    rev: 4.0.1
+    rev: 6.0.0
     hooks:
       - id: flake8
   - repo: https://github.com/PyCQA/isort
@@ -30,6 +30,7 @@ repos:
     hooks:
       - id: codespell
         args: ['--ignore-words-list=ro']
+        exclude: '\.ipynb$'
   - repo: https://github.com/pre-commit/pre-commit-hooks
     rev: v3.1.0
     hooks:

diff --git a/README.md b/README.md
@@ -33,9 +33,10 @@ The toolbox supports the most comprehensive 6 datasets \& benchmarks and 10+ pop
 
 The toolbox supports the most advanced high-quality navigation dataset, InternData-N1, which includes 3k+ scenes and 830k VLN data covering diverse embodiments and scenes, and the first dual-system navigation foundation model with leading performance on all the benchmarks and zero-shot generalization capability in the real world, InternVLA-N1.
 
-## 🔥 News
+## 🔥
+- [2025/10] Add a simple [inference-only demo](scripts/eval/inference_only_demo.ipynb) of InternVLA-N1.
 - [2025/10] InternVLA-N1 [technique report](https://internrobotics.github.io/internvla-n1.github.io/static/pdfs/InternVLA_N1.pdf) is released. Please check our [homepage](https://internrobotics.github.io/internvla-n1.github.io/).
-- [2025/09] Real-world deployment code of InternVLA-N1 is released.
+- [2025/09] Real-world deployment code of InternVLA-N1 is released.- [2025/09] Real-world deployment code of InternVLA-N1 is released.
 - [2025/07] We are hosting 🏆IROS 2025 Grand Challenge, stay tuned at [official website](https://internrobotics.shlab.org.cn/challenge/2025/).
 - [2025/07] InternNav v0.1.1 released.
 
@@ -144,38 +145,38 @@ Please refer to the [documentation](https://internrobotics.github.io/user_guide/
 #### VLN-CE Task
 | Model  | Dataset/Benchmark | NE | OS | SR | SPL | Download |
 | ------ | ----------------- | -- | -- | --------- |  -- | --------- |
-| `InternVLA-N1 (S2)` | R2R | 4.89 | 60.6 | 55.4 | 52.1| [Model](https://huggingface.co/InternRobotics/InternVLA-N1-S2) | 
-| `InternVLA-N1` | R2R | **4.83** | **63.3** | **58.2** | **54.0** | [Model](https://huggingface.co/InternRobotics/InternVLA-N1) | 
+| `InternVLA-N1 (S2)` | R2R | 4.89 | 60.6 | 55.4 | 52.1| [Model](https://huggingface.co/InternRobotics/InternVLA-N1-S2) |
+| `InternVLA-N1` | R2R | **4.83** | **63.3** | **58.2** | **54.0** | [Model](https://huggingface.co/InternRobotics/InternVLA-N1) |
 | `InternVLA-N1 (S2)` | RxR | 6.67 | 56.5 | 48.6 | 42.6 | [Model](https://huggingface.co/InternRobotics/InternVLA-N1-S2) |
 | `InternVLA-N1` | RxR | **5.91** | **60.8** | **53.5** | **46.1** | [Model](https://huggingface.co/InternRobotics/InternVLA-N1) |
-| `InternVLA-N1-Preview (S2)` | R2R | 5.09 | 60.9 | 53.7 | 49.7 | [Model](https://huggingface.co/InternRobotics/InternVLA-N1-Preview-S2) | 
-| `InternVLA-N1-Preview` | R2R | **4.76** | **63.4** | **56.7** | **52.6** | [Model](https://huggingface.co/InternRobotics/InternVLA-N1-Preview) | 
+| `InternVLA-N1-Preview (S2)` | R2R | 5.09 | 60.9 | 53.7 | 49.7 | [Model](https://huggingface.co/InternRobotics/InternVLA-N1-Preview-S2) |
+| `InternVLA-N1-Preview` | R2R | **4.76** | **63.4** | **56.7** | **52.6** | [Model](https://huggingface.co/InternRobotics/InternVLA-N1-Preview) |
 | `InternVLA-N1-Preview (S2)` | RxR | 6.39 | 60.1 | 50.5 | 43.3 | [Model](https://huggingface.co/InternRobotics/InternVLA-N1-Preview-S2) |
 | `InternVLA-N1-Preview` | RxR | **5.65** | **63.2** | **53.5** | **45.7** | [Model](https://huggingface.co/InternRobotics/InternVLA-N1-Preview) |
 
 #### VLN-PE Task
 | Model  | Dataset/Benchmark | NE | OS | SR | SPL | Download |
 | ------ | ----------------- | -- | -- | -- | --- | --- |
-| `Seq2Seq` | Flash | 8.27 | 43.0 | 15.7 | 9.7 | [Model](https://huggingface.co/InternRobotics/VLN-PE) | 
-| `CMA` | Flash | 7.52 | 45.0 | 24.4 | 18.2 | [Model](https://huggingface.co/InternRobotics/VLN-PE) | 
-| `RDP` | Flash | 6.98 | 42.5 | 24.9 | 17.5 | [Model](https://huggingface.co/InternRobotics/VLN-PE) | 
+| `Seq2Seq` | Flash | 8.27 | 43.0 | 15.7 | 9.7 | [Model](https://huggingface.co/InternRobotics/VLN-PE) |
+| `CMA` | Flash | 7.52 | 45.0 | 24.4 | 18.2 | [Model](https://huggingface.co/InternRobotics/VLN-PE) |
+| `RDP` | Flash | 6.98 | 42.5 | 24.9 | 17.5 | [Model](https://huggingface.co/InternRobotics/VLN-PE) |
 | `InternVLA-N1-Preview` | Flash | **4.21** | **68.0** | **59.8** | **54.0** | [Model](https://huggingface.co/InternRobotics/InternVLA-N1-Preview) |
-| `InternVLA-N1` | Flash | **4.13** | **67.6** | **60.4** | **54.9** | [Model](https://huggingface.co/InternRobotics/InternVLA-N1) | 
-| `Seq2Seq` | Physical | 7.88 | 28.1 | 15.1 | 10.7 | [Model](https://huggingface.co/InternRobotics/VLN-PE) | 
-| `CMA` | Physical | 7.26 | 31.4 | 22.1 | 18.6 | [Model](https://huggingface.co/InternRobotics/VLN-PE) | 
-| `RDP` | Physical | 6.72 | 36.9 | 25.2 | 17.7 | [Model](https://huggingface.co/InternRobotics/VLN-PE) | 
+| `InternVLA-N1` | Flash | **4.13** | **67.6** | **60.4** | **54.9** | [Model](https://huggingface.co/InternRobotics/InternVLA-N1) |
+| `Seq2Seq` | Physical | 7.88 | 28.1 | 15.1 | 10.7 | [Model](https://huggingface.co/InternRobotics/VLN-PE) |
+| `CMA` | Physical | 7.26 | 31.4 | 22.1 | 18.6 | [Model](https://huggingface.co/InternRobotics/VLN-PE) |
+| `RDP` | Physical | 6.72 | 36.9 | 25.2 | 17.7 | [Model](https://huggingface.co/InternRobotics/VLN-PE) |
 | `InternVLA-N1-Preview` | Physical | **5.31** | **49.0** | **42.6** | **35.8** | [Model](https://huggingface.co/InternRobotics/InternVLA-N1-Preview) |
-| `InternVLA-N1` | Physical | **4.73** | **56.7** | **50.6** | **43.3** | [Model](https://huggingface.co/InternRobotics/InternVLA-N1) | 
+| `InternVLA-N1` | Physical | **4.73** | **56.7** | **50.6** | **43.3** | [Model](https://huggingface.co/InternRobotics/InternVLA-N1) |
 
 #### Visual Navigation Task - PointGoal Navigation
 | Model  | Dataset/Benchmark | SR | SPL | Download |
 | ------ | ----------------- | -- | -- | --------- |
-| `iPlanner` | ClutteredEnv | 84.8 | 83.6 | [Model](https://github.com/InternRobotics/NavDP?tab=readme-ov-file#%EF%B8%8F-installation-of-baseline-library) | 
-| `ViPlanner` | ClutteredEnv | 72.4 | 72.3 | [Model](https://github.com/InternRobotics/NavDP?tab=readme-ov-file#%EF%B8%8F-installation-of-baseline-library) | 
+| `iPlanner` | ClutteredEnv | 84.8 | 83.6 | [Model](https://github.com/InternRobotics/NavDP?tab=readme-ov-file#%EF%B8%8F-installation-of-baseline-library) |
+| `ViPlanner` | ClutteredEnv | 72.4 | 72.3 | [Model](https://github.com/InternRobotics/NavDP?tab=readme-ov-file#%EF%B8%8F-installation-of-baseline-library) |
 | `InternVLA-N1 (S1)` | ClutteredEnv | **89.8** | **87.7** | [Model](https://github.com/InternRobotics/NavDP?tab=readme-ov-file#%EF%B8%8F-installation-of-baseline-library) |
-| `iPlanner` | InternScenes | 48.8 | 46.7 | [Model](https://github.com/InternRobotics/NavDP?tab=readme-ov-file#%EF%B8%8F-installation-of-baseline-library) | 
-| `ViPlanner` | InternScenes | 54.3 | 52.5 | [Model](https://github.com/InternRobotics/NavDP?tab=readme-ov-file#%EF%B8%8F-installation-of-baseline-library) | 
-| `InternVLA-N1 (S1)` | InternScenes | **65.7** | **60.7** | [Model](https://github.com/InternRobotics/NavDP?tab=readme-ov-file#%EF%B8%8F-installation-of-baseline-library) | 
+| `iPlanner` | InternScenes | 48.8 | 46.7 | [Model](https://github.com/InternRobotics/NavDP?tab=readme-ov-file#%EF%B8%8F-installation-of-baseline-library) |
+| `ViPlanner` | InternScenes | 54.3 | 52.5 | [Model](https://github.com/InternRobotics/NavDP?tab=readme-ov-file#%EF%B8%8F-installation-of-baseline-library) |
+| `InternVLA-N1 (S1)` | InternScenes | **65.7** | **60.7** | [Model](https://github.com/InternRobotics/NavDP?tab=readme-ov-file#%EF%B8%8F-installation-of-baseline-library) |
 
 
 
@@ -243,7 +244,7 @@ If you use the specific pretrained models and benchmarks, please kindly cite the
 
 ## 📄 License
 
-InternNav's codes are [MIT licensed](LICENSE). 
+InternNav's codes are [MIT licensed](LICENSE).
 The open-sourced InternData-N1 data are under the <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License </a><a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/80x15.png" /></a>.
 Other datasets like VLN-CE inherit their own distribution licenses.
 

diff --git a/assets/realworld_sample_data.tar.gz b/assets/realworld_sample_data.tar.gz
diff --git a/internnav/agent/internvla_n1_agent_realworld.py b/internnav/agent/internvla_n1_agent_realworld.py
@@ -27,6 +27,7 @@ class InternVLAN1AsyncAgent:
     def __init__(self, args):
         self.device = torch.device(args.device)
         self.save_dir = "test_data/" + datetime.now().strftime("%Y%m%d_%H%M%S")
+        print(f"args.model_path{args.model_path}")
         self.model = InternVLAN1ForCausalLM.from_pretrained(
             args.model_path,
             torch_dtype=torch.bfloat16,
@@ -42,6 +43,7 @@ def __init__(self, args):
         self.resize_w = args.resize_w
         self.resize_h = args.resize_h
         self.num_history = args.num_history
+        self.PLAN_STEP_GAP = args.plan_step_gap
 
         prompt = "You are an autonomous navigation assistant. Your task is to <instruction>. Where should you go next to stay on track? Please output the next waypoint's coordinates in the image. Please output STOP when you have successfully completed the task."
         answer = ""
@@ -91,6 +93,12 @@ def reset(self):
         self.llm_output = ""
         self.past_key_values = None
 
+        self.output_action = None
+        self.output_latent = None
+        self.output_pixel = None
+        self.pixel_goal_rgb = None
+        self.pixel_goal_depth = None
+
         self.save_dir = "test_data/" + datetime.now().strftime("%Y%m%d_%H%M%S")
         os.makedirs(self.save_dir, exist_ok=True)
 
@@ -118,9 +126,8 @@ def trajectory_tovw(self, trajectory, kp=1.0):
 
     def step(self, rgb, depth, pose, instruction, intrinsic, look_down=False):
         dual_sys_output = S2Output()
-        PLAN_STEP_GAP = 8
         no_output_flag = self.output_action is None and self.output_latent is None
-        if (self.episode_idx - self.last_s2_idx > PLAN_STEP_GAP) or look_down or no_output_flag:
+        if (self.episode_idx - self.last_s2_idx > self.PLAN_STEP_GAP) or look_down or no_output_flag:
             self.output_action, self.output_latent, self.output_pixel = self.step_s2(
                 rgb, depth, pose, instruction, intrinsic, look_down
             )
@@ -152,7 +159,7 @@ def step(self, rgb, depth, pose, instruction, intrinsic, look_down=False):
             )
             trajectories = self.step_s1(self.output_latent, rgbs, depths)
 
-            dual_sys_output.output_action = traj_to_actions(trajectories)
+            dual_sys_output.output_trajectory = traj_to_actions(trajectories, use_discrate_action=False)
 
         return dual_sys_output
 

diff --git a/internnav/model/utils/vln_utils.py b/internnav/model/utils/vln_utils.py
@@ -1,10 +1,10 @@
+from dataclasses import dataclass
+from typing import Optional, Tuple
+
 import numpy as np
 import torch
 from PIL import Image
 from torch import Tensor
-from typing import Optional, Tuple
-import torch
-from dataclasses import dataclass
 
 
 def open_image(image_or_image_path):
@@ -15,9 +15,11 @@ def open_image(image_or_image_path):
     else:
         raise ValueError("Unsupported input type!")
 
+
 def split_and_clean(text):
     # Split by <image> while preserving the delimiter
     import re
+
     parts = re.split(r'(<image>)', text)
     results = []
     for part in parts:
@@ -30,10 +32,11 @@ def split_and_clean(text):
                 results.append(clean_part)
     return results
 
+
 def chunk_token(dp_actions):
     out_list = []
     out_list_read = []
-    
+
     for i in range(len(dp_actions)):
         xyyaw = dp_actions[i]
         x = xyyaw[0]
@@ -56,7 +59,8 @@ def chunk_token(dp_actions):
 
     return out_list
 
-def traj_to_actions(dp_actions):
+
+def traj_to_actions(dp_actions, use_discrate_action=True):
     def reconstruct_xy_from_delta(delta_xyt):
         """
         Input:
@@ -84,7 +88,7 @@ def trajectory_to_discrete_actions_close_to_goal(trajectory, step_size=0.25, tur
         turn_angle_rad = np.deg2rad(turn_angle_deg)
         traj = trajectory
         goal = trajectory[-1]
-        
+
         def normalize_angle(angle):
             return (angle + np.pi) % (2 * np.pi) - np.pi
 
@@ -120,13 +124,17 @@ def normalize_angle(angle):
             pos = next_pos
 
         return actions
-    
+
     # unnormalize
     dp_actions[:, :, :2] /= 4.0
     all_trajectory = reconstruct_xy_from_delta(dp_actions.float().cpu().numpy())
     trajectory = np.mean(all_trajectory, axis=0)
-    actions = trajectory_to_discrete_actions_close_to_goal(trajectory)
-    return actions
+    if use_discrate_action:
+        actions = trajectory_to_discrete_actions_close_to_goal(trajectory)
+        return actions
+    else:
+        return trajectory
+
 
 @dataclass
 class S2Input:
@@ -138,29 +146,33 @@ class S2Input:
     look_down: Optional[bool] = False
     should_infer: Optional[bool] = False
 
+
 @dataclass
 class S2Output:
     idx: Optional[int] = -1
     is_infering: Optional[bool] = False
     output_action: Optional[np.ndarray] = None
+    output_trajectory: Optional[np.ndarray] = None
     output_pixel: Optional[np.ndarray] = None
     output_latent: Optional[torch.Tensor] = None
-    rgb_memory: Optional[np.ndarray] = None # 用于记录pixel goal那一帧的rgb
-    depth_memory: Optional[np.ndarray] = None # 用于记录pixel goal那一帧的depth
- 
+    rgb_memory: Optional[np.ndarray] = None  # 用于记录pixel goal那一帧的rgb
+    depth_memory: Optional[np.ndarray] = None  # 用于记录pixel goal那一帧的depth
+
     def validate(self):
         """确保output_action、output_pixel和output_latent中只有一个为非None"""
         outputs = [self.output_action, self.output_pixel, self.output_latent]
         non_none_count = sum(1 for x in outputs if x is not None)
         return non_none_count > 0 and self.idx >= 0
-
+
+
 @dataclass
 class S1Input:
     pixel_goal: Optional[np.ndarray] = None
     latent: Optional[np.ndarray] = None
     rgb: Optional[np.ndarray] = None
     depth: Optional[np.ndarray] = None
 
+
 @dataclass
 class S1Output:
     # idx: Optional[int] = None
@@ -171,7 +183,6 @@ class S1Output:
     vis_image: Optional[np.ndarray] = None
 
 
-
 def image_resize(
     img: Tensor,
     size: Tuple[int, int],
@@ -241,6 +252,7 @@ def rho_theta(curr_pos: np.ndarray, curr_heading: float, curr_goal: np.ndarray)
 
     return float(rho), float(theta)
 
+
 def get_rotation_matrix(angle: float, ndims: int = 2) -> np.ndarray:
     """Returns a 2x2 or 3x3 rotation matrix for a given angle; if 3x3, the z-axis is
     rotated."""
@@ -260,4 +272,4 @@ def get_rotation_matrix(angle: float, ndims: int = 2) -> np.ndarray:
             ]
         )
     else:
-        raise ValueError("ndims must be 2 or 3")
+        raise ValueError("ndims must be 2 or 3")