Wenjia Xu*,
Zijian Yu*,
Boyang Mu,
Zhiwei Wei,
Yuanben Zhang,
Guangzuo Li and
Mugen Peng
* Equal Contribution
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
School of Geographic Sciences, Hunan Normal University
Aerospace Information Research Institute, Chinese Academy of Sciences
Introduction | Core Components | Supported Function | Results | Contributions | Acknowledgements
RS-Agent.mp4
Recent advancements in Large Language Models (LLMs) and Multi-modal Large Language Models (MLLMs) have led to impressive performance in remote sensing tasks. However, these models are limited to basic vision and language tasks and lack specialized expertise for complex remote sensing applications. To address these, we propose RS-Agent, an intelligent agent for remote sensing. RS-Agent is powered by an LLM as its "Central Controller," enabling it to understand and respond to various problems. It integrates high-performance remote sensing image processing tools, allowing multi-tool, multi-turn conversations for complex tasks. Additionally, RS-Agent utilizes a knowledge graph-enhanced Retrieval-Augmented Generation (RAG) framework to access domain-specific knowledge, ensuring accurate responses to expert-level queries. Experimental results show RS-Agent achieves over 95% task planning accuracy and demonstrates strong domain-specific knowledge retrieval, excelling across various tasks.
-
Central Controller: Serves as the decision-making core of the agent. It interprets user queries, plans task execution, manages dialogue history, and synthesizes final responses.
-
Toolkit: A collection of state-of-the-art remote sensing tools for various applications. These tools are invoked based on the Central Controller’s planning.
-
Solution Space: Stores predefined expert-level task solutions. It guides the Controller in selecting appropriate tools and execution strategies by retrieving relevant task-specific instructions.
-
Knowledge Space: Provides domain-specific information via a curated knowledge database. It supports expert-level reasoning by retrieving relevant content.
| Tool | Function | Example Input |
|---|---|---|
| cloud_removal | Cloud removal from satellite images | Remove the clouds in this image. |
| image_dehazing | Haze removal from images | Dehaze this foggy image. |
| super_resolution | Image super-resolution (2×) | Enhance the resolution of this image. |
| denoising | Image denoising | Remove noise from this image. |
| caption | Geo-specific VQA and captioning | What is in this remote sensing image? |
| optical_detection | Optical image target detection | Detect objects in this optical image. |
| optical_plane_type | Aircraft type recognition in optical images | What type of aircraft is in this image? |
| scene | Scene classification | What is the scene category of this image? |
| sar_detection | Target detection in SAR images | Find the objects in this SAR image. |
| sar_plane_type | Aircraft type recognition in SAR images | Identify the aircraft in this SAR image. |
| knowledge_search | Aircraft info retrieval via Knowledge Database | Who manufactures Boeing 747? |
| building_damage_detection | Building damage assessment | Which buildings are damaged? |
| building_extraction | Building extraction from images | Extract all buildings from the image. |
| road_extraction | Road extraction from images | Extract roads from the scene. |
| horizontal_object_detection | Horizontal bounding box detection | Detect objects using horizontal boxes. |
| rotated_object_detection | Rotated object detection | Detect objects using rotated boxes. |
| semantic_segmentation | Pixel-wise semantic segmentation | Segment the different regions in this image. |
| land_use_classification | Land use categorization | What are the land use types in this image? |
To evaluate RS-Agent’s adaptability, we evaluate its task planning accuracy when paired with different closed-source (GPT series) and open-source LLMs. This analysis helps reveal whether RS-Agent can maintain or improve performance as the underlying model changes.
| Task | ChatGPT (3.5-turbo-1106) | ChatGPT (3.5-turbo) | ChatGPT (4o-mini) | LLaMa 3.1 (8B) | LLaMa 3.1 (70B) | Qwen2.5 (14B) | Qwen2.5 (32B) | Qwen2.5 (72B) | DeepSeek-r1 (70B) |
|---|---|---|---|---|---|---|---|---|---|
| (87.71t/s) | (65.03t/s) | (58.87t/s) | (100.78t/s) | (17.71t/s) | (69.61t/s) | (36.77t/s) | (16.24t/s) | (18.25t/s) | |
| Cloud Removal | 95.00% | 95.00% | 100% | 100% | 100% | 100% | 95.00% | 100% | 100% |
| Image Dehazing | 30.00% | 95.00% | 100% | 100% | 100% | 100% | 100% | 100% | 75.00% |
| Super Resolution | 100% | 100% | 100% | 0.00% | 100% | 100% | 100% | 100% | 95.00% |
| Denoising | 90.00% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 90.00% |
| Image Captioning | 55.00% | 45.00% | 90.00% | 15.00% | 60.00% | 70.00% | 80.00% | 80.00% | 10.00% |
| Object Detection | 75.00% | 60.00% | 95.00% | 30.00% | 90.00% | 90.00% | 85.00% | 100% | 85.00% |
| Optical Plane Classification | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 95.00% |
| Scene Classification | 20.00% | 90.00% | 100% | 80.00% | 90.00% | 90.00% | 100% | 100% | 50.00% |
| SAR Detection | 30.00% | 100% | 100% | 75.00% | 95.00% | 100% | 100% | 100% | 100% |
| SAR Plane Classification | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 90.00% |
| Knowledge Search | 100% | 100% | 100% | 100% | 80.00% | 100% | 100% | 100% | 10.00% |
| Building Damage Detection | 100% | 100% | 100% | 100% | 100% | 95.00% | 100% | 100% | 100% |
| Building Extraction | 10.00% | 70.00% | 100% | 55.00% | 100% | 100% | 100% | 100% | 100% |
| Road Extraction | 15.00% | 55.00% | 100% | 65.00% | 100% | 100% | 100% | 100% | 100% |
| Horizontal Detection | 20.00% | 55.00% | 100% | 95.00% | 100% | 100% | 100% | 100% | 100% |
| Rotated Detection | 15.00% | 35.00% | 100% | 85.00% | 90.00% | 100% | 100% | 100% | 100% |
| Semantic Segmentation | 60.00% | 100% | 100% | 80.00% | 100% | 100% | 100% | 100% | 80.00% |
| Land Use Classification | 15.00% | 100% | 100% | 75.00% | 100% | 100% | 100% | 95.00% | 95.00% |
| Average Accuracy | 57.22% | 82.50% | 99.17% | 75.28% | 94.72% | 96.94% | 97.78% | 98.61% | 81.94% |
This figure shows several key tools that highlight the core capabilities of RS-Agent. The input image represents the image input by the user. The blue box shows the user's request, and the red box shows the RS-Agent's reply.
-
We present RS-Agent, a novel architecture designed to interpret user queries and orchestrate diverse tools for accurate and efficient remote sensing task execution. Its four core components—Central Controller, Toolkits, Solution Space, and Knowledge Space—work in concert, seamlessly interacting and complementing one another to enable robust, adaptive performance across a wide range of applications.
-
To enhance the agent’s task planning accuracy, we propose an innovative Task-Aware Retrieval method. By retrieving and understanding expert-level task solutions, RS-Agent is able to emulate the decision-making and tool selection processes of professional remote sensing analysts.
-
To strengthen RS-Agent’s domain-specific knowledge, we propose DualRAG, a retrieval augmented generation method that assigns weights to extracted keywords and performs dual path retrieval, thereby enhancing the accuracy and relevance of knowledge retrieval.
-
Extensive experiments demonstrate that RS-Agent consistently surpasses previous SOTA Multimodal Large Language Models across a range of remote sensing applications, and significantly boosts the task planning accuracy. These results establish RS-Agent as a major step forward in adapting AI agents to the remote sensing field, and, for the first time, present a comprehensive and modular architecture tailored for remote sensing applications.
We are thankful to the amazing open-sourced LLMs and the tools used in our RS-Agent for releasing their models and code as open-source contributions.



