A novel AI framework designed specifically for the analysis of remote sensing images, integrating large language models (LLMs) with specialized vision-language models to overcome challenges in Earth observation data analysis.
GeoSpatial-RAG employs a retrieval-augmented generation (RAG) approach that creates a multi-modal knowledge vector database from remote sensing imagery and textual descriptions. The framework addresses the significant domain gap between natural images and remote sensing imagery by developing a specialized pipeline using CLIP (Contrastive Language-Image Pretraining).
- Domain-Specific RAG: First RAG system specifically designed for remote sensing imagery
- Multi-Modal Intelligence: Seamlessly combines text and image understanding
- High Accuracy: Achieves 88%+ similarity matching for relevant queries
- Production Ready: Complete web interface with ChatGPT-like experience
- π§ Multi-modal Knowledge Vector Database: Unified encoding of remote sensing images and text descriptions
- π Cross-modal Retrieval: Semantic search using natural language queries or image inputs
- π― CLIP-based Embeddings: Leverages CLIP for both visual and textual information encoding
- π€ LangChain Integration: Advanced text generation with vision-language model support
- ποΈ SQLite Database: Efficient storage and retrieval of embeddings
- π Web Interface: Modern ChatGPT-like interface for easy interaction
- β‘ Real-time Processing: GPU-accelerated processing for fast responses
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Input Query β β Images β β Text Captions β
β (Text/Image) β β β β β
βββββββββββ¬ββββββββ βββββββββββ¬ββββββββ βββββββββββ¬ββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CLIP Encoder β
β (Text + Vision) β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SQLite Vector Database β
β (Text & Image Embeddings) β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Similarity Search & Retrieval β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LangChain RAG Pipeline + VLM Generation β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Generated Response β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Python 3.8+
- CUDA-compatible GPU (recommended)
- 8GB+ RAM
- Clone the repository
git clone https://github.com/debanjan06/geospatial-rag.git
cd geospatial-rag- Create virtual environment
python -m venv geospatial_env
# Windows
geospatial_env\Scripts\activate
# Linux/MacOS
source geospatial_env/bin/activate- Install dependencies
pip install -r requirements.txt
pip install -e .- Set up environment variables
cp .env.example .env
# Edit .env with your configuration# Download our pre-built database (10,975 documents)
# Place in: database/rsicd_embeddings.db
# Contact: bl.sc.p2dsc24032@bl.students.amrita.edu for access# 1. Download RSICD dataset
wget [RSICD_DATASET_URL]
# 2. Generate embeddings
python scripts/generate_embeddings.py --dataset_path /path/to/RSICD --output_dir ./database
# 3. Create SQLite database
python scripts/create_database.py --embeddings_dir ./database --db_path ./database/rsicd_embeddings.db# Create a small demo database for testing
python scripts/create_demo_database.py# Test database and imports
python test_database.pyExpected output:
π GeoSpatial-RAG System Test
========================================
β
Database connected successfully!
π Descriptions: 10,975
π Text embeddings: 10,975
πΌοΈ Image embeddings: 10,975
β
All tests passed!
# Interactive demo
python demo/interactive_demo.py --db_path ./database/rsicd_embeddings.db# Start the web interface
streamlit run streamlit_app.pyThen open: http://localhost:8501
from geospatial_rag import GeoSpatialRAG
from PIL import Image
# Initialize the RAG system
rag = GeoSpatialRAG(db_path="./database/rsicd_embeddings.db")
# Text-only query
results = rag.query("Show me aerial views of storage tanks")
# Text + Image query
image = Image.open("satellite_image.jpg")
results = rag.query("What does this image show?", image=image)
# Display results
for doc in results['documents']:
print(f"Description: {doc.page_content}")
print(f"Similarity: {doc.metadata['similarity']:.4f}")
print("---")
print(f"AI Response: {results['response']}")
# Close when done
rag.close()Our system has been tested and validated with impressive results:
- π Database Size: 10,975 remote sensing images with embeddings
- π― Accuracy: 88%+ similarity scores for relevant queries
- β‘ Speed: <2 seconds average query time on GPU
- π Precision: High relevance in top-5 results for domain-specific queries
| Query | Top Similarity Score | Retrieved Documents | Response Quality |
|---|---|---|---|
| "industrial complex with buildings" | 0.8818 | 5/5 relevant | Excellent |
| "aerial view of storage tanks" | 0.7631 | 5/5 relevant | Excellent |
| "satellite image of urban area" | 0.8203 | 4/5 relevant | Very Good |
| "remote sensing of forest" | 0.7892 | 5/5 relevant | Excellent |
geospatial-rag/
βββ src/
β βββ __init__.py
β βββ geospatial_rag/ # Main package
β βββ __init__.py
β βββ embeddings.py # CLIP embedding generation
β βββ database.py # SQLite database operations
β βββ retriever.py # Custom retriever class
β βββ pipeline.py # Main RAG pipeline
β βββ utils.py # Utility functions
βββ demo/
β βββ interactive_demo.py # Command-line interface
βββ tests/
β βββ test_*.py # Test modules
βββ streamlit_app.py # Web interface
βββ setup_web_interface.py # Web interface setup
βββ quick_start.py # Quick start script
βββ test_database.py # Database testing
βββ requirements.txt # Dependencies
βββ setup.py # Package setup
βββ LICENSE # MIT License
βββ .gitignore # Git ignore rules
βββ README.md # This file
The Streamlit web interface provides:
- π¬ ChatGPT-like Interface: Natural conversation flow
- πΌοΈ Image Upload: Drag-and-drop satellite/aerial image analysis
- βοΈ Advanced Settings: Configurable similarity thresholds and result counts
- π Real-time Stats: Database statistics and system status
- π Live Search: Instant results with similarity scores
- π± Responsive Design: Works on desktop and mobile
# Model Configuration
CLIP_MODEL_NAME=openai/clip-vit-base-patch32
VLM_MODEL_NAME=Salesforce/blip-image-captioning-large
DEVICE=auto
# Database Configuration
DB_PATH=./database/rsicd_embeddings.db
# Processing Configuration
BATCH_SIZE=16
TEXT_WEIGHT=0.7
IMAGE_WEIGHT=0.3
TOP_K=5
# API Keys (optional)
HUGGINGFACE_API_KEY=your_hf_api_key_hereThe system supports extensive configuration through:
- Environment variables
- Configuration files (JSON/YAML)
- Command-line arguments
- Python API parameters
- Size: 10,921 remote sensing images
- Resolution: 224Γ224 pixels
- Sources: Google Earth, Baidu Map, MapABC, Tianditu
- Descriptions: 5 sentences per image
- Splits: Train (8,734) / Valid (1,094) / Test (1,093)
- Features: High intra-class diversity and low inter-class dissimilarity
- Satellite imagery
- Aerial photography
- Remote sensing data
- Multispectral images
- Urban planning imagery
- Agricultural monitoring
- Environmental surveillance
# Run all tests
pytest tests/ -v
# Run specific test modules
pytest tests/test_embeddings.py
pytest tests/test_database.py
pytest tests/test_pipeline.py
# Run with coverage
pytest tests/ --cov=geospatial_rag --cov-report=htmlWe welcome contributions! Please see our Contributing Guidelines.
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Install development dependencies:
pip install -e ".[dev]" - Make your changes and add tests
- Run tests:
pytest tests/ - Submit a pull request
- π¬ New Models: Integration of additional vision-language models
- π Datasets: Support for new remote sensing datasets
- π Interfaces: Mobile apps, desktop applications
- π Performance: Optimization and scaling improvements
- π Documentation: Tutorials, examples, and guides
This project is licensed under the MIT License - see the LICENSE file for details.
This work builds upon and is inspired by the following research:
[1] L. Fang et al., "Open-world recognition in remote sensing: Concepts, challenges, and opportunities," IEEE Geosci. Remote Sens. Mag., vol. 12, no. 2, pp. 8β31, 2024.
[2] R. M. Haralick, K. Shanmugam, and I. Dinstein, "Textural features for image classification," IEEE Trans. Syst., Man, Cybern., vol. SMC-3, no. 6, pp. 610β621, 1973.
[3] K. Kuckreja, M. S. Danish, M. Naseer, A. Das, S. Khan, and F. S. Khan, "Geochat: Grounded large vision-language model for remote sensing," arXiv preprint arXiv:2311.15826, 2023.
[4] R. Xu, C. Wang, J. Zhang, S. Xu, W. Meng, and X. Zhang, "Rssformer: Foreground saliency enhancement for remote sensing land-cover segmentation," IEEE Trans. Image Process., vol. 32, pp. 1052β1064, 2023.
[5] J. Lin, Z. Yang, Q. Liu, Y. Yan, P. Ghamisi, W. Xie, and L. Fang, "Hslabeling: Toward efficient labeling for large-scale remote sensing image segmentation with hybrid sparse labeling," IEEE Trans. Image Process., vol. 34, pp. 1864β1878, 2025.
[6] W. Zhang, M. Cai, T. Zhang, Y. Zhuang, and X. Mao, "Earthgpt: A universal multimodal large language model for multisensor image comprehension in remote sensing domain," IEEE Trans. Geosci. Remote Sens., vol. 62, p. 5917820, 2024.
[7] Y. Hu, J. Yuan, C. Wen, X. Lu, and X. Li, "Rsgpt: A remote sensing vision language model and benchmark," arXiv preprint arXiv:2307.15266, 2023.
[8] L. Zhu, F. Wei, and Y. Lu, "Beyond text: Frozen large language models in visual signal comprehension," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 27047β27057, 2024.
[9] Z. Yuan, Z. Xiong, L. Mou, and X. X. Zhu, "Chatearthnet: A global-scale imageβtext dataset empowering visionβlanguage geo-foundation models," Earth Syst. Sci. Data, vol. 17, pp. 1245β1263, 2025.
[10] X. Lu, B. Wang, X. Zheng, and X. Li, "Exploring models and data for remote sensing image caption generation," IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 4, pp. 2183β2195, 2018.
- Amrita Viswa Vidyapeetham for research support and computational resources
- OpenAI for the CLIP model and vision-language research
- Salesforce for the BLIP model
- RSICD dataset creators for providing the remote sensing image captioning dataset
- LangChain community for the RAG framework
- Streamlit team for the excellent web app framework
- Lead Researcher: Debanjan Shil
- Email: bl.sc.p2dsc24032@bl.students.amrita.edu
- Institution: School of Computing, Amrita Viswa Vidyapeetham, Bengaluru
- Project Repository: https://github.com/debanjan06/geospatial-rag
- π Bug Reports: GitHub Issues
- π‘ Feature Requests: GitHub Discussions
- π§ Direct Contact: For database access or collaboration inquiries
- π Documentation: Check our docs/ directory for detailed guides
β Star this repository if you find it helpful!
π Ready to revolutionize remote sensing analysis with AI?
Get Started β’ Documentation β’ Examples β’ Web Demo