An intelligent conversational interface for querying and visualizing ARGO oceanographic float data using RAG (Retrieval-Augmented Generation) and MCP (Model Context Protocol).
Data Flow: Ingest Argo NetCDF → normalize & store (Postgres + Parquet) → index metadata & embeddings (FAISS/Chroma) → RAG + MCP translator (LLM) → Backend APIs → Interactive dashboard + Chat UI (Streamlit) + visualizations (Plotly/Leaflet)

- ARGO Data Ingestion: Parse NetCDF files and normalize to structured formats
- Dual Storage: PostgreSQL with PostGIS for spatial queries + Parquet for analytics
- Vector Search: FAISS/Chroma for semantic retrieval of profiles and metadata
- RAG + MCP: LLM-powered natural language to SQL translation with structured outputs
- FastAPI Backend: RESTful APIs for chat, queries, and data access
- Interactive Frontend: Streamlit dashboard with chat, maps, and visualizations
- Geospatial Viz: Leaflet maps for float trajectories, Plotly for profiles
bluepy/
├── backend/
│ ├── api/ # FastAPI endpoints
│ ├── core/ # Core business logic
│ ├── db/ # Database models and connections
│ ├── rag/ # RAG + MCP implementation
│ └── main.py # FastAPI app entry point
├── frontend/
│ ├── app.py # Streamlit main app
│ ├── components/ # UI components
│ └── utils/ # Frontend utilities
├── ingestion/
│ ├── parsers/ # NetCDF parsers
│ ├── normalizers/ # Data normalization
│ └── pipeline.py # Ingestion pipeline
├── data/
│ ├── raw/ # Raw NetCDF files
│ ├── processed/ # Parquet files
│ └── embeddings/ # Vector DB storage
├── tests/ # Unit and integration tests
├── docker/ # Docker configurations
├── scripts/ # Utility scripts
├── requirements.txt # Python dependencies
├── .env.example # Environment variables template
└── README.md
- Python 3.10+
- PostgreSQL 14+ with PostGIS extension
- Docker (optional, for containerized deployment)
- Clone and navigate to the project:
cd bluepy- Create virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Set up environment variables:
cp .env.example .env
# Edit .env with your configuration- Initialize database:
python scripts/init_db.py- Run data ingestion (example):
python ingestion/pipeline.py --input data/raw --output data/processedBackend API:
uvicorn backend.main:app --reload --port 8000Frontend Dashboard:
streamlit run frontend/app.py --server.port 8501Access the application at http://localhost:8501
- "Show me salinity profiles near the equator in March 2023"
- "What's the average temperature at 500m depth in the Indian Ocean?"
- "Find floats with anomalous oxygen levels in the last 6 months"
- "Plot temperature vs depth for float 2902123"
POST /chat- Conversational interfacePOST /sql/execute- Execute validated SQL queriesGET /profile/{id}- Get specific profile detailsGET /floats- List all floats with filtersGET /map/geojson- Get trajectory data for mapping
Key environment variables in .env:
# Database
DATABASE_URL=postgresql://user:pass@localhost:5432/argo_db
POSTGRES_USER=argo_user
POSTGRES_PASSWORD=secure_password
POSTGRES_DB=argo_db
# LLM Configuration
OPENAI_API_KEY=your_api_key_here
LLM_MODEL=gpt-4
EMBEDDING_MODEL=text-embedding-3-small
# Vector DB
VECTOR_DB_TYPE=chroma # or faiss
CHROMA_PERSIST_DIR=./data/embeddings/chroma
# API
API_HOST=0.0.0.0
API_PORT=8000
CORS_ORIGINS=http://localhost:8501
# Frontend
STREAMLIT_SERVER_PORT=8501
MAP_PROVIDER=leafletpytest tests/ -v --cov=backend --cov=ingestion# Linting
flake8 backend/ ingestion/ frontend/
# Type checking
mypy backend/ ingestion/
# Formatting
black backend/ ingestion/ frontend/docker-compose up -dkubectl apply -f k8s/argo_profile - Main profile data table with spatial indexing argo_profile_meta - Profile metadata and summaries argo_float - Float information and trajectories
See backend/db/schema.sql for complete schema definitions.
- Backend: FastAPI, SQLAlchemy, psycopg2
- Database: PostgreSQL + PostGIS, Parquet (PyArrow)
- Vector DB: ChromaDB / FAISS
- LLM: OpenAI GPT-4 / Anthropic Claude
- Frontend: Streamlit, Plotly, Folium/Leaflet
- Data Processing: xarray, netCDF4, pandas, numpy
- Deployment: Docker, Docker Compose, Kubernetes
MIT License - see LICENSE file for details
- ARGO Program for oceanographic data
- OpenAI for LLM capabilities
- Streamlit community for excellent framework