Skip to content

janvis11/bluepy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Bluepy - AI Conversational Interface for ARGO Data

An intelligent conversational interface for querying and visualizing ARGO oceanographic float data using RAG (Retrieval-Augmented Generation) and MCP (Model Context Protocol).

Architecture

Data Flow: Ingest Argo NetCDF → normalize & store (Postgres + Parquet) → index metadata & embeddings (FAISS/Chroma) → RAG + MCP translator (LLM) → Backend APIs → Interactive dashboard + Chat UI (Streamlit) + visualizations (Plotly/Leaflet) image

Features

  • ARGO Data Ingestion: Parse NetCDF files and normalize to structured formats
  • Dual Storage: PostgreSQL with PostGIS for spatial queries + Parquet for analytics
  • Vector Search: FAISS/Chroma for semantic retrieval of profiles and metadata
  • RAG + MCP: LLM-powered natural language to SQL translation with structured outputs
  • FastAPI Backend: RESTful APIs for chat, queries, and data access
  • Interactive Frontend: Streamlit dashboard with chat, maps, and visualizations
  • Geospatial Viz: Leaflet maps for float trajectories, Plotly for profiles

Project Structure

bluepy/
├── backend/
│   ├── api/              # FastAPI endpoints
│   ├── core/             # Core business logic
│   ├── db/               # Database models and connections
│   ├── rag/              # RAG + MCP implementation
│   └── main.py           # FastAPI app entry point
├── frontend/
│   ├── app.py            # Streamlit main app
│   ├── components/       # UI components
│   └── utils/            # Frontend utilities
├── ingestion/
│   ├── parsers/          # NetCDF parsers
│   ├── normalizers/      # Data normalization
│   └── pipeline.py       # Ingestion pipeline
├── data/
│   ├── raw/              # Raw NetCDF files
│   ├── processed/        # Parquet files
│   └── embeddings/       # Vector DB storage
├── tests/                # Unit and integration tests
├── docker/               # Docker configurations
├── scripts/              # Utility scripts
├── requirements.txt      # Python dependencies
├── .env.example          # Environment variables template
└── README.md

Quick Start

Prerequisites

  • Python 3.10+
  • PostgreSQL 14+ with PostGIS extension
  • Docker (optional, for containerized deployment)

Installation

  1. Clone and navigate to the project:
cd bluepy
  1. Create virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up environment variables:
cp .env.example .env
# Edit .env with your configuration
  1. Initialize database:
python scripts/init_db.py
  1. Run data ingestion (example):
python ingestion/pipeline.py --input data/raw --output data/processed

Running the Application

Backend API:

uvicorn backend.main:app --reload --port 8000

Frontend Dashboard:

streamlit run frontend/app.py --server.port 8501

Access the application at http://localhost:8501

Usage Examples

Natural Language Queries

  • "Show me salinity profiles near the equator in March 2023"
  • "What's the average temperature at 500m depth in the Indian Ocean?"
  • "Find floats with anomalous oxygen levels in the last 6 months"
  • "Plot temperature vs depth for float 2902123"

API Endpoints

  • POST /chat - Conversational interface
  • POST /sql/execute - Execute validated SQL queries
  • GET /profile/{id} - Get specific profile details
  • GET /floats - List all floats with filters
  • GET /map/geojson - Get trajectory data for mapping

Configuration

Key environment variables in .env:

# Database
DATABASE_URL=postgresql://user:pass@localhost:5432/argo_db
POSTGRES_USER=argo_user
POSTGRES_PASSWORD=secure_password
POSTGRES_DB=argo_db

# LLM Configuration
OPENAI_API_KEY=your_api_key_here
LLM_MODEL=gpt-4
EMBEDDING_MODEL=text-embedding-3-small

# Vector DB
VECTOR_DB_TYPE=chroma  # or faiss
CHROMA_PERSIST_DIR=./data/embeddings/chroma

# API
API_HOST=0.0.0.0
API_PORT=8000
CORS_ORIGINS=http://localhost:8501

# Frontend
STREAMLIT_SERVER_PORT=8501
MAP_PROVIDER=leaflet

Development

Running Tests

pytest tests/ -v --cov=backend --cov=ingestion

Code Quality

# Linting
flake8 backend/ ingestion/ frontend/

# Type checking
mypy backend/ ingestion/

# Formatting
black backend/ ingestion/ frontend/

Deployment

Docker Compose

docker-compose up -d

Kubernetes

kubectl apply -f k8s/

Data Schema

PostgreSQL Tables

argo_profile - Main profile data table with spatial indexing argo_profile_meta - Profile metadata and summaries argo_float - Float information and trajectories

See backend/db/schema.sql for complete schema definitions.

Technology Stack

  • Backend: FastAPI, SQLAlchemy, psycopg2
  • Database: PostgreSQL + PostGIS, Parquet (PyArrow)
  • Vector DB: ChromaDB / FAISS
  • LLM: OpenAI GPT-4 / Anthropic Claude
  • Frontend: Streamlit, Plotly, Folium/Leaflet
  • Data Processing: xarray, netCDF4, pandas, numpy
  • Deployment: Docker, Docker Compose, Kubernetes

License

MIT License - see LICENSE file for details

Acknowledgments

  • ARGO Program for oceanographic data
  • OpenAI for LLM capabilities
  • Streamlit community for excellent framework

About

talk to ocean

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published