Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,35 @@ BLOBSTORE_URL=https://appdev.kbase.us/services/shock-api
# Enable debug mode (true/false)
DEBUG=false

# =============================================================================
# AI PROVIDER CONFIGURATION (for automatic config generation)
# =============================================================================
# Preferred AI provider: auto, openai, argo, ollama, claude-code, rules-only
AI_PROVIDER=auto

# Fallback chain (comma-separated, tried in order)
AI_FALLBACK_CHAIN=openai,argo,ollama,rules-only

# OpenAI Configuration
# OPENAI_API_KEY=sk-your-api-key-here
OPENAI_MODEL=gpt-4o-mini
OPENAI_TEMPERATURE=0.1

# Argo Configuration (ANL internal)
# ARGO_USER=your-anl-username
ARGO_MODEL=gpt4o
ARGO_PROXY_PORT=1080

# Ollama Configuration (local LLM)
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=llama3

# Claude Code Configuration
CLAUDE_CODE_EXECUTABLE=claude

# Generated Config Storage
GENERATED_CONFIG_DIR=/tmp/tablescanner_configs

# =============================================================================
# TEST DATA (AppDev)
# =============================================================================
Expand Down
9 changes: 9 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@ trash/
docs/DEMO_SCRIPT.md
docs/QUICKSTART.md
docs/internal/
DATABASE_SCHEMA.md
docs/personal/
archive/
docs/archive

.DS_Store
.idea
Expand Down Expand Up @@ -31,3 +35,8 @@ lib/

# Cache directory
cache/

# Project-specific artifacts
DATABASE_SCHEMA.md
*.webp
*.png
166 changes: 129 additions & 37 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,64 +1,156 @@
# TableScanner

TableScanner is a microservice for providing filtered and paginated access to tabular data stored in KBase. It uses local SQLite caching and indexing to provide fast access to large datasets without loading them entirely into memory.
TableScanner is a production-grade microservice for querying tabular data from KBase SQLite databases. It provides a comprehensive DataTables Viewer-compatible API with advanced query capabilities, type-aware filtering, and performance optimizations.

## Functionality
## Features

The service provides two methods for data access:
1. **Hierarchical REST**: Path-based endpoints for navigating objects and tables using GET requests.
2. **Flat POST**: A single endpoint (`/table-data`) that accepts a JSON payload for all query parameters.
- **Data Access**: Query SQLite databases from KBase objects and handles
- **Type-Aware Filtering**: Automatic numeric conversion for proper filtering
- **Advanced Operators**: Support for eq, ne, gt, gte, lt, lte, like, ilike, in, not_in, between, is_null, is_not_null
- **Aggregations**: GROUP BY support with count, sum, avg, min, max, stddev, variance, distinct_count
- **Full-Text Search**: FTS5 support with automatic virtual table creation
- **Performance**: Connection pooling, query caching, automatic indexing
- **Statistics**: Pre-computed column statistics (min, max, mean, median, stddev)
- **Schema Information**: Detailed table and column schema with indexes

## Architecture

TableScanner operates as a bridge between KBase storage and client applications:
1. **Data Fetching**: Retrieves SQLite databases from the KBase Blobstore.
2. **Local Caching**: Stores databases locally to avoid repeated downloads.
3. **Indexing**: Creates indices on-the-fly for all table columns to optimize query performance.
4. **API Layer**: A FastAPI application that handles requests and executes SQL queries against the local cache.

Technical details on race conditions and concurrency handling are available in [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md).

## Setup
## Quick Start

### Production

```bash
docker compose up --build -d
```

The service will be available at `http://localhost:8000`. API documentation is at `/docs`.

### Development

```bash
cp .env.example .env
bash scripts/dev.sh
# Edit .env and set KB_SERVICE_AUTH_TOKEN
./scripts/dev.sh
```

The helper script `scripts/dev.sh` automates the environment setup:
1. Activates the virtual environment (`.venv` or `venv`)
2. Loads environment variables from `.env`
3. Sets `PYTHONPATH`
4. Starts the FastAPI development server with hot-reload via `fastapi dev`

## API Usage

### Path-based REST
List tables:
`GET /object/{upa}/tables`
### List Tables

Query table data:
`GET /object/{upa}/tables/{table_name}/data?limit=100`
```bash
curl -H "Authorization: Bearer $KB_TOKEN" \
"http://localhost:8000/object/76990/7/2/tables"
```

### Flat POST
Query table data:
`POST /table-data`
### Query Table Data

Payload example:
```json
{
"berdl_table_id": "76990/7/2",
"table_name": "Genes",
"limit": 100
}
```bash
curl -H "Authorization: Bearer $KB_TOKEN" \
"http://localhost:8000/object/76990/7/2/tables/Genes/data?limit=10"
```

### Enhanced Query with Filters

```bash
curl -X POST -H "Authorization: Bearer $KB_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"berdl_table_id": "local/76990_7_2",
"table_name": "Genes",
"limit": 100,
"filters": [
{"column": "contigs", "operator": "gt", "value": "50"}
]
}' \
"http://localhost:8000/table-data"
```

### Aggregation Query

```bash
curl -X POST -H "Authorization: Bearer $KB_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"group_by": ["category"],
"aggregations": [
{"column": "value", "function": "sum", "alias": "total"}
]
}' \
"http://localhost:8000/api/aggregate/local/76990_7_2/tables/Data"
```

## Documentation

- **[API Reference](docs/API.md)** - Complete API documentation with examples
- **[Architecture Dictionary](docs/ARCHITECTURE.md)** - System design and technical overview
- **[Contributing Guide](docs/CONTRIBUTING.md)** - Setup, testing, and contribution guidelines

## Architecture

TableScanner operates as a bridge between KBase storage and client applications:

1. **Data Fetching**: Retrieves SQLite databases from KBase Blobstore
2. **Local Caching**: Stores databases locally to avoid repeated downloads
3. **Connection Pooling**: Manages database connections with automatic lifecycle
4. **Query Execution**: Type-aware filtering with automatic numeric conversion
5. **Performance**: Query caching, automatic indexing, SQLite optimizations
6. **API Layer**: FastAPI application with comprehensive endpoints

## Project Structure
- `app/`: Application logic and routes.
- `app/utils/`: Utilities for caching, SQLite operations, and Workspace integration.
- `docs/`: Technical documentation.
- `scripts/`: Client examples and utility scripts.

```
TableScanner/
├── app/
│ ├── main.py # FastAPI application
│ ├── routes.py # API endpoints
│ ├── models.py # Pydantic models
│ ├── config.py # Configuration settings
│ ├── services/
│ │ ├── data/
│ │ │ ├── connection_pool.py # Connection pooling
│ │ │ ├── query_service.py # Query execution
│ │ │ └── ...
│ │ └── db_helper.py # Database resolution
│ └── utils/ # Utilities (SQLite, KBase Client)
├── docs/ # Documentation (API, Architecture, Contributing)
├── tests/ # Test suite (Unit & Integration)
├── scripts/ # Helper scripts (dev.sh)
└── static/ # Static files
```

## Configuration

Create a `.env` file with:

```env
KB_SERVICE_AUTH_TOKEN=your_token_here
CACHE_DIR=/tmp/tablescanner_cache
CACHE_MAX_AGE_HOURS=24
DEBUG=false
```

## Performance

- Query execution: < 100ms for typical queries
- Cache hit rate: > 80% for repeated queries
- Database connection: Reused for 30 minutes
- Query cache: 5-minute TTL, max 1000 entries
- Automatic indexing: One-time cost, cached thereafter

## Testing

```bash
# Run all tests
pytest

# Run with coverage
pytest --cov=app --cov-report=html
```

## License
MIT License.

MIT License
14 changes: 14 additions & 0 deletions app/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,9 +59,23 @@ class Settings(BaseSettings):
default=False,
description="Enable debug mode with verbose logging"
)
KB_ENV: str = Field(
default="appdev",
description="KBase environment (appdev, ci, prod)"
)

# Root path for proxy deployment (e.g., "/services/berdl_table_scanner")
ROOT_PATH: str = ""

# Timeout settings
DOWNLOAD_TIMEOUT_SECONDS: float = Field(
default=30.0,
description="Timeout in seconds for downloading databases"
)
KBASE_API_TIMEOUT_SECONDS: float = Field(
default=10.0,
description="Timeout in seconds for KBase API calls"
)

class Config:
env_file = ".env"
Expand Down
20 changes: 20 additions & 0 deletions app/config_constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
"""
Configuration constants for TableScanner.
"""

# Default values
DEFAULT_LIMIT = 100
MAX_LIMIT = 500000
DEFAULT_OFFSET = 0
DEFAULT_SORT_ORDER = "ASC"

# Cache settings
CACHE_TTL_SECONDS = 300 # 5 minutes
CACHE_MAX_ENTRIES = 1000
INDEX_CACHE_TTL = 3600 # 1 hour

# Timeout settings
KBASE_API_TIMEOUT_SECONDS = 30

# API Version
API_VERSION = "2.0"
5 changes: 5 additions & 0 deletions app/db/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
"""
Database module for Config Control Plane.

Provides SQLite-based persistent storage for configuration records.
"""
Loading
Loading