diff --git a/README.md b/README.md
index d34d81e..1e0e726 100644
--- a/README.md
+++ b/README.md
@@ -1,443 +1,101 @@
-# NEON Multi-Modal Tree Species Classification Dataset
+# NEON Multi-Modal Tree Species Classification
 
-A comprehensive dataset of **167 tree species** with **47,971 individual tree crowns** from **30 NEON sites** across North America. Each sample includes RGB imagery, 369-band hyperspectral data, and LiDAR canopy height models.
+A comprehensive toolkit for multi-modal tree species classification using NEON ecological data. This project combines RGB imagery, hyperspectral data, and LiDAR to enable accurate tree species identification across diverse North American ecosystems.
 
-## 🚀 Quick Start
+## Project Vision
 
-### 1. Installation
-```bash
-# Clone the repository
-git clone https://github.com/Ritesh313/NeonTreeClassification.git
-cd NeonTreeClassification
+This repository aims to provide an end-to-end solution for tree species classification:
 
-# Install with uv (recommended - faster dependency resolution)
-uv sync
+- [x] **Dataset**: Ready-to-use multi-modal tree crown dataset with 167 species
+- [ ] **Data Processing**: Tools for downloading and processing raw NEON data products
+- [ ] **Classification Models**: Pre-trained models and training pipelines
+- [ ] **DeepForest Integration**: Automated crown detection and classification workflow
 
-# Or install with pip
-pip install -e .
-```
+## What's Available Now
 
-### 2. Quick Start with Ready-to-Run Example
-
-**🚀 Try the quickstart script first:**
-```bash
-# Option A: Using uv run (recommended)
-uv run python quickstart.py
+### Multi-Modal Dataset
 
-# Option B: After activating environment
-source .venv/bin/activate
-python quickstart.py
-```
+A curated dataset of 47,971 individual tree crowns from 30 NEON sites, ready for immediate use:
 
-This script demonstrates dataloader usage and **automatically downloads the dataset (590 MB)** on first use.
+- **167 tree species** from diverse North American ecosystems
+- **3 modalities**: RGB (3 bands), Hyperspectral (369 bands), LiDAR CHM (1 band)
+- **10 years of data** (2014-2023) with ecological metadata
+- **3 configurations**: `combined` (47,971 samples), `large` (~42,000 samples), `high_quality` (~5,500 samples)
+- **HDF5 format**: Efficient storage with automatic download (590 MB)
 
-### 3. Using Dataloaders in Your Own Projects
+## Quick Start
 
-Copy the code from `quickstart.py` or use it directly in your projects:
+### Installation
 
-#### Option A: Using `uv run` (Recommended)
 ```bash
-# Run your Python script (create in repo root directory)
-uv run python your_analysis_script.py
-
-# Or start Jupyter notebook
-uv run jupyter notebook
+git clone https://github.com/Ritesh313/NeonTreeClassification.git
+cd NeonTreeClassification
+uv sync  # or: pip install -e .
 ```
 
-#### Option B: Activate the environment (familiar to conda users)
-```bash
-# Activate the environment (Linux/Mac)
-source .venv/bin/activate
-
-# Now use regular Python commands (create scripts in repo root)
-python your_analysis_script.py
-jupyter notebook
-
-# Deactivate when done
-deactivate
-```
+### Get the Dataset
 
-#### Example: Getting the Dataloaders
 ```python
 from scripts.get_dataloaders import get_dataloaders
 
-# 📥 Dataset downloads automatically on first use (590 MB)
+# Dataset downloads automatically (590 MB)
 train_loader, test_loader = get_dataloaders(
-    config='large',  # Choose: 'combined', 'large', or 'high_quality'
-    modalities=['rgb', 'hsi', 'lidar'],  # Choose modalities
+    config='large',
+    modalities=['rgb', 'hsi', 'lidar'],
     batch_size=32
 )
 
-# Each batch contains:
+# Use in your training loop
 for batch in train_loader:
-    rgb_data = batch['rgb']        # torch.Tensor [batch_size, 3, 128, 128]
-    hsi_data = batch['hsi']        # torch.Tensor [batch_size, 369, 12, 12]
-    lidar_data = batch['lidar']    # torch.Tensor [batch_size, 1, 12, 12]
-    labels = batch['species_idx']  # torch.Tensor [batch_size] (class indices)
-    # Optional metadata (if include_metadata=True):
-    # batch['crown_id'], batch['species'], batch['site']
-```
-
-### 4. Running the Complete Training Pipeline (Optional)
-```bash
-# Use our complete training script
-uv run python examples/train.py
-```
-
-> **💡 Optional**: For manual control, you can pre-download the dataset using `python scripts/download_dataset.py`
-
-## 📊 Dataset Configurations
-
-The dataset comes with **3 pre-configured subsets** for different use cases:
-
-| Configuration | Samples | Species | Description |
-|---------------|---------|---------|-------------|
-| **`combined`** | 47,971 | 167 | Complete dataset with all available samples |
-| **`large`** | ~42,000 | ~162 | Main training set |
-| **`high_quality`** | ~5,500 | ~96 | Curated subset with highest data quality |
-
-### Key Dataset Features
-- **🎯 Flexible Filtering:** Filter by species, sites, years, or modalities
-- **🌍 Geographic Coverage:** 30 NEON sites across diverse North American ecosystems
-- **📅 Temporal Range:** 10 years of data (2014-2023) for temporal analysis
-- **🔬 Multi-modal:** RGB, 369-band hyperspectral, and LiDAR for each sample
-- **📊 Rich Metadata:** Tree height, diameter, canopy position, and plot information
-
-### 4 Training Scenarios
-```python
-# Scenario 1: Standard training on filtered dataset
-train_loader, test_loader = get_dataloaders(config='large', test_ratio=0.2)
-
-# Scenario 2: Maximum data training
-train_loader, test_loader = get_dataloaders(config='combined', test_ratio=0.15)
-
-# Scenario 3: High-quality only (smaller but cleaner)
-train_loader, test_loader = get_dataloaders(config='high_quality', test_ratio=0.2)
-
-# Scenario 4: Domain transfer (train on large, test on high-quality)
-train_loader, test_loader = get_dataloaders(
-    train_config='large',
-    test_config='high_quality'
-)
-```
-
-## 🌐 Dataset Overview
-
-- **47,971** individual tree crowns
-- **167** unique species
-- **30** NEON sites across North America
-- **2014-2023** (10 years of data)
-- **3 modalities:** RGB (3 bands), Hyperspectral (369 bands), LiDAR CHM (1 band)
-- **Ecological metadata:** Height (95.4% available), stem diameter (99.4% available), canopy position (81.4% available)
-- **HDF5 storage:** Efficient compressed format for fast loading
-
-## Visualization Examples
-
-The package includes comprehensive visualization tools for all three modalities:
-
-| RGB Image | HSI Pseudo RGB | HSI PCA Decomposition |
-|-----------|----------------|----------------------|
-| ![RGB](sample_plots/sample_rgb.png) | ![HSI](sample_plots/sample_hsi.png) | ![HSI PCA](sample_plots/sample_hsi_pca.png) |
-
-| HSI Spectral Signatures | LiDAR Canopy Height Model |
-|-------------------------|---------------------------|
-| ![Spectra](sample_plots/sample_spectra.png) | ![LiDAR](sample_plots/sample_lidar.png) |
-
-```python
-# Visualization functions for tree crown data
-from neon_tree_classification.core.visualization import (
-    plot_rgb, plot_hsi, plot_hsi_pca, plot_hsi_spectra, plot_lidar
-)
-from neon_tree_classification.core.dataset import NeonCrownDataset
-
-# Load dataset and get a sample
-dataset = NeonCrownDataset.load()
-sample_data = dataset[0]  # Get first sample (returns dict with rgb, hsi, lidar arrays)
-
-# Visualize each modality using the numpy arrays directly
-plot_rgb(sample_data['rgb'])             # True color RGB image from array
-plot_hsi(sample_data['hsi'])             # Pseudo RGB (bands ~660nm, ~550nm, ~450nm)
-plot_hsi_pca(sample_data['hsi'])         # PCA decomposition to 3 components
-plot_hsi_spectra(sample_data['hsi'])     # Spectral signatures of pixels
-plot_lidar(sample_data['lidar'])         # Canopy height model with colorbar
+    rgb = batch['rgb']          # [batch_size, 3, 128, 128]
+    hsi = batch['hsi']          # [batch_size, 369, 12, 12]
+    lidar = batch['lidar']      # [batch_size, 1, 12, 12]
+    labels = batch['species_idx']  # [batch_size]
 ```
 
-### Interactive Visualization Notebook
-
-For comprehensive data exploration and visualization examples, see:
-```bash
-# Open the visualization notebook
-jupyter notebook notebooks/visualization.ipynb
-```
-
-## Top Species
-
-The dataset includes 167 tree species. Here are the most common:
-
-| Rank | Species | Count | Percentage |
-|------|---------|-------|------------|
-| 1 | Acer rubrum L. | 5,684 | 11.8% |
-| 2 | Tsuga canadensis (L.) Carrière | 3,303 | 6.9% |
-| 3 | Pseudotsuga menziesii (Mirb.) Franco var. menziesii | 2,978 | 6.2% |
-| 4 | Pinus palustris Mill. | 2,207 | 4.6% |
-| 5 | Quercus rubra L. | 2,086 | 4.3% |
-| 6 | Pinus contorta Douglas ex Loudon var. latifolia Engelm. ex S. Watson | 2,011 | 4.2% |
-| 7 | Tsuga heterophylla (Raf.) Sarg. | 1,425 | 3.0% |
-| 8 | Populus tremuloides Michx. | 1,362 | 2.8% |
-| 9 | Liriodendron tulipifera L. | 1,152 | 2.4% |
-| 10 | Quercus alba L. | 1,139 | 2.4% |
-
-## Geographic Distribution
-
-Data collected from **30 NEON sites** across North America:
-
-**1.** HARV: 7,162 samples (14.9%)
-**2.** MLBS: 5,424 samples (11.3%)
-**3.** GRSM: 4,822 samples (10.1%)
-**4.** DELA: 4,539 samples (9.5%)
-**5.** RMNP: 3,931 samples (8.2%)
-**6.** WREF: 3,616 samples (7.5%)
-**7.** BART: 2,360 samples (4.9%)
-**8.** OSBS: 2,192 samples (4.6%)
-**9.** UNDE: 1,940 samples (4.0%)
-**10.** CLBJ: 1,754 samples (3.7%)
-
-## Installation
-
-### Prerequisites
-- Python 3.9+ (recommended: Python 3.11)
-- CUDA-capable GPU (optional, but recommended for training)
-
-### Recommended: Using uv (Fast Python Package Manager)
+Or run the quickstart example:
 ```bash
-# Clone the repository
-git clone https://github.com/Ritesh313/NeonTreeClassification.git
-cd NeonTreeClassification
-
-# Install uv if not already installed
-curl -LsSf https://astral.sh/uv/install.sh | sh
-# or: pip install uv
-
-# Install with uv (automatically handles dependencies)
-uv sync
-
-# Install with specific dependency groups
-uv sync --extra dev          # Add development tools (testing, formatting)
-uv sync --extra processing   # Add geospatial processing tools (for advanced users)
-uv sync --extra logging      # Add experiment logging tools (Comet ML, WandB)
-uv sync --all-extras         # Install everything
-```
-
-### Alternative: Using pip
-```bash
-# Install core dependencies
-pip install -e .
-
-# Install optional dependencies
-pip install -e .[dev,processing,logging]
-```
-
-## Baseline Classification Results
-
-Preliminary single-modality baseline results for **167-species classification** using the `combined` dataset configuration (seed=42, no hyperparameter optimization):
-
-| Modality | Test Accuracy | Model | Notes |
-|----------|---------------|-------|-------|
-| **RGB** | 53.5% | ResNet | Standard computer vision approach |
-| **HSI** | 27.3% | Spectral CNN | 369-band hyperspectral data |
-| **LiDAR** | 11.5% | Structural CNN | Canopy height model |
-
-*167-species classification is inherently challenging. These are basic preliminary results with default parameters - significant improvements possible with hyperparameter tuning, data augmentation, and architectural improvements.*
-
-### Reproduce Results
-
-**Prerequisites:** First run dataloaders to download dataset:
-```python
-from scripts.get_dataloaders import get_dataloaders
-# This downloads the dataset to _neon_tree_classification_dataset_files/
-train_loader, test_loader = get_dataloaders(config='combined')
-```
-
-**With Comet ML (exact reproduction):**
-```bash
-# RGB baseline (requires Comet ML setup)
-uv run python examples/train.py \
-    --csv_path _neon_tree_classification_dataset_files/metadata/combined_dataset.csv \
-    --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 \
-    --modality rgb --model_type resnet --batch_size 1024 --seed 42 \
-    --logger comet --early_stop_patience 15
-
-# HSI baseline (requires Comet ML setup)  
-uv run python examples/train.py \
-    --csv_path _neon_tree_classification_dataset_files/metadata/combined_dataset.csv \
-    --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 \
-    --modality hsi --model_type spectral_cnn --batch_size 128 --seed 42 \
-    --logger comet --early_stop_patience 15
-
-# LiDAR baseline (requires Comet ML setup)
-uv run python examples/train.py \
-    --csv_path _neon_tree_classification_dataset_files/metadata/combined_dataset.csv \
-    --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 \
-    --modality lidar --model_type structural --batch_size 1024 --seed 42 \
-    --logger comet --early_stop_patience 15
-```
-
-**Without Comet ML (approximate reproduction):**
-```bash
-# RGB baseline (fixed epochs)
-uv run python examples/train.py \
-    --csv_path _neon_tree_classification_dataset_files/metadata/combined_dataset.csv \
-    --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 \
-    --modality rgb --model_type resnet --batch_size 1024 --seed 42 --epochs 100
-
-# HSI baseline (fixed epochs)
-uv run python examples/train.py \
-    --csv_path _neon_tree_classification_dataset_files/metadata/combined_dataset.csv \
-    --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 \
-    --modality hsi --model_type spectral_cnn --batch_size 128 --seed 42 --epochs 100
-
-# LiDAR baseline (fixed epochs)
-uv run python examples/train.py \
-    --csv_path _neon_tree_classification_dataset_files/metadata/combined_dataset.csv \
-    --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 \
-    --modality lidar --model_type structural --batch_size 1024 --seed 42 --epochs 100
-```
-*Note: Without early stopping, results may vary. Original experiments used Comet ML logging with early stopping (patience=15).*
-
-## Repository Structure
-
-```
-NeonTreeClassification/
-├── 📁 neon_tree_classification/      # Main Python package
-│   ├── core/                        # Core functionality
-│   │   ├── dataset.py              # HDF5-based dataset class
-│   │   ├── datamodule.py           # PyTorch Lightning data module
-│   │   └── visualization.py        # Multi-modal visualization tools
-│   └── models/                      # ML architectures & training modules
-├── 📁 scripts/                       # Easy-to-use scripts
-│   ├── download_dataset.py         # Download & extract dataset
-│   ├── get_dataloaders.py          # Simple DataLoader factory
-├── 📁 examples/                      # Training examples
-│   └── train.py                    # Flexible training script
-├── 📁 notebooks/                     # Interactive exploration
-│   └── visualization.ipynb         # Dataset exploration & visualization
-├── 📁 processing/                    # NEON raw data processing tools
-│   └── neon_data_processing/       # Scripts for processing raw NEON data products
-├── 📁 sample_plots/                  # Generated example visualizations
-├── pyproject.toml                   # Package configuration & dependencies
-└── uv.lock                         # Dependency lock file
-```
-
-*Note: Dataset automatically downloads to `_neon_tree_classification_dataset_files/` on first use.*
-
-## Training Models
-
-Use the flexible training script with different modalities:
-
-```bash
-# Train RGB classifier
-python examples/train.py --modality rgb --csv_path _neon_tree_classification_dataset_files/metadata/large_dataset.csv --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5
-
-# Train hyperspectral classifier
-python examples/train.py --modality hsi --csv_path _neon_tree_classification_dataset_files/metadata/combined_dataset.csv --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 --batch_size 16
-
-# Train LiDAR classifier
-python examples/train.py --modality lidar --csv_path _neon_tree_classification_dataset_files/metadata/high_quality_dataset.csv --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5
-
-# Advanced: External test set (train on large, test on high_quality)
-python examples/train.py --modality rgb --csv_path _neon_tree_classification_dataset_files/metadata/large_dataset.csv --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 --external_test_csv _neon_tree_classification_dataset_files/metadata/high_quality_dataset.csv
+uv run python quickstart.py
 ```
 
-### Add Custom Models
-Create new model architectures in `neon_tree_classification/models/` and reference them with the `--model_type` flag.
+## Coming Soon
 
-## Advanced Usage
+**Data Processing Pipeline**: Tools for processing raw NEON data products are being finalized and will be released for public use. This will enable users to:
+- Download NEON tiles for all three modalities
+- Crop individual tree crowns from shapefiles
+- Create custom datasets with their own crown annotations
 
-### Custom Data Filtering & Lightning DataModule
+**Classification Models**: Pre-trained models and training scripts for tree species classification will be added to the repository.
 
-```python
-# For advanced users: Use Lightning DataModule with flexible splitting
-from neon_tree_classification.core.datamodule import NeonCrownDataModule
-
-# Basic configuration with species/site filtering
-datamodule = NeonCrownDataModule(
-    csv_path="_neon_tree_classification_dataset_files/metadata/combined_dataset.csv",
-    hdf5_path="_neon_tree_classification_dataset_files/neon_dataset.h5",
-    modalities=["rgb"],  # Single modality training
-    batch_size=32,
-    # Filtering options
-    species_filter=["PSMEM", "TSHE"],  # Train on specific species
-    site_filter=["HARV", "OSBS"],      # Train on specific sites
-    year_filter=[2018, 2019],          # Train on specific years
-    # Split method options
-    split_method="site",  # Options: "random", "site", "year"
-    val_ratio=0.15,
-    test_ratio=0.15
-)
-
-# For external test sets (advanced)
-datamodule = NeonCrownDataModule(
-    csv_path="_neon_tree_classification_dataset_files/metadata/combined_dataset.csv",
-    hdf5_path="_neon_tree_classification_dataset_files/neon_dataset.h5",
-    external_test_csv_path="path/to/external_test.csv",  # Optional external test
-    external_test_hdf5_path="path/to/external_test.h5",  # Optional external HDF5
-    modalities=["rgb"]
-)
-
-datamodule.setup("fit")  # Auto-filters species for compatibility
-```
-
-### Data Processing Pipeline
-
-The `processing/` folder contains a comprehensive NEON data processing pipeline:
-
-**Core Processing Steps:**
-1. **Download NEON tiles** (`neon_downloader.py`) - Downloads RGB, HSI, LiDAR data from NEON API
-2. **Curate tiles** (`curate_tiles.py`) - Quality control and tile selection
-3. **Process shapefiles** (`shapefile_processing/`) - Extract crown metadata and validate annotations
-4. **Crop crowns** (`crop_crowns_multimodal.py`) - Extract individual tree crowns from tiles
-5. **Convert formats** (`convert_tif_to_npy.py`, `hsi_convert_h5_to_tif.py`) - Optimize data storage
-6. **Generate datasets** (`create_training_csv.py`) - Create final training/test CSVs
-7. **Filter & combine** (`misc/filter_rare_species.py`, `dataset_combiner.py`) - Dataset refinement
+**DeepForest Integration**: Planned integration with [DeepForest](https://github.com/weecology/DeepForest) to enable:
+- Automatic crown detection from aerial imagery
+- Seamless multi-modal data extraction for detected crowns
+- Direct classification using pre-trained models from this repository
 
 ## Dataset Details
 
-### NEON Data Products
-- **RGB**: `DP3.30010.001` - High-resolution orthorectified imagery
-- **Hyperspectral**: `DP3.30006.002` - 426-band spectrometer reflectance
-- **LiDAR**: `DP3.30015.001` - Canopy Height Model
-
-### Data Structure
-```
-_neon_tree_classification_dataset_files/metadata/combined_dataset.csv - Main dataset file
-├── crown_id          - Unique identifier for each tree crown
-├── individual        - NEON individual tree ID
-├── individual_id     - Full individual identifier
-├── site              - NEON site code
-├── year              - Data collection year
-├── species           - Species code
-├── species_name      - Full species name
-├── label_site        - Site where tree was labeled
-├── height            - Tree height in meters (95.4% available)
-├── stemDiameter      - Stem diameter in cm (99.4% available)
-├── canopyPosition    - Light exposure: Full sun, Partially shaded, etc. (81.4% available)
-├── plantStatus       - Tree health status: Live, Dead, etc. (99.99% available)
-├── plot              - Plot identifier (currently "unknown" for all samples)
-├── hand_annotated    - Whether crown was manually annotated (boolean)
-├── rgb_path          - Path to RGB image in HDF5 dataset
-├── hsi_path          - Path to hyperspectral image in HDF5 dataset
-└── lidar_path        - Path to LiDAR CHM in HDF5 dataset
-```
+**Top 5 Species:**
+1. Acer rubrum L. (5,684 samples, 11.8%)
+2. Tsuga canadensis (L.) Carrière (3,303 samples, 6.9%)
+3. Pseudotsuga menziesii (Mirb.) Franco var. menziesii (2,978 samples, 6.2%)
+4. Pinus palustris Mill. (2,207 samples, 4.6%)
+5. Quercus rubra L. (2,086 samples, 4.3%)
 
-**Ecological Metadata**: Most samples include tree measurements that provide valuable ecological context for classification, though availability varies by field.
+**Top 5 Sites:**
+- HARV: 7,162 samples (14.9%)
+- MLBS: 5,424 samples (11.3%)
+- GRSM: 4,822 samples (10.1%)
+- DELA: 4,539 samples (9.5%)
+- RMNP: 3,931 samples (8.2%)
 
-## Contributing
+**NEON Data Products:**
+- RGB: DP3.30010.001 (High-resolution orthorectified imagery)
+- Hyperspectral: DP3.30006.002 (426-band spectrometer reflectance)
+- LiDAR: DP3.30015.001 (Canopy Height Model)
 
-1. Fork the repository
-2. Create a feature branch
-3. Submit a pull request
+For complete dataset documentation, training guides, and advanced usage, see the [docs/](docs/) directory.
 
 ## Acknowledgments
 
-- National Ecological Observatory Network (NEON)
-- Dataset statistics generated on 2025-08-28
+National Ecological Observatory Network (NEON)
+
diff --git a/docs/README.md b/docs/README.md
new file mode 100644
index 0000000..7113ef6
--- /dev/null
+++ b/docs/README.md
@@ -0,0 +1,84 @@
+# NEON Tree Classification Dataset - Documentation
+
+Welcome to the comprehensive documentation for the NEON Multi-Modal Tree Species Classification Dataset.
+
+## Quick Links
+
+- [README](../README.md) - Main README with dataset overview and quick start
+- [Advanced Usage](advanced_usage.md) - Custom filtering, Lightning DataModule, and advanced features
+- [Training Guide](training.md) - Model training examples and baseline results
+- [Visualization Guide](visualization.md) - Data visualization tools and examples
+- [Processing Pipeline](processing.md) - NEON data processing workflow
+
+## Getting Started
+
+1. **New Users**: Start with the [main README](../README.md) for installation and basic usage
+2. **Training Models**: See the [Training Guide](training.md) for model training and baseline results
+3. **Data Exploration**: Check out the [Visualization Guide](visualization.md) for exploring the dataset
+4. **Advanced Features**: Read [Advanced Usage](advanced_usage.md) for custom configurations
+5. **Data Processing**: For processing raw NEON data, see the [Processing Pipeline](processing.md)
+
+## Documentation Structure
+
+### [Advanced Usage](advanced_usage.md)
+- Custom data filtering with Lightning DataModule
+- Split methods (random, site-based, year-based)
+- External test sets
+- Advanced dataloader configuration
+- Direct dataset usage
+- Multi-GPU training
+- Custom training loops
+
+### [Training Guide](training.md)
+- Quick training with examples script
+- Baseline results and reproduction steps
+- Custom model architectures
+- Training best practices
+- Multi-modal training
+- Experiment tracking (Comet ML, W&B)
+- Common issues and solutions
+
+### [Visualization Guide](visualization.md)
+- Overview of visualization tools
+- RGB, HSI, and LiDAR visualization
+- Interactive Jupyter notebook
+- Custom visualizations
+- Multi-modal comparisons
+- Advanced spectral analysis
+
+### [Processing Pipeline](processing.md)
+- Complete data processing workflow
+- NEON data product details
+- Quality control procedures
+- HDF5 dataset creation
+- Configuration subset creation
+- Processing best practices
+
+## Support
+
+For issues, questions, or contributions:
+- GitHub Issues: [Report a bug or request a feature](https://github.com/Ritesh313/NeonTreeClassification/issues)
+- Contributing: See [CONTRIBUTING.md](../CONTRIBUTING.md)
+
+## Citation
+
+If you use this dataset in your research, please cite:
+
+```bibtex
+@dataset{neon_tree_classification_2024,
+  title={NEON Multi-Modal Tree Species Classification Dataset},
+  author={[Author Names]},
+  year={2024},
+  publisher={GitHub},
+  url={https://github.com/Ritesh313/NeonTreeClassification}
+}
+```
+
+## License
+
+See [LICENSE](../LICENSE) file for details.
+
+## Acknowledgments
+
+- National Ecological Observatory Network (NEON)
+- Dataset statistics generated on 2025-08-28
diff --git a/docs/advanced_usage.md b/docs/advanced_usage.md
new file mode 100644
index 0000000..d020ebb
--- /dev/null
+++ b/docs/advanced_usage.md
@@ -0,0 +1,290 @@
+# Advanced Usage
+
+This guide covers advanced features for experienced users who need custom data filtering, specialized training configurations, or want to use the PyTorch Lightning DataModule directly.
+
+## Custom Data Filtering with Lightning DataModule
+
+The `NeonCrownDataModule` provides flexible filtering and splitting options for advanced use cases.
+
+### Basic Configuration
+
+```python
+from neon_tree_classification.core.datamodule import NeonCrownDataModule
+
+# Basic configuration with species/site filtering
+datamodule = NeonCrownDataModule(
+    csv_path="_neon_tree_classification_dataset_files/metadata/combined_dataset.csv",
+    hdf5_path="_neon_tree_classification_dataset_files/neon_dataset.h5",
+    modalities=["rgb"],  # Single modality training
+    batch_size=32,
+    # Filtering options
+    species_filter=["PSMEM", "TSHE"],  # Train on specific species
+    site_filter=["HARV", "OSBS"],      # Train on specific sites
+    year_filter=[2018, 2019],          # Train on specific years
+    # Split method options
+    split_method="random",  # Options: "random", "site", "year"
+    val_ratio=0.15,
+    test_ratio=0.15
+)
+
+datamodule.setup("fit")
+```
+
+### Split Methods
+
+The DataModule supports three splitting strategies:
+
+**1. Random Split** (default)
+```python
+datamodule = NeonCrownDataModule(
+    csv_path="path/to/dataset.csv",
+    hdf5_path="path/to/dataset.h5",
+    split_method="random",
+    val_ratio=0.15,
+    test_ratio=0.15
+)
+```
+
+**2. Site-Based Split**
+
+Useful for testing generalization across geographic locations:
+```python
+datamodule = NeonCrownDataModule(
+    csv_path="path/to/dataset.csv",
+    hdf5_path="path/to/dataset.h5",
+    split_method="site",
+    val_ratio=0.15,
+    test_ratio=0.15
+)
+```
+
+**3. Year-Based Split**
+
+Useful for testing temporal generalization:
+```python
+datamodule = NeonCrownDataModule(
+    csv_path="path/to/dataset.csv",
+    hdf5_path="path/to/dataset.h5",
+    split_method="year",
+    val_ratio=0.15,
+    test_ratio=0.15
+)
+```
+
+### External Test Sets
+
+For domain adaptation or cross-site validation:
+
+```python
+datamodule = NeonCrownDataModule(
+    csv_path="_neon_tree_classification_dataset_files/metadata/combined_dataset.csv",
+    hdf5_path="_neon_tree_classification_dataset_files/neon_dataset.h5",
+    external_test_csv_path="path/to/external_test.csv",
+    external_test_hdf5_path="path/to/external_test.h5",  # Optional, uses main HDF5 if not provided
+    modalities=["rgb"]
+)
+
+datamodule.setup("fit")  # Auto-filters species for compatibility
+```
+
+## Advanced DataLoader Configuration
+
+### Custom Normalization
+
+Each modality supports different normalization methods:
+
+**RGB Normalization:**
+- `"0_1"`: Scale to [0, 1] range (default)
+- `"imagenet"`: ImageNet statistics (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
+- `"per_sample"`: Normalize each sample independently
+
+**HSI Normalization:**
+- `"per_sample"`: Normalize each sample independently (default)
+- `"global"`: Use global dataset statistics
+- `"none"`: No normalization
+
+**LiDAR Normalization:**
+- `"height"`: Normalize by maximum canopy height (default)
+- `"per_sample"`: Normalize each sample independently
+- `"none"`: No normalization
+
+Example:
+```python
+train_loader, test_loader = get_dataloaders(
+    config='large',
+    modalities=['rgb', 'hsi', 'lidar'],
+    batch_size=32,
+    rgb_norm_method='imagenet',
+    hsi_norm_method='global',
+    lidar_norm_method='height'
+)
+```
+
+### Custom Image Sizes
+
+Adjust the spatial resolution for each modality:
+
+```python
+train_loader, test_loader = get_dataloaders(
+    config='large',
+    modalities=['rgb', 'hsi', 'lidar'],
+    batch_size=32,
+    rgb_size=(224, 224),    # Larger RGB for fine-grained features
+    hsi_size=(16, 16),      # Higher HSI resolution
+    lidar_size=(16, 16)     # Higher LiDAR resolution
+)
+```
+
+## Direct Dataset Usage
+
+For maximum control, use the `NeonCrownDataset` class directly:
+
+```python
+from neon_tree_classification.core.dataset import NeonCrownDataset
+from torch.utils.data import DataLoader
+
+# Create dataset with custom parameters
+dataset = NeonCrownDataset(
+    csv_path="_neon_tree_classification_dataset_files/metadata/large_dataset.csv",
+    hdf5_path="_neon_tree_classification_dataset_files/neon_dataset.h5",
+    modalities=['rgb', 'hsi'],
+    species_filter=['ACRU', 'TSCA'],  # Limit to specific species
+    site_filter=['HARV', 'MLBS'],     # Limit to specific sites
+    year_filter=[2018, 2019, 2020],   # Limit to specific years
+    include_metadata=True,             # Include crown_id, species names, etc.
+    rgb_size=(128, 128),
+    hsi_size=(12, 12),
+    rgb_norm_method='imagenet',
+    hsi_norm_method='per_sample'
+)
+
+# Create custom DataLoader
+train_loader = DataLoader(
+    dataset,
+    batch_size=64,
+    shuffle=True,
+    num_workers=8,
+    pin_memory=True
+)
+```
+
+## Accessing Metadata
+
+Enable metadata in batches to access crown IDs, species names, and site information:
+
+```python
+from scripts.get_dataloaders import get_dataloaders
+
+# Note: get_dataloaders doesn't support include_metadata yet
+# Use NeonCrownDataset directly:
+from neon_tree_classification.core.dataset import NeonCrownDataset
+
+dataset = NeonCrownDataset(
+    csv_path="path/to/dataset.csv",
+    hdf5_path="path/to/dataset.h5",
+    modalities=['rgb'],
+    include_metadata=True
+)
+
+# Access metadata in batches
+for batch in DataLoader(dataset, batch_size=32):
+    rgb = batch['rgb']
+    labels = batch['species_idx']
+    crown_ids = batch['crown_id']
+    species_names = batch['species']
+    sites = batch['site']
+```
+
+## Multi-GPU Training
+
+For distributed training with PyTorch Lightning:
+
+```python
+import pytorch_lightning as pl
+from neon_tree_classification.core.datamodule import NeonCrownDataModule
+
+# Configure DataModule
+datamodule = NeonCrownDataModule(
+    csv_path="path/to/dataset.csv",
+    hdf5_path="path/to/dataset.h5",
+    modalities=["rgb"],
+    batch_size=32  # Per-GPU batch size
+)
+
+# Create trainer with multi-GPU support
+trainer = pl.Trainer(
+    devices=4,           # Number of GPUs
+    strategy='ddp',      # Distributed Data Parallel
+    precision=16,        # Mixed precision training
+    max_epochs=100
+)
+
+# Your Lightning module
+trainer.fit(model, datamodule=datamodule)
+```
+
+## Custom Training Loop
+
+Example of a custom training loop without PyTorch Lightning:
+
+```python
+import torch
+from scripts.get_dataloaders import get_dataloaders
+
+# Get dataloaders
+train_loader, test_loader = get_dataloaders(
+    config='large',
+    modalities=['rgb'],
+    batch_size=64
+)
+
+# Your model
+model = YourModel().cuda()
+optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
+criterion = torch.nn.CrossEntropyLoss()
+
+# Training loop
+for epoch in range(100):
+    model.train()
+    for batch in train_loader:
+        rgb = batch['rgb'].cuda()
+        labels = batch['species_idx'].cuda()
+        
+        optimizer.zero_grad()
+        outputs = model(rgb)
+        loss = criterion(outputs, labels)
+        loss.backward()
+        optimizer.step()
+    
+    # Validation
+    model.eval()
+    correct = 0
+    total = 0
+    with torch.no_grad():
+        for batch in test_loader:
+            rgb = batch['rgb'].cuda()
+            labels = batch['species_idx'].cuda()
+            outputs = model(rgb)
+            _, predicted = outputs.max(1)
+            total += labels.size(0)
+            correct += predicted.eq(labels).sum().item()
+    
+    accuracy = 100. * correct / total
+    print(f'Epoch {epoch}: Accuracy = {accuracy:.2f}%')
+```
+
+## Performance Tips
+
+1. **Use larger batch sizes**: The dataset fits in memory efficiently due to HDF5 compression
+2. **Increase num_workers**: More workers can significantly speed up data loading
+3. **Enable pin_memory**: Speeds up CPU-to-GPU transfer
+4. **Use persistent_workers**: Reduces worker initialization overhead
+
+```python
+train_loader, test_loader = get_dataloaders(
+    config='large',
+    modalities=['rgb'],
+    batch_size=256,      # Larger batch size
+    num_workers=16,      # More workers (adjust based on CPU cores)
+)
+```
diff --git a/docs/processing.md b/docs/processing.md
new file mode 100644
index 0000000..4abb6b7
--- /dev/null
+++ b/docs/processing.md
@@ -0,0 +1,399 @@
+# Processing Pipeline
+
+This guide documents the NEON data processing pipeline for advanced users who want to process raw NEON data products or understand how the dataset was created.
+
+## Overview
+
+The `processing/` folder contains a comprehensive pipeline for converting raw NEON data products into the final training-ready dataset. This is useful for:
+- Processing new NEON data
+- Customizing the dataset creation process
+- Understanding data quality control steps
+- Creating similar datasets from other sources
+
+## Pipeline Steps
+
+### 1. Download NEON Tiles
+
+Download RGB, HSI, and LiDAR data from the NEON API:
+
+```bash
+python processing/neon_data_processing/neon_downloader.py \
+    --site HARV \
+    --year 2018 \
+    --output_dir /path/to/output
+```
+
+**What it does:**
+- Downloads NEON data products from the API
+- Organizes files by site, year, and modality
+- Validates downloads and checks file integrity
+
+**Key parameters:**
+- `--site`: NEON site code (e.g., HARV, MLBS, GRSM)
+- `--year`: Data collection year
+- `--products`: Which products to download (rgb, hsi, lidar)
+
+### 2. Curate Tiles
+
+Quality control and tile selection:
+
+```bash
+python processing/neon_data_processing/curate_tiles.py \
+    --input_dir /path/to/downloaded/tiles \
+    --output_dir /path/to/curated/tiles \
+    --quality_threshold 0.8
+```
+
+**What it does:**
+- Checks for data completeness (all three modalities present)
+- Validates spatial alignment between modalities
+- Filters out low-quality or corrupted tiles
+- Creates metadata about tile quality
+
+**Quality checks:**
+- Spatial overlap between RGB, HSI, and LiDAR
+- Missing data percentage
+- Coordinate system consistency
+- File format validation
+
+### 3. Process Shapefiles
+
+Extract crown metadata and validate annotations:
+
+```bash
+cd processing/shapefile_processing
+python process_shapefiles.py \
+    --shapefile_dir /path/to/shapefiles \
+    --output_csv crowns_metadata.csv
+```
+
+**What it does:**
+- Extracts tree crown polygons from shapefiles
+- Links crowns to individual tree measurements
+- Validates crown annotations
+- Merges ecological metadata (height, diameter, etc.)
+
+**Output format:**
+- CSV with crown ID, species, location, measurements
+- Validated polygon geometries
+- Quality flags for each annotation
+
+### 4. Crop Tree Crowns
+
+Extract individual tree crowns from tiles:
+
+```bash
+python processing/neon_data_processing/crop_crowns_multimodal.py \
+    --tiles_dir /path/to/curated/tiles \
+    --crowns_csv crowns_metadata.csv \
+    --output_dir /path/to/cropped/crowns \
+    --rgb_size 128 \
+    --hsi_size 12 \
+    --lidar_size 12
+```
+
+**What it does:**
+- Extracts bounding boxes around each crown
+- Crops corresponding regions from RGB, HSI, and LiDAR tiles
+- Resamples to target resolutions
+- Handles coordinate transformations between modalities
+
+**Parameters:**
+- `--rgb_size`: Target RGB resolution (default: 128x128)
+- `--hsi_size`: Target HSI resolution (default: 12x12)
+- `--lidar_size`: Target LiDAR resolution (default: 12x12)
+- `--padding`: Additional padding around crowns (default: 0)
+
+### 5. Convert Formats
+
+Optimize data storage:
+
+```bash
+# Convert TIF to NumPy
+python processing/neon_data_processing/convert_tif_to_npy.py \
+    --input_dir /path/to/cropped/crowns \
+    --output_dir /path/to/numpy/arrays
+
+# Convert HSI H5 to TIF
+python processing/neon_data_processing/hsi_convert_h5_to_tif.py \
+    --input_dir /path/to/hsi/h5 \
+    --output_dir /path/to/hsi/tif
+```
+
+**What it does:**
+- Converts various formats to efficient storage
+- Applies compression where appropriate
+- Validates converted data
+
+### 6. Generate Training CSV
+
+Create final training/test CSVs:
+
+```bash
+python processing/neon_data_processing/create_training_csv.py \
+    --crowns_dir /path/to/cropped/crowns \
+    --metadata_csv crowns_metadata.csv \
+    --output_csv training_dataset.csv
+```
+
+**What it does:**
+- Combines all metadata
+- Validates data availability for each sample
+- Adds file paths to HDF5 dataset
+- Creates train/val/test splits
+
+### 7. Filter and Combine
+
+Dataset refinement:
+
+```bash
+# Filter rare species
+python processing/misc/filter_rare_species.py \
+    --input_csv training_dataset.csv \
+    --output_csv filtered_dataset.csv \
+    --min_samples 50
+
+# Combine multiple datasets
+python processing/misc/dataset_combiner.py \
+    --input_csvs dataset1.csv dataset2.csv dataset3.csv \
+    --output_csv combined_dataset.csv
+```
+
+**What it does:**
+- Removes species with insufficient samples
+- Combines datasets from different sites/years
+- Ensures species compatibility across datasets
+- Creates configuration subsets (large, high_quality, combined)
+
+## Repository Structure
+
+```
+processing/
+├── neon_data_processing/       # Main processing scripts
+│   ├── neon_downloader.py      # Download NEON data
+│   ├── curate_tiles.py         # Quality control
+│   ├── crop_crowns_multimodal.py  # Extract crowns
+│   ├── convert_tif_to_npy.py   # Format conversion
+│   ├── hsi_convert_h5_to_tif.py   # HSI format conversion
+│   └── create_training_csv.py  # Generate training CSVs
+├── shapefile_processing/       # Shapefile tools
+│   ├── process_shapefiles.py   # Extract crown metadata
+│   └── README.md              # Shapefile processing guide
+└── misc/                      # Utility scripts
+    ├── filter_rare_species.py  # Species filtering
+    └── dataset_combiner.py     # Combine datasets
+```
+
+## NEON Data Products
+
+### RGB (DP3.30010.001)
+**High-Resolution Orthorectified Camera Imagery**
+- Resolution: 10cm
+- Format: GeoTIFF
+- Bands: RGB (3 channels)
+- Coverage: Full site mosaics
+
+### Hyperspectral (DP3.30006.002)
+**Surface Directional Reflectance**
+- Resolution: 1m
+- Format: HDF5
+- Bands: 426 spectral bands (380-2510 nm)
+- Processing: Atmospheric correction applied
+- Note: Reduced to 369 bands in dataset (removed noisy bands)
+
+### LiDAR (DP3.30015.001)
+**Ecosystem Structure**
+- Resolution: 1m
+- Format: GeoTIFF
+- Data: Canopy Height Model (CHM)
+- Derived from: Point cloud classification
+
+## Data Quality Control
+
+### Spatial Alignment
+
+Ensure all modalities are properly aligned:
+
+```python
+from processing.utils import check_spatial_alignment
+
+# Verify alignment
+aligned = check_spatial_alignment(
+    rgb_path='path/to/rgb.tif',
+    hsi_path='path/to/hsi.h5',
+    lidar_path='path/to/lidar.tif',
+    tolerance=0.5  # meters
+)
+
+if not aligned:
+    print("Warning: Modalities not aligned!")
+```
+
+### Missing Data
+
+Handle missing or corrupted data:
+
+```python
+from processing.utils import validate_data
+
+# Check data quality
+quality = validate_data(
+    crown_id='HARV_123',
+    rgb_path='path/to/rgb.npy',
+    hsi_path='path/to/hsi.npy',
+    lidar_path='path/to/lidar.npy'
+)
+
+print(f"Quality score: {quality['score']:.2f}")
+print(f"Issues: {quality['issues']}")
+```
+
+## HDF5 Dataset Creation
+
+Convert processed crowns to HDF5:
+
+```python
+import h5py
+import numpy as np
+
+# Create HDF5 dataset
+with h5py.File('neon_dataset.h5', 'w') as f:
+    # Create groups
+    rgb_group = f.create_group('rgb')
+    hsi_group = f.create_group('hsi')
+    lidar_group = f.create_group('lidar')
+    
+    # Add crown data
+    for crown_id, data in processed_crowns.items():
+        rgb_group.create_dataset(
+            crown_id, 
+            data=data['rgb'], 
+            compression='gzip',
+            compression_opts=9
+        )
+        hsi_group.create_dataset(
+            crown_id, 
+            data=data['hsi'], 
+            compression='gzip',
+            compression_opts=9
+        )
+        lidar_group.create_dataset(
+            crown_id, 
+            data=data['lidar'], 
+            compression='gzip',
+            compression_opts=9
+        )
+```
+
+## Configuration Subsets
+
+Create different dataset configurations:
+
+### Combined Dataset
+All available data (47,971 samples, 167 species)
+
+```bash
+python processing/misc/create_config.py \
+    --input_csv all_crowns.csv \
+    --output_csv combined_dataset.csv \
+    --config combined
+```
+
+### Large Dataset
+Main training set (~42,000 samples, ~162 species)
+
+```bash
+python processing/misc/create_config.py \
+    --input_csv all_crowns.csv \
+    --output_csv large_dataset.csv \
+    --config large \
+    --min_samples_per_species 50
+```
+
+### High Quality Dataset
+Curated subset (~5,500 samples, ~96 species)
+
+```bash
+python processing/misc/create_config.py \
+    --input_csv all_crowns.csv \
+    --output_csv high_quality_dataset.csv \
+    --config high_quality \
+    --quality_threshold 0.9 \
+    --min_samples_per_species 100
+```
+
+## Processing Best Practices
+
+### 1. Start Small
+Process one site first to validate the pipeline:
+
+```bash
+# Test with single site
+python process_site.sh HARV 2018
+```
+
+### 2. Parallel Processing
+Use parallel processing for large-scale operations:
+
+```bash
+# Process multiple sites in parallel
+parallel -j 4 python process_site.sh ::: HARV MLBS GRSM DELA
+```
+
+### 3. Disk Space
+Monitor disk usage - raw and processed data can be large:
+- Raw tiles: ~100GB per site
+- Processed crowns: ~50GB per site
+- Final HDF5 dataset: ~600MB (compressed)
+
+### 4. Validation
+Always validate processed data:
+
+```bash
+python processing/utils/validate_dataset.py \
+    --csv_path training_dataset.csv \
+    --hdf5_path neon_dataset.h5
+```
+
+## Troubleshooting
+
+### Issue: Spatial misalignment
+**Solution:** Check coordinate reference systems (CRS) and reproject if needed
+
+### Issue: Missing HSI bands
+**Solution:** Verify HSI data download and band extraction
+
+### Issue: Corrupted crowns
+**Solution:** Increase quality thresholds in curation step
+
+### Issue: Memory errors
+**Solution:** Process in batches or use more efficient data types
+
+## Custom Processing
+
+For custom processing workflows:
+
+```python
+from processing.pipeline import ProcessingPipeline
+
+# Create custom pipeline
+pipeline = ProcessingPipeline(
+    sites=['HARV', 'MLBS'],
+    years=[2018, 2019, 2020],
+    output_dir='custom_dataset'
+)
+
+# Configure processing
+pipeline.set_quality_threshold(0.85)
+pipeline.set_crown_sizes(rgb=224, hsi=16, lidar=16)
+
+# Run pipeline
+pipeline.run()
+```
+
+## Additional Resources
+
+- [NEON Data Portal](https://data.neonscience.org/)
+- [NEON Data Products Catalog](https://data.neonscience.org/data-products/explore)
+- [NEON API Documentation](https://data.neonscience.org/data-api)
+- [Processing README](../processing/neon_data_processing/README.md)
diff --git a/docs/training.md b/docs/training.md
new file mode 100644
index 0000000..6b1abca
--- /dev/null
+++ b/docs/training.md
@@ -0,0 +1,289 @@
+# Training Guide
+
+This guide covers model training, baseline results, and tips for achieving good performance on the NEON tree classification dataset.
+
+## Quick Training with Examples Script
+
+The repository includes a flexible training script that supports all modalities:
+
+```bash
+# Train RGB classifier
+uv run python examples/train.py \
+    --modality rgb \
+    --csv_path _neon_tree_classification_dataset_files/metadata/large_dataset.csv \
+    --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5
+
+# Train hyperspectral classifier
+uv run python examples/train.py \
+    --modality hsi \
+    --csv_path _neon_tree_classification_dataset_files/metadata/combined_dataset.csv \
+    --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 \
+    --batch_size 16
+
+# Train LiDAR classifier
+uv run python examples/train.py \
+    --modality lidar \
+    --csv_path _neon_tree_classification_dataset_files/metadata/high_quality_dataset.csv \
+    --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5
+
+# External test set (train on large, test on high_quality)
+uv run python examples/train.py \
+    --modality rgb \
+    --csv_path _neon_tree_classification_dataset_files/metadata/large_dataset.csv \
+    --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 \
+    --external_test_csv _neon_tree_classification_dataset_files/metadata/high_quality_dataset.csv
+```
+
+## Baseline Results
+
+Preliminary single-modality baseline results for 167-species classification using the `combined` dataset configuration (seed=42, no hyperparameter optimization):
+
+| Modality | Test Accuracy | Model | Notes |
+|----------|---------------|-------|-------|
+| RGB | 53.5% | ResNet | Standard computer vision approach |
+| HSI | 27.3% | Spectral CNN | 369-band hyperspectral data |
+| LiDAR | 11.5% | Structural CNN | Canopy height model |
+
+**Important Notes:**
+- 167-species classification is inherently challenging
+- These are basic preliminary results with default parameters
+- Significant improvements possible with hyperparameter tuning, data augmentation, and architectural improvements
+- Multi-modal fusion is expected to significantly improve performance
+
+## Reproducing Baseline Results
+
+### Prerequisites
+
+First, download the dataset:
+```python
+from scripts.get_dataloaders import get_dataloaders
+# This downloads the dataset to _neon_tree_classification_dataset_files/
+train_loader, test_loader = get_dataloaders(config='combined')
+```
+
+### With Comet ML (Exact Reproduction)
+
+The original experiments used Comet ML for logging and early stopping:
+
+```bash
+# RGB baseline
+uv run python examples/train.py \
+    --csv_path _neon_tree_classification_dataset_files/metadata/combined_dataset.csv \
+    --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 \
+    --modality rgb --model_type resnet --batch_size 1024 --seed 42 \
+    --logger comet --early_stop_patience 15
+
+# HSI baseline
+uv run python examples/train.py \
+    --csv_path _neon_tree_classification_dataset_files/metadata/combined_dataset.csv \
+    --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 \
+    --modality hsi --model_type spectral_cnn --batch_size 128 --seed 42 \
+    --logger comet --early_stop_patience 15
+
+# LiDAR baseline
+uv run python examples/train.py \
+    --csv_path _neon_tree_classification_dataset_files/metadata/combined_dataset.csv \
+    --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 \
+    --modality lidar --model_type structural --batch_size 1024 --seed 42 \
+    --logger comet --early_stop_patience 15
+```
+
+### Without Comet ML (Approximate Reproduction)
+
+Without early stopping, results may vary:
+
+```bash
+# RGB baseline (fixed epochs)
+uv run python examples/train.py \
+    --csv_path _neon_tree_classification_dataset_files/metadata/combined_dataset.csv \
+    --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 \
+    --modality rgb --model_type resnet --batch_size 1024 --seed 42 --epochs 100
+
+# HSI baseline (fixed epochs)
+uv run python examples/train.py \
+    --csv_path _neon_tree_classification_dataset_files/metadata/combined_dataset.csv \
+    --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 \
+    --modality hsi --model_type spectral_cnn --batch_size 128 --seed 42 --epochs 100
+
+# LiDAR baseline (fixed epochs)
+uv run python examples/train.py \
+    --csv_path _neon_tree_classification_dataset_files/metadata/combined_dataset.csv \
+    --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 \
+    --modality lidar --model_type structural --batch_size 1024 --seed 42 --epochs 100
+```
+
+## Custom Model Architectures
+
+### Creating Custom Models
+
+Add new model architectures in `neon_tree_classification/models/` and reference them with the `--model_type` flag.
+
+Example custom model:
+
+```python
+# neon_tree_classification/models/my_custom_model.py
+import torch.nn as nn
+
+class MyCustomModel(nn.Module):
+    def __init__(self, num_classes, input_channels=3):
+        super().__init__()
+        self.features = nn.Sequential(
+            nn.Conv2d(input_channels, 64, kernel_size=3, padding=1),
+            nn.ReLU(),
+            nn.MaxPool2d(2),
+            # Add more layers...
+        )
+        self.classifier = nn.Linear(64, num_classes)
+    
+    def forward(self, x):
+        x = self.features(x)
+        x = x.mean([2, 3])  # Global average pooling
+        x = self.classifier(x)
+        return x
+```
+
+## Training Best Practices
+
+### 1. Start with RGB
+
+RGB data is easiest to work with and provides good baseline performance:
+- Standard computer vision techniques apply
+- Pre-trained ImageNet models can be fine-tuned
+- Faster training times
+
+### 2. Dataset Configuration Selection
+
+Choose based on your goals:
+- `combined`: Maximum data, all species
+- `large`: Good balance of data quantity and quality
+- `high_quality`: Cleanest data, fewer species
+
+### 3. Hyperparameter Tuning
+
+Key hyperparameters to tune:
+- Learning rate (start with 1e-3 to 1e-4)
+- Batch size (larger is usually better, up to memory limits)
+- Weight decay (0 to 1e-4)
+- Augmentation parameters
+
+### 4. Data Augmentation
+
+For RGB:
+```python
+import torchvision.transforms as transforms
+
+train_transforms = transforms.Compose([
+    transforms.RandomHorizontalFlip(),
+    transforms.RandomVerticalFlip(),
+    transforms.RandomRotation(90),
+    transforms.ColorJitter(brightness=0.2, contrast=0.2),
+])
+```
+
+### 5. Learning Rate Scheduling
+
+Use learning rate scheduling for better convergence:
+```python
+scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
+    optimizer, 
+    T_max=100
+)
+```
+
+## Multi-Modal Training
+
+Combining multiple modalities typically improves performance:
+
+### Early Fusion
+```python
+# Concatenate features from different modalities
+rgb_features = rgb_encoder(rgb_data)
+hsi_features = hsi_encoder(hsi_data)
+lidar_features = lidar_encoder(lidar_data)
+
+combined = torch.cat([rgb_features, hsi_features, lidar_features], dim=1)
+output = classifier(combined)
+```
+
+### Late Fusion
+```python
+# Average predictions from different modalities
+rgb_pred = rgb_model(rgb_data)
+hsi_pred = hsi_model(hsi_data)
+lidar_pred = lidar_model(lidar_data)
+
+final_pred = (rgb_pred + hsi_pred + lidar_pred) / 3
+```
+
+## Experiment Tracking
+
+### Using Comet ML
+
+```bash
+# Set up Comet ML
+export COMET_API_KEY="your_api_key"
+
+# Train with Comet logging
+uv run python examples/train.py \
+    --modality rgb \
+    --logger comet \
+    --csv_path path/to/dataset.csv \
+    --hdf5_path path/to/dataset.h5
+```
+
+### Using Weights & Biases
+
+```bash
+# Set up W&B
+wandb login
+
+# Train with W&B logging
+uv run python examples/train.py \
+    --modality rgb \
+    --logger wandb \
+    --csv_path path/to/dataset.csv \
+    --hdf5_path path/to/dataset.h5
+```
+
+## Common Issues and Solutions
+
+### Issue: Out of Memory
+**Solution:** Reduce batch size or image resolution
+```bash
+python examples/train.py --batch_size 16 --modality rgb
+```
+
+### Issue: Slow Training
+**Solution:** Increase num_workers and use larger batches
+```bash
+python examples/train.py --batch_size 256 --num_workers 16
+```
+
+### Issue: Poor Convergence
+**Solution:** 
+1. Check learning rate (try 1e-4 or 1e-5)
+2. Use learning rate warmup
+3. Add data augmentation
+4. Try different normalization methods
+
+### Issue: Overfitting
+**Solution:**
+1. Add dropout
+2. Use weight decay
+3. Add more data augmentation
+4. Use early stopping
+
+## Performance Benchmarks
+
+Training times on NVIDIA A100 (40GB):
+
+| Modality | Batch Size | Epochs | Time per Epoch | Total Time |
+|----------|------------|--------|----------------|------------|
+| RGB | 1024 | 100 | ~2 min | ~3.5 hours |
+| HSI | 128 | 100 | ~5 min | ~8 hours |
+| LiDAR | 1024 | 100 | ~1 min | ~2 hours |
+
+Memory requirements:
+- RGB: ~8GB GPU memory (batch_size=1024)
+- HSI: ~12GB GPU memory (batch_size=128)
+- LiDAR: ~4GB GPU memory (batch_size=1024)
diff --git a/docs/visualization.md b/docs/visualization.md
new file mode 100644
index 0000000..eef327c
--- /dev/null
+++ b/docs/visualization.md
@@ -0,0 +1,353 @@
+# Visualization Guide
+
+This guide covers data visualization tools for exploring the NEON tree classification dataset.
+
+## Overview
+
+The package includes comprehensive visualization tools for all three modalities:
+- RGB imagery
+- Hyperspectral (HSI) data
+- LiDAR canopy height models
+
+## Example Visualizations
+
+| RGB Image | HSI Pseudo RGB | HSI PCA Decomposition |
+|-----------|----------------|----------------------|
+| ![RGB](../sample_plots/sample_rgb.png) | ![HSI](../sample_plots/sample_hsi.png) | ![HSI PCA](../sample_plots/sample_hsi_pca.png) |
+
+| HSI Spectral Signatures | LiDAR Canopy Height Model |
+|-------------------------|---------------------------|
+| ![Spectra](../sample_plots/sample_spectra.png) | ![LiDAR](../sample_plots/sample_lidar.png) |
+
+## Visualization Functions
+
+### Basic Usage
+
+```python
+from neon_tree_classification.core.visualization import (
+    plot_rgb, plot_hsi, plot_hsi_pca, plot_hsi_spectra, plot_lidar
+)
+from neon_tree_classification.core.dataset import NeonCrownDataset
+
+# Load dataset and get a sample
+dataset = NeonCrownDataset.load()
+sample_data = dataset[0]  # Returns dict with rgb, hsi, lidar arrays
+
+# Visualize each modality
+plot_rgb(sample_data['rgb'])             # True color RGB image
+plot_hsi(sample_data['hsi'])             # Pseudo RGB from HSI bands
+plot_hsi_pca(sample_data['hsi'])         # PCA decomposition to 3 components
+plot_hsi_spectra(sample_data['hsi'])     # Spectral signatures
+plot_lidar(sample_data['lidar'])         # Canopy height model with colorbar
+```
+
+## RGB Visualization
+
+Display true-color RGB images:
+
+```python
+import matplotlib.pyplot as plt
+from neon_tree_classification.core.visualization import plot_rgb
+
+# Single sample
+plot_rgb(sample_data['rgb'])
+plt.title('Red Maple (Acer rubrum)')
+plt.show()
+
+# Multiple samples
+fig, axes = plt.subplots(2, 3, figsize=(12, 8))
+for i, ax in enumerate(axes.flat):
+    sample = dataset[i]
+    plot_rgb(sample['rgb'], ax=ax)
+    ax.set_title(f"Sample {i}")
+plt.tight_layout()
+plt.show()
+```
+
+## Hyperspectral Visualization
+
+### Pseudo RGB from HSI
+
+Display hyperspectral data as RGB by selecting specific wavelength bands:
+
+```python
+from neon_tree_classification.core.visualization import plot_hsi
+
+# Default uses bands at ~660nm (red), ~550nm (green), ~450nm (blue)
+plot_hsi(sample_data['hsi'])
+plt.show()
+```
+
+### PCA Decomposition
+
+Visualize hyperspectral data using Principal Component Analysis:
+
+```python
+from neon_tree_classification.core.visualization import plot_hsi_pca
+
+# Decompose 369 bands into 3 principal components
+plot_hsi_pca(sample_data['hsi'])
+plt.title('HSI PCA (RGB = PC1, PC2, PC3)')
+plt.show()
+```
+
+### Spectral Signatures
+
+Plot spectral reflectance curves:
+
+```python
+from neon_tree_classification.core.visualization import plot_hsi_spectra
+
+# Plot mean spectrum and sample pixels
+plot_hsi_spectra(sample_data['hsi'])
+plt.title('Hyperspectral Signature')
+plt.xlabel('Wavelength (nm)')
+plt.ylabel('Reflectance')
+plt.show()
+```
+
+## LiDAR Visualization
+
+Display canopy height models:
+
+```python
+from neon_tree_classification.core.visualization import plot_lidar
+
+# Display height map with colorbar
+plot_lidar(sample_data['lidar'])
+plt.title('Canopy Height Model')
+plt.show()
+```
+
+## Interactive Visualization Notebook
+
+For comprehensive data exploration, use the included Jupyter notebook:
+
+```bash
+# Start Jupyter
+uv run jupyter notebook
+
+# Open the visualization notebook
+# Navigate to: notebooks/visualization.ipynb
+```
+
+The notebook includes:
+- Dataset overview and statistics
+- Interactive visualizations for all modalities
+- Species distribution plots
+- Site-wise data exploration
+- Temporal analysis
+
+## Custom Visualizations
+
+### Comparing Species
+
+```python
+import matplotlib.pyplot as plt
+from scripts.get_dataloaders import get_dataloaders
+from neon_tree_classification.core.visualization import plot_rgb
+
+# Load data
+train_loader, _ = get_dataloaders(config='large', batch_size=100)
+
+# Get samples from different species
+species_samples = {}
+for batch in train_loader:
+    for i in range(len(batch['species_idx'])):
+        species_idx = batch['species_idx'][i].item()
+        if species_idx not in species_samples:
+            species_samples[species_idx] = batch['rgb'][i]
+        if len(species_samples) >= 6:
+            break
+    if len(species_samples) >= 6:
+        break
+
+# Plot comparison
+fig, axes = plt.subplots(2, 3, figsize=(12, 8))
+for (species_idx, rgb), ax in zip(species_samples.items(), axes.flat):
+    plot_rgb(rgb.numpy(), ax=ax)
+    ax.set_title(f"Species {species_idx}")
+plt.tight_layout()
+plt.show()
+```
+
+### Multi-Modal Comparison
+
+```python
+import matplotlib.pyplot as plt
+
+# Get a sample with all modalities
+sample = dataset[0]
+
+# Create multi-modal visualization
+fig, axes = plt.subplots(1, 3, figsize=(15, 5))
+
+# RGB
+plot_rgb(sample['rgb'], ax=axes[0])
+axes[0].set_title('RGB')
+
+# HSI Pseudo RGB
+plot_hsi(sample['hsi'], ax=axes[1])
+axes[1].set_title('Hyperspectral (Pseudo RGB)')
+
+# LiDAR
+plot_lidar(sample['lidar'], ax=axes[2])
+axes[2].set_title('LiDAR CHM')
+
+plt.tight_layout()
+plt.show()
+```
+
+### Batch Visualization
+
+```python
+import torch
+from scripts.get_dataloaders import get_dataloaders
+
+# Get a batch
+train_loader, _ = get_dataloaders(config='large', batch_size=16)
+batch = next(iter(train_loader))
+
+# Visualize batch
+fig, axes = plt.subplots(4, 4, figsize=(12, 12))
+for i, ax in enumerate(axes.flat):
+    if i < len(batch['rgb']):
+        plot_rgb(batch['rgb'][i].numpy(), ax=ax)
+        label = batch['species_idx'][i].item()
+        ax.set_title(f"Label: {label}", fontsize=8)
+    ax.axis('off')
+plt.tight_layout()
+plt.show()
+```
+
+## Visualization Tips
+
+### 1. Color Normalization
+
+For better visualization, normalize data appropriately:
+
+```python
+import numpy as np
+
+# RGB: Scale to [0, 1] if needed
+rgb_normalized = np.clip(rgb_data, 0, 1)
+
+# HSI: Normalize for visualization
+hsi_normalized = (hsi_data - hsi_data.min()) / (hsi_data.max() - hsi_data.min())
+
+# LiDAR: Height values are already meaningful
+lidar_viz = lidar_data
+```
+
+### 2. Handling Different Resolutions
+
+```python
+from scipy.ndimage import zoom
+
+# Resize HSI to match RGB resolution for side-by-side comparison
+hsi_upsampled = zoom(sample['hsi'], (1, 128/12, 128/12), order=1)
+```
+
+### 3. Saving Figures
+
+```python
+import matplotlib.pyplot as plt
+
+# High-quality figure
+fig, ax = plt.subplots(figsize=(10, 10), dpi=300)
+plot_rgb(sample['rgb'], ax=ax)
+plt.title('High Resolution RGB')
+plt.savefig('output.png', dpi=300, bbox_inches='tight')
+plt.close()
+```
+
+### 4. Creating Animations
+
+```python
+import matplotlib.animation as animation
+
+# Animate through spectral bands
+fig, ax = plt.subplots()
+ims = []
+for band in range(sample['hsi'].shape[0]):
+    im = ax.imshow(sample['hsi'][band], cmap='gray', animated=True)
+    ims.append([im])
+
+ani = animation.ArtistAnimation(fig, ims, interval=50, blit=True)
+ani.save('spectral_animation.gif')
+```
+
+## Exporting Visualizations
+
+### Generate Sample Images
+
+The repository includes a script to generate sample visualization images:
+
+```bash
+python scripts/generate_sample_images.py
+```
+
+This creates example images in the `sample_plots/` directory.
+
+## Advanced Visualizations
+
+### Spectral Analysis
+
+```python
+import numpy as np
+import matplotlib.pyplot as plt
+
+# Extract mean spectrum per species
+species_spectra = {}
+for batch in train_loader:
+    for i in range(len(batch['species_idx'])):
+        species = batch['species_idx'][i].item()
+        hsi = batch['hsi'][i].numpy()
+        mean_spectrum = hsi.mean(axis=(1, 2))  # Average over spatial dimensions
+        
+        if species not in species_spectra:
+            species_spectra[species] = []
+        species_spectra[species].append(mean_spectrum)
+
+# Plot comparison
+plt.figure(figsize=(12, 6))
+for species, spectra in list(species_spectra.items())[:5]:  # Top 5 species
+    mean = np.mean(spectra, axis=0)
+    plt.plot(mean, label=f'Species {species}')
+plt.xlabel('Band Number')
+plt.ylabel('Mean Reflectance')
+plt.legend()
+plt.title('Spectral Signatures by Species')
+plt.show()
+```
+
+### Height Distribution
+
+```python
+# Analyze LiDAR heights
+heights = []
+for batch in train_loader:
+    lidar = batch['lidar'].numpy()
+    heights.extend(lidar.flatten())
+
+plt.figure(figsize=(10, 6))
+plt.hist(heights, bins=100, edgecolor='black')
+plt.xlabel('Height (m)')
+plt.ylabel('Frequency')
+plt.title('Canopy Height Distribution')
+plt.show()
+```
+
+## Troubleshooting
+
+### Issue: Images appear too dark/bright
+**Solution:** Adjust normalization or use `plt.clim()`
+
+### Issue: HSI visualization looks strange
+**Solution:** Check band selection or use PCA visualization instead
+
+### Issue: Can't see LiDAR features
+**Solution:** Adjust colormap range or use logarithmic scale
+
+### Issue: Out of memory with large batches
+**Solution:** Reduce batch size or visualize fewer samples at once
diff --git a/scripts/update_readme.py b/scripts/update_readme.py
index 4feba99..bb2f1cc 100644
--- a/scripts/update_readme.py
+++ b/scripts/update_readme.py
@@ -1,8 +1,21 @@
 #!/usr/bin/env python3
 """
-Simple script to update README.md with current dataset statistics using only pandas.
+⚠️ DEPRECATED: This script generates the old verbose README format.
 
-Usage:
+The current README has been simplified (README.md) and focuses on essentials only.
+Detailed documentation is now in the docs/ directory.
+
+This script is kept for reference but should not be used to regenerate the README.
+To update dataset statistics, manually edit:
+- README.md (lines 8-13: main statistics in Dataset Overview section)
+- README.md (lines 113-117: top 5 species in Dataset Details section)
+- README.md (lines 122-126: top 5 sites in Geographic Distribution section)
+
+For maintaining consistency, consider creating a simple stats updater that only
+updates specific statistics in the simplified README rather than regenerating
+the entire file.
+
+Original usage:
     python scripts/update_readme.py --csv /path/to/data.csv --readme /path/to/README.md
 """