diff --git a/README.md b/README.md index d34d81e..1e0e726 100644 --- a/README.md +++ b/README.md @@ -1,443 +1,101 @@ -# NEON Multi-Modal Tree Species Classification Dataset +# NEON Multi-Modal Tree Species Classification -A comprehensive dataset of **167 tree species** with **47,971 individual tree crowns** from **30 NEON sites** across North America. Each sample includes RGB imagery, 369-band hyperspectral data, and LiDAR canopy height models. +A comprehensive toolkit for multi-modal tree species classification using NEON ecological data. This project combines RGB imagery, hyperspectral data, and LiDAR to enable accurate tree species identification across diverse North American ecosystems. -## πŸš€ Quick Start +## Project Vision -### 1. Installation -```bash -# Clone the repository -git clone https://github.com/Ritesh313/NeonTreeClassification.git -cd NeonTreeClassification +This repository aims to provide an end-to-end solution for tree species classification: -# Install with uv (recommended - faster dependency resolution) -uv sync +- [x] **Dataset**: Ready-to-use multi-modal tree crown dataset with 167 species +- [ ] **Data Processing**: Tools for downloading and processing raw NEON data products +- [ ] **Classification Models**: Pre-trained models and training pipelines +- [ ] **DeepForest Integration**: Automated crown detection and classification workflow -# Or install with pip -pip install -e . -``` +## What's Available Now -### 2. Quick Start with Ready-to-Run Example - -**πŸš€ Try the quickstart script first:** -```bash -# Option A: Using uv run (recommended) -uv run python quickstart.py +### Multi-Modal Dataset -# Option B: After activating environment -source .venv/bin/activate -python quickstart.py -``` +A curated dataset of 47,971 individual tree crowns from 30 NEON sites, ready for immediate use: -This script demonstrates dataloader usage and **automatically downloads the dataset (590 MB)** on first use. +- **167 tree species** from diverse North American ecosystems +- **3 modalities**: RGB (3 bands), Hyperspectral (369 bands), LiDAR CHM (1 band) +- **10 years of data** (2014-2023) with ecological metadata +- **3 configurations**: `combined` (47,971 samples), `large` (~42,000 samples), `high_quality` (~5,500 samples) +- **HDF5 format**: Efficient storage with automatic download (590 MB) -### 3. Using Dataloaders in Your Own Projects +## Quick Start -Copy the code from `quickstart.py` or use it directly in your projects: +### Installation -#### Option A: Using `uv run` (Recommended) ```bash -# Run your Python script (create in repo root directory) -uv run python your_analysis_script.py - -# Or start Jupyter notebook -uv run jupyter notebook +git clone https://github.com/Ritesh313/NeonTreeClassification.git +cd NeonTreeClassification +uv sync # or: pip install -e . ``` -#### Option B: Activate the environment (familiar to conda users) -```bash -# Activate the environment (Linux/Mac) -source .venv/bin/activate - -# Now use regular Python commands (create scripts in repo root) -python your_analysis_script.py -jupyter notebook - -# Deactivate when done -deactivate -``` +### Get the Dataset -#### Example: Getting the Dataloaders ```python from scripts.get_dataloaders import get_dataloaders -# πŸ“₯ Dataset downloads automatically on first use (590 MB) +# Dataset downloads automatically (590 MB) train_loader, test_loader = get_dataloaders( - config='large', # Choose: 'combined', 'large', or 'high_quality' - modalities=['rgb', 'hsi', 'lidar'], # Choose modalities + config='large', + modalities=['rgb', 'hsi', 'lidar'], batch_size=32 ) -# Each batch contains: +# Use in your training loop for batch in train_loader: - rgb_data = batch['rgb'] # torch.Tensor [batch_size, 3, 128, 128] - hsi_data = batch['hsi'] # torch.Tensor [batch_size, 369, 12, 12] - lidar_data = batch['lidar'] # torch.Tensor [batch_size, 1, 12, 12] - labels = batch['species_idx'] # torch.Tensor [batch_size] (class indices) - # Optional metadata (if include_metadata=True): - # batch['crown_id'], batch['species'], batch['site'] -``` - -### 4. Running the Complete Training Pipeline (Optional) -```bash -# Use our complete training script -uv run python examples/train.py -``` - -> **πŸ’‘ Optional**: For manual control, you can pre-download the dataset using `python scripts/download_dataset.py` - -## πŸ“Š Dataset Configurations - -The dataset comes with **3 pre-configured subsets** for different use cases: - -| Configuration | Samples | Species | Description | -|---------------|---------|---------|-------------| -| **`combined`** | 47,971 | 167 | Complete dataset with all available samples | -| **`large`** | ~42,000 | ~162 | Main training set | -| **`high_quality`** | ~5,500 | ~96 | Curated subset with highest data quality | - -### Key Dataset Features -- **🎯 Flexible Filtering:** Filter by species, sites, years, or modalities -- **🌍 Geographic Coverage:** 30 NEON sites across diverse North American ecosystems -- **πŸ“… Temporal Range:** 10 years of data (2014-2023) for temporal analysis -- **πŸ”¬ Multi-modal:** RGB, 369-band hyperspectral, and LiDAR for each sample -- **πŸ“Š Rich Metadata:** Tree height, diameter, canopy position, and plot information - -### 4 Training Scenarios -```python -# Scenario 1: Standard training on filtered dataset -train_loader, test_loader = get_dataloaders(config='large', test_ratio=0.2) - -# Scenario 2: Maximum data training -train_loader, test_loader = get_dataloaders(config='combined', test_ratio=0.15) - -# Scenario 3: High-quality only (smaller but cleaner) -train_loader, test_loader = get_dataloaders(config='high_quality', test_ratio=0.2) - -# Scenario 4: Domain transfer (train on large, test on high-quality) -train_loader, test_loader = get_dataloaders( - train_config='large', - test_config='high_quality' -) -``` - -## 🌐 Dataset Overview - -- **47,971** individual tree crowns -- **167** unique species -- **30** NEON sites across North America -- **2014-2023** (10 years of data) -- **3 modalities:** RGB (3 bands), Hyperspectral (369 bands), LiDAR CHM (1 band) -- **Ecological metadata:** Height (95.4% available), stem diameter (99.4% available), canopy position (81.4% available) -- **HDF5 storage:** Efficient compressed format for fast loading - -## Visualization Examples - -The package includes comprehensive visualization tools for all three modalities: - -| RGB Image | HSI Pseudo RGB | HSI PCA Decomposition | -|-----------|----------------|----------------------| -| ![RGB](sample_plots/sample_rgb.png) | ![HSI](sample_plots/sample_hsi.png) | ![HSI PCA](sample_plots/sample_hsi_pca.png) | - -| HSI Spectral Signatures | LiDAR Canopy Height Model | -|-------------------------|---------------------------| -| ![Spectra](sample_plots/sample_spectra.png) | ![LiDAR](sample_plots/sample_lidar.png) | - -```python -# Visualization functions for tree crown data -from neon_tree_classification.core.visualization import ( - plot_rgb, plot_hsi, plot_hsi_pca, plot_hsi_spectra, plot_lidar -) -from neon_tree_classification.core.dataset import NeonCrownDataset - -# Load dataset and get a sample -dataset = NeonCrownDataset.load() -sample_data = dataset[0] # Get first sample (returns dict with rgb, hsi, lidar arrays) - -# Visualize each modality using the numpy arrays directly -plot_rgb(sample_data['rgb']) # True color RGB image from array -plot_hsi(sample_data['hsi']) # Pseudo RGB (bands ~660nm, ~550nm, ~450nm) -plot_hsi_pca(sample_data['hsi']) # PCA decomposition to 3 components -plot_hsi_spectra(sample_data['hsi']) # Spectral signatures of pixels -plot_lidar(sample_data['lidar']) # Canopy height model with colorbar + rgb = batch['rgb'] # [batch_size, 3, 128, 128] + hsi = batch['hsi'] # [batch_size, 369, 12, 12] + lidar = batch['lidar'] # [batch_size, 1, 12, 12] + labels = batch['species_idx'] # [batch_size] ``` -### Interactive Visualization Notebook - -For comprehensive data exploration and visualization examples, see: -```bash -# Open the visualization notebook -jupyter notebook notebooks/visualization.ipynb -``` - -## Top Species - -The dataset includes 167 tree species. Here are the most common: - -| Rank | Species | Count | Percentage | -|------|---------|-------|------------| -| 1 | Acer rubrum L. | 5,684 | 11.8% | -| 2 | Tsuga canadensis (L.) CarriΓ¨re | 3,303 | 6.9% | -| 3 | Pseudotsuga menziesii (Mirb.) Franco var. menziesii | 2,978 | 6.2% | -| 4 | Pinus palustris Mill. | 2,207 | 4.6% | -| 5 | Quercus rubra L. | 2,086 | 4.3% | -| 6 | Pinus contorta Douglas ex Loudon var. latifolia Engelm. ex S. Watson | 2,011 | 4.2% | -| 7 | Tsuga heterophylla (Raf.) Sarg. | 1,425 | 3.0% | -| 8 | Populus tremuloides Michx. | 1,362 | 2.8% | -| 9 | Liriodendron tulipifera L. | 1,152 | 2.4% | -| 10 | Quercus alba L. | 1,139 | 2.4% | - -## Geographic Distribution - -Data collected from **30 NEON sites** across North America: - -**1.** HARV: 7,162 samples (14.9%) -**2.** MLBS: 5,424 samples (11.3%) -**3.** GRSM: 4,822 samples (10.1%) -**4.** DELA: 4,539 samples (9.5%) -**5.** RMNP: 3,931 samples (8.2%) -**6.** WREF: 3,616 samples (7.5%) -**7.** BART: 2,360 samples (4.9%) -**8.** OSBS: 2,192 samples (4.6%) -**9.** UNDE: 1,940 samples (4.0%) -**10.** CLBJ: 1,754 samples (3.7%) - -## Installation - -### Prerequisites -- Python 3.9+ (recommended: Python 3.11) -- CUDA-capable GPU (optional, but recommended for training) - -### Recommended: Using uv (Fast Python Package Manager) +Or run the quickstart example: ```bash -# Clone the repository -git clone https://github.com/Ritesh313/NeonTreeClassification.git -cd NeonTreeClassification - -# Install uv if not already installed -curl -LsSf https://astral.sh/uv/install.sh | sh -# or: pip install uv - -# Install with uv (automatically handles dependencies) -uv sync - -# Install with specific dependency groups -uv sync --extra dev # Add development tools (testing, formatting) -uv sync --extra processing # Add geospatial processing tools (for advanced users) -uv sync --extra logging # Add experiment logging tools (Comet ML, WandB) -uv sync --all-extras # Install everything -``` - -### Alternative: Using pip -```bash -# Install core dependencies -pip install -e . - -# Install optional dependencies -pip install -e .[dev,processing,logging] -``` - -## Baseline Classification Results - -Preliminary single-modality baseline results for **167-species classification** using the `combined` dataset configuration (seed=42, no hyperparameter optimization): - -| Modality | Test Accuracy | Model | Notes | -|----------|---------------|-------|-------| -| **RGB** | 53.5% | ResNet | Standard computer vision approach | -| **HSI** | 27.3% | Spectral CNN | 369-band hyperspectral data | -| **LiDAR** | 11.5% | Structural CNN | Canopy height model | - -*167-species classification is inherently challenging. These are basic preliminary results with default parameters - significant improvements possible with hyperparameter tuning, data augmentation, and architectural improvements.* - -### Reproduce Results - -**Prerequisites:** First run dataloaders to download dataset: -```python -from scripts.get_dataloaders import get_dataloaders -# This downloads the dataset to _neon_tree_classification_dataset_files/ -train_loader, test_loader = get_dataloaders(config='combined') -``` - -**With Comet ML (exact reproduction):** -```bash -# RGB baseline (requires Comet ML setup) -uv run python examples/train.py \ - --csv_path _neon_tree_classification_dataset_files/metadata/combined_dataset.csv \ - --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 \ - --modality rgb --model_type resnet --batch_size 1024 --seed 42 \ - --logger comet --early_stop_patience 15 - -# HSI baseline (requires Comet ML setup) -uv run python examples/train.py \ - --csv_path _neon_tree_classification_dataset_files/metadata/combined_dataset.csv \ - --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 \ - --modality hsi --model_type spectral_cnn --batch_size 128 --seed 42 \ - --logger comet --early_stop_patience 15 - -# LiDAR baseline (requires Comet ML setup) -uv run python examples/train.py \ - --csv_path _neon_tree_classification_dataset_files/metadata/combined_dataset.csv \ - --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 \ - --modality lidar --model_type structural --batch_size 1024 --seed 42 \ - --logger comet --early_stop_patience 15 -``` - -**Without Comet ML (approximate reproduction):** -```bash -# RGB baseline (fixed epochs) -uv run python examples/train.py \ - --csv_path _neon_tree_classification_dataset_files/metadata/combined_dataset.csv \ - --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 \ - --modality rgb --model_type resnet --batch_size 1024 --seed 42 --epochs 100 - -# HSI baseline (fixed epochs) -uv run python examples/train.py \ - --csv_path _neon_tree_classification_dataset_files/metadata/combined_dataset.csv \ - --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 \ - --modality hsi --model_type spectral_cnn --batch_size 128 --seed 42 --epochs 100 - -# LiDAR baseline (fixed epochs) -uv run python examples/train.py \ - --csv_path _neon_tree_classification_dataset_files/metadata/combined_dataset.csv \ - --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 \ - --modality lidar --model_type structural --batch_size 1024 --seed 42 --epochs 100 -``` -*Note: Without early stopping, results may vary. Original experiments used Comet ML logging with early stopping (patience=15).* - -## Repository Structure - -``` -NeonTreeClassification/ -β”œβ”€β”€ πŸ“ neon_tree_classification/ # Main Python package -β”‚ β”œβ”€β”€ core/ # Core functionality -β”‚ β”‚ β”œβ”€β”€ dataset.py # HDF5-based dataset class -β”‚ β”‚ β”œβ”€β”€ datamodule.py # PyTorch Lightning data module -β”‚ β”‚ └── visualization.py # Multi-modal visualization tools -β”‚ └── models/ # ML architectures & training modules -β”œβ”€β”€ πŸ“ scripts/ # Easy-to-use scripts -β”‚ β”œβ”€β”€ download_dataset.py # Download & extract dataset -β”‚ β”œβ”€β”€ get_dataloaders.py # Simple DataLoader factory -β”œβ”€β”€ πŸ“ examples/ # Training examples -β”‚ └── train.py # Flexible training script -β”œβ”€β”€ πŸ“ notebooks/ # Interactive exploration -β”‚ └── visualization.ipynb # Dataset exploration & visualization -β”œβ”€β”€ πŸ“ processing/ # NEON raw data processing tools -β”‚ └── neon_data_processing/ # Scripts for processing raw NEON data products -β”œβ”€β”€ πŸ“ sample_plots/ # Generated example visualizations -β”œβ”€β”€ pyproject.toml # Package configuration & dependencies -└── uv.lock # Dependency lock file -``` - -*Note: Dataset automatically downloads to `_neon_tree_classification_dataset_files/` on first use.* - -## Training Models - -Use the flexible training script with different modalities: - -```bash -# Train RGB classifier -python examples/train.py --modality rgb --csv_path _neon_tree_classification_dataset_files/metadata/large_dataset.csv --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 - -# Train hyperspectral classifier -python examples/train.py --modality hsi --csv_path _neon_tree_classification_dataset_files/metadata/combined_dataset.csv --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 --batch_size 16 - -# Train LiDAR classifier -python examples/train.py --modality lidar --csv_path _neon_tree_classification_dataset_files/metadata/high_quality_dataset.csv --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 - -# Advanced: External test set (train on large, test on high_quality) -python examples/train.py --modality rgb --csv_path _neon_tree_classification_dataset_files/metadata/large_dataset.csv --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 --external_test_csv _neon_tree_classification_dataset_files/metadata/high_quality_dataset.csv +uv run python quickstart.py ``` -### Add Custom Models -Create new model architectures in `neon_tree_classification/models/` and reference them with the `--model_type` flag. +## Coming Soon -## Advanced Usage +**Data Processing Pipeline**: Tools for processing raw NEON data products are being finalized and will be released for public use. This will enable users to: +- Download NEON tiles for all three modalities +- Crop individual tree crowns from shapefiles +- Create custom datasets with their own crown annotations -### Custom Data Filtering & Lightning DataModule +**Classification Models**: Pre-trained models and training scripts for tree species classification will be added to the repository. -```python -# For advanced users: Use Lightning DataModule with flexible splitting -from neon_tree_classification.core.datamodule import NeonCrownDataModule - -# Basic configuration with species/site filtering -datamodule = NeonCrownDataModule( - csv_path="_neon_tree_classification_dataset_files/metadata/combined_dataset.csv", - hdf5_path="_neon_tree_classification_dataset_files/neon_dataset.h5", - modalities=["rgb"], # Single modality training - batch_size=32, - # Filtering options - species_filter=["PSMEM", "TSHE"], # Train on specific species - site_filter=["HARV", "OSBS"], # Train on specific sites - year_filter=[2018, 2019], # Train on specific years - # Split method options - split_method="site", # Options: "random", "site", "year" - val_ratio=0.15, - test_ratio=0.15 -) - -# For external test sets (advanced) -datamodule = NeonCrownDataModule( - csv_path="_neon_tree_classification_dataset_files/metadata/combined_dataset.csv", - hdf5_path="_neon_tree_classification_dataset_files/neon_dataset.h5", - external_test_csv_path="path/to/external_test.csv", # Optional external test - external_test_hdf5_path="path/to/external_test.h5", # Optional external HDF5 - modalities=["rgb"] -) - -datamodule.setup("fit") # Auto-filters species for compatibility -``` - -### Data Processing Pipeline - -The `processing/` folder contains a comprehensive NEON data processing pipeline: - -**Core Processing Steps:** -1. **Download NEON tiles** (`neon_downloader.py`) - Downloads RGB, HSI, LiDAR data from NEON API -2. **Curate tiles** (`curate_tiles.py`) - Quality control and tile selection -3. **Process shapefiles** (`shapefile_processing/`) - Extract crown metadata and validate annotations -4. **Crop crowns** (`crop_crowns_multimodal.py`) - Extract individual tree crowns from tiles -5. **Convert formats** (`convert_tif_to_npy.py`, `hsi_convert_h5_to_tif.py`) - Optimize data storage -6. **Generate datasets** (`create_training_csv.py`) - Create final training/test CSVs -7. **Filter & combine** (`misc/filter_rare_species.py`, `dataset_combiner.py`) - Dataset refinement +**DeepForest Integration**: Planned integration with [DeepForest](https://github.com/weecology/DeepForest) to enable: +- Automatic crown detection from aerial imagery +- Seamless multi-modal data extraction for detected crowns +- Direct classification using pre-trained models from this repository ## Dataset Details -### NEON Data Products -- **RGB**: `DP3.30010.001` - High-resolution orthorectified imagery -- **Hyperspectral**: `DP3.30006.002` - 426-band spectrometer reflectance -- **LiDAR**: `DP3.30015.001` - Canopy Height Model - -### Data Structure -``` -_neon_tree_classification_dataset_files/metadata/combined_dataset.csv - Main dataset file -β”œβ”€β”€ crown_id - Unique identifier for each tree crown -β”œβ”€β”€ individual - NEON individual tree ID -β”œβ”€β”€ individual_id - Full individual identifier -β”œβ”€β”€ site - NEON site code -β”œβ”€β”€ year - Data collection year -β”œβ”€β”€ species - Species code -β”œβ”€β”€ species_name - Full species name -β”œβ”€β”€ label_site - Site where tree was labeled -β”œβ”€β”€ height - Tree height in meters (95.4% available) -β”œβ”€β”€ stemDiameter - Stem diameter in cm (99.4% available) -β”œβ”€β”€ canopyPosition - Light exposure: Full sun, Partially shaded, etc. (81.4% available) -β”œβ”€β”€ plantStatus - Tree health status: Live, Dead, etc. (99.99% available) -β”œβ”€β”€ plot - Plot identifier (currently "unknown" for all samples) -β”œβ”€β”€ hand_annotated - Whether crown was manually annotated (boolean) -β”œβ”€β”€ rgb_path - Path to RGB image in HDF5 dataset -β”œβ”€β”€ hsi_path - Path to hyperspectral image in HDF5 dataset -└── lidar_path - Path to LiDAR CHM in HDF5 dataset -``` +**Top 5 Species:** +1. Acer rubrum L. (5,684 samples, 11.8%) +2. Tsuga canadensis (L.) CarriΓ¨re (3,303 samples, 6.9%) +3. Pseudotsuga menziesii (Mirb.) Franco var. menziesii (2,978 samples, 6.2%) +4. Pinus palustris Mill. (2,207 samples, 4.6%) +5. Quercus rubra L. (2,086 samples, 4.3%) -**Ecological Metadata**: Most samples include tree measurements that provide valuable ecological context for classification, though availability varies by field. +**Top 5 Sites:** +- HARV: 7,162 samples (14.9%) +- MLBS: 5,424 samples (11.3%) +- GRSM: 4,822 samples (10.1%) +- DELA: 4,539 samples (9.5%) +- RMNP: 3,931 samples (8.2%) -## Contributing +**NEON Data Products:** +- RGB: DP3.30010.001 (High-resolution orthorectified imagery) +- Hyperspectral: DP3.30006.002 (426-band spectrometer reflectance) +- LiDAR: DP3.30015.001 (Canopy Height Model) -1. Fork the repository -2. Create a feature branch -3. Submit a pull request +For complete dataset documentation, training guides, and advanced usage, see the [docs/](docs/) directory. ## Acknowledgments -- National Ecological Observatory Network (NEON) -- Dataset statistics generated on 2025-08-28 +National Ecological Observatory Network (NEON) + diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..7113ef6 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,84 @@ +# NEON Tree Classification Dataset - Documentation + +Welcome to the comprehensive documentation for the NEON Multi-Modal Tree Species Classification Dataset. + +## Quick Links + +- [README](../README.md) - Main README with dataset overview and quick start +- [Advanced Usage](advanced_usage.md) - Custom filtering, Lightning DataModule, and advanced features +- [Training Guide](training.md) - Model training examples and baseline results +- [Visualization Guide](visualization.md) - Data visualization tools and examples +- [Processing Pipeline](processing.md) - NEON data processing workflow + +## Getting Started + +1. **New Users**: Start with the [main README](../README.md) for installation and basic usage +2. **Training Models**: See the [Training Guide](training.md) for model training and baseline results +3. **Data Exploration**: Check out the [Visualization Guide](visualization.md) for exploring the dataset +4. **Advanced Features**: Read [Advanced Usage](advanced_usage.md) for custom configurations +5. **Data Processing**: For processing raw NEON data, see the [Processing Pipeline](processing.md) + +## Documentation Structure + +### [Advanced Usage](advanced_usage.md) +- Custom data filtering with Lightning DataModule +- Split methods (random, site-based, year-based) +- External test sets +- Advanced dataloader configuration +- Direct dataset usage +- Multi-GPU training +- Custom training loops + +### [Training Guide](training.md) +- Quick training with examples script +- Baseline results and reproduction steps +- Custom model architectures +- Training best practices +- Multi-modal training +- Experiment tracking (Comet ML, W&B) +- Common issues and solutions + +### [Visualization Guide](visualization.md) +- Overview of visualization tools +- RGB, HSI, and LiDAR visualization +- Interactive Jupyter notebook +- Custom visualizations +- Multi-modal comparisons +- Advanced spectral analysis + +### [Processing Pipeline](processing.md) +- Complete data processing workflow +- NEON data product details +- Quality control procedures +- HDF5 dataset creation +- Configuration subset creation +- Processing best practices + +## Support + +For issues, questions, or contributions: +- GitHub Issues: [Report a bug or request a feature](https://github.com/Ritesh313/NeonTreeClassification/issues) +- Contributing: See [CONTRIBUTING.md](../CONTRIBUTING.md) + +## Citation + +If you use this dataset in your research, please cite: + +```bibtex +@dataset{neon_tree_classification_2024, + title={NEON Multi-Modal Tree Species Classification Dataset}, + author={[Author Names]}, + year={2024}, + publisher={GitHub}, + url={https://github.com/Ritesh313/NeonTreeClassification} +} +``` + +## License + +See [LICENSE](../LICENSE) file for details. + +## Acknowledgments + +- National Ecological Observatory Network (NEON) +- Dataset statistics generated on 2025-08-28 diff --git a/docs/advanced_usage.md b/docs/advanced_usage.md new file mode 100644 index 0000000..d020ebb --- /dev/null +++ b/docs/advanced_usage.md @@ -0,0 +1,290 @@ +# Advanced Usage + +This guide covers advanced features for experienced users who need custom data filtering, specialized training configurations, or want to use the PyTorch Lightning DataModule directly. + +## Custom Data Filtering with Lightning DataModule + +The `NeonCrownDataModule` provides flexible filtering and splitting options for advanced use cases. + +### Basic Configuration + +```python +from neon_tree_classification.core.datamodule import NeonCrownDataModule + +# Basic configuration with species/site filtering +datamodule = NeonCrownDataModule( + csv_path="_neon_tree_classification_dataset_files/metadata/combined_dataset.csv", + hdf5_path="_neon_tree_classification_dataset_files/neon_dataset.h5", + modalities=["rgb"], # Single modality training + batch_size=32, + # Filtering options + species_filter=["PSMEM", "TSHE"], # Train on specific species + site_filter=["HARV", "OSBS"], # Train on specific sites + year_filter=[2018, 2019], # Train on specific years + # Split method options + split_method="random", # Options: "random", "site", "year" + val_ratio=0.15, + test_ratio=0.15 +) + +datamodule.setup("fit") +``` + +### Split Methods + +The DataModule supports three splitting strategies: + +**1. Random Split** (default) +```python +datamodule = NeonCrownDataModule( + csv_path="path/to/dataset.csv", + hdf5_path="path/to/dataset.h5", + split_method="random", + val_ratio=0.15, + test_ratio=0.15 +) +``` + +**2. Site-Based Split** + +Useful for testing generalization across geographic locations: +```python +datamodule = NeonCrownDataModule( + csv_path="path/to/dataset.csv", + hdf5_path="path/to/dataset.h5", + split_method="site", + val_ratio=0.15, + test_ratio=0.15 +) +``` + +**3. Year-Based Split** + +Useful for testing temporal generalization: +```python +datamodule = NeonCrownDataModule( + csv_path="path/to/dataset.csv", + hdf5_path="path/to/dataset.h5", + split_method="year", + val_ratio=0.15, + test_ratio=0.15 +) +``` + +### External Test Sets + +For domain adaptation or cross-site validation: + +```python +datamodule = NeonCrownDataModule( + csv_path="_neon_tree_classification_dataset_files/metadata/combined_dataset.csv", + hdf5_path="_neon_tree_classification_dataset_files/neon_dataset.h5", + external_test_csv_path="path/to/external_test.csv", + external_test_hdf5_path="path/to/external_test.h5", # Optional, uses main HDF5 if not provided + modalities=["rgb"] +) + +datamodule.setup("fit") # Auto-filters species for compatibility +``` + +## Advanced DataLoader Configuration + +### Custom Normalization + +Each modality supports different normalization methods: + +**RGB Normalization:** +- `"0_1"`: Scale to [0, 1] range (default) +- `"imagenet"`: ImageNet statistics (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) +- `"per_sample"`: Normalize each sample independently + +**HSI Normalization:** +- `"per_sample"`: Normalize each sample independently (default) +- `"global"`: Use global dataset statistics +- `"none"`: No normalization + +**LiDAR Normalization:** +- `"height"`: Normalize by maximum canopy height (default) +- `"per_sample"`: Normalize each sample independently +- `"none"`: No normalization + +Example: +```python +train_loader, test_loader = get_dataloaders( + config='large', + modalities=['rgb', 'hsi', 'lidar'], + batch_size=32, + rgb_norm_method='imagenet', + hsi_norm_method='global', + lidar_norm_method='height' +) +``` + +### Custom Image Sizes + +Adjust the spatial resolution for each modality: + +```python +train_loader, test_loader = get_dataloaders( + config='large', + modalities=['rgb', 'hsi', 'lidar'], + batch_size=32, + rgb_size=(224, 224), # Larger RGB for fine-grained features + hsi_size=(16, 16), # Higher HSI resolution + lidar_size=(16, 16) # Higher LiDAR resolution +) +``` + +## Direct Dataset Usage + +For maximum control, use the `NeonCrownDataset` class directly: + +```python +from neon_tree_classification.core.dataset import NeonCrownDataset +from torch.utils.data import DataLoader + +# Create dataset with custom parameters +dataset = NeonCrownDataset( + csv_path="_neon_tree_classification_dataset_files/metadata/large_dataset.csv", + hdf5_path="_neon_tree_classification_dataset_files/neon_dataset.h5", + modalities=['rgb', 'hsi'], + species_filter=['ACRU', 'TSCA'], # Limit to specific species + site_filter=['HARV', 'MLBS'], # Limit to specific sites + year_filter=[2018, 2019, 2020], # Limit to specific years + include_metadata=True, # Include crown_id, species names, etc. + rgb_size=(128, 128), + hsi_size=(12, 12), + rgb_norm_method='imagenet', + hsi_norm_method='per_sample' +) + +# Create custom DataLoader +train_loader = DataLoader( + dataset, + batch_size=64, + shuffle=True, + num_workers=8, + pin_memory=True +) +``` + +## Accessing Metadata + +Enable metadata in batches to access crown IDs, species names, and site information: + +```python +from scripts.get_dataloaders import get_dataloaders + +# Note: get_dataloaders doesn't support include_metadata yet +# Use NeonCrownDataset directly: +from neon_tree_classification.core.dataset import NeonCrownDataset + +dataset = NeonCrownDataset( + csv_path="path/to/dataset.csv", + hdf5_path="path/to/dataset.h5", + modalities=['rgb'], + include_metadata=True +) + +# Access metadata in batches +for batch in DataLoader(dataset, batch_size=32): + rgb = batch['rgb'] + labels = batch['species_idx'] + crown_ids = batch['crown_id'] + species_names = batch['species'] + sites = batch['site'] +``` + +## Multi-GPU Training + +For distributed training with PyTorch Lightning: + +```python +import pytorch_lightning as pl +from neon_tree_classification.core.datamodule import NeonCrownDataModule + +# Configure DataModule +datamodule = NeonCrownDataModule( + csv_path="path/to/dataset.csv", + hdf5_path="path/to/dataset.h5", + modalities=["rgb"], + batch_size=32 # Per-GPU batch size +) + +# Create trainer with multi-GPU support +trainer = pl.Trainer( + devices=4, # Number of GPUs + strategy='ddp', # Distributed Data Parallel + precision=16, # Mixed precision training + max_epochs=100 +) + +# Your Lightning module +trainer.fit(model, datamodule=datamodule) +``` + +## Custom Training Loop + +Example of a custom training loop without PyTorch Lightning: + +```python +import torch +from scripts.get_dataloaders import get_dataloaders + +# Get dataloaders +train_loader, test_loader = get_dataloaders( + config='large', + modalities=['rgb'], + batch_size=64 +) + +# Your model +model = YourModel().cuda() +optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) +criterion = torch.nn.CrossEntropyLoss() + +# Training loop +for epoch in range(100): + model.train() + for batch in train_loader: + rgb = batch['rgb'].cuda() + labels = batch['species_idx'].cuda() + + optimizer.zero_grad() + outputs = model(rgb) + loss = criterion(outputs, labels) + loss.backward() + optimizer.step() + + # Validation + model.eval() + correct = 0 + total = 0 + with torch.no_grad(): + for batch in test_loader: + rgb = batch['rgb'].cuda() + labels = batch['species_idx'].cuda() + outputs = model(rgb) + _, predicted = outputs.max(1) + total += labels.size(0) + correct += predicted.eq(labels).sum().item() + + accuracy = 100. * correct / total + print(f'Epoch {epoch}: Accuracy = {accuracy:.2f}%') +``` + +## Performance Tips + +1. **Use larger batch sizes**: The dataset fits in memory efficiently due to HDF5 compression +2. **Increase num_workers**: More workers can significantly speed up data loading +3. **Enable pin_memory**: Speeds up CPU-to-GPU transfer +4. **Use persistent_workers**: Reduces worker initialization overhead + +```python +train_loader, test_loader = get_dataloaders( + config='large', + modalities=['rgb'], + batch_size=256, # Larger batch size + num_workers=16, # More workers (adjust based on CPU cores) +) +``` diff --git a/docs/processing.md b/docs/processing.md new file mode 100644 index 0000000..4abb6b7 --- /dev/null +++ b/docs/processing.md @@ -0,0 +1,399 @@ +# Processing Pipeline + +This guide documents the NEON data processing pipeline for advanced users who want to process raw NEON data products or understand how the dataset was created. + +## Overview + +The `processing/` folder contains a comprehensive pipeline for converting raw NEON data products into the final training-ready dataset. This is useful for: +- Processing new NEON data +- Customizing the dataset creation process +- Understanding data quality control steps +- Creating similar datasets from other sources + +## Pipeline Steps + +### 1. Download NEON Tiles + +Download RGB, HSI, and LiDAR data from the NEON API: + +```bash +python processing/neon_data_processing/neon_downloader.py \ + --site HARV \ + --year 2018 \ + --output_dir /path/to/output +``` + +**What it does:** +- Downloads NEON data products from the API +- Organizes files by site, year, and modality +- Validates downloads and checks file integrity + +**Key parameters:** +- `--site`: NEON site code (e.g., HARV, MLBS, GRSM) +- `--year`: Data collection year +- `--products`: Which products to download (rgb, hsi, lidar) + +### 2. Curate Tiles + +Quality control and tile selection: + +```bash +python processing/neon_data_processing/curate_tiles.py \ + --input_dir /path/to/downloaded/tiles \ + --output_dir /path/to/curated/tiles \ + --quality_threshold 0.8 +``` + +**What it does:** +- Checks for data completeness (all three modalities present) +- Validates spatial alignment between modalities +- Filters out low-quality or corrupted tiles +- Creates metadata about tile quality + +**Quality checks:** +- Spatial overlap between RGB, HSI, and LiDAR +- Missing data percentage +- Coordinate system consistency +- File format validation + +### 3. Process Shapefiles + +Extract crown metadata and validate annotations: + +```bash +cd processing/shapefile_processing +python process_shapefiles.py \ + --shapefile_dir /path/to/shapefiles \ + --output_csv crowns_metadata.csv +``` + +**What it does:** +- Extracts tree crown polygons from shapefiles +- Links crowns to individual tree measurements +- Validates crown annotations +- Merges ecological metadata (height, diameter, etc.) + +**Output format:** +- CSV with crown ID, species, location, measurements +- Validated polygon geometries +- Quality flags for each annotation + +### 4. Crop Tree Crowns + +Extract individual tree crowns from tiles: + +```bash +python processing/neon_data_processing/crop_crowns_multimodal.py \ + --tiles_dir /path/to/curated/tiles \ + --crowns_csv crowns_metadata.csv \ + --output_dir /path/to/cropped/crowns \ + --rgb_size 128 \ + --hsi_size 12 \ + --lidar_size 12 +``` + +**What it does:** +- Extracts bounding boxes around each crown +- Crops corresponding regions from RGB, HSI, and LiDAR tiles +- Resamples to target resolutions +- Handles coordinate transformations between modalities + +**Parameters:** +- `--rgb_size`: Target RGB resolution (default: 128x128) +- `--hsi_size`: Target HSI resolution (default: 12x12) +- `--lidar_size`: Target LiDAR resolution (default: 12x12) +- `--padding`: Additional padding around crowns (default: 0) + +### 5. Convert Formats + +Optimize data storage: + +```bash +# Convert TIF to NumPy +python processing/neon_data_processing/convert_tif_to_npy.py \ + --input_dir /path/to/cropped/crowns \ + --output_dir /path/to/numpy/arrays + +# Convert HSI H5 to TIF +python processing/neon_data_processing/hsi_convert_h5_to_tif.py \ + --input_dir /path/to/hsi/h5 \ + --output_dir /path/to/hsi/tif +``` + +**What it does:** +- Converts various formats to efficient storage +- Applies compression where appropriate +- Validates converted data + +### 6. Generate Training CSV + +Create final training/test CSVs: + +```bash +python processing/neon_data_processing/create_training_csv.py \ + --crowns_dir /path/to/cropped/crowns \ + --metadata_csv crowns_metadata.csv \ + --output_csv training_dataset.csv +``` + +**What it does:** +- Combines all metadata +- Validates data availability for each sample +- Adds file paths to HDF5 dataset +- Creates train/val/test splits + +### 7. Filter and Combine + +Dataset refinement: + +```bash +# Filter rare species +python processing/misc/filter_rare_species.py \ + --input_csv training_dataset.csv \ + --output_csv filtered_dataset.csv \ + --min_samples 50 + +# Combine multiple datasets +python processing/misc/dataset_combiner.py \ + --input_csvs dataset1.csv dataset2.csv dataset3.csv \ + --output_csv combined_dataset.csv +``` + +**What it does:** +- Removes species with insufficient samples +- Combines datasets from different sites/years +- Ensures species compatibility across datasets +- Creates configuration subsets (large, high_quality, combined) + +## Repository Structure + +``` +processing/ +β”œβ”€β”€ neon_data_processing/ # Main processing scripts +β”‚ β”œβ”€β”€ neon_downloader.py # Download NEON data +β”‚ β”œβ”€β”€ curate_tiles.py # Quality control +β”‚ β”œβ”€β”€ crop_crowns_multimodal.py # Extract crowns +β”‚ β”œβ”€β”€ convert_tif_to_npy.py # Format conversion +β”‚ β”œβ”€β”€ hsi_convert_h5_to_tif.py # HSI format conversion +β”‚ └── create_training_csv.py # Generate training CSVs +β”œβ”€β”€ shapefile_processing/ # Shapefile tools +β”‚ β”œβ”€β”€ process_shapefiles.py # Extract crown metadata +β”‚ └── README.md # Shapefile processing guide +└── misc/ # Utility scripts + β”œβ”€β”€ filter_rare_species.py # Species filtering + └── dataset_combiner.py # Combine datasets +``` + +## NEON Data Products + +### RGB (DP3.30010.001) +**High-Resolution Orthorectified Camera Imagery** +- Resolution: 10cm +- Format: GeoTIFF +- Bands: RGB (3 channels) +- Coverage: Full site mosaics + +### Hyperspectral (DP3.30006.002) +**Surface Directional Reflectance** +- Resolution: 1m +- Format: HDF5 +- Bands: 426 spectral bands (380-2510 nm) +- Processing: Atmospheric correction applied +- Note: Reduced to 369 bands in dataset (removed noisy bands) + +### LiDAR (DP3.30015.001) +**Ecosystem Structure** +- Resolution: 1m +- Format: GeoTIFF +- Data: Canopy Height Model (CHM) +- Derived from: Point cloud classification + +## Data Quality Control + +### Spatial Alignment + +Ensure all modalities are properly aligned: + +```python +from processing.utils import check_spatial_alignment + +# Verify alignment +aligned = check_spatial_alignment( + rgb_path='path/to/rgb.tif', + hsi_path='path/to/hsi.h5', + lidar_path='path/to/lidar.tif', + tolerance=0.5 # meters +) + +if not aligned: + print("Warning: Modalities not aligned!") +``` + +### Missing Data + +Handle missing or corrupted data: + +```python +from processing.utils import validate_data + +# Check data quality +quality = validate_data( + crown_id='HARV_123', + rgb_path='path/to/rgb.npy', + hsi_path='path/to/hsi.npy', + lidar_path='path/to/lidar.npy' +) + +print(f"Quality score: {quality['score']:.2f}") +print(f"Issues: {quality['issues']}") +``` + +## HDF5 Dataset Creation + +Convert processed crowns to HDF5: + +```python +import h5py +import numpy as np + +# Create HDF5 dataset +with h5py.File('neon_dataset.h5', 'w') as f: + # Create groups + rgb_group = f.create_group('rgb') + hsi_group = f.create_group('hsi') + lidar_group = f.create_group('lidar') + + # Add crown data + for crown_id, data in processed_crowns.items(): + rgb_group.create_dataset( + crown_id, + data=data['rgb'], + compression='gzip', + compression_opts=9 + ) + hsi_group.create_dataset( + crown_id, + data=data['hsi'], + compression='gzip', + compression_opts=9 + ) + lidar_group.create_dataset( + crown_id, + data=data['lidar'], + compression='gzip', + compression_opts=9 + ) +``` + +## Configuration Subsets + +Create different dataset configurations: + +### Combined Dataset +All available data (47,971 samples, 167 species) + +```bash +python processing/misc/create_config.py \ + --input_csv all_crowns.csv \ + --output_csv combined_dataset.csv \ + --config combined +``` + +### Large Dataset +Main training set (~42,000 samples, ~162 species) + +```bash +python processing/misc/create_config.py \ + --input_csv all_crowns.csv \ + --output_csv large_dataset.csv \ + --config large \ + --min_samples_per_species 50 +``` + +### High Quality Dataset +Curated subset (~5,500 samples, ~96 species) + +```bash +python processing/misc/create_config.py \ + --input_csv all_crowns.csv \ + --output_csv high_quality_dataset.csv \ + --config high_quality \ + --quality_threshold 0.9 \ + --min_samples_per_species 100 +``` + +## Processing Best Practices + +### 1. Start Small +Process one site first to validate the pipeline: + +```bash +# Test with single site +python process_site.sh HARV 2018 +``` + +### 2. Parallel Processing +Use parallel processing for large-scale operations: + +```bash +# Process multiple sites in parallel +parallel -j 4 python process_site.sh ::: HARV MLBS GRSM DELA +``` + +### 3. Disk Space +Monitor disk usage - raw and processed data can be large: +- Raw tiles: ~100GB per site +- Processed crowns: ~50GB per site +- Final HDF5 dataset: ~600MB (compressed) + +### 4. Validation +Always validate processed data: + +```bash +python processing/utils/validate_dataset.py \ + --csv_path training_dataset.csv \ + --hdf5_path neon_dataset.h5 +``` + +## Troubleshooting + +### Issue: Spatial misalignment +**Solution:** Check coordinate reference systems (CRS) and reproject if needed + +### Issue: Missing HSI bands +**Solution:** Verify HSI data download and band extraction + +### Issue: Corrupted crowns +**Solution:** Increase quality thresholds in curation step + +### Issue: Memory errors +**Solution:** Process in batches or use more efficient data types + +## Custom Processing + +For custom processing workflows: + +```python +from processing.pipeline import ProcessingPipeline + +# Create custom pipeline +pipeline = ProcessingPipeline( + sites=['HARV', 'MLBS'], + years=[2018, 2019, 2020], + output_dir='custom_dataset' +) + +# Configure processing +pipeline.set_quality_threshold(0.85) +pipeline.set_crown_sizes(rgb=224, hsi=16, lidar=16) + +# Run pipeline +pipeline.run() +``` + +## Additional Resources + +- [NEON Data Portal](https://data.neonscience.org/) +- [NEON Data Products Catalog](https://data.neonscience.org/data-products/explore) +- [NEON API Documentation](https://data.neonscience.org/data-api) +- [Processing README](../processing/neon_data_processing/README.md) diff --git a/docs/training.md b/docs/training.md new file mode 100644 index 0000000..6b1abca --- /dev/null +++ b/docs/training.md @@ -0,0 +1,289 @@ +# Training Guide + +This guide covers model training, baseline results, and tips for achieving good performance on the NEON tree classification dataset. + +## Quick Training with Examples Script + +The repository includes a flexible training script that supports all modalities: + +```bash +# Train RGB classifier +uv run python examples/train.py \ + --modality rgb \ + --csv_path _neon_tree_classification_dataset_files/metadata/large_dataset.csv \ + --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 + +# Train hyperspectral classifier +uv run python examples/train.py \ + --modality hsi \ + --csv_path _neon_tree_classification_dataset_files/metadata/combined_dataset.csv \ + --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 \ + --batch_size 16 + +# Train LiDAR classifier +uv run python examples/train.py \ + --modality lidar \ + --csv_path _neon_tree_classification_dataset_files/metadata/high_quality_dataset.csv \ + --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 + +# External test set (train on large, test on high_quality) +uv run python examples/train.py \ + --modality rgb \ + --csv_path _neon_tree_classification_dataset_files/metadata/large_dataset.csv \ + --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 \ + --external_test_csv _neon_tree_classification_dataset_files/metadata/high_quality_dataset.csv +``` + +## Baseline Results + +Preliminary single-modality baseline results for 167-species classification using the `combined` dataset configuration (seed=42, no hyperparameter optimization): + +| Modality | Test Accuracy | Model | Notes | +|----------|---------------|-------|-------| +| RGB | 53.5% | ResNet | Standard computer vision approach | +| HSI | 27.3% | Spectral CNN | 369-band hyperspectral data | +| LiDAR | 11.5% | Structural CNN | Canopy height model | + +**Important Notes:** +- 167-species classification is inherently challenging +- These are basic preliminary results with default parameters +- Significant improvements possible with hyperparameter tuning, data augmentation, and architectural improvements +- Multi-modal fusion is expected to significantly improve performance + +## Reproducing Baseline Results + +### Prerequisites + +First, download the dataset: +```python +from scripts.get_dataloaders import get_dataloaders +# This downloads the dataset to _neon_tree_classification_dataset_files/ +train_loader, test_loader = get_dataloaders(config='combined') +``` + +### With Comet ML (Exact Reproduction) + +The original experiments used Comet ML for logging and early stopping: + +```bash +# RGB baseline +uv run python examples/train.py \ + --csv_path _neon_tree_classification_dataset_files/metadata/combined_dataset.csv \ + --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 \ + --modality rgb --model_type resnet --batch_size 1024 --seed 42 \ + --logger comet --early_stop_patience 15 + +# HSI baseline +uv run python examples/train.py \ + --csv_path _neon_tree_classification_dataset_files/metadata/combined_dataset.csv \ + --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 \ + --modality hsi --model_type spectral_cnn --batch_size 128 --seed 42 \ + --logger comet --early_stop_patience 15 + +# LiDAR baseline +uv run python examples/train.py \ + --csv_path _neon_tree_classification_dataset_files/metadata/combined_dataset.csv \ + --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 \ + --modality lidar --model_type structural --batch_size 1024 --seed 42 \ + --logger comet --early_stop_patience 15 +``` + +### Without Comet ML (Approximate Reproduction) + +Without early stopping, results may vary: + +```bash +# RGB baseline (fixed epochs) +uv run python examples/train.py \ + --csv_path _neon_tree_classification_dataset_files/metadata/combined_dataset.csv \ + --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 \ + --modality rgb --model_type resnet --batch_size 1024 --seed 42 --epochs 100 + +# HSI baseline (fixed epochs) +uv run python examples/train.py \ + --csv_path _neon_tree_classification_dataset_files/metadata/combined_dataset.csv \ + --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 \ + --modality hsi --model_type spectral_cnn --batch_size 128 --seed 42 --epochs 100 + +# LiDAR baseline (fixed epochs) +uv run python examples/train.py \ + --csv_path _neon_tree_classification_dataset_files/metadata/combined_dataset.csv \ + --hdf5_path _neon_tree_classification_dataset_files/neon_dataset.h5 \ + --modality lidar --model_type structural --batch_size 1024 --seed 42 --epochs 100 +``` + +## Custom Model Architectures + +### Creating Custom Models + +Add new model architectures in `neon_tree_classification/models/` and reference them with the `--model_type` flag. + +Example custom model: + +```python +# neon_tree_classification/models/my_custom_model.py +import torch.nn as nn + +class MyCustomModel(nn.Module): + def __init__(self, num_classes, input_channels=3): + super().__init__() + self.features = nn.Sequential( + nn.Conv2d(input_channels, 64, kernel_size=3, padding=1), + nn.ReLU(), + nn.MaxPool2d(2), + # Add more layers... + ) + self.classifier = nn.Linear(64, num_classes) + + def forward(self, x): + x = self.features(x) + x = x.mean([2, 3]) # Global average pooling + x = self.classifier(x) + return x +``` + +## Training Best Practices + +### 1. Start with RGB + +RGB data is easiest to work with and provides good baseline performance: +- Standard computer vision techniques apply +- Pre-trained ImageNet models can be fine-tuned +- Faster training times + +### 2. Dataset Configuration Selection + +Choose based on your goals: +- `combined`: Maximum data, all species +- `large`: Good balance of data quantity and quality +- `high_quality`: Cleanest data, fewer species + +### 3. Hyperparameter Tuning + +Key hyperparameters to tune: +- Learning rate (start with 1e-3 to 1e-4) +- Batch size (larger is usually better, up to memory limits) +- Weight decay (0 to 1e-4) +- Augmentation parameters + +### 4. Data Augmentation + +For RGB: +```python +import torchvision.transforms as transforms + +train_transforms = transforms.Compose([ + transforms.RandomHorizontalFlip(), + transforms.RandomVerticalFlip(), + transforms.RandomRotation(90), + transforms.ColorJitter(brightness=0.2, contrast=0.2), +]) +``` + +### 5. Learning Rate Scheduling + +Use learning rate scheduling for better convergence: +```python +scheduler = torch.optim.lr_scheduler.CosineAnnealingLR( + optimizer, + T_max=100 +) +``` + +## Multi-Modal Training + +Combining multiple modalities typically improves performance: + +### Early Fusion +```python +# Concatenate features from different modalities +rgb_features = rgb_encoder(rgb_data) +hsi_features = hsi_encoder(hsi_data) +lidar_features = lidar_encoder(lidar_data) + +combined = torch.cat([rgb_features, hsi_features, lidar_features], dim=1) +output = classifier(combined) +``` + +### Late Fusion +```python +# Average predictions from different modalities +rgb_pred = rgb_model(rgb_data) +hsi_pred = hsi_model(hsi_data) +lidar_pred = lidar_model(lidar_data) + +final_pred = (rgb_pred + hsi_pred + lidar_pred) / 3 +``` + +## Experiment Tracking + +### Using Comet ML + +```bash +# Set up Comet ML +export COMET_API_KEY="your_api_key" + +# Train with Comet logging +uv run python examples/train.py \ + --modality rgb \ + --logger comet \ + --csv_path path/to/dataset.csv \ + --hdf5_path path/to/dataset.h5 +``` + +### Using Weights & Biases + +```bash +# Set up W&B +wandb login + +# Train with W&B logging +uv run python examples/train.py \ + --modality rgb \ + --logger wandb \ + --csv_path path/to/dataset.csv \ + --hdf5_path path/to/dataset.h5 +``` + +## Common Issues and Solutions + +### Issue: Out of Memory +**Solution:** Reduce batch size or image resolution +```bash +python examples/train.py --batch_size 16 --modality rgb +``` + +### Issue: Slow Training +**Solution:** Increase num_workers and use larger batches +```bash +python examples/train.py --batch_size 256 --num_workers 16 +``` + +### Issue: Poor Convergence +**Solution:** +1. Check learning rate (try 1e-4 or 1e-5) +2. Use learning rate warmup +3. Add data augmentation +4. Try different normalization methods + +### Issue: Overfitting +**Solution:** +1. Add dropout +2. Use weight decay +3. Add more data augmentation +4. Use early stopping + +## Performance Benchmarks + +Training times on NVIDIA A100 (40GB): + +| Modality | Batch Size | Epochs | Time per Epoch | Total Time | +|----------|------------|--------|----------------|------------| +| RGB | 1024 | 100 | ~2 min | ~3.5 hours | +| HSI | 128 | 100 | ~5 min | ~8 hours | +| LiDAR | 1024 | 100 | ~1 min | ~2 hours | + +Memory requirements: +- RGB: ~8GB GPU memory (batch_size=1024) +- HSI: ~12GB GPU memory (batch_size=128) +- LiDAR: ~4GB GPU memory (batch_size=1024) diff --git a/docs/visualization.md b/docs/visualization.md new file mode 100644 index 0000000..eef327c --- /dev/null +++ b/docs/visualization.md @@ -0,0 +1,353 @@ +# Visualization Guide + +This guide covers data visualization tools for exploring the NEON tree classification dataset. + +## Overview + +The package includes comprehensive visualization tools for all three modalities: +- RGB imagery +- Hyperspectral (HSI) data +- LiDAR canopy height models + +## Example Visualizations + +| RGB Image | HSI Pseudo RGB | HSI PCA Decomposition | +|-----------|----------------|----------------------| +| ![RGB](../sample_plots/sample_rgb.png) | ![HSI](../sample_plots/sample_hsi.png) | ![HSI PCA](../sample_plots/sample_hsi_pca.png) | + +| HSI Spectral Signatures | LiDAR Canopy Height Model | +|-------------------------|---------------------------| +| ![Spectra](../sample_plots/sample_spectra.png) | ![LiDAR](../sample_plots/sample_lidar.png) | + +## Visualization Functions + +### Basic Usage + +```python +from neon_tree_classification.core.visualization import ( + plot_rgb, plot_hsi, plot_hsi_pca, plot_hsi_spectra, plot_lidar +) +from neon_tree_classification.core.dataset import NeonCrownDataset + +# Load dataset and get a sample +dataset = NeonCrownDataset.load() +sample_data = dataset[0] # Returns dict with rgb, hsi, lidar arrays + +# Visualize each modality +plot_rgb(sample_data['rgb']) # True color RGB image +plot_hsi(sample_data['hsi']) # Pseudo RGB from HSI bands +plot_hsi_pca(sample_data['hsi']) # PCA decomposition to 3 components +plot_hsi_spectra(sample_data['hsi']) # Spectral signatures +plot_lidar(sample_data['lidar']) # Canopy height model with colorbar +``` + +## RGB Visualization + +Display true-color RGB images: + +```python +import matplotlib.pyplot as plt +from neon_tree_classification.core.visualization import plot_rgb + +# Single sample +plot_rgb(sample_data['rgb']) +plt.title('Red Maple (Acer rubrum)') +plt.show() + +# Multiple samples +fig, axes = plt.subplots(2, 3, figsize=(12, 8)) +for i, ax in enumerate(axes.flat): + sample = dataset[i] + plot_rgb(sample['rgb'], ax=ax) + ax.set_title(f"Sample {i}") +plt.tight_layout() +plt.show() +``` + +## Hyperspectral Visualization + +### Pseudo RGB from HSI + +Display hyperspectral data as RGB by selecting specific wavelength bands: + +```python +from neon_tree_classification.core.visualization import plot_hsi + +# Default uses bands at ~660nm (red), ~550nm (green), ~450nm (blue) +plot_hsi(sample_data['hsi']) +plt.show() +``` + +### PCA Decomposition + +Visualize hyperspectral data using Principal Component Analysis: + +```python +from neon_tree_classification.core.visualization import plot_hsi_pca + +# Decompose 369 bands into 3 principal components +plot_hsi_pca(sample_data['hsi']) +plt.title('HSI PCA (RGB = PC1, PC2, PC3)') +plt.show() +``` + +### Spectral Signatures + +Plot spectral reflectance curves: + +```python +from neon_tree_classification.core.visualization import plot_hsi_spectra + +# Plot mean spectrum and sample pixels +plot_hsi_spectra(sample_data['hsi']) +plt.title('Hyperspectral Signature') +plt.xlabel('Wavelength (nm)') +plt.ylabel('Reflectance') +plt.show() +``` + +## LiDAR Visualization + +Display canopy height models: + +```python +from neon_tree_classification.core.visualization import plot_lidar + +# Display height map with colorbar +plot_lidar(sample_data['lidar']) +plt.title('Canopy Height Model') +plt.show() +``` + +## Interactive Visualization Notebook + +For comprehensive data exploration, use the included Jupyter notebook: + +```bash +# Start Jupyter +uv run jupyter notebook + +# Open the visualization notebook +# Navigate to: notebooks/visualization.ipynb +``` + +The notebook includes: +- Dataset overview and statistics +- Interactive visualizations for all modalities +- Species distribution plots +- Site-wise data exploration +- Temporal analysis + +## Custom Visualizations + +### Comparing Species + +```python +import matplotlib.pyplot as plt +from scripts.get_dataloaders import get_dataloaders +from neon_tree_classification.core.visualization import plot_rgb + +# Load data +train_loader, _ = get_dataloaders(config='large', batch_size=100) + +# Get samples from different species +species_samples = {} +for batch in train_loader: + for i in range(len(batch['species_idx'])): + species_idx = batch['species_idx'][i].item() + if species_idx not in species_samples: + species_samples[species_idx] = batch['rgb'][i] + if len(species_samples) >= 6: + break + if len(species_samples) >= 6: + break + +# Plot comparison +fig, axes = plt.subplots(2, 3, figsize=(12, 8)) +for (species_idx, rgb), ax in zip(species_samples.items(), axes.flat): + plot_rgb(rgb.numpy(), ax=ax) + ax.set_title(f"Species {species_idx}") +plt.tight_layout() +plt.show() +``` + +### Multi-Modal Comparison + +```python +import matplotlib.pyplot as plt + +# Get a sample with all modalities +sample = dataset[0] + +# Create multi-modal visualization +fig, axes = plt.subplots(1, 3, figsize=(15, 5)) + +# RGB +plot_rgb(sample['rgb'], ax=axes[0]) +axes[0].set_title('RGB') + +# HSI Pseudo RGB +plot_hsi(sample['hsi'], ax=axes[1]) +axes[1].set_title('Hyperspectral (Pseudo RGB)') + +# LiDAR +plot_lidar(sample['lidar'], ax=axes[2]) +axes[2].set_title('LiDAR CHM') + +plt.tight_layout() +plt.show() +``` + +### Batch Visualization + +```python +import torch +from scripts.get_dataloaders import get_dataloaders + +# Get a batch +train_loader, _ = get_dataloaders(config='large', batch_size=16) +batch = next(iter(train_loader)) + +# Visualize batch +fig, axes = plt.subplots(4, 4, figsize=(12, 12)) +for i, ax in enumerate(axes.flat): + if i < len(batch['rgb']): + plot_rgb(batch['rgb'][i].numpy(), ax=ax) + label = batch['species_idx'][i].item() + ax.set_title(f"Label: {label}", fontsize=8) + ax.axis('off') +plt.tight_layout() +plt.show() +``` + +## Visualization Tips + +### 1. Color Normalization + +For better visualization, normalize data appropriately: + +```python +import numpy as np + +# RGB: Scale to [0, 1] if needed +rgb_normalized = np.clip(rgb_data, 0, 1) + +# HSI: Normalize for visualization +hsi_normalized = (hsi_data - hsi_data.min()) / (hsi_data.max() - hsi_data.min()) + +# LiDAR: Height values are already meaningful +lidar_viz = lidar_data +``` + +### 2. Handling Different Resolutions + +```python +from scipy.ndimage import zoom + +# Resize HSI to match RGB resolution for side-by-side comparison +hsi_upsampled = zoom(sample['hsi'], (1, 128/12, 128/12), order=1) +``` + +### 3. Saving Figures + +```python +import matplotlib.pyplot as plt + +# High-quality figure +fig, ax = plt.subplots(figsize=(10, 10), dpi=300) +plot_rgb(sample['rgb'], ax=ax) +plt.title('High Resolution RGB') +plt.savefig('output.png', dpi=300, bbox_inches='tight') +plt.close() +``` + +### 4. Creating Animations + +```python +import matplotlib.animation as animation + +# Animate through spectral bands +fig, ax = plt.subplots() +ims = [] +for band in range(sample['hsi'].shape[0]): + im = ax.imshow(sample['hsi'][band], cmap='gray', animated=True) + ims.append([im]) + +ani = animation.ArtistAnimation(fig, ims, interval=50, blit=True) +ani.save('spectral_animation.gif') +``` + +## Exporting Visualizations + +### Generate Sample Images + +The repository includes a script to generate sample visualization images: + +```bash +python scripts/generate_sample_images.py +``` + +This creates example images in the `sample_plots/` directory. + +## Advanced Visualizations + +### Spectral Analysis + +```python +import numpy as np +import matplotlib.pyplot as plt + +# Extract mean spectrum per species +species_spectra = {} +for batch in train_loader: + for i in range(len(batch['species_idx'])): + species = batch['species_idx'][i].item() + hsi = batch['hsi'][i].numpy() + mean_spectrum = hsi.mean(axis=(1, 2)) # Average over spatial dimensions + + if species not in species_spectra: + species_spectra[species] = [] + species_spectra[species].append(mean_spectrum) + +# Plot comparison +plt.figure(figsize=(12, 6)) +for species, spectra in list(species_spectra.items())[:5]: # Top 5 species + mean = np.mean(spectra, axis=0) + plt.plot(mean, label=f'Species {species}') +plt.xlabel('Band Number') +plt.ylabel('Mean Reflectance') +plt.legend() +plt.title('Spectral Signatures by Species') +plt.show() +``` + +### Height Distribution + +```python +# Analyze LiDAR heights +heights = [] +for batch in train_loader: + lidar = batch['lidar'].numpy() + heights.extend(lidar.flatten()) + +plt.figure(figsize=(10, 6)) +plt.hist(heights, bins=100, edgecolor='black') +plt.xlabel('Height (m)') +plt.ylabel('Frequency') +plt.title('Canopy Height Distribution') +plt.show() +``` + +## Troubleshooting + +### Issue: Images appear too dark/bright +**Solution:** Adjust normalization or use `plt.clim()` + +### Issue: HSI visualization looks strange +**Solution:** Check band selection or use PCA visualization instead + +### Issue: Can't see LiDAR features +**Solution:** Adjust colormap range or use logarithmic scale + +### Issue: Out of memory with large batches +**Solution:** Reduce batch size or visualize fewer samples at once diff --git a/scripts/update_readme.py b/scripts/update_readme.py index 4feba99..bb2f1cc 100644 --- a/scripts/update_readme.py +++ b/scripts/update_readme.py @@ -1,8 +1,21 @@ #!/usr/bin/env python3 """ -Simple script to update README.md with current dataset statistics using only pandas. +⚠️ DEPRECATED: This script generates the old verbose README format. -Usage: +The current README has been simplified (README.md) and focuses on essentials only. +Detailed documentation is now in the docs/ directory. + +This script is kept for reference but should not be used to regenerate the README. +To update dataset statistics, manually edit: +- README.md (lines 8-13: main statistics in Dataset Overview section) +- README.md (lines 113-117: top 5 species in Dataset Details section) +- README.md (lines 122-126: top 5 sites in Geographic Distribution section) + +For maintaining consistency, consider creating a simple stats updater that only +updates specific statistics in the simplified README rather than regenerating +the entire file. + +Original usage: python scripts/update_readme.py --csv /path/to/data.csv --readme /path/to/README.md """