Skip to content

Code for the paper - "BTCChat: Advancing Remote Sensing Bi-temporal Change Captioning with Multimodal Large Language Model"

License

Notifications You must be signed in to change notification settings

IntelliSensing/BTCChat

Repository files navigation

BTCChat: Advancing Remote Sensing Bi-temporal Change Captioning with Multimodal Large Language Model

📢 Notices & New Features

Note

Video Processing Support The model has been upgraded to support video input processing, expanding its capabilities beyond static images to dynamic temporal analysis.

Tip

Enhanced Change Extraction Modeling The Change Extraction (CE) Module now supports the addition of Transformer layers. This allows for more complex modeling of spatiotemporal correlations, further improving performance beyond standard convolutional approaches.

⚠️ Reproducibility Note: To reproduce the results reported in the BTCChat paper, please set the training parameter cc_n_layers to 0.


Project Overview

BTCChat is a multi-temporal Multimodal Large Language Model (MLLM) designed for remote sensing. It specifically addresses the gap in bi-temporal change understanding. By introducing a novel Change Extraction (CE) Module and a Prompt Augmentation (PA) mechanism, BTCChat achieves state-of-the-art performance in change captioning while retaining robust capabilities for single-image visual question answering.


Installation

conda create -n btcchat python=3.10 -y
conda activate btcchat

pip install --upgrade pip  # enable PEP 660 support
# this is optional if you prefer to system built-in nvcc.
conda install -c nvidia cuda-toolkit -y
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.4.2/flash_attn-2.4.2+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.4.2+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install -e .
pip install -e ".[train]"

pip install git+https://github.com/huggingface/transformers@v4.36.2
site_pkg_path=$(python -c 'import site; print(site.getsitepackages()[0])')
cp -rv ./llava/train/transformers_replace/* $site_pkg_path/transformers/

Directory Structure

VILA/
├── llava/                          # Core Model Code
│   ├── model/                      # Model Architecture
│   │   ├── llava_arch.py           # VILA Main Architecture (LlavaMetaModel)
│   │   ├── multimodal_encoder/     # Visual Encoder
│   │   │   ├── siglip_encoder.py   # SigLIP Vision Tower
│   │   │   └── siglip/             # SigLIP Implementation
│   │   ├── multimodal_projector/   # Multimodal Projector
│   │   │   ├── builder.py          # Projector Builder
│   │   │   ├── base_projector.py   # Projector Base Class & CE Integration
│   │   │   └── model_encoder.py    # AttentiveEncoder (Change Extraction Logic)
│   │   └── change_encoder/         # Change Extraction Module Configuration
│   │       └── builder.py          # Parameter Definitions
│   ├── train/                      # Training Code
│   │   ├── train.py                # Main Training Entry Point
│   │   ├── train_mem.py            # Memory Optimized Training Entry
│   │   ├── args.py                 # Training Argument Definitions
│   │   └── llava_trainer.py        # Custom Trainer
│   └── data/                       # Data Processing
│       ├── dataset.py              # Dataset Class Definitions
│       └── datasets_mixture.py     # Dataset Mixture Configuration
├── scripts/                        # Training Scripts
│   └── v1_5/release/3b/            # 3B Model Training Scripts
│       └── sft.sh                  # Joint Instruction Tuning
└── README.md                      # This Documentation

Model Architecture

Overall Architecture

Input[Input Image/s/Videos] --> Vision Tower(SigLIP) --> MM Projector --> LLM(Sheared-LLaMA) --> Textual Output
                                                              ↑
                                                   Change Extraction Module

Core Components

  1. Visual Encoder
  • Model: google/siglip-so400m-patch14-384
  • Function: Encodes input images/video into visual features.
  • Configuration:
    • mm_vision_select_feature: cls_patch - Uses CLS token and patch features.
    • mm_vision_select_layer: -2 - Uses features from the second to last layer.
  1. Multimodal Projector
  • Type: MLP with 2x2 downsampling
  • Function: Maps visual features into the LLM's embedding space.
  • Core Class: MultipleApplicationLayers
  1. Change Extraction (CE) Module
  • Core Module: AttentiveEncoder (Refers to $\mathcal{M}_c$ in the paper)
  • Function: Processes bi-temporal remote sensing image pairs to extract fine-grained spatiotemporal correlations.
  • Enhanced Capability: As noted in the notices, this module supports adding Transformer layers (self.selftrans) to enhance feature interaction, in addition to the standard cosine similarity and Multi-Layer CNN network described in the BTCChat paper.
  1. Large Language Model
  • Base Model: princeton-nlp/Sheared-LLaMA-2.7B (VILA-1.5 3B base)
  • Function: Generates textual responses based on multimodal embeddings.

Training

If you want to reproduce the results from the BTCChat paper, you don't need to add Transformer layers to Change Extraction; simply set cc_n_layers to 0.

For other specific training parameters, please refer to the settings in scripts/v1_5/release/3b/sft.sh.

deepspeed --master_port=$((RANDOM + 10000)) --include localhost:0 \
    llava/train/train_mem.py \
    --deepspeed scripts/zero2_offload.json \
    --model_name_or_path /path/to/the/pre-trained/change_extraction_checkpoint/ \
    --version v1 \
    --data_mixture geochat_instruct+levir_cc \
    --vision_tower google/siglip-so400m-patch14-384 \
    --mm_vision_select_feature cls_patch \
    --mm_projector mlp_downsample \
    # === Key Training Switches (Stage 2 Config) ===
    --tune_vision_tower False \       # Freeze Vision Tower
    --tune_mm_projector False \       # Freeze Projector
    --tune_cc_projector False \       # Freeze CE Module (Already trained in Stage 1)
    --tune_single_projector False \   # Freeze Single Projector
    --tune_language_model True \      # Train LLM ONLY
    # === Change Extraction Module Config ===
    --chg True \                      # Enable Change Detection Mode
    --chg_type Chg2Cap \              # Task Type: Change Captioning
    --from_origin True \              # Load from original
    --cc_n_layers 0 \                 # Transformer Layers for CE (New Feature)
    --cc_head 8 \                     # Attention Heads
    # === Hyperparameters ===
    --bf16 True \
    --num_train_epochs 1 \
    --learning_rate 1e-4 \

FAQ

Q1: How do I add a new remote sensing dataset?

Add Dataset configuration in datasets_mixture.py.

If special processing is needed, add a new Dataset class in dataset.py.

Register the new data type in the build_datasets() function.

Q2: How do I adjust the complexity of the Change Extraction (CE) Module? (New Feature)

Modify the following arguments:

  • --cc_n_layers: (Enhanced) Increase the number of Transformer layers to improve modeling capability.

  • --cc_head: Increase the number of attention heads.

  • --cc_dropout: Adjust regularization strength.

Q3: What if I run out of VRAM?

Use zero2_offload.json configuration for CPU offload.

Reduce --per_device_train_batch_size.

Increase --gradient_accumulation_steps to maintain effective batch size.

Enable --gradient_checkpointing True.

About

Code for the paper - "BTCChat: Advancing Remote Sensing Bi-temporal Change Captioning with Multimodal Large Language Model"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published