BTCChat: Advancing Remote Sensing Bi-temporal Change Captioning with Multimodal Large Language Model
Note
Video Processing Support The model has been upgraded to support video input processing, expanding its capabilities beyond static images to dynamic temporal analysis.
Tip
Enhanced Change Extraction Modeling The Change Extraction (CE) Module now supports the addition of Transformer layers. This allows for more complex modeling of spatiotemporal correlations, further improving performance beyond standard convolutional approaches.
cc_n_layers to 0.
BTCChat is a multi-temporal Multimodal Large Language Model (MLLM) designed for remote sensing. It specifically addresses the gap in bi-temporal change understanding. By introducing a novel Change Extraction (CE) Module and a Prompt Augmentation (PA) mechanism, BTCChat achieves state-of-the-art performance in change captioning while retaining robust capabilities for single-image visual question answering.
conda create -n btcchat python=3.10 -y
conda activate btcchat
pip install --upgrade pip # enable PEP 660 support
# this is optional if you prefer to system built-in nvcc.
conda install -c nvidia cuda-toolkit -y
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.4.2/flash_attn-2.4.2+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.4.2+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install -e .
pip install -e ".[train]"
pip install git+https://github.com/huggingface/transformers@v4.36.2
site_pkg_path=$(python -c 'import site; print(site.getsitepackages()[0])')
cp -rv ./llava/train/transformers_replace/* $site_pkg_path/transformers/
VILA/
├── llava/ # Core Model Code
│ ├── model/ # Model Architecture
│ │ ├── llava_arch.py # VILA Main Architecture (LlavaMetaModel)
│ │ ├── multimodal_encoder/ # Visual Encoder
│ │ │ ├── siglip_encoder.py # SigLIP Vision Tower
│ │ │ └── siglip/ # SigLIP Implementation
│ │ ├── multimodal_projector/ # Multimodal Projector
│ │ │ ├── builder.py # Projector Builder
│ │ │ ├── base_projector.py # Projector Base Class & CE Integration
│ │ │ └── model_encoder.py # AttentiveEncoder (Change Extraction Logic)
│ │ └── change_encoder/ # Change Extraction Module Configuration
│ │ └── builder.py # Parameter Definitions
│ ├── train/ # Training Code
│ │ ├── train.py # Main Training Entry Point
│ │ ├── train_mem.py # Memory Optimized Training Entry
│ │ ├── args.py # Training Argument Definitions
│ │ └── llava_trainer.py # Custom Trainer
│ └── data/ # Data Processing
│ ├── dataset.py # Dataset Class Definitions
│ └── datasets_mixture.py # Dataset Mixture Configuration
├── scripts/ # Training Scripts
│ └── v1_5/release/3b/ # 3B Model Training Scripts
│ └── sft.sh # Joint Instruction Tuning
└── README.md # This Documentation
Input[Input Image/s/Videos] --> Vision Tower(SigLIP) --> MM Projector --> LLM(Sheared-LLaMA) --> Textual Output
↑
Change Extraction Module
- Visual Encoder
- Model: google/siglip-so400m-patch14-384
- Function: Encodes input images/video into visual features.
- Configuration:
- mm_vision_select_feature: cls_patch - Uses CLS token and patch features.
- mm_vision_select_layer: -2 - Uses features from the second to last layer.
- Multimodal Projector
- Type: MLP with 2x2 downsampling
- Function: Maps visual features into the LLM's embedding space.
- Core Class: MultipleApplicationLayers
- Change Extraction (CE) Module
- Core Module: AttentiveEncoder (Refers to
$\mathcal{M}_c$ in the paper) - Function: Processes bi-temporal remote sensing image pairs to extract fine-grained spatiotemporal correlations.
- Enhanced Capability: As noted in the notices, this module supports adding Transformer layers (self.selftrans) to enhance feature interaction, in addition to the standard cosine similarity and Multi-Layer CNN network described in the BTCChat paper.
- Large Language Model
- Base Model: princeton-nlp/Sheared-LLaMA-2.7B (VILA-1.5 3B base)
- Function: Generates textual responses based on multimodal embeddings.
If you want to reproduce the results from the BTCChat paper, you don't need to add Transformer layers to Change Extraction; simply set cc_n_layers to 0.
For other specific training parameters, please refer to the settings in scripts/v1_5/release/3b/sft.sh.
deepspeed --master_port=$((RANDOM + 10000)) --include localhost:0 \
llava/train/train_mem.py \
--deepspeed scripts/zero2_offload.json \
--model_name_or_path /path/to/the/pre-trained/change_extraction_checkpoint/ \
--version v1 \
--data_mixture geochat_instruct+levir_cc \
--vision_tower google/siglip-so400m-patch14-384 \
--mm_vision_select_feature cls_patch \
--mm_projector mlp_downsample \
# === Key Training Switches (Stage 2 Config) ===
--tune_vision_tower False \ # Freeze Vision Tower
--tune_mm_projector False \ # Freeze Projector
--tune_cc_projector False \ # Freeze CE Module (Already trained in Stage 1)
--tune_single_projector False \ # Freeze Single Projector
--tune_language_model True \ # Train LLM ONLY
# === Change Extraction Module Config ===
--chg True \ # Enable Change Detection Mode
--chg_type Chg2Cap \ # Task Type: Change Captioning
--from_origin True \ # Load from original
--cc_n_layers 0 \ # Transformer Layers for CE (New Feature)
--cc_head 8 \ # Attention Heads
# === Hyperparameters ===
--bf16 True \
--num_train_epochs 1 \
--learning_rate 1e-4 \Add Dataset configuration in datasets_mixture.py.
If special processing is needed, add a new Dataset class in dataset.py.
Register the new data type in the build_datasets() function.
Modify the following arguments:
-
--cc_n_layers: (Enhanced) Increase the number of Transformer layers to improve modeling capability. -
--cc_head: Increase the number of attention heads. -
--cc_dropout: Adjust regularization strength.
Use zero2_offload.json configuration for CPU offload.
Reduce --per_device_train_batch_size.
Increase --gradient_accumulation_steps to maintain effective batch size.
Enable --gradient_checkpointing True.