Skip to content

siantonelli/graph_valuation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Node-Level Data Valuation on Graphs

This repository contains the implementation of data valuation techniques for graph-structured data, as described in our paper Node-Level Data Valuation on Graphs.

Installation

git clone https://github.com/siantonelli/graph_valuation.git
cd graph_valuation
uv sync

Data and Storage Management

The project automatically handles data and storage directories for you:

  • Datasets will be automatically downloaded to cfg.core.data_dir (defaults to ./data/) when first needed
  • Results will be saved to cfg.core.storage_dir (defaults to ./storage/)

These directories are created automatically if they don't exist. You can customize these paths by modifying the configuration or setting environment variables. The default paths are relative to your project root directory.

Configuration Management

This project uses Hydra for configuration management. Hydra enables you to compose your configuration dynamically and override any parameter from the command line. For more details on how Hydra works, visit hydra.cc.

Usage

Running with Scripts

The main script for computing node values is attribute_values.py. Here's a basic example:

uv run src/graph_valuation/scripts/attribute_values.py approach=training/banzhaf 
                                                        train/model=sgc 
                                                        train.n_models=1 
                                                        dataset=cora_ml 
                                                        data.data_seed=42 
                                                        task.mode=transductive 
                                                        approach.num_subsets=50000 
                                                        approach.alpha=0.1 
                                                        approach.setting=all 
                                                        task.induced_subgraph=True

Available Approaches

Different data valuation methods can be used by changing the approach parameter. Available approaches are defined in conf/approach/training/:

  • banzhaf - Data Banzhaf valuation
  • shapley - Data Shapley valuation
  • loo - Leave-One-Out estimation
  • datamodel - DataModel approach
  • pc_winter - PC-Winter valuation

Simply replace approach=training/banzhaf with any other approach from the configuration files.

Note: The inference pipeline (conf/approach/inference/) is currently work in progress (WIP). Use the training approaches for stable functionality.

Computing Through Code

You can also use the data valuation approaches programmatically in your Python code:

from pathlib import Path
import torch
from omegaconf import OmegaConf
from torch_geometric.datasets import CitationFull
import torch_geometric.transforms as T

from graph_valuation.approaches import DataBanzhaf, DataShapley, Datamodel
from graph_valuation.models import SGC

# Load dataset directly
pre_transform = T.Compose([
    T.LargestConnectedComponents(),
    T.ToUndirected()
])
transform = T.RandomNodeSplit(split="test_rest", num_val=140)
dataset = CitationFull(
    root="data/cora_ml", 
    name="cora_ml",
    pre_transform=pre_transform,
    transform=transform
)
data = dataset[0]

# Configure training settings directly
train_cfg = OmegaConf.create({
    "torch_device": None,
    "device": "cpu",
    "deterministic": True,
    "n_models": 1,
    "epochs": 3000,
    "patience": 50,
    "model_seed": 42,
    "n_jobs": -1,
    "model": {
        "_target_": "graph_valuation.models.SGC",
        "in_channels": data.num_features,
        "out_channels": data.y.max().item() + 1
    },
    "optimizer": {
        "_target_": "torch.optim.Adam",
        "lr": 1e-2,
        "betas": [0.9, 0.999],
        "eps": 1e-08,
        "weight_decay": 5e-4
    }
})

# Set up results directory
results_dir = Path("storage/example_run")
results_dir.mkdir(parents=True, exist_ok=True)

# Initialize a data valuation approach
valuator = DataBanzhaf(
    filename="example_banzhaf",
    n_subsets=1000,           # Number of subsets to sample
    alpha=0.1,                # Subset size parameter
    setting="all",            # Which nodes to value: "all", "train"
    induced_subgraph=True,
    training_mode="transductive",
    train_cfg=train_cfg,
    results_dir=results_dir,
)

# Compute node values
print("Computing node values...")
nodes_values = valuator.attribute_values(
    data=data, 
    utility_fn="margins",     # Utility function: "margins", "test_accs", "val_accs"
    learning_signal=True      # Whether to use learning signal
)

print(f"Computed values for {len(node_values)} nodes")
print(f"Top 5 most valuable nodes: {node_values.argsort()[-5:]}")

# You can also use other approaches:
# DataShapley, Datamodel, LOO, PCWinter
# Each has slightly different parameters - see the approach files for details

Experiments and Evaluation

The repository provides several experimental scripts to evaluate different aspects of data valuation methods.

Computing and Evaluating Node Values

For evaluating pre-computed node values, you first need to run attribute_values.py to generate the valuation results, then use the evaluation scripts.

Node Influence Evaluation

Evaluate how well the computed values identify influential nodes by removing the most highly-valued nodes and measuring the impact on model performance:

uv run src/graph_valuation/scripts/evaluate_nodes_influence.py approach=training/banzhaf \
                                                               train/model=sgc \
                                                               train.n_models=1 \
                                                               dataset=cora_ml \
                                                               data.data_seed=42 \
                                                               task.mode=transductive \
                                                               approach.num_subsets=50000 \
                                                               approach.alpha=0.1 \
                                                               approach.setting=all \
                                                               task.induced_subgraph=True

Predictive Support Evaluation

Assess the quality of computed values by measuring their predictive power for model behavior:

uv run src/graph_valuation/scripts/evaluate_predictive_supports.py approach=training/banzhaf \
                                                                    train/model=sgc \
                                                                    train.n_models=1 \
                                                                    dataset=cora_ml \
                                                                    data.data_seed=42 \
                                                                    task.mode=transductive \
                                                                    approach.num_subsets=50000 \
                                                                    approach.alpha=0.1 \
                                                                    approach.setting=all \
                                                                    task.induced_subgraph=True

Note: These evaluation scripts expect that you have already computed node values using the same configuration parameters.

Standalone Experiments

The following experiments are standalone and compute their own valuations internally to measure specific characteristics of different approaches.

Linear Datamodeling Score (LDS) Experiments

Evaluate different approaches using the Linear Datamodeling Score framework across various subset sizes and approximation parameters:

uv run src/graph_valuation/scripts/evaluate_lds.py '~approach' \
                                                    train/model=sgc \
                                                    train.n_models=1 \
                                                    dataset=cora_ml \
                                                    data.data_seed=42 \
                                                    task.mode=transductive \
                                                    +induced_subgraph=True \
                                                    +n_subsets=50000 \
                                                    +alpha=0.1 \
                                                    +p_trunc=0.25 \
                                                    +label_trunc_ratio=0 \
                                                    +group_trunc_ratio_hop_1=0.99 \
                                                    +group_trunc_ratio_hop_2=0.99 \
                                                    +setting=all \
                                                    +subset_sizes='[1000,2500,5000,10000,25000,50000]' \
                                                    +alphas='[0.1, 0.25, 0.5, 0.75, 0.9]'

Memorization Experiments

Study memorization patterns across different valuation approaches using poisoned data detection:

uv run src/graph_valuation/scripts/run_memorization.py '~approach' \
                                                        train/model=sgc \
                                                        train.n_models=1 \
                                                        dataset=cora_ml \
                                                        data.data_seed=42 \
                                                        task.mode=transductive \
                                                        +poisoning_budget=0.1 \
                                                        +induced_subgraph=True \
                                                        +n_subsets=50000 \
                                                        +alpha=0.1 \
                                                        +p_trunc=0.25 \
                                                        +label_trunc_ratio=0 \
                                                        +group_trunc_ratio_hop_1=0.99 \
                                                        +group_trunc_ratio_hop_2=0.99 \
                                                        train.torch_device=cpu \
                                                        +high_logits=True \
                                                        +high_degree=False

Extensibility

The codebase is designed to be easily extensible for both custom models and data valuation approaches.

Custom Models

You can compute values for your own custom models by:

  1. Creating a new model class in src/graph_valuation/models/ following the existing patterns (see gcn.py, sgc.py, etc.)

  2. Adding a corresponding Hydra configuration YAML file in conf/train/model/

  3. Important: Ensure your model's forward pass follows the standard PyTorch Geometric conventions. If not, you'll need to implement your own pipeline in the training functions (training.py, inference.py)

Custom Data Valuation Approaches

You can implement your own data valuation methods by:

  1. Creating a new approach class in src/graph_valuation/approaches/ following the existing patterns (see the various approach implementations)

  2. Adding a corresponding Hydra configuration YAML file in conf/approach/training/ or conf/approach/inference/ depending on your approach type

  3. Key requirements for your custom approach:

    • Inherit from the appropriate base class
    • Implement the required methods for subset evaluation
    • Handle the interaction with the training/inference pipeline
    • Follow the expected input/output formats for compatibility with evaluation scripts
  4. Important: Your approach should be compatible with the existing experiment framework to work with the evaluation and analysis scripts

Accessing Experimental Data

If you need access to the specific datasets and experimental results used in our paper, please contact us at [antonelli at cs dot uni-koeln dot de]. We can provide guidance on obtaining the exact data configurations and pre-computed results that were used for the experiments described in the publication.

Citation

If you use this code in your research, please cite our paper:

@article{antonelli2025nodelevel,
  title = {Node-Level Data Valuation on Graphs},
  author = {Antonelli, Simone and Bojchevski, Aleksandar},
  journal = {Transactions on Machine Learning Research},
  issn = {2835-8856},
  year = {2025},
  url = {https://openreview.net/forum?id=tNyApIqDSJ},
  note = {},
}

License

This project is licensed under the MIT License.

About

Data valuation methods for graph-structured data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages