Skip to content

Preprocess command fails: missing args, logging bug, undefined var, parser/pipeline key mismatch, and cache filename inconsistency #4

@jh-source

Description

@jh-source

Environment

  • OS: Linux (EL8, 4.18)
  • Python: 3.11 (conda)
  • PyTorch: 2.1.2+cu121
  • PyG: 2.6.1
  • Lightning: 2.4.0
  • torchmetrics: 1.8.1
  • RDKit: 2024.3.6
  • fair-esm: 2.0.0

Command

python -m scripts.preprocess.preprocess_data \
  --dataset pdbbind \
  --data_dir /path/to/PDBBIND_atomCorrected \
  --cache_path /path/to/processed/cache_xxx \
  --split_path /path/to/timesplit_xxx \
  --esm_embeddings_path /path/to/esm/esm_embeddings \
  --num_workers 20

Observed errors (from running the command)

  • Missing CLI arg in preprocess script
    • AttributeError: 'Namespace' object has no attribute 'bb_random_prior' (referenced in parse_args but arg not defined)
  • Incorrect logging usage in training pipeline
    • TypeError: 'module' object is not callable (uses logging(...) instead of logging.info(...))
  • Wrong output path variable in training pipeline
    • Uses self.full_cache_path (undefined) when writing complex_names.pkl; should use self.config.cache_path
  • Missing attribute in training pipeline
    • AttributeError: 'TrainingDataPipeline' object has no attribute 'dataset'
  • Parser/pipeline dict key mismatch
    • ComplexParser.parse_protein() expects apo_rec_path/holo_rec_path, but TrainingDataPipeline.prepare_input_files() produces apo_protein_file/holo_protein_file, causing:
    • ValueError: Apo Path=None and Holo Path=None not found

Suggested fixes

  • Add missing arg in scripts/preprocess/preprocess_data.py:
    • parser.add_argument('--bb_random_prior', action='store_true', default=False, ...)
  • Replace logging(...) with logging.info(...) in flexdock/data/modules/training/pipeline.py
  • Replace self.full_cache_path with self.config.cache_path when writing complex_names.pkl
  • Set self.dataset = config.dataset in TrainingDataPipeline.__init__
  • Unify keys between pipeline and parser:
    • Use apo_rec_path/holo_rec_path in prepare_input_files() (or make parser accept both)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions