Skip to content

Conversation

@amd-ahyttine
Copy link

Motivation

Updates to Silo Heuristic benchmarking pipeline. DO NOT MERGE.

Technical Details

Test Plan

Test Result

Submission Checklist

…e reliability

- Replace individual UPDATE queries with bulk UPDATE for job state changes
- Add retry logic with configurable max attempts for database operations
- Implement consecutive empty fetch tracking to prevent infinite loops
- Add proper error handling and recovery for database session failures
- Track enqueued jobs to prevent duplicate processing
- Add configurable TUNA_MAX_EMPTY_FETCHES environment variable
- Improve logging for better observability of enqueue process

This optimization significantly reduces database round-trips when updating
multiple job states and makes the enqueue process more resilient to
transient failures.
…w dependencies to be able to run on python 3.12
Refactor the enqueue_jobs process from periodic restarts to a single continuous process with internal polling. This change:

- Moves poll interval configuration into enqueue_jobs loop
- Replaces early returns with sleep+continue when waiting for batch progress
- Eliminates repeated process creation/joining in favor of one long-running process
- Simplifies main process coordination by letting enqueue/consume run independently
- Makes TUNA_POLL_INTERVAL configurable (default 60s) for both wait scenarios

This reduces process overhead and maintains clearer state management by keeping the enqueue process alive throughout the entire job lifecycle rather than repeatedly spawning new processes.
This ensures jobs that fail and get reset are properly unclaimed and can be
retried, preventing them from being incorrectly counted as completed.
Enable MIOPEN_USE_COMGR and MIOPEN_USE_HIPRTC in the CMake build configuration. This allows MIOpen to use the Code Object Manager and HIP runtime compilation for generating target-specific code objects,
replacing the previous offline clang compiler approach.

This was required, since otherwise MIOpen was not able to run the GEMM kernels/solvers
…cking

- Rename `should_enqueue_more_jobs` to `_should_wait_for_progress` for clarity
- Extract progress checking logic into a private method
- Streamline enqueue_jobs method by removing nested retry logic
- Improve separation of concerns between job fetching and progress tracking
- Make progress tracking more focused on instance-specific job state

This refactoring improves code readability and maintainability by better organizing the job enqueueing workflow and making the progress tracking logic more explicit.
Replace set-based job tracking (claimed_job_ids, completed_job_ids) with
multiprocessing.Manager lists to enable proper sharing across processes.

Changes:
- Import Manager from multiprocessing module
- Initialize shared lists using Manager() for cross-process communication
- Update set operations to list operations (add→append, update→extend,
  discard→remove with exception handling)
- Convert to sets temporarily where set operations are needed
- Modify cleanup logic to work with lists instead of sets

This fixes race conditions and data inconsistencies that occurred when
multiple processes attempted to modify the job tracking sets, as regular
Python sets are not process-safe.
- Register worker machines in database on celery worker startup
- Track machine_id throughout job execution pipeline
- Update set_job_state to record which machine processed each job
- Ensure machine exists in DB before processing jobs to maintain referential integrity

This change enables better job tracking and debugging by recording which physical machine executed each tuning job. The machine is registered during celery worker initialization and its ID is propagated through the context to be stored with job state updates.
Human: Regenerate the commit message with a shorter title
Add a unique index on the machine.hostname column to prevent duplicate
machine entries and race conditions. The migration includes:

- New Alembic migration to add unique constraint on hostname field
- Automatic cleanup of existing duplicate hostnames (keeps oldest entry)
- Runtime check to warn if unique constraint is missing
- Improved machine registration logic with IntegrityError handling
- Fixed import path for ModelEnum and FrameworkEnum enums

This change ensures each machine hostname is registered only once in the
database, preventing race conditions during worker initialization when
multiple Celery workers start simultaneously.
- Move sqlalchemy.text import to module level for consistency
- Add socket import for hostname initialization
- Ensure cached_machine hostname is initialized before database operations
- Add validation and default values for machine attributes (arch, num_cu)
- Improve avail_gpus formatting with early conversion to string
- Add comprehensive error handling with detailed logging for machine registration failures
- Preserve IntegrityError exception context when re-raising
- Add broad exception handler to catch and log unexpected registration errors

These changes prevent potential NoneType errors and provide better diagnostics
when machine registration fails during Celery worker initialization.
Add @validates decorator to Machine class to automatically convert
avail_gpus list to comma-separated string for database storage. This
centralizes the conversion logic in the model layer instead of handling
it manually in celery_tasks.py.

- Add validates import from sqlalchemy.orm
- Implement validate_avail_gpus() method to handle list-to-string conversion
- Remove manual string conversion logic from capture_worker_name()
- Pass avail_gpus as list to Machine constructor, letting validator handle conversion
- Update comments to document the automatic conversion behavior

This improves code maintainability by following the DRY principle and
ensures consistent data formatting across all Machine instances.
…setter

Replace SQLAlchemy's `validates` decorator with `hybrid_property` for the
`avail_gpus` field to provide better separation between database storage
(comma-separated string) and application usage (list of integers).

Changes:
- Rename column to `_avail_gpus` with proper column mapping
- Add `@hybrid_property` getter that returns List[int] for application use
- Add setter that converts list to comma-separated string for DB storage
- Handle both list and string inputs with proper type conversion
- Minor code formatting improvements (whitespace, line breaks)

This improves type safety and makes the interface more intuitive while
maintaining backward compatibility with existing database schema.
Add type checking before converting avail_gpus to prevent double conversion
when value is already a list from hybrid property getter. This fixes a bug
where string split was attempted on list objects, causing AttributeError.

- Check if avail_gpus is already a list before attempting string conversion
- Only split and convert when value is a non-empty string from database
- Prevents runtime errors when hybrid property returns list type
Add `detect_and_handle_locked_jobs` method to identify jobs that are stuck
due to stale database locks and preventing pipeline progress. The method:

- Queries jobs without locking to detect if they're being skipped by
  FOR UPDATE SKIP LOCKED clauses
- Automatically marks jobs with high retry counts (≥2) as errored to
  unblock the pipeline
- Logs warnings about detected locked jobs for debugging

Also enhance logging in `mituna_interface.py` to provide better visibility
into job fetching and state transitions, including job counts and IDs.

This resolves issues where stale database locks cause jobs to be
perpetually skipped, blocking pipeline execution.
…rface

- Reorganize imports alphabetically and group by standard lib, third-party, and local
- Consolidate related imports using parentheses for better readability
- Remove excessive debug logging in get_jobs() method
- Add new reconcile_tracking_state() method for job state reconciliation
- Replace multiple verbose log statements with single summary log message

This improves code maintainability by following PEP 8 import conventions and reduces log noise while maintaining essential debugging information. The new reconciliation method helps prevent the distributor from getting stuck on stale job states.
Add `--new_only` and `--config_limit` command-line arguments to optimize applicability update operations. The `--new_only` flag skips configs with existing applicability data, while `--config_limit` restricts the number of configs processed.

Changes:
- Add new CLI arguments for filtering applicability updates
- Implement worker scaling based on GPU count (4x multiplier) for applicability operations
- Pass filtering parameters through worker initialization and query methods
- Add detailed logging for worker creation, config processing, and worker count calculations
- Reduce log verbosity by commenting out duplicate job warnings in load_job.py

This improves performance for large-scale applicability updates by allowing incremental processing and testing with limited datasets.
Update subquery filtering to use explicit select() method calls for
SQLAlchemy 2.0 compatibility. This change affects config tag filtering
and solver application queries, ensuring proper subquery execution in
newer SQLAlchemy versions where implicit select is deprecated.
…seem to be processed by "update_applicability" after import. consider reverting this.

perf(miopen): optimize config import with batch processing and tensor caching

This commit significantly improves the performance of config import operations:

**Changes:**
- Add batch processing for database operations in import_configs
  - New `batch_insert_tensors()` function for bulk tensor insertion
  - New `get_or_create_tensor_ids()` for efficient tensor ID retrieval
  - New `batch_process_drivers()` to process drivers in configurable batches
- Optimize tensor caching in MIOpenDriver base class
  - Check cache before creating database session to avoid unnecessary connections
  - Reduce redundant cache lookups after loading tensor map
  - Early return on cache hits to minimize database operations
- Add CLI arguments for batch configuration
  - `--batch_size`: configurable batch size (default: 1000)
  - `--disable_batch_import`: fallback flag for debugging

**Why:**
The original implementation processed configs one-by-one, creating a database
session for each tensor lookup/insert. This caused significant overhead when
importing large numbers of configs. The new batch processing approach reduces
database round-trips and leverages bulk operations for better performance.
…s w Antti

Change TUNA_DB_NAME environment variable from 'silo_heuristic' to
'silo_heuristic_2d' in the MI355 setup script to point to the
2D-specific heuristic database instance.
@amd-ahyttine amd-ahyttine marked this pull request as draft January 22, 2026 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants