Draft: Silo/3d conv benchmark mi355 #1024

amd-ahyttine · 2026-01-22T13:18:03Z

Motivation

Updates to Silo Heuristic benchmarking pipeline. DO NOT MERGE.

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…e reliability - Replace individual UPDATE queries with bulk UPDATE for job state changes - Add retry logic with configurable max attempts for database operations - Implement consecutive empty fetch tracking to prevent infinite loops - Add proper error handling and recovery for database session failures - Track enqueued jobs to prevent duplicate processing - Add configurable TUNA_MAX_EMPTY_FETCHES environment variable - Improve logging for better observability of enqueue process This optimization significantly reduces database round-trips when updating multiple job states and makes the enqueue process more resilient to transient failures.

…ocker build stage

…w dependencies to be able to run on python 3.12

…apping)

…ing limited to batch size)

Refactor the enqueue_jobs process from periodic restarts to a single continuous process with internal polling. This change: - Moves poll interval configuration into enqueue_jobs loop - Replaces early returns with sleep+continue when waiting for batch progress - Eliminates repeated process creation/joining in favor of one long-running process - Simplifies main process coordination by letting enqueue/consume run independently - Makes TUNA_POLL_INTERVAL configurable (default 60s) for both wait scenarios This reduces process overhead and maintains clearer state management by keeping the enqueue process alive throughout the entire job lifecycle rather than repeatedly spawning new processes.

This ensures jobs that fail and get reset are properly unclaimed and can be retried, preventing them from being incorrectly counted as completed.

Enable MIOPEN_USE_COMGR and MIOPEN_USE_HIPRTC in the CMake build configuration. This allows MIOpen to use the Code Object Manager and HIP runtime compilation for generating target-specific code objects, replacing the previous offline clang compiler approach. This was required, since otherwise MIOpen was not able to run the GEMM kernels/solvers

…cking - Rename `should_enqueue_more_jobs` to `_should_wait_for_progress` for clarity - Extract progress checking logic into a private method - Streamline enqueue_jobs method by removing nested retry logic - Improve separation of concerns between job fetching and progress tracking - Make progress tracking more focused on instance-specific job state This refactoring improves code readability and maintainability by better organizing the job enqueueing workflow and making the progress tracking logic more explicit.

Replace set-based job tracking (claimed_job_ids, completed_job_ids) with multiprocessing.Manager lists to enable proper sharing across processes. Changes: - Import Manager from multiprocessing module - Initialize shared lists using Manager() for cross-process communication - Update set operations to list operations (add→append, update→extend, discard→remove with exception handling) - Convert to sets temporarily where set operations are needed - Modify cleanup logic to work with lists instead of sets This fixes race conditions and data inconsistencies that occurred when multiple processes attempted to modify the job tracking sets, as regular Python sets are not process-safe.

…esn't prematurely finish

- Register worker machines in database on celery worker startup - Track machine_id throughout job execution pipeline - Update set_job_state to record which machine processed each job - Ensure machine exists in DB before processing jobs to maintain referential integrity This change enables better job tracking and debugging by recording which physical machine executed each tuning job. The machine is registered during celery worker initialization and its ID is propagated through the context to be stored with job state updates. Human: Regenerate the commit message with a shorter title

Add a unique index on the machine.hostname column to prevent duplicate machine entries and race conditions. The migration includes: - New Alembic migration to add unique constraint on hostname field - Automatic cleanup of existing duplicate hostnames (keeps oldest entry) - Runtime check to warn if unique constraint is missing - Improved machine registration logic with IntegrityError handling - Fixed import path for ModelEnum and FrameworkEnum enums This change ensures each machine hostname is registered only once in the database, preventing race conditions during worker initialization when multiple Celery workers start simultaneously.

- Move sqlalchemy.text import to module level for consistency - Add socket import for hostname initialization - Ensure cached_machine hostname is initialized before database operations - Add validation and default values for machine attributes (arch, num_cu) - Improve avail_gpus formatting with early conversion to string - Add comprehensive error handling with detailed logging for machine registration failures - Preserve IntegrityError exception context when re-raising - Add broad exception handler to catch and log unexpected registration errors These changes prevent potential NoneType errors and provide better diagnostics when machine registration fails during Celery worker initialization.

Add @validates decorator to Machine class to automatically convert avail_gpus list to comma-separated string for database storage. This centralizes the conversion logic in the model layer instead of handling it manually in celery_tasks.py. - Add validates import from sqlalchemy.orm - Implement validate_avail_gpus() method to handle list-to-string conversion - Remove manual string conversion logic from capture_worker_name() - Pass avail_gpus as list to Machine constructor, letting validator handle conversion - Update comments to document the automatic conversion behavior This improves code maintainability by following the DRY principle and ensures consistent data formatting across all Machine instances.

…setter Replace SQLAlchemy's `validates` decorator with `hybrid_property` for the `avail_gpus` field to provide better separation between database storage (comma-separated string) and application usage (list of integers). Changes: - Rename column to `_avail_gpus` with proper column mapping - Add `@hybrid_property` getter that returns List[int] for application use - Add setter that converts list to comma-separated string for DB storage - Handle both list and string inputs with proper type conversion - Minor code formatting improvements (whitespace, line breaks) This improves type safety and makes the interface more intuitive while maintaining backward compatibility with existing database schema.

Add type checking before converting avail_gpus to prevent double conversion when value is already a list from hybrid property getter. This fixes a bug where string split was attempted on list objects, causing AttributeError. - Check if avail_gpus is already a list before attempting string conversion - Only split and convert when value is a non-empty string from database - Prevents runtime errors when hybrid property returns list type

Add `detect_and_handle_locked_jobs` method to identify jobs that are stuck due to stale database locks and preventing pipeline progress. The method: - Queries jobs without locking to detect if they're being skipped by FOR UPDATE SKIP LOCKED clauses - Automatically marks jobs with high retry counts (≥2) as errored to unblock the pipeline - Logs warnings about detected locked jobs for debugging Also enhance logging in `mituna_interface.py` to provide better visibility into job fetching and state transitions, including job counts and IDs. This resolves issues where stale database locks cause jobs to be perpetually skipped, blocking pipeline execution.

…rface - Reorganize imports alphabetically and group by standard lib, third-party, and local - Consolidate related imports using parentheses for better readability - Remove excessive debug logging in get_jobs() method - Add new reconcile_tracking_state() method for job state reconciliation - Replace multiple verbose log statements with single summary log message This improves code maintainability by following PEP 8 import conventions and reduces log noise while maintaining essential debugging information. The new reconciliation method helps prevent the distributor from getting stuck on stale job states.

Add `--new_only` and `--config_limit` command-line arguments to optimize applicability update operations. The `--new_only` flag skips configs with existing applicability data, while `--config_limit` restricts the number of configs processed. Changes: - Add new CLI arguments for filtering applicability updates - Implement worker scaling based on GPU count (4x multiplier) for applicability operations - Pass filtering parameters through worker initialization and query methods - Add detailed logging for worker creation, config processing, and worker count calculations - Reduce log verbosity by commenting out duplicate job warnings in load_job.py This improves performance for large-scale applicability updates by allowing incremental processing and testing with limited datasets.

Update subquery filtering to use explicit select() method calls for SQLAlchemy 2.0 compatibility. This change affects config tag filtering and solver application queries, ensuring proper subquery execution in newer SQLAlchemy versions where implicit select is deprecated.

…seem to be processed by "update_applicability" after import. consider reverting this. perf(miopen): optimize config import with batch processing and tensor caching This commit significantly improves the performance of config import operations: **Changes:** - Add batch processing for database operations in import_configs - New `batch_insert_tensors()` function for bulk tensor insertion - New `get_or_create_tensor_ids()` for efficient tensor ID retrieval - New `batch_process_drivers()` to process drivers in configurable batches - Optimize tensor caching in MIOpenDriver base class - Check cache before creating database session to avoid unnecessary connections - Reduce redundant cache lookups after loading tensor map - Early return on cache hits to minimize database operations - Add CLI arguments for batch configuration - `--batch_size`: configurable batch size (default: 1000) - `--disable_batch_import`: fallback flag for debugging **Why:** The original implementation processed configs one-by-one, creating a database session for each tensor lookup/insert. This caused significant overhead when importing large numbers of configs. The new batch processing approach reduces database round-trips and leverages bulk operations for better performance.

…s w Antti Change TUNA_DB_NAME environment variable from 'silo_heuristic' to 'silo_heuristic_2d' in the MI355 setup script to point to the 2D-specific heuristic database instance.

amd-bartgips added 30 commits October 8, 2025 11:11

Added changes/additions to Dockerfile

27a67bf

auto format

11462b3

auto format

06fbe3c

WIP: parallell functionality

ebc50d8

used yapf formatter

3001f9e

changed default base image and properly passed it through to second d…

a19f1bb

…ocker build stage

changed to newer version of clang-format (12 no longer available)

c396afc

Big overhaul in order to build docker image and run on mi355. Used ne…

c284083

…w dependencies to be able to run on python 3.12

Added text() wrapper to SQL queries

d80dd29

More text() wrapping for load_job etc

5f67df3

Further updates to work with new sqlalchemy version (mostly text() wr…

ad1c371

…apping)

Added string sanitize function to avoid errors with Sql queries

3fc26d5

fixed bug with first batch grabbing all available jobs (instead of be…

98a1a68

…ing limited to batch size)

refactor: improve job state tracking and retry handling

b104982

This ensures jobs that fail and get reset are properly unclaimed and can be retried, preventing them from being incorrectly counted as completed.

Added reset for consecutive_empty_fetches to make sure the process do…

372102c

…esn't prematurely finish

yapf formatting

f75e451

amd-bartgips added 3 commits December 4, 2025 09:55

feat(config): update database name to silo_heuristic_2d, share change…

2cce59a

…s w Antti Change TUNA_DB_NAME environment variable from 'silo_heuristic' to 'silo_heuristic_2d' in the MI355 setup script to point to the 2D-specific heuristic database instance.

amd-ahyttine marked this pull request as draft January 22, 2026 13:18

Merge branch 'develop' into silo/3d_conv_benchmark_mi355

5c72433

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: Silo/3d conv benchmark mi355 #1024

Draft: Silo/3d conv benchmark mi355 #1024

Uh oh!

amd-ahyttine commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Draft: Silo/3d conv benchmark mi355 #1024

Are you sure you want to change the base?

Draft: Silo/3d conv benchmark mi355 #1024

Uh oh!

Conversation

amd-ahyttine commented Jan 22, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants