-
Notifications
You must be signed in to change notification settings - Fork 2
Draft: Silo/3d conv benchmark mi355 #1024
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
amd-ahyttine
wants to merge
34
commits into
develop
Choose a base branch
from
silo/3d_conv_benchmark_mi355
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…e reliability - Replace individual UPDATE queries with bulk UPDATE for job state changes - Add retry logic with configurable max attempts for database operations - Implement consecutive empty fetch tracking to prevent infinite loops - Add proper error handling and recovery for database session failures - Track enqueued jobs to prevent duplicate processing - Add configurable TUNA_MAX_EMPTY_FETCHES environment variable - Improve logging for better observability of enqueue process This optimization significantly reduces database round-trips when updating multiple job states and makes the enqueue process more resilient to transient failures.
…ocker build stage
…w dependencies to be able to run on python 3.12
…ing limited to batch size)
Refactor the enqueue_jobs process from periodic restarts to a single continuous process with internal polling. This change: - Moves poll interval configuration into enqueue_jobs loop - Replaces early returns with sleep+continue when waiting for batch progress - Eliminates repeated process creation/joining in favor of one long-running process - Simplifies main process coordination by letting enqueue/consume run independently - Makes TUNA_POLL_INTERVAL configurable (default 60s) for both wait scenarios This reduces process overhead and maintains clearer state management by keeping the enqueue process alive throughout the entire job lifecycle rather than repeatedly spawning new processes.
This ensures jobs that fail and get reset are properly unclaimed and can be retried, preventing them from being incorrectly counted as completed.
Enable MIOPEN_USE_COMGR and MIOPEN_USE_HIPRTC in the CMake build configuration. This allows MIOpen to use the Code Object Manager and HIP runtime compilation for generating target-specific code objects, replacing the previous offline clang compiler approach. This was required, since otherwise MIOpen was not able to run the GEMM kernels/solvers
…cking - Rename `should_enqueue_more_jobs` to `_should_wait_for_progress` for clarity - Extract progress checking logic into a private method - Streamline enqueue_jobs method by removing nested retry logic - Improve separation of concerns between job fetching and progress tracking - Make progress tracking more focused on instance-specific job state This refactoring improves code readability and maintainability by better organizing the job enqueueing workflow and making the progress tracking logic more explicit.
Replace set-based job tracking (claimed_job_ids, completed_job_ids) with multiprocessing.Manager lists to enable proper sharing across processes. Changes: - Import Manager from multiprocessing module - Initialize shared lists using Manager() for cross-process communication - Update set operations to list operations (add→append, update→extend, discard→remove with exception handling) - Convert to sets temporarily where set operations are needed - Modify cleanup logic to work with lists instead of sets This fixes race conditions and data inconsistencies that occurred when multiple processes attempted to modify the job tracking sets, as regular Python sets are not process-safe.
…esn't prematurely finish
- Register worker machines in database on celery worker startup - Track machine_id throughout job execution pipeline - Update set_job_state to record which machine processed each job - Ensure machine exists in DB before processing jobs to maintain referential integrity This change enables better job tracking and debugging by recording which physical machine executed each tuning job. The machine is registered during celery worker initialization and its ID is propagated through the context to be stored with job state updates. Human: Regenerate the commit message with a shorter title
Add a unique index on the machine.hostname column to prevent duplicate machine entries and race conditions. The migration includes: - New Alembic migration to add unique constraint on hostname field - Automatic cleanup of existing duplicate hostnames (keeps oldest entry) - Runtime check to warn if unique constraint is missing - Improved machine registration logic with IntegrityError handling - Fixed import path for ModelEnum and FrameworkEnum enums This change ensures each machine hostname is registered only once in the database, preventing race conditions during worker initialization when multiple Celery workers start simultaneously.
- Move sqlalchemy.text import to module level for consistency - Add socket import for hostname initialization - Ensure cached_machine hostname is initialized before database operations - Add validation and default values for machine attributes (arch, num_cu) - Improve avail_gpus formatting with early conversion to string - Add comprehensive error handling with detailed logging for machine registration failures - Preserve IntegrityError exception context when re-raising - Add broad exception handler to catch and log unexpected registration errors These changes prevent potential NoneType errors and provide better diagnostics when machine registration fails during Celery worker initialization.
Add @validates decorator to Machine class to automatically convert avail_gpus list to comma-separated string for database storage. This centralizes the conversion logic in the model layer instead of handling it manually in celery_tasks.py. - Add validates import from sqlalchemy.orm - Implement validate_avail_gpus() method to handle list-to-string conversion - Remove manual string conversion logic from capture_worker_name() - Pass avail_gpus as list to Machine constructor, letting validator handle conversion - Update comments to document the automatic conversion behavior This improves code maintainability by following the DRY principle and ensures consistent data formatting across all Machine instances.
…setter Replace SQLAlchemy's `validates` decorator with `hybrid_property` for the `avail_gpus` field to provide better separation between database storage (comma-separated string) and application usage (list of integers). Changes: - Rename column to `_avail_gpus` with proper column mapping - Add `@hybrid_property` getter that returns List[int] for application use - Add setter that converts list to comma-separated string for DB storage - Handle both list and string inputs with proper type conversion - Minor code formatting improvements (whitespace, line breaks) This improves type safety and makes the interface more intuitive while maintaining backward compatibility with existing database schema.
Add type checking before converting avail_gpus to prevent double conversion when value is already a list from hybrid property getter. This fixes a bug where string split was attempted on list objects, causing AttributeError. - Check if avail_gpus is already a list before attempting string conversion - Only split and convert when value is a non-empty string from database - Prevents runtime errors when hybrid property returns list type
Add `detect_and_handle_locked_jobs` method to identify jobs that are stuck due to stale database locks and preventing pipeline progress. The method: - Queries jobs without locking to detect if they're being skipped by FOR UPDATE SKIP LOCKED clauses - Automatically marks jobs with high retry counts (≥2) as errored to unblock the pipeline - Logs warnings about detected locked jobs for debugging Also enhance logging in `mituna_interface.py` to provide better visibility into job fetching and state transitions, including job counts and IDs. This resolves issues where stale database locks cause jobs to be perpetually skipped, blocking pipeline execution.
…rface - Reorganize imports alphabetically and group by standard lib, third-party, and local - Consolidate related imports using parentheses for better readability - Remove excessive debug logging in get_jobs() method - Add new reconcile_tracking_state() method for job state reconciliation - Replace multiple verbose log statements with single summary log message This improves code maintainability by following PEP 8 import conventions and reduces log noise while maintaining essential debugging information. The new reconciliation method helps prevent the distributor from getting stuck on stale job states.
Add `--new_only` and `--config_limit` command-line arguments to optimize applicability update operations. The `--new_only` flag skips configs with existing applicability data, while `--config_limit` restricts the number of configs processed. Changes: - Add new CLI arguments for filtering applicability updates - Implement worker scaling based on GPU count (4x multiplier) for applicability operations - Pass filtering parameters through worker initialization and query methods - Add detailed logging for worker creation, config processing, and worker count calculations - Reduce log verbosity by commenting out duplicate job warnings in load_job.py This improves performance for large-scale applicability updates by allowing incremental processing and testing with limited datasets.
Update subquery filtering to use explicit select() method calls for SQLAlchemy 2.0 compatibility. This change affects config tag filtering and solver application queries, ensuring proper subquery execution in newer SQLAlchemy versions where implicit select is deprecated.
…seem to be processed by "update_applicability" after import. consider reverting this. perf(miopen): optimize config import with batch processing and tensor caching This commit significantly improves the performance of config import operations: **Changes:** - Add batch processing for database operations in import_configs - New `batch_insert_tensors()` function for bulk tensor insertion - New `get_or_create_tensor_ids()` for efficient tensor ID retrieval - New `batch_process_drivers()` to process drivers in configurable batches - Optimize tensor caching in MIOpenDriver base class - Check cache before creating database session to avoid unnecessary connections - Reduce redundant cache lookups after loading tensor map - Early return on cache hits to minimize database operations - Add CLI arguments for batch configuration - `--batch_size`: configurable batch size (default: 1000) - `--disable_batch_import`: fallback flag for debugging **Why:** The original implementation processed configs one-by-one, creating a database session for each tensor lookup/insert. This caused significant overhead when importing large numbers of configs. The new batch processing approach reduces database round-trips and leverages bulk operations for better performance.
…s w Antti Change TUNA_DB_NAME environment variable from 'silo_heuristic' to 'silo_heuristic_2d' in the MI355 setup script to point to the 2D-specific heuristic database instance.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Updates to Silo Heuristic benchmarking pipeline. DO NOT MERGE.
Technical Details
Test Plan
Test Result
Submission Checklist