Skip to content

Conversation

@izaitsevfb
Copy link
Contributor

@izaitsevfb izaitsevfb commented Dec 16, 2025

Adds support for more granular dispatches (test and job level filters, see pytorch/pytorch#168201) to autorevert.

  • Job/test filtering: When restarting workflows, only re-run specific failed jobs and tests instead of the entire workflow (uses workflow_dispatch inputs when supported)
  • Workflow resolver: Parses workflow YAML files to detect which inputs are available for filtering
  • New CLI subcommand: restart-workflow for manually triggering filtered workflow restarts
  • Fallback behavior: Workflows without input support (e.g., inductor) fall back to full workflow restart

Testing

(links lead to runs per commit, see issued from my account as results of local testing)


  1. manual dispatch testing:
python -m pytorch_auto_revert restart-workflow pull  4816fd912210162bea4cdf34f7a39d2909477549 --jobs "linux-jammy-py3.10-gcc11" --tests "distributed/test_functional_differentials"

runs: https://github.com/pytorch/pytorch/actions/workflows/pull.yml?query=branch%3Atrunk%2F4816fd912210162bea4cdf34f7a39d2909477549


  1. granular restart on trunk:
python -m pytorch_auto_revert autorevert-checker pull --hours 12 --as-of "2025-12-18 06:25" --hud-html

log: P2090410466

runs (filters by job and test):
https://github.com/pytorch/pytorch/actions/workflows/pull.yml?query=branch%3Atrunk%2F9fe21ba6d0583790c1857485ede8e17c89ab9afd
https://github.com/pytorch/pytorch/actions/workflows/pull.yml?query=branch%3Atrunk%2F3fc6a055e09174135cd839e723c4f0bdab9589b3


  1. many restarts
python -m pytorch_auto_revert --dry-run autorevert-checker pull --hours 18  --hud-html --as-of "2025-12-19 22:00"

log P2090444414:

runs:
https://github.com/pytorch/pytorch/actions/workflows/pull.yml?query=branch%3Atrunk%2Feafa4f67d2afdca606eebbca50571b0ba1ab922b
https://github.com/pytorch/pytorch/actions/workflows/pull.yml?query=branch%3Atrunk%2F96b3e7d78914f5db043e8b9ae3b3f72498abca4e
https://github.com/pytorch/pytorch/actions/workflows/pull.yml?query=branch%3Atrunk%2F7d49bd5060925055724d8976794cc1fd328066aa


  1. workflow without input support (inductor)
python -m pytorch_auto_revert  autorevert-checker inductor --hours 64 --hud-html --as-of "2025-12-18 17:31"

log: P2090462639

run: https://github.com/pytorch/pytorch/actions/workflows/inductor.yml?query=branch%3Atrunk%2Fa79fbc97065538f756418e6e3bde02a708e893b5

@pytorch-bot pytorch-bot bot added the ci-no-td label Dec 16, 2025
@vercel
Copy link

vercel bot commented Dec 16, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Review Updated (UTC)
torchci Ignored Ignored Preview Dec 19, 2025 10:39pm

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 16, 2025
@izaitsevfb izaitsevfb marked this pull request as draft December 16, 2025 00:27
notes_parts.append(f"tests_filter={','.join(tests_to_include)}")
if notes: # Error message from exception
notes_parts.append(notes)
notes = "; ".join(notes_parts) if notes_parts else ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are overwriting notes here, it can have been set in other parts of the code, maybe you want to create notes_parts right at the function start and poppulate it?

)
)
deduped.append(
Signal(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid maintenance hell and simplify things like this, all places where you do things like this please use dataclasses.replace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remembered the issue I had with that: Signal, SignalCommit, and SignalEvent are regular classes, not dataclasses.

potentially a good suggestion, but should probably be done as a separate BE PR.

filtered.append(e)
prev_key = key
new_commits.append(
SignalCommit(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataclasses.replace

)

out.append(
Signal(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataclasses.replace

self._by_display[name] = ref
self._by_file[base] = ref

def get_input_support(self, workflow_name: str) -> WorkflowInputSupport:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the overcomplicated self._input_support_cache and replace with

@lru_cache(maxsize=512)
def get_input_support(self, workflow_name: str) -> WorkflowInputSupport:
   # move the logic from self._fetch_and_parse_workflow_inputs here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


>>> Lint for aws/lambda/pytorch-auto-revert/pytorch_auto_revert/workflow_resolver.py:

  Warning (FLAKE8) B019
    Use of `functools.lru_cache` or `functools.cache` on methods can lead to
    memory leaks. The cache may retain instance references, preventing garbage
    collection.
    See

        130  |        ref = self.require(workflow_name)
        131  |        return self._fetch_workflow_input_support(ref.file_name)
        132  |
    >>> 133  |    @lru_cache(maxsize=None)
        134  |    def _fetch_workflow_input_support(self, file_name: str) -> WorkflowInputSupport:
        135  |        """Fetch and parse workflow input support with caching.
        136  |

for attempt in RetryWithBackoff():
with attempt:
contents = self._repository.get_contents(path)
yaml_content = contents.decoded_content.decode("utf-8")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call workflow = yaml.safe_load(yaml_content) here.

you can't trust that the success on do the request will actually lead to a successful response (even if it is 200, etc).

You can get garbage data, so, you might want to retry at this point.

wf_ref = self.resolver.require(workflow_name)

# Check what inputs this workflow supports
input_support = self.resolver.get_input_support(workflow_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be resilient here.

Accept that this function could fail (raise an exception, etc).

And if it does, just assume the workflow does not support inputs and move on. It is usually better to be suboptimal when things are unstable than to not do anything at all and fail.

repo = client.get_repo(f"{self.repo_owner}/{self.repo_name}")
workflow = repo.get_workflow(wf_ref.file_name)
proper_workflow_create_dispatch(workflow, ref=tag_ref, inputs={})
proper_workflow_create_dispatch(workflow, ref=tag_ref, inputs=inputs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am being pedantic, but there is a problem of handling external APIs this way.

If proper_workflow_create_dispatch is failing, you will retry get_repo and get_workflow. This is not ideal.

You should RetryWithBackoff and validate the output of each request independently.

@jeanschmidt
Copy link
Contributor

there are a few things I believe we should go over the code first (in other PRs) and fix before implementing those changes to avoid cascading bad standards.

Let me know if you need any help on this.

@izaitsevfb izaitsevfb marked this pull request as ready for review December 20, 2025 00:05
@izaitsevfb
Copy link
Contributor Author

@jeanschmidt addressed your comments (except some where I left a reply), this PR is tested and ready for review!

@izaitsevfb
Copy link
Contributor Author

consider reviewing commit-by-commit:

  1. the first one you already saw
  2. the second one adds the local CLI to restart jobs + minor fix
  3. addresses review comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants