(Closes #3157) Initial implementation of maximal parallel region trans. by LonelyCat124 · Pull Request #3205 · stfc/PSyclone

LonelyCat124 · 2025-10-31T12:12:13Z

No description provided.

codecov · 2025-10-31T12:23:45Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.95%. Comparing base (59f796e) to head (5395733).

Additional details and impacted files

@@           Coverage Diff            @@
##           master    #3205    +/-   ##
========================================
  Coverage   99.95%   99.95%            
========================================
  Files         380      382     +2     
  Lines       53949    54050   +101     
========================================
+ Hits        53927    54028   +101     
  Misses         22       22

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

LonelyCat124 · 2025-11-03T10:20:57Z

@MetBenjaminWent Chris said you had some cases worth trying this with as functionality tests?

LonelyCat124 · 2025-11-21T15:17:45Z

The examples I've been give by to look at are:
https://code.metoffice.gov.uk/trac/lfric_apps/browser/main/trunk/science/physics_schemes/source/boundary_layer/bdy_impl3.F90 (simpler one)
https://code.metoffice.gov.uk/trac/lfric_apps/browser/main/trunk/science/physics_schemes/source/boundary_layer/ex_coef.F90 (more complex one)

LonelyCat124 · 2025-11-24T15:15:22Z

Made a bit more progress with this now - there was definitely some missing logic for bdy_impl3 to work.

One thing that is apparent that applying things in this way means we result in barriers that live outside parallel regions, which should be purged by OMPMinimiseSyncTrans, but currently aren't.

@sergisiso @arporter Am I ok to make that change as part of this PR?

…y on applying inside if/loop nodes

LonelyCat124 · 2025-11-26T13:42:13Z

I fixed the previous issue, and this has raised some other challenges with the cases I receieved from MO.

OpenMP sentinel isn't kept, and we don't seem to have an option to PSyclone to keep statements with the OpenMP sentinel, which is a problem for some pre-existing files that have code that uses the OpenMP sentinel for conditional compilation. @hiker I think you've mentioned this previously - do we have any branch of PSyclone that handles this at all? I know its in fparser so the frontend can presumably do something with it in PSyclone.
Assignments are currently always excluded from parallel regions, but this is not a good idea. For example we could have a statement such as x = y / omp_get_thread_num() which would have to be inside a parallel region. I was wondering about just having scalar assignments as valid to be inside parallel regions, but this is not so straightforward and I don't have a good answer to how to do this - perhaps scalar assignments to local variables are allowed and others aren't? I'm sure there are still some edge cases here though.
We can't handle some loop structures, e.g. we don't parallelise over jj here:

omp_block = tdims%j_end
!$ omp_block = ceiling(tdims%j_end/real(omp_get_num_threads()))

!$OMP do SCHEDULE(STATIC)
do jj = tdims%j_start, tdims%j_end, omp_block
  do k = blm1, 2, -1
    l = 0
    do j = jj, min(jj+omp_block-1, tdims%j_end)
      do i = tdims%i_start, tdims%i_end
        r_sq = r_rho_levels(i,j,k)*r_rho_levels(i,j,k)
        rr_sq = r_rho_levels(i,j,k+1)*r_rho_levels(i,j,k+1)
        dqw(i,j,k) = (-dtrdz_charney_grid(i,j,k) * (rr_sq * fqw(i,j,k + 1) - r_sq * fqw(i,j,k)) + dqw_nt(i,j,k)) * gamma2(i,j)
...

I assume we just need to use force=True for this loop, but in theory we could evaluate the i,j,k indices and find that they are guaranteed non-overlapping (i think? since we step by omp_block at jj, and j is thus independent) but its not straightforward. I can't remember if we already have an issue about this kinda of analysis? @hiker @sergisiso

sergisiso · 2025-11-26T15:43:39Z

Regarding 3. This is what @mn416 #3213 could solve, it may be worth trying with his dep_analysis

LonelyCat124 · 2025-11-26T15:52:03Z

Yeah, I'd just been looking at the PR earlier today after I looked at this - I guess its still a while until we'd have it ready, but if it could handle cases like this it would definitely help (assuming the rest is otherwise ok).

mn416 · 2025-11-26T16:01:26Z

I copied this chunk of code into a minimally compiling module:

module example_module
  implicit none

  type :: Dims
     integer :: j_start, j_end, i_start, i_end
  end type

contains

  subroutine sub(tdims, r_rho_levels, dtrdz_charney_grid, fqw, dqw, dqw_nt, gamma2, blm1, omp_block)
    type(Dims), intent(inout) :: tdims
    integer, intent(inout) :: r_rho_levels(:,:,:)
    integer, intent(inout) :: dtrdz_charney_grid(:,:,:)
    integer, intent(inout) :: fqw(:,:,:)
    integer, intent(inout) :: dqw(:,:,:)
    integer, intent(inout) :: dqw_nt(:,:,:)
    integer, intent(inout) :: gamma2(:,:)
    integer, intent(inout) :: blm1, omp_block
    integer :: jj, k, r_sq, rr_sq, l, i, j

    do jj = tdims%j_start, tdims%j_end, omp_block
      do k = blm1, 2, -1
        l = 0
        do j = jj, min(jj+omp_block-1, tdims%j_end)
          do i = tdims%i_start, tdims%i_end
            r_sq = r_rho_levels(i,j,k)*r_rho_levels(i,j,k)
            rr_sq = r_rho_levels(i,j,k+1)*r_rho_levels(i,j,k+1)
            dqw(i,j,k) = (-dtrdz_charney_grid(i,j,k) * (rr_sq * fqw(i,j,k + 1) - r_sq * fqw(i,j,k)) + dqw_nt(i,j,k)) * gamma2(i,j)
          end do
        end do
      end do
    end do

  end subroutine
end module

The analysis gives the following output:

Routine sub: 
  Loop jj: conflict free
  Loop k: conflict free
  Loop j: conflict free
  Loop i: conflict free

Interestingly, it does take a few seconds to prove that the jj loop is conflict free.

LonelyCat124 · 2025-12-10T14:00:05Z

I used force=True for the jj loops in the bdy_impl3.F90 file, and I get only 3 parallel regions.

There are 2 things left to potentially look at.

How to determine whether an Assignment is safe to go inside a parallel region - or if we should worry about that right now (and if we can safely determine it and that the assigned variable is private). There are some slightly naive rules I could make (e.g. scalar assignments to local variables are fine), but its difficult to determine if this is actually true. I can't even rely on if they're read-only, as it could need to be read from some loc that ends up outside the parallel region, and since the parallel region is built up manually I'm not sure how to solve this. Any ideas? @sergisiso @hiker
The common pattern of omp end do nowait into omp barrier. I end up (after barrier reduction) with some of these, and I think it might be slightly neater to just convert this pattern into simple omp end do, but it shouldn't move the needle in a relevant way w.r.t performance, so I'm happy to leave this until later.

sergisiso · 2026-01-12T10:38:46Z

@LonelyCat124 Could we finish this PR without the Assignments as there are a few cases that are just contiguous loops that we could already do? And then do the Assignments in a follow up as they seem complicated and worth disucssing separately.

LonelyCat124 · 2026-01-12T10:43:42Z

@LonelyCat124 Could we finish this PR without the Assignments as there are a few cases that are just contiguous loops that we could already do? And then do the Assignments in a follow up as they seem complicated and worth disucssing separately.

@sergisiso I'm slightly hesitant since some of the examples I was given by MO do need these (and in fact create wrong code without it) if it were to be applied. I think at least I'd need/want to add a check to see if any omp_... calls are included on the RHS of an assignment for now and fail validation if so? What do you think?

I had started implemeting some logic now for Assignment but the next accesses check is somewhat complicated.

Edit Though: I think we're also limited without Matthew's #3213 anyway so maybe the assignment checks aren't so urgent.

Edit 2: This was my current plan:

        # Assignments can be in the parallel region if they write to a locally
        # defined scalar variable (i.e. the symbol has a AutomaticInterface).
        # TODO The lhs variable needs to be private - not sure we can do that
        # yet.
        if isinstance(node, Assignment):
            if not isinstance(node.lhs.symbol.interface, AutomaticInterface):
                return False
            if node.lhs.symbol.is_array:
                return False
            # TODO We need to mark the LHS variable as private.
            # TODO We need to ensure the next access to the lhs variable is
            # also contained in the same parallel region.
            return True

But the next access check needs to happen in apply when applying the parallel transformation and becomes a bit of a mess I think (since we may end up having to split the parallel regions again at this point).

sergisiso · 2026-01-12T10:52:27Z

I'm slightly hesitant since some of the examples I was given by MO do need these (and in fact create wrong code without it) if it were to be applied.

@LonelyCat124 I need some context about this issue, can you paste and example here

sergisiso · 2026-01-12T10:58:28Z

@LonelyCat124 I thought without assignments it would be encapsulating in a transformation what your already implemented in nemo/eg7/openmp_cpu_nowait_trans.py. Does this example also have the problem then?

LonelyCat124 · 2026-01-12T11:17:35Z

@LonelyCat124 I thought without assignments it would be encapsulating in a transformation what your already implemented in nemo/eg7/openmp_cpu_nowait_trans.py. Does this example also have the problem then?

Yeah it would probably be similar, but I was testing it with some code (https://code.metoffice.gov.uk/trac/lfric_apps/browser/main/trunk/science/physics_schemes/source/boundary_layer/bdy_impl3.F90) and we do a lot worse than the manual implementation. I think omp_block = ceiling(tdims%j_end/real(omp_get_num_threads())) is safe to have outside the parallel region, but since we can't put the following loop in an omp do anyway (maybe with force we can) it might not be important.

I'm happy to just go with this simple version initially though and then improve it, probably makes reviewing more straight forward. I'll write some tests for the current functionality and add some stuff to the docstrings.

LonelyCat124 · 2026-01-12T11:27:06Z

I'm slightly hesitant since some of the examples I was given by MO do need these (and in fact create wrong code without it) if it were to be applied.

@LonelyCat124 I need some context about this issue, can you paste and example here

@sergisiso An example of one we'd do wrong right now would be any code where have x = omp_get_thread_num() as a child of a Routine, since I think this statement is only valid inside a parallel region, and we'd never consider to put the assignment in a parallel region.

sergisiso · 2026-01-12T11:45:07Z

These last 2 would also be an issue without this transformation if the do the typical for loop in psyir.walk(Loop): OMPLoopTrans. In both cases (new transformation or previous approach) they need to be found and added the Parallel region independently. So I would still go ahead without the Assignments for this first PR.

LonelyCat124 · 2026-01-12T15:45:10Z

Blocked due to circular dependencies in examples - will update with blocking PR when created.
Blocked by #3280

LonelyCat124 · 2026-01-23T13:50:36Z

@sergisiso This is now passing CI, so ready for review I think.

sergisiso

Thanks for getting this started @LonelyCat124 I have some difficulties visualising how this will be applied to other use cases that have the same pattern, e.g maximal OMPTrarget, maximal profiling regions, maximal acc kernel. It may be worth starting some of these to make sure the base class is capable of doing them.

sergisiso · 2026-01-26T15:05:55Z

src/psyclone/psyir/transformations/maximal_parallel_region_trans.py

+    TransformationError
+
+
+class MaximalParallelRegionTrans(RegionTrans, metaclass=abc.ABCMeta):


Should this have the @transformation_documentation_wrapper and remove the validate params?

Yes, and I should update RegionTrans if it needs it (though is suspect its done).

I need to think if the subclasses need it or not. I think they do (and to therefore to define a very basic apply method).

sergisiso · 2026-01-26T15:08:01Z

src/psyclone/psyir/transformations/maximal_parallel_region_trans.py

+    TransformationError
+
+
+class MaximalParallelRegionTrans(RegionTrans, metaclass=abc.ABCMeta):


I am wondering if we should remove the "parallel" word from everywhere in this file? There is nothing specific to parallelism, it can be used for any outer_node that we want around a set of valid/invalid inner_nodes.

An example of this is in examples/nemo/scripts/utils.py::add_profiling which creates the largest posible ProfileNode to sections excluding Directives (maybe it would be good to add that to should is abstract class is generic enough)

Yeah, that seems reasonable, I'll take a look at add_profiling and see if it can be modified to fit.

Renaming done, need to look at add profiling still.

@sergisiso I think add_profiling needs something like this (though maybe there's just a better explicit correct option?:

from psyclone.psyir.transformations import MaximalRegionTrans import itertools def all_subclasses(cls): return set(cls.__subclasses__()).union( [s for c in cls.__subclasses__() for s in all_subclasses(c)]) disallowed_classes = itertools.chain(all_subclasses(Directive), all_subclasses(Return), all_subclasses(ScopingNode), [Directive, Return, ScopingNode, Loop, IfBlock]) class MaximalProfilingTrans(MaximalRegionTrans): '''Applies Profiling to the largest possible region.''' _allowed_nodes = [cls for cls in Node.__subclasses__() if cls not in disallowed_classes]

What do you think for allowed nodes - is this best or shall I just do e.g. Assignment, Call, ??? I'm trying to think what other statements are needed (since ifblock/loops/while_loops are handled anyway).

I settled on

class MaximalProfilingTrans(MaximalRegionTrans): '''Applies Profiling to the largest possible region.''' _allowed_nodes = [Assignment, Call, CodeBlock] _transformation = ProfileTrans

for now, we can adjust this later if we find it to be insufficient.

sergisiso · 2026-01-26T15:41:28Z

src/psyclone/psyir/transformations/maximal_region_trans.py

+                    for node in current_block:
+                        if node.walk(self._required_nodes,
+                                     stop_type=self._required_nodes):
+                            par_trans.apply(current_block)


What happens if this fails the validation? Maybe we want to catch it instead of propagate since we may already partially updated the tree.

For example if the region directive is a OMPTarget and it finds a call to a non-offloaded subroutine.

I've updated this now to try validating all blocks before applying the transformation, however I'm not sure what should be done when the validation fails? I wonder about having a function to call (which can either do nothing and just skip the apply method for this block) or raise a TransformationError or whatever the transformation decides?

I've gone with that implemetation for now.

sergisiso · 2026-01-26T15:44:46Z

src/psyclone/psyir/transformations/maximal_region_trans.py

+                if isinstance(child, IfBlock):
+                    self.apply(child.if_body)
+                    if child.else_body:
+                        self.apply(child.else_body)


It doesn't seem right that the apply is recursive because each apply calls its own validate, and validate is also recursive. So the validations currently grow quadratically.

It maybe better to move that to an independent recursive method and call that after the validation is complete.

Alternatively, you could use something akind to the sets that you used in openmp_cpu_nowait_trans.py and the validate populates the set and the apply applies it which is not recursive

Yeah, when I was doing the recursive apply Iforgot about the validation steps, thats pretty bad. I'll take a look at the nowait transformation and see if I can reuse (slash remember) the logic as a non recursive method is probably better if it doesn't involve significantly more complex code.

The openmp_cpu_nowait_trans.py on master still uses recursion for loops and ifblocks, so I think there's no escaping that recursion, so I'll make this its own routine instead of part of apply.

sergisiso · 2026-01-26T15:48:22Z

src/psyclone/psyir/transformations/omp_maximal_parallel_region_trans.py

+from psyclone.psyir.transformations.omp_parallel_trans import OMPParallelTrans
+
+
+class OMPMaximalParallelRegionTrans(MaximalParallelRegionTrans):


I would prefer the "OMPParallel" words to be together, to make clear it adds OMPParallel directives.

I know I proposed the Maximal word, but do you think it may be confused to mean "the optimally largest" regions? - that we don't do.

Is there any more fitting word we could use?

I can't think of a better work - I can add that the maximal is "for what PSyclone can compute" in the docs but otherwise its hard.

I changed this to MaximalOMPParallel... for now, can continue to iterate if we have a better name.

LonelyCat124 · 2026-01-27T15:26:43Z

src/psyclone/psyir/transformations/omp_minimise_sync_trans.py

                _eliminate_final_parallel_barrier(parallel)
+            # Finally eliminate any barriers leftover outside of parallel
+            # regions, as these are now superfluous
+            self._eliminate_uncontained_barriers(node)


@sergisiso This should resolve #3306.

LonelyCat124 · 2026-01-30T12:21:50Z

@sergisiso This is ready for another look when you're back off leave.

Initial implementation of maximal parallel region trans. Tests TODO

172c205

LonelyCat124 changed the title ~~Initial implementation of maximal parallel region trans. Tests TODO~~ (Closes #3157) Initial implementation of maximal parallel region trans. Oct 31, 2025

LonelyCat124 added 2 commits October 31, 2025 15:32

Tests for base MaximalParallelRegionTrans class

8b19ef0

Missed __init__ updates

38ef3e6

Merge branch 'master' into 3157_parallel_routine_trans

586c51a

Changes to fix unnecessary parallel additions and to recurse correctl…

1b86da5

…y on applying inside if/loop nodes

LonelyCat124 added 2 commits November 26, 2025 13:49

Fixed some of the behaviour to improve barrier removal

f3fa2a3

linting

cbdde6d

Update the test to handle the new required node structure

0617319

Merge branch 'master' into 3157_parallel_routine_trans

d76e988

LonelyCat124 added 2 commits January 12, 2026 14:12

Added tests

bdc9f12

Merged master

1db7f8d

LonelyCat124 marked this pull request as ready for review January 12, 2026 14:15

Fix test overlaps

67ae4c0

LonelyCat124 requested a review from sergisiso January 12, 2026 15:05

LonelyCat124 added 2 commits January 12, 2026 15:18

Update the example

a275f48

Missing init update

f711ae9

LonelyCat124 added the Blocked An issue/PR that is blocked by one or more issues/PRs. label Jan 12, 2026

merged master and updated as needed

fef6b0b

LonelyCat124 added ready for review and removed Blocked An issue/PR that is blocked by one or more issues/PRs. labels Jan 23, 2026

sergisiso added under review and removed ready for review labels Jan 26, 2026

sergisiso requested changes Jan 26, 2026

View reviewed changes

sergisiso added reviewed with actions and removed under review labels Jan 26, 2026

LonelyCat124 mentioned this pull request Jan 27, 2026

nowait option could stop adding barriers after Directives that imply implicit barriers #3306

Open

LonelyCat124 commented Jan 27, 2026

View reviewed changes

LonelyCat124 added 6 commits January 27, 2026 15:31

First changes towards review

c15d552

Changes for review

321893c

linting

44c7a94

linting

e521436

Changes for coverage

3002bd6

Merge branch 'master' into 3157_parallel_routine_trans

5395733

LonelyCat124 added ready for review and removed reviewed with actions labels Jan 30, 2026

		TransformationError


		class MaximalParallelRegionTrans(RegionTrans, metaclass=abc.ABCMeta):

		from psyclone.psyir.transformations.omp_parallel_trans import OMPParallelTrans


		class OMPMaximalParallelRegionTrans(MaximalParallelRegionTrans):

Conversation

LonelyCat124 commented Oct 31, 2025

Uh oh!

codecov bot commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

LonelyCat124 commented Nov 3, 2025

Uh oh!

LonelyCat124 commented Nov 21, 2025

Uh oh!

LonelyCat124 commented Nov 24, 2025

Uh oh!

LonelyCat124 commented Nov 26, 2025

Uh oh!

sergisiso commented Nov 26, 2025

Uh oh!

LonelyCat124 commented Nov 26, 2025

Uh oh!

mn416 commented Nov 26, 2025

Uh oh!

LonelyCat124 commented Dec 10, 2025

Uh oh!

sergisiso commented Jan 12, 2026

Uh oh!

LonelyCat124 commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sergisiso commented Jan 12, 2026

Uh oh!

sergisiso commented Jan 12, 2026

Uh oh!

LonelyCat124 commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LonelyCat124 commented Jan 12, 2026

Uh oh!

sergisiso commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LonelyCat124 commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LonelyCat124 commented Jan 23, 2026

Uh oh!

sergisiso left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sergisiso Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

codecov bot commented Oct 31, 2025 •

edited

Loading

LonelyCat124 commented Jan 12, 2026 •

edited

Loading

LonelyCat124 commented Jan 12, 2026 •

edited

Loading

sergisiso commented Jan 12, 2026 •

edited

Loading

LonelyCat124 commented Jan 12, 2026 •

edited

Loading

sergisiso Jan 26, 2026 •

edited

Loading