Skip to content

(Closes #3157) Initial implementation of maximal parallel region trans.#3205

Open
LonelyCat124 wants to merge 21 commits intomasterfrom
3157_parallel_routine_trans
Open

(Closes #3157) Initial implementation of maximal parallel region trans.#3205
LonelyCat124 wants to merge 21 commits intomasterfrom
3157_parallel_routine_trans

Conversation

@LonelyCat124
Copy link
Collaborator

No description provided.

@LonelyCat124 LonelyCat124 changed the title Initial implementation of maximal parallel region trans. Tests TODO (Closes #3157) Initial implementation of maximal parallel region trans. Oct 31, 2025
@codecov
Copy link

codecov bot commented Oct 31, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.95%. Comparing base (59f796e) to head (5395733).

Additional details and impacted files
@@           Coverage Diff            @@
##           master    #3205    +/-   ##
========================================
  Coverage   99.95%   99.95%            
========================================
  Files         380      382     +2     
  Lines       53949    54050   +101     
========================================
+ Hits        53927    54028   +101     
  Misses         22       22            

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@LonelyCat124
Copy link
Collaborator Author

@MetBenjaminWent Chris said you had some cases worth trying this with as functionality tests?

@LonelyCat124
Copy link
Collaborator Author

@LonelyCat124
Copy link
Collaborator Author

Made a bit more progress with this now - there was definitely some missing logic for bdy_impl3 to work.

One thing that is apparent that applying things in this way means we result in barriers that live outside parallel regions, which should be purged by OMPMinimiseSyncTrans, but currently aren't.

@sergisiso @arporter Am I ok to make that change as part of this PR?

@LonelyCat124
Copy link
Collaborator Author

I fixed the previous issue, and this has raised some other challenges with the cases I receieved from MO.

  1. OpenMP sentinel isn't kept, and we don't seem to have an option to PSyclone to keep statements with the OpenMP sentinel, which is a problem for some pre-existing files that have code that uses the OpenMP sentinel for conditional compilation. @hiker I think you've mentioned this previously - do we have any branch of PSyclone that handles this at all? I know its in fparser so the frontend can presumably do something with it in PSyclone.
  2. Assignments are currently always excluded from parallel regions, but this is not a good idea. For example we could have a statement such as x = y / omp_get_thread_num() which would have to be inside a parallel region. I was wondering about just having scalar assignments as valid to be inside parallel regions, but this is not so straightforward and I don't have a good answer to how to do this - perhaps scalar assignments to local variables are allowed and others aren't? I'm sure there are still some edge cases here though.
  3. We can't handle some loop structures, e.g. we don't parallelise over jj here:
omp_block = tdims%j_end
!$ omp_block = ceiling(tdims%j_end/real(omp_get_num_threads()))

!$OMP do SCHEDULE(STATIC)
do jj = tdims%j_start, tdims%j_end, omp_block
  do k = blm1, 2, -1
    l = 0
    do j = jj, min(jj+omp_block-1, tdims%j_end)
      do i = tdims%i_start, tdims%i_end
        r_sq = r_rho_levels(i,j,k)*r_rho_levels(i,j,k)
        rr_sq = r_rho_levels(i,j,k+1)*r_rho_levels(i,j,k+1)
        dqw(i,j,k) = (-dtrdz_charney_grid(i,j,k) * (rr_sq * fqw(i,j,k + 1) - r_sq * fqw(i,j,k)) + dqw_nt(i,j,k)) * gamma2(i,j)
...

I assume we just need to use force=True for this loop, but in theory we could evaluate the i,j,k indices and find that they are guaranteed non-overlapping (i think? since we step by omp_block at jj, and j is thus independent) but its not straightforward. I can't remember if we already have an issue about this kinda of analysis? @hiker @sergisiso

@sergisiso
Copy link
Collaborator

Regarding 3. This is what @mn416 #3213 could solve, it may be worth trying with his dep_analysis

@LonelyCat124
Copy link
Collaborator Author

Yeah, I'd just been looking at the PR earlier today after I looked at this - I guess its still a while until we'd have it ready, but if it could handle cases like this it would definitely help (assuming the rest is otherwise ok).

@mn416
Copy link
Collaborator

mn416 commented Nov 26, 2025

I copied this chunk of code into a minimally compiling module:

module example_module
  implicit none

  type :: Dims
     integer :: j_start, j_end, i_start, i_end
  end type

contains

  subroutine sub(tdims, r_rho_levels, dtrdz_charney_grid, fqw, dqw, dqw_nt, gamma2, blm1, omp_block)
    type(Dims), intent(inout) :: tdims
    integer, intent(inout) :: r_rho_levels(:,:,:)
    integer, intent(inout) :: dtrdz_charney_grid(:,:,:)
    integer, intent(inout) :: fqw(:,:,:)
    integer, intent(inout) :: dqw(:,:,:)
    integer, intent(inout) :: dqw_nt(:,:,:)
    integer, intent(inout) :: gamma2(:,:)
    integer, intent(inout) :: blm1, omp_block
    integer :: jj, k, r_sq, rr_sq, l, i, j

    do jj = tdims%j_start, tdims%j_end, omp_block
      do k = blm1, 2, -1
        l = 0
        do j = jj, min(jj+omp_block-1, tdims%j_end)
          do i = tdims%i_start, tdims%i_end
            r_sq = r_rho_levels(i,j,k)*r_rho_levels(i,j,k)
            rr_sq = r_rho_levels(i,j,k+1)*r_rho_levels(i,j,k+1)
            dqw(i,j,k) = (-dtrdz_charney_grid(i,j,k) * (rr_sq * fqw(i,j,k + 1) - r_sq * fqw(i,j,k)) + dqw_nt(i,j,k)) * gamma2(i,j)
          end do
        end do
      end do
    end do

  end subroutine
end module

The analysis gives the following output:

Routine sub: 
  Loop jj: conflict free
  Loop k: conflict free
  Loop j: conflict free
  Loop i: conflict free

Interestingly, it does take a few seconds to prove that the jj loop is conflict free.

@LonelyCat124
Copy link
Collaborator Author

I used force=True for the jj loops in the bdy_impl3.F90 file, and I get only 3 parallel regions.

There are 2 things left to potentially look at.

  1. How to determine whether an Assignment is safe to go inside a parallel region - or if we should worry about that right now (and if we can safely determine it and that the assigned variable is private). There are some slightly naive rules I could make (e.g. scalar assignments to local variables are fine), but its difficult to determine if this is actually true. I can't even rely on if they're read-only, as it could need to be read from some loc that ends up outside the parallel region, and since the parallel region is built up manually I'm not sure how to solve this. Any ideas? @sergisiso @hiker
  2. The common pattern of omp end do nowait into omp barrier. I end up (after barrier reduction) with some of these, and I think it might be slightly neater to just convert this pattern into simple omp end do, but it shouldn't move the needle in a relevant way w.r.t performance, so I'm happy to leave this until later.

@sergisiso
Copy link
Collaborator

@LonelyCat124 Could we finish this PR without the Assignments as there are a few cases that are just contiguous loops that we could already do? And then do the Assignments in a follow up as they seem complicated and worth disucssing separately.

@LonelyCat124
Copy link
Collaborator Author

LonelyCat124 commented Jan 12, 2026

@LonelyCat124 Could we finish this PR without the Assignments as there are a few cases that are just contiguous loops that we could already do? And then do the Assignments in a follow up as they seem complicated and worth disucssing separately.

@sergisiso I'm slightly hesitant since some of the examples I was given by MO do need these (and in fact create wrong code without it) if it were to be applied. I think at least I'd need/want to add a check to see if any omp_... calls are included on the RHS of an assignment for now and fail validation if so? What do you think?

I had started implemeting some logic now for Assignment but the next accesses check is somewhat complicated.

Edit Though: I think we're also limited without Matthew's #3213 anyway so maybe the assignment checks aren't so urgent.

Edit 2: This was my current plan:

        # Assignments can be in the parallel region if they write to a locally
        # defined scalar variable (i.e. the symbol has a AutomaticInterface).
        # TODO The lhs variable needs to be private - not sure we can do that
        # yet.
        if isinstance(node, Assignment):
            if not isinstance(node.lhs.symbol.interface, AutomaticInterface):
                return False
            if node.lhs.symbol.is_array:
                return False
            # TODO We need to mark the LHS variable as private.
            # TODO We need to ensure the next access to the lhs variable is
            # also contained in the same parallel region.
            return True

But the next access check needs to happen in apply when applying the parallel transformation and becomes a bit of a mess I think (since we may end up having to split the parallel regions again at this point).

@sergisiso
Copy link
Collaborator

I'm slightly hesitant since some of the examples I was given by MO do need these (and in fact create wrong code without it) if it were to be applied.

@LonelyCat124 I need some context about this issue, can you paste and example here

@sergisiso
Copy link
Collaborator

@LonelyCat124 I thought without assignments it would be encapsulating in a transformation what your already implemented in nemo/eg7/openmp_cpu_nowait_trans.py. Does this example also have the problem then?

@LonelyCat124
Copy link
Collaborator Author

LonelyCat124 commented Jan 12, 2026

@LonelyCat124 I thought without assignments it would be encapsulating in a transformation what your already implemented in nemo/eg7/openmp_cpu_nowait_trans.py. Does this example also have the problem then?

Yeah it would probably be similar, but I was testing it with some code (https://code.metoffice.gov.uk/trac/lfric_apps/browser/main/trunk/science/physics_schemes/source/boundary_layer/bdy_impl3.F90) and we do a lot worse than the manual implementation. I think omp_block = ceiling(tdims%j_end/real(omp_get_num_threads())) is safe to have outside the parallel region, but since we can't put the following loop in an omp do anyway (maybe with force we can) it might not be important.

I'm happy to just go with this simple version initially though and then improve it, probably makes reviewing more straight forward. I'll write some tests for the current functionality and add some stuff to the docstrings.

@LonelyCat124
Copy link
Collaborator Author

I'm slightly hesitant since some of the examples I was given by MO do need these (and in fact create wrong code without it) if it were to be applied.

@LonelyCat124 I need some context about this issue, can you paste and example here

@sergisiso An example of one we'd do wrong right now would be any code where have x = omp_get_thread_num() as a child of a Routine, since I think this statement is only valid inside a parallel region, and we'd never consider to put the assignment in a parallel region.

@sergisiso
Copy link
Collaborator

sergisiso commented Jan 12, 2026

These last 2 would also be an issue without this transformation if the do the typical for loop in psyir.walk(Loop): OMPLoopTrans. In both cases (new transformation or previous approach) they need to be found and added the Parallel region independently. So I would still go ahead without the Assignments for this first PR.

@LonelyCat124 LonelyCat124 marked this pull request as ready for review January 12, 2026 14:15
@LonelyCat124
Copy link
Collaborator Author

LonelyCat124 commented Jan 12, 2026

Blocked due to circular dependencies in examples - will update with blocking PR when created.
Blocked by #3280

@LonelyCat124 LonelyCat124 added the Blocked An issue/PR that is blocked by one or more issues/PRs. label Jan 12, 2026
@LonelyCat124
Copy link
Collaborator Author

@sergisiso This is now passing CI, so ready for review I think.

@LonelyCat124 LonelyCat124 added ready for review and removed Blocked An issue/PR that is blocked by one or more issues/PRs. labels Jan 23, 2026
Copy link
Collaborator

@sergisiso sergisiso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting this started @LonelyCat124 I have some difficulties visualising how this will be applied to other use cases that have the same pattern, e.g maximal OMPTrarget, maximal profiling regions, maximal acc kernel. It may be worth starting some of these to make sure the base class is capable of doing them.

TransformationError


class MaximalParallelRegionTrans(RegionTrans, metaclass=abc.ABCMeta):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have the @transformation_documentation_wrapper and remove the validate params?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and I should update RegionTrans if it needs it (though is suspect its done).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to think if the subclasses need it or not. I think they do (and to therefore to define a very basic apply method).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

TransformationError


class MaximalParallelRegionTrans(RegionTrans, metaclass=abc.ABCMeta):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we should remove the "parallel" word from everywhere in this file? There is nothing specific to parallelism, it can be used for any outer_node that we want around a set of valid/invalid inner_nodes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example of this is in examples/nemo/scripts/utils.py::add_profiling which creates the largest posible ProfileNode to sections excluding Directives (maybe it would be good to add that to should is abstract class is generic enough)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that seems reasonable, I'll take a look at add_profiling and see if it can be modified to fit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming done, need to look at add profiling still.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sergisiso I think add_profiling needs something like this (though maybe there's just a better explicit correct option?:

    from psyclone.psyir.transformations import MaximalRegionTrans
    import itertools
    def all_subclasses(cls):
        return set(cls.__subclasses__()).union(
            [s for c in cls.__subclasses__() for s in all_subclasses(c)])
    disallowed_classes = itertools.chain(all_subclasses(Directive),
                                         all_subclasses(Return),
                                         all_subclasses(ScopingNode),
                                         [Directive, Return, ScopingNode,
                                          Loop, IfBlock])
    class MaximalProfilingTrans(MaximalRegionTrans):
        '''Applies Profiling to the largest possible region.'''
        _allowed_nodes = [cls for cls in Node.__subclasses__() if
                          cls not in disallowed_classes]

What do you think for allowed nodes - is this best or shall I just do e.g. Assignment, Call, ??? I'm trying to think what other statements are needed (since ifblock/loops/while_loops are handled anyway).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I settled on

    class MaximalProfilingTrans(MaximalRegionTrans):
        '''Applies Profiling to the largest possible region.'''
        _allowed_nodes = [Assignment, Call, CodeBlock]
        _transformation = ProfileTrans

for now, we can adjust this later if we find it to be insufficient.

for node in current_block:
if node.walk(self._required_nodes,
stop_type=self._required_nodes):
par_trans.apply(current_block)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this fails the validation? Maybe we want to catch it instead of propagate since we may already partially updated the tree.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example if the region directive is a OMPTarget and it finds a call to a non-offloaded subroutine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this now to try validating all blocks before applying the transformation, however I'm not sure what should be done when the validation fails? I wonder about having a function to call (which can either do nothing and just skip the apply method for this block) or raise a TransformationError or whatever the transformation decides?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone with that implemetation for now.

if isinstance(child, IfBlock):
self.apply(child.if_body)
if child.else_body:
self.apply(child.else_body)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem right that the apply is recursive because each apply calls its own validate, and validate is also recursive. So the validations currently grow quadratically.

It maybe better to move that to an independent recursive method and call that after the validation is complete.

Copy link
Collaborator

@sergisiso sergisiso Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, you could use something akind to the sets that you used in openmp_cpu_nowait_trans.py and the validate populates the set and the apply applies it which is not recursive

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, when I was doing the recursive apply Iforgot about the validation steps, thats pretty bad. I'll take a look at the nowait transformation and see if I can reuse (slash remember) the logic as a non recursive method is probably better if it doesn't involve significantly more complex code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The openmp_cpu_nowait_trans.py on master still uses recursion for loops and ifblocks, so I think there's no escaping that recursion, so I'll make this its own routine instead of part of apply.

from psyclone.psyir.transformations.omp_parallel_trans import OMPParallelTrans


class OMPMaximalParallelRegionTrans(MaximalParallelRegionTrans):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer the "OMPParallel" words to be together, to make clear it adds OMPParallel directives.

I know I proposed the Maximal word, but do you think it may be confused to mean "the optimally largest" regions? - that we don't do.

Is there any more fitting word we could use?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of a better work - I can add that the maximal is "for what PSyclone can compute" in the docs but otherwise its hard.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to MaximalOMPParallel... for now, can continue to iterate if we have a better name.

_eliminate_final_parallel_barrier(parallel)
# Finally eliminate any barriers leftover outside of parallel
# regions, as these are now superfluous
self._eliminate_uncontained_barriers(node)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sergisiso This should resolve #3306.

@LonelyCat124
Copy link
Collaborator Author

@sergisiso This is ready for another look when you're back off leave.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants