(Closes #3157) Initial implementation of maximal parallel region trans.#3205
(Closes #3157) Initial implementation of maximal parallel region trans.#3205LonelyCat124 wants to merge 21 commits intomasterfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #3205 +/- ##
========================================
Coverage 99.95% 99.95%
========================================
Files 380 382 +2
Lines 53949 54050 +101
========================================
+ Hits 53927 54028 +101
Misses 22 22 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@MetBenjaminWent Chris said you had some cases worth trying this with as functionality tests? |
|
The examples I've been give by to look at are: |
|
Made a bit more progress with this now - there was definitely some missing logic for bdy_impl3 to work. One thing that is apparent that applying things in this way means we result in barriers that live outside parallel regions, which should be purged by @sergisiso @arporter Am I ok to make that change as part of this PR? |
…y on applying inside if/loop nodes
|
I fixed the previous issue, and this has raised some other challenges with the cases I receieved from MO.
I assume we just need to use |
|
Yeah, I'd just been looking at the PR earlier today after I looked at this - I guess its still a while until we'd have it ready, but if it could handle cases like this it would definitely help (assuming the rest is otherwise ok). |
|
I copied this chunk of code into a minimally compiling module: module example_module
implicit none
type :: Dims
integer :: j_start, j_end, i_start, i_end
end type
contains
subroutine sub(tdims, r_rho_levels, dtrdz_charney_grid, fqw, dqw, dqw_nt, gamma2, blm1, omp_block)
type(Dims), intent(inout) :: tdims
integer, intent(inout) :: r_rho_levels(:,:,:)
integer, intent(inout) :: dtrdz_charney_grid(:,:,:)
integer, intent(inout) :: fqw(:,:,:)
integer, intent(inout) :: dqw(:,:,:)
integer, intent(inout) :: dqw_nt(:,:,:)
integer, intent(inout) :: gamma2(:,:)
integer, intent(inout) :: blm1, omp_block
integer :: jj, k, r_sq, rr_sq, l, i, j
do jj = tdims%j_start, tdims%j_end, omp_block
do k = blm1, 2, -1
l = 0
do j = jj, min(jj+omp_block-1, tdims%j_end)
do i = tdims%i_start, tdims%i_end
r_sq = r_rho_levels(i,j,k)*r_rho_levels(i,j,k)
rr_sq = r_rho_levels(i,j,k+1)*r_rho_levels(i,j,k+1)
dqw(i,j,k) = (-dtrdz_charney_grid(i,j,k) * (rr_sq * fqw(i,j,k + 1) - r_sq * fqw(i,j,k)) + dqw_nt(i,j,k)) * gamma2(i,j)
end do
end do
end do
end do
end subroutine
end moduleThe analysis gives the following output: Interestingly, it does take a few seconds to prove that the |
|
I used There are 2 things left to potentially look at.
|
|
@LonelyCat124 Could we finish this PR without the Assignments as there are a few cases that are just contiguous loops that we could already do? And then do the Assignments in a follow up as they seem complicated and worth disucssing separately. |
@sergisiso I'm slightly hesitant since some of the examples I was given by MO do need these (and in fact create wrong code without it) if it were to be applied. I think at least I'd need/want to add a check to see if any I had started implemeting some logic now for Assignment but the next accesses check is somewhat complicated. Edit Though: I think we're also limited without Matthew's #3213 anyway so maybe the assignment checks aren't so urgent. Edit 2: This was my current plan: But the next access check needs to happen in apply when applying the parallel transformation and becomes a bit of a mess I think (since we may end up having to split the parallel regions again at this point). |
@LonelyCat124 I need some context about this issue, can you paste and example here |
|
@LonelyCat124 I thought without assignments it would be encapsulating in a transformation what your already implemented in |
Yeah it would probably be similar, but I was testing it with some code (https://code.metoffice.gov.uk/trac/lfric_apps/browser/main/trunk/science/physics_schemes/source/boundary_layer/bdy_impl3.F90) and we do a lot worse than the manual implementation. I think I'm happy to just go with this simple version initially though and then improve it, probably makes reviewing more straight forward. I'll write some tests for the current functionality and add some stuff to the docstrings. |
@sergisiso An example of one we'd do wrong right now would be any code where have |
|
These last 2 would also be an issue without this transformation if the do the typical |
|
Blocked due to circular dependencies in examples - will update with blocking PR when created. |
|
@sergisiso This is now passing CI, so ready for review I think. |
sergisiso
left a comment
There was a problem hiding this comment.
Thanks for getting this started @LonelyCat124 I have some difficulties visualising how this will be applied to other use cases that have the same pattern, e.g maximal OMPTrarget, maximal profiling regions, maximal acc kernel. It may be worth starting some of these to make sure the base class is capable of doing them.
| TransformationError | ||
|
|
||
|
|
||
| class MaximalParallelRegionTrans(RegionTrans, metaclass=abc.ABCMeta): |
There was a problem hiding this comment.
Should this have the @transformation_documentation_wrapper and remove the validate params?
There was a problem hiding this comment.
Yes, and I should update RegionTrans if it needs it (though is suspect its done).
There was a problem hiding this comment.
I need to think if the subclasses need it or not. I think they do (and to therefore to define a very basic apply method).
| TransformationError | ||
|
|
||
|
|
||
| class MaximalParallelRegionTrans(RegionTrans, metaclass=abc.ABCMeta): |
There was a problem hiding this comment.
I am wondering if we should remove the "parallel" word from everywhere in this file? There is nothing specific to parallelism, it can be used for any outer_node that we want around a set of valid/invalid inner_nodes.
There was a problem hiding this comment.
An example of this is in examples/nemo/scripts/utils.py::add_profiling which creates the largest posible ProfileNode to sections excluding Directives (maybe it would be good to add that to should is abstract class is generic enough)
There was a problem hiding this comment.
Yeah, that seems reasonable, I'll take a look at add_profiling and see if it can be modified to fit.
There was a problem hiding this comment.
Renaming done, need to look at add profiling still.
There was a problem hiding this comment.
@sergisiso I think add_profiling needs something like this (though maybe there's just a better explicit correct option?:
from psyclone.psyir.transformations import MaximalRegionTrans
import itertools
def all_subclasses(cls):
return set(cls.__subclasses__()).union(
[s for c in cls.__subclasses__() for s in all_subclasses(c)])
disallowed_classes = itertools.chain(all_subclasses(Directive),
all_subclasses(Return),
all_subclasses(ScopingNode),
[Directive, Return, ScopingNode,
Loop, IfBlock])
class MaximalProfilingTrans(MaximalRegionTrans):
'''Applies Profiling to the largest possible region.'''
_allowed_nodes = [cls for cls in Node.__subclasses__() if
cls not in disallowed_classes]
What do you think for allowed nodes - is this best or shall I just do e.g. Assignment, Call, ??? I'm trying to think what other statements are needed (since ifblock/loops/while_loops are handled anyway).
There was a problem hiding this comment.
I settled on
class MaximalProfilingTrans(MaximalRegionTrans):
'''Applies Profiling to the largest possible region.'''
_allowed_nodes = [Assignment, Call, CodeBlock]
_transformation = ProfileTrans
for now, we can adjust this later if we find it to be insufficient.
| for node in current_block: | ||
| if node.walk(self._required_nodes, | ||
| stop_type=self._required_nodes): | ||
| par_trans.apply(current_block) |
There was a problem hiding this comment.
What happens if this fails the validation? Maybe we want to catch it instead of propagate since we may already partially updated the tree.
There was a problem hiding this comment.
For example if the region directive is a OMPTarget and it finds a call to a non-offloaded subroutine.
There was a problem hiding this comment.
I've updated this now to try validating all blocks before applying the transformation, however I'm not sure what should be done when the validation fails? I wonder about having a function to call (which can either do nothing and just skip the apply method for this block) or raise a TransformationError or whatever the transformation decides?
There was a problem hiding this comment.
I've gone with that implemetation for now.
| if isinstance(child, IfBlock): | ||
| self.apply(child.if_body) | ||
| if child.else_body: | ||
| self.apply(child.else_body) |
There was a problem hiding this comment.
It doesn't seem right that the apply is recursive because each apply calls its own validate, and validate is also recursive. So the validations currently grow quadratically.
It maybe better to move that to an independent recursive method and call that after the validation is complete.
There was a problem hiding this comment.
Alternatively, you could use something akind to the sets that you used in openmp_cpu_nowait_trans.py and the validate populates the set and the apply applies it which is not recursive
There was a problem hiding this comment.
Yeah, when I was doing the recursive apply Iforgot about the validation steps, thats pretty bad. I'll take a look at the nowait transformation and see if I can reuse (slash remember) the logic as a non recursive method is probably better if it doesn't involve significantly more complex code.
There was a problem hiding this comment.
The openmp_cpu_nowait_trans.py on master still uses recursion for loops and ifblocks, so I think there's no escaping that recursion, so I'll make this its own routine instead of part of apply.
| from psyclone.psyir.transformations.omp_parallel_trans import OMPParallelTrans | ||
|
|
||
|
|
||
| class OMPMaximalParallelRegionTrans(MaximalParallelRegionTrans): |
There was a problem hiding this comment.
I would prefer the "OMPParallel" words to be together, to make clear it adds OMPParallel directives.
I know I proposed the Maximal word, but do you think it may be confused to mean "the optimally largest" regions? - that we don't do.
Is there any more fitting word we could use?
There was a problem hiding this comment.
I can't think of a better work - I can add that the maximal is "for what PSyclone can compute" in the docs but otherwise its hard.
There was a problem hiding this comment.
I changed this to MaximalOMPParallel... for now, can continue to iterate if we have a better name.
| _eliminate_final_parallel_barrier(parallel) | ||
| # Finally eliminate any barriers leftover outside of parallel | ||
| # regions, as these are now superfluous | ||
| self._eliminate_uncontained_barriers(node) |
|
@sergisiso This is ready for another look when you're back off leave. |
No description provided.