Skip to content

Conversation

@dindon-sournois
Copy link
Collaborator

No description provided.

Comment on lines +75 to +95
#ifdef _OPENACC
subroutine myalloc_ZDF_gpu()
allocate(zwd(jpk, dimen_jvzdf))
zwd = huge(zwd(1,1))
allocate(zws(jpk, dimen_jvzdf))
zws = huge(zws(1,1))
allocate(zwi(jpk, dimen_jvzdf))
zwi = huge(zwi(1,1))
allocate(zwx(jpk, dimen_jvzdf))
zwx = huge(zwx(1,1))
allocate(zwy(jpk, dimen_jvzdf))
zwy = huge(zwy(1,1))
allocate(zwz(jpk, dimen_jvzdf))
zwz = huge(zwz(1,1))
allocate(zwt(jpk, dimen_jvzdf))
zwt = huge(zwt(1,1))

!$acc enter data create(zwd,zwi,zwx,zws,zwz,zwy,zwt)
!$acc update device(zwd,zwi,zwx,zws,zwz,zwy,zwt)
END subroutine myalloc_ZDF_gpu
#endif
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We create a new subroutine here that is called once in trczdf after dimen_jvzdf value is known

We could probably do the same for the CPU version to avoid duplicates, also the memory counter might needs to be adapted

Comment on lines -177 to 179

!$acc enter data create( e1t(1:jpj,1:jpi), e2t(1:jpj,1:jpi), e3t(1:jpk,1:jpj,1:jpi) ) if(use_gpu)
!$acc enter data create( e1u(1:jpj,1:jpi), e2u(1:jpj,1:jpi), e3u(1:jpk,1:jpj,1:jpi) ) if(use_gpu)
!$acc enter data create( e1v(1:jpj,1:jpi), e2v(1:jpj,1:jpi), e3v(1:jpk,1:jpj,1:jpi) ) if(use_gpu)
!$acc enter data create( e3w(1:jpk,1:jpj,1:jpi) ) if(use_gpu)
!$acc enter data create( un(1:jpk,1:jpj,1:jpi), vn(1:jpk,1:jpj,1:jpi), wn(1:jpk,1:jpj,1:jpi) ) if(use_gpu)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not a good idea to declare these arrays here:

  • they are allocated and deallocated later, which is a waste of time
  • GPU allocation should be moved close to CPU allocate as the port progress

Comment on lines +136 to +137
! NOTE: kernel is too big, should be split
!$acc parallel loop gang vector default(present) async vector_length(32)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to think about clever ways to generate this kernel as it seems quite big, best performance on A100 was obtained with a vector length of 32 which isn't very high

Aij = e1t(jj,ji) * e2t(jj,ji)

#ifdef _OPENACC
ntx=jv
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for GPU version we parallelize on dimen_jvzdf

@dindon-sournois dindon-sournois marked this pull request as ready for review April 24, 2024 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants