Conversation
|
You need to add yourself to the contributors file (alphabetical order) You say that you tested it to test that it works, how did you test it? As I recall you were using a input mesh file that had inter-grid maps. So did it not fail in the partitioning? Looks like you just changed the number of mpis used on existing tasks. The KGOs have changed, though I be suspicious of some of the checksums. There are a lot of differences, some only the last few digits, though some much larger. I think you need to provide some plots comparing 1 cpu vs multiple cpus to show that the fields are not significantly impacted. |
Added myself to the contributor list (added myself in a previous PR, but it has not been merged yet). I also fixed a detail in the new error message added to the code. I modified the existing lfric2lfric rose stem tests, so that three of the jobs run with multiple CPUS: ral_seuk-C32_lam_MG, clim_gal9-C24_C12, and oasis_clim_gal9-C24_C12. That is the reason why there are KGO changes. I am afraid the results change slightly with the partitioning. |
| help=In the lfric2lfric program, this variable indicates the | ||
| =mesh for which the described partitioning is taking place. | ||
| = | ||
| =The only valid values of this variable in lfric2lfric | ||
| =are 'source' to indicate the source grid, and 'destination' | ||
| =to indicate the destination grid. |
There was a problem hiding this comment.
| help=In the lfric2lfric program, this variable indicates the | |
| =mesh for which the described partitioning is taking place. | |
| = | |
| =The only valid values of this variable in lfric2lfric | |
| =are 'source' to indicate the source grid, and 'destination' | |
| =to indicate the destination grid. | |
| help=The designation of the mesh to which this | |
| =partitioning profile is to be applied. | |
| = | |
| =Recognised designations are: | |
| = * 'source' | |
| = * 'destination' |
| 'Partitioning parameters for the source mesh were not found, '// & | ||
| 'please specify a partitioning namelist with mesh_target=source.' |
There was a problem hiding this comment.
| 'Partitioning parameters for the source mesh were not found, '// & | |
| 'please specify a partitioning namelist with mesh_target=source.' | |
| 'Source mesh partitioning namelist (partitioning:source) not found.' |
If the code cannot find the required partitioning profile namelist, then that is all it should state, the code doesn't know why it can't be found, so it cannot suggest a solution. For example, if this error message was given for Marks problem, he would have checked the namelist file which was not at fault. This message also would have to be updated if a different key instance member was used.
The Rose metadata is enough for the user to be able to check their configuration settings.
|
|
||
| if (.not. configuration%namelist_exists('partitioning', 'destination')) then | ||
| write( log_scratch_space, '(A)' ) & | ||
| 'Partitioning parameters for the source mesh were not found, '// & |
There was a problem hiding this comment.
| 'Partitioning parameters for the source mesh were not found, '// & | |
| 'Destination mesh partitioning namelist (partitioning:destination) not found.' |
As before, the code doesn't know why the namelist is not found. Also the error message still referred to the source mesh.
I think you should edit the contributors list anyway, as it will block this change going on in the event that this PR is ready to commit first. Please refer to my comments in the Scitech review. Please provide some evidence that there is not significant changes in the output (plots) |
|
I wouldn't be concerned about speed at this juncture, My comments where based of the KGO differences and attribution of them. You haven't shown what the 1 CPU Oasis regridded method produces or is that not possible? If the differences are in the long/lats please provide example magnitudes? or the differences? |
We looked in the past that the map and the oasis method gave the same results with one CPU, and this is still true although I did not include the plot. We find the same here with 4 CPUs. I obtain lon/lat differences if I use ncdiff to generate the differences, but when I look at the actual longitude and latitude values in the lfric2lfric output using 1 or multiple CPUs, they are actually the same. I do not know why ncdiff behaves like this. |
ncdiff appears to checks differences at a lower tolerance than just dumping the output. Again, please post the differences in the lat lon values (a representative sample) |
There are no differences in lat lon values when comparing the values directly. There are some differences when comparing using ncdiff. I am confident that the lat lon values is not the problem. I suspect the problem is how lfric reads different types of data, I am preparing a test to check this theory. |
|
As requested, please post the differences in the values you have from ncdiff. |
Taken directly from ncdiff: Mesh2d_face_x = -0.7626276, -0.6977054, -0.6327863, -0.5678701, -0.502957, Mesh2d_face_y = 51.98213, 51.98115, 51.98013, 51.97908, 51.97799, 51.97687, |
|
No, that's not it, have you used nccmp? I want to see the differences in the lat lon values in the output. and similarly in the global mesh files you are using as input. |
So far I only used ncdiff. Differences with nccmp are, for the regional model: Variable Group Count Sum AbsSum Min Max Range Mean StdDev and for the global model: Variable Group Count Sum AbsSum Min Max Range Mean StdDev |
Sorry Juan, |
|
The output for 1 and multiple CPUs are on VDI: global to global 1 CPU 6 CPUs regional to regional 1 CPU 4 CPUs |
|
Please post the links to the files, it is what I asked for. |
I do not understand very well what you mean with 'links to the files', these are files result from a cylc run and they are in VDI. Here is the link to the workflows: https://cylchub/services/cylc-review/taskjobs/juan.m.castillo/?suite=lfric2lfric_main/run1 (@mo-rickywong) I meant post the location to the output files, like you did for the global to global and regional 2 regional, but for the oasisi_clim_gal9-C24_C12 runs. |
mo-rickywong
left a comment
There was a problem hiding this comment.
Doesn't pass SciTech in current state, please see main PR for comments
| "src_name": "", | ||
| "src_type": "", | ||
| "mpi_parts": 4, | ||
| }) %} |
| "src_name": "", | ||
| "src_type": "", | ||
| "mpi_parts": 6, | ||
| }) %} |
| "src_name": "C12", | ||
| "src_type": "global", | ||
| "mpi_parts": 6, | ||
| }) %} |
|
After some investigation I have reached some conclusions, and found a solution. So far I have locally implemented a fix for the OASIS method; I will later concentrate in a fix for the map method, and push all code once it is ready to be reviewed. This issue will require an associated PR in lfric_core. So far, every time I made changes to lfric2lfric I just checked three variables: theta, rho, and zh. When in this PR I looked at other variables, I realised there was something wrong. What happens is that variables that in the XIOS xml file do not have an axis, the vertical levels for a mesh point are stored consecutively in memory, and this is what lfric2lfric assumed. Variables with an axis definition, stores layer data consecutively in memory, so lfric2lfric does not properly process these variables, as in this case it mixes data on different vertical levels when doing horizontal interpolation. To illustrate this point, I copy below the XIOS definition for the variables theta, rho, zh (used previously to assess lfric2lfric), wvar, and canopy_height (these last two variables that break lfric2lfric, and which will be used to test this PR): Please notice the grid_ref definition for "full_level_face_grid": What lfric2lfric was missing was kernels taking into account this difference. I used lfric_xios subroutines to interrogate if a given variable has an axis definition, and if this is the case I modified the existing kernels to point to the right locations in memory in order to extract a field layer. Below I show regional to regional results (using code still in a local branch) for the oasis method, plotting in the same order: the original field to be interpolated; the interpolated field, using the weights generated by the lfric2lfric apps, and applied using a python script (this should be considered the 'right' solution); the oasis-interpolated field using 1 CPU and the lfric_apps trunk; the oasis-interpolated field using 4 CPUs and the proposed fix: Variable: theta Variable: wvar Variable: canopy_height |


























PR Summary
Sci/Tech Reviewer: @mo-rickywong
Code Reviewer: @TeranIvy
Change the type of the partitioning key instance member for lfric2lfric, so that lfric2lfric can run with more than one CPU
Tagging Iva and Steve so they are aware of the suspect output from lfric2lfric on multiple CPUs
@TeranIvy , @stevemullerworth
Code Quality Checklist
Testing
No other tests where performed, just modified existing tests so that the new functionality could be tested. Therefore, there were KGO changes that were addressed.
trac.log
Test Suite Results - lfric_apps - test_lfric2lfric_multi_CPU/run1
Suite Information
Task Information
✅ succeeded tasks - 1456
Security Considerations
Performance Impact
AI Assistance and Attribution
Documentation
PSyclone Approval
Sci/Tech Review
Please detail how you tested the functionality? i.e. what modifications did you make to test it? doesn't seem like the rose-stem test should pass given the partitioning checks with the presence of inter-grid maps in the input mesh file.
KGOs changes are numerous and not all insignificant, please provide plots/evidence that output from multiple cpus is not significantly different from runs with multiple cpus (plots?).
SciTech Review: Fail Suspect results from runs with multiple CPUs. While this change may enable lfric2lfric to run with multiple CPUs, the output maybe known but not necessarily good
Code Review