Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Z_R, L_R, T_RESCALE_POWER answers don't match for my Bering domain #54

Closed
kshedstrom opened this issue Dec 30, 2021 · 21 comments · Fixed by #104 or #117
Closed

Z_R, L_R, T_RESCALE_POWER answers don't match for my Bering domain #54

kshedstrom opened this issue Dec 30, 2021 · 21 comments · Fixed by #104 or #117
Labels
bug Something isn't working

Comments

@kshedstrom
Copy link

R and Q are not a problem. Debugging tips welcome.

@Hallberg-NOAA
Copy link
Member

There are well established procedures to debug dimensional consistency issues. Do a pair of short runs with DEBUG=True, one with all of the rescaling disabled by setting [ZLTRQH]_RESCALE_POWER = 0, and a second one with one or more of these rescaling factors set to another integer. These runs will write out a series of unscaled checksums with identifying messages (a few hundred lines per timestep) to stdout. The first place where the output differs is usually a pretty good indication of where in the code flow the inconsistency lies, although it is sometimes necessary to go back and add more of the unscaled checksum calls to find the source of the problem.

Sometimes the first difference turns out to be a red herring where the problem is with the checksum call; in this case you will have the isolated line that differs followed by a bunch of lines that match. Once the lines start to differ repeatedly, you can be pretty sure that you are past the first problematic point.

After finding the source of the problem, it is up to you whether to suggest that we add any new checksum calls that you had to add to detect the problem, although of course they should be all wrapped in a logical test for whether DEBUG=True.

@kshedstrom
Copy link
Author

kshedstrom commented Mar 17, 2022

I didn't realize the checksums would be unscaled. The first differing checksum of note happens after the first Coriolis computation. I tracked it down to update_OBC_segment_data in MOM_open_boundary.F90. For having a Z_SCALE of 1 vs. 0, the vertical remapping happens with matching values of h_stack, but not matching values of segment%field(m)%dz_src. The wrong levels from buffer_src are being used to populate buffer_dst. Differences show up in the Coriolis when using segment%tangential_vel at the open boundaries.

Should h_stack match with a nonzero Z_SCALE? Or how do I get dz_src to match? @Hallberg-NOAA @MJHarrison-GFDL

@kshedstrom
Copy link
Author

As for T_SCALE of 1 vs. 0, there is trouble in the uv_sponge code. CS%var_u%p differs by a factor of two where the sponge is zero, but not by exactly two in the sponge area. The nudging timescales match and probably shouldn't. @sdbachman

@kshedstrom
Copy link
Author

Wait, that was the L_SCALE case. Probably the same story for T_SCALE for similar reasons.

@kshedstrom
Copy link
Author

I don't seen any scale factors in set_up_ALE_sponge_vel_field_varying. It is similar to set_up_ALE_sponge_field_varying for the tracers - how and where would the tracers be scaled?

@Hallberg-NOAA
Copy link
Member

It looks to me like the problem (or at least part of it) is the 3rd argument in the calls to horiz_interp_and_extrap_tracer() on lines 968 and 1017 of MOM_ALE_sponge.F90 for u and v. These calls are currently using conversion factors of 1.0, but I think that these should be changed to US%m_s_to_L_T. This particular bug would not explain why there differences when the value of Z_RESCALE_POWER changes, but at least it is a start.

@kshedstrom
Copy link
Author

I have a fix to the sponge trouble with L_SCALE and T_SCALE. As for Z_SCALE, should h change with non-zero Z_SCALE? It doesn't. The trouble comes about in adustSegmentEtaToFitBathymetry in MOM_open_boundary.F90, where it assumes vertical distances should be scaled by the Z_SCALE. The segment%Htot field comes from h and is therefore not scaled.

@Hallberg-NOAA
Copy link
Member

Thanks for all of your work delving into this, @kshedstrom!

The thickness variables scale with H_SCALE, not the Z_SCALE scaling used by height variables. The reason for that is that in Boussinesq mode, the thickness variables have completely different units (m when unscaled) than in non-Boussinesq mode (kg m-2 when unscaled). To facilitate the conversions between the two vertical thickness and height units, use the factors GV%H_to_Z, GV%H_to_m, GV%H_to_kg_m2, GV%H_to_MKS or the reciprocal variables like GV%Z_to_H, etc.

A quick examination of adustSegmentEtaToFitBathymetry() seems to have the right conversion factors to first set segment%dz_src in units of Z (and assume that it is coming in with units of Z), and then convert them to units of H:

! Now convert thicknesses to units of H.
do k=1,nz
segment%field(fld)%dz_src(i,j,k) = segment%field(fld)%dz_src(i,j,k)*GV%Z_to_H
enddo

It seems like a really bad idea that segment%dz_src uses different units at different points in the code. However, I am not quite sure how to address this. It seems like the time_interp_external() capability does not offer the ability to rescale variables, but this is one part of the code that is still very poorly documented, with interfaces inherited directly from the FMS code without any documentation of their arguments. When I was going through the MOM6 code to ensure consistent dimensional rescaling, this appears to have been omitted because (1) I did not have a test actively exercising this code, and (2) the interactions between the time_interp_external() facilities and the buffers to interpolate onto the segments introduced enough complexity, array-reuse and indirection that I was not brave enough to try to guess at the units of the various re-used variables or the right solution within the code. In such cases (and there were very few), I confined my edits to comments (such as documenting the shifting dimensions of segment%dz_src), so as not to unwittingly break something that was working.

I think that the preferred outcome is that we have a single well-defined unit for segment%dz_src wherever it is used, and that the necessary rescaling occur as soon as possible when it is being read - either inside of time_interp_external() or immediately afterward. The fact that you have a case that is detecting these problems is exactly what we need to know that we have gotten them right.

@kshedstrom
Copy link
Author

I added a scale factor right after the call to time_interp_external. It now differs in MOM_barotropic, BT_force_[uv]. The term forces%taux differs by the scale factor, but mass_to_Z is the same in both runs. This is with the Z_RESCALE set to non-zero, so that seems odd.

@Hallberg-NOAA Hallberg-NOAA added the bug Something isn't working label Apr 4, 2022
@Hallberg-NOAA
Copy link
Member

It looks to me like the problem might be that when BT_NONLIN_STRESS = True (ie., CS%nonlin_stress==.true.) in MOM_barotropic.F90, CS%IDatu and CS%IDatv are being set with units of [H-1], whereas they are otherwise being set with units of [Z-1]. Multiplying the expression for CS%IDatu by a factor of GV%Z_to_H on lines 1248, 1251 and 1254 and similarly for the expressoins for CS%IDatv on lines 1274, 1277 and 1280 might be one way to solve this problem. This should not change answers in Boussinesq mode, but in non-Boussinesq mode this will change answers and correct a real bug.

@kshedstrom
Copy link
Author

Thanks, this did fix the barotropic scaling issue. The next trouble is in MOM_energetic_PBL. For the point i,j = 5,5 (the first one) the default case does not converge in 20 iterations while the scaled case converges in 11 iterations inside ePBL_column. I need to investigate more.

@kshedstrom
Copy link
Author

On the third iteration, it returns differing values of MLD_output. More digging needed...

@Hallberg-NOAA
Copy link
Member

Do these values differ dramatically, or only at something resembling roundoff?

I find that using very large scaling factors (like 2^140 or 2^-140) can be helpful for making the problems obvious.

The other thing that I have sometimes found is that setting an explicit small underflow value (like VEL_UNDERFLOW), can sometimes be important for getting answers to reproduce. It is possible that we have missed one of these cases of underflow in the ePBL TKE budget, and that your Bering domain test case is tickling this issue when other cases do not.

However, the fact that the iteration does not converge at all is suggestive that this is another real bug and not a minor oversight.

@kshedstrom
Copy link
Author

Sorry, the values differ on the first time through. Not by much, but not at underflow levels either. Then MixLen_shape is different the second time through and things are rather more different. I'm using a scaling of 2^2.

@kshedstrom
Copy link
Author

At L#1079 of MOM_energetic_PBL.F90, htot is not scaled and MLD_guess is scaled. The result is differing values of Surface_Scale and therefore vstar is off by some scale that isn't a power of 2.

@kshedstrom
Copy link
Author

I fixed that one, now there's an H_RESCALE issue with pre-btstep accel. The fun never ends!

@kshedstrom
Copy link
Author

It's in the Coriolis term, truly in the last few bits, such that totalview stats match except for the checksum. Is this where you'd try all the Coriolis options?

@Hallberg-NOAA
Copy link
Member

Do the differences remain in the last few bits if you try very large and very small rescaling factors? Trying more extreme rescaling may be a way to help smoke out the problem.

And yes, I think that trying the various Coriolis options would be a good way to figure out which ones are problematic. Which Coriolis scheme parameter settings are you using in your case?

@kshedstrom
Copy link
Author

It's that cursed OBC code again, specifically in remapping_core_h. I don't think it matters which Coriolis scheme we are using. Totalview really does not want to step into remapping_core_h, but buffer_dst comes out matching in the top four levels (all within the top ten meters of the source), they don't match for k between 5 and 10, then they match below that.

@kshedstrom
Copy link
Author

Do these need to be scaled with H_SCALE:
h_neglect, h_neglect_edge?

@kshedstrom
Copy link
Author

We discussed it and agreed to pass GV%H_subroundoff for the two optional arguments to get the correct scalings. Most of the calls to remapping_core_h are straightforward to correct. However, MOM_lateral_boundary_diffusion.F90 has unit tests calling a routine calling remapping_core_h and it doesn't have GV.

Hallberg-NOAA pushed a commit that referenced this issue Apr 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants