Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug/sframbat/memleak #3538

Draft
wants to merge 10 commits into
base: develop
Choose a base branch
from
Draft

Bug/sframbat/memleak #3538

wants to merge 10 commits into from

Conversation

CusiniM
Copy link
Collaborator

@CusiniM CusiniM commented Feb 6, 2025

No description provided.

@rrsettgast
Copy link
Member

@wrtobin @corbett5

@sframba found that this test resulted in a memory leak. It was assumed that this was in syncronizeFields. At some point it appeared that the issue is in the call:

fieldsToBeSync.addFields( FieldLocation::Node, { fields::acousticfields::Pressure_np1::key() } );

as the increase happened between usage1 and usage2. Then it started moving around a bit and can appear to be the syncronizeFields call...I played around a bunch and it was kind of hit or miss...but when I change the FieldIdentifiers::m_fields from a std::map< string, array1d< string > > to a std::map< string, std::vector< string > > the leak appears to go away. This reminds me of the PR I abandoned a while back (#2968). I can't tell you why the leak occurs, it doesn't occur at every call...but it is certainly strange. I propose that it may be a good idea to remove all array1d<string> from the code as I intended to do. Any thoughts?

@sframba can you confirm the leak had disappeared on your system?

@liyangrock
Copy link
Contributor

Hello,

I found there may be a memory leak in the explicit solid mechanics solver and wanted to share my observations in case it helps with diagnosing the issue.

I used an input file from the develop branch and modified the mesh resolution in each direction from 4 to 40, as well as forceDt from 1.0 to 1.0e-7, to make the memory usage behavior more apparent. Although this input file does not perform any meaningful computations, I noticed in Ubuntu's System Monitor that memory usage of geos steadily increases as the number of solver steps grows.

I suspect this memory leak might affect all solvers, but it has likely gone unnoticed in the implicit solver since the number of time increments is typically small. However, in the explicit solver, where a large number of time increments is required, the memory leak becomes more significant.

I used the following command:

export OMP_NUM_THREADS=10
geos -i multiBodyMeshGen_smoke.xml

I hope this information is helpful. Please let me know if any additional details are needed!

Thanks!

multiBodyMeshGen_smoke.xml.txt

@sframba
Copy link
Contributor

sframba commented Feb 11, 2025

@sframba can you confirm the leak had disappeared on your system?

Hi @rrsettgast , it seems that the leak between usage1 and usage2 has disappeared, but at least on Pangea3 (Nvidia V100s, gcc 11.4.0, cuda 11.8.0, openMPI 4.1.6) there seems to still be a leak between usage4 and usage5.
Here's the beginning of my log file on 2 MPI ranks:

                             usage0    usage1    usage2    usage3    usage4    usage5    usage6
                             usage0    usage1    usage2    usage3    usage4    usage5    usage6
Cycle      0 Memory usage: 1627456  1627456  1627456  1627456  1627456  1640064  1640064
Cycle      0 Memory usage: 1630464  1630464  1630464  1630464  1630464  1643072  1643072
Cycle    618 Memory usage: 1640064  1640064  1640064  1640064  1640064  1644288  1644288
Cycle   2377 Memory usage: 1643072  1643072  1643072  1643072  1643072  1646976  1646976
Cycle   3475 Memory usage: 1644288  1644288  1644288  1644288  1644288  1648384  1648384
Cycle   4273 Memory usage: 1646976  1646976  1646976  1646976  1646976  1650432  1650432
Cycle   4435 Memory usage: 1650432  1650432  1650432  1650432  1650432  1650752  1650752
Cycle   5755 Memory usage: 1648384  1648384  1648384  1648384  1648384  1652608  1652608
Cycle   6576 Memory usage: 1650752  1650752  1650752  1650752  1650752  1654592  1654592

I also added some prints in the CommunicationTools::synchronizeFields method, and I have seen the increase happen in the synchronizePackSendRecvSizes, synchronizePackSendRecv and even in the icomm.resize() call, which I didn't expect.

I digged a little deeper and one of the leaks for example comes from a line in NeighborCommunicator method packCommSizeForSync:

bufferSize += nodeManager.packSize( iter.second, nodeGhostsToSend, 0, onDevice, events )

Again, this involves an array1d< string > (string_array), iter.second in this case. I do believe that there is a fundamental issue with those on GPU.
Do you think it would be possible to resume your PR #2968 ?

@liyangrock
Copy link
Contributor

Hello,

I found there may be a memory leak in the explicit solid mechanics solver and wanted to share my observations in case it helps with diagnosing the issue.

I used an input file from the develop branch and modified the mesh resolution in each direction from 4 to 40, as well as forceDt from 1.0 to 1.0e-7, to make the memory usage behavior more apparent. Although this input file does not perform any meaningful computations, I noticed in Ubuntu's System Monitor that memory usage of geos steadily increases as the number of solver steps grows.

I suspect this memory leak might affect all solvers, but it has likely gone unnoticed in the implicit solver since the number of time increments is typically small. However, in the explicit solver, where a large number of time increments is required, the memory leak becomes more significant.

I used the following command:

export OMP_NUM_THREADS=10
geos -i multiBodyMeshGen_smoke.xml

I hope this information is helpful. Please let me know if any additional details are needed!

Thanks!

multiBodyMeshGen_smoke.xml.txt

The memory usage in the aforementioned case is now stable after replacing array1d< string >. Thanks @rrsettgast

@sframba
Copy link
Contributor

sframba commented Feb 17, 2025

Tanks @rrsettgast for the array1d< string > replacement. As discussed, I think there is a residual (probably small) issue with all array1ds of composite objects, for exemple array1d< MPI_Status > and array1d< MPI_Request > in MPI_iCommData.hpp. I have highlighted those in my last commit on this branch. We could use this PR to change those to std::vector as well, if you guys agree and if it doesn't have any other impact

@corbett5
Copy link
Contributor

@sframba @rrsettgast have either of you run LvArray unit tests through valgrind? Specifically I think testArray1D.cpp would be sufficient since it uses a composite object. Last time I ran these tests through valgrind they were clean (it was years ago but not much has changed). We're still using the nested arrays in the code, so we should try and fix the leak with storing composite objects in arrays.

@corbett5
Copy link
Contributor

@sframba @rrsettgast I just ran testArray1D.cpp through memcheck on develop and there are no leaks. Even when I change from using my TestString object which provides a convenient wrapper to using std::string directly there are no leaks. And this is without using the suppression file to boot. testArray1DOfArray1D is similarly clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants