Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashes caused by "Improve call counting mechanism" change #29934

Closed
jkotas opened this issue Feb 1, 2020 · 5 comments · Fixed by #32250
Closed

Crashes caused by "Improve call counting mechanism" change #29934

jkotas opened this issue Feb 1, 2020 · 5 comments · Fixed by #32250

Comments

@jkotas
Copy link
Member

jkotas commented Feb 1, 2020

We are seeing more intermittent crashes than usual in CI in last few days. All crash dumps point to a problem with this PR. The typical crash is AV, early during the xunit process startup, with a stacktrace like this:

#0  MethodDesc::HasNativeCodeSlot (this=0x25) at /__w/1/s/src/coreclr/src/vm/method.hpp:1886
#1  MethodDesc::GetNativeCode (this=0x25) at /__w/1/s/src/coreclr/src/vm/method.cpp:1028
#2  0x00007f2908dff137 in CallCountingManager::OnCallCountThresholdReached (transitionBlock=<optimized out>,
    stubIdentifyingToken=<optimized out>) at /__w/1/s/src/coreclr/src/vm/callcounting.cpp:743
#3  0x00007f2908de3289 in OnCallCountThresholdReachedStub2 ()
    at /__w/1/s/src/coreclr/src/pal/inc/unixasmmacrosamd64.inc:986
#4  0x00007f288f9887f9 in ?? ()
#5  0x00007f288df8b050 in ?? ()
#6  0x00007f288f4e91bb in ?? ()
#7  0x00007f286813041e in ?? ()
@jkotas
Copy link
Member Author

jkotas commented Feb 1, 2020

cc @kouvel

@jkotas
Copy link
Member Author

jkotas commented Feb 1, 2020

From #27540:

https://dev.azure.com/dnceng/public/_build/results?buildId=503929&view=ms.vss-test-web.build-test-results-tab&runId=16001644&paneView=attachments&resultId=175719

  Discovering: System.Data.Common.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Data.Common.Tests (found 1788 of 1791 test cases)
  Starting:    System.Data.Common.Tests (parallel test collections = on, max threads = 2)
./RunTests.sh: line 161:  3911 Segmentation fault      (core dumped)

@jkotas
Copy link
Member Author

jkotas commented Feb 1, 2020

From #22224:

https://dev.azure.com/dnceng/public/_build/results?buildId=503475&view=ms.vss-test-web.build-test-results-tab&runId=15997008&paneView=attachments&resultId=177338

  Discovering: System.Text.Encodings.Web.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Text.Encodings.Web.Tests (found 181 test cases)
  Starting:    System.Text.Encodings.Web.Tests (parallel test collections = on, max threads = 2)
./RunTests.sh: line 161:  2203 Segmentation fault

@stephentoub
Copy link
Member

#22786

@jkotas
Copy link
Member Author

jkotas commented Feb 1, 2020

Also, include #29892 as part of this.

@kouvel kouvel self-assigned this Feb 12, 2020
@kouvel kouvel added this to the 5.0 milestone Feb 12, 2020
kouvel added a commit to kouvel/runtime that referenced this issue Mar 2, 2020
- Commit 1
  - Reverts commit f954c6b, which reverted PR dotnet#1457 due to issues
- Commit 2
  - Fixes crashes and assertion failures seen by the original change, fixes dotnet#29934
  - The crashes were caused by commit dotnet@6aa3c70 in the original PR
  - Call counting infos cannot be deleted when the corresponding call counting stubs may still run, because:
    - The remaining call count decremented by the stub is in the call counting info
    - The only way to get a code version / method desc from a stub is to go through the call counting info
  - Got one repro of the assertion failure in dotnet#22786 and it is most likely caused by the same issue, following heap corruption from modifying a deleted call counting info where the memory is reused for a `NativeCodeVersionNode`, messing up the method desc pointer
  - Fixed with a partial revert of the above commit. Added back the `Complete` stage and then call counting infos are deleted only after it's ensured that call counting stubs won't be used (shortly before deleting them).
- Commit 3
  - Public static functions of `CallCountingManager` that may be called through the debugger may occur before static initialization, added a check for null as suggested in dotnet#29892
kouvel added a commit that referenced this issue Mar 3, 2020
* Improve call counting mechanism

- Commit 1
  - Reverts commit f954c6b, which reverted PR #1457 due to issues
- Commit 2
  - Fixes crashes and assertion failures seen by the original change, fixes #29934
  - The crashes were caused by commit 6aa3c70 in the original PR
  - Call counting infos cannot be deleted when the corresponding call counting stubs may still run, because:
    - The remaining call count decremented by the stub is in the call counting info
    - The only way to get a code version / method desc from a stub is to go through the call counting info
  - Got one repro of the assertion failure in #22786 and it is most likely caused by the same issue, following heap corruption from modifying a deleted call counting info where the memory is reused for a `NativeCodeVersionNode`, messing up the method desc pointer
  - Fixed with a partial revert of the above commit. Added back the `Complete` stage and then call counting infos are deleted only after it's ensured that call counting stubs won't be used (shortly before deleting them).
- Commit 3
  - Public static functions of `CallCountingManager` that may be called through the debugger may occur before static initialization, added a check for null as suggested in #29892

* Fix crashes and assertion failures seen by the original change

* Add check for null for some functions callable from the debugger
@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants