Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logging: fix delegating log sink races #12298

Merged
merged 3 commits into from
Jul 28, 2020
Merged

Conversation

mattklein123
Copy link
Member

@mattklein123 mattklein123 commented Jul 26, 2020

This fixes two different issues:

  1. Previously there was no locking around log sink replacement,
    so it was possibles for a log sink to get removed by one
    thread while getting written to by another thread.
  2. Even with locking, the base class destructor pattern would
    do the swap after the derived class was destroyed, leading to
    undefined behavior.

This was easy to reproduce in cx_limit_integration_test but is
an issue anywhere the log expectations are used, or previously in the death test
stderr workaround (EXPECT_DEATH_LOG_TO_STDERR) for coverage which has
been removed because coverage no longer logs to a file and instead logs to stderr
like the rest of the tests.

Fixes #11841

Risk Level: Medium, code is a bit scary, though only really in tests
Testing: Existing tests
Docs Changes: N/A
Release Notes: N/A

This fixes two different issues:
1) Previously there was no locking around log sink replacement,
   so it was possibles for a link sink to get removed by one
   thread while getting written to be another thread.
2) Even with locking, the base class destructor pattern would
   do the swap after the derived class was destroyed, leading to
   undefined behavior.

This was easy to reproduce in cx_limit_integration_test but is
an issue anywhere the log expectations are used or death test
stderr workaround for coverage which has been removed because
it no longer logs to a file.

Fixes #11841

Signed-off-by: Matt Klein <mklein@lyft.com>
@mattklein123
Copy link
Member Author

cc @jmarantz

@mattklein123 mattklein123 changed the title logging: fix delegating lock sink races logging: fix delegating log sink races Jul 26, 2020
Copy link
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch - thank you for debugging all of these!

Signed-off-by: Matt Klein <mklein@lyft.com>
@mattklein123
Copy link
Member Author

@alyssawilk updated

@mattklein123
Copy link
Member Author

I'm not sure why my clang-tidy fixes are not working. cc @lizan

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for digging into this. Is there more of a story. you can put into the description about why EXPECT_DEATH_LOG_TO_STDERR is lo longer needed?

// protection is really only needed in tests. It would be nice to figure out a test-only
// mechanism for this that does not require extra locking that we don't explicitly need in the
// prod code.
absl::ReaderMutexLock sink_lock(&sink_mutex_);
Copy link
Contributor

@jmarantz jmarantz Jul 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was doing benchmarks around ReaderMutexLock for symbol-tables, I found that the overall performance was worse than just using normal locks.

I tried this with a co-workers 72-core machine as well. I wouldn't say my tests were conclusive but it made me shy away from these if performance is the goal.

I do have an idea for test-only locks: you could use BasicLockable and have an empty implementation of that, which you could pass into LoggerContext in prod, and use a real one when instantiating the logger in test/test_runner.cc. You might need to update LoggerContext to take two mutexes rather than one, for this purpose.

It's would need careful commenting/variable-naming as a mutex lock which is a no-op in production could be dangerous if misinterpreted by a later change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this with a co-workers 72-core machine as well. I wouldn't say my tests were conclusive but it made me shy away from these if performance is the goal.

I guess my feeling on this is that a) logging performance is not the most critical thing since this is already after log level checks and b) it seems like the read mutex should be faster here, though if it isn't that seems like a larger problem that I Would rather not tackle right now as part of this change, so my inclination is to just leave this for now as it's technically correct and we can revisit later if needed.

It's would need careful commenting/variable-naming as a mutex lock which is a no-op in production could be dangerous if misinterpreted by a later change.

Per above since it's not 100% critical that we maximize perf in this path I think I would rather get this flake fix in and possibly try out your idea as part of the TODO later.

Copy link
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo @jmarantz comments

@mattklein123
Copy link
Member Author

Thanks very much for digging into this. Is there more of a story. you can put into the description about why EXPECT_DEATH_LOG_TO_STDERR is lo longer needed?

Updated

@mattklein123 mattklein123 merged commit dbbcc69 into master Jul 28, 2020
@mattklein123 mattklein123 deleted the cx_limit_flake_fixes branch July 28, 2020 16:48
KBaichoo pushed a commit to KBaichoo/envoy that referenced this pull request Jul 30, 2020
This fixes two different issues:
1) Previously there was no locking around log sink replacement,
   so it was possibles for a log sink to get removed by one
   thread while getting written to by another thread.
2) Even with locking, the base class destructor pattern would
   do the swap after the derived class was destroyed, leading to
   undefined behavior.

This was easy to reproduce in cx_limit_integration_test but is
an issue anywhere the log expectations are used, or previously in the death test
stderr workaround (EXPECT_DEATH_LOG_TO_STDERR) for coverage which has
been removed because coverage no longer logs to a file and instead logs to stderr
like the rest of the tests.

Fixes envoyproxy#11841

Risk Level: Medium, code is a bit scary, though only really in tests
Testing: Existing tests
Docs Changes: N/A
Release Notes: N/A

Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
chaoqin-li1123 pushed a commit to chaoqin-li1123/envoy that referenced this pull request Aug 7, 2020
This fixes two different issues:
1) Previously there was no locking around log sink replacement,
   so it was possibles for a log sink to get removed by one
   thread while getting written to by another thread.
2) Even with locking, the base class destructor pattern would
   do the swap after the derived class was destroyed, leading to
   undefined behavior.

This was easy to reproduce in cx_limit_integration_test but is
an issue anywhere the log expectations are used, or previously in the death test
stderr workaround (EXPECT_DEATH_LOG_TO_STDERR) for coverage which has 
been removed because coverage no longer logs to a file and instead logs to stderr
like the rest of the tests.

Fixes envoyproxy#11841

Risk Level: Medium, code is a bit scary, though only really in tests
Testing: Existing tests
Docs Changes: N/A
Release Notes: N/A

Signed-off-by: Matt Klein <mklein@lyft.com>
chaoqin-li1123 pushed a commit to chaoqin-li1123/envoy that referenced this pull request Aug 7, 2020
This fixes two different issues:
1) Previously there was no locking around log sink replacement,
   so it was possibles for a log sink to get removed by one
   thread while getting written to by another thread.
2) Even with locking, the base class destructor pattern would
   do the swap after the derived class was destroyed, leading to
   undefined behavior.

This was easy to reproduce in cx_limit_integration_test but is
an issue anywhere the log expectations are used, or previously in the death test
stderr workaround (EXPECT_DEATH_LOG_TO_STDERR) for coverage which has
been removed because coverage no longer logs to a file and instead logs to stderr
like the rest of the tests.

Fixes envoyproxy#11841

Risk Level: Medium, code is a bit scary, though only really in tests
Testing: Existing tests
Docs Changes: N/A
Release Notes: N/A

Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: chaoqinli <chaoqinli@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cx_limit_integration_test flakes
3 participants