Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-threading issue in rocksdb::WriteThread::Writer::CreateMutex() #2388

Closed
jsteemann opened this issue May 31, 2017 · 8 comments
Closed

Comments

@jsteemann
Copy link
Contributor

When we import data into RocksDB with multiple threads, we sometimes see multiple threads hanging in RocksDB code sections and not making any more progress.

Backtraces of the hanging threads all look like this:

#1  0x00007f9b66324dbc in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x0000000001b2f835 in std::condition_variable::wait<rocksdb::WriteThread::BlockingAwaitState(rocksdb::WriteThread::Writer*, uint8_t)::<lambda()> > (__p=..., __lock=..., this=0x7f993924efa0) at /usr/include/c++/6/condition_variable:99
#3  rocksdb::WriteThread::BlockingAwaitState (this=this@entry=0x7f9b5e553f60, w=w@entry=0x7f993924ef00, 
    goal_mask=goal_mask@entry=14 '\016') at ./3rdParty/rocksdb/v5.1.4/db/write_thread.cc:37
#4  0x0000000001b2f916 in rocksdb::WriteThread::AwaitState (this=this@entry=0x7f9b5e553f60, w=w@entry=0x7f993924ef00, 
    goal_mask=goal_mask@entry=14 '\016', 
    ctx=ctx@entry=0x362b1d0 <rocksdb::WriteThread::JoinBatchGroup(rocksdb::WriteThread::Writer*)::ctx>)
    at ./3rdParty/rocksdb/v5.1.4/db/write_thread.cc:157
#5  0x0000000001b2fc51 in rocksdb::WriteThread::JoinBatchGroup (this=this@entry=0x7f9b5e553f60, w=w@entry=0x7f993924ef00)
    at ./3rdParty/rocksdb/v5.1.4/db/write_thread.cc:227
#6  0x0000000001aa1406 in rocksdb::DBImpl::WriteImpl (this=0x7f9b5e553800, write_options=..., my_batch=0x7f9906a3f000, 
    callback=callback@entry=0x0, log_used=log_used@entry=0x0, log_ref=0, disable_memtable=false)
    at ./3rdParty/rocksdb/v5.1.4/db/db_impl.cc:4665
#7  0x0000000001aa3117 in rocksdb::DBImpl::Write (this=<optimized out>, write_options=..., my_batch=<optimized out>)
    at ./3rdParty/rocksdb/v5.1.4/db/db_impl.cc:4615
#8  0x0000000001c328b3 in rocksdb::TransactionImpl::Commit (this=0x7f9906a3a000)
    at ./3rdParty/rocksdb/v5.1.4/utilities/transactions/transaction_impl.cc:249
...

Precisely, the threads are all hanging in WriteThread::BlockingAwaitState(...) in write_thread.cc:37 (note: the code is from RocksDB version 5.1.4):

 21 uint8_t WriteThread::BlockingAwaitState(Writer* w, uint8_t goal_mask) {
 22   // We're going to block.  Lazily create the mutex.  We guarantee
 23   // propagation of this construction to the waker via the
 24   // STATE_LOCKED_WAITING state.  The waker won't try to touch the mutex
 25   // or the condvar unless they CAS away the STATE_LOCKED_WAITING that
 26   // we install below.
 27   w->CreateMutex();
 28 
 29   auto state = w->state.load(std::memory_order_acquire);
 30   assert(state != STATE_LOCKED_WAITING);
 31   if ((state & goal_mask) == 0 &&
 32       w->state.compare_exchange_strong(state, STATE_LOCKED_WAITING)) {
 33     // we have permission (and an obligation) to use StateMutex
 34     std::unique_lock<std::mutex> guard(w->StateMutex());
 35     w->StateCV().wait(guard, [w] {
 36       return w->state.load(std::memory_order_relaxed) != STATE_LOCKED_WAITING;
 37     });
 38     state = w->state.load(std::memory_order_relaxed);
 39   }

I have attached to the process with gdb and added breakpoints after the wait call, but the waits never return.

Browsing through the code of write_thread.cc I found one obvious problem that may cause trouble when multiple threads want to write in parallel:
BlockingAwaitState unconditionally calls Writer::CreateMutex() in line 27 as can be seen above. Multiple threads can enter it at the same time.
CreateMutex is then supposed to lazily initialize a mutex and a condition variable when not already initialized.
The actual problem is in CreateMutex():

130     void CreateMutex() {
131       if (!made_waitable) {
132         // Note that made_waitable is tracked separately from state
133         // transitions, because we can't atomically create the mutex and
134         // link into the list.
135         made_waitable = true;
136         new (&state_mutex_bytes) std::mutex;
137         new (&state_cv_bytes) std::condition_variable;
138       }
139     }

In line 131 it checks its made_waitable variable. If it's false, it sets it to true in line 135.
This is not thread-safe: multiple threads can enter CreateMutex() at the same time, and all can see a made_waitable value of false. They will then all enter the initialization section, which may initialize the mutex and the condition variable multiple times, potentially overriding existing state in the mutex and/or the condition variable.
And if the mutex or condition variable state are corrupted, then it is very likely that there are follow-up issues when using them, e.g. the hanging threads as observed above.

I am wondering why the mutex and condition variable are initialized lazily at all, and why they are not initialized unconditionally in the constructor of WriteThread::Writer.
I patched RocksDB locally as follows and it seems to fix the problem:

diff --git a/3rdParty/rocksdb/v5.1.4/db/write_thread.cc b/3rdParty/rocksdb/v5.1.4/db/write_thread.cc
index 66cec0e1b3..646eab1db8 100644
--- a/3rdParty/rocksdb/v5.1.4/db/write_thread.cc
+++ b/3rdParty/rocksdb/v5.1.4/db/write_thread.cc
@@ -24,7 +24,7 @@ uint8_t WriteThread::BlockingAwaitState(Writer* w, uint8_t goal_mask) {
   // STATE_LOCKED_WAITING state.  The waker won't try to touch the mutex
   // or the condvar unless they CAS away the STATE_LOCKED_WAITING that
   // we install below.
-  w->CreateMutex();
+  // w->CreateMutex();
 
   auto state = w->state.load(std::memory_order_acquire);
   assert(state != STATE_LOCKED_WAITING);
diff --git a/3rdParty/rocksdb/v5.1.4/db/write_thread.h b/3rdParty/rocksdb/v5.1.4/db/write_thread.h
index 7cad1d0d59..64e23a4233 100644
--- a/3rdParty/rocksdb/v5.1.4/db/write_thread.h
+++ b/3rdParty/rocksdb/v5.1.4/db/write_thread.h
@@ -111,7 +111,10 @@ class WriteThread {
           state(STATE_INIT),
           parallel_group(nullptr),
           link_older(nullptr),
-          link_newer(nullptr) {}
+          link_newer(nullptr) {
+      
+      CreateMutex();
+    }
 
     ~Writer() {
       if (made_waitable) {

An alternative would be to make the made_waitable an atomic variable and use compare_exchange_strong in CreateMutex. However, that would still require checking the state of the atomic variable on every invocation of BlockingAwaitState(...). Initializing the objects just once and unconditionally in the ctor of WriteThread::Write seems to be more efficient and is at least easier to implement correctly.

Note that all of the above applies to RocksDB version 5.1.4, but I peeked into the latest RocksDB version from the master branch and found the above code section to be still the same.
So I think the problem is still there in current versions of RocksDB.

I browsed existing Github issues for similar bug reports, and I think that issue #1208 could be related.

@yiwu-arbug
Copy link
Contributor

Unless there is a bug, CreateMutex() shouldn't be called from multiple threads. Each thread entering DBImpl::WriteImpl() holds its own Writer object and call AwaitState() on its own. Some questions for you:

  • What's your OS and compiler?
  • How easy can you reproduce the deadlock? Can you try ASAN/TSAN to see if it help detect the real bug?
  • Before it gets to deadlock, does any writes return error status (status other than Status::OK)?
  • Are you using Transaction or OptimisticTransaction?

On the other hand, I need to do some benchmarking to see how much impact it will be to change from lazy initialize mutex to not doing so. The mutex is lazily initialize because unless in a congested scenario, we shouldn't need them.

@jsteemann
Copy link
Contributor Author

Thanks for the reply. Here are the answers to your questions:

  • 4.10.0-21-generic compilation error on mac os mavericks #23-Ubuntu and g++-6.3
  • it's relatively easy to reproduce. I will try ASAN later and update this issue with any findings
  • all writes up to the deadlock report success
  • it's using Transaction with pessimistic locking
    I'll also check whether the threads are using the same Writer objects or different ones. If it's different objects, then this probably is not the cause for the deadlock but something else.

@yiwu-arbug
Copy link
Contributor

And just in case, are you using allow_2pc?

@jsteemann
Copy link
Contributor Author

No, two-phase commits are not involved.

@jsteemann
Copy link
Contributor Author

Quick update on this issue:
we ran a few more imports and the problem still occurred even with my suggested patch above. So I dug a bit more in the source and found that there was a also an issue with Writer threads not being correctly notified in all cases and starving then.

I peeked into the upstream RocksDB code, which seems to contain a fix for this problem already. Following is the change I made to our 5.1.4 copy of RocksDB, and that seems to fix our problem of threads hanging in RocksDB and not making progress.
I think the main problem was that LinkOne did change the status when a Writer became the group leader, but did not notify other waiting Writers. This however should be done because other Writer may already block on the condition variable and not get out of its wait call otherwise.
Upstream RocksDB code does something similar.

Here's my patch:

diff --git a/3rdParty/rocksdb/v5.1.4/db/write_thread.cc b/3rdParty/rocksdb/v5.1.4/db/write_thread.cc
index 66cec0e1b3..426dc8920a 100644
--- a/3rdParty/rocksdb/v5.1.4/db/write_thread.cc
+++ b/3rdParty/rocksdb/v5.1.4/db/write_thread.cc
@@ -189,11 +189,13 @@ void WriteThread::LinkOne(Writer* w, bool* linked_as_leader) {
     Writer* writers = newest_writer_.load(std::memory_order_relaxed);
     w->link_older = writers;
     if (newest_writer_.compare_exchange_strong(writers, w)) {
+      /*
       if (writers == nullptr) {
         // this isn't part of the WriteThread machinery, but helps with
         // debugging and is checked by an assert in WriteImpl
         w->state.store(STATE_GROUP_LEADER, std::memory_order_relaxed);
       }
+      */
       *linked_as_leader = (writers == nullptr);
       return;
     }
@@ -218,6 +220,9 @@ void WriteThread::JoinBatchGroup(Writer* w) {
   assert(w->batch != nullptr);
   bool linked_as_leader;
   LinkOne(w, &linked_as_leader);
+  if (linked_as_leader) {
+    SetState(w, STATE_GROUP_LEADER);
+  }
 
   TEST_SYNC_POINT_CALLBACK("WriteThread::JoinBatchGroup:Wait", w);
 
@@ -435,6 +440,8 @@ void WriteThread::EnterUnbatched(Writer* w, InstrumentedMutex* mu) {
     TEST_SYNC_POINT("WriteThread::EnterUnbatched:Wait");
     AwaitState(w, STATE_GROUP_LEADER, &ctx);
     mu->Lock();
+  } else {
+    SetState(w, STATE_GROUP_LEADER);
   }
 }
 

I think this issue can then be closed, because the problem seems to be fixed in upstream RocksDB. The originally reported problem of multi-threading issues in CreateMutex may or may not be an actual problem. If it is as you said that CreateMutex is never called in parallel on the same Writer object, then the problem report is bogus. If you think there is a chance of it being called in parallel on the same Writer object, it's not bogus.

@yiwu-arbug
Copy link
Contributor

Thank you much for investigation.

Regarding your patch. Both before and after the patch, w->state is set to STATE_GROUP_LEADER if w gets to the head of the lock-free linked list. The only difference is that before the patch it use memory_order_relaxed to set the state, and after the patch it use a stronger memory order. It has nothing to do with notifying other waiting threads. How the WriteThread works is that the group leader do the work for the waiting threads, and notify them to exit in ExitAsBatchGroupLeader(). Also as the inline comment says, for the leader its state doesn't matter, as long as it know itself to be a leader. We set it anyway to make debugging easier.

I suspect the bug is still hidden. But that you change the memory order in your patch somehow mitigates it. Please keep an eye on it. Lock-free algorithm is hard to debug..

@jsteemann
Copy link
Contributor Author

Yes, the fact that it works for us now may be a side-effect of the changed memory order... I'll investigate further should the problem re-occur.
We'll also try an unpatched RocksDB 5.5 (or whatever version is the latest then) in a few weeks and see if the problem re-occurs with it. If yes, then I'll investigate further. And if not, then it's all good.
Please feel free to close this issue. Thanks for your support!

@yiwu-arbug
Copy link
Contributor

Sounds good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants