-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Print additional error handling instructions on RocksDB error. #27170
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -254,6 +254,18 @@ struct Rocks { | |||
write_batch_perf_status: PerfSamplingStatus, | ||||
} | ||||
|
||||
macro_rules! rocks_try { | ||||
($expression:expr) => {{ | ||||
match $expression { | ||||
Ok(output) => output, | ||||
Err(error) => { | ||||
$crate::blockstore_db::Rocks::print_error_handling_instruction(&error); | ||||
return Err($crate::blockstore_db::BlockstoreError::RocksDb(error)); | ||||
} | ||||
} | ||||
}}; | ||||
} | ||||
|
||||
impl Rocks { | ||||
fn open(path: &Path, options: BlockstoreOptions) -> Result<Rocks> { | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. according to past bug reports, wrapping There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. RocksDB reports corruption only when it actually read that block, which IMO is a good thing because this will allow us to open the db and obtain file metadata in our ledger tool. For instance, #26790 allows our ledger tool to print out slot range of a sst file, and #26905 allows our ledger tool to purge older slots to free up space. They can both operate on corrupted sst files. |
||||
let access_type = options.access_type.clone(); | ||||
|
@@ -441,17 +453,17 @@ impl Rocks { | |||
} | ||||
|
||||
fn get_cf(&self, cf: &ColumnFamily, key: &[u8]) -> Result<Option<Vec<u8>>> { | ||||
let opt = self.db.get_cf(cf, key)?; | ||||
let opt = rocks_try!(self.db.get_cf(cf, key)); | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why are we certain this is the only source of the error? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also, i think this is a bit maintenance annoyance, when adding more api surface into rocks. instead, i'd propose remove this auto-derived solana/ledger/src/blockstore_db.rs Line 104 in 1eba91a
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
These are just starters :p. Ideally, it should apply to all rocksdb calls. This PR only takes care of those rocksdb calls inside There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's another one that we cannot catch is the corruption reported via rocksdb's background compaction job. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This sounds like a good idea. Let me think more about it as we want it to be general enough to cover all types of severe/irrecoverable errors. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
oh, that's good point. btw, can we be notified these errors in some way like registering error listeners to rocks? if it's easy, let's do this. :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This could be done via the |
||||
Ok(opt) | ||||
} | ||||
|
||||
fn put_cf(&self, cf: &ColumnFamily, key: &[u8], value: &[u8]) -> Result<()> { | ||||
self.db.put_cf(cf, key, value)?; | ||||
rocks_try!(self.db.put_cf(cf, key, value)); | ||||
Ok(()) | ||||
} | ||||
|
||||
fn delete_cf(&self, cf: &ColumnFamily, key: &[u8]) -> Result<()> { | ||||
self.db.delete_cf(cf, key)?; | ||||
rocks_try!(self.db.delete_cf(cf, key)); | ||||
Ok(()) | ||||
} | ||||
|
||||
|
@@ -461,7 +473,7 @@ impl Rocks { | |||
from_key: &[u8], | ||||
to_key: &[u8], | ||||
) -> Result<()> { | ||||
self.db.delete_file_in_range_cf(cf, from_key, to_key)?; | ||||
rocks_try!(self.db.delete_file_in_range_cf(cf, from_key, to_key)); | ||||
Ok(()) | ||||
} | ||||
|
||||
|
@@ -505,7 +517,10 @@ impl Rocks { | |||
} | ||||
match result { | ||||
Ok(_) => Ok(()), | ||||
Err(e) => Err(BlockstoreError::RocksDb(e)), | ||||
Err(e) => { | ||||
Rocks::print_error_handling_instruction(&e); | ||||
Err(BlockstoreError::RocksDb(e)) | ||||
} | ||||
} | ||||
} | ||||
|
||||
|
@@ -526,6 +541,21 @@ impl Rocks { | |||
Err(e) => Err(BlockstoreError::RocksDb(e)), | ||||
} | ||||
} | ||||
|
||||
fn print_error_handling_instruction(rocks_error: &rocksdb::Error) { | ||||
let error_msg = format!("{:?}", rocks_error); | ||||
if error_msg.contains("Corruption:") { | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i think rocksdb::Error::kind() can be used? https://docs.rs/rocksdb/latest/rocksdb/struct.Error.html#method.kind https://docs.rs/rocksdb/latest/rocksdb/enum.ErrorKind.html#variant.Corruption There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, oh, I didn't know this is available in rust-rocksdb. Thanks for pointing it out! |
||||
error!( | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. well, I don't think |
||||
"RocksDb detected data corruption. This usually means your \ | ||||
ledger integrity is irrevocably compromised due to \ | ||||
underlying hardware problem like silent data corruption \ | ||||
and can't continue to operate. If possible, bug report to \ | ||||
#9009 with the mentioned SST file attached. After that, \ | ||||
remove your current ledger to restore operation or restore \ | ||||
using your backup system." | ||||
); | ||||
} | ||||
} | ||||
} | ||||
|
||||
pub trait Column { | ||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why macro is needed? i think generic wrapper
fn
should suffice at quick glance.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn
will work as well, but since we are doing this only for rare errors,macro_rules!
should have less performance impact on the normal case.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, i don't think using
macro_rules!
will bring better performance here... could you elaborate a bit? I'm currently thinking thefn
version would surely be inlined, just like the macro one.