-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correctly record the rawBytesRead_ metric in TableScan #8147
Conversation
✅ Deploy Preview for meta-velox canceled.
|
@@ -274,6 +274,7 @@ std::vector<cache::CachePin> DirectCoalescedLoad::loadData(bool isPrefetch) { | |||
} | |||
input_->read(buffers, requests_[0].region.offset, LogType::FILE); | |||
ioStats_->read().increment(size); | |||
ioStats_->incRawBytesRead(size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks it's a duplicated metric. Can we use ioStats_->Read instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FelixYBW Yes, the duplicated metric has been removed. Additionally, the getCompletedBytes
method in HiveDataSource
has been updated to retrieve the ioStats_->read()
metric instead of the rawBytesRead_
metric. Here is a detailed explanation of the changes #8147 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiaoxmeng Got it. Updated. Please help to review again. Thanks.
@mbasmanova Can you help to review? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yuhta Jimmy, would you help review this PR?
CC: @oerling @xiaoxmeng
Looks we still have two metrics used, like here. Should we unified to one metric? Or the two metrics count different thing? velox/velox/dwio/common/DirectInputStream.cpp Lines 148 to 155 in f12b055
|
@FelixYBW It seems the value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JkSelf LGTM. Thanks!
@oerling You added the velox/velox/dwio/common/DirectInputStream.cpp Lines 148 to 155 in f12b055
|
@xiaoxmeng @mbasmanova Can you help to merge if no further comment? Thanks for your help |
@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@xiaoxmeng @mbasmanova Can you help to merge? Thanks. |
@JkSelf I synced with @oerling offline and we might need to first make a refactor on our side about these different counters as these are also used by Meta internal projects. The meaning of these read counters are a bit different with and without cache and we will update with better documentation on these to avoid the confusion. Thanks! |
From @oerling : |
@xiaoxmeng @FelixYBW Thank you for providing the crucial information. As said by @oerling, i think it is more accurate to use |
@xiaoxmeng Do you have any further comment? Thanks. |
@xiaoxmeng Can you help to review again? Thanks. |
@@ -274,6 +274,7 @@ std::vector<cache::CachePin> DirectCoalescedLoad::loadData(bool isPrefetch) { | |||
} | |||
input_->read(buffers, requests_[0].region.offset, LogType::FILE); | |||
ioStats_->read().increment(size); | |||
ioStats_->incRawBytesRead(size - overread); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JkSelf Can you also fix this in DwioCoalescedLoadBase::updateStats
? Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yuhta The rawBytesRead already updated in loadSync
and loadData
. It seems no need updated in DwioCoalescedLoadBase::updateStats
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is it updated in DwioCoalescedLoad::loadData
and SsdLoad::loadData
? I cannot find it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reviewed the code and found that the update of rawBytesRead
for CachedInputBuffer
is done in the loadSync
function. loadSync
is executed unconditionally after loadOrFuture
here. However, in the case of DirectInputBuffer
, the execution of loadSync
is conditional here, so we need to update rawBytesRead
in the loadData
function instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yuhta Do you have any further comment? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. Can you add comment to those 2 places saying the counter is increased in loadSync
unconditionally so we don't need to do it in loadData
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yuhta Added the comments. Please help to review again. Thanks.
@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@xiaoxmeng merged this pull request in 6d2f0d4. |
Conbench analyzed the 1 benchmark run on commit There weren't enough matching historic benchmark results to make a call on whether there were regressions. The full Conbench report has more details. |
This pull request has been reverted by 0f96c1c. |
Based on @oerling design:
rawBytesRead is the quantized size of data read by a query. Read is the volume copied from storage. The second is 0 if data comes from caches. With DirectBufferedInput these are close but then read has unused coalesced columns counted where raw read does not. A column can be coalesced into IO but never hit by the query due to filtering.
Currently,
DirectBufferInput
only tracksrawBytesRead
in theloadSync
method. It does not trackrawBytesRead
in the asynchronousloadData
method. This PR adds therawBytesRead
metric to theloadData
method.