Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release tensor by ReleaseTensor instruction #4737

Merged
merged 2 commits into from
Apr 26, 2021

Conversation

daquexian
Copy link
Contributor

添加对 ReleaseTensor 指令的调用、在 python 退出时跳过 tensor 析构

…leasing when exiting

Signed-off-by: daquexian <daquexian566@gmail.com>
@@ -77,6 +77,12 @@ EagerMirroredTensorImpl::EagerMirroredTensorImpl(
eager_blob_object_(eager_blob_object) {
dtype_ = CHECK_JUST(DType::GetDTypeByDataType(eager_blob_object->blob_desc().data_type()));
tensor_storage_ = std::make_shared<TensorStorage>(eager_blob_object->tensor_buffer());
tensor_storage_->set_releaser_hook(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const auto& eager_blob_object = this->eager_blob_object_;
const auto& parallel_desc = this->parallel_desc();

@@ -77,6 +77,12 @@ EagerMirroredTensorImpl::EagerMirroredTensorImpl(
eager_blob_object_(eager_blob_object) {
dtype_ = CHECK_JUST(DType::GetDTypeByDataType(eager_blob_object->blob_desc().data_type()));
tensor_storage_ = std::make_shared<TensorStorage>(eager_blob_object->tensor_buffer());
tensor_storage_->set_releaser_hook(
[this](const std::shared_ptr<eager::TensorBuffer>& tensor_buffer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[eager_blob_object, parallel_desc]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this只是裸指针,不能保证执行的时候该指针还有效。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有道理

@@ -77,6 +77,12 @@ EagerMirroredTensorImpl::EagerMirroredTensorImpl(
eager_blob_object_(eager_blob_object) {
dtype_ = CHECK_JUST(DType::GetDTypeByDataType(eager_blob_object->blob_desc().data_type()));
tensor_storage_ = std::make_shared<TensorStorage>(eager_blob_object->tensor_buffer());
tensor_storage_->set_releaser_hook(
[this](const std::shared_ptr<eager::TensorBuffer>& tensor_buffer) {
PhysicalRun([this](const std::shared_ptr<InstructionsBuilder>& builder) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[&]

@oneflow-ci-bot oneflow-ci-bot removed their request for review April 26, 2021 14:28
@oneflow-ci-bot oneflow-ci-bot merged commit 5170ffa into master Apr 26, 2021
@oneflow-ci-bot oneflow-ci-bot deleted the skip_releasing_tensor_when_exiting branch April 26, 2021 15:07
liujuncheng pushed a commit that referenced this pull request Jun 3, 2021
* release tensor by instructions, update shut_down_util, skip tensor releasing when exiting

Signed-off-by: daquexian <daquexian566@gmail.com>

* Captures shared_ptr instread of raw pointer

Co-authored-by: lixinqi <lixinqi0703106@163.com>
Former-commit-id: 5170ffa
mergify bot added a commit that referenced this pull request Mar 22, 2023
1. <del>#4737 这个很早之前的 PR 在
python 退出后会跳过 tensor 的析构,导致 oom,当时这么做的原因已经不记得了,现在来看不跳过也想不到会有什么问题,所以本 PR
恢复为正常执行析构。</del>
    更新:CI 遇到问题,待进一步定位,这个 PR 先不包含这部分改动了。
原 oom 问题可能和 #9681 这个 PR 把设置
is_shutting_down 的时机提前了有关
3. 修复线程不安全导致的 python stack getter 偶发 segfault 的 bug

---------

Signed-off-by: daquexian <daquexian566@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants