-
Notifications
You must be signed in to change notification settings - Fork 802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Session misc. #9502
Session misc. #9502
Conversation
leaves-zwx
commented
Dec 1, 2022
•
edited
Loading
edited
- MultiClientSession (python/oneflow/framework/multi_client_session.py) 增加 Reset 接口
- 删除 SessionGlobalObjectsScope (oneflow/core/job/session_global_objects_scope.h)
- 删除 Cluster (oneflow/core/job/cluster.h)
- 删除 session 不再使用的 api (oneflow/api/python/session/session.cpp)
- 删除 session_util 中不再使用的 api (oneflow/api/python/framework/session_util.cpp)
Speed stats:
|
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9502/ |
Speed stats:
|
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9502/ |
@@ -43,6 +34,8 @@ ONEFLOW_API_PYBIND11_MODULE("", m) { | |||
[](MultiClientSessionContext& session, const std::string& config_proto_str) { | |||
return session.TryInit(config_proto_str).GetOrThrow(); | |||
}) | |||
.def("try_close", | |||
[](MultiClientSessionContext& session) { return session.TryClose().GetOrThrow(); }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里关闭 Session 的操作,只是调用了 try close,没有析构 session 本身对吧。
然后再 try init 把之前释放的资源再 init 出来?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
session 本身没法析构,以 shared_ptr 形式被 nn_graph 共享,我这里控制不了它的析构。
return torch.nn.functional.relu(x) | ||
|
||
def setUp(self): | ||
session_ctx.GetDefaultSession().Reset() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reset 是不是会有个 bad case,主动调用 Rest 时, graph 虽然当前执行完成了,但是后面可能还要用,这时是不是会触发错误了?
所以这个 Reset 使用有个前置条件,就是调用 reset 之前,之前创建的 graph 都得析构了?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个需要用户控制了,用户也不需要这个接口,因为实际场景不会创建那么多 graph。
Speed stats:
|
Speed stats:
|
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9502/ |
config_proto.set_session_id(session_id); | ||
|
||
CHECK(of::RegsterSessionId(session_id)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
session id 未来也可以优化掉。这个概念应该没有实际的用处。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个 session_id 我尝试删过了,删不掉。scope 还用到了 session_id。而且系统中全局搜索了 session id,还有很多地方使用,它们以什么方式取的得 session id,路劲上全部要修改。牵扯的改动挺多。所以还是另外的 pr 再去改了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以,我原本想的就是删掉 scope 中的 session id。不过可以后面再搞。 scope 本来也很重度,还有 symbol,后面一块儿重构吧。
static HashMap<int64_t, std::shared_ptr<Session>> id2session_map; | ||
return &id2session_map; | ||
} | ||
|
||
std::vector<int64_t>* RegsiteredSessionIds() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
要不顺道直接删掉 session id 吧,不需要记录 session id 2 session 的映射,没有多 session 同时存在的场景, scope 中的 session id 也可以删掉。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
只保留 default session 的全局对象就行。
new_default_sess = MultiClientSession(env, oneflow._oneflow_internal.NewSessionId()) | ||
session_id = oneflow._oneflow_internal.NewSessionId() | ||
assert oneflow._oneflow_internal.RegsterSessionId(session_id) | ||
new_default_sess = MultiClientSession(env, session_id) | ||
global _sess_id2sess |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不需要这个 global,可以直接用 global sess,去掉 id 相关的所有函数。
是不是可以用这个功能来试验下用这个来清理 CI 中的 session。 然后直接就加到 CI 的 unittest 里面。 |
os.environ.get("ONEFLOW_TEST_RESET_SESSION_PERIOD", "10") | ||
) | ||
if RESET_SESSION_COUNT >= reset_session_period: | ||
oneflow.framework.session_context.GetDefaultSession().Reset() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里会影响 CI 吗?比如 CI 里自动进行 reset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个就是让 ci 每测试若干用例,就 reset 一次来重置 stream index 计数,避免出现 stream_index > 4096 的报错。
Speed stats:
|
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9502/ |
CI failed when running job: cuda-speed-test. PR label automerge has been removed |
Speed stats:
|
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9502/ |
Speed stats:
|
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9502/ |
Fixes https://github.com/Oneflow-Inc/OneTeam/issues/1861 ,在 TryCloseDefaultSession 之前设置 is_shutting_down 为 true(#9502 这个 PR 在 TryCloseDefaultSession 里引入了 Sync 操作),并把 python c 对象的指针保存在 PyFrame 对象里,GetCurrentFrame 里不再需要重复获取,修复因为没有拿到 gil 锁导致的 segfault Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>