-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic timeline #2215
Dynamic timeline #2215
Conversation
4a2daae
to
f927588
Compare
Nice work, @Vikas-kum! Thanks for picking this up. I'll take a look today or tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! A few questions.
horovod/common/timeline.cc
Outdated
} else { | ||
// last event closed the json ']' , need to seek to one position back and write ',' to continue | ||
long pos = file_.tellp(); | ||
file_.seekp (pos-1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any performance penalty to doing this? Timeline already introduces a lot of overhead, especially when using markers. Can you test it with synthetic benchmark with and without this op?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think it should cause any perf penalty, given that we are already going to write bytes from next position.
Would you be able to help me with quickly running a small job and verifying on this PR as I will ahve to go through setups, which i dont currently have. :) I know I am asking for a favor but will appreciate if you can help.
Thanks @tgaddair for review. Will update PR soon addressing comments. |
77fe6e2
to
eb26a22
Compare
@tgaddair Gentle reminder. Would you be take another round at this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think about, I believe your approach here makes sense then what I was trying to achieve in my last couple of commits to #2129, as we keep all the state modification in the RunLoopOnce
, which avoids race conditions.
The one thing I would still like to see is to make these calls blocking so the user doesn't need to sleep after each call (otherwise, they could miss allreduce information on the timeline when starting, or the end of the timeline if they stop and then exit the program). Does that sound reasonable to you?
I took this approach because it simplifies the implementation quite a bit and also makes sure that locks are held for much smaller and confined scope. Start takes into effect only when BGThread goes to next iteration. To make sure of correctness test checks for cycle markers and events getting captured we would need to wait for at least 2 full cycle, so that event and cycle markers are indeed written. hvd.start_Timeline()--> (sleep here in test code makes sure that timeline is started before next line test code)starts takes in to effect --> allreduce (event will be captured) --> events are written and cycles are written to queue --> stop_timeline(No more events)--> WriterThread dump any existing events to a file. Somehow, I don't feel strongly about the idea of blocking call because it blocks my training script completely. I do see a side effect of having non-blocking calls and that is actual events may be written in next cycle of thread. But at this point, I expect user would be calling start_timeline to stop_timeline for sufficient gap which gives them a good picture of what is happening. If user wants to block, they can put this kind of sleep 1s(and know that they are knowingly blocking their script) in their code to make sure that events get dumped. For this reason, I would like to have this non-blocking. Please let me know if you have strong feeling about blocking call and I can make stop timeline blocking. |
6365cce
to
896dca7
Compare
3bb245f
to
4613b7a
Compare
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
Adding mroe test Signed-off-by: Vikas Kumar <vikumar@amazon.com>
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
remove debug statement addressed comments Signed-off-by: Vikas Kumar <vikumar@amazon.com>
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
4613b7a
to
24cb454
Compare
09a0832
to
90fb692
Compare
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
e42b5b5
to
2889be9
Compare
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
924bbdb
to
d07b715
Compare
Thanks for making these changes, @Vikas-kum! Last thing I would ask: can run If tests pass after that, we should be good to land :) |
thanks @tgaddair , I ran clang-format for only timeline.h timeline.cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, will take a look at the follow-up PR!
Thanks. Created this PR: https://github.com/horovod/horovod/pull/2281/files , |
@Vikas-kum We found an issue introduced by this commit. Our distributed filesystem doesn't support random writes (including re-open and append). It used to work fine with horovod timeline because it only create, writes the data sequentially, and then close. I wonder if we can keep the old behavior if users do not use the dynamic timeline feature. Any pointers on what we should change? |
Any next steps on this @Vikas-kum ? |
This PR
1)introduces a new Python API for starting and stopping the timeline:
TODO
1/ thread detach() re-introduce.
Fixes #2008.
This is followup from this PR: #2129 and addresses concurrent access.
Checklist before submitting
Description
Fixes # (issue).
Review process to land