-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ray Cluster features #50
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Some small things primarily in testing and defaults.
setup.cfg
Outdated
@@ -65,7 +65,10 @@ doc= | |||
sphinx_rtd_theme>=0.5.0 | |||
sphinx-fortran==1.1.1 | |||
nbsphinx>=0.8.2 | |||
myst-parser>=0.15.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is needed anymore
smartsim/exp/ray/raycluster.py
Outdated
del dictionary[key] | ||
|
||
|
||
### TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be taken out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I left it till the last moment to make sure we did not forget any suggestion.
line = fp.readline() | ||
while line: | ||
plain_line = re.sub("\033\\[([0-9]+)(;[0-9]+)*m", "", line) | ||
if "Local node IP:" in plain_line: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this parse from the raystarter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this parses from the head node log.
smartsim/exp/ray/raycluster.py
Outdated
ray_port, | ||
run_args=None, | ||
ray_args=None, | ||
interface="eth0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if the default interface should be ipogif0
which is the high speed network on XC.
eth0
would be cloud and local....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
smartsim/exp/ray/raystarter.py
Outdated
cliargs += [f"--redis-password={args.redis_password}"] | ||
|
||
# On some systems, ssh to compute nodes (and port forwarding) is not allowed. | ||
# If that's the case, the user should bind the dashboard to 0.0.0.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment seems to be about the host address but the following lines are about the port number. Im I missing something?
tests/with_ray/test_ray.py
Outdated
|
||
pytestmark = pytest.mark.skip(reason="Local launch is currently disabled for Ray") | ||
|
||
# pytestmark = pytest.mark.skipif( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming we are leaving this in as a future?
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include a simple example of port forwarding here? I knew how to do it, but others might not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I write it as text? Because I usually do this with ssh
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, minor changes and optimizations but I think this is finally ready to go in!!
Great work!! 😄
Makefile
Outdated
@@ -103,23 +103,23 @@ cov: | |||
# help: test - Build and run all tests | |||
.PHONY: test | |||
test: | |||
@cd ./tests/; python -m pytest --ignore=full_wlm/ | |||
@cd ./tests/; python -m pytest --ignore=full_wlm/ --ignore=with_ray/full_wlm/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put ray with full_wlm
doc/conf.py
Outdated
@@ -41,7 +41,8 @@ | |||
'sphinxfortran.fortran_domain', | |||
'sphinxfortran.fortran_autodoc', | |||
'breathe', | |||
'nbsphinx' | |||
'nbsphinx', | |||
'myst_parser' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont need this
smartsim/control/controller.py
Outdated
@@ -345,6 +350,31 @@ def _launch_orchestrator(self, orchestrator): | |||
self._save_orchestrator(orchestrator) | |||
logger.debug(f"Orchestrator launched on nodes: {orchestrator.hosts}") | |||
|
|||
def _launch_ray_cluster(self, ray_cluster): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this be the normal launching function?
smartsim/control/controller.py
Outdated
if on WLM, find the nodes where it was launched and | ||
set them in the JobManager | ||
|
||
:param orchestrator: ray cluster to launch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
orchestrator
?
smartsim/exp/ray/raycluster.py
Outdated
logger = get_logger(__name__) | ||
|
||
|
||
def delete_elements(dictionary, key_list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets put this in utilities
@@ -120,6 +120,8 @@ def stop(self, *args): | |||
self._control.stop_entity(entity) | |||
for entity_list in stop_manifest.ensembles: | |||
self._control.stop_entity_list(entity_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possible optimization to make in manifest so that we don't have switch cases here. add method to manifest to obtain all EntityList
types despite execptions. then just loop over all of them and call _control.stop_entity_list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing these. This looks great!
This draft PR adds Ray cluster setup and deployment to SmartSim.
The
ray.raystarter
script file allows us to bypass some IP inconsistencies we encountered when starting the head node in the standard way. We need to be on the node already to correctly get the head node IP.smartsim.exp.ray
Missing: