Skip to content
This repository was archived by the owner on Aug 9, 2024. It is now read-only.

Drop needs_reboot conditional in _check_status(...) #11

Merged
merged 4 commits into from
Apr 28, 2023

Conversation

NucciTheBoss
Copy link
Member

Description

This pull request adds juju-reboot to the _check_status(...) function so that the machine will automatically restart if it needs security updates applied. Using juju-reboot helps with automatic deployments as a human-operator is not required to manually intervene when the slurmdbd charm is first deployed.

How was the code tested?

I tested this charm on a private OpenStack instance with ubuntu@22.04 base images that needed security updates to be deployed.

Checklist

  • I am the author of these changes, or I have the rights to submit them.
  • I have added the relevant changes to the README and/or documentation.
  • I have self reviewed my own code.
  • All requested changes and/or review comments have been resolved.

Notes:
* _check_status needs C901 because it is too complex with the addition of `juju-reboot`. It should be simplified in a later commit.
Copy link
Contributor

@jaimesouza jaimesouza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NucciTheBoss!

@jamesbeedy
Copy link
Contributor

Let’s talk about this further before it gets merged.

@NucciTheBoss
Copy link
Member Author

Sounds good @jamesbeedy! Is there anything in particular about this proposed change that you would like to discuss? My main reasoning for adding this automatic reboot feature is that the "Machine needs reboot" conditional is interfering with my deployment scripts on clouds where the Juju base images are not as update to date with the latest Ubuntu packages.

I can get around the block by manually rebooting the machines when this conditional is encountered, but the commands are cloud-specific. This makes it difficult for when you are writing a bootstrapping script intended to be used with multiple clouds. juju-reboot is a "cloud-agnostic" tool that can be used to schedule a machine reboot after the running hook has finished executing. I found this to be the best of both worlds where we can apply security updates and not require human intervention during initial cloud deployment. Let me know what your thoughts are.

@jamesbeedy
Copy link
Contributor

jamesbeedy commented Apr 26, 2023

Sounds good @jamesbeedy! Is there anything in particular about this proposed change that you would like to discuss? My main reasoning for adding this automatic reboot feature is that the "Machine needs reboot" conditional is interfering with my deployment scripts on clouds where the Juju base images are not as update to date with the latest Ubuntu packages.

I can get around the block by manually rebooting the machines when this conditional is encountered, but the commands are cloud-specific. This makes it difficult for when you are writing a bootstrapping script intended to be used with multiple clouds. juju-reboot is a "cloud-agnostic" tool that can be used to schedule a machine reboot after the running hook has finished executing. I found this to be the best of both worlds where we can apply security updates and not require human intervention during initial cloud deployment. Let me know what your thoughts are.

Hey @NucciTheBoss ,

The reasons we have avoided juju-reboot in the past are:

  1. `juju-reboot didn't work on rhel variants (I think it does now though)
  2. We didn't want to introduce the possibility for charm code to be able to disrupt running workloads.
  3. GPU and infiniband drivers were built to match a specific kernel version, and we did not want to upgrade the kernel.

#1 was a hard blocker which it is not anymore, #2 though, was what we felt strongly about, #3 was imperative for our driver provisioning.

There have been ideas discussed concerning a tighter charm code integration with slurm, where the slurmd charm code would be responsible for draining the node and then when no jobs were running and the node was drained, authorize a reboot (but this still did not account for #3). To this extent, we decided against automated reboots and figured we would simply signal to the user if a reboot is needed and let the admin/user take care of things.

We also have discussed triggering a reboot from an action ... but then the action never completes and we end up in a strange situation there too.

@jamesbeedy
Copy link
Contributor

jamesbeedy commented Apr 26, 2023

Possibly we just remove checking if a machine needs reboot .... thoughts?

@NucciTheBoss
Copy link
Member Author

We also have discussed triggering a reboot from an action...

Yeah... triggering reboots and then regaining control is always a pain. I'm certain we could find a "clever" way to trigger reboots using an action since juju-reboot only works with hooks. We could have something like:

def _on_reboot_action(self, _: ActionEvent) -> None:
    RebootRequiredEvent().emit()

def _on_reboot_event(self, _: RebootRequiredEvent) -> None:
    subprocess.run(["juju-reboot"])

This way you escape the inconvenient situation where Juju won't return the result of your reboot action and then the hook schedules the reboot.

Possibly we just remove checking if a machine needs reboot .... thoughts?

I would prefer this 1000 times over scheduling a reboot/blocking deployment for an update 😅. If you all are alright with this, I can update my flock of pull requests to focus on removing these conditionals. If we still want to signal to the user that the machine requires a reboot, we could keep the needs reboot conditional but change it to just throwing a warning message in juju debug-log .... Let me know your thoughts on this!

@jamesbeedy
Copy link
Contributor

100% lets drop it 👊

@NucciTheBoss
Copy link
Member Author

100% lets drop it fist_oncoming

Sweet! I'll get to work on removing the code then. We should discuss the issues with GPU and Infiniband drivers at the next sync.

@NucciTheBoss NucciTheBoss changed the title Use juju-reboot to automatically restart slurmdbd units Drop needs_reboot conditional in _check_status(...) Apr 27, 2023
Copy link
Contributor

@jamesbeedy jamesbeedy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Thanks!

Copy link
Contributor

@jaimesouza jaimesouza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@jaimesouza jaimesouza merged commit 814c7da into charmed-hpc:main Apr 28, 2023
@NucciTheBoss NucciTheBoss deleted the patch-reboot branch July 20, 2023 21:15
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants