|
| 1 | +# Volcano Job scale up and down |
| 2 | + |
| 3 | +@hzxuzhonghu; April 24, 2020 |
| 4 | + |
| 5 | +## Motivation |
| 6 | + |
| 7 | +Currently, Volcano does not support Job update. It is not allowed to update the `Job.Spec` on the fly. |
| 8 | +However, many users show appeal to run ML training jobs in a elastic manner. For example ModelArts want to dynamically adjust Job's replicas according to the cluster idle capacity |
| 9 | +in order to achieve most high efficiency on GPU card. |
| 10 | + |
| 11 | +I propose to support volcano job dynamical scale up/down before more intelligent elasticity in the first step. |
| 12 | + |
| 13 | +## Design |
| 14 | + |
| 15 | +Before this design, let's recall the current Job's initialization |
| 16 | + |
| 17 | +### Job Initialization |
| 18 | + |
| 19 | +When a Volcano job is created, the job controller does the following to run/manage all of its tasks. |
| 20 | + |
| 21 | +1. all the plugins execute OnJobAdd callbacks to create service and hosts configmap, etc |
| 22 | + |
| 23 | +2. create pvc for the job |
| 24 | + |
| 25 | +3. create PodGroup for the job |
| 26 | + |
| 27 | +4. execute plugins' OnPodAdd callbacks to set pod related env, mount hostfile, etc |
| 28 | + |
| 29 | +5. call the kube-apiserver to create pods equals the replicas of the job |
| 30 | + |
| 31 | +All above steps are run in `syncJob`, which is called when external events happen, for this it happens when Job is newly created. |
| 32 | + |
| 33 | +### Volcano Job Scale Up/Down |
| 34 | + |
| 35 | +The Job's scale up and down correlates to reconciling of the resources the job owns, like PVC/PodGroup/Service/HostFile ConfigMap |
| 36 | +so the procedure is kind of similar to the [Job Initialization](#Job Initialization). |
| 37 | + |
| 38 | +The differences are: |
| 39 | + |
| 40 | +1. job plugins' callbacks:only the `svc` plugin should update the configmap including the job tasks |
| 41 | + |
| 42 | +2. create pods when scale up |
| 43 | + |
| 44 | +3. delete pods when scale down |
| 45 | + |
| 46 | +However, only when the job is not started, the initialization is run. |
| 47 | +So we need a way to know whether it is a scale up/down event that triggered this round of sync. |
| 48 | + |
| 49 | +The way I propose is to add a new event `JobUpdatedEvent` to indicate that the job is updated(here only cares about the scale up/down). |
| 50 | +And accordingly add a new action `UpdateJobAction` to run `UpdateJob` function. And the overall workflow is: |
| 51 | + |
| 52 | + |
| 53 | +To scale up/down on the fly, Volcano should be responsible to notify the original pods the current status, including the hosts of all the pods. |
| 54 | +This is done by plugins, so to distinguish from the initialization phase, a new `OnJobUpdate` is introduced. |
| 55 | +It is to reconcile all the associated configs of the job. Currently, the `svc` plugin should update the configmap of all the hosts. |
| 56 | + |
| 57 | +**NOTE**: |
| 58 | + |
| 59 | +1. Users should watch the `/etc/volcano` to get the up-to-date hosts files if they want to be aware of the training workers. |
| 60 | + |
| 61 | +2. The env `VC_{task name}_HOSTS` `VC_{task name}_NUM` of the existing pods can not be mutated on the fly, so be careful not to use it. |
| 62 | + |
| 63 | +``` |
| 64 | +type PluginInterface interface { |
| 65 | + // The unique name of Plugin. |
| 66 | + Name() string |
| 67 | +
|
| 68 | + // for all pod when createJobPod |
| 69 | + OnPodCreate(pod *v1.Pod, job *vcbatch.Job) error |
| 70 | +
|
| 71 | + // do once when syncJob |
| 72 | + OnJobAdd(job *vcbatch.Job) error |
| 73 | +
|
| 74 | + // do once when killJob |
| 75 | + OnJobDelete(job *vcbatch.Job) error |
| 76 | +
|
| 77 | + OnJobUpdate(job *vcbatch.Job) error |
| 78 | +} |
| 79 | +``` |
| 80 | + |
| 81 | +`UpdateJob` is much like the current `SyncJob`, and it's workflow is: |
| 82 | + |
| 83 | +1. all the plugins execute OnJobUpdate callbacks, which is to update all the envs, service and hosts configmap. |
| 84 | + |
| 85 | +2. create pvc for the job if necessary |
| 86 | + |
| 87 | +3. update PodGroup for the job if necessary |
| 88 | + |
| 89 | +4. execute plugins' OnPodAdd callbacks to set pod related env, mount hostfile, etc |
| 90 | + |
| 91 | +5. call the kube-apiserver to create/delete pods equals the replicas of the job |
| 92 | + |
| 93 | + |
| 94 | +**Note**: when scale down, the pod delete order is from the larger indexed to the lower indexed. But this is not guaranteed as Kubernetes is a eventual consistent system. |
| 95 | + |
| 96 | + |
| 97 | + |
| 98 | +### Admission webhook |
| 99 | + |
| 100 | +Should prevent invalid mutating Job Spec on the fly. In this proposal, we only allow `replicas` and `minAvailable` update. Any other spec changes will be prohibited. |
| 101 | +It is also not allowed if the number of total replicas is less than the `minAvailable`. |
| 102 | + |
| 103 | +`minAvailable` must be greater than zero, we depend on it to maintain the job status. |
0 commit comments