TensorFlow ParameterServer Training On FrameworkController
- Support both GPU and CPU Distributed Training
- Automatically clean up PS when the whole FrameworkAttempt is completed
- No need to adjust existing TensorFlow image
- No need to setup Kubernetes DNS and Kubernetes Service
- Common Feature
- See
[PREREQUISITE]
in each specific Framework yaml file - Need to setup Kubernetes Cluster-Level Logging, if you need to persist and expose the log for deleted Pod