Documentation
¶
Overview ¶
Package trainer is to manage pytorch training jobs.
Index ¶
- Constants
- type ClusterSpec
- type KubernetesLabels
- type PyTorchConfig
- type PyTorchReplicaSet
- func (s *PyTorchReplicaSet) Create(config *torchv1alpha1.ControllerConfig, worldSize int32) error
- func (s *PyTorchReplicaSet) CreatePodWithIndex(index int32, worldSize int32) (*v1.Pod, error)
- func (s *PyTorchReplicaSet) CreateServiceWithIndex(index int32) (*v1.Service, error)
- func (s *PyTorchReplicaSet) Delete() error
- func (s *PyTorchReplicaSet) GetSingleReplicaStatus(index int32) torchv1alpha1.ReplicaState
- func (s *PyTorchReplicaSet) GetStatus() (torchv1alpha1.PyTorchReplicaStatus, error)
- func (s *PyTorchReplicaSet) Labels() KubernetesLabels
- func (s *PyTorchReplicaSet) SyncPods(worldSize int32) error
- func (s *PyTorchReplicaSet) SyncServices() error
- type PyTorchReplicaSetInterface
- type TaskSpec
- type TrainingJob
- func (j *TrainingJob) ClusterSpec() ClusterSpec
- func (j *TrainingJob) Delete()
- func (j *TrainingJob) GetStatus() (torchv1alpha1.State, []*torchv1alpha1.PyTorchReplicaStatus, error)
- func (j *TrainingJob) Reconcile(config *torchv1alpha1.ControllerConfig) error
- func (j *TrainingJob) SchedulerName() string
- func (j *TrainingJob) UID() types.UID
- func (j *TrainingJob) Update(newJob *torchv1alpha1.PyTorchJob)
Constants ¶
const ( SuccessfulCreateReason = "SuccessfulCreate" FailedCreateReason = "FailedCreate" )
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type ClusterSpec ¶
TODO(jose5918): We don't really need the cluster spec for this operator but no harm in leaving it for POC ClusterSpec represents a cluster TensorFlow specification. https://www.tensorflow.org/deploy/distributed#create_a_tftrainclusterspec_to_describe_the_cluster It is a map from job names to network addresses.
type KubernetesLabels ¶
KubernetesLabels represents a set of labels to apply to a Kubernetes resources.
func (KubernetesLabels) ToSelector ¶
func (l KubernetesLabels) ToSelector() (string, error)
ToSelector converts the labels to a selector matching the labels.
type PyTorchConfig ¶
type PyTorchConfig struct {
Cluster ClusterSpec `json:"cluster"`
Task TaskSpec `json:"task"`
Environment string `json:"environment"`
}
PyTorchConfig is a struct representing the TensorFlow config. This struct is turned into an environment which is used by TensorFlow processes to configure themselves.
type PyTorchReplicaSet ¶
type PyTorchReplicaSet struct {
ClientSet kubernetes.Interface
// Job is a pointer to the TrainingJob to which this replica belongs.
Job *TrainingJob
Spec torchv1alpha1.PyTorchReplicaSpec
// contains filtered or unexported fields
}
PyTorchReplicaSet is a set of PyTorch processes all acting as the same role (e.g. worker
func NewPyTorchReplicaSet ¶
func NewPyTorchReplicaSet(clientSet kubernetes.Interface, recorder record.EventRecorder, tfReplicaSpec torchv1alpha1.PyTorchReplicaSpec, job *TrainingJob) (*PyTorchReplicaSet, error)
func (*PyTorchReplicaSet) Create ¶
func (s *PyTorchReplicaSet) Create(config *torchv1alpha1.ControllerConfig, worldSize int32) error
func (*PyTorchReplicaSet) CreatePodWithIndex ¶
CreatePodWithIndex will create a new pod with specify index
func (*PyTorchReplicaSet) CreateServiceWithIndex ¶
func (s *PyTorchReplicaSet) CreateServiceWithIndex(index int32) (*v1.Service, error)
CreateServiceWithIndex will create a new service with specify index
func (*PyTorchReplicaSet) Delete ¶
func (s *PyTorchReplicaSet) Delete() error
Delete deletes the replicas
func (*PyTorchReplicaSet) GetSingleReplicaStatus ¶
func (s *PyTorchReplicaSet) GetSingleReplicaStatus(index int32) torchv1alpha1.ReplicaState
func (*PyTorchReplicaSet) GetStatus ¶
func (s *PyTorchReplicaSet) GetStatus() (torchv1alpha1.PyTorchReplicaStatus, error)
Status returns the status of the replica set.
func (*PyTorchReplicaSet) Labels ¶
func (s *PyTorchReplicaSet) Labels() KubernetesLabels
Labels returns the labels for this replica set.
func (*PyTorchReplicaSet) SyncPods ¶
func (s *PyTorchReplicaSet) SyncPods(worldSize int32) error
SyncPods will try to check current pods for this PyTorchReplicaSet and try to make it as desired.
func (*PyTorchReplicaSet) SyncServices ¶
func (s *PyTorchReplicaSet) SyncServices() error
SyncServices will try to check current services for this PyTorchReplicaSet and try to make it as desired.
type PyTorchReplicaSetInterface ¶
type PyTorchReplicaSetInterface interface {
Create() error
Delete() error
GetStatus() (torchv1alpha1.PyTorchReplicaStatus, error)
}
PyTorchReplicas is an interface for managing a set of replicas.
type TrainingJob ¶
type TrainingJob struct {
KubeCli kubernetes.Interface
Replicas []*PyTorchReplicaSet
// contains filtered or unexported fields
}
TODO(jlewi): We should switch a New pattern and make trainingJob private so we can ensure correctness on creation.
func NewJob ¶
func NewJob(kubeCli kubernetes.Interface, torchJobClient pytorchclient.Interface, recorder record.EventRecorder, job *torchv1alpha1.PyTorchJob, config *torchv1alpha1.ControllerConfig) (*TrainingJob, error)
func (*TrainingJob) ClusterSpec ¶
func (j *TrainingJob) ClusterSpec() ClusterSpec
func (*TrainingJob) Delete ¶
func (j *TrainingJob) Delete()
func (*TrainingJob) GetStatus ¶
func (j *TrainingJob) GetStatus() (torchv1alpha1.State, []*torchv1alpha1.PyTorchReplicaStatus, error)
func (*TrainingJob) Reconcile ¶
func (j *TrainingJob) Reconcile(config *torchv1alpha1.ControllerConfig) error
reconcile tries to get the job into the desired state.
func (*TrainingJob) SchedulerName ¶
func (j *TrainingJob) SchedulerName() string
func (*TrainingJob) UID ¶
func (j *TrainingJob) UID() types.UID
func (*TrainingJob) Update ¶
func (j *TrainingJob) Update(newJob *torchv1alpha1.PyTorchJob)
Update replaces the PyTorchJob corresponding to TrainingJob with the provided job. This function is used when the Spec/Status of the job is modified outside the controller. For example, if the user issues a delete request. This will update the metadata on the object so we need to replace the spec.