trainer

package
v0.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 29, 2018 License: Apache-2.0 Imports: 22 Imported by: 0

Documentation

Overview

Package trainer is to manage pytorch training jobs.

Index

Constants

View Source
const (
	SuccessfulCreateReason = "SuccessfulCreate"
	FailedCreateReason     = "FailedCreate"
)

Variables

This section is empty.

Functions

This section is empty.

Types

type ClusterSpec

type ClusterSpec map[string][]string

TODO(jose5918): We don't really need the cluster spec for this operator but no harm in leaving it for POC ClusterSpec represents a cluster TensorFlow specification. https://www.tensorflow.org/deploy/distributed#create_a_tftrainclusterspec_to_describe_the_cluster It is a map from job names to network addresses.

type KubernetesLabels

type KubernetesLabels map[string]string

KubernetesLabels represents a set of labels to apply to a Kubernetes resources.

func (KubernetesLabels) ToSelector

func (l KubernetesLabels) ToSelector() (string, error)

ToSelector converts the labels to a selector matching the labels.

type PyTorchConfig

type PyTorchConfig struct {
	Cluster     ClusterSpec `json:"cluster"`
	Task        TaskSpec    `json:"task"`
	Environment string      `json:"environment"`
}

PyTorchConfig is a struct representing the TensorFlow config. This struct is turned into an environment which is used by TensorFlow processes to configure themselves.

type PyTorchReplicaSet

type PyTorchReplicaSet struct {
	ClientSet kubernetes.Interface

	// Job is a pointer to the TrainingJob to which this replica belongs.
	Job  *TrainingJob
	Spec torchv1alpha1.PyTorchReplicaSpec
	// contains filtered or unexported fields
}

PyTorchReplicaSet is a set of PyTorch processes all acting as the same role (e.g. worker

func NewPyTorchReplicaSet

func NewPyTorchReplicaSet(clientSet kubernetes.Interface, recorder record.EventRecorder, tfReplicaSpec torchv1alpha1.PyTorchReplicaSpec, job *TrainingJob) (*PyTorchReplicaSet, error)

func (*PyTorchReplicaSet) Create

func (s *PyTorchReplicaSet) Create(config *torchv1alpha1.ControllerConfig, worldSize int32) error

func (*PyTorchReplicaSet) CreatePodWithIndex

func (s *PyTorchReplicaSet) CreatePodWithIndex(index int32, worldSize int32) (*v1.Pod, error)

CreatePodWithIndex will create a new pod with specify index

func (*PyTorchReplicaSet) CreateServiceWithIndex

func (s *PyTorchReplicaSet) CreateServiceWithIndex(index int32) (*v1.Service, error)

CreateServiceWithIndex will create a new service with specify index

func (*PyTorchReplicaSet) Delete

func (s *PyTorchReplicaSet) Delete() error

Delete deletes the replicas

func (*PyTorchReplicaSet) GetSingleReplicaStatus

func (s *PyTorchReplicaSet) GetSingleReplicaStatus(index int32) torchv1alpha1.ReplicaState

func (*PyTorchReplicaSet) GetStatus

Status returns the status of the replica set.

func (*PyTorchReplicaSet) Labels

func (s *PyTorchReplicaSet) Labels() KubernetesLabels

Labels returns the labels for this replica set.

func (*PyTorchReplicaSet) SyncPods

func (s *PyTorchReplicaSet) SyncPods(worldSize int32) error

SyncPods will try to check current pods for this PyTorchReplicaSet and try to make it as desired.

func (*PyTorchReplicaSet) SyncServices

func (s *PyTorchReplicaSet) SyncServices() error

SyncServices will try to check current services for this PyTorchReplicaSet and try to make it as desired.

type PyTorchReplicaSetInterface

type PyTorchReplicaSetInterface interface {
	Create() error
	Delete() error
	GetStatus() (torchv1alpha1.PyTorchReplicaStatus, error)
}

PyTorchReplicas is an interface for managing a set of replicas.

type TaskSpec

type TaskSpec struct {
	Type  string `json:"type"`
	Index int    `json:"index"`
}

type TrainingJob

type TrainingJob struct {
	KubeCli kubernetes.Interface

	Replicas []*PyTorchReplicaSet
	// contains filtered or unexported fields
}

TODO(jlewi): We should switch a New pattern and make trainingJob private so we can ensure correctness on creation.

func (*TrainingJob) ClusterSpec

func (j *TrainingJob) ClusterSpec() ClusterSpec

func (*TrainingJob) Delete

func (j *TrainingJob) Delete()

func (*TrainingJob) GetStatus

func (*TrainingJob) Reconcile

func (j *TrainingJob) Reconcile(config *torchv1alpha1.ControllerConfig) error

reconcile tries to get the job into the desired state.

func (*TrainingJob) SchedulerName

func (j *TrainingJob) SchedulerName() string

func (*TrainingJob) UID

func (j *TrainingJob) UID() types.UID

func (*TrainingJob) Update

func (j *TrainingJob) Update(newJob *torchv1alpha1.PyTorchJob)

Update replaces the PyTorchJob corresponding to TrainingJob with the provided job. This function is used when the Spec/Status of the job is modified outside the controller. For example, if the user issues a delete request. This will update the metadata on the object so we need to replace the spec.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL