trainer

package

v0.3.0 Latest Latest Go to latest Published: Sep 29, 2018 License: Apache-2.0 Imports: 22 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/linquanisaac/pytorch-operator

Links

Open Source Insights

Documentation ¶

Overview ¶

Package trainer is to manage pytorch training jobs.

Index ¶

Constants
type ClusterSpec
type KubernetesLabels
- func (l KubernetesLabels) ToSelector() (string, error)
type PyTorchConfig
type PyTorchReplicaSet
- func NewPyTorchReplicaSet(clientSet kubernetes.Interface, recorder record.EventRecorder, ...) (*PyTorchReplicaSet, error)
type PyTorchReplicaSetInterface
type TaskSpec
type TrainingJob
- func NewJob(kubeCli kubernetes.Interface, torchJobClient pytorchclient.Interface, ...) (*TrainingJob, error)

Constants ¶

View Source

const (
	SuccessfulCreateReason = "SuccessfulCreate"
	FailedCreateReason     = "FailedCreate"
)

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type ClusterSpec ¶

type ClusterSpec map[string][]string

TODO(jose5918): We don't really need the cluster spec for this operator but no harm in leaving it for POC ClusterSpec represents a cluster TensorFlow specification. https://www.tensorflow.org/deploy/distributed#create_a_tftrainclusterspec_to_describe_the_cluster It is a map from job names to network addresses.

type KubernetesLabels ¶

type KubernetesLabels map[string]string

KubernetesLabels represents a set of labels to apply to a Kubernetes resources.

func (KubernetesLabels) ToSelector ¶

func (l KubernetesLabels) ToSelector() (string, error)

ToSelector converts the labels to a selector matching the labels.

type PyTorchConfig ¶

type PyTorchConfig struct {
	Cluster     ClusterSpec `json:"cluster"`
	Task        TaskSpec    `json:"task"`
	Environment string      `json:"environment"`
}

PyTorchConfig is a struct representing the TensorFlow config. This struct is turned into an environment which is used by TensorFlow processes to configure themselves.

type PyTorchReplicaSet ¶

type PyTorchReplicaSet struct {
	ClientSet kubernetes.Interface

	// Job is a pointer to the TrainingJob to which this replica belongs.
	Job  *TrainingJob
	Spec torchv1alpha1.PyTorchReplicaSpec
	// contains filtered or unexported fields
}

PyTorchReplicaSet is a set of PyTorch processes all acting as the same role (e.g. worker

func NewPyTorchReplicaSet ¶

func NewPyTorchReplicaSet(clientSet kubernetes.Interface, recorder record.EventRecorder, tfReplicaSpec torchv1alpha1.PyTorchReplicaSpec, job *TrainingJob) (*PyTorchReplicaSet, error)

func (*PyTorchReplicaSet) Create ¶

func (s *PyTorchReplicaSet) Create(config *torchv1alpha1.ControllerConfig, worldSize int32) error

func (*PyTorchReplicaSet) CreatePodWithIndex ¶

func (s *PyTorchReplicaSet) CreatePodWithIndex(index int32, worldSize int32) (*v1.Pod, error)

CreatePodWithIndex will create a new pod with specify index

func (*PyTorchReplicaSet) CreateServiceWithIndex ¶

func (s *PyTorchReplicaSet) CreateServiceWithIndex(index int32) (*v1.Service, error)

CreateServiceWithIndex will create a new service with specify index

func (*PyTorchReplicaSet) Delete ¶

func (s *PyTorchReplicaSet) Delete() error

Delete deletes the replicas

func (*PyTorchReplicaSet) GetSingleReplicaStatus ¶

func (s *PyTorchReplicaSet) GetSingleReplicaStatus(index int32) torchv1alpha1.ReplicaState

func (*PyTorchReplicaSet) GetStatus ¶

func (s *PyTorchReplicaSet) GetStatus() (torchv1alpha1.PyTorchReplicaStatus, error)

Status returns the status of the replica set.

func (*PyTorchReplicaSet) Labels ¶

func (s *PyTorchReplicaSet) Labels() KubernetesLabels

Labels returns the labels for this replica set.

func (*PyTorchReplicaSet) SyncPods ¶

func (s *PyTorchReplicaSet) SyncPods(worldSize int32) error

SyncPods will try to check current pods for this PyTorchReplicaSet and try to make it as desired.

func (*PyTorchReplicaSet) SyncServices ¶

func (s *PyTorchReplicaSet) SyncServices() error

SyncServices will try to check current services for this PyTorchReplicaSet and try to make it as desired.

type PyTorchReplicaSetInterface ¶

type PyTorchReplicaSetInterface interface {
	Create() error
	Delete() error
	GetStatus() (torchv1alpha1.PyTorchReplicaStatus, error)
}

PyTorchReplicas is an interface for managing a set of replicas.

type TaskSpec ¶

type TaskSpec struct {
	Type  string `json:"type"`
	Index int    `json:"index"`
}

type TrainingJob ¶

type TrainingJob struct {
	KubeCli kubernetes.Interface

	Replicas []*PyTorchReplicaSet
	// contains filtered or unexported fields
}

TODO(jlewi): We should switch a New pattern and make trainingJob private so we can ensure correctness on creation.

func NewJob ¶

func NewJob(kubeCli kubernetes.Interface, torchJobClient pytorchclient.Interface, recorder record.EventRecorder, job *torchv1alpha1.PyTorchJob, config *torchv1alpha1.ControllerConfig) (*TrainingJob, error)

func (*TrainingJob) ClusterSpec ¶

func (j *TrainingJob) ClusterSpec() ClusterSpec

func (*TrainingJob) Delete ¶

func (j *TrainingJob) Delete()

func (*TrainingJob) GetStatus ¶

func (j *TrainingJob) GetStatus() (torchv1alpha1.State, []*torchv1alpha1.PyTorchReplicaStatus, error)

func (*TrainingJob) Reconcile ¶

func (j *TrainingJob) Reconcile(config *torchv1alpha1.ControllerConfig) error

reconcile tries to get the job into the desired state.

func (*TrainingJob) SchedulerName ¶

func (j *TrainingJob) SchedulerName() string

func (*TrainingJob) UID ¶

func (j *TrainingJob) UID() types.UID

func (*TrainingJob) Update ¶

func (j *TrainingJob) Update(newJob *torchv1alpha1.PyTorchJob)

Update replaces the PyTorchJob corresponding to TrainingJob with the provided job. This function is used when the Spec/Status of the job is modified outside the controller. For example, if the user issues a delete request. This will update the metadata on the object so we need to replace the spec.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL