training-operator

module
v1.8.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 10, 2024 License: Apache-2.0

README

Kubeflow Training Operator

Build Status Coverage Status Go Report Card

Overview

Kubeflow Training Operator is a Kubernetes-native project for fine-tuning and scalable distributed training of machine learning (ML) models created with various ML frameworks such as PyTorch, Tensorflow, XGBoost, MPI, Paddle and others.

Training Operator allows you to use Kubernetes workloads to effectively train your large models via Kubernetes Custom Resources APIs or using Training Operator Python SDK.

Note: Before v1.2 release, Kubeflow Training Operator only supports TFJob on Kubernetes.

Prerequisites

  • Version >= 1.25 of Kubernetes cluster and kubectl

Installation

Master Branch
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"
Stable Release
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"
TensorFlow Release Only

For users who prefer to use original TensorFlow controllers, please checkout v1.2-branch, patches for bug fixes will still be accepted to this branch.

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.2.0"
Python SDK for Kubeflow Training Operator

Training Operator provides Python SDK for the custom resources. To learn more about available SDK APIs check the TrainingClient.

Use pip install command to install the latest release of the SDK:

pip install kubeflow-training

Training Operator controller and Python SDK have the same release versions.

Quickstart

Please refer to the getting started guide to quickly create your first Training Operator Job using Python SDK.

If you want to work directly with Kubernetes Custom Resources provided by Training Operator, follow the PyTorchJob MNIST guide.

API Documentation

Please refer to following API Documentation:

Community

The following links provide information about getting involved in the community:

This is a part of Kubeflow, so please see readme in kubeflow/kubeflow to get in touch with the community.

Contributing

Please refer to the DEVELOPMENT

Change Log

Please refer to CHANGELOG

Version Matrix

The following table lists the most recent few versions of the operator.

Operator Version API Version Kubernetes Version
v1.0.x v1 1.16+
v1.1.x v1 1.16+
v1.2.x v1 1.16+
v1.3.x v1 1.18+
v1.4.x v1 1.23+
v1.5.x v1 1.23+
v1.6.x v1 1.23+
v1.7.x v1 1.25+
latest (master HEAD) v1 1.25+

Acknowledgement

This project was originally started as a distributed training operator for TensorFlow and later we merged efforts from other Kubeflow training operators to provide a unified and simplified experience for both users and developers. We are very grateful to all who filed issues or helped resolve them, asked and answered questions, and were part of inspiring discussions. We'd also like to thank everyone who's contributed to and maintained the original operators.

Directories

Path Synopsis
cmd
hack
swagger module
pkg
apis/kubeflow.org/v1
Package v1 is the v1 version of the API.
Package v1 is the v1 version of the API.
client/clientset/versioned/fake
This package has the automatically generated fake clientset.
This package has the automatically generated fake clientset.
client/clientset/versioned/scheme
This package contains the scheme of the automatically generated clientset.
This package contains the scheme of the automatically generated clientset.
client/clientset/versioned/typed/kubeflow.org/v1
This package has the automatically generated typed clients.
This package has the automatically generated typed clients.
client/clientset/versioned/typed/kubeflow.org/v1/fake
Package fake has the automatically generated clients.
Package fake has the automatically generated clients.
controller.v1/tensorflow
Package controller provides a Kubernetes controller for a TFJob resource.
Package controller provides a Kubernetes controller for a TFJob resource.
util/train
Package that various helper routines for training.
Package that various helper routines for training.
test_job
apis/test_job/v1
Package v1 is the v1 version of the API.
Package v1 is the v1 version of the API.
client/clientset/versioned
This package has the automatically generated clientset.
This package has the automatically generated clientset.
client/clientset/versioned/fake
This package has the automatically generated fake clientset.
This package has the automatically generated fake clientset.
client/clientset/versioned/scheme
This package contains the scheme of the automatically generated clientset.
This package contains the scheme of the automatically generated clientset.
client/clientset/versioned/typed/test_job/v1
This package has the automatically generated typed clients.
This package has the automatically generated typed clients.
client/clientset/versioned/typed/test_job/v1/fake
Package fake has the automatically generated clients.
Package fake has the automatically generated clients.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL