kubedl

command module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 20, 2020 License: Apache-2.0 Imports: 14 Imported by: 0

README

KubeDL

License Build Status

KubeDL is short for Kubernetes-Deep-Learning. It is a unified operator that supports running multiple types of distributed deep learning/machine learning workloads on Kubernetes.


Currently, KubeDL supports the following types of ML/DL jobs:

KubeDL maintains API compatibility with certain kubeflow job operators and provides additional features as below:

  • Support running prevalent ML/DL workloads in a single operator.
  • Support submitting a job with artifacts synced from remote source such as github without rebuilding the image.
  • Support advanced scheduling features such as gang scheduling with pluggable backend schedulers.
  • Instrumented with unified prometheus metrics for different types of DL jobs, such as job launch delay, current number of pending/running jobs.
  • Support job metadata persistency with a pluggable storage backend such as Mysql.
  • Enable specific workload type according to the installed CRDs automatically or through the startup flags explicitly.
  • A modular architecture that can be easily extended for more types of DL/ML workloads with shared libraries, see how to add a custom job workload.
  • [Work-in-progress] Provide a dashboard for monitoring the jobs' lifecycle and stats.

Getting started

You can deploy KubeDL using a single Helm command or just YAML files.

Deploy KubeDL using Helm

KubeDL can be deployed with a single command leveraging the helm chart:

helm install kubedl ./helm/kubedl 

You can override default values defined in ./helm/kubedl/values.yaml with --set flag, for example:

helm install kubedl ./helm/kubedl --set kubedlSysNamespace=kube-system --set resources.requests.cpu=1024m --set resources.requests.memory=2Gi

Helm will render templates and apply them to cluster, just run the command above in root dir and be ready to go :)

Alternatively, deploy KubeDL using YAML File
Install CRDs
kubectl apply -f https://raw.githubusercontent.com/alibaba/kubedl/master/config/crd/bases/kubeflow.org_pytorchjobs.yaml
kubectl apply -f https://raw.githubusercontent.com/alibaba/kubedl/master/config/crd/bases/kubeflow.org_tfjobs.yaml
kubectl apply -f https://raw.githubusercontent.com/alibaba/kubedl/master/config/crd/bases/xdl.kubedl.io_xdljobs.yaml
kubectl apply -f https://raw.githubusercontent.com/alibaba/kubedl/master/config/crd/bases/xgboostjob.kubeflow.org_xgboostjobs.yaml
Install KubeDL operator
kubectl apply -f https://raw.githubusercontent.com/alibaba/kubedl/master/config/manager/all_in_one.yaml

The official KubeDL operator image is hosted under docker hub.

Optional: Enable workload kind selectively

If you only need some of the workload types and want to disable others, you can use either one of the three options or all of them:

  • Set env WORKLOADS_ENABLE in KubeDL container when you do deploying. The value is a list of workloads types that you want to enable. For example, WORKLOADS_ENABLE=TFJob,PytorchJob means only TFJob and PytorchJob workload are enabled, the others are disabled.

  • Set startup arguments --workloads in KubeDL container args when you do deploying. The value configuration is consistent with WORKLOADS_ENABLE env.

  • [DEFAULT] Only install the CRDs you need, KubeDL will automatically enables corresponding workload controllers, you can set --workloads auto or WORKLOADS_ENABLE=auto explicitly. This is the default approach.

Check documents for a full list of operator startup flags.

Run an Example Job

This example demonstrates how to run a simple MNist Tensorflow job with KubeDL.

Submit the TFJob
kubectl apply -f https://raw.githubusercontent.com/alibaba/kubedl/master/example/tf/tf_job_mnist.yaml
Monitor the status of the Tensorflow job
kubectl get tfjobs -n kubedl
kubectl describe tfjob mnist -n kubedl
Delete the job
kubectl delete tfjob mnist -n kubedl
Workload types

Supported workload types are tfjob, xgboostjob, pytorchjob, xdljob, e.g.

kubectl get xgboostjob 

Tutorial

How to run a XDL Job

KubeDL Metrics

Check the documents for the prometheus metrics supported for KubeDL operator.

Sync Artifacts from Remote Repository

KubeDL supports submitting jobs with artifacts synced from remote source dynamically without rebuilding the image. Currently github is supported. A plugable interface is supported for other sources such as hdfs. Check the documents for details.

Job Dashboard

A dashboard for monitoring the jobs' lifecycle and stats is currently in progress. The dashboard also provides convenient job operation options including job creation、termination, and deletion. See the demo below.

Developer Guide

Build the controller manager binary
make manager
Run the tests
make test
Generate manifests e.g. CRD, RBAC YAML files etc
make manifests
Build the docker image
export IMG=<your_image_name> && make docker-build
Push the image
docker push <your_image_name>

To develop/debug KubeDL controller manager locally, please check the debug guide.

Community

If you have any questions or want to contribute, GitHub issues or pull requests are warmly welcome. You can also contact us via the following channels:

  • Dingtalk Group(钉钉讨论群)

Certain implementations rely on existing code from the Kubeflow community and the credit goes to original Kubeflow authors.

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis
api
elasticdljob/v1alpha1
+kubebuilder:object:generate=true +groupName=elasticdl.org
+kubebuilder:object:generate=true +groupName=elasticdl.org
marsjob/v1alpha1
Package v1beta1 contains API Schema definitions for the kubedl.io v1alpha1 API group +kubebuilder:object:generate=true +groupName=kubedl.io
Package v1beta1 contains API Schema definitions for the kubedl.io v1alpha1 API group +kubebuilder:object:generate=true +groupName=kubedl.io
pytorch/v1
Package v1 is the v1 version of the API.
Package v1 is the v1 version of the API.
tensorflow/v1
Package v1 is the v1 version of the API.
Package v1 is the v1 version of the API.
xdl/v1alpha1
Package v1alpha1 contains API Schema definitions for the xdl v1alpha1 API group +k8s:openapi-gen=true +k8s:deepcopy-gen=package,register +k8s:conversion-gen=github.com/alibaba/kubedl/api/xdl +k8s:defaulter-gen=TypeMeta +groupName=xdl.kubedl.io Package v1alpha1 contains API Schema definitions for the xdl v1alpha1 API group +k8s:openapi-gen=true +k8s:deepcopy-gen=package,register +k8s:conversion-gen=github.com/alibaba/kubedl/api/xdl +k8s:defaulter-gen=TypeMeta +groupName=xdl.kubedl.io
Package v1alpha1 contains API Schema definitions for the xdl v1alpha1 API group +k8s:openapi-gen=true +k8s:deepcopy-gen=package,register +k8s:conversion-gen=github.com/alibaba/kubedl/api/xdl +k8s:defaulter-gen=TypeMeta +groupName=xdl.kubedl.io Package v1alpha1 contains API Schema definitions for the xdl v1alpha1 API group +k8s:openapi-gen=true +k8s:deepcopy-gen=package,register +k8s:conversion-gen=github.com/alibaba/kubedl/api/xdl +k8s:defaulter-gen=TypeMeta +groupName=xdl.kubedl.io
xgboost/v1alpha1
Package v1alpha1 contains API Schema definitions for the xgboostjob v1alpha1 API group +k8s:openapi-gen=true +k8s:deepcopy-gen=package,register +k8s:conversion-gen=github.com/kubeflow/xgboost-operator/pkg/apis/xgboostjob +k8s:defaulter-gen=TypeMeta +groupName=xgboostjob.kubeflow.org Package v1alpha1 contains API Schema definitions for the xgboostjob v1alpha1 API group +k8s:openapi-gen=true +k8s:deepcopy-gen=package,register +k8s:conversion-gen=github.com/kubeflow/xgboost-operator/pkg/apis/xgboostjob +k8s:defaulter-gen=TypeMeta +groupName=xgboostjob.kubeflow.org
Package v1alpha1 contains API Schema definitions for the xgboostjob v1alpha1 API group +k8s:openapi-gen=true +k8s:deepcopy-gen=package,register +k8s:conversion-gen=github.com/kubeflow/xgboost-operator/pkg/apis/xgboostjob +k8s:defaulter-gen=TypeMeta +groupName=xgboostjob.kubeflow.org Package v1alpha1 contains API Schema definitions for the xgboostjob v1alpha1 API group +k8s:openapi-gen=true +k8s:deepcopy-gen=package,register +k8s:conversion-gen=github.com/kubeflow/xgboost-operator/pkg/apis/xgboostjob +k8s:defaulter-gen=TypeMeta +groupName=xgboostjob.kubeflow.org
cmd
tensorflow
Package tensorflow provides a Kubernetes controller for a TFJob resource.
Package tensorflow provides a Kubernetes controller for a TFJob resource.
xdl
pkg
job_controller/api/v1
Package v1 is the v1 version of the API.
Package v1 is the v1 version of the API.
test_job/v1
Package v1 is the v1 version of the API.
Package v1 is the v1 version of the API.
util
Package util provides various helper routines.
Package util provides various helper routines.
util/train
Package that various helper routines for training.
Package that various helper routines for training.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL