coordinator

package
v0.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 25, 2025 License: Apache-2.0 Imports: 10 Imported by: 0

Documentation

Overview

Package coordinator provides a distributed training coordinator.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type CheckpointInfo

type CheckpointInfo struct {
	ID        string
	Epoch     int32
	Path      string
	Workers   map[string]bool
	Completed bool
}

CheckpointInfo holds information about a checkpoint.

type Coordinator

type Coordinator struct {
	pb.UnimplementedCoordinatorServer
	// contains filtered or unexported fields
}

Coordinator implements the pb.CoordinatorServer interface. It manages the state of the distributed training cluster.

func NewCoordinator

func NewCoordinator(out io.Writer, timeout time.Duration) *Coordinator

NewCoordinator creates a new Coordinator.

func (*Coordinator) Addr

func (c *Coordinator) Addr() net.Addr

Addr returns the address the coordinator is listening on.

func (*Coordinator) EndCheckpoint

EndCheckpoint is called by workers to report the completion of a checkpoint.

func (*Coordinator) GracefulStop

func (c *Coordinator) GracefulStop()

GracefulStop gracefully stops the coordinator service.

func (*Coordinator) Heartbeat

Heartbeat is called by workers to signal that they are still alive.

func (*Coordinator) RegisterWorker

RegisterWorker registers a new worker with the coordinator.

func (*Coordinator) Start

func (c *Coordinator) Start(address string) error

Start starts the coordinator service on the given address.

func (*Coordinator) StartCheckpoint

StartCheckpoint initiates a new checkpoint process.

func (*Coordinator) Stop

func (c *Coordinator) Stop()

Stop gracefully stops the coordinator service.

func (*Coordinator) UnregisterWorker

UnregisterWorker removes a worker from the coordinator.

type WorkerInfo

type WorkerInfo struct {
	ID            string
	Address       string
	Rank          int
	LastHeartbeat time.Time
}

WorkerInfo holds information about a worker in the cluster.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL