Documentation
¶
Overview ¶
Package coordinator provides a distributed training coordinator.
Index ¶
- type CheckpointInfo
- type Coordinator
- func (c *Coordinator) Addr() net.Addr
- func (c *Coordinator) EndCheckpoint(_ context.Context, req *pb.EndCheckpointRequest) (*pb.EndCheckpointResponse, error)
- func (c *Coordinator) GracefulStop()
- func (c *Coordinator) Heartbeat(_ context.Context, req *pb.HeartbeatRequest) (*pb.HeartbeatResponse, error)
- func (c *Coordinator) RegisterWorker(_ context.Context, req *pb.RegisterWorkerRequest) (*pb.RegisterWorkerResponse, error)
- func (c *Coordinator) Start(address string) error
- func (c *Coordinator) StartCheckpoint(_ context.Context, req *pb.StartCheckpointRequest) (*pb.StartCheckpointResponse, error)
- func (c *Coordinator) Stop()
- func (c *Coordinator) UnregisterWorker(_ context.Context, req *pb.UnregisterWorkerRequest) (*pb.UnregisterWorkerResponse, error)
- type WorkerInfo
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type CheckpointInfo ¶
type CheckpointInfo struct { ID string Epoch int32 Path string Workers map[string]bool Completed bool }
CheckpointInfo holds information about a checkpoint.
type Coordinator ¶
type Coordinator struct { pb.UnimplementedCoordinatorServer // contains filtered or unexported fields }
Coordinator implements the pb.CoordinatorServer interface. It manages the state of the distributed training cluster.
func NewCoordinator ¶
func NewCoordinator(out io.Writer, timeout time.Duration) *Coordinator
NewCoordinator creates a new Coordinator.
func (*Coordinator) Addr ¶
func (c *Coordinator) Addr() net.Addr
Addr returns the address the coordinator is listening on.
func (*Coordinator) EndCheckpoint ¶
func (c *Coordinator) EndCheckpoint(_ context.Context, req *pb.EndCheckpointRequest) (*pb.EndCheckpointResponse, error)
EndCheckpoint is called by workers to report the completion of a checkpoint.
func (*Coordinator) GracefulStop ¶
func (c *Coordinator) GracefulStop()
GracefulStop gracefully stops the coordinator service.
func (*Coordinator) Heartbeat ¶
func (c *Coordinator) Heartbeat(_ context.Context, req *pb.HeartbeatRequest) (*pb.HeartbeatResponse, error)
Heartbeat is called by workers to signal that they are still alive.
func (*Coordinator) RegisterWorker ¶
func (c *Coordinator) RegisterWorker(_ context.Context, req *pb.RegisterWorkerRequest) (*pb.RegisterWorkerResponse, error)
RegisterWorker registers a new worker with the coordinator.
func (*Coordinator) Start ¶
func (c *Coordinator) Start(address string) error
Start starts the coordinator service on the given address.
func (*Coordinator) StartCheckpoint ¶
func (c *Coordinator) StartCheckpoint(_ context.Context, req *pb.StartCheckpointRequest) (*pb.StartCheckpointResponse, error)
StartCheckpoint initiates a new checkpoint process.
func (*Coordinator) Stop ¶
func (c *Coordinator) Stop()
Stop gracefully stops the coordinator service.
func (*Coordinator) UnregisterWorker ¶
func (c *Coordinator) UnregisterWorker(_ context.Context, req *pb.UnregisterWorkerRequest) (*pb.UnregisterWorkerResponse, error)
UnregisterWorker removes a worker from the coordinator.