gcpbatchtracker

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 29, 2023 License: BSD-2-Clause Imports: 18 Imported by: 6

README

gcpbatchtracker

DRMAA2 JobTracker implementation for Google Batch

Experimental Google Batch support for DRMAA2os.

How gcpbatchtracker Works

The project is created for embedding it as a backend in https://github.com/dgruber/drmaa2os

What gcpbatchtracker is

It is a basic DRMAA2 implementation for Google Batch for Go. The DRMAA2 JobTemplate can be used for submitting Google Batch jobs. The DRMAA2 JobInfo struct is used for getting the status of a job. The job state model is converted to the DRMAA2 spec.

How to use it

See examples directory which uses the interface directly.

Converting a DRMAA2 Job Template to an Google Batch Job

DRMAA2 JobTemplate Google Batch Job
RemoteCommand Command to execute in container or script or script path
Args In case of container the arguments of the command (if RemoteCommand empty then the arguments of entrypoint)
CandidateMachines[0] Machine type or when prefixed with "template:" it uses an instance template with that name
JobCategory Container image or $script$ or $scriptpath$ for other runnables which interpretes then RemoteCommand as script or script path
JobName JobID
AccountingID Sets a tag "accounting"
MinSlots Specifies the parallelism (how many tasks to run in parallel)
MaxSlots Specifies the amount of tasks to run. For MPI set MinSlots = MaxSlots.

For StaginInFiles and StageOutFiles see below.

In case of a container following files are always mounted from host:

    "/etc/cloudbatch-taskgroup-hosts:/etc/cloudbatch-taskgroup-hosts",
    "/etc/ssh:/etc/ssh",
    "/root/.ssh:/root/.ssh",

For a container the following runtime options are set:

  • "--network=host"

Default output path is cloud logging. If "OutputPath" is set it is changed to LogsPolicy_PATH with the OutputPath as destination.

JobTemplate Extensions
DRMAA2 JobTemplate Extension Key DRMAA2 JobTemplate Extension Value
ExtensionProlog / "prolog" String which contains prolog script executed on machine level before the job starts
ExtensionEpilog / "epilog" String which contains epilog script executed on machine level after the job ends successfully
ExtensionSpot / "spot" "true"/"t"/... when machine should be spot
ExctensionAccelerators / "accelerators" Accelerator name for machine
ExtensionTasksPerNode / "tasks_per_node" Amount of tasks per node
ExtensionDockerOptions / "docker_options" Override of docker run options in case a container image is used

JobInfo Fields

DRMAA2 JobInfo Batch Job
Slots Parallelism

Job Control Mapping

Did not yet find some way to put a job in hold, suspend, or release a job. Terminating a job deletes it...

Job State Mapping

DRMAA2 State Batch Job State
Done JobStatus_SUCCEEDED
Failed JobStatus_FAILED
Suspended -
Running JobStatus_RUNNING JobStatus_DELETION_IN_PROGRESS
Queued JobStatus_QUEUED JobStatus_SCHEDULED
Undetermined JobStatus_STATE_UNSPECIFIED

File staging using the Job Template

NFS (Google Filestore) and GCS is supported.

For NFS in containers besides directories also files can be specified. In case of files, the directory is mounted to the host and from there the file inside the container as specified in key. For the directory case a leading "/" is required.

    StageInFiles: map[string]string{
            "/etc/script.sh": "nfs:10.20.30.40:/filestore/user/dir/script.sh",
            "/mnt/dir": "nfs:10.20.30.40:/filestore/user/dir/",
            "/somedir": "gs://benchmarkfiles", // mount a bucket into container or host
        },

StageOutFiles creates a bucket if it does not exist before the job is submitted. If that failes then the job submission call fails. Currently only gs:// is evaluated in the StageOutFiles map.

Examples

See examples directory.

Documentation

Index

Constants

View Source
const (

	// job categories (otherwise it is a container image)
	JobCategoryScriptPath = "$scriptpath$" // treats RemoteCommand as path to script and ignores args
	JobCategoryScript     = "$script$"     // treats RemoteCommand as script and ignores args
)
View Source
const (
	ExtensionProlog        = "prolog"
	ExtensionEpilog        = "epilog"
	ExtensionSpot          = "spot"
	ExtensionAccelerators  = "accelerators"
	ExtensionTasksPerNode  = "tasks_per_node"
	ExtensionDockerOptions = "docker_options"
)

Variables

This section is empty.

Functions

func BatchJobToJobInfo

func BatchJobToJobInfo(job *batchpb.Job) (drmaa2interface.JobInfo, error)

func ConvertJobState

func ConvertJobState(job *batchpb.Job) (drmaa2interface.JobState, string, error)

func ConvertJobTemplateToJobRequest

func ConvertJobTemplateToJobRequest(session, project, location string, jt drmaa2interface.JobTemplate) (batchpb.CreateJobRequest, error)

func CreateMissingStageOutBuckets

func CreateMissingStageOutBuckets(project string, stageOutFiles map[string]string) error

func CreateRunnables

func CreateRunnables(barriers bool, prolog string) []*batchpb.Runnable

func GetAcceleratorsExtension

func GetAcceleratorsExtension(jt drmaa2interface.JobTemplate) (string, int64, bool)

func GetDockerOptionsExtension

func GetDockerOptionsExtension(jt drmaa2interface.JobTemplate) (string, bool)

func GetMachineEpilogExtension

func GetMachineEpilogExtension(jt drmaa2interface.JobTemplate) (string, bool)

func GetMachinePrologExtension

func GetMachinePrologExtension(jt drmaa2interface.JobTemplate) (string, bool)

func GetSpotExtension

func GetSpotExtension(jt drmaa2interface.JobTemplate) (bool, bool)

func GetStorageClient

func GetStorageClient() (*storage.Client, error)

func GetTasksPerNodeExtension

func GetTasksPerNodeExtension(jt drmaa2interface.JobTemplate) (int64, bool)

func IsInDRMAA2Session

func IsInDRMAA2Session(client *batch.Client, session string, jobID string) bool

func IsInJobSession

func IsInJobSession(session string, job *batchpb.Job) bool

func MountBucket

func MountBucket(jobRequest *batchpb.CreateJobRequest, execPosition int, destination, source string) *batchpb.CreateJobRequest

func NewAllocator

func NewAllocator() *allocator

func SetAcceleratorsExtension

func SetAcceleratorsExtension(jt drmaa2interface.JobTemplate, count int64, accelerator string) drmaa2interface.JobTemplate

func SetDockerOptionsExtension

func SetDockerOptionsExtension(jt drmaa2interface.JobTemplate, dockerOptions string) drmaa2interface.JobTemplate

Types

type GCPBatchTracker

type GCPBatchTracker struct {
	// contains filtered or unexported fields
}

GCPBatchTracker implements the JobTracker interface so that it can be used as backend in drmaa2os project.

func NewGCPBatchTracker

func NewGCPBatchTracker(drmaa2session string, project, location string) (*GCPBatchTracker, error)

func (*GCPBatchTracker) AddArrayJob

func (t *GCPBatchTracker) AddArrayJob(jt drmaa2interface.JobTemplate, begin int, end int, step int, maxParallel int) (string, error)

AddArrayJob makes a mass submission of jobs defined by the same job template. Many HPC workload manager support job arrays for submitting 10s of thousands of similar jobs by one call. The additional parameters define how many jobs are submitted by defining a TASK_ID range. Begin is the first task ID (like 1), end is the last task ID (like 10), step is a positive integeger which defines the increments from one task ID to the next task ID (like 1). maxParallel is an arguments representating an optional functionality which instructs the backend to limit maxParallel tasks of this job arary to run in parallel. Note, that jobs use the TASK_ID environment variable to identifiy which task they are and determine that way what to do (like which data set is accessed).

func (*GCPBatchTracker) AddJob

AddJob typically submits or starts a new job at the backend. The function returns the unique job ID or an error if job submission (or starting of the job in case there is no queueing system) has failed.

func (*GCPBatchTracker) CloseMonitoringSession

func (t *GCPBatchTracker) CloseMonitoringSession(name string) error

func (*GCPBatchTracker) DeleteJob

func (t *GCPBatchTracker) DeleteJob(jobID string) error

DeleteJob removes a job from a potential internal database. It does not stop a job. A job must be in an endstate (terminated, failed) in order to call DeleteJob. In case of an error or the job is not in an end state error must be returned. If the backend does not support cleaning up resources for a finished job nil should be returned.

func (*GCPBatchTracker) GetAllJobIDs

func (t *GCPBatchTracker) GetAllJobIDs(filter *drmaa2interface.JobInfo) ([]string, error)

func (*GCPBatchTracker) GetAllMachines

func (t *GCPBatchTracker) GetAllMachines(filter []string) ([]drmaa2interface.Machine, error)

func (*GCPBatchTracker) GetAllQueueNames

func (t *GCPBatchTracker) GetAllQueueNames(filter []string) ([]string, error)

func (*GCPBatchTracker) JobControl

func (t *GCPBatchTracker) JobControl(jobID string, action string) error

JobControl sends a request to the backend to either "terminate", "suspend", "resume", "hold", or "release" a job. The strings are fixed and are defined by the JobControl constants. This could change in the future to be limited only to constants representing the actions. When the request is not accepted by the system the function must return an error.

func (*GCPBatchTracker) JobInfo

func (t *GCPBatchTracker) JobInfo(jobID string) (drmaa2interface.JobInfo, error)

JobInfo returns the job status of a job in form of a JobInfo struct or an error.

func (*GCPBatchTracker) JobInfoFromMonitor

func (t *GCPBatchTracker) JobInfoFromMonitor(jobID string) (drmaa2interface.JobInfo, error)

JobInfoFromMonitor might collect job state and job info in a different way as a JobSession with persistent storage does

func (*GCPBatchTracker) JobState

func (t *GCPBatchTracker) JobState(jobID string) (drmaa2interface.JobState, string, error)

JobState returns the DRMAA2 state and substate (free form string) of the job.

func (*GCPBatchTracker) ListArrayJobs

func (t *GCPBatchTracker) ListArrayJobs(arrayjobID string) ([]string, error)

ListArrayJobs returns all job IDs an job array ID (or array job ID) represents or an error.

func (*GCPBatchTracker) ListJobCategories

func (t *GCPBatchTracker) ListJobCategories() ([]string, error)

ListJobCategories returns a list of job categories which can be used in the JobCategory field of the job template. The list is informational. An example is returning a list of supported container images. AddJob() and AddArrayJob() processes a JobTemplate and hence also the JobCategory field.

func (*GCPBatchTracker) ListJobs

func (t *GCPBatchTracker) ListJobs() ([]string, error)

ListJobs returns all visible job IDs or an error.

func (*GCPBatchTracker) OpenMonitoringSession

func (t *GCPBatchTracker) OpenMonitoringSession(name string) error

func (*GCPBatchTracker) Wait

func (t *GCPBatchTracker) Wait(jobID string, timeout time.Duration, state ...drmaa2interface.JobState) error

Wait blocks until the job is either in one of the given states, the max. waiting time (specified by timeout) is reached or an other internal error occured (like job was not found). In case of a timeout also an error must be returned.

type GoogleBatchTrackerParams

type GoogleBatchTrackerParams struct {
	GoogleProjectID string
	Region          string
}

GoogleBatchTrackerParams provide parameters which can be passed to the SessionManager in order to pass things like Google project or region into the job tracker. It needs to be that complicated in order to be used but not tightly integrated with the SessionManager of the DRMAA2OS project, so that not all depenedencies have to be compiled in.

Directories

Path Synopsis
examples
drmaa2job module
ssh module

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL