config

package
v0.9.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 18, 2025 License: Apache-2.0 Imports: 12 Imported by: 0

Documentation

Overview

Package config provides the gpud configuration data for the server.

Index

Constants

View Source
const (
	DefaultAPIVersion = "v1"
	DefaultGPUdPort   = 15132
	DefaultDataDir    = "/var/lib/gpud"
)

Variables

View Source
var (
	DefaultRefreshPeriod = metav1.Duration{Duration: time.Minute}

	// keep the metrics only for the last 3 hours
	DefaultRetentionPeriod = metav1.Duration{Duration: 3 * time.Hour}

	// compact/vacuum is disruptive to existing queries (including reads)
	// but necessary to keep the state database from growing indefinitely
	// TODO: disabled for now, until we have a better way to detect the performance issue
	DefaultCompactPeriod = metav1.Duration{Duration: 0}
)

Functions

func DefaultFifoFile

func DefaultFifoFile() (string, error)

func DefaultStateFile

func DefaultStateFile() (string, error)

func FifoFilePath added in v0.9.0

func FifoFilePath(dataDir string) string

FifoFilePath returns the FIFO pipe path under the dataDir.

func PackagesDir added in v0.9.0

func PackagesDir(dataDir string) string

PackagesDir returns the packages directory under the dataDir.

func ResolveDataDir added in v0.9.0

func ResolveDataDir(dataDir string) (string, error)

ResolveDataDir resolves and validates a data directory path. If dataDir is empty or matches DefaultDataDir, it uses platform-specific logic:

  • For root users (or when /var/lib exists): /var/lib/gpud
  • For non-root users: $HOME/.gpud

For non-empty custom paths, it ensures the directory exists and is writable. The directory is created with 0755 permissions if it doesn't exist.

func StateFilePath added in v0.9.0

func StateFilePath(dataDir string) string

StateFilePath returns the state DB file path under the dataDir.

func VersionFilePath added in v0.9.0

func VersionFilePath(dataDir string) string

VersionFilePath returns the version file path under the dataDir.

Types

type Config

type Config struct {
	APIVersion string `json:"api_version"`

	// Address for the server to listen on.
	Address string `json:"address"`

	// DataDir is the root directory for GPUd state and package artifacts.
	DataDir string `json:"data_dir"`

	// State file that persists the latest status.
	// If empty, the states are not persisted to file.
	State string `json:"state"`

	// Amount of time to retain states/metrics for.
	// Once elapsed, old states/metrics are purged/compacted.
	RetentionPeriod metav1.Duration `json:"retention_period"`

	// Interval at which to compact the state database.
	CompactPeriod metav1.Duration `json:"compact_period"`

	// Set true to enable profiler.
	Pprof bool `json:"pprof"`

	// Set false to disable auto update
	EnableAutoUpdate bool `json:"enable_auto_update"`

	// Exit code to exit with when auto updating.
	// Only valid when the auto update is enabled.
	// Set -1 to disable the auto update by exit code.
	AutoUpdateExitCode int `json:"auto_update_exit_code"`

	// VersionFile is the file that contains the target version.
	// If empty, the version file is not used.
	VersionFile string `json:"version_file"`

	// A list of nvidia tool command paths to overwrite the default paths.
	NvidiaToolOverwrites pkgconfigcommon.ToolOverwrites `json:"nvidia_tool_overwrites"`

	// PluginSpecsFile is the file that contains the plugin specs.
	PluginSpecsFile string `json:"plugin_specs_file"`

	// Components specifies the components to enable.
	// Leave empty, "*", or "all" to enable all components.
	// Or prefix component names with "-" to disable them.
	Components []string `json:"components"`

	// FailureInjector is the failure injector.
	FailureInjector *components.FailureInjector `json:"failure_injector,omitempty"`

	// SkipSessionUpdateConfig skips processing of updateConfig session commands. Intended for testing.
	SkipSessionUpdateConfig bool `json:"skip_session_update_config"`

	// DBInMemory enables in-memory SQLite database mode.
	// When true, the database is opened as a shared in-memory database (file::memory:?cache=shared)
	// instead of using the State file path. Data will not persist across restarts.
	// ref. https://github.com/mattn/go-sqlite3?tab=readme-ov-file#faq
	DBInMemory bool `json:"db_in_memory"`

	// SessionToken is the session token for control plane authentication.
	// Used when DBInMemory is true and session credentials are passed via CLI flags.
	// This allows gpud up to pass the session token from login to gpud run.
	SessionToken string `json:"-"`

	// SessionMachineID is the machine ID assigned by the control plane.
	// Used when DBInMemory is true and session credentials are passed via CLI flags.
	// This allows gpud up to pass the assigned machine ID from login to gpud run.
	SessionMachineID string `json:"-"`

	// SessionEndpoint is the control plane endpoint.
	// Used when DBInMemory is true and session credentials are passed via CLI flags.
	// This allows gpud up to pass the endpoint from login to gpud run.
	// The server reads the endpoint from metadata DB, so it must be seeded for in-memory mode.
	SessionEndpoint string `json:"-"`
	// contains filtered or unexported fields
}

Config provides gpud configuration data for the server

func DefaultConfig

func DefaultConfig(ctx context.Context, opts ...OpOption) (*Config, error)

func (*Config) ShouldDisable added in v0.5.0

func (config *Config) ShouldDisable(componentName string) bool

ShouldDisable returns true if the component should be disabled. If the disable component sets are not specified, it will return false, meaning it should not be disabled, instead enabled by default.

func (*Config) ShouldEnable added in v0.5.0

func (config *Config) ShouldEnable(componentName string) bool

ShouldEnable returns true if the component should be enabled. If the enable component sets are not specified, it will return true, meaning it should be enabled by default.

func (*Config) Validate

func (config *Config) Validate() error

type Op

type Op struct {
	pkgconfigcommon.ToolOverwrites

	FailureInjector *components.FailureInjector

	DataDir    string
	DBInMemory bool

	// SessionToken is the session token for db-in-memory mode.
	// When DBInMemory is true and this is set, the server will seed
	// this token into the in-memory database.
	SessionToken string

	// SessionMachineID is the machine ID for db-in-memory mode.
	// When DBInMemory is true and this is set, the server will seed
	// this machine ID into the in-memory database.
	SessionMachineID string

	// SessionEndpoint is the control plane endpoint for db-in-memory mode.
	// When DBInMemory is true and this is set, the server will seed
	// this endpoint into the in-memory database.
	// The server reads the endpoint from metadata DB, so it must be seeded for in-memory mode.
	SessionEndpoint string
}

func (*Op) ApplyOpts

func (op *Op) ApplyOpts(opts []OpOption) error

type OpOption

type OpOption func(*Op)

func WithDBInMemory added in v0.9.0

func WithDBInMemory(b bool) OpOption

WithDBInMemory enables in-memory SQLite database mode. When true, uses file::memory:?cache=shared instead of file-based storage. ref. https://github.com/mattn/go-sqlite3?tab=readme-ov-file#faq

func WithDataDir added in v0.9.0

func WithDataDir(dataDir string) OpOption

WithDataDir overrides the default data directory for GPUd artifacts.

func WithExcludedInfinibandDevices added in v0.9.0

func WithExcludedInfinibandDevices(devices []string) OpOption

WithExcludedInfinibandDevices sets the list of InfiniBand device names to exclude from monitoring. Device names should be like "mlx5_0", "mlx5_1", etc. (not full paths).

This is useful for excluding devices that have restricted Physical Functions (PFs) and cause kernel errors (mlx5_cmd_out_err ACCESS_REG) when queried. This is common on NVIDIA DGX, Umbriel, and GB200 systems with ConnectX-7 adapters.

ref. https://github.com/prometheus/node_exporter/issues/3434 https://github.com/leptonai/gpud/issues/1164

func WithFailureInjector added in v0.6.0

func WithFailureInjector(injector *components.FailureInjector) OpOption

func WithInfinibandClassRootDir added in v0.5.1

func WithInfinibandClassRootDir(p string) OpOption

Specifies the root directory of the InfiniBand class.

func WithSessionEndpoint added in v0.9.0

func WithSessionEndpoint(endpoint string) OpOption

WithSessionEndpoint sets the control plane endpoint for db-in-memory mode. When DBInMemory is true and this is set, the server will seed this endpoint into the in-memory database. The server reads the endpoint from metadata DB, so it must be seeded for in-memory mode.

func WithSessionMachineID added in v0.9.0

func WithSessionMachineID(machineID string) OpOption

WithSessionMachineID sets the machine ID for db-in-memory mode. When DBInMemory is true and this is set, the server will seed this machine ID into the in-memory database.

func WithSessionToken added in v0.9.0

func WithSessionToken(token string) OpOption

WithSessionToken sets the session token for db-in-memory mode. When DBInMemory is true and this is set, the server will seed this token into the in-memory database for session authentication.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL