Documentation
¶
Overview ¶
Package config provides the gpud configuration data for the server.
Index ¶
- Constants
- Variables
- func DefaultFifoFile() (string, error)
- func DefaultStateFile() (string, error)
- func FifoFilePath(dataDir string) string
- func PackagesDir(dataDir string) string
- func ResolveDataDir(dataDir string) (string, error)
- func StateFilePath(dataDir string) string
- func VersionFilePath(dataDir string) string
- type Config
- type Op
- type OpOption
- func WithDBInMemory(b bool) OpOption
- func WithDataDir(dataDir string) OpOption
- func WithExcludedInfinibandDevices(devices []string) OpOption
- func WithFailureInjector(injector *components.FailureInjector) OpOption
- func WithInfinibandClassRootDir(p string) OpOption
- func WithSessionEndpoint(endpoint string) OpOption
- func WithSessionMachineID(machineID string) OpOption
- func WithSessionToken(token string) OpOption
Constants ¶
const ( DefaultAPIVersion = "v1" DefaultGPUdPort = 15132 DefaultDataDir = "/var/lib/gpud" )
Variables ¶
var ( DefaultRefreshPeriod = metav1.Duration{Duration: time.Minute} // keep the metrics only for the last 3 hours DefaultRetentionPeriod = metav1.Duration{Duration: 3 * time.Hour} // compact/vacuum is disruptive to existing queries (including reads) // but necessary to keep the state database from growing indefinitely // TODO: disabled for now, until we have a better way to detect the performance issue DefaultCompactPeriod = metav1.Duration{Duration: 0} )
Functions ¶
func DefaultFifoFile ¶
func DefaultStateFile ¶
func FifoFilePath ¶ added in v0.9.0
FifoFilePath returns the FIFO pipe path under the dataDir.
func PackagesDir ¶ added in v0.9.0
PackagesDir returns the packages directory under the dataDir.
func ResolveDataDir ¶ added in v0.9.0
ResolveDataDir resolves and validates a data directory path. If dataDir is empty or matches DefaultDataDir, it uses platform-specific logic:
- For root users (or when /var/lib exists): /var/lib/gpud
- For non-root users: $HOME/.gpud
For non-empty custom paths, it ensures the directory exists and is writable. The directory is created with 0755 permissions if it doesn't exist.
func StateFilePath ¶ added in v0.9.0
StateFilePath returns the state DB file path under the dataDir.
func VersionFilePath ¶ added in v0.9.0
VersionFilePath returns the version file path under the dataDir.
Types ¶
type Config ¶
type Config struct {
APIVersion string `json:"api_version"`
// Address for the server to listen on.
Address string `json:"address"`
// DataDir is the root directory for GPUd state and package artifacts.
DataDir string `json:"data_dir"`
// State file that persists the latest status.
// If empty, the states are not persisted to file.
State string `json:"state"`
// Amount of time to retain states/metrics for.
// Once elapsed, old states/metrics are purged/compacted.
RetentionPeriod metav1.Duration `json:"retention_period"`
// Interval at which to compact the state database.
CompactPeriod metav1.Duration `json:"compact_period"`
// Set true to enable profiler.
Pprof bool `json:"pprof"`
// Set false to disable auto update
EnableAutoUpdate bool `json:"enable_auto_update"`
// Exit code to exit with when auto updating.
// Only valid when the auto update is enabled.
// Set -1 to disable the auto update by exit code.
AutoUpdateExitCode int `json:"auto_update_exit_code"`
// VersionFile is the file that contains the target version.
// If empty, the version file is not used.
VersionFile string `json:"version_file"`
// A list of nvidia tool command paths to overwrite the default paths.
NvidiaToolOverwrites pkgconfigcommon.ToolOverwrites `json:"nvidia_tool_overwrites"`
// PluginSpecsFile is the file that contains the plugin specs.
PluginSpecsFile string `json:"plugin_specs_file"`
// Components specifies the components to enable.
// Leave empty, "*", or "all" to enable all components.
// Or prefix component names with "-" to disable them.
Components []string `json:"components"`
// FailureInjector is the failure injector.
FailureInjector *components.FailureInjector `json:"failure_injector,omitempty"`
// SkipSessionUpdateConfig skips processing of updateConfig session commands. Intended for testing.
SkipSessionUpdateConfig bool `json:"skip_session_update_config"`
// DBInMemory enables in-memory SQLite database mode.
// When true, the database is opened as a shared in-memory database (file::memory:?cache=shared)
// instead of using the State file path. Data will not persist across restarts.
// ref. https://github.com/mattn/go-sqlite3?tab=readme-ov-file#faq
DBInMemory bool `json:"db_in_memory"`
// SessionToken is the session token for control plane authentication.
// Used when DBInMemory is true and session credentials are passed via CLI flags.
// This allows gpud up to pass the session token from login to gpud run.
SessionToken string `json:"-"`
// SessionMachineID is the machine ID assigned by the control plane.
// Used when DBInMemory is true and session credentials are passed via CLI flags.
// This allows gpud up to pass the assigned machine ID from login to gpud run.
SessionMachineID string `json:"-"`
// SessionEndpoint is the control plane endpoint.
// Used when DBInMemory is true and session credentials are passed via CLI flags.
// This allows gpud up to pass the endpoint from login to gpud run.
// The server reads the endpoint from metadata DB, so it must be seeded for in-memory mode.
SessionEndpoint string `json:"-"`
// contains filtered or unexported fields
}
Config provides gpud configuration data for the server
func (*Config) ShouldDisable ¶ added in v0.5.0
ShouldDisable returns true if the component should be disabled. If the disable component sets are not specified, it will return false, meaning it should not be disabled, instead enabled by default.
func (*Config) ShouldEnable ¶ added in v0.5.0
ShouldEnable returns true if the component should be enabled. If the enable component sets are not specified, it will return true, meaning it should be enabled by default.
type Op ¶
type Op struct {
pkgconfigcommon.ToolOverwrites
FailureInjector *components.FailureInjector
DataDir string
DBInMemory bool
// SessionToken is the session token for db-in-memory mode.
// When DBInMemory is true and this is set, the server will seed
// this token into the in-memory database.
SessionToken string
// SessionMachineID is the machine ID for db-in-memory mode.
// When DBInMemory is true and this is set, the server will seed
// this machine ID into the in-memory database.
SessionMachineID string
// SessionEndpoint is the control plane endpoint for db-in-memory mode.
// When DBInMemory is true and this is set, the server will seed
// this endpoint into the in-memory database.
// The server reads the endpoint from metadata DB, so it must be seeded for in-memory mode.
SessionEndpoint string
}
type OpOption ¶
type OpOption func(*Op)
func WithDBInMemory ¶ added in v0.9.0
WithDBInMemory enables in-memory SQLite database mode. When true, uses file::memory:?cache=shared instead of file-based storage. ref. https://github.com/mattn/go-sqlite3?tab=readme-ov-file#faq
func WithDataDir ¶ added in v0.9.0
WithDataDir overrides the default data directory for GPUd artifacts.
func WithExcludedInfinibandDevices ¶ added in v0.9.0
WithExcludedInfinibandDevices sets the list of InfiniBand device names to exclude from monitoring. Device names should be like "mlx5_0", "mlx5_1", etc. (not full paths).
This is useful for excluding devices that have restricted Physical Functions (PFs) and cause kernel errors (mlx5_cmd_out_err ACCESS_REG) when queried. This is common on NVIDIA DGX, Umbriel, and GB200 systems with ConnectX-7 adapters.
ref. https://github.com/prometheus/node_exporter/issues/3434 https://github.com/leptonai/gpud/issues/1164
func WithFailureInjector ¶ added in v0.6.0
func WithFailureInjector(injector *components.FailureInjector) OpOption
func WithInfinibandClassRootDir ¶ added in v0.5.1
Specifies the root directory of the InfiniBand class.
func WithSessionEndpoint ¶ added in v0.9.0
WithSessionEndpoint sets the control plane endpoint for db-in-memory mode. When DBInMemory is true and this is set, the server will seed this endpoint into the in-memory database. The server reads the endpoint from metadata DB, so it must be seeded for in-memory mode.
func WithSessionMachineID ¶ added in v0.9.0
WithSessionMachineID sets the machine ID for db-in-memory mode. When DBInMemory is true and this is set, the server will seed this machine ID into the in-memory database.
func WithSessionToken ¶ added in v0.9.0
WithSessionToken sets the session token for db-in-memory mode. When DBInMemory is true and this is set, the server will seed this token into the in-memory database for session authentication.