Documentation
¶
Overview ¶
Package setup provides internal testing utilities for configuring and managing Grove operator installations during e2e tests. This package is not intended for production use and its API may change without notice.
Functions are exported to allow access from e2e test packages (e.g., operator/e2e/tests) which need to modify Grove configuration during test scenarios.
Package setup provides utilities for connecting to Kubernetes clusters for E2E testing.
The cluster must be created beforehand with Grove operator, Kai scheduler, and required test infrastructure already deployed. For local development with k3d, you can use:
./operator/hack/e2e-cluster/create-e2e-cluster.py
This package only handles connecting to existing clusters - it does not create clusters.
Index ¶
- Constants
- func CreateDefaultKaiQueues(ctx context.Context, config *HelmInstallConfig) error
- func GetGroveChartDir() (string, error)
- func GetWorkerNodeLabelSelector() string
- func InstallHelmChart(config *HelmInstallConfig) (*release.Release, error)
- func InstallWithSkaffold(ctx context.Context, config *SkaffoldInstallConfig) (func(), error)
- func SetupRegistryTestImages(registryPort string, images []string) error
- func StartNodeMonitoring(ctx context.Context, clientset *kubernetes.Clientset, logger *utils.Logger) func()
- func UninstallHelmChart(config *HelmInstallConfig) error
- func UpdateGroveConfiguration(ctx context.Context, restConfig *rest.Config, chartDir string, ...) error
- func UpgradeHelmChart(config *HelmInstallConfig) (*release.Release, error)
- func WaitForKaiCRDs(ctx context.Context, config *HelmInstallConfig) error
- type GroveConfig
- type HelmInstallConfig
- type SharedClusterManager
- func (scm *SharedClusterManager) CleanupWorkloads(ctx context.Context) error
- func (scm *SharedClusterManager) GetCRClient() client.Client
- func (scm *SharedClusterManager) GetCleanupError() string
- func (scm *SharedClusterManager) GetClients() (*kubernetes.Clientset, *rest.Config, dynamic.Interface)
- func (scm *SharedClusterManager) GetRegistryPort() string
- func (scm *SharedClusterManager) GetWorkerNodes() []string
- func (scm *SharedClusterManager) HasCleanupFailed() bool
- func (scm *SharedClusterManager) IsSetup() bool
- func (scm *SharedClusterManager) MarkCleanupFailed(err error)
- func (scm *SharedClusterManager) PrepareForTest(ctx context.Context, requiredWorkerNodes int) error
- func (scm *SharedClusterManager) Setup(ctx context.Context, testImages []string) error
- func (scm *SharedClusterManager) Teardown()
- type SkaffoldInstallConfig
- type WebhooksConfig
Constants ¶
const ( // OperatorNamespace is the namespace where the Grove operator is deployed for E2E tests. // This is used during installation (via Skaffold) and for finding operator pods during diagnostics. OperatorNamespace = "grove-system" // OperatorDeploymentName is the name of the operator deployment (also the Helm release name). // This is used to find operator pods for log collection during test failures. OperatorDeploymentName = "grove-operator" // DefaultWebhookPort is the default port for the webhook server. // NOTE: If you change this, also update config.server.webhooks.port in operator/charts/values.yaml DefaultWebhookPort = 9443 // DefaultWebhookServerCertDir is the default directory for webhook certificates. // NOTE: If you change this, also update config.server.webhooks.serverCertDir in operator/charts/values.yaml DefaultWebhookServerCertDir = "/etc/grove-operator/webhook-certs" // OperatorConfigVolumeName is the volume name in the grove-operator Deployment // that mounts the operator ConfigMap. Matches deployment.yaml volume name. OperatorConfigVolumeName = "operator-config" // OperatorConfigDataKey is the key in the operator ConfigMap containing the config YAML. // Matches the key generated by the Helm chart _helpers.tpl. OperatorConfigDataKey = "config.yaml" )
const ( // WorkerNodeLabelKey is the label key used to identify worker nodes in e2e tests. // This can be changed if infrastructure changes. WorkerNodeLabelKey = "node_role.e2e.grove.nvidia.com" // WorkerNodeLabelValue is the label value for worker node identification in e2e tests. WorkerNodeLabelValue = "agent" // TopologyLabelZone is the Kubernetes label key for zone topology domain. TopologyLabelZone = "kubernetes.io/zone" // TopologyLabelBlock is the Kubernetes label key for the block topology domain. TopologyLabelBlock = "kubernetes.io/block" // TopologyLabelRack is the Kubernetes label key for the rack topology domain. TopologyLabelRack = "kubernetes.io/rack" // TopologyLabelHostname is the Kubernetes label key for the hostname topology domain. TopologyLabelHostname = "kubernetes.io/hostname" )
const (
// EnvRegistryPort specifies the container registry port for test images (optional)
EnvRegistryPort = "E2E_REGISTRY_PORT"
)
Variables ¶
This section is empty.
Functions ¶
func CreateDefaultKaiQueues ¶
func CreateDefaultKaiQueues(ctx context.Context, config *HelmInstallConfig) error
CreateDefaultKaiQueues creates queues using the k8s client YAML apply functionality
func GetGroveChartDir ¶
GetGroveChartDir returns the absolute path to the Grove Helm chart directory. It uses runtime.Caller to find the path relative to this source file. This function is exported for use by callers of UpdateGroveConfiguration.
func GetWorkerNodeLabelSelector ¶
func GetWorkerNodeLabelSelector() string
GetWorkerNodeLabelSelector returns the label selector for worker nodes in e2e tests. Returns a formatted string "key=value" for use with Kubernetes label selectors.
func InstallHelmChart ¶
func InstallHelmChart(config *HelmInstallConfig) (*release.Release, error)
InstallHelmChart installs a Helm chart with the given configuration.
func InstallWithSkaffold ¶
func InstallWithSkaffold(ctx context.Context, config *SkaffoldInstallConfig) (func(), error)
InstallWithSkaffold builds and deploys using Skaffold CLI. It runs in two phases to account for the push registry being different than the pull registry for k3d. Note: we're using Skaffold CLI instead of go libraries because there are depdenency conflicts between grove and Skaffold at the time of implentaiton.
func SetupRegistryTestImages ¶
SetupRegistryTestImages pulls images and pushes them to a local container registry. This is used for test images that need to be available inside the cluster. The registry is expected to be accessible at localhost:<registryPort>.
func StartNodeMonitoring ¶
func StartNodeMonitoring(ctx context.Context, clientset *kubernetes.Clientset, logger *utils.Logger) func()
StartNodeMonitoring starts a goroutine that monitors k3d cluster nodes for not ready status and automatically replaces them by deleting the node and restarting the corresponding Docker container. Returns a cleanup function that should be deferred to stop the monitoring.
Background: There's an intermitten issue where nodes go not ready which causes the tests to fail occasional. This is an issue with either k3d or docker on mac. This is a simple solution that is working flawlessly.
The monitoring process: 1. Checks for nodes that are not in Ready status every 5 seconds 2. Skips cordoned nodes (intentionally unschedulable for maintenance) 3. Deletes the not ready node from Kubernetes 4. Finds and restarts the corresponding Docker container (node names match container names exactly) 5. The restarted container will rejoin the cluster as a new node
func UninstallHelmChart ¶
func UninstallHelmChart(config *HelmInstallConfig) error
UninstallHelmChart uninstalls a Helm release.
func UpdateGroveConfiguration ¶
func UpdateGroveConfiguration(ctx context.Context, restConfig *rest.Config, chartDir string, config *GroveConfig, logger *utils.Logger) error
UpdateGroveConfiguration updates the Grove operator configuration.
This uses Helm upgrade (rather than Skaffold) because: 1. Grove is initially installed via Skaffold, which uses Helm under the hood 2. For config-only changes (like switching cert modes), rebuilding images is unnecessary 3. Helm upgrade with ReuseValues preserves the image configuration that Skaffold set
The chartDir parameter should be the path to the Grove Helm chart directory. Use GetGroveChartDir() to obtain the default chart directory path.
This approach avoids wasteful rebuilds while staying compatible with the Skaffold installation.
func UpgradeHelmChart ¶
func UpgradeHelmChart(config *HelmInstallConfig) (*release.Release, error)
UpgradeHelmChart upgrades a Helm chart with the given configuration.
func WaitForKaiCRDs ¶
func WaitForKaiCRDs(ctx context.Context, config *HelmInstallConfig) error
WaitForKaiCRDs waits for the Queue CRD from scheduling.run.ai/v2 to be available
Types ¶
type GroveConfig ¶
type GroveConfig struct {
// InstallCRDs controls whether CRDs should be installed/updated.
InstallCRDs bool
// Webhooks contains webhook-specific configuration.
Webhooks WebhooksConfig
}
GroveConfig holds typed configuration options for updating the Grove operator. This struct provides a user-friendly interface that gets translated to Helm values internally.
type HelmInstallConfig ¶
type HelmInstallConfig struct {
// RestConfig is the Kubernetes REST configuration. If nil, uses default kubeconfig.
RestConfig *rest.Config
// ReleaseName is the name of the Helm release. Required unless GenerateName is true.
ReleaseName string
// ChartRef is the chart reference (path, URL, or chart name). Required.
ChartRef string
// ChartVersion is the version of the chart to install. Required.
ChartVersion string
// Namespace is the Kubernetes namespace to install into. Required.
Namespace string
// CreateNamespace creates the namespace if it doesn't exist.
CreateNamespace bool
// Wait blocks until all resources are ready.
Wait bool
// GenerateName generates a random release name with ReleaseName as prefix.
GenerateName bool
// Values are the chart values to use for the installation.
Values map[string]interface{}
// HelmLoggerFunc is called for Helm operation logging.
HelmLoggerFunc func(format string, v ...interface{})
// Logger is the full logger for component operations.
Logger *utils.Logger
// RepoURL is the base URL of the Helm repository (optional, for direct chart downloads).
RepoURL string
// ReuseValues reuses the last release's values and merges in the new values.
ReuseValues bool
// Timeout is the time to wait for Kubernetes operations (default: 5 minutes).
Timeout time.Duration
}
HelmInstallConfig holds configuration for Helm chart installations.
func (*HelmInstallConfig) Validate ¶
func (c *HelmInstallConfig) Validate() error
Validate validates and sets defaults for the configuration.
type SharedClusterManager ¶
type SharedClusterManager struct {
// contains filtered or unexported fields
}
SharedClusterManager manages a shared Kubernetes cluster for E2E tests. It connects to an existing cluster via kubeconfig and manages test lifecycle operations like workload cleanup and node cordoning.
func SharedCluster ¶
func SharedCluster(logger *utils.Logger) *SharedClusterManager
SharedCluster returns the singleton shared cluster manager
func (*SharedClusterManager) CleanupWorkloads ¶
func (scm *SharedClusterManager) CleanupWorkloads(ctx context.Context) error
CleanupWorkloads removes all test workloads from the cluster
func (*SharedClusterManager) GetCRClient ¶
func (scm *SharedClusterManager) GetCRClient() client.Client
GetCRClient returns the controller-runtime client for typed CR access
func (*SharedClusterManager) GetCleanupError ¶
func (scm *SharedClusterManager) GetCleanupError() string
GetCleanupError returns the error message from the failed cleanup, or empty string if no failure.
func (*SharedClusterManager) GetClients ¶
func (scm *SharedClusterManager) GetClients() (*kubernetes.Clientset, *rest.Config, dynamic.Interface)
GetClients returns the kubernetes clients for tests to use
func (*SharedClusterManager) GetRegistryPort ¶
func (scm *SharedClusterManager) GetRegistryPort() string
GetRegistryPort returns the registry port for test image setup
func (*SharedClusterManager) GetWorkerNodes ¶
func (scm *SharedClusterManager) GetWorkerNodes() []string
GetWorkerNodes returns the list of worker node names
func (*SharedClusterManager) HasCleanupFailed ¶
func (scm *SharedClusterManager) HasCleanupFailed() bool
HasCleanupFailed returns true if a previous cleanup operation failed.
func (*SharedClusterManager) IsSetup ¶
func (scm *SharedClusterManager) IsSetup() bool
IsSetup returns whether the shared cluster is setup
func (*SharedClusterManager) MarkCleanupFailed ¶
func (scm *SharedClusterManager) MarkCleanupFailed(err error)
MarkCleanupFailed marks that a cleanup operation has failed. This causes all subsequent tests to fail immediately when they try to prepare the cluster.
func (*SharedClusterManager) PrepareForTest ¶
func (scm *SharedClusterManager) PrepareForTest(ctx context.Context, requiredWorkerNodes int) error
PrepareForTest prepares the cluster for a specific test by cordoning the appropriate nodes. It ensures exactly `requiredWorkerNodes` nodes are schedulable by cordoning excess nodes. Returns an error if a previous cleanup operation failed, preventing potentially corrupted test state.
func (*SharedClusterManager) Setup ¶
func (scm *SharedClusterManager) Setup(ctx context.Context, testImages []string) error
Setup initializes the shared cluster by connecting to an existing Kubernetes cluster.
The cluster must be created beforehand with:
- Grove operator deployed and ready
- Kai scheduler deployed and ready
- Required node labels and topology configuration
For local development with k3d:
./operator/hack/e2e-cluster/create-e2e-cluster.py
Optional environment variables:
- E2E_REGISTRY_PORT (default: 5001, for pushing test images to local registry)
The cluster connection is established via standard kubeconfig resolution:
- KUBECONFIG environment variable, or
- ~/.kube/config, or
- In-cluster config
func (*SharedClusterManager) Teardown ¶
func (scm *SharedClusterManager) Teardown()
Teardown cleans up the shared cluster
type SkaffoldInstallConfig ¶
type SkaffoldInstallConfig struct {
// SkaffoldYAMLPath is the path to the skaffold.yaml file. Required.
SkaffoldYAMLPath string
// RestConfig is the Kubernetes REST configuration. Required.
RestConfig *rest.Config
// Profiles are the Skaffold profiles to activate (optional).
Profiles []string
// PushRepo is the repository to push images to (e.g., "localhost:5001"). Required.
PushRepo string
// PullRepo is the repository to pull images from (e.g., "registry:5001"). Required.
PullRepo string
// Namespace is the target namespace for deployment (optional, defaults to "default").
Namespace string
// Env are environment variables required by skaffold.yaml (e.g., VERSION, LD_FLAGS).
Env map[string]string
// Logger is the logger for operations (optional, will use default if nil).
Logger *utils.Logger
}
SkaffoldInstallConfig holds configuration for Skaffold installations.
func (*SkaffoldInstallConfig) Validate ¶
func (c *SkaffoldInstallConfig) Validate() error
Validate validates the configuration.
type WebhooksConfig ¶
type WebhooksConfig struct {
// CertProvisionMode controls how webhook certificates are provisioned.
// Use configv1alpha1.CertProvisionModeAuto for automatic provisioning,
// configv1alpha1.CertProvisionModeManual for external cert management.
CertProvisionMode configv1alpha1.CertProvisionMode
// SecretName is the name of the Kubernetes secret containing TLS certificates.
SecretName string
// Annotations to apply to webhook configurations (e.g., for cert-manager CA injection).
// These annotations are applied to all webhook configurations.
Annotations map[string]string
}
WebhooksConfig holds configuration for Grove's webhook server.