Documentation
¶
Overview ¶
Package aksmachinepoller provides a GET-based poller for tracking individual AKS machine provisioning status by polling GET machine until terminal state. This is an alternative to the Azure SDK poller, which polls on AKS operation objects (through GET operation).
This approach works because provisioning error details and success status are derived from the AKS machine object itself (through the ProvisioningError field). One use case is batched AKS machine provisioning, where the batch coordinator sends one API call for N machines and gets back one SDK poller for the entire batch — it cannot track individual machines. Each machine needs its own poller, and polling GET machine is the most straightforward approach.
The poller sits on top of the same SDK HTTP client, so each GET call still passes through the SDK pipeline's per-request retry policy. See the Options doc comment for a detailed comparison with the SDK poller.
Note: there is a proposal to stop relying on ProvisioningError from machine objects and rely on AKS operation errors instead. That would require batched request error returning (potentially via upcoming ARM batch API) and rewriting error handling based on AKS error formats instead of CRP error formats. If that transition happens, this approach would need to be revisited.
Index ¶
- Constants
- type AKSMachineGetter
- type AKSMachineNewListPager
- type MachineListCache
- func (c *MachineListCache) Get(machineName string) (*armcontainerservice.Machine, error)
- func (c *MachineListCache) List(ctx context.Context) ([]*armcontainerservice.Machine, error)
- func (c *MachineListCache) PollUntilDone(ctx context.Context, name string) (*armcontainerservice.ErrorDetail, error)
- func (c *MachineListCache) RequestUpdate()
- func (c *MachineListCache) Shutdown()
- type Options
- type Poller
Constants ¶
const ( // DefaultMachineListCacheTTL is the default time-to-live for machine list cache entries. // GET Machine 429s at 1K-node scale cost 17-29 seconds each; caching LIST results // converts O(N) individual GETs into O(1) cached lookups. A 30-second TTL is // acceptable because drift and reconciliation checks re-run on subsequent cycles anyway. DefaultMachineListCacheTTL = 30 * time.Second // Provisioning state constants for AKS Machine API ProvisioningStateCreating = "Creating" ProvisioningStateUpdating = "Updating" ProvisioningStateDeleting = "Deleting" ProvisioningStateSucceeded = "Succeeded" ProvisioningStateFailed = "Failed" )
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type AKSMachineGetter ¶
type AKSMachineGetter interface {
Get(ctx context.Context, resourceGroupName string, resourceName string, agentPoolName string, aksMachineName string, options *armcontainerservice.MachinesClientGetOptions) (armcontainerservice.MachinesClientGetResponse, error)
}
type AKSMachineNewListPager ¶
type AKSMachineNewListPager interface {
NewListPager(resourceGroupName string, resourceName string, agentPoolName string, options *armcontainerservice.MachinesClientListOptions) *runtime.Pager[armcontainerservice.MachinesClientListResponse]
}
type MachineListCache ¶
type MachineListCache struct {
// contains filtered or unexported fields
}
machineListCache caches the results of LIST Machine API calls, keyed by machine name. It reduces O(N) individual GET Machine calls (for drift checks, reconciliation, etc.) to O(1) cached lookups between LIST refreshes.
Thread-safety: all methods are safe for concurrent use.
Invalidation strategy:
- TTL-based: entries expire after cacheTTL (default 30s)
- Explicit: mutating operations (Create, Update, Delete) invalidate the affected entry
- Full refresh: List() replaces the entire cache
Update strategy:
- Background worker goroutine handles all cache updates
- Periodic refresh every 5 minutes keeps cache fresh
- On-demand updates via RequestUpdate() channel (non-blocking)
func NewMachineListCache ¶
func NewMachineListCache(ctx context.Context, ttl time.Duration, client AKSMachineNewListPager, interval time.Duration, clusterResourceGroup, clusterName, aksMachinesPoolName string) *MachineListCache
func (*MachineListCache) Get ¶
func (c *MachineListCache) Get(machineName string) (*armcontainerservice.Machine, error)
Get retrieves a machine from the cache by name. Returns the machine if cache is fresh and contains the machine. Returns an error if the cache is stale or the machine is not found.
func (*MachineListCache) List ¶
func (c *MachineListCache) List(ctx context.Context) ([]*armcontainerservice.Machine, error)
List returns all machines in the cache. Returns an error if the cache is not fresh and requests an update.
func (*MachineListCache) PollUntilDone ¶
func (c *MachineListCache) PollUntilDone(ctx context.Context, name string) (*armcontainerservice.ErrorDetail, error)
func (*MachineListCache) RequestUpdate ¶
func (c *MachineListCache) RequestUpdate()
RequestUpdate sends a non-blocking update request to the background worker. If an update is already pending (channel buffer full), this is a no-op. This method never blocks the caller - use it to hint that a cache refresh would be beneficial.
func (*MachineListCache) Shutdown ¶
func (c *MachineListCache) Shutdown()
Shutdown stops the background update worker and waits for it to finish. Call this during provider shutdown to prevent goroutine leaks. After calling Shutdown, the cache will no longer receive automatic updates.
type Options ¶
type Options struct {
// PollInterval is the interval between GET requests to check operation state.
// Equivalent to SDK's PollUntilDoneOptions.Frequency (default 30s).
PollInterval time.Duration
// RetryDelay is the initial delay before retrying after a transient GET error or
// unexpected state (nil/unrecognized provisioning state). Doubles on each consecutive
// retry, capped at MaxRetryDelay (exponential backoff).
// Equivalent to SDK's policy.RetryOptions.RetryDelay (default 800ms), but applied at
// the polling loop level rather than per-HTTP-request.
RetryDelay time.Duration
// MaxRetryDelay is the maximum delay between retries (exponential backoff cap).
// Equivalent to SDK's policy.RetryOptions.MaxRetryDelay (default 60s).
MaxRetryDelay time.Duration
// MaxRetries is the maximum number of consecutive retry attempts for transient GET
// errors or unexpected states before giving up. Resets to its initial value whenever
// a healthy non-terminal state (Creating/Updating) is observed (see comment above).
// Equivalent to SDK's policy.RetryOptions.MaxRetries (default 3), but scoped to the
// polling session rather than individual HTTP requests.
MaxRetries int
}
Options contains configuration for polling long-running operations.
How this poller relates to the Azure SDK poller ¶
The Azure SDK's runtime.Poller has two layers:
- Polling loop — exposes one option: Frequency (interval between polls, default 30s).
- HTTP pipeline — each poll request passes through policy.RetryOptions, which handles transient HTTP errors with exponential backoff (RetryDelay=800ms, MaxRetryDelay=60s, MaxRetries=3, status codes 408/429/500/502/503/504). Retries are scoped per-request: each poll gets its own fresh retry budget.
Our GET-based poller sits on top of the same SDK HTTP client, so each GET call still benefits from the SDK pipeline's per-request retries. The options here control an additional retry layer at the polling loop level, handling cases the SDK pipeline cannot: successful GETs that return unexpected state (nil or unrecognized provisioning state), or transient errors that persist beyond the SDK pipeline's per-request budget.
Differences from the SDK poller and why ¶
Retry-After headers: The SDK poller honors Retry-After on poll responses, overriding Frequency. We use a fixed PollInterval because the server does not typically send Retry-After on successful provisioning GET responses. Per-request Retry-After (e.g., on 429s) is still handled by the SDK HTTP pipeline underneath.
Retry budget reset: We reset MaxRetries when a healthy non-terminal state (Creating/ Updating) is observed. The SDK doesn't need this because its retries are per-request (each poll starts fresh). Ours are per-session (one budget across the entire loop), so without resetting, intermittent errors across a long-running operation would accumulate and exhaust the budget even though the operation is making progress. The reset makes our session-scoped budget behave equivalently to the SDK's per-request budget.
Per-try timeout (TryTimeout): Not implemented. The SDK HTTP pipeline's transport-level timeouts and context cancellation provide equivalent protection.
func DefaultOptions ¶
func DefaultOptions() Options
DefaultOptions returns production poller configuration.
func InstantOptions ¶
func InstantOptions() Options
InstantOptions returns poller configuration for tests where the fake returns Succeeded immediately. Uses minimal intervals to avoid delays while still exercising the polling code path.
type Poller ¶
type Poller struct {
// contains filtered or unexported fields
}
Poller polls AKS machine instances until they reach a terminal state. This follows Azure SDK polling patterns with exponential backoff for transient errors.
func (*Poller) PollUntilDone ¶
func (p *Poller) PollUntilDone(ctx context.Context) (*armcontainerservice.ErrorDetail, error)
PollUntilDone polls the AKS machine instance with GET calls until provisioning state is stabilized. If the provisioning is a success, returns nil. If provisioning is a failure, returns provisioning error. The function itself will error (second return value) only if the function is not performing as expected. E.g., getting a proper provisioning error from AKS machine API is the expected behavior of this function, this won't be considered function error.
ASSUMPTION: the AKS machine creation has already begun, and is visible from the API (using GET).