migration

package
v0.11.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 4, 2026 License: Apache-2.0 Imports: 26 Imported by: 0

README

Migration

The migration package is used to orchestrate schema changes to one or more tables. Each change is tracked in the runner, with the copier and checker largely agnostic that they may be copying multiple tables at once.

There is one replication client for all changes, and a subscription is added for each table:

┌─────────────────────────────────────────────────────────────────────────────────────────┐
│                                    SPIRIT MIGRATION                                     │
│                                                                                         │
│  ┌──────────────────────────────────────────────────────────────────────────────────┐   │
│  │                              RUNNER (Orchestrator)                               │   │
│  │                              pkg/migration/runner.go                             │   │
│  │                                                                                  │   │
│  │   Coordinates the entire migration lifecycle:                                    │   │
│  │   1. Setup (create _new table, checkpoint table)                                 │   │
│  │   2. Start replication client                                                    │   │
│  │   3. Run copier                                                                  │   │
│  │   4. Disable watermark optimization                                              │   │
│  │   5. Run checksum                                                                │   │
│  │   6. Perform cutover (RENAME TABLE)                                              │   │
│  └──────────────────────────────────────────────────────────────────────────────────┘   │
│         │                              │                              │                 │
│         │ owns                         │ owns                         │ owns            │
│         ▼                              ▼                              ▼                 │
│  ┌─────────────┐              ┌─────────────────┐              ┌─────────────┐          │
│  │   COPIER    │              │   REPL CLIENT   │              │   CHECKER   │          │
│  │             │              │                 │              │  (Checksum) │          │
│  └─────────────┘              └─────────────────┘              └─────────────┘          │
└─────────────────────────────────────────────────────────────────────────────────────────┘

The copier and replication client are running in parallel during step 3. The only hard requirement is that the replication client starts before the copier so we can ensure that all changes are tracked.

The copier is essentially a dirty-copy, in that each chunk does not correctly keep track of changes being made. This is reconciled via the replication client, which is able to detect any changes that have been made, and apply them to the new table consistently.

Once copying is complete, a checksum process is started. This ensures that all data has safely made it to the new table, and it is safe to cutover.

What parts of the process are locking?

To answer this question, we need to understand that there are two types of locks. Both can be an issue:

  1. Data locks, aka InnoDB row level locks. These are configurable via innodb_lock_wait_timeout: the server default is 50s, but spirit overwrites this to 3s.
  2. Metadata locks. These are configurable via lock_wait_timeout: the server default is 1 year(!), but spirit overwrites this to 30s (configurable by --lock-wait-timeout).

So when we describe Spirit as a "non-blocking schema change tool" that is a bit of a white lie. What it means is that we don't require a metadata lock for the entire 10h schema change, as built-in MySQL DDL often does. It does not mean that we do not require locks.

The following are common data lock issues:

  • Data locks are required on the specific rows when copying chunks from the existing table to the new table. This is because we use the INSERT .. SELECT syntax which does not use MVCC (i.e. non-locking reads) when reading from the SELECT side. We mitigate this effect by offering you a configurable target-chunk-time. Lower values will mean that these locks are held for less time because the copies are shorter.

  • Data locks are required on the specific rows when applying changes from the replication client. This is similar to the copier, except we use a combination of DELETE and REPLACE .. SELECT.

  • The copier and replication client conflict with each other and deadlock. This is less likely to occur now than in earlier versions of Spirit, see issue #479.

You may want to use --buffered for cases where there are hot rows, or high probability of contention with data locks. Because of its different design, it only requires data locks on the _new table, which effectively prevents all of the data lock contention.

The following are cases where metadata locks are required:

  • Spirit initially attempts INSTANT/INPLACE DDL. If this is compatible, it requires an exclusive metadata lock on the table.
  • Starting a checksum requires an initial exclusive metadata lock to ensure that all data is synchronized between the checksum threads.
  • The cutover operation requires an exclusive metadata lock.

What causes all metadata lock issues? (hint: it's not spirit)

Any open transactions will have shared metadata locks on any of the tables that you are modifying. If you have long transactions that have not yet committed/rolled back, Spirit's exclusive lock will be queued waiting for them to finish. This then looks like a Spirit problem because any shared lock requests that arrive after Spirit's exclusive lock request will then be queued behind Spirit. So the solution is to keep your transactions as short as possible.

The fix for Spirit, is that it will by default force-kill the specific connections that are blocking it from acquiring an exclusive lock. This can be disabled with --skip-force-kill if needed.

Using Spirit migration as a Go package

I will assume that similar to our use-case, you are probably wrapping some sort of automation around Spirit. If this automation is written in Go, I would encourage you to use the Spirit API and not the CLI executable.

The following is a simplification of what we use ourselves:

func (sm *Spirit) Execute(ctx context.Context, m *ExecutableTask) error {
	startTime := time.Now()
	runner, err := migration.NewRunner(&migration.Migration{
		Host:              m.Cluster.Host,
		Username:          m.Cluster.Username,
		Password:          &m.Cluster.Password,
		Database:          m.Cluster.DatabaseName,
		Statement:         m.Statement,
		TargetChunkTime:   m.TargetChunkTime,
		Threads:           m.Concurrency,
		LockWaitTimeout:   m.LockWaitTimeout,
		InterpolateParams: true,
	})
	if err != nil {
		return errors.Wrap(err, "failed to create spirit migration runner")
	}
	defer runner.Close()
	if m.Metrics != nil {
		runner.SetMetricsSink(m.Metrics)
	}
	sm.Lock()
	sm.progressCallback = func() string {
		return runner.Progress().Summary
	}
	sm.Unlock()
	runner.SetLogger(m.Logger)
	if err = runner.Run(ctx); err != nil {
		return errors.Wrap(err, "failed to run spirit migration")
	}
	m.Logger.Infof("spirit migration completed in %s", time.Since(startTime))
	return nil
}

Documentation

Overview

Package migration contains the logic for running online schema changes.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type CutOver

type CutOver struct {
	// contains filtered or unexported fields
}

func NewCutOver

func NewCutOver(db *sql.DB, config []*cutoverConfig, feed *repl.Client, dbConfig *dbconn.DBConfig, logger *slog.Logger) (*CutOver, error)

NewCutOver contains the logic to perform the final cut over. It can cutover multiple tables at once based on config. A replication feed which is used to ensure consistency before the cut over.

func (*CutOver) Run

func (c *CutOver) Run(ctx context.Context) error

type Migration

type Migration struct {
	Host                 string        `name:"host" help:"Hostname" optional:""`
	Username             string        `name:"username" help:"User" optional:""`
	Password             *string       `name:"password" help:"Password" optional:""`
	Database             string        `name:"database" help:"Database" optional:""`
	ConfFile             string        `name:"conf" help:"MySQL conf file" optional:"" type:"existingfile"`
	Table                string        `name:"table" help:"Table" optional:""`
	Alter                string        `name:"alter" help:"The alter statement to run on the table" optional:""`
	Threads              int           `name:"threads" help:"Number of concurrent threads for copy and checksum tasks" optional:"" default:"4"`
	TargetChunkTime      time.Duration `name:"target-chunk-time" help:"The target copy time for each chunk" optional:"" default:"500ms"`
	ReplicaDSN           string        `name:"replica-dsn" help:"A DSN for a replica which (if specified) will be used for lag checking." optional:""`
	ReplicaMaxLag        time.Duration `` /* 127-byte string literal not displayed */
	LockWaitTimeout      time.Duration `name:"lock-wait-timeout" help:"The DDL lock_wait_timeout required for checksum and cutover" optional:"" default:"30s"`
	SkipDropAfterCutover bool          `name:"skip-drop-after-cutover" help:"Keep old table after completing cutover" optional:"" default:"false"`
	DeferCutOver         bool          `name:"defer-cutover" help:"Defer cutover (and checksum) until sentinel table is dropped" optional:"" default:"false"`
	SkipForceKill        bool          `` /* 168-byte string literal not displayed */
	Strict               bool          `name:"strict" help:"Exit on --alter mismatch when incomplete migration is detected" optional:"" default:"false"`
	Statement            string        `name:"statement" help:"The SQL statement to run (replaces --table and --alter)" optional:"" default:""`
	Lint                 bool          `name:"lint" help:"Run lint checks before running migration" optional:""`
	LintOnly             bool          `name:"lint-only" help:"Run lint checks and exit without performing migration" optional:""`

	// TLS Configuration
	TLSMode            string `` /* 142-byte string literal not displayed */
	TLSCertificatePath string `name:"tls-ca" help:"Path to custom TLS CA certificate file" optional:""`

	// Buffered copy uses the DBLog algorithm for copying and replication applying.
	// It reads rows from the source and inserts them into the target, rather than
	// using INSERT IGNORE .. SELECT. This is also required for cross-server moves.
	Buffered bool `name:"buffered" help:"Use the buffered copier based on the lock-free DBLog algorithm" optional:"" default:"false"`

	// Hidden options for now (supports more obscure cash/sq usecases)
	InterpolateParams bool `name:"interpolate-params" help:"Enable interpolate params for DSN" optional:"" default:"false" hidden:""`
	// Used for tests so we can concurrently execute without issues even though
	// the sentinel name is shared. Basically it will be true here, but false
	// in the tests unless we set it explicitly true.
	RespectSentinel bool `name:"respect-sentinel" help:"Look for sentinel table to exist and block if it does" optional:"" default:"true" hidden:""`
	// contains filtered or unexported fields
}

func (*Migration) Run

func (m *Migration) Run() error

func (*Migration) Validate added in v0.11.0

func (m *Migration) Validate() error

Validate is called by Kong after parsing to check for invalid flag combinations.

type Runner

type Runner struct {
	// contains filtered or unexported fields
}

func NewRunner

func NewRunner(m *Migration) (*Runner, error)

func (*Runner) Cancel

func (r *Runner) Cancel()

func (*Runner) Close

func (r *Runner) Close() error

func (*Runner) DumpCheckpoint

func (r *Runner) DumpCheckpoint(ctx context.Context) error

DumpCheckpoint is called approximately every minute. It writes the current state of the migration to the checkpoint table, which can be used in recovery. Previously resuming from checkpoint would always restart at the copier, but it can now also resume at the checksum phase.

func (*Runner) Progress

func (r *Runner) Progress() status.Progress

func (*Runner) Run

func (r *Runner) Run(ctx context.Context) error

func (*Runner) SetLogger

func (r *Runner) SetLogger(logger *slog.Logger)

func (*Runner) SetMetricsSink

func (r *Runner) SetMetricsSink(sink metrics.Sink)

func (*Runner) Status

func (r *Runner) Status() string

Directories

Path Synopsis
Package check provides various configuration and health checks that can be run against a sql.DB connection.
Package check provides various configuration and health checks that can be run against a sql.DB connection.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL