migration

package

v0.15.0 Latest Latest Go to latest Published: Jun 15, 2026 License: Apache-2.0 Imports: 25 Imported by: 1

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/block/spirit

Links

Open Source Insights

README ¶

Migration

The migration package is used to orchestrate schema changes to one or more tables. Each change is tracked in the runner, with the copier and checker largely agnostic that they may be copying multiple tables at once.

There is one change source (change.Source) for all changes, and a subscription is added for each table:

┌─────────────────────────────────────────────────────────────────────────────────────────┐
│                                    SPIRIT MIGRATION                                     │
│                                                                                         │
│ ┌─────────────────────────────────────────────────────────────────────────────────────┐ │
│ │                   RUNNER (Orchestrator) — pkg/migration/runner.go                   │ │
│ │                                                                                     │ │
│ │          Owns and coordinates every component below, across the lifecycle:          │ │
│ │            1. Setup (_new + checkpoint tables)   2. Start change source             │ │
│ │           3. Run copier   4. Disable watermark opt.   5. Initial checksum           │ │
│ │           6. Wait on defer-cutover sentinel    7. Cutover (RENAME TABLE)            │ │
│ └─────────────────────────────────────────────────────────────────────────────────────┘ │
│                                                                                         │
│   ┌────────────────────┐        ┌────────────────────┐        ┌────────────────────┐    │
│   │       COPIER       │        │   CHANGE SOURCE    │        │      CHECKER       │    │
│   │                    │        │   (binlogClient)   │        │     (Checksum)     │    │
│   │  reads the source  │        │   reads binlog;    │        │   verifies that    │    │
│   │  table in chunks   │        │ one sub. per table │        │   source == _new   │    │
│   └──────────┬─────────┘        └──────────┬─────────┘        └────────────────────┘    │
│              │ row images                  │ row images                                 │
│              │ (buffered copier)           │ (always)                                   │
│              └──────────────┬──────────────┘                                            │
│                             ▼                                                           │
│                     ┌───────────────┐                                                   │
│                     │    APPLIER    │ ── REPLACE INTO _new VALUES (…) ──►  _new table   │
│                     │  pkg/applier  │                                                   │
│                     └───────────────┘                                                   │
│                                                                                         │
└─────────────────────────────────────────────────────────────────────────────────────────┘

The copier and change source are running in parallel during step 3. The only hard requirement is that the change source starts before the copier so we can ensure that all changes are tracked.

The copier is essentially a dirty-copy, in that each chunk does not correctly keep track of changes being made. This is reconciled via the change source, which detects any changes that have been made and applies them to the new table consistently.

Once copying is complete, a checksum process is started. This ensures that all data has safely made it to the new table, and it is safe to cutover.

What parts of the process are locking?

To answer this question, we need to understand that there are two types of locks:

Data locks, aka InnoDB row-level locks. These are configurable via innodb_lock_wait_timeout: the server default is 50s, but spirit overwrites this to 3s.
Metadata locks (MDL). These are configurable via lock_wait_timeout: the server default is 1 year(!), but spirit overwrites this to 30s (configurable by --lock-wait-timeout).

By default, Spirit takes no data locks on the source table. The default buffered copier reads rows into Spirit and writes them to _new through the applier (REPLACE INTO _new VALUES (...)), and the change source is likewise always buffered — it applies binlog row images and never runs SELECT FROM original. Neither path holds shared row locks on the source, so there is no copier-vs-OLTP contention on hot rows. (The applier's REPLACE INTO does take locks, but only on _new, which nothing else touches.) See pkg/change/README.md for the subscription design.

Data locks only re-enter the picture if you opt into the legacy --unbuffered copier, which issues INSERT IGNORE INTO _new ... SELECT FROM original. The SELECT side takes shared row locks rather than using MVCC, so it can contend with production workloads touching the same rows. When using --unbuffered, the main knob to mitigate this is target-chunk-time: smaller chunks mean each copy statement holds locks for less time.

So the locking that does apply by default is metadata locks. When we describe Spirit as a "non-blocking schema change tool" that is a bit of a white lie: we don't hold an MDL for the entire 10h schema change, as built-in MySQL DDL often does, but we do need a brief exclusive MDL at a few points:

Spirit initially attempts INSTANT/INPLACE DDL. If this is compatible, it requires an exclusive metadata lock on the table.
Starting a checksum requires an initial exclusive metadata lock to ensure that all data is synchronized between the checksum threads.
The cutover operation requires an exclusive metadata lock.

What causes all metadata lock issues? (hint: it's not spirit)

Any open transactions will have shared metadata locks on any of the tables that you are modifying. If you have long transactions that have not yet committed/rolled back, Spirit's exclusive lock will be queued waiting for them to finish. This then looks like a Spirit problem because any shared lock requests that arrive after Spirit's exclusive lock request will then be queued behind Spirit. So the solution is to keep your transactions as short as possible.

The fix for Spirit, is that it will by default force-kill the specific connections that are blocking it from acquiring an exclusive lock. This can be disabled with --skip-force-kill if needed.

Using Spirit `migration` as a Go package

I will assume that similar to our use-case, you are probably wrapping some sort of automation around Spirit. If this automation is written in Go, I would encourage you to use the Spirit API and not the CLI executable.

The following is a simplification of what we use ourselves:

func (sm *Spirit) Execute(ctx context.Context, m *ExecutableTask) error {
	startTime := time.Now()
	runner, err := migration.NewRunner(&migration.Migration{
		Host:              m.Cluster.Host,
		Username:          m.Cluster.Username,
		Password:          &m.Cluster.Password,
		Database:          m.Cluster.DatabaseName,
		Statement:         m.Statement,
		TargetChunkTime:   m.TargetChunkTime,
		Threads:           m.Concurrency,
		LockWaitTimeout:   m.LockWaitTimeout,
		InterpolateParams: true,
	})
	if err != nil {
		return errors.Wrap(err, "failed to create spirit migration runner")
	}
	defer runner.Close()
	if m.Metrics != nil {
		runner.SetMetricsSink(m.Metrics)
	}
	sm.Lock()
	sm.progressCallback = func() string {
		return runner.Progress().Summary
	}
	sm.Unlock()
	runner.SetLogger(m.Logger)
	if err = runner.Run(ctx); err != nil {
		return errors.Wrap(err, "failed to run spirit migration")
	}
	m.Logger.Infof("spirit migration completed in %s", time.Since(startTime))
	return nil
}

Documentation ¶

Overview ¶

Package migration contains the logic for running online schema changes.

Index ¶

type CutOver
- func NewCutOver(db *sql.DB, config []*cutoverConfig, feed change.Source, ...) (*CutOver, error)
- func (c *CutOver) Run(ctx context.Context) error
type Migration
- func (m *Migration) Run() error
- func (m *Migration) Validate() error
type Runner
- func NewRunner(m *Migration) (*Runner, error)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type CutOver ¶

type CutOver struct {
	// contains filtered or unexported fields
}

func NewCutOver ¶

func NewCutOver(db *sql.DB, config []*cutoverConfig, feed change.Source, dbConfig *dbconn.DBConfig, logger *slog.Logger) (*CutOver, error)

NewCutOver contains the logic to perform the final cut over. It can cutover multiple tables at once based on config. A replication feed which is used to ensure consistency before the cut over.

func (*CutOver) Run ¶

func (c *CutOver) Run(ctx context.Context) error

type Migration ¶

type Migration struct {
	Host         string  `name:"host" help:"Hostname" optional:""`
	Username     string  `name:"username" help:"User" optional:""`
	Password     *string `name:"password" help:"Password" optional:""`
	Database     string  `name:"database" help:"Database" optional:""`
	ConfFile     string  `name:"conf" help:"MySQL conf file" optional:"" type:"existingfile"`
	Table        string  `name:"table" help:"Table" optional:""`
	Alter        string  `name:"alter" help:"The alter statement to run on the table" optional:""`
	Threads      int     `name:"threads" help:"Number of concurrent threads for copy and checksum tasks" optional:"" default:"4"`
	WriteThreads int     `` /* 206-byte string literal not displayed */

	// EnableExperimentalAutoscaling turns on dynamic write-thread scaling driven
	// by throttler feedback; WriteThreads becomes the starting value and the
	// cap is fixed at 2x that (deliberately not configurable for now, to keep
	// the experimental surface small). See issue #831.
	EnableExperimentalAutoscaling bool          `` /* 187-byte string literal not displayed */
	TargetChunkTime               time.Duration `name:"target-chunk-time" help:"The target copy time for each chunk" optional:"" default:"500ms"`
	ReplicaDSN                    string        `` /* 157-byte string literal not displayed */
	ReplicaMaxLag                 time.Duration `` /* 127-byte string literal not displayed */
	LockWaitTimeout               time.Duration `name:"lock-wait-timeout" help:"The DDL lock_wait_timeout required for checksum and cutover" optional:"" default:"30s"`
	SkipDropAfterCutover          bool          `name:"skip-drop-after-cutover" help:"Keep old table after completing cutover" optional:"" default:"false"`
	DeferCutOver                  bool          `name:"defer-cutover" help:"Defer cutover (and checksum) until sentinel table is dropped" optional:"" default:"false"`
	SkipForceKill                 bool          `` /* 168-byte string literal not displayed */
	Statement                     string        `name:"statement" help:"The SQL statement to run (replaces --table and --alter)" optional:"" default:""`
	Lint                          bool          `name:"lint" help:"Run lint checks before running migration" optional:""`
	LintOnly                      bool          `name:"lint-only" help:"Run lint checks and exit without performing migration" optional:""`

	// TLS Configuration
	TLSMode            string `` /* 142-byte string literal not displayed */
	TLSCertificatePath string `name:"tls-ca" help:"Path to custom TLS CA certificate file" optional:""`

	// Buffered copy (the default) uses the DBLog algorithm for copying and
	// replication applying. It reads rows from the source and inserts them into
	// the target, rather than using INSERT IGNORE .. SELECT, and is also required
	// for cross-server moves. Unbuffered opts back into the legacy
	// INSERT IGNORE .. SELECT copier.
	Unbuffered bool `` /* 156-byte string literal not displayed */

	// EnableExperimentalGTID switches the change source from binlog file+position to MySQL GTIDs.
	// EXPERIMENTAL — see pkg/change/gtid.go. Requires gtid_mode=ON and
	// enforce_gtid_consistency=ON on the source.
	EnableExperimentalGTID bool `` /* 141-byte string literal not displayed */

	CheckpointMaxAge     time.Duration `name:"checkpoint-max-age" help:"Maximum age of a checkpoint before refusing to resume from it" optional:"" default:"168h"`
	ChecksumYieldTimeout time.Duration `` /* 203-byte string literal not displayed */

	// MaxCommitLatency throttles when observed commit latency exceeds this
	// threshold. Currently auto-enabled only on Aurora (auto-detected); the
	// default 100ms is intentionally a high upper bound to only cut the most
	// extreme tail latencies. See issue #468.
	MaxCommitLatency time.Duration `` /* 160-byte string literal not displayed */

	// Hidden options for now (supports more obscure cash/sq usecases)
	InterpolateParams bool `name:"interpolate-params" help:"Enable interpolate params for DSN" optional:"" default:"false" hidden:""`
	// Used for tests so we can concurrently execute without issues even though
	// the sentinel name is shared. Basically it will be true here, but false
	// in the tests unless we set it explicitly true.
	RespectSentinel bool `name:"respect-sentinel" help:"Look for sentinel table to exist and block if it does" optional:"" default:"true" hidden:""`
	// contains filtered or unexported fields
}

func (*Migration) Run ¶

func (m *Migration) Run() error

func (*Migration) Validate ¶ added in v0.11.0

func (m *Migration) Validate() error

Validate is called by Kong after parsing to check for invalid flag combinations. Zero values mean "use the default" (normalizeOptions fills them in), so they are not rejected here; only explicitly-negative or otherwise invalid values are caught.

type Runner ¶

type Runner struct {
	// contains filtered or unexported fields
}

func NewRunner ¶

func NewRunner(m *Migration) (*Runner, error)

func (*Runner) Cancel ¶

func (r *Runner) Cancel()

func (*Runner) Close ¶

func (r *Runner) Close() error

func (*Runner) DumpCheckpoint ¶

func (r *Runner) DumpCheckpoint(ctx context.Context) error

DumpCheckpoint is called approximately every minute. It writes the current state of the migration to the checkpoint table, which can be used in recovery. Previously resuming from checkpoint would always restart at the copier, but it can now also resume at the checksum phase.

func (*Runner) Progress ¶

func (r *Runner) Progress() status.Progress

func (*Runner) Run ¶

func (r *Runner) Run(ctx context.Context) error

func (*Runner) SetLogger ¶

func (r *Runner) SetLogger(logger *slog.Logger)

func (*Runner) SetMetricsSink ¶

func (r *Runner) SetMetricsSink(sink metrics.Sink)

func (*Runner) Status ¶

func (r *Runner) Status() string

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
check Package check provides various configuration and health checks that can be run against a sql.DB connection.	Package check provides various configuration and health checks that can be run against a sql.DB connection.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

Migration

What parts of the process are locking?

Using Spirit migration as a Go package

Documentation ¶

Overview ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

Types ¶

type CutOver ¶

func NewCutOver ¶

func (*CutOver) Run ¶

type Migration ¶

func (*Migration) Run ¶

func (*Migration) Validate ¶ added in v0.11.0

type Runner ¶

func NewRunner ¶

func (*Runner) Cancel ¶

func (*Runner) Close ¶

func (*Runner) DumpCheckpoint ¶

func (*Runner) Progress ¶

func (*Runner) Run ¶

func (*Runner) SetLogger ¶

func (*Runner) SetMetricsSink ¶

func (*Runner) Status ¶

Source Files ¶

Directories ¶

Using Spirit `migration` as a Go package