compact

package

v2.1.1 Latest Latest Go to latest Published: Oct 23, 2025 License: BSD-3-Clause Imports: 20 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/cockroachdb/pebble

Links

Open Source Insights

Documentation ¶

Index ¶

func SplitAndEncodeSpan(cmp base.Compare, span *keyspan.Span, upToKey []byte, tw sstable.RawWriter) error
type Frontiers
- func (f *Frontiers) Advance(k []byte)
- func (f *Frontiers) Init(cmp base.Compare)
- func (f *Frontiers) String() string
type Iter
- func NewIter(cfg IterConfig, pointIter base.InternalIterator, ...) *Iter
- func (i *Iter) Close() error
- func (i *Iter) Error() error
- func (i *Iter) First() *base.InternalKV
- func (i *Iter) ForceObsoleteDueToRangeDel() bool
- func (i *Iter) Frontiers() *Frontiers
- func (i *Iter) Next() *base.InternalKV
- func (i *Iter) SnapshotPinned() bool
- func (i *Iter) Span() *keyspan.Span
- func (i *Iter) Stats() IterStats
type IterConfig
type IterStats
type NeverSeparateValues
- func (NeverSeparateValues) Add(tw sstable.RawWriter, kv *base.InternalKV, forceObsolete bool) error
- func (NeverSeparateValues) EstimatedFileSize() uint64
- func (NeverSeparateValues) EstimatedReferenceSize() uint64
- func (NeverSeparateValues) FinishOutput() (ValueSeparationMetadata, error)
type OutputBlob
type OutputSplitter
- func NewOutputSplitter(cmp base.Compare, startKey []byte, limit []byte, targetFileSize uint64, ...) *OutputSplitter
- func (s *OutputSplitter) ShouldSplitBefore(nextUserKey []byte, estimatedFileSize uint64, equalPrevFn func([]byte) bool) ShouldSplit
- func (s *OutputSplitter) SplitKey() []byte
type OutputTable
type RangeDelSpanCompactor
- func MakeRangeDelSpanCompactor(cmp base.Compare, equal base.Equal, snapshots Snapshots, ...) RangeDelSpanCompactor
- func (c *RangeDelSpanCompactor) Compact(span, output *keyspan.Span)
type RangeKeySpanCompactor
- func MakeRangeKeySpanCompactor(cmp base.Compare, suffixCmp base.CompareRangeSuffixes, snapshots Snapshots, ...) RangeKeySpanCompactor
- func (c *RangeKeySpanCompactor) Compact(span, output *keyspan.Span)
type Result
- func (r Result) WithError(err error) Result
type Runner
- func NewRunner(cfg RunnerConfig, iter *Iter) *Runner
- func (r *Runner) Finish() Result
- func (r *Runner) FirstKey() []byte
- func (r *Runner) MoreDataToWrite() bool
- func (r *Runner) TableSplitLimit(startKey []byte) []byte
- func (r *Runner) WriteTable(objMeta objstorage.ObjectMetadata, tw sstable.RawWriter, limitKey []byte, ...)
type RunnerConfig
type ShouldSplit
- func (s ShouldSplit) String() string
type Snapshots
- func (s Snapshots) Index(seq base.SeqNum) int
- func (s Snapshots) IndexAndSeqNum(seq base.SeqNum) (int, base.SeqNum)
type Stats
type TombstoneElision
- func ElideTombstonesOutsideOf(inUseRanges []base.UserKeyBounds) TombstoneElision
- func NoTombstoneElision() TombstoneElision
- func SetupTombstoneElision(cmp base.Compare, v *manifest.Version, l0Organizer *manifest.L0Organizer, ...) (dels, rangeKeys TombstoneElision)
- func (e TombstoneElision) ElidesEverything() bool
- func (e TombstoneElision) ElidesNothing() bool
- func (e TombstoneElision) String() string
type ValueSeparation
type ValueSeparationMetadata

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func SplitAndEncodeSpan ¶

func SplitAndEncodeSpan(
	cmp base.Compare, span *keyspan.Span, upToKey []byte, tw sstable.RawWriter,
) error

SplitAndEncodeSpan splits a span at upToKey and encodes the first part into the table writer, and updates the span to store the remaining part.

If upToKey is nil or the span ends before upToKey, we encode the entire span and reset it to the empty span.

Note that the span.Start slice will be reused (it will be replaced with a copy of upToKey, if appropriate).

The span can contain either only RANGEDEL keys or only range keys.

Types ¶

type Frontiers ¶

type Frontiers struct {
	// contains filtered or unexported fields
}

Frontiers is used to track progression of a task (eg, compaction) across the keyspace. Clients that want to be informed when the task advances to a key ≥ some frontier may register a frontier, providing a callback. The task calls `Advance(k)` with each user key encountered, which invokes the `reached` func on all tracked Frontiers with `key`s ≤ k.

Internally, Frontiers is implemented as a simple heap.

func (*Frontiers) Advance ¶

func (f *Frontiers) Advance(k []byte)

Advance notifies all member Frontiers with keys ≤ k.

func (*Frontiers) Init ¶

func (f *Frontiers) Init(cmp base.Compare)

Init initializes a Frontiers for use.

func (*Frontiers) String ¶

func (f *Frontiers) String() string

String implements fmt.Stringer.

type Iter ¶

type Iter struct {
	// contains filtered or unexported fields
}

Iter provides a forward-only iterator that encapsulates the logic for collapsing entries during compaction. It wraps an internal iterator and collapses entries that are no longer necessary because they are shadowed by newer entries. The simplest example of this is when the internal iterator contains two keys: a.PUT.2 and a.PUT.1. Instead of returning both entries, compact.Iter collapses the second entry because it is no longer necessary. The high-level structure for compact.Iter is to iterate over its internal iterator and output 1 entry for every user-key. There are four complications to this story.

1. Eliding Deletion Tombstones

Consider the entries a.DEL.2 and a.PUT.1. These entries collapse to a.DEL.2. Do we have to output the entry a.DEL.2? Only if a.DEL.2 possibly shadows an entry at a lower level. If we're compacting to the base-level in the LSM tree then a.DEL.2 is definitely not shadowing an entry at a lower level and can be elided.

We can do slightly better than only eliding deletion tombstones at the base level by observing that we can elide a deletion tombstone if there are no sstables that contain the entry's key. This check is performed by elideTombstone.

2. Merges

The MERGE operation merges the value for an entry with the existing value for an entry. The logical value of an entry can be composed of a series of merge operations. When compact.Iter sees a MERGE, it scans forward in its internal iterator collapsing MERGE operations for the same key until it encounters a SET or DELETE operation. For example, the keys a.MERGE.4, a.MERGE.3, a.MERGE.2 will be collapsed to a.MERGE.4 and the values will be merged using the specified Merger.

An interesting case here occurs when MERGE is combined with SET. Consider the entries a.MERGE.3 and a.SET.2. The collapsed key will be a.SET.3. The reason that the kind is changed to SET is because the SET operation acts as a barrier preventing further merging. This can be seen better in the scenario a.MERGE.3, a.SET.2, a.MERGE.1. The entry a.MERGE.1 may be at lower (older) level and not involved in the compaction. If the compaction of a.MERGE.3 and a.SET.2 produced a.MERGE.3, a subsequent compaction with a.MERGE.1 would merge the values together incorrectly.

3. Snapshots

Snapshots are lightweight point-in-time views of the DB state. At its core, a snapshot is a sequence number along with a guarantee from Pebble that it will maintain the view of the database at that sequence number. Part of this guarantee is relatively straightforward to achieve. When reading from the database Pebble will ignore sequence numbers that are larger than the snapshot sequence number. The primary complexity with snapshots occurs during compaction: the collapsing of entries that are shadowed by newer entries is at odds with the guarantee that Pebble will maintain the view of the database at the snapshot sequence number. Rather than collapsing entries up to the next user key, compact.Iter can only collapse entries up to the next snapshot boundary. That is, every snapshot boundary potentially causes another entry for the same user-key to be emitted. Another way to view this is that snapshots define stripes and entries are collapsed within stripes, but not across stripes. Consider the following scenario:

a.PUT.9
a.DEL.8
a.PUT.7
a.DEL.6
a.PUT.5

In the absence of snapshots these entries would be collapsed to a.PUT.9. What if there is a snapshot at sequence number 7? The entries can be divided into two stripes and collapsed within the stripes:

a.PUT.9        a.PUT.9
a.DEL.8  --->
a.PUT.7
--             --
a.DEL.6  --->  a.DEL.6
a.PUT.5

All of the rules described earlier still apply, but they are confined to operate within a snapshot stripe. Snapshots only affect compaction when the snapshot sequence number lies within the range of sequence numbers being compacted. In the above example, a snapshot at sequence number 10 or at sequence number 5 would not have any effect.

4. Range Deletions

Range deletions provide the ability to delete all of the keys (and values) in a contiguous range. Range deletions are stored indexed by their start key. The end key of the range is stored in the value. In order to support lookup of the range deletions which overlap with a particular key, the range deletion tombstones need to be fragmented whenever they overlap. This fragmentation is performed by keyspan.Fragmenter. The fragments are then subject to the rules for snapshots. For example, consider the two range tombstones [a,e)#1 and [c,g)#2:

2:     c-------g
1: a-------e

These tombstones will be fragmented into:

2:     c---e---g
1: a---c---e

Do we output the fragment [c,e)#1? Since it is covered by [c-e]#2 the answer depends on whether it is in a new snapshot stripe.

In addition to the fragmentation of range tombstones, compaction also needs to take the range tombstones into consideration when outputting normal keys. Just as with point deletions, a range deletion covering an entry can cause the entry to be elided.

A note on the stability of keys and values.

The stability guarantees of keys and values returned by the iterator tree that backs a compact.Iter is nuanced and care must be taken when referencing any returned items.

Keys and values returned by exported functions (i.e. First, Next, etc.) have lifetimes that fall into two categories:

Lifetime valid for duration of compaction. Range deletion keys and values are stable for the duration of the compaction, due to way in which a compact.Iter is typically constructed (i.e. via (*compaction).newInputIter, which wraps the iterator over the range deletion block in a noCloseIter, preventing the release of the backing memory until the compaction is finished).

Lifetime limited to duration of sstable block liveness. Point keys (SET, DEL, etc.) and values must be cloned / copied following the return from the exported function, and before a subsequent call to Next advances the iterator and mutates the contents of the returned key and value.

func NewIter ¶

func NewIter(
	cfg IterConfig,
	pointIter base.InternalIterator,
	rangeDelIter, rangeKeyIter keyspan.FragmentIterator,
) *Iter

NewIter creates a new compaction iterator. See the comment for Iter for a detailed description. rangeDelIter and rangeKeyIter can be nil.

func (*Iter) Close ¶

func (i *Iter) Close() error

Close the iterator.

func (*Iter) Error ¶

func (i *Iter) Error() error

Error returns any error encountered.

Note that Close will return the error as well.

func (*Iter) First ¶

func (i *Iter) First() *base.InternalKV

First has the same semantics as InternalIterator.First.

func (*Iter) ForceObsoleteDueToRangeDel ¶

func (i *Iter) ForceObsoleteDueToRangeDel() bool

ForceObsoleteDueToRangeDel returns true in a subset of the cases when SnapshotPinned returns true. This value is true when the point is obsolete due to a RANGEDEL but could not be deleted due to a snapshot.

func (*Iter) Frontiers ¶

func (i *Iter) Frontiers() *Frontiers

Frontiers returns the frontiers for the compaction iterator.

func (*Iter) Next ¶

func (i *Iter) Next() *base.InternalKV

Next has the same semantics as InternalIterator.Next. Note that when Next returns a RANGEDEL or a range key, the caller can use Span() to get the corresponding span.

func (*Iter) SnapshotPinned ¶

func (i *Iter) SnapshotPinned() bool

SnapshotPinned returns whether the last point key returned by the compaction iterator was only returned because an open snapshot prevents its elision. This field only applies to point keys, and not to range deletions or range keys.

func (*Iter) Span ¶

func (i *Iter) Span() *keyspan.Span

Span returns the range deletion or range key span corresponding to the current key. Can only be called right after a Next() call that returned a RANGEDEL or a range key. The keys in the span should not be retained or modified.

func (*Iter) Stats ¶

func (i *Iter) Stats() IterStats

Stats returns the compaction iterator stats.

type IterConfig ¶

type IterConfig struct {
	Comparer *base.Comparer
	Merge    base.Merge

	// The snapshot sequence numbers that need to be maintained. These sequence
	// numbers define the snapshot stripes.
	Snapshots Snapshots

	TombstoneElision TombstoneElision
	RangeKeyElision  TombstoneElision

	// IsBottommostDataLayer indicates that the compaction inputs form the
	// bottommost layer of data for the compaction's key range. This allows the
	// sequence number of KVs in the bottom snapshot stripe to be simplified to
	// 0 (which improves compression and enables an optimization during forward
	// iteration). This can be enabled if there are no tables overlapping the
	// output at lower levels (than the output) in the LSM.
	//
	// This field may be false even when nothing is overlapping in lower levels.
	// At the time of writing, flushes always set this to false (because flushes
	// almost never form the bottommost layer of data).
	IsBottommostDataLayer bool

	// IneffectualPointDeleteCallback is called if a SINGLEDEL is being elided
	// without deleting a point set/merge. False positives are rare but possible
	// (because of delete-only compactions).
	IneffectualSingleDeleteCallback func(userKey []byte)

	// NondeterministicSingleDeleteCallback is called in compactions/flushes if any
	// single delete has consumed a Set/Merge, and there is another immediately older
	// Set/SetWithDelete/Merge. False positives are rare but possible (because of
	// delete-only compactions).
	NondeterministicSingleDeleteCallback func(userKey []byte)

	// MissizedDeleteCallback is called in compactions/flushes when a DELSIZED
	// tombstone is found that did not accurately record the size of the value it
	// deleted. This can lead to incorrect behavior in compactions.
	//
	// For the second case, elidedSize and expectedSize will be set to the actual
	// size of the elided key and the expected size that was recorded in the
	// tombstone. For the first case (when a key doesn't exist), these will be 0.
	MissizedDeleteCallback func(userKey []byte, elidedSize, expectedSize uint64)
}

IterConfig contains the parameters necessary to create a compaction iterator.

type IterStats ¶

type IterStats struct {
	// Count of DELSIZED keys that were missized.
	CountMissizedDels uint64
}

IterStats are statistics produced by the compaction iterator.

type NeverSeparateValues ¶ added in v2.1.0

type NeverSeparateValues struct{}

NeverSeparateValues is a ValueSeparation implementation that never separates values into external blob files. It is the default value if no ValueSeparation implementation is explicitly provided.

func (NeverSeparateValues) Add ¶ added in v2.1.0

func (NeverSeparateValues) Add(
	tw sstable.RawWriter, kv *base.InternalKV, forceObsolete bool,
) error

Add implements the ValueSeparation interface.

func (NeverSeparateValues) EstimatedFileSize ¶ added in v2.1.0

func (NeverSeparateValues) EstimatedFileSize() uint64

EstimatedFileSize implements the ValueSeparation interface.

func (NeverSeparateValues) EstimatedReferenceSize ¶ added in v2.1.0

func (NeverSeparateValues) EstimatedReferenceSize() uint64

EstimatedReferenceSize implements the ValueSeparation interface.

func (NeverSeparateValues) FinishOutput ¶ added in v2.1.0

func (NeverSeparateValues) FinishOutput() (ValueSeparationMetadata, error)

FinishOutput implements the ValueSeparation interface.

type OutputBlob ¶ added in v2.1.0

type OutputBlob struct {
	Stats blob.FileWriterStats
	// ObjMeta is metadata for the object backing the blob file.
	ObjMeta objstorage.ObjectMetadata
	// Metadata is metadata for the blob file.
	Metadata *manifest.PhysicalBlobFile
}

OutputBlob contains metadata about a blob file that was created during a compaction.

type OutputSplitter ¶

type OutputSplitter struct {
	// contains filtered or unexported fields
}

OutputSplitter is used to determine where to split output tables in a compaction.

An OutputSplitter is initialized when we start an output file:

  s := NewOutputSplitter(...)
  for nextKey != nil && !s.ShouldSplitBefore(nextKey, ...) {
    ...
  }
	splitKey := s.SplitKey()

OutputSplitter enforces a target file size. This splitter splits to a new output file when the estimated file size is 0.5x-2x the target file size. If there are overlapping grandparent files, this splitter will attempt to split at a grandparent boundary. For example, consider the example where a compaction wrote 'd' to the current output file, and the next key has a user key 'g':

                              previous key   next key
	                                 |           |
	                                 |           |
	                 +---------------|----+   +--|----------+
	  grandparents:  |       000006  |    |   |  | 000007   |
	                 +---------------|----+   +--|----------+
	                 a    b          d    e   f  g       i

Splitting the output file F before 'g' will ensure that the current output file F does not overlap the grandparent file 000007. Aligning sstable boundaries like this can significantly reduce write amplification, since a subsequent compaction of F into the grandparent level will avoid needlessly rewriting any keys within 000007 that do not overlap F's bounds. Consider the following compaction:

                 +----------------------+
input            |                      |
level            +----------------------+
                            \/
         +---------------+       +---------------+
output   |XXXXXXX|       |       |      |XXXXXXXX|
level    +---------------+       +---------------+

The input-level file overlaps two files in the output level, but only partially. The beginning of the first output-level file and the end of the second output-level file will be rewritten verbatim. This write I/O is "wasted" in the sense that no merging is being performed.

To prevent the above waste, this splitter attempts to split output files before the start key of grandparent files. It still strives to write output files of approximately the target file size, by constraining this splitting at grandparent points to apply only if the current output's file size is about the right order of magnitude.

OutputSplitter guarantees that we never split user keys between files.

The dominant cost of OutputSplitter is one key comparison per ShouldSplitBefore call.

func NewOutputSplitter ¶

func NewOutputSplitter(
	cmp base.Compare,
	startKey []byte,
	limit []byte,
	targetFileSize uint64,
	grandparentLevel manifest.LevelIterator,
	frontiers *Frontiers,
) *OutputSplitter

NewOutputSplitter creates a new OutputSplitter. See OutputSplitter for more information.

The limitKey must be either nil (no limit) or a key greater than startKey.

NewOutputSplitter registers the splitter with the provided Frontiers.

Note: it is allowed for the startKey to be behind the current frontier, as long as the key in the first ShouldSplitBefore call is at the frontier.

func (*OutputSplitter) ShouldSplitBefore ¶

func (s *OutputSplitter) ShouldSplitBefore(
	nextUserKey []byte, estimatedFileSize uint64, equalPrevFn func([]byte) bool,
) ShouldSplit

ShouldSplitBefore returns whether we should split the output before the next key. It is passed the current estimated file size and a function that can be used to retrieve the previous user key.

The equalPrevFn function is used to guarantee no split user keys, without OutputSplitter copying each key internally. It is not performance sensitive, as it is only called once we decide to split.

Once ShouldSplitBefore returns SplitNow, it must not be called again. SplitKey() can be used to retrieve the recommended split key.

INVARIANT: nextUserKey must match the current frontier.

func (*OutputSplitter) SplitKey ¶

func (s *OutputSplitter) SplitKey() []byte

SplitKey returns the suggested split key - the first key at which the next output file should start.

If ShouldSplitBefore never returned SplitNow, then SplitKey returns the limit passed to NewOutputSplitter (which can be nil).

Otherwise, it returns a key <= the key passed to the last ShouldSplitBefore call and > the key passed to the previous call to ShouldSplitBefore (and > than the start key). This key is guaranteed to be larger than the start key.

type OutputTable ¶

type OutputTable struct {
	CreationTime time.Time
	// ObjMeta is metadata for the object backing the table.
	ObjMeta objstorage.ObjectMetadata
	// WriterMeta is populated once the table is fully written. On compaction
	// failure (see Result), WriterMeta might not be set.
	WriterMeta sstable.WriterMetadata
	// BlobReferences is the list of blob references for the table.
	BlobReferences manifest.BlobReferences
	// BlobReferenceDepth is the depth of the blob references for the table.
	BlobReferenceDepth manifest.BlobReferenceDepth
}

OutputTable contains metadata about a table that was created during a compaction.

type RangeDelSpanCompactor ¶

type RangeDelSpanCompactor struct {
	// contains filtered or unexported fields
}

RangeDelSpanCompactor coalesces RANGEDELs within snapshot stripes and elides RANGEDELs in the last stripe if possible.

func MakeRangeDelSpanCompactor ¶

func MakeRangeDelSpanCompactor(
	cmp base.Compare, equal base.Equal, snapshots Snapshots, elision TombstoneElision,
) RangeDelSpanCompactor

MakeRangeDelSpanCompactor creates a new compactor for RANGEDEL spans.

func (*RangeDelSpanCompactor) Compact ¶

func (c *RangeDelSpanCompactor) Compact(span, output *keyspan.Span)

Compact compacts the given range del span and stores the results in the given output span, reusing its slices.

Compaction of a span entails coalescing RANGEDELs keys within snapshot stripes, and eliding RANGEDELs in the last stripe if possible.

It is possible for the output span to be empty after the call (if all RANGEDELs in the span are elided).

The spans that are passed to Compact calls must be ordered and non-overlapping.

type RangeKeySpanCompactor ¶

type RangeKeySpanCompactor struct {
	// contains filtered or unexported fields
}

RangeKeySpanCompactor coalesces range keys within snapshot stripes and elides RangeKeyDelete and RangeKeyUnsets when possible. It is used as a container for at most one "compacted" span.

func MakeRangeKeySpanCompactor ¶

func MakeRangeKeySpanCompactor(
	cmp base.Compare,
	suffixCmp base.CompareRangeSuffixes,
	snapshots Snapshots,
	elision TombstoneElision,
) RangeKeySpanCompactor

MakeRangeKeySpanCompactor creates a new compactor for range key spans.

func (*RangeKeySpanCompactor) Compact ¶

func (c *RangeKeySpanCompactor) Compact(span, output *keyspan.Span)

Compact compacts the given range key span and stores the results in the given output span, reusing its slices.

Compaction of a span entails coalescing range keys within snapshot stripes, and eliding RangeKeyUnset/RangeKeyDelete in the last stripe if possible.

It is possible for the output span to be empty after the call (if all range keys in the span are elided).

The spans that are passed to Compact calls must be ordered and non-overlapping.

type Result ¶

type Result struct {
	// Err is the result of the compaction. On success, Err is nil and Tables
	// stores the output tables. On failure, Err is set and Tables stores the
	// tables created so far (and which need to be cleaned up).
	Err    error
	Tables []OutputTable
	Blobs  []OutputBlob
	Stats  Stats
}

Result stores the result of a compaction - more specifically, the "data" part where we use the compaction iterator to write output tables.

func (Result) WithError ¶

func (r Result) WithError(err error) Result

WithError returns a modified Result which has the Err field set.

type Runner ¶

type Runner struct {
	// contains filtered or unexported fields
}

Runner is a helper for running the "data" part of a compaction (where we use the compaction iterator to write output tables).

Sample usage:

r := NewRunner(cfg, iter)
for r.MoreDataToWrite() {
  objMeta, tw := ... // Create object and table writer.
  r.WriteTable(objMeta, tw)
}
result := r.Finish()

func NewRunner ¶

func NewRunner(cfg RunnerConfig, iter *Iter) *Runner

NewRunner creates a new Runner.

func (*Runner) Finish ¶

func (r *Runner) Finish() Result

Finish closes the compaction iterator and returns the result of the compaction.

func (*Runner) FirstKey ¶ added in v2.1.0

func (r *Runner) FirstKey() []byte

FirstKey returns the first key that will be written; this can be a point key or the beginning of a range del or range key span.

FirstKey can only be called right after MoreDataToWrite() was called and returned true.

func (*Runner) MoreDataToWrite ¶

func (r *Runner) MoreDataToWrite() bool

MoreDataToWrite returns true if there is more data to be written.

func (*Runner) TableSplitLimit ¶

func (r *Runner) TableSplitLimit(startKey []byte) []byte

TableSplitLimit returns a hard split limit for an output table that starts at startKey (which must be strictly greater than startKey), or nil if there is no limit.

func (*Runner) WriteTable ¶

func (r *Runner) WriteTable(
	objMeta objstorage.ObjectMetadata,
	tw sstable.RawWriter,
	limitKey []byte,
	valueSeparation ValueSeparation,
)

WriteTable writes a new output table. This table will be part of Result.Tables. Should only be called if MoreDataToWrite() returned true.

limitKey (if non-empty) forces the sstable to be finished before reaching this key.

WriteTable always closes the Writer.

type RunnerConfig ¶

type RunnerConfig struct {
	// CompactionBounds are the bounds containing all the input tables. All output
	// tables must fall within these bounds as well.
	CompactionBounds base.UserKeyBounds

	// L0SplitKeys is only set for flushes and it contains the flush split keys
	// (see L0Sublevels.FlushSplitKeys). These are split points enforced for the
	// output tables.
	L0SplitKeys [][]byte

	// Grandparents are the tables in level+2 that overlap with the files being
	// compacted. Used to determine output table boundaries. Do not assume that
	// the actual files in the grandparent when this compaction finishes will be
	// the same.
	Grandparents manifest.LevelSlice

	// MaxGrandparentOverlapBytes is the maximum number of bytes of overlap
	// allowed for a single output table with the tables in the grandparent level.
	MaxGrandparentOverlapBytes uint64

	// TargetOutputFileSize is the desired size of an individual table created
	// during compaction. In practice, the sizes can vary between 50%-200% of this
	// value.
	TargetOutputFileSize uint64

	// GrantHandle is used to perform accounting of resource consumption by the
	// CompactionScheduler.
	GrantHandle base.CompactionGrantHandle
}

RunnerConfig contains the parameters needed for the Runner.

type ShouldSplit ¶

type ShouldSplit bool

ShouldSplit indicates whether a compaction should split between output files. See the OutputSplitter interface.

const (
	// NoSplit may be returned by an OutputSplitter to indicate that it does NOT
	// recommend splitting compaction output sstables between the previous key
	// and the next key.
	NoSplit ShouldSplit = false
	// SplitNow may be returned by an OutputSplitter to indicate that it does
	// recommend splitting compaction output sstables between the previous key
	// and the next key.
	SplitNow ShouldSplit = true
)

func (ShouldSplit) String ¶

func (s ShouldSplit) String() string

String implements the Stringer interface.

type Snapshots ¶

type Snapshots []base.SeqNum

Snapshots stores a list of snapshot sequence numbers, in ascending order.

Snapshots are lightweight point-in-time views of the DB state. At its core, a snapshot is a sequence number along with a guarantee from Pebble that it will maintain the view of the database at that sequence number. Part of this guarantee is relatively straightforward to achieve. When reading from the database Pebble will ignore sequence numbers that are larger than the snapshot sequence number. The primary complexity with snapshots occurs during compaction: the collapsing of entries that are shadowed by newer entries is at odds with the guarantee that Pebble will maintain the view of the database at the snapshot sequence number. Rather than collapsing entries up to the next user key, compactionIter can only collapse entries up to the next snapshot boundary. That is, every snapshot boundary potentially causes another entry for the same user-key to be emitted. Another way to view this is that snapshots define stripes and entries are collapsed within stripes, but not across stripes. Consider the following scenario:

a.PUT.9
a.DEL.8
a.PUT.7
a.DEL.6
a.PUT.5

In the absence of snapshots these entries would be collapsed to a.PUT.9. What if there is a snapshot at sequence number 7? The entries can be divided into two stripes and collapsed within the stripes:

a.PUT.9        a.PUT.9
a.DEL.8  --->
a.PUT.7
--             --
a.DEL.6  --->  a.DEL.6
a.PUT.5

func (Snapshots) Index ¶

func (s Snapshots) Index(seq base.SeqNum) int

Index returns the index of the first snapshot sequence number which is >= seq or len(s) if there is no such sequence number.

func (Snapshots) IndexAndSeqNum ¶

func (s Snapshots) IndexAndSeqNum(seq base.SeqNum) (int, base.SeqNum)

IndexAndSeqNum returns the index of the first snapshot sequence number which is >= seq and that sequence number, or len(s) and InternalKeySeqNumMax if there is no such sequence number.

type Stats ¶

type Stats struct {
	CumulativePinnedKeys uint64
	CumulativePinnedSize uint64
	// CumulativeWrittenSize is the total size of all data written to output
	// objects.
	CumulativeWrittenSize uint64
	// CumulativeBlobReferenceSize is the total size of all blob references
	// written to output objects.
	CumulativeBlobReferenceSize uint64
	// CumulativeBlobFileSize is the total size of all data written to blob
	// output objects specifically.
	CumulativeBlobFileSize uint64
	CountMissizedDels      uint64
}

Stats describes stats collected during the compaction.

type TombstoneElision ¶

type TombstoneElision struct {
	// contains filtered or unexported fields
}

TombstoneElision is the information required to determine which tombstones (in the bottom snapshot stripe) can be elided. For example, when compacting into L6 (the lowest level), we can elide all tombstones (in the bottom snapshot stripe).

TombstoneElision can indicate that no tombstones can be elided, or it can store a set of key ranges where only tombstones that do NOT overlap those key ranges can be elided.

Note that the concept of "tombstone" applies to range keys as well: RangeKeyUnset and RangeKeyDelete are considered tombstones w.r.t other range keys and can use TombstoneElision.

func ElideTombstonesOutsideOf ¶

func ElideTombstonesOutsideOf(inUseRanges []base.UserKeyBounds) TombstoneElision

ElideTombstonesOutsideOf is used when tombstones can be elided if they don't overlap with a set of "in use" key ranges. These ranges must be ordered and disjoint.

func NoTombstoneElision ¶

func NoTombstoneElision() TombstoneElision

NoTombstoneElision is used when no tombstones can be elided (e.g. the entire compaction range is in use).

func SetupTombstoneElision ¶

func SetupTombstoneElision(
	cmp base.Compare,
	v *manifest.Version,
	l0Organizer *manifest.L0Organizer,
	outputLevel int,
	compactionBounds base.UserKeyBounds,
) (dels, rangeKeys TombstoneElision)

SetupTombstoneElision calculates the TombstoneElision policies for a compaction operating on the given version and output level.

func (TombstoneElision) ElidesEverything ¶

func (e TombstoneElision) ElidesEverything() bool

ElidesEverything returns true if all tombstones (in the bottom snapshot stripe) can be elided.

func (TombstoneElision) ElidesNothing ¶

func (e TombstoneElision) ElidesNothing() bool

ElidesNothing returns true if no tombstones will be elided.

func (TombstoneElision) String ¶

func (e TombstoneElision) String() string

type ValueSeparation ¶ added in v2.1.0

type ValueSeparation interface {
	// EstimatedFileSize returns an estimate of the disk space consumed by the
	// current, pending blob file if it were closed now. If no blob file has
	// been created, it returns 0.
	EstimatedFileSize() uint64
	// EstimatedReferenceSize returns an estimate of the disk space consumed by
	// the current output sstable's blob references so far.
	EstimatedReferenceSize() uint64
	// Add adds the provided key-value pair to the provided sstable writer,
	// possibly separating the value into a blob file.
	Add(tw sstable.RawWriter, kv *base.InternalKV, forceObsolete bool) error
	// FinishOutput is called when a compaction is finishing an output sstable.
	// It returns the table's blob references, which will be added to the
	// table's TableMetadata, and stats and metadata describing a newly
	// constructed blob file if any.
	FinishOutput() (ValueSeparationMetadata, error)
}

ValueSeparation defines an interface for writing some values to separate blob files.

type ValueSeparationMetadata ¶ added in v2.1.0

type ValueSeparationMetadata struct {
	BlobReferences     manifest.BlobReferences
	BlobReferenceSize  uint64
	BlobReferenceDepth manifest.BlobReferenceDepth

	// The below fields are only populated if a new blob file was created.
	BlobFileStats    blob.FileWriterStats
	BlobFileObject   objstorage.ObjectMetadata
	BlobFileMetadata *manifest.PhysicalBlobFile
}

ValueSeparationMetadata describes metadata about a table's blob references, and optionally a newly constructed blob file.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL