Documentation
¶
Overview ¶
Package valblk implements value blocks for sstables.
Value blocks are supported in TableFormatPebblev3.
1. Motivation and overview
Value blocks are a mechanism designed for sstables storing MVCC data, where there can be many versions of a key that need to be kept, but only the latest value is typically read (see the documentation for Comparer.Split regarding MVCC keys). The goal is faster reads. Unlike Pebble versions, which can be eagerly thrown away (except when there are snapshots), MVCC versions are long-lived (e.g. default CockroachDB garbage collection threshold for older versions is 24 hours) and can significantly slow down reads. We have seen CockroachDB production workloads with very slow reads due to: - 100s of versions for each key in a table.
- Tables with mostly MVCC garbage consisting of 2 versions per key -- a real key-value pair, followed by a key-value pair whose value (usually with zero byte length) indicates it is an MVCC tombstone.
The value blocks mechanism attempts to improve read throughput in these cases when the key size is smaller than the value sizes of older versions. This is done by moving the value of an older version to a value block in a different part of the sstable. This improves spatial locality of the data being read by the workload, which increases caching effectiveness.
Additionally, even when the key size is not smaller than the value of older versions (e.g. secondary indexes in CockroachDB), TableFormatPebblev3 stores the result of key comparisons done at write time inside the sstable, which makes stepping from one key prefix to the next prefix (i.e., skipping over older versions of a MVCC key) more efficient by avoiding key comparisons and key decoding. See the results in https://github.com/cockroachdb/pebble/v2/pull/2149 and more details in the comment inside BenchmarkIteratorScanNextPrefix. These improvements are also visible in end-to-end CockroachDB tests, as outlined in https://github.com/cockroachdb/cockroach/pull/96652.
In TableFormatPebblev3, each SET has a one byte value prefix that tells us whether the value is in-place or in a value block. This 1 byte prefix encodes additional information:
ShortAttribute: This is an attribute of the value. Currently, CockroachDB uses it to represent whether the value is a tombstone or not. This avoids the need to fetch a value from the value block if the caller only wants to figure out whether it is an MVCC tombstone. The length of the value is another attribute that the caller can be interested in, and it is also accessible without reading the value in the value block (see the value handle in the details section).
SET-same-prefix: this enables the aforementioned optimization when stepping from one key prefix to the next key prefix.
We further optimize this iteration over prefixes by using the restart points in a block to encode whether the SET at a restart point has the same prefix since the last restart point. This allows us to skip over restart points within the same block. See the comment in blockWriter, and how both SET-same-prefix and the restart point information is used in blockIter.nextPrefixV3.
This flexibility of values that are in-place or in value blocks requires flexibility in the iterator interface. The InternalIterator interface returns a LazyValue instead of a byte slice. Additionally, pebble.Iterator allows the caller to ask for a LazyValue. See lazy_value.go for details, including the memory lifetime management.
For historical discussions about this feature, see the issue https://github.com/cockroachdb/pebble/v2/issues/1170 and the prototype in https://github.com/cockroachdb/pebble/v2/pull/1443.
The code in this file mainly covers value block and related encodings. We discuss these in the next section.
2. Details
Note that the notion of the latest value is local to the sstable. It is possible that that latest value has been deleted by a sstable in a higher level, and what is the latest value from the perspective of the whole LSM is an older MVCC version. This only affects performance and not correctness. This local knowledge is also why we continue to store these older versions in the same sstable -- we need to be able to conveniently read them. The code in this file is agnostic to the policy regarding what should be stored in value blocks -- it allows even the latest MVCC version to be stored in a value block. The policy decision in made in the sstable.Writer. See Writer.makeAddPointDecisionV3.
Data blocks contain two kinds of SET keys: those with in-place values and those with a value handle. To distinguish these two cases we use a single byte prefix (ValuePrefix). This single byte prefix is split into multiple parts, where nb represents information that is encoded in n bits.
+---------------+--------------------+-----------+--------------------+ | value-kind 2b | SET-same-prefix 1b | unused 2b | short-attribute 3b | +---------------+--------------------+-----------+--------------------+
The 2 bit value-kind specifies whether this is an in-place value or a value handle pointing to a value block. We use 2 bits here for future representation of values that are in separate files. The 1 bit SET-same-prefix is true if this key is a SET and is immediately preceded by a SET that shares the same prefix. The 3 bit short-attribute is described in base.ShortAttribute -- it stores user-defined attributes about the value. It is unused for in-place values.
Value Handle and Value Blocks: valueHandles refer to values in value blocks. Value blocks are simpler than normal data blocks (that contain key-value pairs, and allow for binary search), which makes them cheap for value retrieval purposes. A valueHandle is a tuple (valueLen, blockNum, offsetInBlock), where blockNum is the 0 indexed value block number and offsetInBlock is the byte offset in that block containing the value. The valueHandle.valueLen is included since there are multiple use cases in CockroachDB that need the value length but not the value, for which we can avoid reading the value in the value block (see https://github.com/cockroachdb/pebble/v2/issues/1170#issuecomment-958203245).
A value block has a checksum like other blocks, and is optionally compressed. An uncompressed value block is a sequence of values with no separator or length (we rely on the valueHandle to demarcate). The valueHandle.offsetInBlock points to the value, of length valueHandle.valueLen. While writing a sstable, all the (possibly compressed) value blocks need to be held in-memory until they can be written. Value blocks are placed after the "meta rangedel" and "meta range key" blocks since value blocks are considered less likely to be read.
Meta Value Index Block: Since the (key, valueHandle) pair are written before there is any knowledge of the byte offset of the value block in the file, or its compressed length, we need another lookup to map the valueHandle.blockNum to the information needed to read it from the file. This information is provided by the "value index block". The "value index block" is referred to by the metaindex block. The design intentionally avoids making the "value index block" a general purpose key-value block, since each caller wants to lookup the information for a particular blockNum (there is no need for SeekGE etc.). Instead, this index block stores a sequence of (blockNum, blockOffset, blockLength) tuples, where the blockNums are consecutive integers, and the tuples are encoded with a fixed width encoding. This allows a reader to find the tuple for block K by looking at the offset K*fixed-width. The fixed width for each field is decided by looking at the maximum value of each of these fields. As a concrete example of a large sstable with many value blocks, we constructed a 100MB sstable with many versions and had 2475 value blocks (~32KB each). This sstable had this tuple encoded using 2+4+2=8 bytes, which means the uncompressed value index block was 2475*8=~19KB, which is modest. Therefore, we don't support more than one value index block. Consider the example of 2 byte blockNum, 4 byte blockOffset and 2 byte blockLen. The value index block will look like:
+---------------+------------------+---------------+ | blockNum (2B) | blockOffset (4B) | blockLen (2B) | +---------------+------------------+---------------+ | 0 | 7,123,456 | 30,000 | +---------------+------------------+---------------+ | 1 | 7,153,456 | 20,000 | +---------------+------------------+---------------+ | 2 | 7,173,456 | 25,567 | +---------------+------------------+---------------+ | .... | ... | ... |
The metaindex block contains the valueBlocksIndexHandle which in addition to the BlockHandle also specifies the widths of these tuple fields. In the above example, the valueBlockIndexHandle.{blockNumByteLength,blockOffsetByteLength,blockLengthByteLength} will be (2,4,2).
Index ¶
- Constants
- func DecodeBlockHandleFromIndex(vbiBlock []byte, blockNum uint32, indexHandle IndexHandle) (block.Handle, error)
- func DecodeIndex(data []byte, vbih IndexHandle) ([]block.Handle, error)
- func DecodeLenFromHandle(src []byte) (uint32, []byte)
- func EncodeHandle(dst []byte, v Handle) int
- func EncodeIndexHandle(dst []byte, v IndexHandle) int
- type ExternalBlockReader
- type Handle
- type IndexHandle
- type IteratorBlockReader
- type LayoutWriter
- type Reader
- type ReaderProvider
- type Writer
- type WriterStats
Constants ¶
const HandleMaxLen = 3 * binary.MaxVarintLen32
HandleMaxLen is the maximum length of a variable-width encoded Handle.
Handle fields are varint encoded, so maximum 5 bytes each. This could alternatively be group varint encoded, but experiments were inconclusive (https://github.com/cockroachdb/pebble/v2/pull/1443#issuecomment-1270298802).
Variables ¶
This section is empty.
Functions ¶
func DecodeBlockHandleFromIndex ¶
func DecodeBlockHandleFromIndex( vbiBlock []byte, blockNum uint32, indexHandle IndexHandle, ) (block.Handle, error)
DecodeBlockHandleFromIndex decodes the block handle for the given block number from the provided index block.
func DecodeIndex ¶
func DecodeIndex(data []byte, vbih IndexHandle) ([]block.Handle, error)
DecodeIndex decodes the entirety of a value block index, returning a slice of the contained block handles.
func DecodeLenFromHandle ¶
DecodeLenFromHandle decodes the value length from the beginning of a variaible-width encoded Handle.
func EncodeHandle ¶
EncodeHandle encodes the Handle into dst in a variable-width encoding and returns the number of bytes encoded.
func EncodeIndexHandle ¶
func EncodeIndexHandle(dst []byte, v IndexHandle) int
EncodeIndexHandle encodes the IndexHandle into dst and returns the number of bytes written.
Types ¶
type ExternalBlockReader ¶
type ExternalBlockReader interface {
ReadValueBlockExternal(context.Context, block.Handle) (block.BufferHandle, error)
}
ExternalBlockReader is used to read value blocks from a file outside the context of a open sstable iterator reading the file.
type Handle ¶
Handle is stored with a key when the value is in a value block. This handle is the pointer to that value.
func DecodeHandle ¶
DecodeHandle decodes a Handle from a variable-length encoding.
func DecodeRemainingHandle ¶
DecodeRemainingHandle decodes the BlockNum and OffsetInBlock from a variable-width encoded Handle, assuming the ValueLen prefix has already been stripped from src.
type IndexHandle ¶
type IndexHandle struct {
Handle block.Handle
BlockNumByteLength uint8
BlockOffsetByteLength uint8
BlockLengthByteLength uint8
}
IndexHandle is placed in the metaindex if there are any value blocks. If there are no value blocks, there is no value blocks index, and no entry in the metaindex. Note that the lack of entry in the metaindex should not be used to ascertain whether the values are prefixed, since the former is an emergent property of the data that was written and not known until all the key-value pairs in the sstable are written.
func DecodeIndexHandle ¶
func DecodeIndexHandle(src []byte) (IndexHandle, int, error)
DecodeIndexHandle decodes an IndexHandle from src and returns the number of bytes read.
func (IndexHandle) RowWidth ¶
func (h IndexHandle) RowWidth() int
RowWidth returns the number of bytes that the referenced index block uses per value block.
func (IndexHandle) String ¶
func (h IndexHandle) String() string
String returns a string representation of the IndexHandle.
type IteratorBlockReader ¶
type IteratorBlockReader interface {
ReadValueBlock(block.Handle, *base.InternalIteratorStats) (block.BufferHandle, error)
}
IteratorBlockReader is used to read value blocks from within an open file.
type LayoutWriter ¶
type LayoutWriter interface {
// WriteValueBlock writes a pre-finished value block (with the trailer) to
// the writer.
WriteValueBlock(blk block.PhysicalBlock) (block.Handle, error)
// WriteValueIndexBlock writes a pre-finished value block index to the
// writer.
WriteValueIndexBlock(blk block.PhysicalBlock, vbih IndexHandle) (block.Handle, error)
}
LayoutWriter defines the interface for a writer that writes out serialized value and value index blocks.
type Reader ¶
type Reader struct {
// contains filtered or unexported fields
}
Reader implements GetLazyValueForPrefixAndValueHandler; it is used to create LazyValues (each of which can can be used to retrieve a value in a value block). It is used when the sstable was written with Properties.ValueBlocksAreEnabled. The lifetime of this object is tied to the lifetime of the sstable iterator.
func MakeReader ¶
func MakeReader( i IteratorBlockReader, rp ReaderProvider, vbih IndexHandle, stats *base.InternalIteratorStats, ) Reader
MakeReader constructs a Reader.
func (*Reader) GetInternalValueForPrefixAndValueHandle ¶
func (r *Reader) GetInternalValueForPrefixAndValueHandle(handle []byte) base.InternalValue
GetInternalValueForPrefixAndValueHandle returns an InternalValue for the given value prefix and value.
The result is only valid until the next call to GetInternalValueForPrefixAndValueHandle. Use InternalValue.Clone if the lifetime of the InternalValue needs to be extended. For more details, see the "memory management" comment where LazyValue is declared.
type ReaderProvider ¶
type ReaderProvider interface {
GetReader(context.Context) (ExternalBlockReader, error)
Close()
}
ReaderProvider supports the implementation of blockProviderWhenClosed. GetReader and Close can be called multiple times in pairs.
type Writer ¶
type Writer struct {
// contains filtered or unexported fields
}
Writer writes a sequence of value blocks, and the value blocks index, for a sstable.
func NewWriter ¶
func NewWriter( flushGovernor block.FlushGovernor, compressor *block.Compressor, checksumType block.ChecksumType, blockFinishedFunc func(compressedSize int), ) *Writer
NewWriter creates a new Writer of value blocks and value index blocks.
func (*Writer) AddValue ¶
AddValue adds a value to the writer, returning a Handle referring to the stored value.
func (*Writer) Finish ¶
func (w *Writer) Finish(layout LayoutWriter, fileOffset uint64) (IndexHandle, WriterStats, error)
Finish finishes writing the value blocks and value blocks index, writing the buffered blocks out to the provider layout writer.
type WriterStats ¶
type WriterStats struct {
NumValueBlocks uint64
NumValuesInValueBlocks uint64
// Includes both value blocks and value index block.
ValueBlocksAndIndexSize uint64
}
WriterStats contains statistics about the value blocks and value index block written by a Writer.