hashsplit

package module
v1.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 29, 2021 License: MIT Imports: 3 Imported by: 2

README

Hashsplit - content-based splitting of byte streams

Go Reference Go Report Card Tests

Hashsplitting is a way of dividing a byte stream into pieces based on the stream's content rather than on any predetermined chunk size. As the Splitter reads the stream it maintains a rolling checksum of the last several bytes. A chunk boundary occurs when the rolling checksum has enough trailing bits set (where “enough” is a configurable setting that determines the average chunk size).

Hashsplitting has benefits when it comes to representing multiple, slightly different versions of the same data. Consider, for example, the problem of adding EXIF tags to a JPEG image file. The tags appear near the beginning of the file, and the bulk of the image data follows. If the file were divided into chunks at (say) 8-kilobyte boundaries, then adding EXIF data near the beginning would alter every following chunk (except in the lucky case where the size of the added data is an exact multiple of 8kb). With hashsplitting, only the chunks in the vicinity of the change are affected.

A sequence of hashsplit chunks can additionally be organized into a tree for even better compaction. Each chunk has a “level” L determined by the rolling checksum, and each node in the tree has a level N. Tree nodes at level 0 collect chunks at level 0, up to and including a chunk at level L>0; then a new level-0 node begins. Tree nodes at level N collect nodes at level N-1 up to and including a chunk at level L>N; then a new level-N node begins.

Hashsplitting is used to dramatically reduce storage and bandwidth requirements in projects like rsync, bup, and perkeep. More information, and a proposed standard, can be found at github.com/hashsplit/hashsplit-spec.

Note: an earlier version of this package included a Splitter.Split method, which allowed a Splitter s to consume all of the input from an io.Reader r. This has been removed. The same behavior can be obtained simply by doing:

_, err := io.Copy(s, r)
if err != nil { ... }
err = s.Close()

Documentation

Overview

Package hashsplit implements content-based splitting of byte streams.

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func Split

func Split(r io.Reader, f func([]byte, uint) error) error

Split hashsplits its input using a default Splitter and the given callback to process chunks. See NewSplitter for details about the callback.

Types

type Node

type Node struct {
	Nodes  []*Node
	Leaves [][]byte

	// Size is the number of bytes represented by this tree node.
	// For a level-0 node this is normally the lengths of the byte slices in Leaves, added together.
	// However, for some applications those byte slices are placeholders for the original data
	// (such as when the original data is saved aside to separate storage).
	// In those cases Size represents the original data, not the placeholder data in Leaves.
	//
	// For higher-level nodes, this is the sum of the Size fields in all child nodes.
	Size uint64

	// Offset is the byte position that this node represents in the original input stream,
	// before splitting.
	// It is equal to the sum of the Size fields of the siblings to this node's "left."
	// Applications can use the Offset field for random access by position to any chunk in the original input stream.
	Offset uint64
}

Node is a node in the tree built by a TreeBuilder. A interior node ("level 1" and higher) contains one or more subnodes as children. A leaf node ("level 0") contains one or more byte slices, which are normally hashsplit chunks of the input. Exactly one of Nodes and Leaves is non-empty.

func Seek

func Seek(node *Node, pos uint64) *Node

Seek finds the level-0 node representing the given byte position (i.e., the one where Offset <= pos < Offset+Size).

type Splitter

type Splitter struct {
	// MinSize is the minimum chunk size.
	// Only the final chunk may be smaller than this.
	// This should always be >= 64,
	// which is the rolling checksum "window size."
	// If it's less than the size of the checksum window,
	// then the same window can span multiple chunks,
	// meaning a chunk boundary is not independent of the preceding chunk.
	// If you leave this set to zero,
	// 64 is what you'll get.
	// If you really mean "I want no minimum,"
	// set this to 1.
	MinSize int

	// SplitBits is the number of trailing zero bits in the rolling checksum required to produce a chunk.
	// The default (what you get if you leave it set to zero) is 13,
	// which means a chunk boundary occurs on average once every 8,192 bytes.
	//
	// (But thanks to math, that doesn't mean that 8,192 is the median chunk size.
	// The median chunk size is actually the logarithm, base (2^SplitBits-1)/(2^SplitBits), of 0.5.
	// That makes the median chunk size 5,678 when SplitBits==13.)
	SplitBits uint
	// contains filtered or unexported fields
}

Splitter hashsplits a byte sequence into chunks. It implements the io.WriteCloser interface.

Hashsplitting is a way of dividing a byte stream into pieces based on the stream's content rather than on any predetermined chunk size. As the Splitter reads the stream it maintains a rolling checksum of the last several bytes. A chunk boundary occurs when the rolling checksum has enough trailing bits set to zero (where "enough" is a configurable setting that determines the average chunk size).

Hashsplitting has benefits when it comes to representing multiple, slightly different versions of the same data. Consider, for example, the problem of adding EXIF tags to a JPEG image file. The tags appear near the beginning of the file, and the bulk of the image data follows. If the file were divided into chunks at (say) 8-kilobyte boundaries, then adding EXIF data near the beginning would alter every following chunk, except in the lucky case where the size of the added data is an exact multiple of 8kb. With hashsplitting, only the chunks in the vicinity of the change are affected.

Hashsplitting is used to dramatically reduce storage and bandwidth requirements in projects like rsync, bup, and perkeep.

func NewSplitter

func NewSplitter(f func([]byte, uint) error) *Splitter

NewSplitter produces a new Splitter with the given callback. The Splitter is an io.WriteCloser. As bytes are written to it, it finds chunk boundaries and calls the callback.

The callback receives the bytes of the chunk, and the chunk's "level," which is the number of extra trailing zeroes in the rolling checksum (in excess of Splitter.SplitBits).

Do not forget to call Close on the Splitter to flush any remaining chunk from its internal buffer.

func (*Splitter) Close

func (s *Splitter) Close() error

Close implements io.Closer. It is necessary to call Close to flush any buffered chunk remaining. Calling Close may result in a call to the callback in s. Close is idempotent: it can safely be called multiple times without error (and without producing the final chunk multiple times).

func (*Splitter) Write

func (s *Splitter) Write(inp []byte) (int, error)

Write implements io.Writer. May produce one or more calls to the callback in s, as chunks are discovered. Any error from the callback will cause Write to return early with that error.

type TreeBuilder

type TreeBuilder struct {
	// contains filtered or unexported fields
}

TreeBuilder assembles a sequence of chunks into a hashsplit tree.

A hashsplit tree provides another level of space-and-bandwidth savings over and above what Split gives you. Consider, again, the example of adding EXIF tags to a JPEG file. Although most chunks of the hashsplitted file will be the same before and after adding tags, the _list_ needed to reassemble those chunks into the original file will be very different: all the unaffected chunks must shift position to accommodate the new EXIF-containing chunks.

A hashsplit tree organizes that list into a tree instead, whose shape is determined by the content of the chunks, just as the chunk boundaries are. It has the property that only the tree nodes in the vicinity of the change will be affected. Most subtrees will remain the same.

Just as each chunk has a level L determined by the rolling checksum (see NewSplitter), so does each node in the tree have a level, N. Tree nodes at level 0 collect chunks at level 0, up to and including a chunk at level L>0; then a new level-0 node begins. Tree nodes at level N>0 collect nodes at level N-1 up to and including a chunk at level L>N; then a new level-N node begins.

Example
var r io.Reader // Represents the source of some data.

tb := NewTreeBuilder()
err := Split(r, func(bytes []byte, level uint) error {
	tb.Add(bytes, len(bytes), level)
	return nil
})
if err != nil {
	panic(err)
}
// Get the root of the tree with tb.Root().
Example (FanOut)
var r io.Reader // Represents the source of some data.

tb := NewTreeBuilder()
err := Split(r, func(bytes []byte, level uint) error {
	// Map level to a smaller range for wider fan-out.
	tb.Add(bytes, len(bytes), level/4)
	return nil
})
if err != nil {
	panic(err)
}

// Get the root of the tree with tb.Root().
Example (SaveAside)
var r io.Reader // Represents the source of some data.

// Represents any function that replaces a chunk with a compact representation of that chunk
// (like a hash or a lookup key).
var saveAside func([]byte) ([]byte, error)

tb := NewTreeBuilder()
err := Split(r, func(chunk []byte, level uint) error {
	size := len(chunk)

	// Replace chunk with a compact representation.
	repr, err := saveAside(chunk)
	if err != nil {
		return err
	}

	// Store the compact representation in the tree,
	// not the chunk itself;
	// but use the original chunk's size.
	tb.Add(repr, size, level)
	return nil
})
if err != nil {
	panic(err)
}

// Get the root of the tree with tb.Root().

func NewTreeBuilder

func NewTreeBuilder() *TreeBuilder

NewTreeBuilder produces a new TreeBuilder.

func (*TreeBuilder) Add

func (tb *TreeBuilder) Add(bytes []byte, size int, level uint)

Add adds a new chunk to the TreeBuilder. It is typical to call this function in the callback of Split as each chunk is produced.

Normally, size should be len(bytes). However, some applications will prefer to save each split chunk aside to separate storage rather than place all chunks in the tree. In such a case, bytes will be a lookup key for recovering the original chunk, and size should be the original size of the chunk (not the size of the lookup key). This allows the Size and Offset fields of the nodes in the tree to be correct with respect to the original data.

The level of a chunk is normally the level value passed to the Split callback. It results in the creation of a new Node at the given level. However, this produces a tree with an average branching factor of 2. For wider fan-out (more children per node), the caller can reduce the value of level.

func (*TreeBuilder) Root

func (tb *TreeBuilder) Root() *Node

Root produces the root of the tree after all nodes have been added with calls to Add.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL