gosync

package module
v0.0.0-...-35846ce Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 24, 2015 License: MIT Imports: 0 Imported by: 0

README

Go-Sync

Build Status GoDoc

gosync is a library inspired by zsync and rsync. The intent is that it's easier to build upon than the zsync/rsync codebases. By writing it in Go, it's easier to create in a way that's cross-platform, can take advantage of multiple CPUs with built in benchmarks, code documentation and unit tests.

There are many areas that benefit from the use of multiple threads & connections:

  • Making use of multiple http connections, to avoid the bandwidth latency product limiting transfer rates to remote servers
  • Making use of multiple CPUs while doing the comparisons

gosync includes a commandline tool that's intended as a starting point for building the functionality that you want.

Zsync modified rsync to allow it to be used against a "dumb" http server, but we can go further:

  • Arrange files by hashed path, and checksum: if a file hasn't changed, you can serve the existing version (NB: this works well with s3 sync)
  • Split the checksum blocks from the data: serve checksum blocks securely over https, while allowing caching on the data over http
  • Split up the data files: improve the fallback when there's a non HTTP 1.1 proxy between the client and server

Current State

The command-line tools are fleshed out enough for testing comparison behaviour and profiling it against real input files. NB: The command-line tools are not in a state for use in production!

There is a basic HTTP Blocksource, which currently supports fixed size blocks (no compression on the source), but which should be able to multiple tcp connections to increase transfer speed where latency is a bottleneck.

Work needs to be done to add support for a BlockSourceResolver that can deal with compressed blocks. Some changes and refactoring need to happen where there are assumptions about a source block being of 'blocksize' - in order to optimize transmitted size, blocks should be gziped. Being able to support multiple source files would also be good.

After that, there needs to be some cleanup of the CLI command code, which is pretty verbose and duplicates a lot. Things like version numbers in the files would be good, and then implementation of more features to make it potentially usable.

Performance

On an 8 MB file with few matches, I'm hitting about 16 MB/s with 4 threads. I think that we're mostly CPU bound, and should scale reasonably well with more processors. When there are very similar files, the speed is far higher (since the weak checksum match is significantly cheaper)

Some numbers:

Budget for 8 MB/s byte by byte comparison on single thread: 120ns

Current Benchmark State (Golang 1.4):

  • Checksum: 50.3 ns
  • Comparison (No index lookup)
    • Weak Rejection: 68.6 ns
    • Strong Rejection: 326 ns

Generating a gosync file for a 22 GB file (not on an SSD) took me around 2m31s ~= 145 MB/s sustained checksum generation. The resulting file was around 50 MB and does not compress well (which makes sense, since it's hashes with a hopefully near-random distribution).

The 32 bit Rollsum hash produces far fewer false positives than the 16 bit one, with the same 4 byte hash overhead.

Index generation:

  • Easily hits 100 MB/s on my workstation, satisfying the idea that you should be able to build 12 GB payloads in ~1m
  • More likely to be bottlenecked by the disk throughput / seek time than the CPU

TODO

  • gzip source blocks (this involves writing out a version of the file that's compressed in block-increments)
  • support variable length source blocks
  • Provide more helpers to make common usage easier (multi-threading etc)
  • Clean up naming consistency and clarity: Block / Chunk etc
  • Flesh out full directory build / sync
  • Implement 'patch' payloads from a known start point to a desired endstate
  • Validate downloaded blocks
  • Validate full file checksum after patching
  • Provide bandwidth limiting / monitoring as part of http blocksource
  • Think about turning the filechecksum into an interface
  • Avoid marshalling / unmarshalling blocks during checksum generation
  • Sequential patcher to resume after error?

Testing

Unit tests
go test github.com/Redundancy/go-sync/...
Commandline & files
go build github.com/Redundancy/go-sync/gosync
gosync build filenameToPatchTo
gosync patch filenameToPatchFrom filenameToPatchTo.gosync filenameToPatchTo

Note that normally, patching would rely on a remote http/https file source.

Documentation

Overview

gosync is inspired by zsync, and rsync. It aims to take the fundamentals and create a very flexible library that can be adapted to work in many ways.

We rely heavily on built in Go abstractions like io.Reader, hash.Hash and our own interfaces - this makes the code easier to change, and to test. In particular, no part of the core library should know anything about the transport or layout of the reference data. If you want to do rsync and do http/https range requests, that's just as good as zsync client-server over an SSH tunnel. The goal is also to allow support for multiple concurrent connections, so that you can make the best use of your line in the face of the bandwidth latency product (or other concerns that require concurrency to solve).

The following optimizations are possible: * Generate hashes with multiple threads (both during reference generation and local file interrogation) * Multiple ranged requests (can even be used to get the hashes)

Example
// due to short example strings, use a very small block size
// using one this small in practice would increase your file transfer!
const BLOCK_SIZE = 4

// This is the "file" as described by the authoritive version
const REFERENCE = "The quick brown fox jumped over the lazy dog"

// This is what we have locally. Not too far off, but not correct.
const LOCAL_VERSION = "The qwik brown fox jumped 0v3r the lazy"

generator := filechecksum.NewFileChecksumGenerator(BLOCK_SIZE)

_, referenceFileIndex, _, err := indexbuilder.BuildIndexFromString(generator, REFERENCE)

if err != nil {
	return
}

compare := &comparer.Comparer{}

// This will result in a stream of blocks that match in the local version
// to those in the reference
// We could do this on two goroutines simultaneously, if we used two identical generators
matchStream := compare.StartFindMatchingBlocks(
	bytes.NewBufferString(LOCAL_VERSION),
	0,
	generator,
	referenceFileIndex,
)

merger := &comparer.MatchMerger{}

// Combine adjacent blocks. If finding concurrently, call once per stream
merger.StartMergeResultStream(matchStream, BLOCK_SIZE)

// a sorted list of ranges of blocks that match between the reference and the local version
matchingBlockRanges := merger.GetMergedBlocks()
PrintLocalSpans("Match", matchingBlockRanges, LOCAL_VERSION, BLOCK_SIZE)

missingBlockRanges := matchingBlockRanges.GetMissingBlocks(uint(referenceFileIndex.BlockCount) - 1)
PrintReferenceSpans("Missing", missingBlockRanges, REFERENCE, BLOCK_SIZE)

// the "file" to write to
// No verification on the source of the replacement data.
// See http example for hash checking in a block source.
patchedFile := bytes.NewBuffer(make([]byte, 0, len(REFERENCE)))
remoteReferenceSource := blocksources.NewReadSeekerBlockSource(
	bytes.NewReader([]byte(REFERENCE)),
	blocksources.MakeNullFixedSizeResolver(BLOCK_SIZE),
)

err = sequential.SequentialPatcher(
	bytes.NewReader([]byte(LOCAL_VERSION)),
	remoteReferenceSource,
	ToPatcherMissingSpan(missingBlockRanges, BLOCK_SIZE),
	ToPatcherFoundSpan(matchingBlockRanges, BLOCK_SIZE),
	1024,
	patchedFile,
)

if err != nil {
	fmt.Println("Error:", err)
	return
}

fmt.Printf("Patched result: \"%s\"\n", patchedFile.Bytes())
fmt.Println("Remotely requested bytes:", remoteReferenceSource.ReadBytes(), "(without the index!)")
fmt.Println("Full file length:", len(REFERENCE), "bytes")
Output:

Match: "The "
Match: "k brown fox jump"
Match: "the lazy"
Missing: "quic"
Missing: "ed over "
Missing: " dog"
Patched result: "The quick brown fox jumped over the lazy dog"
Remotely requested bytes: 16 (without the index!)
Full file length: 44 bytes
Example (HttpBlockSource)

This is exceedingly similar to the module Example, but uses the http blocksource and a local http server

package main

import (
	"bytes"
	"crypto/md5"
	"fmt"
	"github.com/Redundancy/go-sync/blocksources"
	"github.com/Redundancy/go-sync/comparer"
	"github.com/Redundancy/go-sync/filechecksum"
	"github.com/Redundancy/go-sync/indexbuilder"
	"github.com/Redundancy/go-sync/patcher/sequential"
	"net/http"
	"time"
)

// due to short example strings, use a very small block size
// using one this small in practice would increase your file transfer!
const BLOCK_SIZE = 4

// This is the "file" as described by the authoritive version
const REFERENCE = "The quick brown fox jumped over the lazy dog"

// This is what we have locally. Not too far off, but not correct.
const LOCAL_VERSION = "The qwik brown fox jumped 0v3r the lazy"

var content = bytes.NewReader([]byte(REFERENCE))
var LOCAL_URL = ""
var PORT = 8000

func handler(w http.ResponseWriter, req *http.Request) {
	http.ServeContent(w, req, "", time.Now(), content)
}

// set up a http server locally that will respond predictably to ranged requests
func setupServer() {
	s := http.NewServeMux()
	s.HandleFunc("/content", handler)

	go func() {
		for {
			p := fmt.Sprintf(":%v", PORT)
			LOCAL_URL = "http://localhost" + p

			err := http.ListenAndServe(
				p,
				s,
			)

			if err != nil {
				PORT += 1
			}
		}
	}()
}

// This is exceedingly similar to the module Example, but uses the http blocksource and a local http server
func main() {
	setupServer()

	generator := filechecksum.NewFileChecksumGenerator(BLOCK_SIZE)
	_, referenceFileIndex, checksumLookup, err := indexbuilder.BuildIndexFromString(generator, REFERENCE)

	if err != nil {
		return
	}

	compare := &comparer.Comparer{}

	// This will result in a stream of blocks that match in the local version
	// to those in the reference
	// We could do this on two goroutines simultaneously, if we used two identical generators
	matchStream := compare.StartFindMatchingBlocks(
		bytes.NewBufferString(LOCAL_VERSION),
		0,
		generator,
		referenceFileIndex,
	)

	merger := &comparer.MatchMerger{}
	merger.StartMergeResultStream(matchStream, BLOCK_SIZE)

	matchingBlockRanges := merger.GetMergedBlocks()
	missingBlockRanges := matchingBlockRanges.GetMissingBlocks(uint(referenceFileIndex.BlockCount) - 1)

	patchedFile := bytes.NewBuffer(make([]byte, 0, len(REFERENCE)))
	remoteReferenceSource := blocksources.NewHttpBlockSource(
		LOCAL_URL+"/content",
		2,
		blocksources.MakeNullFixedSizeResolver(BLOCK_SIZE),
		&filechecksum.HashVerifier{
			Hash:                md5.New(),
			BlockSize:           BLOCK_SIZE,
			BlockChecksumGetter: checksumLookup,
		},
	)

	err = sequential.SequentialPatcher(
		bytes.NewReader([]byte(LOCAL_VERSION)),
		remoteReferenceSource,
		ToPatcherMissingSpan(missingBlockRanges, BLOCK_SIZE),
		ToPatcherFoundSpan(matchingBlockRanges, BLOCK_SIZE),
		1024,
		patchedFile,
	)

	if err != nil {
		fmt.Println("Error:", err)
		return
	}

	fmt.Printf("Patched content: \"%v\"\n", patchedFile.String())
	fmt.Printf("Downloaded Bytes: %v\n", remoteReferenceSource.ReadBytes())

}
Output:

Patched content: "The quick brown fox jumped over the lazy dog"
Downloaded Bytes: 16

Directories

Path Synopsis
Package chunks provides the basic structure for a pair of the weak and strong checksums.
Package chunks provides the basic structure for a pair of the weak and strong checksums.
package comparer is responsible for using a FileChecksumGenerator (filechecksum) and an index to move through a file and compare it to the index, producing a FileDiffSummary
package comparer is responsible for using a FileChecksumGenerator (filechecksum) and an index to move through a file and compare it to the index, producing a FileDiffSummary
package filechecksum provides the FileChecksumGenerator, whose main responsibility is to read a file, and generate both weak and strong checksums for every block.
package filechecksum provides the FileChecksumGenerator, whose main responsibility is to read a file, and generate both weak and strong checksums for every block.
gosync is a command-line implementation of the gosync package functionality, primarily as a demonstration of usage but supposed to be functional in itself.
gosync is a command-line implementation of the gosync package functionality, primarily as a demonstration of usage but supposed to be functional in itself.
Package index provides the functionality to describe a reference 'file' and its contents in terms of the weak and strong checksums, in such a way that you can check if a weak checksum is present, then check if there is a strong checksum that matches.
Package index provides the functionality to describe a reference 'file' and its contents in terms of the weak and strong checksums, in such a way that you can check if a weak checksum is present, then check if there is a strong checksum that matches.
Package indexbuilder provides a few shortbuts to building a checksum index by generating and then loading the checksums, and building an index from that.
Package indexbuilder provides a few shortbuts to building a checksum index by generating and then loading the checksums, and building an index from that.
Package patcher follows a pattern established by hash, which defines the interface in the top level package, and then provides implementations below it.
Package patcher follows a pattern established by hash, which defines the interface in the top level package, and then provides implementations below it.
sequential
Sequential Patcher will stream the patched version of the file to output, since it works strictly in order, it cannot patch the local file directly (since it might overwrite a block needed later), so there would have to be a final copy once the patching was done.
Sequential Patcher will stream the patched version of the file to output, since it works strictly in order, it cannot patch the local file directly (since it might overwrite a block needed later), so there would have to be a final copy once the patching was done.
rollsum provides an implementation of a rolling checksum - a checksum that's efficient to advance a byte or more at a time.
rollsum provides an implementation of a rolling checksum - a checksum that's efficient to advance a byte or more at a time.
util
readers
util/readers exists to provide convenient and composable io.Reader compatible streams to allow testing without having to check in large binary files.
util/readers exists to provide convenient and composable io.Reader compatible streams to allow testing without having to check in large binary files.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL