graphsplit

package module
v0.4.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 8, 2021 License: MIT Imports: 37 Imported by: 4

README

Go-graphsplit

A tool for splitting large dataset into graph slices fit for making deal in the Filecoin Network.

When storing a large dataset in the Filecoin Network, we have to split it into smaller pieces to fit for the size of sector, which could be 32GiB or 64GiB.

At first, we made the dataset into a large tar ball, did chunking this tar ball into small pieces, and then make deals with storage miners with these pieces. We did this way for a while until we realized that it brought a difficulty for data retrieval. Even if we only needed to retrieve a small file in the dataset, we had to retrieve all the pieces of the tar ball at first.

Graphsplit has solved the problem we faced above. It takes advantage of IPLD concepts, following the unixfs format datastructures. It regards the dataset or it's sub-directory as a big graph and then cut it into small graphs. Each small graph will keep its file system structure as possible as its used be. After that, we only need to organize these small graphs into a car file according to unixfs.

Build

git clone https://github.com/filedrive-team/go-graphsplit.git

cd go-graphsplit

# get submodules
git submodule update --init --recursive

# build filecoin-ffi
make ffi

make

Usage

See the work flow of graphsplit

Splitting dataset:

# car-dir: folder for splitted smaller pieces, in form of .car
# slice-size: size for each pieces
# parallel: number goroutines run when building ipld nodes
# graph-name: it will use graph-name for prefix of smaller pieces
# calc-commp: calculation of pieceCID, default value is false. Be careful, a lot of cpu, memory and time would be consumed if slice size is very large.
# parent-path: usually just be the same as /path/to/dataset, it's just a method to figure out relative path when building IPLD graph
./graphsplit chunk \
--car-dir=path/to/car-dir \
--slice-size=17179869184 \
--parallel=2 \
--graph-name=gs-test \
--calc-commp=false \
--parent-path=/path/to/dataset \
/path/to/dataset

Notes: A manifest.csv will created to save the mapping with graph slice name, the payload cid and slice inner structure. As following:

cat /path/to/car-dir/manifest.csv
payload_cid,filename,detail
Qm...,graph-slice-name.car,inner-structure-json

If set --calc-commp=true, two another fields would be add to manifest.csv

cat /path/to/car-dir/manifest.csv
payload_cid,filename,piece_cid,piece_size,detail
Qm...,graph-slice-name.car,baga...,16646144,inner-structure-json

Import car file to IPFS:

ipfs dag import /path/to/car-dir/car-file

Restore files:

# car-path: directory or file, in form of .car
# output-dir: usually just be the same as /path/to/output-dir
# parallel: number goroutines run when restoring
./graphsplit restore \
--car-path=/path/to/car-path \
--output-dir=/path/to/output-dir \
--parallel=2

PieceCID Calculation for a single car file:

# Calculate pieceCID for a single car file
# 
./graphsplit commP /path/to/carfile

Contribute

PRs are welcome!

License

MIT

Documentation

Index

Constants

View Source
const UnixfsChunkSize uint64 = 1 << 20
View Source
const UnixfsLinksPerLevel = 1 << 10

Variables

This section is empty.

Functions

func BuildFileNode

func BuildFileNode(item Finfo, bufDs ipld.DAGService, cidBuilder cid.Builder) (node ipld.Node, err error)

func BuildIpldGraph

func BuildIpldGraph(ctx context.Context, fileList []Finfo, graphName, parentPath, carDir string, parallel int, cb GraphBuildCallback)

func CarTo added in v0.2.0

func CarTo(carPath, outputDir string, parallel int)

func Chunk added in v0.3.0

func Chunk(ctx context.Context, sliceSize int64, parentPath, targetPath, carDir, graphName string, parallel int, cb GraphBuildCallback) error

func ExistDir added in v0.2.0

func ExistDir(path string) bool

func GenGraphName

func GenGraphName(graphName string, sliceCount, sliceTotal int) string

func GetFileList

func GetFileList(args []string) (fileList []string, err error)

func GetFileListAsync

func GetFileListAsync(args []string) chan Finfo

func GetGraphCount

func GetGraphCount(args []string, sliceSize int64) int

func Import added in v0.2.0

func Import(path string, st car.Store) (cid.Cid, error)

func Merge added in v0.2.0

func Merge(dir string, parallel int)

func NodeWriteTo added in v0.2.0

func NodeWriteTo(nd files.Node, fpath string) error

Types

type CommPRet added in v0.4.0

type CommPRet struct {
	Root cid.Cid
	Size abi.UnpaddedPieceSize
}

func CalcCommP added in v0.4.0

func CalcCommP(ctx context.Context, inpath string) (*CommPRet, error)

almost copy paste from https://github.com/filecoin-project/lotus/node/impl/client/client.go#L749-L770

type FSBuilder added in v0.4.1

type FSBuilder struct {
	// contains filtered or unexported fields
}

func NewFSBuilder added in v0.4.1

func NewFSBuilder(root *dag.ProtoNode, ds ipld.DAGService) *FSBuilder

func (*FSBuilder) Build added in v0.4.1

func (b *FSBuilder) Build() (*fsNode, error)

type Finfo

type Finfo struct {
	Path      string
	Name      string
	Info      os.FileInfo
	SeekStart int64
	SeekEnd   int64
}

type GraphBuildCallback added in v0.3.0

type GraphBuildCallback interface {
	OnSuccess(node ipld.Node, graphName, fsDetail string)
	OnError(error)
}

func CSVCallback added in v0.3.0

func CSVCallback(carDir string) GraphBuildCallback

func CommPCallback added in v0.4.0

func CommPCallback(carDir string) GraphBuildCallback

func ErrCallback added in v0.3.0

func ErrCallback() GraphBuildCallback

type Manifest added in v0.4.0

type Manifest struct {
	PayloadCid string `csv:"payload_cid"`
	Filename   string `csv:"filename"`
}

manifest

type PieceInfo added in v0.4.0

type PieceInfo struct {
	PayloadCid string `csv:"payload_cid"`
	Filename   string `csv:"filename"`
	PieceCid   string `csv:"piece_cid"`
	PieceSize  uint64 `csv:"piece_size"`
}

piece info

Directories

Path Synopsis
cmd
graphsplit command

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL