iterfile

package module
v0.0.0-...-a212c9f Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 23, 2016 License: MIT Imports: 2 Imported by: 0

README

File Iteration Benchmarks

Build Status Go Report Card GoDoc

Benchmarking for various file iteration utilities

This small library provides various mechanisms for reading a file one line at a time. These utilities aren't necessarily meant to be used as a library for use in production code (though you're more than welcome to) but rather to profile and benchmark various iteration constructs. See Benchmarking Readline Iterators for a quick writeup about this repository.

Read more at Benchmarking Readline Iterators and Yielding Functions for Iteration in Go.

Usage

All of the functions in this library are Readlines functions; that is they take as input at least the path to a file, and then provide some iterable context with which to handle one line of the file at a time. The examples for usage here will simply be a line character count (less the newlines), the testing methodology uses line, word, and character counts. Currently we have implemented:

  • ChanReadlines: returns a channel to range on.
  • CallbackReadlines: accepts a per-line callback function.
  • IteratorReadlines: returns a stateful iterator.
Channel Readlines

Use the channel based readlines iterator as follows:

// construct the reader and the line count.
var chars int
reader, err := ChanReadlines("fixtures/small.txt")

// check if there was an error opening the file or scanning.
if err != nil {
    log.Fatal(err)
}

// iterate over the lines using range
for line := range reader {
    chars += len(line)
}

Variants of this reader would not require the error checking at the beginning, but would rather yield errors in iteration along with the line.

Callback Readlines

Use the callback-style readlines iterator as follows:

var chars int

// Define the callback function
cb := func(line string) error {
    chars += len(line)
    return nil
}

// Pass the callback to the iterator
err := CallbackReadlines("fixtures/small.txt", cb)
if err != nil {
    log.Fatal(err)
}

Note that in this mechanism, you can break out of the loop by returning an error from the callback, which will cause the calling iterator to return and hopefully also close the file and be done!

Iterator Readlines

Use the stateful iterator returned by the readlines iterator as follows:

var chars int
reader, err := IteratorReadlines("fixtures/small.txt")

// check if there was an error opening the file or scanning.
if err != nil {
    log.Fatal(err)
}

// iterate over the stateful LineIterator that has been returned.
for reader.Next() {
    chars += len(reader.Line())
}

Benchmarks

Benchmarks can be run with the go test -bench=. command. The current benchmarks are as follows:

BenchmarkChanReadlinesSmall-8         	   20000	     74958 ns/op
BenchmarkChallbackReadlinesSmall-8    	   50000	     28836 ns/op
BenchmarkIteratorReadlinesSmall-8     	   50000	     29006 ns/op

BenchmarkChanReadlinesMedium-8        	    2000	    621716 ns/op
BenchmarkChallbackReadlinesMedium-8   	   10000	    216734 ns/op
BenchmarkIteratorReadlinesMedium-8    	   10000	    219842 ns/op

BenchmarkChanReadlinesLarge-8         	     200	   6250004 ns/op
BenchmarkChallbackReadlinesLarge-8    	    1000	   2198904 ns/op
BenchmarkIteratorReadlinesLarge-8     	    1000	   2229104 ns/op

We benchmark each word count function on small (100 lines), medium (1000 lines) and large (10000 lines) text files.

Profiling

Memory usage is just as critical as time performance, so I profiled memory usage using the mprof utility by Fabian Pedregosa and Philippe Gervais. The profiler ran a command-line script in cmd/readline.go that allows you to select an iteration function as an argument. For comparison, I also created a Python script that implemented the same functionality. All iterators are counting characters from a 3.9GB, 900,002 line file filled with "fizz buzz" text.

Memory Profiling of Readlines Iteration for a 3.9G Text File

Interestingly, while the channel readlines implementation took almost as long as Python, it used the least amount of memory. Both the iterator and callback implementations used slightly more memory, probably due to the state tracking each method was required to perform. These methods both took approximately the same time to complete, significantly faster than the channel method.

Help Wanted!

Have a method or mechanism for line-by-line reading of a file, submit it with a pull-request and add it to the list of benchmarks! In particular, I couldn't get the closure-style of read-by-line iterator work:

for gen, next, err := GeneratorReadlines("myfile.txt"); next; line, next, err = gen() {
    // do something with the line
}

I was either not getting all the lines or I was getting a final line that was simply the empty string, making all my counts incorrect. If you're interested in this problem, take a look at the current implementation and tests. Submit an issue if you'd like to discuss it!

About

Learning a new programming language often means that you want to explore everything as completely as possible. That's what this small repository is about for me, learning to write benchmarking code and to write quality iterators that are Go idiomatic. Of course, then the repository gets out of control with Repo images, etc. But hey - if you're not having fun, why are you programming?

Acknowledgements

Most of the iterators were implemented based on Ewan Cheslack-Postava's Iterators in Go blob post. Table based testing inspired by Dave Chaney's Writing table driven tests in Go blog post. Benchmarking was similarly inspired by How to write benchmarks in Go. Check those posts out if you haven't already.

The banner image used in this README, “lines & curves” by Josef Stuefer is used by a Creative Commons BY-NC-ND license.

Documentation

Overview

Package iterfile provides various mechanisms for reading a file one line at a time. These utilities aren't necessarily meant to be used as a library for use in production code (though you're more than welcome to) but rather to profile and benchmark various iteration constructs.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CallbackReadlines

func CallbackReadlines(path string, cb func(string) error) error

CallbackReadlines allows the caller to specify a callback function whose input is the line being read. The callback is then called on each line in the file. Note that the CallbackReadlines function returns an error that should be checked along with the results of the callbacks. Basic usage is:

func linecb(line string) error {
    // do something with the line
}

err := CallbackReadlines("myfile.txt", linecb)

Note that the callback function can return an error, which if detected will cause the loop to break and return the error from the callback.

func ChanReadlines

func ChanReadlines(path string) (<-chan string, error)

ChanReadlines returns an channel that can be used in conjunction with the range keyword for looping over every line in the file. Basic usage is:

reader, err := ChanReadlines("myfile.txt")
for line := range reader {
    // do something with the line
}

The channel will be closed by the reader when the entire file is read.

func GeneratorReadlines

func GeneratorReadlines(path string) (func() (string, bool, error), bool, error)

GeneratorReadlines returns a closure that can be called multiple times as though it were a generator function, creating kind of an interesting for expression construct that can be fit into a single line. Basic usage is:

for gen, next, err := GeneratorReadlines("myfile.txt"); next; line, next, err = gen() {
    // do something with the line
}

The loop stops when the generator next bool returns false.

Types

type LineIterator

type LineIterator interface {
	Next() bool   // Advances the iterator to the next line
	Line() string // Returns the current line of iteration
}

LineIterator specifies how an iterable object over file lines should work.

func IteratorReadlines

func IteratorReadlines(path string) (LineIterator, error)

IteratorReadlines returns a LineIterator to loop over by calling its Next() method and obtaining the value with its Line() method. Basic usage is:

reader, err := IteratorReadlines("myfile.txt")
for reader.Next() {
     line := reader.Line()
     // do something with the line
}

Once the LineIterator is exahusted it cannot be used again or reset.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL