iterfile

package module

v0.0.0-...-a212c9f Latest Latest Go to latest Published: Dec 23, 2016 License: MIT Imports: 2 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/bbengfort/iterfile

Links

Open Source Insights

README ¶

File Iteration Benchmarks

Benchmarking for various file iteration utilities

This small library provides various mechanisms for reading a file one line at a time. These utilities aren't necessarily meant to be used as a library for use in production code (though you're more than welcome to) but rather to profile and benchmark various iteration constructs. See Benchmarking Readline Iterators for a quick writeup about this repository.

Read more at Benchmarking Readline Iterators and Yielding Functions for Iteration in Go.

Usage

All of the functions in this library are Readlines functions; that is they take as input at least the path to a file, and then provide some iterable context with which to handle one line of the file at a time. The examples for usage here will simply be a line character count (less the newlines), the testing methodology uses line, word, and character counts. Currently we have implemented:

ChanReadlines: returns a channel to range on.
CallbackReadlines: accepts a per-line callback function.
IteratorReadlines: returns a stateful iterator.

Channel Readlines

Use the channel based readlines iterator as follows:

// construct the reader and the line count.
var chars int
reader, err := ChanReadlines("fixtures/small.txt")

// check if there was an error opening the file or scanning.
if err != nil {
    log.Fatal(err)
}

// iterate over the lines using range
for line := range reader {
    chars += len(line)
}

Variants of this reader would not require the error checking at the beginning, but would rather yield errors in iteration along with the line.

Callback Readlines

Use the callback-style readlines iterator as follows:

var chars int

// Define the callback function
cb := func(line string) error {
    chars += len(line)
    return nil
}

// Pass the callback to the iterator
err := CallbackReadlines("fixtures/small.txt", cb)
if err != nil {
    log.Fatal(err)
}

Note that in this mechanism, you can break out of the loop by returning an error from the callback, which will cause the calling iterator to return and hopefully also close the file and be done!

Iterator Readlines

Use the stateful iterator returned by the readlines iterator as follows:

var chars int
reader, err := IteratorReadlines("fixtures/small.txt")

// check if there was an error opening the file or scanning.
if err != nil {
    log.Fatal(err)
}

// iterate over the stateful LineIterator that has been returned.
for reader.Next() {
    chars += len(reader.Line())
}

Benchmarks

Benchmarks can be run with the go test -bench=. command. The current benchmarks are as follows:

BenchmarkChanReadlinesSmall-8         	   20000	     74958 ns/op
BenchmarkChallbackReadlinesSmall-8    	   50000	     28836 ns/op
BenchmarkIteratorReadlinesSmall-8     	   50000	     29006 ns/op

BenchmarkChanReadlinesMedium-8        	    2000	    621716 ns/op
BenchmarkChallbackReadlinesMedium-8   	   10000	    216734 ns/op
BenchmarkIteratorReadlinesMedium-8    	   10000	    219842 ns/op

BenchmarkChanReadlinesLarge-8         	     200	   6250004 ns/op
BenchmarkChallbackReadlinesLarge-8    	    1000	   2198904 ns/op
BenchmarkIteratorReadlinesLarge-8     	    1000	   2229104 ns/op

We benchmark each word count function on small (100 lines), medium (1000 lines) and large (10000 lines) text files.

Profiling

Memory usage is just as critical as time performance, so I profiled memory usage using the mprof utility by Fabian Pedregosa and Philippe Gervais. The profiler ran a command-line script in cmd/readline.go that allows you to select an iteration function as an argument. For comparison, I also created a Python script that implemented the same functionality. All iterators are counting characters from a 3.9GB, 900,002 line file filled with "fizz buzz" text.

Interestingly, while the channel readlines implementation took almost as long as Python, it used the least amount of memory. Both the iterator and callback implementations used slightly more memory, probably due to the state tracking each method was required to perform. These methods both took approximately the same time to complete, significantly faster than the channel method.

Help Wanted!

Have a method or mechanism for line-by-line reading of a file, submit it with a pull-request and add it to the list of benchmarks! In particular, I couldn't get the closure-style of read-by-line iterator work:

for gen, next, err := GeneratorReadlines("myfile.txt"); next; line, next, err = gen() {
    // do something with the line
}

I was either not getting all the lines or I was getting a final line that was simply the empty string, making all my counts incorrect. If you're interested in this problem, take a look at the current implementation and tests. Submit an issue if you'd like to discuss it!

About

Learning a new programming language often means that you want to explore everything as completely as possible. That's what this small repository is about for me, learning to write benchmarking code and to write quality iterators that are Go idiomatic. Of course, then the repository gets out of control with Repo images, etc. But hey - if you're not having fun, why are you programming?

Acknowledgements

Most of the iterators were implemented based on Ewan Cheslack-Postava's Iterators in Go blob post. Table based testing inspired by Dave Chaney's Writing table driven tests in Go blog post. Benchmarking was similarly inspired by How to write benchmarks in Go. Check those posts out if you haven't already.

The banner image used in this README, “lines & curves” by Josef Stuefer is used by a Creative Commons BY-NC-ND license.

Documentation ¶

Overview ¶

Package iterfile provides various mechanisms for reading a file one line at a time. These utilities aren't necessarily meant to be used as a library for use in production code (though you're more than welcome to) but rather to profile and benchmark various iteration constructs.

Index ¶

func CallbackReadlines(path string, cb func(string) error) error
func ChanReadlines(path string) (<-chan string, error)
func GeneratorReadlines(path string) (func() (string, bool, error), bool, error)
type LineIterator
- func IteratorReadlines(path string) (LineIterator, error)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func CallbackReadlines ¶

func CallbackReadlines(path string, cb func(string) error) error

CallbackReadlines allows the caller to specify a callback function whose input is the line being read. The callback is then called on each line in the file. Note that the CallbackReadlines function returns an error that should be checked along with the results of the callbacks. Basic usage is:

func linecb(line string) error {
    // do something with the line
}

err := CallbackReadlines("myfile.txt", linecb)

Note that the callback function can return an error, which if detected will cause the loop to break and return the error from the callback.

func ChanReadlines ¶

func ChanReadlines(path string) (<-chan string, error)

ChanReadlines returns an channel that can be used in conjunction with the range keyword for looping over every line in the file. Basic usage is:

reader, err := ChanReadlines("myfile.txt")
for line := range reader {
    // do something with the line
}

The channel will be closed by the reader when the entire file is read.

func GeneratorReadlines ¶

func GeneratorReadlines(path string) (func() (string, bool, error), bool, error)

GeneratorReadlines returns a closure that can be called multiple times as though it were a generator function, creating kind of an interesting for expression construct that can be fit into a single line. Basic usage is:

for gen, next, err := GeneratorReadlines("myfile.txt"); next; line, next, err = gen() {
    // do something with the line
}

The loop stops when the generator next bool returns false.

Types ¶

type LineIterator ¶

type LineIterator interface {
	Next() bool   // Advances the iterator to the next line
	Line() string // Returns the current line of iteration
}

LineIterator specifies how an iterable object over file lines should work.

func IteratorReadlines ¶

func IteratorReadlines(path string) (LineIterator, error)

IteratorReadlines returns a LineIterator to loop over by calling its Next() method and obtaining the value with its Line() method. Basic usage is:

reader, err := IteratorReadlines("myfile.txt")
for reader.Next() {
     line := reader.Line()
     // do something with the line
}

Once the LineIterator is exahusted it cannot be used again or reset.

Source Files ¶

View all Source files

iterfile.go

Directories ¶

Path	Synopsis
cmd

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL