nrt

package module

v0.0.1 Latest Latest Go to latest Published: Feb 7, 2021 License: Apache-2.0 Imports: 15 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/nsip/dev-nrt

Links

Open Source Insights

README ¶

dev-nrt

temporary home for early code relating to nrt

Contains some initial experiments to speed up data ingest for NAPLAN reporting.

Sample data to really exercise the code needs to be decent size. Have run coparative tests against exisitng n2 ingest using a sample file created with the perl script which can be found here:

https://github.com/nsip/naplan-results-reporting/tree/master/SampleData (Note, you'll need to install CPAN and the modules listed in the top of the file to get it working if you don't have a comprehensive perl environment already on youur machine).

I ran this as :

perl nap_platformdata_generator.pl 50 100

to generate file of >500Mb

on n2, ingest is around 40 seconds with nrt (basic) ingest is around 14 seconds.

The utilities expose 2 methods to handle the conversion one sends the json to a k/v store as normal (badger in this case). The other writes the ouptut to file/files, useful for debugging. The list of objects to consume from the xml stream is also configurable, so you could (for example) create one file with all codeframe data, one with results, and one with students etc. etc. Default sample code writes everything to one file.

package main

import (
	"log"
	"os"

	nrt "github.com/nsip/dev-nrt"
)

func main() {

	//
	// obtain reader for file of interest
	//
	f, _ := os.Open("../../testdata/n2sif.xml") // normal sample xml file
	// f, _ := os.Open("../../testdata/rrd.xml") // large 500Mb sample file
	defer f.Close()

	//
	// superset of data objects we can extract from the
	// stream
	//
	var dataTypes = []string{
		"NAPStudentResponseSet",
		"NAPEventStudentLink",
		"StudentPersonal",
		"NAPTestlet",
		"NAPTestItem",
		"NAPTest",
		"NAPCodeFrame",
		"SchoolInfo",
		"NAPTestScoreSummary",
	}

	// err := nrt.StreamToJsonFile(f, "./out/rrd.json", dataTypes...)
	err := nrt.StreamToKVStore(f, "./kv/", nrt.IdxSifObjectByRefId(), dataTypes...)
	if err != nil {
		log.Println("error converting xml file:", err)
	}

}

So here's the question, given that the largest dataset within the xml file is the student results, parsing just those objects only takes around 10 seconds.

So in theory if I wrap the methods above in goroutines (one for each data object say, creating a file of output for each), then I should be able to get the whole process to complete in the time of the longest extraction, given that all the others are shorter.

However, this is much slower! I'm thinking becasue of the cpu usage and the small number of cores on my laptop; more than one process means too much timeslicing, would like to see what happens and what's possible on larger machines - we can golinear or concurrent based on detecting the size of the host machine??

Documentation ¶

Index ¶

type Option
type Transformer
- func NewTransformer(userOpts ...Option) (*Transformer, error)
- func (tr *Transformer) Run() error

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Option ¶

type Option func(*Transformer) error

func CoreReports ¶

func CoreReports(core bool) Option

indicate whether the most-used common reports will be included in this run of the transformer

func ForceIngest ¶

func ForceIngest(fi bool) Option

Even if an exisitng datastore has been created in the past, this option ensures that the old data is removed and the ingest cycle is run again reading all data files from the input folder. Default is true.

func InputFolder ¶

func InputFolder(path string) Option

the folder contaning RRD xml data files for processing

func ItemLevelReports ¶

func ItemLevelReports(ilevel bool) Option

indicate whether the most heavyweight/detailed reports will be included in this run of the transformer, inlcuding these has the largest effect on overall processing time

func QAReports ¶

func QAReports(qa bool) Option

indicate whether qa reports will be included in this run of the transformer

func Repository ¶

func Repository(repo *repository.BadgerRepo) Option

the key-value store cotaining the ingested RRD data

func ShowProgress ¶

func ShowProgress(sp bool) Option

Show progress bars for report processing. (The progress bar for data ingest is always shown) Defaults to true. Reasons to disable are: to get clear visiblilty of any std.Out console messages so they don't mix with the console progress bars. Also if piping the ouput to a file the progress bars are witten out sequentially producing a lot of unnecessary noise data in the file.

func SkipIngest ¶

func SkipIngest(si bool) Option

Tells NRT to go sraight the the report processing activity, as data has already been ingested at an earlier point in time

func StopAfterIngest ¶

func StopAfterIngest(sai bool) Option

Make transformer stop once data ingest is complete various report confgurations can then be run independently without reloading the results data Default is false, tranformer will ingest data and move directly to report processing

func WritingExtractReports ¶

func WritingExtractReports(wx bool) Option

indicate whether the writing-extract reports (input to downstream writing marking systems) will be included in this run of the transformer

type Transformer ¶

type Transformer struct {
	// contains filtered or unexported fields
}

the core nrt engine, passes the streams of rrd data through pipelines of report processors creating tabluar and fixed-width reports from the xml data

func NewTransformer ¶

func NewTransformer(userOpts ...Option) (*Transformer, error)

func (*Transformer) Run ¶

func (tr *Transformer) Run() error

Ingest and process data from RRD files

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
convert/cnv2file command
convert/cnv2kvdb command
nrt command
codeframe processing of data withitn the context of the codeframe, e.g.	processing of data withitn the context of the codeframe, e.g.
files
pipelines
records
reports
repository
sec
utils

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL