nrt

package module
v0.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 7, 2021 License: Apache-2.0 Imports: 15 Imported by: 0

README

dev-nrt

temporary home for early code relating to nrt

Contains some initial experiments to speed up data ingest for NAPLAN reporting.

Sample data to really exercise the code needs to be decent size. Have run coparative tests against exisitng n2 ingest using a sample file created with the perl script which can be found here:

https://github.com/nsip/naplan-results-reporting/tree/master/SampleData (Note, you'll need to install CPAN and the modules listed in the top of the file to get it working if you don't have a comprehensive perl environment already on youur machine).

I ran this as :

perl nap_platformdata_generator.pl 50 100

to generate file of >500Mb

on n2, ingest is around 40 seconds with nrt (basic) ingest is around 14 seconds.

The utilities expose 2 methods to handle the conversion one sends the json to a k/v store as normal (badger in this case). The other writes the ouptut to file/files, useful for debugging. The list of objects to consume from the xml stream is also configurable, so you could (for example) create one file with all codeframe data, one with results, and one with students etc. etc. Default sample code writes everything to one file.

package main

import (
	"log"
	"os"

	nrt "github.com/nsip/dev-nrt"
)

func main() {

	//
	// obtain reader for file of interest
	//
	f, _ := os.Open("../../testdata/n2sif.xml") // normal sample xml file
	// f, _ := os.Open("../../testdata/rrd.xml") // large 500Mb sample file
	defer f.Close()

	//
	// superset of data objects we can extract from the
	// stream
	//
	var dataTypes = []string{
		"NAPStudentResponseSet",
		"NAPEventStudentLink",
		"StudentPersonal",
		"NAPTestlet",
		"NAPTestItem",
		"NAPTest",
		"NAPCodeFrame",
		"SchoolInfo",
		"NAPTestScoreSummary",
	}

	// err := nrt.StreamToJsonFile(f, "./out/rrd.json", dataTypes...)
	err := nrt.StreamToKVStore(f, "./kv/", nrt.IdxSifObjectByRefId(), dataTypes...)
	if err != nil {
		log.Println("error converting xml file:", err)
	}

}

So here's the question, given that the largest dataset within the xml file is the student results, parsing just those objects only takes around 10 seconds.

So in theory if I wrap the methods above in goroutines (one for each data object say, creating a file of output for each), then I should be able to get the whole process to complete in the time of the longest extraction, given that all the others are shorter.

However, this is much slower! I'm thinking becasue of the cpu usage and the small number of cores on my laptop; more than one process means too much timeslicing, would like to see what happens and what's possible on larger machines - we can golinear or concurrent based on detecting the size of the host machine??

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Option

type Option func(*Transformer) error

func CoreReports

func CoreReports(core bool) Option

indicate whether the most-used common reports will be included in this run of the transformer

func ForceIngest

func ForceIngest(fi bool) Option

Even if an exisitng datastore has been created in the past, this option ensures that the old data is removed and the ingest cycle is run again reading all data files from the input folder. Default is true.

func InputFolder

func InputFolder(path string) Option

the folder contaning RRD xml data files for processing

func ItemLevelReports

func ItemLevelReports(ilevel bool) Option

indicate whether the most heavyweight/detailed reports will be included in this run of the transformer, inlcuding these has the largest effect on overall processing time

func QAReports

func QAReports(qa bool) Option

indicate whether qa reports will be included in this run of the transformer

func Repository

func Repository(repo *repository.BadgerRepo) Option

the key-value store cotaining the ingested RRD data

func ShowProgress

func ShowProgress(sp bool) Option

Show progress bars for report processing. (The progress bar for data ingest is always shown) Defaults to true. Reasons to disable are: to get clear visiblilty of any std.Out console messages so they don't mix with the console progress bars. Also if piping the ouput to a file the progress bars are witten out sequentially producing a lot of unnecessary noise data in the file.

func SkipIngest

func SkipIngest(si bool) Option

Tells NRT to go sraight the the report processing activity, as data has already been ingested at an earlier point in time

func StopAfterIngest

func StopAfterIngest(sai bool) Option

Make transformer stop once data ingest is complete various report confgurations can then be run independently without reloading the results data Default is false, tranformer will ingest data and move directly to report processing

func WritingExtractReports

func WritingExtractReports(wx bool) Option

indicate whether the writing-extract reports (input to downstream writing marking systems) will be included in this run of the transformer

type Transformer

type Transformer struct {
	// contains filtered or unexported fields
}

the core nrt engine, passes the streams of rrd data through pipelines of report processors creating tabluar and fixed-width reports from the xml data

func NewTransformer

func NewTransformer(userOpts ...Option) (*Transformer, error)

func (*Transformer) Run

func (tr *Transformer) Run() error

Ingest and process data from RRD files

Directories

Path Synopsis
cmd
nrt command
processing of data withitn the context of the codeframe, e.g.
processing of data withitn the context of the codeframe, e.g.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL