corpus

package
v1.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 4, 2024 License: Apache-2.0 Imports: 1 Imported by: 0

Documentation

Overview

Package corpus provides functionality for creating and managing corpora.

A corpus is a collection of text documents that are used for training and testing machine learning models. The documents in a corpus are typically sentences or paragraphs of text.

The corpus package provides an interface for working with corpora, as well as a set of built-in corpora that can be used for detecting which text will generate false positives in WAF rules.

This interface includes methods for retrieving the URL of a corpus, fetching the file from the remote URL, creating an iterator for the corpus, and retrieving the payload of a given a line from the corpus iterator. Each corpus will have a size, year, source, and language. The iterator interface includes methods for fetching the next sentence from the corpus and checking whether there is another sentence in the corpus. Each corpus must implement the corpus interface. As this is an experimental package, this interface is subject to change.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Corpus

type Corpus interface {
	// URL returns the URL of the corpus
	URL() string

	// WithURL sets the URL of the corpus
	WithURL(url string) Corpus

	// FetchCorpusFile fetches the corpus file from the remote URL and returns a CorpusFile for interaction with the file.
	FetchCorpusFile() File

	// GetIterator returns an iterator for the corpus
	GetIterator(c File) Iterator

	// Size returns the size of the corpus
	Size() string

	// WithSize sets the size of the corpus
	// Most corpora will have a sizes like "100K", "1M", etc., related to the amount of sentences in the corpus
	WithSize(size string) Corpus

	// Year returns the year of the corpus
	Year() string

	// WithYear sets the year of the corpus
	// Most corpora will have a year like "2023", "2022", etc.
	WithYear(year string) Corpus

	// Source returns the source of the corpus
	Source() string

	// WithSource sets the source of the corpus
	// Most corpora will have a source like "news", "web", "wikipedia", etc.
	WithSource(source string) Corpus

	// Language returns the language of the corpus
	Language() string

	// WithLanguage sets the language of the corpus
	// Most corpora will have a language like "eng", "de", etc.
	WithLanguage(lang string) Corpus
}

Corpus is the interface that must be implemented to make a corpus available to clients

type File

type File interface {
	// CacheDir is the directory where files are cached
	CacheDir() string

	// FilePath is the path to the cached file
	FilePath() string

	// WithCacheDir sets the cache directory
	WithCacheDir(cacheDir string) File

	// WithFilePath sets the file path
	WithFilePath(filePath string) File
}

File interface is used to interact with Corpus files. It provides methods for setting the cache directory and file path.

type Iterator

type Iterator interface {
	// Next returns the next sentence from the corpus
	Next() Payload
	// HasNext returns true unless the end of the corpus has been reached
	// false otherwise
	HasNext() bool
}

Iterator is an interface for iterating over a corpus

type Payload

type Payload interface {
	// LineNumber returns the payload given a line from the Corpus Iterator
	LineNumber() int
	// SetLineNumber sets the line number of the payload
	SetLineNumber(line int)
	// Content returns the payload given a line from the Corpus Iterator
	Content() string
	// SetContent sets the content of the payload
	SetContent(content string)
}

type Type

type Type string

Type is the type of the corpus

const (
	Leipzig Type = "leipzig"
	NoType  Type = "none"
)

func (*Type) Set

func (t *Type) Set(value string) error

func (*Type) String

func (t *Type) String() string

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL