storage

package
v0.0.14 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 4, 2026 License: MIT Imports: 10 Imported by: 0

Documentation

Overview

Package storage provides access to annotation data for form classification training.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func GetDomain

func GetDomain(rawURL string) string

GetDomain extracts the domain name from a URL (for grouped cross-validation).

Types

type AnnotationSchema

type AnnotationSchema struct {
	Types       map[string]string // full_name -> short_name
	TypesInv    map[string]string // short_name -> full_name
	NAValue     string
	SkipValue   string
	SimplifyMap map[string]string
}

AnnotationSchema holds the types and their mappings for form or field annotations.

type FormAnnotation

type FormAnnotation struct {
	FormHTML       string
	URL            string
	Type           string            // short form type
	TypeFull       string            // full form type
	FormIndex      int               // index of form on the page
	FieldTypes     map[string]string // field_name -> short_type
	FieldTypesFull map[string]string // field_name -> full_type
	FormSchema     *AnnotationSchema
	FieldSchema    *AnnotationSchema

	// Computed
	FormAnnotated   bool
	FieldsAnnotated bool
}

FormAnnotation represents a single annotated form.

type IterOptions

type IterOptions struct {
	DropDuplicates     bool
	DropNA             bool
	DropSkipped        bool
	SimplifyFormTypes  bool
	SimplifyFieldTypes bool
	Verbose            bool
}

IterOptions controls annotation iteration behavior.

func DefaultIterOptions

func DefaultIterOptions() IterOptions

DefaultIterOptions returns the default options for iterating annotations.

type PageAnnotation added in v0.0.3

type PageAnnotation struct {
	HTML     string
	URL      string
	Type     string // short page type
	TypeFull string // full page type
}

PageAnnotation represents a single annotated page.

type PageStorage added in v0.0.3

type PageStorage struct {
	Folder string
}

PageStorage wraps the page annotation data folder.

func NewPageStorage added in v0.0.3

func NewPageStorage(folder string) *PageStorage

NewPageStorage creates a PageStorage for the given data folder.

func (*PageStorage) GetPageIndex added in v0.0.3

func (s *PageStorage) GetPageIndex() (map[string]pageIndexEntry, error)

GetPageIndex reads the page index file.

func (*PageStorage) GetPageSchema added in v0.0.3

func (s *PageStorage) GetPageSchema() (*AnnotationSchema, error)

GetPageSchema reads the page type schema from config.json.

func (*PageStorage) IterPageAnnotations added in v0.0.3

func (s *PageStorage) IterPageAnnotations(opts IterOptions) ([]PageAnnotation, error)

IterPageAnnotations yields PageAnnotation objects from the storage.

type Storage

type Storage struct {
	Folder string
}

Storage wraps the annotation data folder.

func NewStorage

func NewStorage(folder string) *Storage

NewStorage creates a Storage for the given data folder.

func (*Storage) GetConfig

func (s *Storage) GetConfig() (*configJSON, error)

GetConfig reads the config file.

func (*Storage) GetFieldSchema

func (s *Storage) GetFieldSchema() (*AnnotationSchema, error)

GetFieldSchema returns the field annotation schema.

func (*Storage) GetFormSchema

func (s *Storage) GetFormSchema() (*AnnotationSchema, error)

GetFormSchema returns the form annotation schema.

func (*Storage) GetIndex

func (s *Storage) GetIndex() (map[string]indexEntry, error)

GetIndex reads the index file.

func (*Storage) IterAnnotations

func (s *Storage) IterAnnotations(opts IterOptions) ([]FormAnnotation, error)

IterAnnotations yields FormAnnotation objects from the storage.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL