dit

package module
v0.0.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 6, 2026 License: MIT Imports: 8 Imported by: 2

README

dît

Banner

dît (means found in Kurdish) tells you the type of an HTML form and its fields using machine learning.

It detects whether a form is a login, search, registration, password recovery, contact, mailing list, order form, or something else, and classifies each field (username, password, email, search query, etc.). Zero external ML dependencies.

Install

go get github.com/happyhackingspace/dit

Usage

As a Library
import "github.com/happyhackingspace/dit"

// Load classifier (finds model.json automatically)
c, _ := dit.New()

// Classify forms in HTML
results, _ := c.ExtractForms(htmlString)
for _, r := range results {
    fmt.Println(r.Type)   // "login"
    fmt.Println(r.Fields) // {"username": "username or email", "password": "password"}
}

// With probabilities
results, _ := c.ExtractFormsProba(htmlString, 0.05)

// Train a new model
c, _ := dit.Train("data/", &dit.TrainConfig{Verbose: true})
c.Save("model.json")

// Evaluate via cross-validation
result, _ := dit.Evaluate("data/", &dit.EvalConfig{Folds: 10})
fmt.Printf("Form accuracy: %.1f%%\n", result.FormAccuracy*100)
As a CLI
# Classify forms on a URL
dit run https://github.com/login

# Classify forms in a local file
dit run login.html

# With probabilities
dit run https://github.com/login --proba

# Train a model
dit train model.json --data-folder data

# Evaluate model accuracy
dit evaluate --data-folder data

Form Types

Type Description
login Login form
search Search form
registration Registration / signup form
password/login recovery Password reset / recovery form
contact/comment Contact or comment form
join mailing list Newsletter / mailing list signup
order/add to cart Order or add-to-cart form
other Other form type

Field Types

Category Types
Authentication username, password, password confirmation, email, email confirmation, username or email
Names first name, last name, middle name, full name, organization name, gender
Address country, city, state, address, postal code
Contact phone, fax, url
Search search query, search category
Content comment text, comment title, about me text
Buttons submit button, cancel button, reset button
Verification captcha, honeypot, TOS confirmation, remember me checkbox, receive emails confirmation
Security security question, security answer
Time full date, day, month, year, timezone
Product product quantity, sorting option, style select
Other other number, other read-only, other

Full list of 79 type codes in data/config.json.

Accuracy

Cross-validation results (10-fold, grouped by domain):

Metric Score
Form type accuracy 83.1% (1138/1369)
Field type accuracy 86.9% (4536/5218)
Sequence accuracy 79.2% (1031/1302)

Trained on 1000+ annotated web forms from Alexa Top 1M websites.

Contributing

See CONTRIBUTING.md.

Credits

Go port of Formasaurus.

License

MIT

Documentation

Overview

Package dit classifies HTML form and field types.

It provides a two-stage ML pipeline: logistic regression for form types and a CRF model for field types.

c, _ := dit.New()
results, _ := c.ExtractForms(htmlString)
for _, r := range results {
    fmt.Println(r.Type)   // "login"
    fmt.Println(r.Fields) // {"username": "username or email", "password": "password"}
}

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Classifier

type Classifier struct {
	// contains filtered or unexported fields
}

Classifier wraps the form and field type classification models.

func Load

func Load(path string) (*Classifier, error)

Load loads a trained classifier from a model file.

func New

func New() (*Classifier, error)

New loads the classifier from "model.json", searching the current directory and parent directories up to the module root (where go.mod lives).

func Train

func Train(dataDir string, config *TrainConfig) (*Classifier, error)

Train trains a classifier on annotated HTML forms in the given data directory.

func (*Classifier) ExtractForms

func (c *Classifier) ExtractForms(html string) ([]FormResult, error)

ExtractForms extracts and classifies all forms in the given HTML string. Returns an empty slice (not nil) if no forms are found.

func (*Classifier) ExtractFormsProba

func (c *Classifier) ExtractFormsProba(html string, threshold float64) ([]FormResultProba, error)

ExtractFormsProba extracts forms and returns classification probabilities. Probabilities below threshold are omitted.

func (*Classifier) Save

func (c *Classifier) Save(path string) error

Save writes the classifier to a model file.

type EvalConfig

type EvalConfig struct {
	Folds   int
	Verbose bool
}

EvalConfig holds configuration for evaluation.

type EvalResult

type EvalResult struct {
	FormAccuracy     float64
	FieldAccuracy    float64
	SequenceAccuracy float64
	FormCorrect      int
	FormTotal        int
	FieldCorrect     int
	FieldTotal       int
	SequenceCorrect  int
	SequenceTotal    int
}

EvalResult holds cross-validation evaluation results.

func Evaluate

func Evaluate(dataDir string, config *EvalConfig) (*EvalResult, error)

Evaluate runs cross-validation evaluation on annotated data.

type FormResult

type FormResult struct {
	Type   string            `json:"type"`
	Fields map[string]string `json:"fields,omitempty"`
}

FormResult holds the classification result for a single form.

type FormResultProba

type FormResultProba struct {
	Type   map[string]float64            `json:"type"`
	Fields map[string]map[string]float64 `json:"fields,omitempty"`
}

FormResultProba holds probability-based classification results for a single form.

type TrainConfig

type TrainConfig struct {
	Verbose bool
}

TrainConfig holds configuration for training.

Directories

Path Synopsis
Package classifier implements form and field type classification.
Package classifier implements form and field type classification.
cmd
dit command
Package crf implements a linear-chain Conditional Random Field.
Package crf implements a linear-chain Conditional Random Field.
internal
htmlutil
Package htmlutil provides HTML form and field extraction utilities.
Package htmlutil provides HTML form and field extraction utilities.
storage
Package storage provides access to annotation data for form classification training.
Package storage provides access to annotation data for form classification training.
textutil
Package textutil provides text processing utilities for form classification.
Package textutil provides text processing utilities for form classification.
vectorizer
Package vectorizer provides text vectorization utilities matching sklearn behavior.
Package vectorizer provides text vectorization utilities matching sklearn behavior.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL