unigram

package
v0.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 18, 2025 License: Apache-2.0 Imports: 11 Imported by: 2

Documentation

Index

Constants

View Source
const (
	CacheExpiredTime = 5
	CacheCleanTime   = 10
)

Variables

This section is empty.

Functions

This section is empty.

Types

type Config

type Config struct {
	// contains filtered or unexported fields
}

Config holds the configuration for the Unigram model

type TokenScore

type TokenScore struct {
	Token string
	Score float64
}

TokenScore represents a token and its score in the Unigram model

type Unigram

type Unigram struct {
	// contains filtered or unexported fields
}

Unigram implements the Unigram language model for tokenization

func New

func New(vocab []TokenScore, opts *util.Params) (*Unigram, error)

New creates a new Unigram model with the given vocabulary and options

func (*Unigram) GetVocab

func (u *Unigram) GetVocab() map[string]int

GetVocab returns the vocabulary mapping (token -> ID)

func (*Unigram) GetVocabSize

func (u *Unigram) GetVocabSize() int

GetVocabSize returns the size of the vocabulary

func (*Unigram) IdToToken

func (u *Unigram) IdToToken(id int) (string, bool)

IdToToken returns the token for the given ID

func (*Unigram) Save

func (u *Unigram) Save(dir string, prefixOpt ...string) error

Save saves the Unigram model to the given directory

func (*Unigram) TokenToId

func (u *Unigram) TokenToId(token string) (int, bool)

TokenToId returns the ID for the given token

func (*Unigram) Tokenize

func (u *Unigram) Tokenize(sequence string) ([]tokenizer.Token, error)

Tokenize tokenizes the given sequence into multiple tokens

type UnigramBuilder

type UnigramBuilder struct {
	// contains filtered or unexported fields
}

UnigramBuilder can be used to create a Unigram model with a custom configuration

func NewUnigramBuilder

func NewUnigramBuilder() *UnigramBuilder

NewUnigramBuilder creates a new UnigramBuilder with default configuration

func (*UnigramBuilder) Build

func (ub *UnigramBuilder) Build() (*Unigram, error)

Build creates a new Unigram model with the configured parameters

func (*UnigramBuilder) BytesFallback

func (ub *UnigramBuilder) BytesFallback(bytesFallback bool) *UnigramBuilder

BytesFallback sets whether to use byte fallback for unknown tokens

func (*UnigramBuilder) FuseUnk

func (ub *UnigramBuilder) FuseUnk(fuseUnk bool) *UnigramBuilder

FuseUnk sets whether to fuse unknown tokens together

func (*UnigramBuilder) UnkID

func (ub *UnigramBuilder) UnkID(unkID int) *UnigramBuilder

UnkID sets the unknown token ID for the Unigram model

func (*UnigramBuilder) Vocab

func (ub *UnigramBuilder) Vocab(vocab []TokenScore) *UnigramBuilder

Vocab sets the vocabulary for the Unigram model

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL