tokenizers

package

v0.3.0 Latest Latest Go to latest Published: Aug 25, 2025 License: Apache-2.0 Imports: 5 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/zerfoo/zerfoo

Links

Open Source Insights

Documentation ¶

Index ¶

type TokenizerNode
- func NewTokenizerNode(vocab map[string]int32, unkTokenID int32) *TokenizerNode

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type TokenizerNode ¶

type TokenizerNode struct {
	// contains filtered or unexported fields
}

TokenizerNode converts a tensor of strings into a tensor of integer token IDs. NOTE: This implementation assumes a flexible Node interface that can handle different tensor types, not one strictly tied to numerics.

func NewTokenizerNode ¶

func NewTokenizerNode(vocab map[string]int32, unkTokenID int32) *TokenizerNode

NewTokenizerNode creates a new node for tokenization. The vocabulary maps string tokens to their integer IDs. unkTokenID is the ID to use for tokens not found in the vocabulary.

func (*TokenizerNode) Attributes ¶

func (n *TokenizerNode) Attributes() map[string]any

Attributes returns no attributes for this node.

func (*TokenizerNode) Backward ¶

func (n *TokenizerNode) Backward(ctx context.Context, mode types.BackwardMode, outputGradient tensor.Tensor) ([]tensor.Tensor, error)

Backward is not implemented for TokenizerNode as it is not a differentiable operation.

func (*TokenizerNode) Forward ¶

func (n *TokenizerNode) Forward(ctx context.Context, inputs ...tensor.Tensor) (tensor.Tensor, error)

Forward performs the tokenization. It expects a single input: a 1D TensorString. It outputs a 2D TensorNumeric[int32] with shape [1, sequence_length].

func (*TokenizerNode) OpType ¶

func (n *TokenizerNode) OpType() string

OpType returns the type of the node.

func (*TokenizerNode) OutputShape ¶

func (n *TokenizerNode) OutputShape() []int

OutputShape returns the shape of the output tensor. Since the sequence length is dynamic, we can represent it with -1.

Source Files ¶

View all Source files

tokenizer_node.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL