chunking

package
v0.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 16, 2026 License: Apache-2.0 Imports: 11 Imported by: 4

Documentation

Overview

Package chunking provides primitives to interact with the openapi HTTP API.

Code generated by github.com/oapi-codegen/oapi-codegen/v2 version v2.5.1 DO NOT EDIT.

Index

Constants

View Source
const (
	// ModelFixedBert uses BERT's WordPiece tokenization (~30k vocab).
	// Good for general-purpose text and multilingual content.
	ModelFixedBert = "fixed-bert-tokenizer"

	// ModelFixedBPE uses OpenAI's tiktoken BPE tokenization (cl100k_base, ~100k vocab).
	// Good for GPT-style models and code.
	ModelFixedBPE = "fixed-bpe-tokenizer"
)

Fixed chunker model names

View Source
const MIMETypePlainText = "text/plain"

MIMETypePlainText is the MIME type for text/plain content chunks.

Variables

This section is empty.

Functions

func GetSwagger

func GetSwagger() (swagger *openapi3.T, err error)

GetSwagger returns the Swagger specification corresponding to the generated code in this file. The external references of Swagger specification are resolved. The logic of resolving external references is tightly connected to "import-mapping" feature. Externally referenced files must be embedded in the corresponding golang packages. Urls can be supported but this task was out of the scope.

func PathToRawSpec

func PathToRawSpec(pathToFile string) map[string]func() ([]byte, error)

Constructs a synthetic filesystem for resolving external references when loading openapi specifications.

Types

type AudioChunkOptions

type AudioChunkOptions struct {
	// OverlapDurationMs Overlap duration in milliseconds between audio chunks (default: 0).
	OverlapDurationMs int `json:"overlap_duration_ms,omitempty,omitzero"`

	// WindowDurationMs Window duration in milliseconds for fixed-window audio chunking (default: 30000).
	WindowDurationMs int `json:"window_duration_ms,omitempty,omitzero"`
}

AudioChunkOptions Options specific to audio chunking.

type BinaryContent

type BinaryContent struct {
	// Data Base64-encoded binary data (valid WAV, PNG, etc.)
	Data []byte `json:"data,omitempty,omitzero"`

	// EndTimeMs Audio: window end time in milliseconds
	EndTimeMs float32 `json:"end_time_ms,omitempty,omitzero"`

	// FrameDelayMs Animation: display delay in milliseconds
	FrameDelayMs int `json:"frame_delay_ms,omitempty,omitzero"`

	// FrameIndex Animation: frame number
	FrameIndex int `json:"frame_index,omitempty,omitzero"`

	// StartTimeMs Audio: window start time in milliseconds
	StartTimeMs float32 `json:"start_time_ms,omitempty,omitzero"`
}

BinaryContent Binary media content with format-specific metadata.

type Chunk

type Chunk struct {
	// Id Sequence number of the chunk (0, 1, 2, ...)
	Id uint32 `json:"id"`

	// MimeType MIME type: text/plain, audio/wav, image/png, etc.
	MimeType string `json:"mime_type"`
	// contains filtered or unexported fields
}

Chunk defines model for Chunk.

func NewTextChunk

func NewTextChunk(id uint32, text string, startChar, endChar int) Chunk

NewTextChunk creates a Chunk containing text content with the given parameters.

func (Chunk) AsBinaryContent

func (t Chunk) AsBinaryContent() (BinaryContent, error)

AsBinaryContent returns the union data inside the Chunk as a BinaryContent

func (Chunk) AsTextContent

func (t Chunk) AsTextContent() (TextContent, error)

AsTextContent returns the union data inside the Chunk as a TextContent

func (*Chunk) FromBinaryContent

func (t *Chunk) FromBinaryContent(v BinaryContent) error

FromBinaryContent overwrites any union data inside the Chunk as the provided BinaryContent

func (*Chunk) FromTextContent

func (t *Chunk) FromTextContent(v TextContent) error

FromTextContent overwrites any union data inside the Chunk as the provided TextContent

func (Chunk) GetText

func (c Chunk) GetText() string

GetText returns the text content of a text chunk. Returns empty string if the chunk is not a text chunk or cannot be decoded.

func (Chunk) MarshalJSON

func (t Chunk) MarshalJSON() ([]byte, error)

func (*Chunk) MergeBinaryContent

func (t *Chunk) MergeBinaryContent(v BinaryContent) error

MergeBinaryContent performs a merge with any union data inside the Chunk, using the provided BinaryContent

func (*Chunk) MergeTextContent

func (t *Chunk) MergeTextContent(v TextContent) error

MergeTextContent performs a merge with any union data inside the Chunk, using the provided TextContent

func (*Chunk) UnmarshalJSON

func (t *Chunk) UnmarshalJSON(b []byte) error

type ChunkOptions

type ChunkOptions struct {
	// Audio Options specific to audio chunking.
	Audio AudioChunkOptions `json:"audio,omitempty,omitzero"`

	// MaxChunks Maximum number of chunks to generate per document.
	MaxChunks int `json:"max_chunks,omitempty,omitzero"`

	// Text Options specific to text chunking.
	Text TextChunkOptions `json:"text,omitempty,omitzero"`

	// Threshold Confidence threshold for model-based chunking (0.0-1.0).
	Threshold float32 `json:"threshold,omitempty,omitzero"`
}

ChunkOptions Per-request configuration for chunking. All fields are optional - zero/omitted values use chunker defaults.

type Chunker

type Chunker interface {
	// Chunk splits text using the provided per-request options.
	// Options that are nil use the chunker's default values.
	Chunk(ctx context.Context, text string, opts ChunkOptions) ([]Chunk, error)
	Close() error
}

Chunker splits text into semantically meaningful chunks. ChunkOptions is generated from openapi.yaml - see openapi.gen.go

type TextChunkOptions

type TextChunkOptions struct {
	// OverlapTokens Number of tokens to overlap between consecutive chunks. Helps maintain context across chunk boundaries. Only used by fixed-size chunkers.
	OverlapTokens int `json:"overlap_tokens,omitempty,omitzero"`

	// Separator Separator string for splitting (e.g., '\n\n' for paragraphs). Only used by fixed-size chunkers.
	Separator string `json:"separator,omitempty,omitzero"`

	// TargetTokens Target number of tokens per chunk.
	TargetTokens int `json:"target_tokens,omitempty,omitzero"`
}

TextChunkOptions Options specific to text chunking.

type TextContent

type TextContent struct {
	// EndChar Character position in original text where chunk ends (exclusive)
	EndChar int `json:"end_char"`

	// StartChar Character position in original text where chunk starts
	StartChar int `json:"start_char"`

	// Text The chunk text content
	Text string `json:"text"`
}

TextContent Text content with character offsets.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL