utf16

package
v0.9.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 17, 2026 License: MIT Imports: 1 Imported by: 0

Documentation

Overview

Package utf16 provides UTF-16 code-unit length and offset helpers over Go's UTF-8 strings. JS Yjs and the V1 wire format speak UTF-16 natively (Item.Len, Text indices, split offsets are all UTF-16 units); we store strings as Go-idiomatic UTF-8 and convert at the boundary.

See docs/yrs-port-notes/types-text.md gotcha 1: Item.Len is always UTF-16 code units; misreading it as bytes diverges from the JS peer on the first non-BMP character.

Index

Constants

View Source
const Replacement = "�"

Replacement is the Unicode REPLACEMENT CHARACTER (U+FFFD) used by SplitAt when the requested split lands inside a surrogate pair.

Variables

This section is empty.

Functions

func ByteOffset

func ByteOffset(s string, utf16Offset uint64) (byteIdx int, ok bool)

ByteOffset returns the byte index in s that corresponds to the given UTF-16 code unit offset.

ok=true: the offset lands cleanly between two characters. ok=false: the offset is interior to a surrogate pair, i.e. the

caller asked to split a non-BMP character. In that case
byteIdx is the byte boundary AFTER the straddled char,
matching yrs's silent round-up. SplitAt below uses this
signal to apply U+FFFD replacement.

Offsets past the end of s (in UTF-16 units) clip to len(s) with ok=true.

func Length

func Length(s string) uint64

Length returns the number of UTF-16 code units required to encode s. ASCII chars and BMP chars contribute 1; non-BMP chars (e.g. emoji) contribute 2 (a surrogate pair).

Equivalent to JS `s.length` for the same string.

func SplitAt

func SplitAt(s string, utf16Offset uint64) (left, right string)

SplitAt splits s at the given UTF-16 code unit offset. Returns (left, right) such that Length(left) + Length(right) == Length(s) in the clean case.

If the offset lands inside a surrogate pair, both halves' boundary chars are replaced with U+FFFD (matching JS Yjs behaviour; docs/yrs-port-notes/types-text.md gotcha 3 — yrs's no-op silently produces an orphan low surrogate). The replaced chars consume the same UTF-16 budget (1 unit each), so total Length is preserved.

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL