utf16

package

v0.9.0 Latest Latest Go to latest Published: May 17, 2026 License: MIT Imports: 1 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/Deln0r/ygo

Links

Open Source Insights

Documentation ¶

Overview ¶

Package utf16 provides UTF-16 code-unit length and offset helpers over Go's UTF-8 strings. JS Yjs and the V1 wire format speak UTF-16 natively (Item.Len, Text indices, split offsets are all UTF-16 units); we store strings as Go-idiomatic UTF-8 and convert at the boundary.

See docs/yrs-port-notes/types-text.md gotcha 1: Item.Len is always UTF-16 code units; misreading it as bytes diverges from the JS peer on the first non-BMP character.

Index ¶

Constants
func ByteOffset(s string, utf16Offset uint64) (byteIdx int, ok bool)
func Length(s string) uint64
func SplitAt(s string, utf16Offset uint64) (left, right string)

Constants ¶

View Source

const Replacement = "�"

Replacement is the Unicode REPLACEMENT CHARACTER (U+FFFD) used by SplitAt when the requested split lands inside a surrogate pair.

Variables ¶

This section is empty.

Functions ¶

func ByteOffset ¶

func ByteOffset(s string, utf16Offset uint64) (byteIdx int, ok bool)

ByteOffset returns the byte index in s that corresponds to the given UTF-16 code unit offset.

ok=true: the offset lands cleanly between two characters. ok=false: the offset is interior to a surrogate pair, i.e. the

caller asked to split a non-BMP character. In that case
byteIdx is the byte boundary AFTER the straddled char,
matching yrs's silent round-up. SplitAt below uses this
signal to apply U+FFFD replacement.

Offsets past the end of s (in UTF-16 units) clip to len(s) with ok=true.

func Length ¶

func Length(s string) uint64

Length returns the number of UTF-16 code units required to encode s. ASCII chars and BMP chars contribute 1; non-BMP chars (e.g. emoji) contribute 2 (a surrogate pair).

Equivalent to JS `s.length` for the same string.

func SplitAt ¶

func SplitAt(s string, utf16Offset uint64) (left, right string)

SplitAt splits s at the given UTF-16 code unit offset. Returns (left, right) such that Length(left) + Length(right) == Length(s) in the clean case.

If the offset lands inside a surrogate pair, both halves' boundary chars are replaced with U+FFFD (matching JS Yjs behaviour; docs/yrs-port-notes/types-text.md gotcha 3 — yrs's no-op silently produces an orphan low surrogate). The replaced chars consume the same UTF-16 budget (1 unit each), so total Length is preserved.

Types ¶

This section is empty.

Source Files ¶

View all Source files

utf16.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL