Documentation
¶
Overview ¶
Package utf16 provides UTF-16 code-unit length and offset helpers over Go's UTF-8 strings. JS Yjs and the V1 wire format speak UTF-16 natively (Item.Len, Text indices, split offsets are all UTF-16 units); we store strings as Go-idiomatic UTF-8 and convert at the boundary.
See docs/yrs-port-notes/types-text.md gotcha 1: Item.Len is always UTF-16 code units; misreading it as bytes diverges from the JS peer on the first non-BMP character.
Index ¶
Constants ¶
const Replacement = "�"
Replacement is the Unicode REPLACEMENT CHARACTER (U+FFFD) used by SplitAt when the requested split lands inside a surrogate pair.
Variables ¶
This section is empty.
Functions ¶
func ByteOffset ¶
ByteOffset returns the byte index in s that corresponds to the given UTF-16 code unit offset.
ok=true: the offset lands cleanly between two characters. ok=false: the offset is interior to a surrogate pair, i.e. the
caller asked to split a non-BMP character. In that case byteIdx is the byte boundary AFTER the straddled char, matching yrs's silent round-up. SplitAt below uses this signal to apply U+FFFD replacement.
Offsets past the end of s (in UTF-16 units) clip to len(s) with ok=true.
func Length ¶
Length returns the number of UTF-16 code units required to encode s. ASCII chars and BMP chars contribute 1; non-BMP chars (e.g. emoji) contribute 2 (a surrogate pair).
Equivalent to JS `s.length` for the same string.
func SplitAt ¶
SplitAt splits s at the given UTF-16 code unit offset. Returns (left, right) such that Length(left) + Length(right) == Length(s) in the clean case.
If the offset lands inside a surrogate pair, both halves' boundary chars are replaced with U+FFFD (matching JS Yjs behaviour; docs/yrs-port-notes/types-text.md gotcha 3 — yrs's no-op silently produces an orphan low surrogate). The replaced chars consume the same UTF-16 budget (1 unit each), so total Length is preserved.
Types ¶
This section is empty.