Documentation
¶
Overview ¶
Package float16 implements the 16-bit floating point data type (IEEE 754-2008).
This implementation provides conversion between float16 and other floating-point types (float32 and float64) with support for various rounding modes and error handling.
Special Values ¶
The float16 type supports all IEEE 754-2008 special values:
- Positive and negative zero
- Positive and negative infinity
- Not-a-Number (NaN) values with payload
- Normalized numbers
- Subnormal (denormal) numbers
Subnormal Numbers ¶
When converting to higher-precision types (float32/float64), subnormal float16 values are preserved. However, when converting back from higher-precision types to float16, subnormal values may be rounded to the nearest representable normal float16 value. This behavior is consistent with many hardware implementations that handle subnormals in a similar way for performance reasons.
Rounding Modes ¶
The following rounding modes are supported for conversions:
- RoundNearestEven: Round to nearest, ties to even (default)
- RoundTowardZero: Round toward zero (truncate)
- RoundTowardPositive: Round toward positive infinity
- RoundTowardNegative: Round toward negative infinity
- RoundNearestAway: Round to nearest, ties away from zero
Error Handling ¶
Conversion functions with a ConversionMode parameter can return errors for:
- Overflow: When a value is too large to be represented
- Underflow: When a value is too small to be represented (in strict mode)
- Inexact: When rounding occurs (in strict mode)
See: http://en.wikipedia.org/wiki/Half-precision_floating-point_format
Index ¶
- Constants
- Variables
- func Configure(cfg *Config)
- func DebugInfo() map[string]interface{}
- func Equal(a, b Float16) bool
- func GetBenchmarkOperations() map[string]BenchmarkOperation
- func GetMemoryUsage() int
- func GetVersion() string
- func Greater(a, b Float16) bool
- func GreaterEqual(a, b Float16) bool
- func IsFinite(f Float16) bool
- func IsInf(f Float16, sign int) bool
- func IsNaN(f Float16) bool
- func IsNormal(f Float16) bool
- func IsSubnormal(f Float16) bool
- func Less(a, b Float16) bool
- func LessEqual(a, b Float16) bool
- func Signbit(f Float16) bool
- func ToSlice32(f16s []Float16) []float32
- func ToSlice64(f16s []Float16) []float64
- func ValidateSliceLength(a, b []Float16) error
- type ArithmeticMode
- type BenchmarkOperation
- type Config
- type ConversionMode
- type ErrorCode
- type Float16
- func Abs(f Float16) Float16
- func Acos(f Float16) Float16
- func Add(a, b Float16) Float16
- func AddSlice(a, b []Float16) []Float16
- func AddWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)
- func Asin(f Float16) Float16
- func Atan(f Float16) Float16
- func Atan2(y, x Float16) Float16
- func Cbrt(f Float16) Float16
- func Ceil(f Float16) Float16
- func Clamp(f, min, max Float16) Float16
- func CopySign(f, sign Float16) Float16
- func Cos(f Float16) Float16
- func Cosh(f Float16) Float16
- func Dim(f, g Float16) Float16
- func Div(a, b Float16) Float16
- func DivSlice(a, b []Float16) []Float16
- func DivWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)
- func DotProduct(a, b []Float16) Float16
- func Erf(f Float16) Float16
- func Erfc(f Float16) Float16
- func Exp(f Float16) Float16
- func Exp10(f Float16) Float16
- func Exp2(f Float16) Float16
- func FastAdd(a, b Float16) Float16
- func FastMul(a, b Float16) Float16
- func Floor(f Float16) Float16
- func Frexp(f Float16) (frac Float16, exp int)
- func FromBits(bits uint16) Float16
- func FromFloat32(f32 float32) Float16
- func FromFloat64(f64 float64) Float16
- func FromFloat64WithMode(f64 float64, convMode ConversionMode, roundMode RoundingMode) (Float16, error)
- func FromInt(i int) Float16
- func FromInt32(i int32) Float16
- func FromInt64(i int64) Float16
- func FromSlice64(f64s []float64) []Float16
- func Gamma(f Float16) Float16
- func Hypot(f, g Float16) Float16
- func Inf(sign int) Float16
- func J0(f Float16) Float16
- func J1(f Float16) Float16
- func Ldexp(frac Float16, exp int) Float16
- func Lerp(a, b, t Float16) Float16
- func Lgamma(f Float16) (Float16, int)
- func Log(f Float16) Float16
- func Log10(f Float16) Float16
- func Log2(f Float16) Float16
- func Max(a, b Float16) Float16
- func Min(a, b Float16) Float16
- func Mod(f, divisor Float16) Float16
- func Modf(f Float16) (integer, frac Float16)
- func Mul(a, b Float16) Float16
- func MulSlice(a, b []Float16) []Float16
- func MulWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)
- func NaN() Float16
- func NextAfter(f, g Float16) Float16
- func Norm2(s []Float16) Float16
- func One() Float16
- func Parse(s string) (Float16, error)
- func Pow(f, exp Float16) Float16
- func Remainder(f, divisor Float16) Float16
- func Round(f Float16) Float16
- func RoundToEven(f Float16) Float16
- func ScaleSlice(s []Float16, scalar Float16) []Float16
- func Sign(f Float16) Float16
- func Sin(f Float16) Float16
- func Sinh(f Float16) Float16
- func Sqrt(f Float16) Float16
- func Sub(a, b Float16) Float16
- func SubSlice(a, b []Float16) []Float16
- func SubWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)
- func SumSlice(s []Float16) Float16
- func Tan(f Float16) Float16
- func Tanh(f Float16) Float16
- func ToFloat16(f32 float32) Float16
- func ToFloat16WithMode(f32 float32, convMode ConversionMode, roundMode RoundingMode) (Float16, error)
- func ToSlice16(f32s []float32) []Float16
- func ToSlice16WithMode(f32s []float32, convMode ConversionMode, roundMode RoundingMode) ([]Float16, []error)
- func Trunc(f Float16) Float16
- func VectorAdd(a, b []Float16) []Float16
- func VectorMul(a, b []Float16) []Float16
- func Y0(f Float16) Float16
- func Y1(f Float16) Float16
- func Zero() Float16
- func (f Float16) Abs() Float16
- func (f Float16) Bits() uint16
- func (f Float16) Class() FloatClass
- func (f Float16) CopySign(sign Float16) Float16
- func (f Float16) GoString() string
- func (f Float16) IsFinite() bool
- func (f Float16) IsInf(sign int) bool
- func (f Float16) IsNaN() bool
- func (f Float16) IsNormal() bool
- func (f Float16) IsSubnormal() bool
- func (f Float16) IsZero() bool
- func (f Float16) Neg() Float16
- func (f Float16) Sign() int
- func (f Float16) Signbit() bool
- func (f Float16) String() string
- func (f Float16) ToFloat32() float32
- func (f Float16) ToFloat64() float64
- func (f Float16) ToInt() int
- func (f Float16) ToInt32() int32
- func (f Float16) ToInt64() int64
- type Float16Error
- type FloatClass
- type RoundingMode
- type SliceStats
Constants ¶
const ( Version = "1.0.0" VersionMajor = 1 VersionMinor = 0 VersionPatch = 0 )
Package version information
const ( SignMask = 0x8000 // 0b1000000000000000 - Sign bit mask ExponentMask = 0x7C00 // 0b0111110000000000 - Exponent bits mask MantissaMask = 0x03FF // 0b0000001111111111 - Mantissa bits mask MantissaLen = 10 // Number of mantissa bits ExponentLen = 5 // Number of exponent bits // Exponent bias and limits for IEEE 754 half-precision // bias = 2^(exponent_bits-1) - 1 = 2^4 - 1 = 15 ExponentBias = 15 // Bias for 5-bit exponent ExponentMax = 31 // Maximum exponent value (11111 binary) ExponentMin = 0 // Minimum exponent value // Normalized exponent range ExponentNormalMin = 1 // Minimum normalized exponent ExponentNormalMax = 30 // Maximum normalized exponent (infinity at 31) // Float32 constants for conversion Float32ExponentBias = 127 // IEEE 754 single precision bias Float32ExponentLen = 8 // Float32 exponent bits Float32MantissaLen = 23 // Float32 mantissa bits // Special exponent values ExponentZero = 0 // Zero and subnormal numbers ExponentInfinity = 31 // Infinity and NaN )
IEEE 754 half-precision format constants
Variables ¶
var ( DefaultArithmeticMode = ModeIEEEArithmetic DefaultRounding = RoundNearestEven )
Global arithmetic settings
var ( DefaultConversionMode = ModeIEEE DefaultRoundingMode = RoundNearestEven )
Global conversion settings
var ( // Common integer values Zero16 = PositiveZero One16 = ToFloat16(1.0) Two16 = ToFloat16(2.0) Three16 = ToFloat16(3.0) Four16 = ToFloat16(4.0) Five16 = ToFloat16(5.0) Ten16 = ToFloat16(10.0) // Common fractional values Half16 = ToFloat16(0.5) Quarter16 = ToFloat16(0.25) Third16 = ToFloat16(1.0 / 3.0) // Special mathematical values NaN16 = QuietNaN PosInf = PositiveInfinity NegInf = NegativeInfinity // Commonly used constants Deg2Rad = ToFloat16(float32(math.Pi / 180.0)) // Degrees to radians Rad2Deg = ToFloat16(float32(180.0 / math.Pi)) // Radians to degrees )
Constants for common values
var ( E = ToFloat16(float32(math.E)) // Euler's number Pi = ToFloat16(float32(math.Pi)) // Pi Phi = ToFloat16(float32(math.Phi)) // Golden ratio Sqrt2 = ToFloat16(float32(math.Sqrt2)) // Square root of 2 SqrtE = ToFloat16(float32(math.SqrtE)) // Square root of E SqrtPi = ToFloat16(float32(math.SqrtPi)) // Square root of Pi SqrtPhi = ToFloat16(float32(math.SqrtPhi)) // Square root of Phi Ln2 = ToFloat16(float32(math.Ln2)) // Natural logarithm of 2 Log2E = ToFloat16(float32(math.Log2E)) // Base-2 logarithm of E Ln10 = ToFloat16(float32(math.Ln10)) // Natural logarithm of 10 Log10E = ToFloat16(float32(math.Log10E)) // Base-10 logarithm of E )
Mathematical constants as Float16 values
var ( ErrOverflowError = &Float16Error{Code: ErrOverflow, Msg: "value too large for float16"} ErrUnderflowError = &Float16Error{Code: ErrUnderflow, Msg: "value too small for float16"} ErrNaNError = &Float16Error{Code: ErrNaN, Msg: "NaN in strict mode"} ErrInfinityError = &Float16Error{Code: ErrInfinity, Msg: "infinity in strict mode"} ErrDivByZeroError = &Float16Error{Code: ErrDivisionByZero, Msg: "division by zero"} )
Predefined error instances
Functions ¶
func Configure ¶
func Configure(cfg *Config)
Configure applies the given configuration to the package
func DebugInfo ¶
func DebugInfo() map[string]interface{}
DebugInfo returns debugging information about the package state
func GetBenchmarkOperations ¶
func GetBenchmarkOperations() map[string]BenchmarkOperation
GetBenchmarkOperations returns a map of operations suitable for benchmarking
func GetMemoryUsage ¶
func GetMemoryUsage() int
GetMemoryUsage returns the current memory usage of the package in bytes
func IsInf ¶
IsInf reports whether f is an infinity, according to sign If sign > 0, IsInf reports whether f is positive infinity If sign < 0, IsInf reports whether f is negative infinity If sign == 0, IsInf reports whether f is either infinity
func IsNormal ¶
IsNormal reports whether f is a normal number (not zero, subnormal, infinite, or NaN)
func IsSubnormal ¶
IsSubnormal reports whether f is a subnormal number
func ValidateSliceLength ¶
ValidateSliceLength checks if two slices have the same length
Types ¶
type ArithmeticMode ¶
type ArithmeticMode int
ArithmeticMode defines the precision/performance trade-off for arithmetic operations
const ( // ModeIEEE provides full IEEE 754 compliance with proper rounding ModeIEEEArithmetic ArithmeticMode = iota // ModeFastArithmetic optimizes for speed, may sacrifice some precision ModeFastArithmetic // ModeExactArithmetic provides exact results when possible, errors on precision loss ModeExactArithmetic )
type BenchmarkOperation ¶
BenchmarkOperation represents a benchmarkable operation
type Config ¶
type Config struct {
DefaultConversionMode ConversionMode
DefaultRoundingMode RoundingMode
DefaultArithmeticMode ArithmeticMode
EnableFastMath bool // Package float16 implements the 16-bit floating point data type (IEEE 754-2008).
}
Package configuration
func DefaultConfig ¶
func DefaultConfig() *Config
DefaultConfig returns the default package configuration
type ConversionMode ¶
type ConversionMode int
ConversionMode defines how conversions handle edge cases
const ( // ModeIEEE uses standard IEEE 754 rounding and special value behavior ModeIEEE ConversionMode = iota // ModeStrict returns errors for overflow, underflow, and NaN ModeStrict // ModeFast optimizes for performance, may sacrifice some precision ModeFast // ModeExact preserves exact values when possible, errors on precision loss ModeExact )
type Float16 ¶
type Float16 uint16
Float16 represents a 16-bit IEEE 754 half-precision floating-point value
const ( PositiveZero Float16 = 0x0000 // +0.0 NegativeZero Float16 = 0x8000 // -0.0 PositiveInfinity Float16 = 0x7C00 // +∞ NegativeInfinity Float16 = 0xFC00 // -∞ // Largest finite values MaxValue Float16 = 0x7BFF // Largest positive finite value (~65504) MinValue Float16 = 0xFBFF // Largest negative finite value (~-65504) // Smallest normalized positive value SmallestNormal Float16 = 0x0400 // 2^-14 ≈ 6.103515625e-05 // Largest subnormal value LargestSubnormal Float16 = 0x03FF // (1023/1024) * 2^-14 ≈ 6.097555161e-05 // Smallest positive subnormal value SmallestSubnormal Float16 = 0x0001 // 2^-24 ≈ 5.960464478e-08 // Common NaN representations QuietNaN Float16 = 0x7E00 // Quiet NaN (most significant mantissa bit set) SignalingNaN Float16 = 0x7D00 // Signaling NaN NegativeQNaN Float16 = 0xFE00 // Negative quiet NaN )
Special values following IEEE 754 half-precision standard
func AddWithMode ¶
func AddWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)
AddWithMode performs addition with specified arithmetic and rounding modes
func DivWithMode ¶
func DivWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)
DivWithMode performs division with specified arithmetic and rounding modes
func DotProduct ¶
DotProduct computes the dot product of two Float16 slices
func Frexp ¶
Frexp breaks f into a normalized fraction and an integral power of two It returns frac and exp satisfying f == frac × 2^exp, with the absolute value of frac in the interval [0.5, 1) or zero
func FromFloat32 ¶
FromFloat32 converts a float32 to Float16 (with potential precision loss)
func FromFloat64 ¶
FromFloat64 converts a float64 to Float16 (with potential precision loss)
func FromFloat64WithMode ¶
func FromFloat64WithMode(f64 float64, convMode ConversionMode, roundMode RoundingMode) (Float16, error)
FromFloat64WithMode converts a float64 to Float16 with specified modes
func FromSlice64 ¶
FromSlice64 converts a slice of float64 to Float16 with optimized performance
func Inf ¶
Inf returns a Float16 infinity value If sign >= 0, returns positive infinity If sign < 0, returns negative infinity
func Modf ¶
Modf returns integer and fractional floating-point numbers that sum to f Both values have the same sign as f
func MulWithMode ¶
func MulWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)
MulWithMode performs multiplication with specified arithmetic and rounding modes
func NextAfter ¶
NextAfter returns the next representable Float16 value after f in the direction of g
func RoundToEven ¶
RoundToEven returns the nearest integer value to f, rounding ties to even
func ScaleSlice ¶
ScaleSlice multiplies each element in the slice by a scalar
func SubWithMode ¶
func SubWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)
SubWithMode performs subtraction with specified arithmetic and rounding modes
func ToFloat16WithMode ¶
func ToFloat16WithMode(f32 float32, convMode ConversionMode, roundMode RoundingMode) (Float16, error)
ToFloat16WithMode converts a float32 to Float16 with specified conversion and rounding modes
func ToSlice16WithMode ¶
func ToSlice16WithMode(f32s []float32, convMode ConversionMode, roundMode RoundingMode) ([]Float16, []error)
SIMD-friendly batch conversion with error handling ToSlice16WithMode converts a slice with specified conversion mode
func VectorAdd ¶
VectorAdd performs vectorized addition (placeholder for future SIMD implementation)
func VectorMul ¶
VectorMul performs vectorized multiplication (placeholder for future SIMD implementation)
func (Float16) Class ¶
func (f Float16) Class() FloatClass
Class returns the IEEE 754 classification of the Float16 value
func (Float16) IsFinite ¶
IsFinite returns true if the Float16 value is finite (not infinity or NaN)
func (Float16) IsInf ¶
IsInf returns true if the Float16 value represents infinity If sign > 0, returns true only for positive infinity If sign < 0, returns true only for negative infinity If sign == 0, returns true for either infinity
func (Float16) IsNormal ¶
IsNormal returns true if the Float16 value is normalized (not zero, subnormal, infinite, or NaN)
func (Float16) IsSubnormal ¶
IsSubnormal returns true if the Float16 value is subnormal (denormalized)
func (Float16) IsZero ¶
IsZero returns true if the Float16 value represents zero (positive or negative)
func (Float16) Sign ¶
Sign returns the sign of the Float16 value: 1 for positive, -1 for negative, 0 for zero
type Float16Error ¶
type Float16Error struct {
Op string // Operation that caused the error
Value interface{} // Input value that caused the error
Msg string // Error message
Code ErrorCode // Specific error code
}
Float16Error represents errors that can occur during Float16 operations
func (*Float16Error) Error ¶
func (e *Float16Error) Error() string
type FloatClass ¶
type FloatClass int
Class returns the IEEE 754 class of the floating-point value
const ( ClassSignalingNaN FloatClass = iota ClassQuietNaN ClassNegativeInfinity ClassNegativeNormal ClassNegativeSubnormal ClassNegativeZero ClassPositiveZero ClassPositiveSubnormal ClassPositiveNormal ClassPositiveInfinity )
type RoundingMode ¶
type RoundingMode int
RoundingMode defines IEEE 754 rounding behavior
const ( // RoundNearestEven rounds to nearest, ties to even (IEEE default) RoundNearestEven RoundingMode = iota // RoundNearestAway rounds to nearest, ties away from zero RoundNearestAway // RoundTowardZero truncates toward zero RoundTowardZero // RoundTowardPositive rounds toward +∞ RoundTowardPositive // RoundTowardNegative rounds toward -∞ RoundTowardNegative )
type SliceStats ¶
SliceStats computes basic statistics for a Float16 slice
func ComputeSliceStats ¶
func ComputeSliceStats(s []Float16) SliceStats
ComputeSliceStats calculates statistics for a Float16 slice