float16

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 28, 2025 License: Apache-2.0 Imports: 5 Imported by: 6

README

float16

Go Reference Go Report Card License

A comprehensive Go implementation of IEEE 754-2008 16-bit floating-point (half-precision) arithmetic with full support for special values, multiple rounding modes, and high-performance operations.

Features

  • Full IEEE 754-2008 compliance for 16-bit floating-point arithmetic
  • Complete special value support: ±0, ±∞, NaN (with payload), normalized and subnormal numbers
  • Multiple rounding modes: nearest-even, toward zero, toward ±∞, nearest-away
  • Flexible conversion modes: IEEE standard, strict error handling, fast approximations
  • High-performance operations with optional fast math optimizations
  • Comprehensive test suite with extensive edge case coverage
  • Zero dependencies - pure Go implementation

Installation

go get github.com/zerfoo/float16

Quick Start

package main

import (
    "fmt"
    "github.com/zerfoo/float16"
)

func main() {
    // Create float16 values
    a := float16.FromFloat32(3.14159)
    b := float16.FromFloat64(2.71828)
    
    // Basic arithmetic
    sum := a.Add(b)
    product := a.Mul(b)
    
    // Convert back to other types
    fmt.Printf("Sum: %v (float32: %f)\n", sum, sum.ToFloat32())
    fmt.Printf("Product: %v (float64: %f)\n", product, product.ToFloat64())
    
    // Work with special values
    inf := float16.Inf(1)  // positive infinity
    nan := float16.NaN()   // quiet NaN
    zero := float16.Zero() // positive zero
    
    fmt.Printf("Infinity: %v\n", inf)
    fmt.Printf("NaN: %v\n", nan)
    fmt.Printf("Zero: %v\n", zero)
}

Core Types and Constants

Float16 Type

The Float16 type represents a 16-bit IEEE 754 half-precision floating-point value:

type Float16 uint16
Special Values
const (
    PositiveZero     Float16 = 0x0000 // +0.0
    NegativeZero     Float16 = 0x8000 // -0.0
    PositiveInfinity Float16 = 0x7C00 // +∞
    NegativeInfinity Float16 = 0xFC00 // -∞
    MaxValue         Float16 = 0x7BFF // ~65504
    MinValue         Float16 = 0xFBFF // ~-65504
)

Conversion Functions

From Other Types
// From float32/float64
f16 := float16.FromFloat32(3.14159)
f16 := float16.FromFloat64(2.71828)

// From bit representation
f16 := float16.FromBits(0x4200) // 3.0

// From string
f16, err := float16.ParseFloat("3.14159", 32)
To Other Types
f32 := f16.ToFloat32()
f64 := f16.ToFloat64()
bits := f16.Bits()
str := f16.String()

Arithmetic Operations

a := float16.FromFloat32(5.0)
b := float16.FromFloat32(3.0)

// Basic arithmetic
sum := a.Add(b)        // 8.0
diff := a.Sub(b)       // 2.0
product := a.Mul(b)    // 15.0
quotient := a.Div(b)   // 1.666...

// Mathematical functions
sqrt := a.Sqrt()       // √5
abs := a.Abs()         // |a|
neg := a.Neg()         // -a

Rounding Modes

Configure rounding behavior for conversions:

import "github.com/zerfoo/float16"

// Set global rounding mode
config := float16.GetConfig()
config.DefaultRoundingMode = float16.RoundTowardZero
float16.Configure(config)

// Available rounding modes:
// - RoundNearestEven (default)
// - RoundTowardZero
// - RoundTowardPositive  
// - RoundTowardNegative
// - RoundNearestAway

Conversion Modes

Control conversion behavior and error handling:

config := float16.GetConfig()
config.DefaultConversionMode = float16.ModeStrict
float16.Configure(config)

// Available modes:
// - ModeIEEE: Standard IEEE 754 behavior
// - ModeStrict: Returns errors for overflow/underflow
// - ModeFast: Optimized for performance

Special Value Handling

f := float16.FromFloat32(math.Inf(1))

// Check value types
if f.IsInf(0) {
    fmt.Println("Value is infinity")
}
if f.IsNaN() {
    fmt.Println("Value is NaN")
}
if f.IsFinite() {
    fmt.Println("Value is finite")
}
if f.IsNormal() {
    fmt.Println("Value is normalized")
}
if f.IsSubnormal() {
    fmt.Println("Value is subnormal")
}

// IEEE 754 classification
class := f.Class()
switch class {
case float16.ClassPositiveInfinity:
    fmt.Println("Positive infinity")
case float16.ClassQuietNaN:
    fmt.Println("Quiet NaN")
// ... other classes
}

Performance Features

Fast Math Operations
// Enable fast math for better performance (may sacrifice precision)
config := float16.GetConfig()
config.EnableFastMath = true
float16.Configure(config)

// Use fast operations
result := float16.FastAdd(a, b)
result := float16.FastMul(a, b)
Vectorized Operations
// Vectorized operations (optimized for SIMD when available)
a := []float16.Float16{...}
b := []float16.Float16{...}

sum := float16.VectorAdd(a, b)
product := float16.VectorMul(a, b)

Error Handling

// Strict mode returns errors for exceptional conditions
config := float16.GetConfig()
config.DefaultConversionMode = float16.ModeStrict
float16.Configure(config)

f16, err := float16.FromFloat32WithMode(1e10, float16.ModeStrict)
if err != nil {
    if float16Err, ok := err.(*float16.Float16Error); ok {
        switch float16Err.Code {
        case float16.ErrOverflow:
            fmt.Println("Value too large for float16")
        case float16.ErrUnderflow:
            fmt.Println("Value too small for float16")
        }
    }
}

Utilities

Statistics for Slices
values := []float16.Float16{
    float16.FromFloat32(1.0),
    float16.FromFloat32(2.0),
    float16.FromFloat32(3.0),
}

stats := float16.ComputeSliceStats(values)
fmt.Printf("Min: %v, Max: %v, Mean: %v\n", stats.Min, stats.Max, stats.Mean)
Debugging and Monitoring
// Get memory usage
usage := float16.GetMemoryUsage()
fmt.Printf("Memory usage: %d bytes\n", usage)

// Get debug information
debug := float16.DebugInfo()
fmt.Printf("Debug info: %+v\n", debug)

Benchmarking

The package includes built-in benchmarking utilities:

ops := float16.GetBenchmarkOperations()
for name, op := range ops {
    // Benchmark operation
    fmt.Printf("Benchmarking %s\n", name)
}

Range and Precision

Float16 has the following characteristics:

  • Range: ±6.55×10⁴ (approximately ±65,504)
  • Precision: ~3-4 decimal digits
  • Smallest positive normal: ~6.10×10⁻⁵
  • Smallest positive subnormal: ~5.96×10⁻⁸
  • Machine epsilon: ~9.77×10⁻⁴

Use Cases

Float16 is ideal for:

  • Machine Learning: Reduced memory usage and faster training
  • Graphics Programming: Color values, texture coordinates
  • Scientific Computing: Large datasets where precision can be traded for memory
  • Embedded Systems: Memory-constrained environments
  • Data Compression: Storing floating-point data more efficiently

Performance Considerations

  • Conversions between float16 and float32/float64 have computational overhead
  • Native float16 arithmetic is generally faster than conversion-based approaches
  • Enable fast math mode for performance-critical applications where precision can be sacrificed
  • Use vectorized operations for bulk processing

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

References

Documentation

Overview

Package float16 implements the 16-bit floating point data type (IEEE 754-2008).

This implementation provides conversion between float16 and other floating-point types (float32 and float64) with support for various rounding modes and error handling.

Special Values

The float16 type supports all IEEE 754-2008 special values:

  • Positive and negative zero
  • Positive and negative infinity
  • Not-a-Number (NaN) values with payload
  • Normalized numbers
  • Subnormal (denormal) numbers

Subnormal Numbers

When converting to higher-precision types (float32/float64), subnormal float16 values are preserved. However, when converting back from higher-precision types to float16, subnormal values may be rounded to the nearest representable normal float16 value. This behavior is consistent with many hardware implementations that handle subnormals in a similar way for performance reasons.

Rounding Modes

The following rounding modes are supported for conversions:

  • RoundNearestEven: Round to nearest, ties to even (default)
  • RoundTowardZero: Round toward zero (truncate)
  • RoundTowardPositive: Round toward positive infinity
  • RoundTowardNegative: Round toward negative infinity
  • RoundNearestAway: Round to nearest, ties away from zero

Error Handling

Conversion functions with a ConversionMode parameter can return errors for:

  • Overflow: When a value is too large to be represented
  • Underflow: When a value is too small to be represented (in strict mode)
  • Inexact: When rounding occurs (in strict mode)

See: http://en.wikipedia.org/wiki/Half-precision_floating-point_format

Index

Constants

View Source
const (
	Version      = "1.0.0"
	VersionMajor = 1
	VersionMinor = 0
	VersionPatch = 0
)

Package version information

View Source
const (
	SignMask     = 0x8000 // 0b1000000000000000 - Sign bit mask
	ExponentMask = 0x7C00 // 0b0111110000000000 - Exponent bits mask
	MantissaMask = 0x03FF // 0b0000001111111111 - Mantissa bits mask
	MantissaLen  = 10     // Number of mantissa bits
	ExponentLen  = 5      // Number of exponent bits

	// Exponent bias and limits for IEEE 754 half-precision
	// bias = 2^(exponent_bits-1) - 1 = 2^4 - 1 = 15
	ExponentBias = 15 // Bias for 5-bit exponent
	ExponentMax  = 31 // Maximum exponent value (11111 binary)
	ExponentMin  = 0  // Minimum exponent value

	// Normalized exponent range
	ExponentNormalMin = 1  // Minimum normalized exponent
	ExponentNormalMax = 30 // Maximum normalized exponent (infinity at 31)

	// Float32 constants for conversion
	Float32ExponentBias = 127 // IEEE 754 single precision bias
	Float32ExponentLen  = 8   // Float32 exponent bits
	Float32MantissaLen  = 23  // Float32 mantissa bits

	// Special exponent values
	ExponentZero     = 0  // Zero and subnormal numbers
	ExponentInfinity = 31 // Infinity and NaN
)

IEEE 754 half-precision format constants

Variables

View Source
var (
	DefaultArithmeticMode = ModeIEEEArithmetic
	DefaultRounding       = RoundNearestEven
)

Global arithmetic settings

View Source
var (
	DefaultConversionMode = ModeIEEE
	DefaultRoundingMode   = RoundNearestEven
)

Global conversion settings

View Source
var (
	// Common integer values
	Zero16  = PositiveZero
	One16   = ToFloat16(1.0)
	Two16   = ToFloat16(2.0)
	Three16 = ToFloat16(3.0)
	Four16  = ToFloat16(4.0)
	Five16  = ToFloat16(5.0)
	Ten16   = ToFloat16(10.0)

	// Common fractional values
	Half16    = ToFloat16(0.5)
	Quarter16 = ToFloat16(0.25)
	Third16   = ToFloat16(1.0 / 3.0)

	// Special mathematical values
	NaN16  = QuietNaN
	PosInf = PositiveInfinity
	NegInf = NegativeInfinity

	// Commonly used constants
	Deg2Rad = ToFloat16(float32(math.Pi / 180.0)) // Degrees to radians
	Rad2Deg = ToFloat16(float32(180.0 / math.Pi)) // Radians to degrees
)

Constants for common values

View Source
var (
	E       = ToFloat16(float32(math.E))       // Euler's number
	Pi      = ToFloat16(float32(math.Pi))      // Pi
	Phi     = ToFloat16(float32(math.Phi))     // Golden ratio
	Sqrt2   = ToFloat16(float32(math.Sqrt2))   // Square root of 2
	SqrtE   = ToFloat16(float32(math.SqrtE))   // Square root of E
	SqrtPi  = ToFloat16(float32(math.SqrtPi))  // Square root of Pi
	SqrtPhi = ToFloat16(float32(math.SqrtPhi)) // Square root of Phi
	Ln2     = ToFloat16(float32(math.Ln2))     // Natural logarithm of 2
	Log2E   = ToFloat16(float32(math.Log2E))   // Base-2 logarithm of E
	Ln10    = ToFloat16(float32(math.Ln10))    // Natural logarithm of 10
	Log10E  = ToFloat16(float32(math.Log10E))  // Base-10 logarithm of E
)

Mathematical constants as Float16 values

View Source
var (
	ErrOverflowError  = &Float16Error{Code: ErrOverflow, Msg: "value too large for float16"}
	ErrUnderflowError = &Float16Error{Code: ErrUnderflow, Msg: "value too small for float16"}
	ErrNaNError       = &Float16Error{Code: ErrNaN, Msg: "NaN in strict mode"}
	ErrInfinityError  = &Float16Error{Code: ErrInfinity, Msg: "infinity in strict mode"}
	ErrDivByZeroError = &Float16Error{Code: ErrDivisionByZero, Msg: "division by zero"}
)

Predefined error instances

Functions

func Configure

func Configure(cfg *Config)

Configure applies the given configuration to the package

func DebugInfo

func DebugInfo() map[string]interface{}

DebugInfo returns debugging information about the package state

func Equal

func Equal(a, b Float16) bool

Equal returns true if two Float16 values are equal

func GetBenchmarkOperations

func GetBenchmarkOperations() map[string]BenchmarkOperation

GetBenchmarkOperations returns a map of operations suitable for benchmarking

func GetMemoryUsage

func GetMemoryUsage() int

GetMemoryUsage returns the current memory usage of the package in bytes

func GetVersion

func GetVersion() string

GetVersion returns the package version string

func Greater

func Greater(a, b Float16) bool

Greater returns true if a > b

func GreaterEqual

func GreaterEqual(a, b Float16) bool

GreaterEqual returns true if a >= b

func IsFinite

func IsFinite(f Float16) bool

IsFinite reports whether f is neither infinite nor NaN

func IsInf

func IsInf(f Float16, sign int) bool

IsInf reports whether f is an infinity, according to sign If sign > 0, IsInf reports whether f is positive infinity If sign < 0, IsInf reports whether f is negative infinity If sign == 0, IsInf reports whether f is either infinity

func IsNaN

func IsNaN(f Float16) bool

IsNaN reports whether f is an IEEE 754 "not-a-number" value

func IsNormal

func IsNormal(f Float16) bool

IsNormal reports whether f is a normal number (not zero, subnormal, infinite, or NaN)

func IsSubnormal

func IsSubnormal(f Float16) bool

IsSubnormal reports whether f is a subnormal number

func Less

func Less(a, b Float16) bool

Less returns true if a < b

func LessEqual

func LessEqual(a, b Float16) bool

LessEqual returns true if a <= b

func Signbit

func Signbit(f Float16) bool

Signbit reports whether f is negative or negative zero

func ToSlice32

func ToSlice32(f16s []Float16) []float32

ToSlice32 converts a slice of Float16 to float32 with optimized performance

func ToSlice64

func ToSlice64(f16s []Float16) []float64

ToSlice64 converts a slice of Float16 to float64 with optimized performance

func ValidateSliceLength

func ValidateSliceLength(a, b []Float16) error

ValidateSliceLength checks if two slices have the same length

Types

type ArithmeticMode

type ArithmeticMode int

ArithmeticMode defines the precision/performance trade-off for arithmetic operations

const (
	// ModeIEEE provides full IEEE 754 compliance with proper rounding
	ModeIEEEArithmetic ArithmeticMode = iota
	// ModeFastArithmetic optimizes for speed, may sacrifice some precision
	ModeFastArithmetic
	// ModeExactArithmetic provides exact results when possible, errors on precision loss
	ModeExactArithmetic
)

type BenchmarkOperation

type BenchmarkOperation func(Float16, Float16) Float16

BenchmarkOperation represents a benchmarkable operation

type Config

type Config struct {
	DefaultConversionMode ConversionMode
	DefaultRoundingMode   RoundingMode
	DefaultArithmeticMode ArithmeticMode
	EnableFastMath        bool // Package float16 implements the 16-bit floating point data type (IEEE 754-2008).

}

Package configuration

func DefaultConfig

func DefaultConfig() *Config

DefaultConfig returns the default package configuration

func GetConfig

func GetConfig() *Config

GetConfig returns the current package configuration

type ConversionMode

type ConversionMode int

ConversionMode defines how conversions handle edge cases

const (
	// ModeIEEE uses standard IEEE 754 rounding and special value behavior
	ModeIEEE ConversionMode = iota
	// ModeStrict returns errors for overflow, underflow, and NaN
	ModeStrict
	// ModeFast optimizes for performance, may sacrifice some precision
	ModeFast
	// ModeExact preserves exact values when possible, errors on precision loss
	ModeExact
)

type ErrorCode

type ErrorCode int

ErrorCode represents specific error types

const (
	ErrOverflow ErrorCode = iota
	ErrUnderflow
	ErrInvalidOperation
	ErrDivisionByZero
	ErrInexact
	ErrNaN
	ErrInfinity
)

type Float16

type Float16 uint16

Float16 represents a 16-bit IEEE 754 half-precision floating-point value

const (
	PositiveZero     Float16 = 0x0000 // +0.0
	NegativeZero     Float16 = 0x8000 // -0.0
	PositiveInfinity Float16 = 0x7C00 // +∞
	NegativeInfinity Float16 = 0xFC00 // -∞

	// Largest finite values
	MaxValue Float16 = 0x7BFF // Largest positive finite value (~65504)
	MinValue Float16 = 0xFBFF // Largest negative finite value (~-65504)

	// Smallest normalized positive value
	SmallestNormal Float16 = 0x0400 // 2^-14 ≈ 6.103515625e-05

	// Largest subnormal value
	LargestSubnormal Float16 = 0x03FF // (1023/1024) * 2^-14 ≈ 6.097555161e-05

	// Smallest positive subnormal value
	SmallestSubnormal Float16 = 0x0001 // 2^-24 ≈ 5.960464478e-08

	// Common NaN representations
	QuietNaN     Float16 = 0x7E00 // Quiet NaN (most significant mantissa bit set)
	SignalingNaN Float16 = 0x7D00 // Signaling NaN
	NegativeQNaN Float16 = 0xFE00 // Negative quiet NaN
)

Special values following IEEE 754 half-precision standard

func Abs

func Abs(f Float16) Float16

Abs returns the absolute value of f

func Acos

func Acos(f Float16) Float16

Acos returns the arccosine of f

func Add

func Add(a, b Float16) Float16

Add performs addition of two Float16 values

func AddSlice

func AddSlice(a, b []Float16) []Float16

AddSlice performs element-wise addition of two Float16 slices

func AddWithMode

func AddWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)

AddWithMode performs addition with specified arithmetic and rounding modes

func Asin

func Asin(f Float16) Float16

Asin returns the arcsine of f

func Atan

func Atan(f Float16) Float16

Atan returns the arctangent of f

func Atan2

func Atan2(y, x Float16) Float16

Atan2 returns the arctangent of y/x

func Cbrt

func Cbrt(f Float16) Float16

Cbrt returns the cube root of the Float16 value

func Ceil

func Ceil(f Float16) Float16

Ceil returns the smallest integer value greater than or equal to f

func Clamp

func Clamp(f, min, max Float16) Float16

Clamp restricts f to the range [min, max]

func CopySign

func CopySign(f, sign Float16) Float16

CopySign returns a Float16 with the magnitude of f and the sign of sign

func Cos

func Cos(f Float16) Float16

Cos returns the cosine of f (in radians)

func Cosh

func Cosh(f Float16) Float16

Cosh returns the hyperbolic cosine of f

func Dim

func Dim(f, g Float16) Float16

Dim returns the positive difference between f and g: max(f-g, 0)

func Div

func Div(a, b Float16) Float16

Div performs division of two Float16 values

func DivSlice

func DivSlice(a, b []Float16) []Float16

DivSlice performs element-wise division of two Float16 slices

func DivWithMode

func DivWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)

DivWithMode performs division with specified arithmetic and rounding modes

func DotProduct

func DotProduct(a, b []Float16) Float16

DotProduct computes the dot product of two Float16 slices

func Erf

func Erf(f Float16) Float16

Erf returns the error function of f

func Erfc

func Erfc(f Float16) Float16

Erfc returns the complementary error function of f

func Exp

func Exp(f Float16) Float16

Exp returns e^f

func Exp10

func Exp10(f Float16) Float16

Exp10 returns 10^f

func Exp2

func Exp2(f Float16) Float16

Exp2 returns 2^f

func FastAdd

func FastAdd(a, b Float16) Float16

FastAdd performs addition optimized for speed (may sacrifice precision)

func FastMul

func FastMul(a, b Float16) Float16

FastMul performs multiplication optimized for speed (may sacrifice precision)

func Floor

func Floor(f Float16) Float16

Floor returns the largest integer value less than or equal to f

func Frexp

func Frexp(f Float16) (frac Float16, exp int)

Frexp breaks f into a normalized fraction and an integral power of two It returns frac and exp satisfying f == frac × 2^exp, with the absolute value of frac in the interval [0.5, 1) or zero

func FromBits

func FromBits(bits uint16) Float16

FromBits creates a Float16 from its bit representation

func FromFloat32

func FromFloat32(f32 float32) Float16

FromFloat32 converts a float32 to Float16 (with potential precision loss)

func FromFloat64

func FromFloat64(f64 float64) Float16

FromFloat64 converts a float64 to Float16 (with potential precision loss)

func FromFloat64WithMode

func FromFloat64WithMode(f64 float64, convMode ConversionMode, roundMode RoundingMode) (Float16, error)

FromFloat64WithMode converts a float64 to Float16 with specified modes

func FromInt

func FromInt(i int) Float16

FromInt converts an integer to Float16

func FromInt32

func FromInt32(i int32) Float16

FromInt32 converts an int32 to Float16

func FromInt64

func FromInt64(i int64) Float16

FromInt64 converts an int64 to Float16 (with potential precision loss)

func FromSlice64

func FromSlice64(f64s []float64) []Float16

FromSlice64 converts a slice of float64 to Float16 with optimized performance

func Gamma

func Gamma(f Float16) Float16

Gamma returns the Gamma function of f

func Hypot

func Hypot(f, g Float16) Float16

Hypot returns sqrt(f*f + g*g), taking care to avoid overflow and underflow

func Inf

func Inf(sign int) Float16

Inf returns a Float16 infinity value If sign >= 0, returns positive infinity If sign < 0, returns negative infinity

func J0

func J0(f Float16) Float16

J0 returns the order-zero Bessel function of the first kind

func J1

func J1(f Float16) Float16

J1 returns the order-one Bessel function of the first kind

func Ldexp

func Ldexp(frac Float16, exp int) Float16

Ldexp returns frac × 2^exp

func Lerp

func Lerp(a, b, t Float16) Float16

Lerp performs linear interpolation between a and b by factor t

func Lgamma

func Lgamma(f Float16) (Float16, int)

Lgamma returns the natural logarithm and sign of Gamma(f)

func Log

func Log(f Float16) Float16

Log returns the natural logarithm of f

func Log10

func Log10(f Float16) Float16

Log10 returns the base-10 logarithm of f

func Log2

func Log2(f Float16) Float16

Log2 returns the base-2 logarithm of f

func Max

func Max(a, b Float16) Float16

Max returns the larger of two Float16 values

func Min

func Min(a, b Float16) Float16

Min returns the smaller of two Float16 values

func Mod

func Mod(f, divisor Float16) Float16

Mod returns the floating-point remainder of f/divisor

func Modf

func Modf(f Float16) (integer, frac Float16)

Modf returns integer and fractional floating-point numbers that sum to f Both values have the same sign as f

func Mul

func Mul(a, b Float16) Float16

Mul performs multiplication of two Float16 values

func MulSlice

func MulSlice(a, b []Float16) []Float16

MulSlice performs element-wise multiplication of two Float16 slices

func MulWithMode

func MulWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)

MulWithMode performs multiplication with specified arithmetic and rounding modes

func NaN

func NaN() Float16

NaN returns a Float16 quiet NaN value

func NextAfter

func NextAfter(f, g Float16) Float16

NextAfter returns the next representable Float16 value after f in the direction of g

func Norm2

func Norm2(s []Float16) Float16

Norm2 computes the L2 norm (Euclidean norm) of a Float16 slice

func One

func One() Float16

One returns a Float16 value representing 1.0

func Parse

func Parse(s string) (Float16, error)

Parse converts a string to Float16 (placeholder for future implementation)

func Pow

func Pow(f, exp Float16) Float16

Pow returns f raised to the power of exp

func Remainder

func Remainder(f, divisor Float16) Float16

Remainder returns the IEEE 754 floating-point remainder of f/divisor

func Round

func Round(f Float16) Float16

Round returns the nearest integer value to f

func RoundToEven

func RoundToEven(f Float16) Float16

RoundToEven returns the nearest integer value to f, rounding ties to even

func ScaleSlice

func ScaleSlice(s []Float16, scalar Float16) []Float16

ScaleSlice multiplies each element in the slice by a scalar

func Sign

func Sign(f Float16) Float16

Sign returns -1, 0, or 1 depending on the sign of f

func Sin

func Sin(f Float16) Float16

Sin returns the sine of f (in radians)

func Sinh

func Sinh(f Float16) Float16

Sinh returns the hyperbolic sine of f

func Sqrt

func Sqrt(f Float16) Float16

Sqrt returns the square root of the Float16 value

func Sub

func Sub(a, b Float16) Float16

Sub performs subtraction of two Float16 values

func SubSlice

func SubSlice(a, b []Float16) []Float16

SubSlice performs element-wise subtraction of two Float16 slices

func SubWithMode

func SubWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)

SubWithMode performs subtraction with specified arithmetic and rounding modes

func SumSlice

func SumSlice(s []Float16) Float16

SumSlice returns the sum of all elements in the slice

func Tan

func Tan(f Float16) Float16

Tan returns the tangent of f (in radians)

func Tanh

func Tanh(f Float16) Float16

Tanh returns the hyperbolic tangent of f

func ToFloat16

func ToFloat16(f32 float32) Float16

ToFloat16 converts a float32 value to Float16 format using default settings

func ToFloat16WithMode

func ToFloat16WithMode(f32 float32, convMode ConversionMode, roundMode RoundingMode) (Float16, error)

ToFloat16WithMode converts a float32 to Float16 with specified conversion and rounding modes

func ToSlice16

func ToSlice16(f32s []float32) []Float16

ToSlice16 converts a slice of float32 to Float16 with optimized performance

func ToSlice16WithMode

func ToSlice16WithMode(f32s []float32, convMode ConversionMode, roundMode RoundingMode) ([]Float16, []error)

SIMD-friendly batch conversion with error handling ToSlice16WithMode converts a slice with specified conversion mode

func Trunc

func Trunc(f Float16) Float16

Trunc returns the integer part of f (truncated towards zero)

func VectorAdd

func VectorAdd(a, b []Float16) []Float16

VectorAdd performs vectorized addition (placeholder for future SIMD implementation)

func VectorMul

func VectorMul(a, b []Float16) []Float16

VectorMul performs vectorized multiplication (placeholder for future SIMD implementation)

func Y0

func Y0(f Float16) Float16

Y0 returns the order-zero Bessel function of the second kind

func Y1

func Y1(f Float16) Float16

Y1 returns the order-one Bessel function of the second kind

func Zero

func Zero() Float16

Zero returns a Float16 zero value

func (Float16) Abs

func (f Float16) Abs() Float16

Abs returns the absolute value of the Float16

func (Float16) Bits

func (f Float16) Bits() uint16

Bits returns the underlying uint16 representation

func (Float16) Class

func (f Float16) Class() FloatClass

Class returns the IEEE 754 classification of the Float16 value

func (Float16) CopySign

func (f Float16) CopySign(sign Float16) Float16

CopySign returns a Float16 with the magnitude of f and the sign of sign

func (Float16) GoString

func (f Float16) GoString() string

GoString returns a Go syntax representation of the Float16 value

func (Float16) IsFinite

func (f Float16) IsFinite() bool

IsFinite returns true if the Float16 value is finite (not infinity or NaN)

func (Float16) IsInf

func (f Float16) IsInf(sign int) bool

IsInf returns true if the Float16 value represents infinity If sign > 0, returns true only for positive infinity If sign < 0, returns true only for negative infinity If sign == 0, returns true for either infinity

func (Float16) IsNaN

func (f Float16) IsNaN() bool

IsNaN returns true if the Float16 value represents NaN (Not a Number)

func (Float16) IsNormal

func (f Float16) IsNormal() bool

IsNormal returns true if the Float16 value is normalized (not zero, subnormal, infinite, or NaN)

func (Float16) IsSubnormal

func (f Float16) IsSubnormal() bool

IsSubnormal returns true if the Float16 value is subnormal (denormalized)

func (Float16) IsZero

func (f Float16) IsZero() bool

IsZero returns true if the Float16 value represents zero (positive or negative)

func (Float16) Neg

func (f Float16) Neg() Float16

Neg returns the negation of the Float16

func (Float16) Sign

func (f Float16) Sign() int

Sign returns the sign of the Float16 value: 1 for positive, -1 for negative, 0 for zero

func (Float16) Signbit

func (f Float16) Signbit() bool

Signbit returns true if the Float16 value has a negative sign bit

func (Float16) String

func (f Float16) String() string

String returns a string representation of the Float16 value

func (Float16) ToFloat32

func (f Float16) ToFloat32() float32

ToFloat32 converts a Float16 value to float32 with full precision

func (Float16) ToFloat64

func (f Float16) ToFloat64() float64

ToFloat64 converts a Float16 value to float64 with full precision

func (Float16) ToInt

func (f Float16) ToInt() int

ToInt converts a Float16 to int (truncated toward zero)

func (Float16) ToInt32

func (f Float16) ToInt32() int32

ToInt32 converts a Float16 to int32 (truncated toward zero)

func (Float16) ToInt64

func (f Float16) ToInt64() int64

ToInt64 converts a Float16 to int64 (truncated toward zero)

type Float16Error

type Float16Error struct {
	Op    string      // Operation that caused the error
	Value interface{} // Input value that caused the error
	Msg   string      // Error message
	Code  ErrorCode   // Specific error code
}

Float16Error represents errors that can occur during Float16 operations

func (*Float16Error) Error

func (e *Float16Error) Error() string

type FloatClass

type FloatClass int

Class returns the IEEE 754 class of the floating-point value

const (
	ClassSignalingNaN FloatClass = iota
	ClassQuietNaN
	ClassNegativeInfinity
	ClassNegativeNormal
	ClassNegativeSubnormal
	ClassNegativeZero
	ClassPositiveZero
	ClassPositiveSubnormal
	ClassPositiveNormal
	ClassPositiveInfinity
)

func FpClassify

func FpClassify(f Float16) FloatClass

FpClassify returns the IEEE 754 class of f

type RoundingMode

type RoundingMode int

RoundingMode defines IEEE 754 rounding behavior

const (
	// RoundNearestEven rounds to nearest, ties to even (IEEE default)
	RoundNearestEven RoundingMode = iota
	// RoundNearestAway rounds to nearest, ties away from zero
	RoundNearestAway
	// RoundTowardZero truncates toward zero
	RoundTowardZero
	// RoundTowardPositive rounds toward +∞
	RoundTowardPositive
	// RoundTowardNegative rounds toward -∞
	RoundTowardNegative
)

type SliceStats

type SliceStats struct {
	Min    Float16
	Max    Float16
	Sum    Float16
	Mean   Float16
	Length int
}

SliceStats computes basic statistics for a Float16 slice

func ComputeSliceStats

func ComputeSliceStats(s []Float16) SliceStats

ComputeSliceStats calculates statistics for a Float16 slice

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL