float16

package module

v0.1.0 Latest Latest Go to latest Published: Jul 28, 2025 License: Apache-2.0 Imports: 5 Imported by: 6

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/zerfoo/float16

Links

Open Source Insights

README ¶

float16

A comprehensive Go implementation of IEEE 754-2008 16-bit floating-point (half-precision) arithmetic with full support for special values, multiple rounding modes, and high-performance operations.

Features

Full IEEE 754-2008 compliance for 16-bit floating-point arithmetic
Complete special value support: ±0, ±∞, NaN (with payload), normalized and subnormal numbers
Multiple rounding modes: nearest-even, toward zero, toward ±∞, nearest-away
Flexible conversion modes: IEEE standard, strict error handling, fast approximations
High-performance operations with optional fast math optimizations
Comprehensive test suite with extensive edge case coverage
Zero dependencies - pure Go implementation

Installation

go get github.com/zerfoo/float16

Quick Start

package main

import (
    "fmt"
    "github.com/zerfoo/float16"
)

func main() {
    // Create float16 values
    a := float16.FromFloat32(3.14159)
    b := float16.FromFloat64(2.71828)
    
    // Basic arithmetic
    sum := a.Add(b)
    product := a.Mul(b)
    
    // Convert back to other types
    fmt.Printf("Sum: %v (float32: %f)\n", sum, sum.ToFloat32())
    fmt.Printf("Product: %v (float64: %f)\n", product, product.ToFloat64())
    
    // Work with special values
    inf := float16.Inf(1)  // positive infinity
    nan := float16.NaN()   // quiet NaN
    zero := float16.Zero() // positive zero
    
    fmt.Printf("Infinity: %v\n", inf)
    fmt.Printf("NaN: %v\n", nan)
    fmt.Printf("Zero: %v\n", zero)
}

Core Types and Constants

Float16 Type

The Float16 type represents a 16-bit IEEE 754 half-precision floating-point value:

type Float16 uint16

Special Values

const (
    PositiveZero     Float16 = 0x0000 // +0.0
    NegativeZero     Float16 = 0x8000 // -0.0
    PositiveInfinity Float16 = 0x7C00 // +∞
    NegativeInfinity Float16 = 0xFC00 // -∞
    MaxValue         Float16 = 0x7BFF // ~65504
    MinValue         Float16 = 0xFBFF // ~-65504
)

Conversion Functions

From Other Types

// From float32/float64
f16 := float16.FromFloat32(3.14159)
f16 := float16.FromFloat64(2.71828)

// From bit representation
f16 := float16.FromBits(0x4200) // 3.0

// From string
f16, err := float16.ParseFloat("3.14159", 32)

To Other Types

f32 := f16.ToFloat32()
f64 := f16.ToFloat64()
bits := f16.Bits()
str := f16.String()

Arithmetic Operations

a := float16.FromFloat32(5.0)
b := float16.FromFloat32(3.0)

// Basic arithmetic
sum := a.Add(b)        // 8.0
diff := a.Sub(b)       // 2.0
product := a.Mul(b)    // 15.0
quotient := a.Div(b)   // 1.666...

// Mathematical functions
sqrt := a.Sqrt()       // √5
abs := a.Abs()         // |a|
neg := a.Neg()         // -a

Rounding Modes

Configure rounding behavior for conversions:

import "github.com/zerfoo/float16"

// Set global rounding mode
config := float16.GetConfig()
config.DefaultRoundingMode = float16.RoundTowardZero
float16.Configure(config)

// Available rounding modes:
// - RoundNearestEven (default)
// - RoundTowardZero
// - RoundTowardPositive  
// - RoundTowardNegative
// - RoundNearestAway

Conversion Modes

Control conversion behavior and error handling:

config := float16.GetConfig()
config.DefaultConversionMode = float16.ModeStrict
float16.Configure(config)

// Available modes:
// - ModeIEEE: Standard IEEE 754 behavior
// - ModeStrict: Returns errors for overflow/underflow
// - ModeFast: Optimized for performance

Special Value Handling

f := float16.FromFloat32(math.Inf(1))

// Check value types
if f.IsInf(0) {
    fmt.Println("Value is infinity")
}
if f.IsNaN() {
    fmt.Println("Value is NaN")
}
if f.IsFinite() {
    fmt.Println("Value is finite")
}
if f.IsNormal() {
    fmt.Println("Value is normalized")
}
if f.IsSubnormal() {
    fmt.Println("Value is subnormal")
}

// IEEE 754 classification
class := f.Class()
switch class {
case float16.ClassPositiveInfinity:
    fmt.Println("Positive infinity")
case float16.ClassQuietNaN:
    fmt.Println("Quiet NaN")
// ... other classes
}

Performance Features

Fast Math Operations

// Enable fast math for better performance (may sacrifice precision)
config := float16.GetConfig()
config.EnableFastMath = true
float16.Configure(config)

// Use fast operations
result := float16.FastAdd(a, b)
result := float16.FastMul(a, b)

Vectorized Operations

// Vectorized operations (optimized for SIMD when available)
a := []float16.Float16{...}
b := []float16.Float16{...}

sum := float16.VectorAdd(a, b)
product := float16.VectorMul(a, b)

Error Handling

// Strict mode returns errors for exceptional conditions
config := float16.GetConfig()
config.DefaultConversionMode = float16.ModeStrict
float16.Configure(config)

f16, err := float16.FromFloat32WithMode(1e10, float16.ModeStrict)
if err != nil {
    if float16Err, ok := err.(*float16.Float16Error); ok {
        switch float16Err.Code {
        case float16.ErrOverflow:
            fmt.Println("Value too large for float16")
        case float16.ErrUnderflow:
            fmt.Println("Value too small for float16")
        }
    }
}

Utilities

Statistics for Slices

values := []float16.Float16{
    float16.FromFloat32(1.0),
    float16.FromFloat32(2.0),
    float16.FromFloat32(3.0),
}

stats := float16.ComputeSliceStats(values)
fmt.Printf("Min: %v, Max: %v, Mean: %v\n", stats.Min, stats.Max, stats.Mean)

Debugging and Monitoring

// Get memory usage
usage := float16.GetMemoryUsage()
fmt.Printf("Memory usage: %d bytes\n", usage)

// Get debug information
debug := float16.DebugInfo()
fmt.Printf("Debug info: %+v\n", debug)

Benchmarking

The package includes built-in benchmarking utilities:

ops := float16.GetBenchmarkOperations()
for name, op := range ops {
    // Benchmark operation
    fmt.Printf("Benchmarking %s\n", name)
}

Range and Precision

Float16 has the following characteristics:

Range: ±6.55×10⁴ (approximately ±65,504)
Precision: ~3-4 decimal digits
Smallest positive normal: ~6.10×10⁻⁵
Smallest positive subnormal: ~5.96×10⁻⁸
Machine epsilon: ~9.77×10⁻⁴

Use Cases

Float16 is ideal for:

Machine Learning: Reduced memory usage and faster training
Graphics Programming: Color values, texture coordinates
Scientific Computing: Large datasets where precision can be traded for memory
Embedded Systems: Memory-constrained environments
Data Compression: Storing floating-point data more efficiently

Performance Considerations

Conversions between float16 and float32/float64 have computational overhead
Native float16 arithmetic is generally faster than conversion-based approaches
Enable fast math mode for performance-critical applications where precision can be sacrificed
Use vectorized operations for bulk processing

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

References

Documentation ¶

Overview ¶

Package float16 implements the 16-bit floating point data type (IEEE 754-2008).

This implementation provides conversion between float16 and other floating-point types (float32 and float64) with support for various rounding modes and error handling.

Special Values ¶

The float16 type supports all IEEE 754-2008 special values:

Positive and negative zero
Positive and negative infinity
Not-a-Number (NaN) values with payload
Normalized numbers
Subnormal (denormal) numbers

Subnormal Numbers ¶

When converting to higher-precision types (float32/float64), subnormal float16 values are preserved. However, when converting back from higher-precision types to float16, subnormal values may be rounded to the nearest representable normal float16 value. This behavior is consistent with many hardware implementations that handle subnormals in a similar way for performance reasons.

Rounding Modes ¶

The following rounding modes are supported for conversions:

RoundNearestEven: Round to nearest, ties to even (default)
RoundTowardZero: Round toward zero (truncate)
RoundTowardPositive: Round toward positive infinity
RoundTowardNegative: Round toward negative infinity
RoundNearestAway: Round to nearest, ties away from zero

Error Handling ¶

Conversion functions with a ConversionMode parameter can return errors for:

Overflow: When a value is too large to be represented
Underflow: When a value is too small to be represented (in strict mode)
Inexact: When rounding occurs (in strict mode)

See: http://en.wikipedia.org/wiki/Half-precision_floating-point_format

Index ¶

Constants
Variables
func Configure(cfg *Config)
func DebugInfo() map[string]interface{}
func Equal(a, b Float16) bool
func GetBenchmarkOperations() map[string]BenchmarkOperation
func GetMemoryUsage() int
func GetVersion() string
func Greater(a, b Float16) bool
func GreaterEqual(a, b Float16) bool
func IsFinite(f Float16) bool
func IsInf(f Float16, sign int) bool
func IsNaN(f Float16) bool
func IsNormal(f Float16) bool
func IsSubnormal(f Float16) bool
func Less(a, b Float16) bool
func LessEqual(a, b Float16) bool
func Signbit(f Float16) bool
func ToSlice32(f16s []Float16) []float32
func ToSlice64(f16s []Float16) []float64
func ValidateSliceLength(a, b []Float16) error
type ArithmeticMode
type BenchmarkOperation
type Config
- func DefaultConfig() *Config
- func GetConfig() *Config
type ConversionMode
type ErrorCode
type Float16
- func Abs(f Float16) Float16
- func Acos(f Float16) Float16
- func Add(a, b Float16) Float16
- func AddSlice(a, b []Float16) []Float16
- func AddWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)
- func Asin(f Float16) Float16
- func Atan(f Float16) Float16
- func Atan2(y, x Float16) Float16
- func Cbrt(f Float16) Float16
- func Ceil(f Float16) Float16
- func Clamp(f, min, max Float16) Float16
- func CopySign(f, sign Float16) Float16
- func Cos(f Float16) Float16
- func Cosh(f Float16) Float16
- func Dim(f, g Float16) Float16
- func Div(a, b Float16) Float16
- func DivSlice(a, b []Float16) []Float16
- func DivWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)
- func DotProduct(a, b []Float16) Float16
- func Erf(f Float16) Float16
- func Erfc(f Float16) Float16
- func Exp(f Float16) Float16
- func Exp10(f Float16) Float16
- func Exp2(f Float16) Float16
- func FastAdd(a, b Float16) Float16
- func FastMul(a, b Float16) Float16
- func Floor(f Float16) Float16
- func Frexp(f Float16) (frac Float16, exp int)
- func FromBits(bits uint16) Float16
- func FromFloat32(f32 float32) Float16
- func FromFloat64(f64 float64) Float16
- func FromFloat64WithMode(f64 float64, convMode ConversionMode, roundMode RoundingMode) (Float16, error)
- func FromInt(i int) Float16
- func FromInt32(i int32) Float16
- func FromInt64(i int64) Float16
- func FromSlice64(f64s []float64) []Float16
- func Gamma(f Float16) Float16
- func Hypot(f, g Float16) Float16
- func Inf(sign int) Float16
- func J0(f Float16) Float16
- func J1(f Float16) Float16
- func Ldexp(frac Float16, exp int) Float16
- func Lerp(a, b, t Float16) Float16
- func Lgamma(f Float16) (Float16, int)
- func Log(f Float16) Float16
- func Log10(f Float16) Float16
- func Log2(f Float16) Float16
- func Max(a, b Float16) Float16
- func Min(a, b Float16) Float16
- func Mod(f, divisor Float16) Float16
- func Modf(f Float16) (integer, frac Float16)
- func Mul(a, b Float16) Float16
- func MulSlice(a, b []Float16) []Float16
- func MulWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)
- func NaN() Float16
- func NextAfter(f, g Float16) Float16
- func Norm2(s []Float16) Float16
- func One() Float16
- func Parse(s string) (Float16, error)
- func Pow(f, exp Float16) Float16
- func Remainder(f, divisor Float16) Float16
- func Round(f Float16) Float16
- func RoundToEven(f Float16) Float16
- func ScaleSlice(s []Float16, scalar Float16) []Float16
- func Sign(f Float16) Float16
- func Sin(f Float16) Float16
- func Sinh(f Float16) Float16
- func Sqrt(f Float16) Float16
- func Sub(a, b Float16) Float16
- func SubSlice(a, b []Float16) []Float16
- func SubWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)
- func SumSlice(s []Float16) Float16
- func Tan(f Float16) Float16
- func Tanh(f Float16) Float16
- func ToFloat16(f32 float32) Float16
- func ToFloat16WithMode(f32 float32, convMode ConversionMode, roundMode RoundingMode) (Float16, error)
- func ToSlice16(f32s []float32) []Float16
- func ToSlice16WithMode(f32s []float32, convMode ConversionMode, roundMode RoundingMode) ([]Float16, []error)
- func Trunc(f Float16) Float16
- func VectorAdd(a, b []Float16) []Float16
- func VectorMul(a, b []Float16) []Float16
- func Y0(f Float16) Float16
- func Y1(f Float16) Float16
- func Zero() Float16
- func (f Float16) Abs() Float16
- func (f Float16) Bits() uint16
- func (f Float16) Class() FloatClass
- func (f Float16) CopySign(sign Float16) Float16
- func (f Float16) GoString() string
- func (f Float16) IsFinite() bool
- func (f Float16) IsInf(sign int) bool
- func (f Float16) IsNaN() bool
- func (f Float16) IsNormal() bool
- func (f Float16) IsSubnormal() bool
- func (f Float16) IsZero() bool
- func (f Float16) Neg() Float16
- func (f Float16) Sign() int
- func (f Float16) Signbit() bool
- func (f Float16) String() string
- func (f Float16) ToFloat32() float32
- func (f Float16) ToFloat64() float64
- func (f Float16) ToInt() int
- func (f Float16) ToInt32() int32
- func (f Float16) ToInt64() int64
type Float16Error
- func (e *Float16Error) Error() string
type FloatClass
- func FpClassify(f Float16) FloatClass
type RoundingMode
type SliceStats
- func ComputeSliceStats(s []Float16) SliceStats

Constants ¶

View Source

const (
	Version      = "1.0.0"
	VersionMajor = 1
	VersionMinor = 0
	VersionPatch = 0
)

Package version information

View Source

const (
	SignMask     = 0x8000 // 0b1000000000000000 - Sign bit mask
	ExponentMask = 0x7C00 // 0b0111110000000000 - Exponent bits mask
	MantissaMask = 0x03FF // 0b0000001111111111 - Mantissa bits mask
	MantissaLen  = 10     // Number of mantissa bits
	ExponentLen  = 5      // Number of exponent bits

	// Exponent bias and limits for IEEE 754 half-precision
	// bias = 2^(exponent_bits-1) - 1 = 2^4 - 1 = 15
	ExponentBias = 15 // Bias for 5-bit exponent
	ExponentMax  = 31 // Maximum exponent value (11111 binary)
	ExponentMin  = 0  // Minimum exponent value

	// Normalized exponent range
	ExponentNormalMin = 1  // Minimum normalized exponent
	ExponentNormalMax = 30 // Maximum normalized exponent (infinity at 31)

	// Float32 constants for conversion
	Float32ExponentBias = 127 // IEEE 754 single precision bias
	Float32ExponentLen  = 8   // Float32 exponent bits
	Float32MantissaLen  = 23  // Float32 mantissa bits

	// Special exponent values
	ExponentZero     = 0  // Zero and subnormal numbers
	ExponentInfinity = 31 // Infinity and NaN
)

IEEE 754 half-precision format constants

Variables ¶

View Source

var (
	DefaultArithmeticMode = ModeIEEEArithmetic
	DefaultRounding       = RoundNearestEven
)

Global arithmetic settings

View Source

var (
	DefaultConversionMode = ModeIEEE
	DefaultRoundingMode   = RoundNearestEven
)

Global conversion settings

View Source

var (
	// Common integer values
	Zero16  = PositiveZero
	One16   = ToFloat16(1.0)
	Two16   = ToFloat16(2.0)
	Three16 = ToFloat16(3.0)
	Four16  = ToFloat16(4.0)
	Five16  = ToFloat16(5.0)
	Ten16   = ToFloat16(10.0)

	// Common fractional values
	Half16    = ToFloat16(0.5)
	Quarter16 = ToFloat16(0.25)
	Third16   = ToFloat16(1.0 / 3.0)

	// Special mathematical values
	NaN16  = QuietNaN
	PosInf = PositiveInfinity
	NegInf = NegativeInfinity

	// Commonly used constants
	Deg2Rad = ToFloat16(float32(math.Pi / 180.0)) // Degrees to radians
	Rad2Deg = ToFloat16(float32(180.0 / math.Pi)) // Radians to degrees
)

Constants for common values

View Source

var (
	E       = ToFloat16(float32(math.E))       // Euler's number
	Pi      = ToFloat16(float32(math.Pi))      // Pi
	Phi     = ToFloat16(float32(math.Phi))     // Golden ratio
	Sqrt2   = ToFloat16(float32(math.Sqrt2))   // Square root of 2
	SqrtE   = ToFloat16(float32(math.SqrtE))   // Square root of E
	SqrtPi  = ToFloat16(float32(math.SqrtPi))  // Square root of Pi
	SqrtPhi = ToFloat16(float32(math.SqrtPhi)) // Square root of Phi
	Ln2     = ToFloat16(float32(math.Ln2))     // Natural logarithm of 2
	Log2E   = ToFloat16(float32(math.Log2E))   // Base-2 logarithm of E
	Ln10    = ToFloat16(float32(math.Ln10))    // Natural logarithm of 10
	Log10E  = ToFloat16(float32(math.Log10E))  // Base-10 logarithm of E
)

Mathematical constants as Float16 values

View Source

var (
	ErrOverflowError  = &Float16Error{Code: ErrOverflow, Msg: "value too large for float16"}
	ErrUnderflowError = &Float16Error{Code: ErrUnderflow, Msg: "value too small for float16"}
	ErrNaNError       = &Float16Error{Code: ErrNaN, Msg: "NaN in strict mode"}
	ErrInfinityError  = &Float16Error{Code: ErrInfinity, Msg: "infinity in strict mode"}
	ErrDivByZeroError = &Float16Error{Code: ErrDivisionByZero, Msg: "division by zero"}
)

Predefined error instances

Functions ¶

func Configure ¶

func Configure(cfg *Config)

Configure applies the given configuration to the package

func DebugInfo ¶

func DebugInfo() map[string]interface{}

DebugInfo returns debugging information about the package state

func Equal ¶

func Equal(a, b Float16) bool

Equal returns true if two Float16 values are equal

func GetBenchmarkOperations ¶

func GetBenchmarkOperations() map[string]BenchmarkOperation

GetBenchmarkOperations returns a map of operations suitable for benchmarking

func GetMemoryUsage ¶

func GetMemoryUsage() int

GetMemoryUsage returns the current memory usage of the package in bytes

func GetVersion ¶

func GetVersion() string

GetVersion returns the package version string

func Greater ¶

func Greater(a, b Float16) bool

Greater returns true if a > b

func GreaterEqual ¶

func GreaterEqual(a, b Float16) bool

GreaterEqual returns true if a >= b

func IsFinite ¶

func IsFinite(f Float16) bool

IsFinite reports whether f is neither infinite nor NaN

func IsInf ¶

func IsInf(f Float16, sign int) bool

IsInf reports whether f is an infinity, according to sign If sign > 0, IsInf reports whether f is positive infinity If sign < 0, IsInf reports whether f is negative infinity If sign == 0, IsInf reports whether f is either infinity

func IsNaN ¶

func IsNaN(f Float16) bool

IsNaN reports whether f is an IEEE 754 "not-a-number" value

func IsNormal ¶

func IsNormal(f Float16) bool

IsNormal reports whether f is a normal number (not zero, subnormal, infinite, or NaN)

func IsSubnormal ¶

func IsSubnormal(f Float16) bool

IsSubnormal reports whether f is a subnormal number

func Less ¶

func Less(a, b Float16) bool

Less returns true if a < b

func LessEqual ¶

func LessEqual(a, b Float16) bool

LessEqual returns true if a <= b

func Signbit ¶

func Signbit(f Float16) bool

Signbit reports whether f is negative or negative zero

func ToSlice32 ¶

func ToSlice32(f16s []Float16) []float32

ToSlice32 converts a slice of Float16 to float32 with optimized performance

func ToSlice64 ¶

func ToSlice64(f16s []Float16) []float64

ToSlice64 converts a slice of Float16 to float64 with optimized performance

func ValidateSliceLength ¶

func ValidateSliceLength(a, b []Float16) error

ValidateSliceLength checks if two slices have the same length

Types ¶

type ArithmeticMode ¶

type ArithmeticMode int

ArithmeticMode defines the precision/performance trade-off for arithmetic operations

const (
	// ModeIEEE provides full IEEE 754 compliance with proper rounding
	ModeIEEEArithmetic ArithmeticMode = iota
	// ModeFastArithmetic optimizes for speed, may sacrifice some precision
	ModeFastArithmetic
	// ModeExactArithmetic provides exact results when possible, errors on precision loss
	ModeExactArithmetic
)

type BenchmarkOperation ¶

type BenchmarkOperation func(Float16, Float16) Float16

BenchmarkOperation represents a benchmarkable operation

type Config ¶

type Config struct {
	DefaultConversionMode ConversionMode
	DefaultRoundingMode   RoundingMode
	DefaultArithmeticMode ArithmeticMode
	EnableFastMath        bool // Package float16 implements the 16-bit floating point data type (IEEE 754-2008).

}

Package configuration

func DefaultConfig ¶

func DefaultConfig() *Config

DefaultConfig returns the default package configuration

func GetConfig ¶

func GetConfig() *Config

GetConfig returns the current package configuration

type ConversionMode ¶

type ConversionMode int

ConversionMode defines how conversions handle edge cases

const (
	// ModeIEEE uses standard IEEE 754 rounding and special value behavior
	ModeIEEE ConversionMode = iota
	// ModeStrict returns errors for overflow, underflow, and NaN
	ModeStrict
	// ModeFast optimizes for performance, may sacrifice some precision
	ModeFast
	// ModeExact preserves exact values when possible, errors on precision loss
	ModeExact
)

type ErrorCode ¶

type ErrorCode int

ErrorCode represents specific error types

const (
	ErrOverflow ErrorCode = iota
	ErrUnderflow
	ErrInvalidOperation
	ErrDivisionByZero
	ErrInexact
	ErrNaN
	ErrInfinity
)

type Float16 ¶

type Float16 uint16

Float16 represents a 16-bit IEEE 754 half-precision floating-point value

const (
	PositiveZero     Float16 = 0x0000 // +0.0
	NegativeZero     Float16 = 0x8000 // -0.0
	PositiveInfinity Float16 = 0x7C00 // +∞
	NegativeInfinity Float16 = 0xFC00 // -∞

	// Largest finite values
	MaxValue Float16 = 0x7BFF // Largest positive finite value (~65504)
	MinValue Float16 = 0xFBFF // Largest negative finite value (~-65504)

	// Smallest normalized positive value
	SmallestNormal Float16 = 0x0400 // 2^-14 ≈ 6.103515625e-05

	// Largest subnormal value
	LargestSubnormal Float16 = 0x03FF // (1023/1024) * 2^-14 ≈ 6.097555161e-05

	// Smallest positive subnormal value
	SmallestSubnormal Float16 = 0x0001 // 2^-24 ≈ 5.960464478e-08

	// Common NaN representations
	QuietNaN     Float16 = 0x7E00 // Quiet NaN (most significant mantissa bit set)
	SignalingNaN Float16 = 0x7D00 // Signaling NaN
	NegativeQNaN Float16 = 0xFE00 // Negative quiet NaN
)

Special values following IEEE 754 half-precision standard

func Abs ¶

func Abs(f Float16) Float16

Abs returns the absolute value of f

func Acos ¶

func Acos(f Float16) Float16

Acos returns the arccosine of f

func Add ¶

func Add(a, b Float16) Float16

Add performs addition of two Float16 values

func AddSlice ¶

func AddSlice(a, b []Float16) []Float16

AddSlice performs element-wise addition of two Float16 slices

func AddWithMode ¶

func AddWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)

AddWithMode performs addition with specified arithmetic and rounding modes

func Asin ¶

func Asin(f Float16) Float16

Asin returns the arcsine of f

func Atan ¶

func Atan(f Float16) Float16

Atan returns the arctangent of f

func Atan2 ¶

func Atan2(y, x Float16) Float16

Atan2 returns the arctangent of y/x

func Cbrt ¶

func Cbrt(f Float16) Float16

Cbrt returns the cube root of the Float16 value

func Ceil ¶

func Ceil(f Float16) Float16

Ceil returns the smallest integer value greater than or equal to f

func Clamp ¶

func Clamp(f, min, max Float16) Float16

Clamp restricts f to the range [min, max]

func CopySign ¶

func CopySign(f, sign Float16) Float16

CopySign returns a Float16 with the magnitude of f and the sign of sign

func Cos ¶

func Cos(f Float16) Float16

Cos returns the cosine of f (in radians)

func Cosh ¶

func Cosh(f Float16) Float16

Cosh returns the hyperbolic cosine of f

func Dim ¶

func Dim(f, g Float16) Float16

Dim returns the positive difference between f and g: max(f-g, 0)

func Div ¶

func Div(a, b Float16) Float16

Div performs division of two Float16 values

func DivSlice ¶

func DivSlice(a, b []Float16) []Float16

DivSlice performs element-wise division of two Float16 slices

func DivWithMode ¶

func DivWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)

DivWithMode performs division with specified arithmetic and rounding modes

func DotProduct ¶

func DotProduct(a, b []Float16) Float16

DotProduct computes the dot product of two Float16 slices

func Erf ¶

func Erf(f Float16) Float16

Erf returns the error function of f

func Erfc ¶

func Erfc(f Float16) Float16

Erfc returns the complementary error function of f

func Exp ¶

func Exp(f Float16) Float16

Exp returns e^f

func Exp10 ¶

func Exp10(f Float16) Float16

Exp10 returns 10^f

func Exp2 ¶

func Exp2(f Float16) Float16

Exp2 returns 2^f

func FastAdd ¶

func FastAdd(a, b Float16) Float16

FastAdd performs addition optimized for speed (may sacrifice precision)

func FastMul ¶

func FastMul(a, b Float16) Float16

FastMul performs multiplication optimized for speed (may sacrifice precision)

func Floor ¶

func Floor(f Float16) Float16

Floor returns the largest integer value less than or equal to f

func Frexp ¶

func Frexp(f Float16) (frac Float16, exp int)

Frexp breaks f into a normalized fraction and an integral power of two It returns frac and exp satisfying f == frac × 2^exp, with the absolute value of frac in the interval [0.5, 1) or zero

func FromBits ¶

func FromBits(bits uint16) Float16

FromBits creates a Float16 from its bit representation

func FromFloat32 ¶

func FromFloat32(f32 float32) Float16

FromFloat32 converts a float32 to Float16 (with potential precision loss)

func FromFloat64 ¶

func FromFloat64(f64 float64) Float16

FromFloat64 converts a float64 to Float16 (with potential precision loss)

func FromFloat64WithMode ¶

func FromFloat64WithMode(f64 float64, convMode ConversionMode, roundMode RoundingMode) (Float16, error)

FromFloat64WithMode converts a float64 to Float16 with specified modes

func FromInt ¶

func FromInt(i int) Float16

FromInt converts an integer to Float16

func FromInt32 ¶

func FromInt32(i int32) Float16

FromInt32 converts an int32 to Float16

func FromInt64 ¶

func FromInt64(i int64) Float16

FromInt64 converts an int64 to Float16 (with potential precision loss)

func FromSlice64 ¶

func FromSlice64(f64s []float64) []Float16

FromSlice64 converts a slice of float64 to Float16 with optimized performance

func Gamma ¶

func Gamma(f Float16) Float16

Gamma returns the Gamma function of f

func Hypot ¶

func Hypot(f, g Float16) Float16

Hypot returns sqrt(f*f + g*g), taking care to avoid overflow and underflow

func Inf ¶

func Inf(sign int) Float16

Inf returns a Float16 infinity value If sign >= 0, returns positive infinity If sign < 0, returns negative infinity

func J0 ¶

func J0(f Float16) Float16

J0 returns the order-zero Bessel function of the first kind

func J1 ¶

func J1(f Float16) Float16

J1 returns the order-one Bessel function of the first kind

func Ldexp ¶

func Ldexp(frac Float16, exp int) Float16

Ldexp returns frac × 2^exp

func Lerp ¶

func Lerp(a, b, t Float16) Float16

Lerp performs linear interpolation between a and b by factor t

func Lgamma ¶

func Lgamma(f Float16) (Float16, int)

Lgamma returns the natural logarithm and sign of Gamma(f)

func Log ¶

func Log(f Float16) Float16

Log returns the natural logarithm of f

func Log10 ¶

func Log10(f Float16) Float16

Log10 returns the base-10 logarithm of f

func Log2 ¶

func Log2(f Float16) Float16

Log2 returns the base-2 logarithm of f

func Max ¶

func Max(a, b Float16) Float16

Max returns the larger of two Float16 values

func Min ¶

func Min(a, b Float16) Float16

Min returns the smaller of two Float16 values

func Mod ¶

func Mod(f, divisor Float16) Float16

Mod returns the floating-point remainder of f/divisor

func Modf ¶

func Modf(f Float16) (integer, frac Float16)

Modf returns integer and fractional floating-point numbers that sum to f Both values have the same sign as f

func Mul ¶

func Mul(a, b Float16) Float16

Mul performs multiplication of two Float16 values

func MulSlice ¶

func MulSlice(a, b []Float16) []Float16

MulSlice performs element-wise multiplication of two Float16 slices

func MulWithMode ¶

func MulWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)

MulWithMode performs multiplication with specified arithmetic and rounding modes

func NaN ¶

func NaN() Float16

NaN returns a Float16 quiet NaN value

func NextAfter ¶

func NextAfter(f, g Float16) Float16

NextAfter returns the next representable Float16 value after f in the direction of g

func Norm2 ¶

func Norm2(s []Float16) Float16

Norm2 computes the L2 norm (Euclidean norm) of a Float16 slice

func One ¶

func One() Float16

One returns a Float16 value representing 1.0

func Parse ¶

func Parse(s string) (Float16, error)

Parse converts a string to Float16 (placeholder for future implementation)

func Pow ¶

func Pow(f, exp Float16) Float16

Pow returns f raised to the power of exp

func Remainder ¶

func Remainder(f, divisor Float16) Float16

Remainder returns the IEEE 754 floating-point remainder of f/divisor

func Round ¶

func Round(f Float16) Float16

Round returns the nearest integer value to f

func RoundToEven ¶

func RoundToEven(f Float16) Float16

RoundToEven returns the nearest integer value to f, rounding ties to even

func ScaleSlice ¶

func ScaleSlice(s []Float16, scalar Float16) []Float16

ScaleSlice multiplies each element in the slice by a scalar

func Sign ¶

func Sign(f Float16) Float16

Sign returns -1, 0, or 1 depending on the sign of f

func Sin ¶

func Sin(f Float16) Float16

Sin returns the sine of f (in radians)

func Sinh ¶

func Sinh(f Float16) Float16

Sinh returns the hyperbolic sine of f

func Sqrt ¶

func Sqrt(f Float16) Float16

Sqrt returns the square root of the Float16 value

func Sub ¶

func Sub(a, b Float16) Float16

Sub performs subtraction of two Float16 values

func SubSlice ¶

func SubSlice(a, b []Float16) []Float16

SubSlice performs element-wise subtraction of two Float16 slices

func SubWithMode ¶

func SubWithMode(a, b Float16, mode ArithmeticMode, rounding RoundingMode) (Float16, error)

SubWithMode performs subtraction with specified arithmetic and rounding modes

func SumSlice ¶

func SumSlice(s []Float16) Float16

SumSlice returns the sum of all elements in the slice

func Tan ¶

func Tan(f Float16) Float16

Tan returns the tangent of f (in radians)

func Tanh ¶

func Tanh(f Float16) Float16

Tanh returns the hyperbolic tangent of f

func ToFloat16 ¶

func ToFloat16(f32 float32) Float16

ToFloat16 converts a float32 value to Float16 format using default settings

func ToFloat16WithMode ¶

func ToFloat16WithMode(f32 float32, convMode ConversionMode, roundMode RoundingMode) (Float16, error)

ToFloat16WithMode converts a float32 to Float16 with specified conversion and rounding modes

func ToSlice16 ¶

func ToSlice16(f32s []float32) []Float16

ToSlice16 converts a slice of float32 to Float16 with optimized performance

func ToSlice16WithMode ¶

func ToSlice16WithMode(f32s []float32, convMode ConversionMode, roundMode RoundingMode) ([]Float16, []error)

SIMD-friendly batch conversion with error handling ToSlice16WithMode converts a slice with specified conversion mode

func Trunc ¶

func Trunc(f Float16) Float16

Trunc returns the integer part of f (truncated towards zero)

func VectorAdd ¶

func VectorAdd(a, b []Float16) []Float16

VectorAdd performs vectorized addition (placeholder for future SIMD implementation)

func VectorMul ¶

func VectorMul(a, b []Float16) []Float16

VectorMul performs vectorized multiplication (placeholder for future SIMD implementation)

func Y0 ¶

func Y0(f Float16) Float16

Y0 returns the order-zero Bessel function of the second kind

func Y1 ¶

func Y1(f Float16) Float16

Y1 returns the order-one Bessel function of the second kind

func Zero ¶

func Zero() Float16

Zero returns a Float16 zero value

func (Float16) Abs ¶

func (f Float16) Abs() Float16

Abs returns the absolute value of the Float16

func (Float16) Bits ¶

func (f Float16) Bits() uint16

Bits returns the underlying uint16 representation

func (Float16) Class ¶

func (f Float16) Class() FloatClass

Class returns the IEEE 754 classification of the Float16 value

func (Float16) CopySign ¶

func (f Float16) CopySign(sign Float16) Float16

CopySign returns a Float16 with the magnitude of f and the sign of sign

func (Float16) GoString ¶

func (f Float16) GoString() string

GoString returns a Go syntax representation of the Float16 value

func (Float16) IsFinite ¶

func (f Float16) IsFinite() bool

IsFinite returns true if the Float16 value is finite (not infinity or NaN)

func (Float16) IsInf ¶

func (f Float16) IsInf(sign int) bool

IsInf returns true if the Float16 value represents infinity If sign > 0, returns true only for positive infinity If sign < 0, returns true only for negative infinity If sign == 0, returns true for either infinity

func (Float16) IsNaN ¶

func (f Float16) IsNaN() bool

IsNaN returns true if the Float16 value represents NaN (Not a Number)

func (Float16) IsNormal ¶

func (f Float16) IsNormal() bool

IsNormal returns true if the Float16 value is normalized (not zero, subnormal, infinite, or NaN)

func (Float16) IsSubnormal ¶

func (f Float16) IsSubnormal() bool

IsSubnormal returns true if the Float16 value is subnormal (denormalized)

func (Float16) IsZero ¶

func (f Float16) IsZero() bool

IsZero returns true if the Float16 value represents zero (positive or negative)

func (Float16) Neg ¶

func (f Float16) Neg() Float16

Neg returns the negation of the Float16

func (Float16) Sign ¶

func (f Float16) Sign() int

Sign returns the sign of the Float16 value: 1 for positive, -1 for negative, 0 for zero

func (Float16) Signbit ¶

func (f Float16) Signbit() bool

Signbit returns true if the Float16 value has a negative sign bit

func (Float16) String ¶

func (f Float16) String() string

String returns a string representation of the Float16 value

func (Float16) ToFloat32 ¶

func (f Float16) ToFloat32() float32

ToFloat32 converts a Float16 value to float32 with full precision

func (Float16) ToFloat64 ¶

func (f Float16) ToFloat64() float64

ToFloat64 converts a Float16 value to float64 with full precision

func (Float16) ToInt ¶

func (f Float16) ToInt() int

ToInt converts a Float16 to int (truncated toward zero)

func (Float16) ToInt32 ¶

func (f Float16) ToInt32() int32

ToInt32 converts a Float16 to int32 (truncated toward zero)

func (Float16) ToInt64 ¶

func (f Float16) ToInt64() int64

ToInt64 converts a Float16 to int64 (truncated toward zero)

type Float16Error ¶

type Float16Error struct {
	Op    string      // Operation that caused the error
	Value interface{} // Input value that caused the error
	Msg   string      // Error message
	Code  ErrorCode   // Specific error code
}

Float16Error represents errors that can occur during Float16 operations

func (*Float16Error) Error ¶

func (e *Float16Error) Error() string

type FloatClass ¶

type FloatClass int

Class returns the IEEE 754 class of the floating-point value

const (
	ClassSignalingNaN FloatClass = iota
	ClassQuietNaN
	ClassNegativeInfinity
	ClassNegativeNormal
	ClassNegativeSubnormal
	ClassNegativeZero
	ClassPositiveZero
	ClassPositiveSubnormal
	ClassPositiveNormal
	ClassPositiveInfinity
)

func FpClassify ¶

func FpClassify(f Float16) FloatClass

FpClassify returns the IEEE 754 class of f

type RoundingMode ¶

type RoundingMode int

RoundingMode defines IEEE 754 rounding behavior

const (
	// RoundNearestEven rounds to nearest, ties to even (IEEE default)
	RoundNearestEven RoundingMode = iota
	// RoundNearestAway rounds to nearest, ties away from zero
	RoundNearestAway
	// RoundTowardZero truncates toward zero
	RoundTowardZero
	// RoundTowardPositive rounds toward +∞
	RoundTowardPositive
	// RoundTowardNegative rounds toward -∞
	RoundTowardNegative
)

type SliceStats ¶

type SliceStats struct {
	Min    Float16
	Max    Float16
	Sum    Float16
	Mean   Float16
	Length int
}

SliceStats computes basic statistics for a Float16 slice

func ComputeSliceStats ¶

func ComputeSliceStats(s []Float16) SliceStats

ComputeSliceStats calculates statistics for a Float16 slice

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL