float8

package module

v0.3.0 Latest Latest Go to latest Published: Aug 25, 2025 License: Apache-2.0 Imports: 4 Imported by: 4

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/zerfoo/float8

Links

Open Source Insights

README ¶

float8

A high-performance Go library implementing IEEE 754 FP8 E4M3FN format for 8-bit floating-point arithmetic, commonly used in machine learning applications for reduced-precision computations.

Features

IEEE 754 FP8 E4M3FN Format: Complete implementation of the 8-bit floating-point format
High Performance: Optimized arithmetic operations with optional fast lookup tables
Comprehensive API: Full support for conversion, arithmetic, and mathematical operations
Machine Learning Ready: Designed for ML workloads requiring reduced precision
Zero Dependencies: Pure Go implementation with no external dependencies

Format Specification

The Float8 type uses the E4M3FN variant of IEEE 754 FP8:

1 bit: Sign (0 = positive, 1 = negative)
4 bits: Exponent (biased by 7, range [-6, 7])
3 bits: Mantissa (3 explicit bits, 1 implicit leading bit for normal numbers)

Special Values

Zero: Exponent=0000, Mantissa=000 (both positive and negative)
NaN: Exponent=1111, Mantissa=111
No Infinities: The E4M3FN variant does not support infinity values

Installation

go get github.com/zerfoo/float8

Quick Start

package main

import (
    "fmt"
    "github.com/zerfoo/float8"
)

func main() {
    // Initialize the package (optional, done automatically)
    float8.Initialize()
    
    // Create Float8 values from float32
    a := float8.FromFloat32(3.14)
    b := float8.FromFloat32(2.71)
    
    // Perform arithmetic operations
    sum := a.Add(b)
    product := a.Mul(b)
    
    // Convert back to float32
    fmt.Printf("a = %f\n", a.ToFloat32())
    fmt.Printf("b = %f\n", b.ToFloat32())
    fmt.Printf("a + b = %f\n", sum.ToFloat32())
    fmt.Printf("a * b = %f\n", product.ToFloat32())
}

Configuration

The library supports various configuration options for performance optimization:

// Configure with custom settings
config := &float8.Config{
    EnableFastArithmetic: true,  // Enable lookup tables for faster arithmetic
    EnableFastConversion: true,  // Enable lookup tables for faster conversion
    DefaultMode:          float8.ModeDefault,
    ArithmeticMode:       float8.ArithmeticAuto,
}

float8.Configure(config)

API Reference

Core Types

Float8: The main 8-bit floating-point type
Config: Configuration options for the package

Conversion Functions

// From other numeric types
func FromFloat32(f float32) Float8
func FromFloat64(f float64) Float8
func FromInt(i int) Float8

// To other numeric types
func (f Float8) ToFloat32() float32
func (f Float8) ToFloat64() float64
func (f Float8) ToInt() int

Arithmetic Operations

func (f Float8) Add(other Float8) Float8
func (f Float8) Sub(other Float8) Float8
func (f Float8) Mul(other Float8) Float8
func (f Float8) Div(other Float8) Float8

Mathematical Functions

func (f Float8) Abs() Float8
func (f Float8) Neg() Float8
func (f Float8) Sqrt() Float8
// ... and more

Utility Functions

func (f Float8) IsZero() bool
func (f Float8) IsNaN() bool
func (f Float8) IsInf() bool
func (f Float8) String() string

Performance

The library offers two performance modes:

Standard Mode: Compact implementation with minimal memory usage
Fast Mode: Uses pre-computed lookup tables for faster operations at the cost of memory

Enable fast mode for performance-critical applications:

float8.EnableFastArithmetic()
float8.EnableFastConversion()

Testing

Run the comprehensive test suite:

# Run all tests
go test ./...

# Run tests with coverage
go test -cover ./...

# Generate coverage report
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out

Benchmarks

Run performance benchmarks:

go test -bench=. -benchmem ./...

Use Cases

Machine Learning: Reduced precision training and inference
Neural Networks: Memory-efficient model parameters
Scientific Computing: Applications requiring controlled precision
Embedded Systems: Resource-constrained environments

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Acknowledgments

IEEE 754 standard for floating-point arithmetic
The machine learning community for driving FP8 adoption
Contributors and maintainers of this project

Documentation ¶

Index ¶

Constants
Variables
func Configure(config *Config)
func DebugInfo() map[string]interface{}
func DisableFastArithmetic()
func DisableFastConversion()
func EnableFastArithmetic()
func EnableFastConversion()
func Equal(a, b Float8) bool
func GetMemoryUsage() int
func GetVersion() string
func Greater(a, b Float8) bool
func GreaterEqual(a, b Float8) bool
func Initialize()
func Less(a, b Float8) bool
func LessEqual(a, b Float8) bool
func ToSlice32(f8s []Float8) []float32
type ArithmeticMode
type Config
- func DefaultConfig() *Config
type ConversionMode
type Float8
- func Add(a, b Float8) Float8
- func AddSlice(a, b []Float8) []Float8
- func AddWithMode(a, b Float8, mode ArithmeticMode) Float8
- func Ceil(f Float8) Float8
- func Clamp(f, min, max Float8) Float8
- func CopySign(f, sign Float8) Float8
- func Cos(f Float8) Float8
- func Div(a, b Float8) Float8
- func DivWithMode(a, b Float8, mode ArithmeticMode) Float8
- func Exp(f Float8) Float8
- func Floor(f Float8) Float8
- func Fmod(x, y Float8) Float8
- func FromBits(bits uint8) Float8
- func FromFloat64(f float64) Float8
- func FromInt(i int) Float8
- func Lerp(a, b, t Float8) Float8
- func Log(f Float8) Float8
- func Max(a, b Float8) Float8
- func Min(a, b Float8) Float8
- func Mul(a, b Float8) Float8
- func MulSlice(a, b []Float8) []Float8
- func MulWithMode(a, b Float8, mode ArithmeticMode) Float8
- func One() Float8
- func Parse(s string) (Float8, error)
- func Pow(f, exp Float8) Float8
- func Round(f Float8) Float8
- func ScaleSlice(s []Float8, scalar Float8) []Float8
- func Sign(f Float8) Float8
- func Sin(f Float8) Float8
- func Sqrt(f Float8) Float8
- func Sub(a, b Float8) Float8
- func SubWithMode(a, b Float8, mode ArithmeticMode) Float8
- func SumSlice(s []Float8) Float8
- func Tan(f Float8) Float8
- func ToFloat8(f32 float32) Float8
- func ToFloat8WithMode(f32 float32, mode ConversionMode) (Float8, error)
- func ToSlice8(f32s []float32) []Float8
- func Trunc(f Float8) Float8
- func Zero() Float8
- func (f Float8) Abs() Float8
- func (f Float8) Bits() uint8
- func (f Float8) GoString() string
- func (f Float8) IsFinite() bool
- func (f Float8) IsInf() bool
- func (f Float8) IsNaN() bool
- func (f Float8) IsNormal() bool
- func (f Float8) IsValid() bool
- func (f Float8) IsZero() bool
- func (f Float8) Neg() Float8
- func (f Float8) Sign() int
- func (f Float8) String() string
- func (f Float8) ToFloat32() float32
- func (f Float8) ToFloat64() float64
- func (f Float8) ToInt() int
type Float8Error
- func (e *Float8Error) Error() string

Constants ¶

View Source

const (
	Version      = "2.0.0"
	VersionMajor = 2
	VersionMinor = 0
	VersionPatch = 0
)

Version information

View Source

const (
	SignMask     = 0b10000000 // 0x80 - Sign bit mask
	ExponentMask = 0b01111000 // 0x78 - Exponent bits mask
	MantissaMask = 0b00000111 // 0x07 - Mantissa bits mask
	MantissaLen  = 3          // Number of mantissa bits

	// Exponent bias and limits
	// See https://en.wikipedia.org/wiki/Exponent_bias
	// bias = 2^(|exponent|-1) - 1
	ExponentBias = 7  // Bias for 4-bit exponent
	ExponentMax  = 15 // Maximum exponent value
	ExponentMin  = -7 // Minimum exponent value

	// Float32 constants for conversion
	Float32Bias = 127 // IEEE 754 single precision bias

	// Special values
	PositiveZero     Float8 = 0x00
	NegativeZero     Float8 = 0x80
	PositiveInfinity Float8 = 0x78 // IEEE 754 E4M3FN: S.1111.000 = 0.1111.000₂
	NegativeInfinity Float8 = 0xF8 // IEEE 754 E4M3FN: S.1111.000 = 1.1111.000₂
	NaN              Float8 = 0x7F // IEEE 754 E4M3FN: S.1111.111 (0x7F or 0xFF)
	MaxValue         Float8 = 0x7E // Largest finite positive value
	MinValue         Float8 = 0xFE // Largest finite negative value
	SmallestPositive Float8 = 0x01 // Smallest positive normalized value
)

Bit masks and constants for Float8 format

Variables ¶

View Source

var (
	E      = ToFloat8(2.718281828459045)  // Euler's number
	Pi     = ToFloat8(3.141592653589793)  // Pi
	Phi    = ToFloat8(1.618033988749895)  // Golden ratio
	Sqrt2  = ToFloat8(1.4142135623730951) // Square root of 2
	SqrtE  = ToFloat8(1.6487212707001282) // Square root of E
	SqrtPi = ToFloat8(1.7724538509055159) // Square root of Pi
	Ln2    = ToFloat8(0.6931471805599453) // Natural logarithm of 2
	Log2E  = ToFloat8(1.4426950408889634) // Base-2 logarithm of E
	Ln10   = ToFloat8(2.302585092994046)  // Natural logarithm of 10
	Log10E = ToFloat8(0.4342944819032518) // Base-10 logarithm of E
)

Constants as Float8 values

View Source

var (
	ErrOverflow  = &Float8Error{Op: "convert", Msg: "value too large for float8"}
	ErrUnderflow = &Float8Error{Op: "convert", Msg: "value too small for float8"}
	ErrNaN       = &Float8Error{Op: "convert", Msg: "NaN not representable in float8"}
)

Common error instances

View Source

var DefaultArithmeticMode = ArithmeticAuto

Global arithmetic mode

View Source

var DefaultConversionMode = ModeDefault

Global conversion mode (can be changed for different behavior)

Functions ¶

func Configure ¶

func Configure(config *Config)

Configure applies the given configuration to the package

func DebugInfo ¶

func DebugInfo() map[string]interface{}

DebugInfo returns debugging information about the package state

func DisableFastArithmetic ¶

func DisableFastArithmetic()

DisableFastArithmetic disables lookup tables and uses algorithmic operations

func DisableFastConversion ¶

func DisableFastConversion()

DisableFastConversion disables lookup table and uses algorithmic conversion

func EnableFastArithmetic ¶

func EnableFastArithmetic()

EnableFastArithmetic enables lookup tables for arithmetic operations

func EnableFastConversion ¶

func EnableFastConversion()

EnableFastConversion enables lookup table for ToFloat32 conversion

func Equal ¶

func Equal(a, b Float8) bool

Equal returns true if two Float8 values are equal

func GetMemoryUsage ¶

func GetMemoryUsage() int

GetMemoryUsage returns the current memory usage of lookup tables in bytes

func GetVersion ¶

func GetVersion() string

GetVersion returns the package version string

func Initialize ¶

func Initialize()

Initialize performs one-time package initialization

func ToSlice32 ¶

func ToSlice32(f8s []Float8) []float32

ToSlice32 converts a slice of Float8 to float32 with optimized performance.

This function is optimized for batch conversion of Float8 values to float32. It handles all special values correctly, including negative zero, infinity, and NaN.

Parameters:

f8s: The input slice of Float8 values to convert. May be nil or empty.

Returns:

nil if the input slice is nil
A new slice containing the converted float32 values

Note: The conversion from Float8 to float32 is always exact since Float8 is a subset of float32. For large slices, consider using a pool of []float32 to reduce allocations.

Types ¶

type ArithmeticMode ¶

type ArithmeticMode int

ArithmeticMode defines which implementation to use for arithmetic operations

const (
	// ArithmeticAuto chooses the best implementation automatically
	ArithmeticAuto ArithmeticMode = iota
	// ArithmeticAlgorithmic forces algorithmic implementation
	ArithmeticAlgorithmic
	// ArithmeticLookup forces lookup table implementation (if available)
	ArithmeticLookup
)

type Config ¶

type Config struct {
	EnableFastArithmetic bool
	EnableFastConversion bool
	DefaultMode          ConversionMode
	ArithmeticMode       ArithmeticMode
}

Config holds package configuration options

func DefaultConfig ¶

func DefaultConfig() *Config

DefaultConfig returns the default package configuration

type ConversionMode ¶

type ConversionMode int

ConversionMode defines how conversions handle edge cases

const (
	// ModeDefault uses standard IEEE 754 rounding behavior
	ModeDefault ConversionMode = iota
	// ModeStrict returns errors for overflow/underflow
	ModeStrict
	// ModeFast uses lookup tables when available (default for arithmetic)
	ModeFast
)

type Float8 ¶

type Float8 uint8

Float8 represents an 8-bit floating-point number using the IEEE 754 FP8 E4M3FN format. This format is commonly used in machine learning for reduced-precision arithmetic.

Bit layout:

1 bit : Sign (0 = positive, 1 = negative)
4 bits : Exponent (biased by 7, range [-6, 7])
3 bits : Mantissa (3 explicit bits, 1 implicit leading bit for normal numbers)

Special values:

PositiveZero/NegativeZero: Exponent=0000, Mantissa=000
PositiveInfinity/NegativeInfinity: Exponent=1111, Mantissa=000
NaN: Exponent=1111, Mantissa=111

This implementation follows the E4M3FN variant which has no infinities and two NaNs.

func Add ¶

func Add(a, b Float8) Float8

Add returns the sum of the operands a and b.

This is a convenience function that calls AddWithMode with DefaultArithmeticMode. For more control over the arithmetic behavior, use AddWithMode directly.

Special cases:

Add(+0, ±0) = +0
Add(-0, -0) = -0
Add(±Inf, ∓Inf) = NaN (but returns +0 in this implementation)
Add(NaN, x) = NaN
Add(x, NaN) = NaN

For finite numbers, the result is rounded to the nearest representable Float8 value using the current rounding mode (typically round-to-nearest-even).

func AddSlice ¶

func AddSlice(a, b []Float8) []Float8

AddSlice performs element-wise addition of two Float8 slices.

This function adds corresponding elements of the input slices and returns a new slice with the results. The input slices must have the same length; otherwise, the function will panic.

Parameters:

a, b: Slices of Float8 values to be added element-wise.

Returns:

A new slice where each element is the sum of the corresponding elements in a and b.

Panics:

If the input slices have different lengths.

Example:

a := []Float8{1.0, 2.0, 3.0}
b := []Float8{4.0, 5.0, 6.0}
result := AddSlice(a, b) // Returns [5.0, 7.0, 9.0]

func AddWithMode ¶

func AddWithMode(a, b Float8, mode ArithmeticMode) Float8

AddWithMode returns the sum of the operands a and b using the specified arithmetic mode.

The arithmetic mode determines how the addition is performed:

ArithmeticAuto: Uses the fastest available method (lookup tables if enabled)
ArithmeticLookup: Forces use of lookup tables (panics if not available)
ArithmeticAlgorithmic: Uses the algorithmic implementation

Special cases are handled according to IEEE 754 rules:

If either operand is NaN, the result is NaN
Infinities of the same sign add to infinity of that sign
Infinities of opposite signs produce NaN (but this implementation returns +0)
The sign of a zero result is the sign of the sum of the operands

For finite numbers, the result is rounded to the nearest representable Float8 value. If the exact result is exactly halfway between two representable values, it is rounded to the value with an even least significant bit (round-to-nearest-even).

func Ceil ¶

func Ceil(f Float8) Float8

Ceil returns the least integer value greater than or equal to f.

Special cases are:

Ceil(±0) = ±0
Ceil(±Inf) = ±Inf
Ceil(NaN) = NaN

For finite x, the result is the least integer value ≥ x. The result is exact (no rounding occurs).

func Clamp ¶

func Clamp(f, min, max Float8) Float8

Clamp restricts f to the range [min, max]

func CopySign ¶

func CopySign(f, sign Float8) Float8

CopySign returns a Float8 with the magnitude of f and the sign of sign

func Cos ¶

func Cos(f Float8) Float8

Cos returns the cosine of f (in radians).

Special cases are:

Cos(±0) = 1
Cos(±Inf) = NaN
Cos(NaN) = NaN

For finite x, the result is the cosine of x in the range [-1, 1]. The result is rounded to the nearest representable Float8 value.

func Div ¶

func Div(a, b Float8) Float8

Div returns the quotient a/b of the operands a and b.

This is a convenience function that calls DivWithMode with DefaultArithmeticMode. For more control over the arithmetic behavior, use DivWithMode directly.

Special cases:

Div(±0, ±0) = NaN
Div(±Inf, ±Inf) = NaN
Div(x, ±0) = ±Inf for x finite and not zero (sign obeys rule for signs)
Div(±Inf, y) = ±Inf for y finite and not zero (sign obeys rule for signs)
Div(x, y) = NaN if x or y is NaN

The sign of the result follows the standard sign rules for division. For finite numbers, the result is rounded to the nearest representable Float8 value. Division by zero results in ±Inf with the sign determined by the rule of signs.

func DivWithMode ¶

func DivWithMode(a, b Float8, mode ArithmeticMode) Float8

DivWithMode performs division with specified arithmetic mode

func Exp ¶

func Exp(f Float8) Float8

Exp returns e^f

func Floor ¶

func Floor(f Float8) Float8

Floor returns the greatest integer value less than or equal to f.

Special cases are:

Floor(±0) = ±0
Floor(±Inf) = ±Inf
Floor(NaN) = NaN

For finite x, the result is the greatest integer value ≤ x. The result is exact (no rounding occurs).

func Fmod ¶

func Fmod(x, y Float8) Float8

Fmod returns the floating-point remainder of x/y.

The result has the same sign as x and magnitude less than the magnitude of y.

Special cases are:

Fmod(±0, y) = ±0 for y != 0
Fmod(±Inf, y) = NaN
Fmod(x, 0) = NaN
Fmod(NaN, y) = NaN
Fmod(x, NaN) = NaN
Fmod(x, ±Inf) = x for x not infinite

For finite x and y (y ≠ 0), the result is x - n*y where n is the integer nearest to x/y. If two integers are equally near, the even one is chosen. The result is rounded to the nearest representable Float8 value.

func FromBits ¶

func FromBits(bits uint8) Float8

FromBits creates a Float8 from its bit representation

func FromFloat64 ¶

func FromFloat64(f float64) Float8

FromFloat64 converts a float64 to Float8 (with potential precision loss)

func FromInt ¶

func FromInt(i int) Float8

FromInt converts an integer to Float8

func Lerp ¶

func Lerp(a, b, t Float8) Float8

Lerp performs linear interpolation between a and b by factor t

func Log ¶

func Log(f Float8) Float8

Log returns the natural logarithm of f.

Special cases are:

Log(+Inf) = +Inf
Log(0) = -Inf
Log(x < 0) = NaN
Log(NaN) = NaN

For finite x > 0, the result is the natural logarithm of x. The result is rounded to the nearest representable Float8 value.

func Max ¶

func Max(a, b Float8) Float8

Max returns the larger of two Float8 values. If either value is NaN, returns NaN. Max(+Inf, x) returns +Inf Max(-Inf, x) returns x (if x is finite or +Inf) Max(x, +Inf) returns +Inf Max(x, -Inf) returns x (if x is finite or +Inf)

func Min ¶

func Min(a, b Float8) Float8

Min returns the smaller of two Float8 values. If either value is NaN, returns NaN. Min(+Inf, x) returns x (if x is finite or -Inf) Min(-Inf, x) returns -Inf Min(x, +Inf) returns x (if x is finite or -Inf) Min(x, -Inf) returns -Inf

func Mul ¶

func Mul(a, b Float8) Float8

Mul returns the product of the operands a and b.

This is a convenience function that calls MulWithMode with DefaultArithmeticMode. For more control over the arithmetic behavior, use MulWithMode directly.

Special cases:

Mul(±0, ±Inf) = NaN
Mul(±Inf, ±0) = NaN
Mul(±0, ±0) = ±0 (sign obeys the rule for signs of zero products)
Mul(±0, y) = ±0 for y finite and not zero
Mul(±Inf, y) = ±Inf for y finite and not zero
Mul(x, y) = NaN if x or y is NaN

The sign of the result follows the standard sign rules for multiplication. For finite numbers, the result is rounded to the nearest representable Float8 value.

func MulSlice ¶

func MulSlice(a, b []Float8) []Float8

MulSlice performs element-wise multiplication of two Float8 slices.

This function multiplies corresponding elements of the input slices and returns a new slice with the results. The input slices must have the same length; otherwise, the function will panic.

Parameters:

a, b: Slices of Float8 values to be multiplied element-wise.

Returns:

A new slice where each element is the product of the corresponding elements in a and b.

Panics:

If the input slices have different lengths.

Example:

a := []Float8{1.0, 2.0, 3.0}
b := []Float8{4.0, 5.0, 6.0}
result := MulSlice(a, b) // Returns [4.0, 10.0, 18.0]

func MulWithMode ¶

func MulWithMode(a, b Float8, mode ArithmeticMode) Float8

MulWithMode performs multiplication with specified arithmetic mode

func One ¶

func One() Float8

One returns a Float8 value representing 1.0

func Parse ¶

func Parse(s string) (Float8, error)

Parse converts a string to Float8

func Pow ¶

func Pow(f, exp Float8) Float8

Pow returns f raised to the power of exp.

Special cases are:

Pow(±0, exp) = ±0 for exp > 0
Pow(±0, exp) = +Inf for exp < 0
Pow(1, exp) = 1 for any exp (even NaN)
Pow(f, 0) = 1 for any f (including NaN, +Inf, -Inf)
Pow(f, 1) = f for any f
Pow(NaN, exp) = NaN
Pow(f, NaN) = NaN
Pow(±0, -Inf) = +Inf
Pow(±0, +Inf) = +0
Pow(+Inf, exp) = +Inf for exp > 0
Pow(+Inf, exp) = +0 for exp < 0
Pow(-Inf, exp) = -0 for exp a negative odd integer
Pow(-Inf, exp) = +0 for exp a negative non-odd integer
Pow(-Inf, exp) = -Inf for exp a positive odd integer
Pow(-Inf, exp) = +Inf for exp a positive non-odd integer
Pow(-1, ±Inf) = 1
Pow(f, +Inf) = +Inf for |f| > 1
Pow(f, -Inf) = +0 for |f| > 1
Pow(f, +Inf) = +0 for |f| < 1
Pow(f, -Inf) = +Inf for |f| < 1

The result is rounded to the nearest representable Float8 value.

func Round ¶

func Round(f Float8) Float8

Round returns the nearest integer value to f, rounding ties to even.

Special cases are:

Round(±0) = ±0
Round(±Inf) = ±Inf
Round(NaN) = NaN

For finite x, the result is the nearest integer to x. Ties are rounded to the nearest even integer. The result is exact (no rounding occurs).

func ScaleSlice ¶

func ScaleSlice(s []Float8, scalar Float8) []Float8

ScaleSlice multiplies each element in the slice by a scalar

func Sign ¶

func Sign(f Float8) Float8

Sign returns -1, 0, or 1 depending on the sign of f

func Sin ¶

func Sin(f Float8) Float8

Sin returns the sine of f (in radians).

Special cases are:

Sin(±0) = ±0
Sin(±Inf) = NaN
Sin(NaN) = NaN

For finite x, the result is the sine of x in the range [-1, 1]. The result is rounded to the nearest representable Float8 value.

func Sqrt ¶

func Sqrt(f Float8) Float8

Sqrt returns the square root of the Float8 value.

Special cases are:

Sqrt(+0) = +0
Sqrt(-0) = -0
Sqrt(+Inf) = +Inf
Sqrt(x) = NaN if x < 0 (including -Inf)
Sqrt(NaN) = NaN

For finite x ≥ 0, the result is the greatest Float8 value y such that y² ≤ x. The result is rounded to the nearest representable Float8 value.

func Sub ¶

func Sub(a, b Float8) Float8

Sub returns the difference of a-b, i.e., the result of subtracting b from a.

This is a convenience function that calls SubWithMode with DefaultArithmeticMode. For more control over the arithmetic behavior, use SubWithMode directly.

Special cases:

Sub(+0, +0) = +0
Sub(+0, -0) = +0
Sub(-0, +0) = -0
Sub(-0, -0) = +0
Sub(±Inf, ±Inf) = NaN (but returns +0 in this implementation)
Sub(NaN, x) = NaN
Sub(x, NaN) = NaN

For finite numbers, the result is rounded to the nearest representable Float8 value.

func SubWithMode ¶

func SubWithMode(a, b Float8, mode ArithmeticMode) Float8

SubWithMode performs subtraction with specified arithmetic mode

func SumSlice ¶

func SumSlice(s []Float8) Float8

SumSlice returns the sum of all elements in the slice.

This function computes the sum of all Float8 values in the input slice. If the slice is empty, it returns PositiveZero.

The summation is performed using the standard addition rules for Float8, including proper handling of special values (NaN, Inf, etc.).

Parameters:

s: The input slice of Float8 values to sum.

Returns:

The sum of all elements in the slice.
If the slice is empty, returns PositiveZero.
If any element is NaN, the result is NaN.

Example:

s := []Float8{1.0, 2.0, 3.0, 4.0}
sum := SumSlice(s) // Returns 10.0

func Tan ¶

func Tan(f Float8) Float8

Tan returns the tangent of f (in radians).

Special cases are:

Tan(±0) = ±0
Tan(±Inf) = NaN
Tan(NaN) = NaN

For finite x, the result is the tangent of x. The result is rounded to the nearest representable Float8 value. Note that the result may be extremely large or small for inputs near (2n+1)π/2.

func ToFloat8 ¶

func ToFloat8(f32 float32) Float8

ToFloat8 converts a float32 value to Float8 format using the default conversion mode.

This is a convenience function that calls ToFloat8WithMode with DefaultConversionMode. For more control over the conversion process, use ToFloat8WithMode directly.

Special cases:

Converts +0.0 to PositiveZero (0x00)
Converts -0.0 to NegativeZero (0x80)
Converts +Inf to PositiveInfinity (0x78)
Converts -Inf to NegativeInfinity (0xF8)
Converts NaN to NaN (0x7F or 0xFF)

For finite numbers, the conversion may lose precision or result in overflow/underflow. The default mode handles these cases by saturating to the maximum/minimum representable values.

func ToFloat8WithMode ¶

func ToFloat8WithMode(f32 float32, mode ConversionMode) (Float8, error)

ToFloat8WithMode converts a float32 to Float8 with the specified conversion mode.

The conversion mode determines how edge cases are handled:

ModeDefault: Uses standard IEEE 754 rounding behavior, saturating on overflow
ModeStrict: Returns an error for overflow/underflow/NaN
ModeFast: Uses lookup tables when available (if enabled)

Special cases are handled as follows:

±0.0 is converted to the corresponding Float8 zero (preserving sign)
±Inf is converted to the corresponding Float8 infinity
NaN is handled according to the conversion mode

For finite numbers, the conversion follows these steps:

Extract sign, exponent, and mantissa from the float32
Adjust the exponent for the Float8 format (E4M3FN)
Round the mantissa to 3 bits (plus implicit leading bit)
Handle overflow/underflow according to the conversion mode

Returns the converted Float8 value and an error if the conversion fails in strict mode.

func ToSlice8 ¶

func ToSlice8(f32s []float32) []Float8

ToSlice8 converts a slice of float32 to Float8 with optimized performance.

This function is optimized for batch conversion of float32 values to Float8. It handles special values correctly, including negative zero, infinity, and NaN.

Parameters:

f32s: The input slice of float32 values to convert. May be nil or empty.

Returns:

nil if the input slice is nil
A non-nil empty slice if the input slice is empty
A new slice containing the converted Float8 values

Note: This function preserves negative zero by checking the sign bit of zero values. For large slices, consider using a pool of []Float8 to reduce allocations.

func Trunc ¶

func Trunc(f Float8) Float8

Trunc returns the integer value of f with any fractional part removed.

Special cases are:

Trunc(±0) = ±0
Trunc(±Inf) = ±Inf
Trunc(NaN) = NaN

For finite x, the result is the integer part of x with the sign of x. This is equivalent to rounding toward zero. The result is exact (no rounding occurs).

func Zero ¶

func Zero() Float8

Zero returns a Float8 zero value

func (Float8) Abs ¶

func (f Float8) Abs() Float8

Abs returns the absolute value of f.

Special cases are:

Abs(±Inf) = +Inf
Abs(NaN) = NaN
Abs(±0) = +0

For all other values, Abs clears the sign bit to return a positive number.

func (Float8) Bits ¶

func (f Float8) Bits() uint8

Bits returns the underlying uint8 representation

func (Float8) GoString ¶

func (f Float8) GoString() string

GoString returns a Go syntax representation of the Float8 value

func (Float8) IsFinite ¶

func (f Float8) IsFinite() bool

IsFinite reports whether f is a finite value (not infinite and not NaN).

A Float8 value is finite if its exponent is not all 1s (0x0F). This includes both normal numbers (with an implicit leading 1 bit) and subnormal numbers (with an implicit leading 0 bit).

Returns:

true if f is a finite number (including zero and subnormals)
false if f is infinity or NaN

func (Float8) IsInf ¶

func (f Float8) IsInf() bool

IsInf reports whether f is an infinity, either positive or negative.

In the E4M3FN format, infinity values have all exponent bits set (0x78 for +Inf, 0xF8 for -Inf) and a zero mantissa. This is different from the standard IEEE 754 format used in float32/float64.

Returns:

true if f is positive or negative infinity
false otherwise, including for NaN and finite values

func (Float8) IsNaN ¶

func (f Float8) IsNaN() bool

IsNaN reports whether f is a "not-a-number" (NaN) value.

In the E4M3FN format, NaN is represented with all exponent bits set (0x0F) and all mantissa bits set (0x07). This results in two possible NaN values: 0x7F (positive NaN) and 0xFF (negative NaN).

Returns:

true if f is a NaN value
false otherwise, including for infinity and finite values

func (Float8) IsNormal ¶

func (f Float8) IsNormal() bool

IsNormal returns true if the Float8 is a normal (non-zero, non-infinite) number

func (Float8) IsValid ¶

func (f Float8) IsValid() bool

IsValid returns true if the Float8 represents a valid number

func (Float8) IsZero ¶

func (f Float8) IsZero() bool

IsZero reports whether f represents the floating-point value zero (either positive or negative).

According to IEEE 754, both +0 and -0 are considered zero, though they may have different bit patterns and behave differently in certain operations (like 1/+0 = +Inf, 1/-0 = -Inf).

Returns:

true if f is +0 or -0
false otherwise, including for NaN and infinity values

func (Float8) Neg ¶

func (f Float8) Neg() Float8

Neg returns the negation of the Float8

func (Float8) Sign ¶

func (f Float8) Sign() int

Sign returns the sign of the Float8 value.

The return values are:

1 if f > 0
-1 if f < 0
0 if f is zero (including -0) or NaN

Note that negative zero is treated as zero (returns 0), following the IEEE 754 standard where +0 and -0 compare as equal. However, they can be distinguished using bitwise operations or by examining the sign bit directly.

For NaN values, Sign returns 0, consistent with math/big.Float's behavior.

func (Float8) String ¶

func (f Float8) String() string

String returns a string representation of the Float8 value

func (Float8) ToFloat32 ¶

func (f Float8) ToFloat32() float32

ToFloat32 converts a Float8 value to float32.

This conversion is always exact since Float8 is a subset of float32. Special values are preserved:

PositiveZero/NegativeZero → ±0.0
PositiveInfinity/NegativeInfinity → ±Inf
NaN → NaN

The conversion uses a fast path for common values and falls back to algorithmic conversion for other values.

func (Float8) ToFloat64 ¶

func (f Float8) ToFloat64() float64

ToFloat64 converts a Float8 to float64

func (Float8) ToInt ¶

func (f Float8) ToInt() int

ToInt converts a Float8 to int (truncated)

type Float8Error ¶

type Float8Error struct {
	Op    string  // Operation that caused the error
	Value float32 // Input value that caused the error (if applicable)
	Msg   string  // Error message
}

Float8Error represents errors that can occur during Float8 operations

func (*Float8Error) Error ¶

func (e *Float8Error) Error() string

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

float8

Features

Format Specification

Special Values

Installation

Quick Start

Configuration

API Reference

Core Types

Conversion Functions

Arithmetic Operations

Mathematical Functions

Utility Functions

Performance

Testing

Benchmarks

Use Cases

Contributing

License

Acknowledgments

Documentation ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

func Configure ¶

func DebugInfo ¶

func DisableFastArithmetic ¶

func DisableFastConversion ¶

func EnableFastArithmetic ¶

func EnableFastConversion ¶

func Equal ¶

func GetMemoryUsage ¶

func GetVersion ¶

func Greater ¶

func GreaterEqual ¶

func Initialize ¶

func Less ¶

func LessEqual ¶

func ToSlice32 ¶

Types ¶

type ArithmeticMode ¶

type Config ¶

func DefaultConfig ¶

type ConversionMode ¶

type Float8 ¶

func Add ¶

func AddSlice ¶

func AddWithMode ¶

func Ceil ¶

func Clamp ¶

func CopySign ¶

func Cos ¶

func Div ¶

func DivWithMode ¶

func Exp ¶

func Floor ¶

func Fmod ¶

func FromBits ¶

func FromFloat64 ¶

func FromInt ¶

func Lerp ¶

func Log ¶

func Max ¶

func Min ¶

func Mul ¶

func MulSlice ¶

func MulWithMode ¶

func One ¶

func Parse ¶

func Pow ¶

func Round ¶

func ScaleSlice ¶

func Sign ¶

func Sin ¶

func Sqrt ¶

func Sub ¶

func SubWithMode ¶

func SumSlice ¶