float16

Arduino library to implement float16 data type.

Description

This experimental library defines the float16 (2 byte) data type, including conversion function to and from float32 type. It is definitely work in progress.

The library implements the Printable interface so one can directly print the float16 values in any stream e.g. Serial.

The primary usage of the float16 data type is to efficiently store and transport a floating point number. As it uses only 2 bytes where float and double have typical 4 and 8 bytes, gains can be made at the price of range and precision.

Specifications

attribute	value	notes
size	2 bytes	layout s eeeee mmmmmmmmmm
sign	1 bit
exponent	5 bit
mantissa	11 bit	~ 3 digits
minimum	5.96046 E−8	smallest positive number.
	1.0009765625	1 + 2^−10 = smallest nr larger than 1.
maximum	65504

example values

/*
   SIGN  EXP     MANTISSA
    0    01111    0000000000 = 1
    0    01111    0000000001 = 1 + 2−10 = 1.0009765625 (next smallest float after 1)
    1    10000    0000000000 = −2

    0    11110    1111111111 = 65504  (max half precision)

    0    00001    0000000000 = 2−14 ≈ 6.10352 × 10−5 (minimum positive normal)
    0    00000    1111111111 = 2−14 - 2−24 ≈ 6.09756 × 10−5 (maximum subnormal)
    0    00000    0000000001 = 2−24 ≈ 5.96046 × 10−8 (minimum positive subnormal)

    0    00000    0000000000 = 0
    1    00000    0000000000 = −0

    0    11111    0000000000 = infinity
    1    11111    0000000000 = −infinity

    0    01101    0101010101 = 0.333251953125 ≈ 1/3
*/

Interface

to elaborate

Constructors

float16(void) defaults to zero.
float16(double f) constructor.
float16(const float16 &f) copy constructor.

Conversion

double toDouble(void) convert to double (or float).
uint16_t getBinary() get the 2 byte binary representation.
void setBinary(uint16_t u) set the 2 bytes binary representation.
size_t printTo(Print& p) const Printable interface.
void setDecimals(uint8_t d) idem, used for printTo.
uint8_t getDecimals() idem.

Note the setDecimals takes one byte per object which is not efficient for arrays of float16. See array example for efficient storage using set/getBinary() functions.

Compare

Standard compare functions. Since 0.1.5 these are quite optimized, so it is fast to compare e.g. 2 measurements.

bool operator == (const float16& f)
bool operator != (const float16& f)
bool operator > (const float16& f)
bool operator >= (const float16& f)
bool operator < (const float16& f)
bool operator <= (const float16& f)

Math (basic)

Math is done by converting to double, do the math and convert back. These operators are added for convenience only. Not planned to optimize these.

float16 operator + (const float16& f)
float16 operator - (const float16& f)
float16 operator * (const float16& f)
float16 operator / (const float16& f)
float16& operator += (const float16& f)
float16& operator -= (const float16& f)
float16& operator *= (const float16& f)
float16& operator /= (const float16& f)

negation operator.

float16 operator - () fast negation.
int sign() returns 1 == positive, 0 == zero, -1 == negative.
bool isZero() returns true if zero. slightly faster than sign().
bool isInf() returns true if value is (-)infinite.

Notes

Future

0.1.x

update documentation.
unit tests of the above.
isNan().

later

update documentation.
error handling.
- divide by zero errors.
look for optimizations.
rewrite f16tof32() with bit magic.
add storage example - with SD card, FRAM or EEPROM
add communication example - serial or Ethernet?

4.7 KiB Raw Blame History Unescape Escape