4.7 KiB
float16
Arduino library to implement float16 data type.
Description
This experimental library defines the float16 (2 byte) data type, including conversion function to and from float32 type. It is definitely work in progress.
The library implements the Printable interface so one can directly print the float16 values in any stream e.g. Serial.
The primary usage of the float16 data type is to efficiently store and transport a floating point number. As it uses only 2 bytes where float and double have typical 4 and 8 bytes, gains can be made at the price of range and precision.
Specifications
attribute | value | notes |
---|---|---|
size | 2 bytes | layout s eeeee mmmmmmmmmm |
sign | 1 bit | |
exponent | 5 bit | |
mantissa | 11 bit | ~ 3 digits |
minimum | 5.96046 E−8 | smallest positive number. |
1.0009765625 | 1 + 2^−10 = smallest nr larger than 1. | |
maximum | 65504 | |
example values
/*
SIGN EXP MANTISSA
0 01111 0000000000 = 1
0 01111 0000000001 = 1 + 2−10 = 1.0009765625 (next smallest float after 1)
1 10000 0000000000 = −2
0 11110 1111111111 = 65504 (max half precision)
0 00001 0000000000 = 2−14 ≈ 6.10352 × 10−5 (minimum positive normal)
0 00000 1111111111 = 2−14 - 2−24 ≈ 6.09756 × 10−5 (maximum subnormal)
0 00000 0000000001 = 2−24 ≈ 5.96046 × 10−8 (minimum positive subnormal)
0 00000 0000000000 = 0
1 00000 0000000000 = −0
0 11111 0000000000 = infinity
1 11111 0000000000 = −infinity
0 01101 0101010101 = 0.333251953125 ≈ 1/3
*/
Interface
to elaborate
Constructors
- float16(void) defaults to zero.
- float16(double f) constructor.
- float16(const float16 &f) copy constructor.
Conversion
- double toDouble(void) convert to double (or float).
- uint16_t getBinary() get the 2 byte binary representation.
- void setBinary(uint16_t u) set the 2 bytes binary representation.
- size_t printTo(Print& p) const Printable interface.
- void setDecimals(uint8_t d) idem, used for printTo.
- uint8_t getDecimals() idem.
Note the setDecimals takes one byte per object which is not efficient for arrays of float16. See array example for efficient storage using set/getBinary() functions.
Compare
Standard compare functions. Since 0.1.5 these are quite optimized, so it is fast to compare e.g. 2 measurements.
- bool operator == (const float16& f)
- bool operator != (const float16& f)
- bool operator > (const float16& f)
- bool operator >= (const float16& f)
- bool operator < (const float16& f)
- bool operator <= (const float16& f)
Math (basic)
Math is done by converting to double, do the math and convert back. These operators are added for convenience only. Not planned to optimize these.
- float16 operator + (const float16& f)
- float16 operator - (const float16& f)
- float16 operator * (const float16& f)
- float16 operator / (const float16& f)
- float16& operator += (const float16& f)
- float16& operator -= (const float16& f)
- float16& operator *= (const float16& f)
- float16& operator /= (const float16& f)
negation operator.
-
float16 operator - () fast negation.
-
int sign() returns 1 == positive, 0 == zero, -1 == negative.
-
bool isZero() returns true if zero. slightly faster than sign().
-
bool isInf() returns true if value is (-)infinite.
Notes
Future
0.1.x
- update documentation.
- unit tests of the above.
- isNan().
later
- update documentation.
- error handling.
- divide by zero errors.
- look for optimizations.
- rewrite f16tof32() with bit magic.
- add storage example - with SD card, FRAM or EEPROM
- add communication example - serial or Ethernet?