mirror of https://github.com/RobTillaart/Arduino.git synced 2024-10-03 18:09:02 -04:00

History

Rob Tillaart 82f164b6bb 0.2.0 float16ext		2024-04-18 12:16:58 +02:00
..
.github	bulk update GitHub actions	2024-04-13 10:35:57 +02:00
examples	0.2.0 float16ext	2024-04-18 12:16:58 +02:00
test	0.2.0 float16ext	2024-04-18 12:16:58 +02:00
.arduino-ci.yml	0.1.0 float16ext	2024-03-06 20:05:07 +01:00
CHANGELOG.md	0.2.0 float16ext	2024-04-18 12:16:58 +02:00
float16ext.cpp	0.2.0 float16ext	2024-04-18 12:16:58 +02:00
float16ext.h	0.2.0 float16ext	2024-04-18 12:16:58 +02:00
keywords.txt	0.2.0 float16ext	2024-04-18 12:16:58 +02:00
library.json	0.2.0 float16ext	2024-04-18 12:16:58 +02:00
library.properties	0.2.0 float16ext	2024-04-18 12:16:58 +02:00
LICENSE	0.1.0 float16ext	2024-03-06 20:05:07 +01:00
README.md	0.2.0 float16ext	2024-04-18 12:16:58 +02:00

README.md

float16ext

Arduino library to implement float16ext data type.

Description

This experimental library defines the float16ext (2 byte) data type, including conversion function to and from float32 type. It is an extension to the float16 library. Reference -https://en.wikipedia.org/wiki/Half-precision_floating-point_format#ARM_alternative_half-precision

The primary usage of the float16ext data type is to efficiently store and transport a floating point number. As it uses only 2 bytes where float and double have typical 4 and 8 bytes, gains can be made at the price of range and precision.

Note that float16ext only has ~3 significant digits.

To print a float16, one need to convert it with toFloat(), toDouble() or toString(decimals). The latter allows concatenation and further conversion to an char array.

In pre 0.3.0 version the Printable interface was implemented, but it has been removed as it caused excessive memory usage when declaring arrays of float16.

ARM alternative half-precision

-https://en.wikipedia.org/wiki/Half-precision_floating-point_format#ARM_alternative_half-precision

ARM processors support (via a floating point control register bit) an "alternative half-precision" format, which does away with the special case for an exponent value of 31 (111112).[10] It is almost identical to the IEEE format, but there is no encoding for infinity or NaNs; instead, an exponent of 31 encodes normalized numbers in the range 65536 to 131008.

Implemented in https://github.com/RobTillaart/float16ext class.

Difference with float16 and float16ext

The float16ext library has an extended range as it supports values from +- 65504 to +- 131008.

The float16ext does not support INF, -INF and NAN. These values are mapped upon the largest positive, the largest negative and the largest positive number.

The -0 and 0 values will both exist.

Although they share a lot of code float16 and float16ext should not be mixed. In the future these libraries might merge / derive one from the other.

Breaking change 0.2.0

Version 0.3.0 has a breaking change. The Printable interface is removed as it causes larger than expected arrays of float 16 (See #16). On ESP8266 every float16 object was 8 bytes and on AVR it was 5 bytes instead of the expected 2 bytes.

To support printing the class added two new conversion functions:

f16.toFloat();
f16.toString(decimals);

Serial.println(f16.toFloat(), 4);
Serial.println(f16.toString(4));

This keeps printing relative easy.

The footprint of the library is now smaller and one can now create compact array's of float16 elements using only 2 bytes per element.

Specifications

layout is same as float16, however the range is different.

Attribute	Value	Notes
size	2 bytes	layout s eeeee mmmmmmmmmm (1, 5, 10)
sign	1 bit
exponent	5 bit
mantissa	10 bit	3 - 4 digits
minimum	±5.96046 E−8	smallest number.
	±1.0009765625	1 + 2^−10 = smallest number larger than 1.
maximum	±131008

± = ALT 0177

Example values

Source: https://en.wikipedia.org/wiki/Half-precision_floating-point_format

/*
   SIGN  EXP     MANTISSA
    0    01111    0000000000 = 1
    0    01111    0000000001 = 1 + 2−10 = 1.0009765625 (next smallest float after 1)
    1    10000    0000000000 = −2

    0    11110    1111111111 = 65504  (max half precision)

    0    00001    0000000000 = 2−14 ≈ 6.10352 × 10−5 (minimum positive normal)
    0    00000    1111111111 = 2−14 - 2−24 ≈ 6.09756 × 10−5 (maximum subnormal)
    0    00000    0000000001 = 2−24 ≈ 5.96046 × 10−8 (minimum positive subnormal)

    0    00000    0000000000 = 0
    1    00000    0000000000 = −0

    0    01101    0101010101 = 0.333251953125 ≈ 1/3
*/

Interface

#include "float16ext.h"

Constructors

float16ext(void) defaults value to zero.
float16ext(double f) constructor.
float16ext(const float16ext &f) copy constructor.

Conversion

double toDouble(void) convert value to double or float (if the same e.g. UNO).
float toFloat(void) convert value to float.
String toString(unsigned int decimals = 2) convert value to a String with decimals. Please note that the accuracy is only 3-4 digits for the whole number so use decimals with care.

Export and store

To serialize the internal format e.g. to disk, two helper functions are available.

uint16_t getBinary() get the 2 byte binary representation.
void setBinary(uint16_t u) set the 2 bytes binary representation.

Compare

The library implement the standard compare functions. These are optimized, so it is fast to compare 2 float16ext values.

Note: comparison with a float or double always include a conversion. You can improve performance by converting e.g. a threshold only once before comparison.

bool operator == (const float16ext& f)
bool operator != (const float16ext& f)
bool operator > (const float16ext& f)
bool operator >= (const float16ext& f)
bool operator < (const float16ext& f)
bool operator <= (const float16ext& f)

Math (basic)

Math is done by converting to double, do the math and convert back. These operators are added for convenience only. Not planned to optimize these.

float16ext operator + (const float16ext& f)
float16ext operator - (const float16ext& f)
float16ext operator * (const float16ext& f)
float16ext operator / (const float16ext& f)
float16ext& operator += (const float16ext& f)
float16ext& operator -= (const float16ext& f)
float16ext& operator *= (const float16ext& f)
float16ext& operator /= (const float16ext& f)

Negation operator.

float16ext operator - () fast negation.

Math helpers.

int sign() returns 1 == positive, 0 == zero, -1 == negative.
bool isZero() returns true if zero. slightly faster than sign().

The float16ext does not support INF or NAN.

Future

Must

update documentation.
keep in sync with float16 lib where possible.

Should

Could

Wont

Support

If you appreciate my libraries, you can support the development and maintenance. Improve the quality of the libraries by providing issues and Pull Requests, or donate through PayPal or GitHub sponsors.

Thank you,

README.md Unescape Escape