2023-11-02 15:12:29 +01:00

176 lines
5.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

[![Arduino CI](https://github.com/RobTillaart/float16/workflows/Arduino%20CI/badge.svg)](https://github.com/marketplace/actions/arduino_ci)
[![Arduino-lint](https://github.com/RobTillaart/float16/actions/workflows/arduino-lint.yml/badge.svg)](https://github.com/RobTillaart/float16/actions/workflows/arduino-lint.yml)
[![JSON check](https://github.com/RobTillaart/float16/actions/workflows/jsoncheck.yml/badge.svg)](https://github.com/RobTillaart/float16/actions/workflows/jsoncheck.yml)
[![GitHub issues](https://img.shields.io/github/issues/RobTillaart/float16.svg)](https://github.com/RobTillaart/float16/issues)
[![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/RobTillaart/float16/blob/master/LICENSE)
[![GitHub release](https://img.shields.io/github/release/RobTillaart/float16.svg?maxAge=3600)](https://github.com/RobTillaart/float16/releases)
[![PlatformIO Registry](https://badges.registry.platformio.org/packages/robtillaart/library/float16.svg)](https://registry.platformio.org/libraries/robtillaart/float16)
# float16
Arduino library to implement float16 data type.
## Description
This **experimental** library defines the float16 (2 byte) data type, including conversion
function to and from float32 type. It is definitely **work in progress**.
The library implements the **Printable** interface so one can directly print the
float16 values in any stream e.g. Serial.
The primary usage of the float16 data type is to efficiently store and transport
a floating point number. As it uses only 2 bytes where float and double have typical
4 and 8 bytes, gains can be made at the price of range and precision.
## Specifications
| attribute | value | notes |
|:----------|:-------------|:--------|
| size | 2 bytes | layout s eeeee mmmmmmmmmm (1,5,10)
| sign | 1 bit |
| exponent | 5 bit |
| mantissa | 11 bit | ~ 3 digits
| minimum | 5.96046 E8 | smallest positive number.
| | 1.0009765625 | 1 + 2^10 = smallest nr larger than 1.
| maximum | 65504 |
| | |
#### example values
```cpp
/*
SIGN EXP MANTISSA
0 01111 0000000000 = 1
0 01111 0000000001 = 1 + 210 = 1.0009765625 (next smallest float after 1)
1 10000 0000000000 = 2
0 11110 1111111111 = 65504 (max half precision)
0 00001 0000000000 = 214 ≈ 6.10352 × 105 (minimum positive normal)
0 00000 1111111111 = 214 - 224 ≈ 6.09756 × 105 (maximum subnormal)
0 00000 0000000001 = 224 ≈ 5.96046 × 108 (minimum positive subnormal)
0 00000 0000000000 = 0
1 00000 0000000000 = 0
0 11111 0000000000 = infinity
1 11111 0000000000 = infinity
0 01101 0101010101 = 0.333251953125 ≈ 1/3
*/
```
#### Related
- https://wokwi.com/projects/376313228108456961 (demo of its usage)
## Interface
```cpp
#include "float16.h"
```
#### Constructors
- **float16(void)** defaults to zero.
- **float16(double f)** constructor.
- **float16(const float16 &f)** copy constructor.
#### Conversion
- **double toDouble(void)** convert to double (or float).
- **uint16_t getBinary()** get the 2 byte binary representation.
- **void setBinary(uint16_t u)** set the 2 bytes binary representation.
- **size_t printTo(Print& p) const** Printable interface.
- **void setDecimals(uint8_t d)** idem, used for printTo.
- **uint8_t getDecimals()** idem.
Note the setDecimals takes one byte per object which is not efficient for arrays of float16.
See array example for efficient storage using set/getBinary() functions.
#### Compare
Standard compare functions. Since 0.1.5 these are quite optimized,
so it is fast to compare e.g. 2 measurements.
- **bool operator == (const float16& f)**
- **bool operator != (const float16& f)**
- **bool operator > (const float16& f)**
- **bool operator >= (const float16& f)**
- **bool operator < (const float16& f)**
- **bool operator <= (const float16& f)**
#### Math (basic)
Math is done by converting to double, do the math and convert back.
These operators are added for convenience only.
Not planned to optimize these.
- **float16 operator + (const float16& f)**
- **float16 operator - (const float16& f)**
- **float16 operator \* (const float16& f)**
- **float16 operator / (const float16& f)**
- **float16& operator += (const float16& f)**
- **float16& operator -= (const float16& f)**
- **float16& operator \*= (const float16& f)**
- **float16& operator /= (const float16& f)**
negation operator.
- **float16 operator - ()** fast negation.
- **int sign()** returns 1 == positive, 0 == zero, -1 == negative.
- **bool isZero()** returns true if zero. slightly faster than **sign()**.
- **bool isInf()** returns true if value is (-)infinite.
#### Experimental 0.1.8
- **bool isNaN()** returns true if value is not a number.
## Notes
## Future
#### Must
- update documentation.
#### Should
- unit tests of the above.
#### Could
- update documentation.
- error handling.
- divide by zero errors.
- look for optimizations.
- rewrite **f16tof32()** with bit magic.
- add storage example - with SD card, FRAM or EEPROM
- add communication example - serial or Ethernet?
#### Wont
## Support
If you appreciate my libraries, you can support the development and maintenance.
Improve the quality of the libraries by providing issues and Pull Requests, or
donate through PayPal or GitHub sponsors.
Thank you,