146 lines
4.6 KiB
Markdown
Raw Normal View History

2022-02-06 16:08:32 +01:00
[![Arduino CI](https://github.com/RobTillaart/Soundex/workflows/Arduino%20CI/badge.svg)](https://github.com/marketplace/actions/arduino_ci)
[![Arduino-lint](https://github.com/RobTillaart/Soundex/actions/workflows/arduino-lint.yml/badge.svg)](https://github.com/RobTillaart/Soundex/actions/workflows/arduino-lint.yml)
[![JSON check](https://github.com/RobTillaart/Soundex/actions/workflows/jsoncheck.yml/badge.svg)](https://github.com/RobTillaart/Soundex/actions/workflows/jsoncheck.yml)
[![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/RobTillaart/Soundex/blob/master/LICENSE)
[![GitHub release](https://img.shields.io/github/release/RobTillaart/Soundex.svg?maxAge=3600)](https://github.com/RobTillaart/Soundex/releases)
# Soundex
Arduino Library for calculating Soundex hash.
## Description
2023-02-02 19:16:24 +01:00
This library generates a (string based) hash based upon how a word sounds.
This algorithm is called Soundex.
The original algorithm was developed by Robert C. Russell and
Margaret King Odell over 100 years ago.
2022-02-06 16:08:32 +01:00
There are several variations of Soundex and these might be supported in the future.
2023-02-02 19:16:24 +01:00
The algorithm roughly copies the uppercase first letter of the word,
2022-11-24 20:21:52 +01:00
followed by 3 digits replacing the consonants.
2022-02-06 16:08:32 +01:00
2023-02-02 19:16:24 +01:00
The base Soundex has 26 x 7 x 7 x 7 = 8918 possible outcomes,
this could be easily encoded in an uint16_t.
2022-11-24 20:21:52 +01:00
This insight triggered the experimental functions.
2022-02-06 16:08:32 +01:00
2022-02-07 14:46:24 +01:00
2022-11-24 20:21:52 +01:00
#### 0.1.2 Experimental
The library has two experimental functions, **soundex16()** and **soundex32()**.
2022-02-07 14:46:24 +01:00
These functions pack a Soundex length 5 hash in a uint16_t and a length 10 in a uint32_t.
2022-11-24 20:21:52 +01:00
These compress soundex() results.
2022-02-07 14:46:24 +01:00
2023-02-02 19:16:24 +01:00
Advantages (16 bit version):
2022-11-24 20:21:52 +01:00
- better hash as it adds 1 extra character
- saves 60% of RAM, (5 bytes vs 2 bytes).
- allows faster comparisons, (compare 2 bytes is faster than 5 )
- less storage/communication needed
2022-02-07 14:46:24 +01:00
- printable as HEX
Disadvantage:
- unknown / new.
2022-11-24 20:21:52 +01:00
- need extra processing.
2022-02-07 14:46:24 +01:00
The hash codes of these new SoundexNN() are a continuous numeric range.
2023-02-02 19:16:24 +01:00
| Checksum | bytes | chars | range/values | used | notes |
|:------------|:-------:|:-------:|----------------:|:-------:|:-------------|
| soundex | 5 | 4 | 8.917 | 1e-6% | default |
| soundex16 | 2 | 5 | 62.425 | 95.3% | 0xF3D9 |
| soundex32 | 4 | 10 | 1.049.193.781 | 24.4% | 0x3E89 6D35 |
2022-02-07 14:46:24 +01:00
2022-11-24 20:21:52 +01:00
Note that soundex16() and soundex32() compresses info much better than
the standard soundex().
2022-02-07 14:46:24 +01:00
2022-11-24 20:21:52 +01:00
A soundex64() is possible and uses 8 bytes.
It would allow to compress very long soundex() results (up to 22 chars) in 8 bytes.
2022-02-07 14:46:24 +01:00
2022-02-06 16:08:32 +01:00
#### Links
- https://en.wikipedia.org/wiki/Soundex
- https://en.wikipedia.org/wiki/Metaphone (not implemented)
## Interface
2023-02-02 19:16:24 +01:00
```cpp
#include "Soundex.h"
```
#### Core
2022-02-06 16:08:32 +01:00
- **Soundex()** Constructor.
2023-02-02 19:16:24 +01:00
- **void setLength(uint8_t length = 4)** Sets the length to include more digits.
2022-11-24 20:21:52 +01:00
Maximum length = SOUNDEX_MAX_LENGTH - 1 == 11 (default).
2022-02-06 16:08:32 +01:00
- **uint8_t getLength()** returns current length.
- **char \* soundex(const char \* str)** determines the (Russell & Odell) Soundex code of the string.
2023-02-02 19:16:24 +01:00
#### Experimental
- **uint16_t soundex16(const char \* str)** determines the (Russell & Odell) Soundex code with
2022-11-24 20:21:52 +01:00
length = 5 of the string and packs the result in an uint16_t.
2022-02-07 14:46:24 +01:00
Note: preferably printed in HEX.
2023-02-02 19:16:24 +01:00
- **uint32_t soundex32(const char \* str)** determines the (Russell & Odell) Soundex code with
2022-11-24 20:21:52 +01:00
length == 10 of the string and packs it in an uint32_t.
2022-02-07 14:46:24 +01:00
Note: preferably printed in HEX.
2022-02-06 16:08:32 +01:00
#### Performance
2023-02-02 19:16:24 +01:00
Not tested on other platforms.
First numbers of **.soundex("Trichloroethylene")** measured with
2022-11-24 20:21:52 +01:00
a test sketch shows the following timing per word.
2022-02-06 16:08:32 +01:00
2023-02-02 19:16:24 +01:00
| Checksum | digits | UNO 16 MHz | ESP32 240 MHz | notes |
|:------------|:--------:|:------------:|:---------------:|:--------|
| soundex | 4 | 28 us | 4 us |
| soundex16 | 5 | 48 us | 6 us | not optimized
| soundex32 | 10 | 120 us | 10 us | not optimized
2022-02-06 16:08:32 +01:00
## Operation
See examples.
## Future ideas
2023-02-02 19:16:24 +01:00
#### Must
- improve documentation
2022-11-24 20:21:52 +01:00
- add examples
2023-02-02 19:16:24 +01:00
#### Should
2022-02-06 16:08:32 +01:00
- more testing
- other platforms
- different key lengths
- string lengths
- performance
2022-11-24 20:21:52 +01:00
2023-02-02 19:16:24 +01:00
#### Could
2022-11-24 20:21:52 +01:00
- use spare bits of soundex16/32 as parity / checksum.
2022-02-06 16:08:32 +01:00
2023-02-02 19:16:24 +01:00
#### Wont
2022-11-24 20:21:52 +01:00
- efficient storage of the Soundex array
- encode in nibbles. (13 bytes instead of 26) => more code, performance?
0x01, 0x23, 0x01 etc.
(performance test was slower, gain in RAM == PROGMEM loss.
2023-02-02 19:16:24 +01:00
- Other algorithms might be added in the future.
- reverse_soundex()
- DaitchMokotoff Soundex
- Beider-Morse Soundex
- Metaphone