## Creating a C++ Hash-Table quickly, that has a QString as its Key-Type.

This posting will assume that the reader has a rough idea, what a hash table is. In case this is not so, this WiKiPedia Article explains that as a generality. Just by reading that article, especially if the reader is of a slightly older generation like me, he or she could get the impression that, putting a hash-table into a practical C++ program is arduous and complex. In reality, the most efficient implementation possible for a hash table, requires some attention to such ideas as, whether the hashes will generally tend to be coprime with the modulus of the hash table, or at least, have the lowest GCDs possible. Often, bucket-arrays with sizes that are powers of (2), and hashes that contain higher prime factors help. But, when writing practical code in C++, one will find that the “Standard Template Library” (‘STL’) already provides one, that has been implemented by experts. It can be put into almost any C++ program, as an ‘unordered_map’. (:1)

But there is a caveat. Any valid data-type meant to act as a key needs to have a hashing function defined, as a template specialization, and not all data-types have one by far. And of course, this hashing function should be as close to unique as possible, for unique key-values, while never changing, if there could be more than one possible representation of the same key-value. Conveniently, C-strings, which are denoted in C or C++ as the old-fashioned ‘char *’ data-type, happen to have this template specialization. But, because C-strings are a very old type of data-structure, and, because the reader may be in the process of writing code that uses the Qt GUI library, the reader may want to use ‘QString’ as his key-data-type, only to find, that a template specialization has not been provided…

For such cases, C++ allows the programmer to define his own hashing function, for QStrings if so chosen. In fact, Qt even allows a quick-and-dirty way to do it, via the ‘qHash()’ function. But there is something I do not like about ‘qHash()’ and its relatives. They tend to produce 32-bit hash-codes! While this does not directly cause an error – only the least-significant 32 bits within a modern 64-bit data-field will contain non-zero values – I think it’s very weak to be using 32-bit hash-codes anyway.

Further, I read somewhere that the latest versions of Qt – as of 5.14? – do, in fact, define such a specialization, so that the programmer does not need to worry about it anymore. But, IF the programmer is still using an older version of Qt, like me, THEN he or she might want to define their own 64-bit hash-function. And so, the following is the result which I arrived at this afternoon. I’ve tested that it will not generate any warnings when compiled under Qt 5.7.1, and that a trivial ‘main()’ function that uses it, will only generate warnings, about the fact that the trivial ‘main()’ function also received the standard ‘argc’ and ‘argv’ parameters, but made no use of them. Otherwise, the resulting executable also produced no messages at run-time, while having put one key-value pair into a sample hash table… (:2)

/*  File 'Hash_QStr.h'
*
*  Regular use of a QString in an unordered_map doesn't
* work, because in earlier versons of Qt, there was no
* std::hash<QString>() specialization.
* Thus, one could be included everywhere the unordered_map
* is to be used...
*/

#ifndef HASH_QSTR_H
#define HASH_QSTR_H

#include <unordered_map>
#include <QString>
//#include <QHash>
#include <QChar>

#define PRIME 5351

/*
namespace cust {
template<typename T>
struct hash32 {};

template<> struct hash32<QString> {
std::size_t operator()(const QString& s) const noexcept {
return (size_t) qHash(s);
}
};
}

*/

namespace cust
{
inline size_t QStr_Orther(const QChar & mychar) noexcept {
return ((mychar.row() << 8) | mychar.cell());
}

template<typename T>
struct hash64 {};

template<> struct hash64<QString>
{
size_t operator()(const QString& s) const noexcept
{
size_t hash = 0;

for (int i = 0; i < s.size(); i++)
hash = (hash << 4) + hash + QStr_Orther(s.at(i));

return hash * PRIME;
}
};
}

#endif  //  HASH_QSTR_H



(Updated 4/12/2021, 8h00… )

## Hash Tables

There is already a lot of work published, that details numerous established ways to implement hash tables. The WiKiPedia article on the broad class of hash-tables, is already a good starting point to become familiar with them.

But just to make this interesting, I undertook the mental exercise of designing a hash table, which when grown, does not require that the contents of the original hash table be rearranged in any way, but in which the expansion is achieved, by simply appending zeroes to the original hash table, and then changing one of the parameters, with which the new, bigger hash table is accessed.

My exercise led me to the premise, that each size of my hash table be a power of two, and it seems reasonable to start with 2^16 entries, aka 65536 entries, giving the hash table an initial Power Level (L1) of 16. There would be a pointer I name P to a hash-table position, and two basic Operations to arrive at P:

1. The key can be hashed, by multiplying by the highest prime number smaller than 2^L1 – This would be the prime number 65521 – within the modulus of 2^L1, and yielding Position P0.
2. The current Position can have added the number A, which is intentionally some power of two lower than L1, within the modulus of 2^L, successively yielding P1, P2, P3, etc…

Because the original multiplier is a prime number lower than 2^L1, and the latter, the initial modulus of the hash table, a power of two, consecutive key-values will only lead to a repetition in P0 after 2^L key-values. But, because actual keys are essentially random, a repetition of P0 is possible by chance, and the resolution of the resulting hash-table collision is the main subject in the design of any hash table.

Default-Size Operation

By default, each position of the hash table contains a memory address, which is either NULL, meaning that the position is empty, or which points to an openly-addressable data-structure, from which the exact key can be retrieved again. When this retrieved exact key does not match the exact key used to perform the lookup, then Operation (2) above needs to be performed on P, and the lookup-attempt repeated.

But, because A is a power of 2 which also fits inside the modulus 2^L, we know that the values of P will repeat themselves exactly, and how small A is, will determine how many lookup-attempts can be undertaken, before P repeats, and this will also determine how many entries can be found at any one bucket. Effectively, if A = 2^14, then 2^L / A = 2^2, so that a series of 4 positions would form the maximum bucket-size. If none of those entries is NULL, the bucket is full, and if an attempt is made to insert a new entry to a full bucket, the hash-table must be expanded.

Finding the value for a key will predictably require, that all the positions following from one value of P0 be read, even if some of them were NULL, until an iteration of P reveals the original key.

It is a trivial fact that eventually, some of the positions in this series will have been written to by other buckets, because keys will be random again, thus leading to (their own values of P) which coincide with (current nA + P0) . But, because the exact key will be retrieved from non-NULL addresses before those are taken to have arisen from the key being searched for, those positions will merely reflect a performance-loss, not an accuracy-loss.

Because it is only legal, for 1 key to lead to a maximum of 1 value, as is the case with all hash tables, before an existing key can be set to a new value, the old key must be found and be deleted. In order for this type of hash table to enforce that policy, its own operation to insert a key would need to be preceded by an operation to retrieve it, which can either return a Boolean True value if the key exists, plus the position at which its address is located, or a Boolean False value if it does not exist, plus the position at which the first NULL address is located, at which a following operation could insert one, in a single step. I would be asking two operations to take place atomically, because this can pose further issues with multi-threaded access to the hash table.

(Edit 06/22/2017 : The premise here, in the case of a multi-threaded application, is that it should be possible to lock a global object, before the retrieval-function is called, as long as it is not already locked, but that if a thread gives the instruction to lock it when it already is, then that thread is put to sleep, until the other thread, which originally locked the object, unlocks it again.

I believe that Such an object is also known as a ‘Mutex’.

In my thought-exercise, this Mutex-object not being locked, should also be a signal to the retrieval-function, that the position returned, in case the key is not found, does not need to be of significance, specifically for the situation that the bucket could be full, in which case there should be no valid pointer to a NULL-address. )

Any defined operation to delete a key would follow according to the same logic – It must be found first, and if not found, cause the appropriate error-code to be returned. Any hash table with more than one entry for the same key is corrupted, but can conceivably be cleaned, if the program that uses it sends multiple commands to delete the same key, or one command to purge it…

Growing the Hash Table

(Last Updated 06/28/2017 … )

## Alists and Plists

One of the data-structures that even early LISP implementations possessed, were ‘plists‘. It was only by reading up on LISP lately, that I discovered that many LISP implementations maintained a hidden plist, in which certain Properties of Symbols were stored. This could easily have been found in the CIR of the Atom, the naming of which was later simplified to CDR. Alternatively, this plist could have been the only datum of the Symbol, hence stored at its CAR.

In short, this was an early associative list, consisting of Key:Value pairs, where the Key was supplied by the programmer, in order to retrieve the Value from the plist.

Plists were later superseded by ‘alists‘. An alist has a better storage format, but again stores Key:Value associations. One big advantage of an alist, is the ability to perform a reverse look-up, in which a sought Value is supplied, and the List of Keys returned, that point to this Value.

I think it is agreed, that Hash Tables eventually manage forward-look-ups better than either a plist or an alist, especially if large numbers of Keys are stored. Yet, it is written that for short collections of data, the simplicity with which an alist is stored, actually allows it to work more quickly than a Hash Table, in LISP.

Now, if it was necessary to perform reverse look-ups, yet the large number of objects require a Hash Table, then the most sensible approach might be to create two Hash Tables, one for the forward look-up, and another for the reverse look-up? In that case, if a Hash Table is predicted to serve a one-to-many relationship, then the care should also be taken from the beginning, to encapsulate all the value-sets that result from one Key, into Sets, even if in many cases, there may be only one element in the Set.

Alternatively a system could be devised, so that only one result from the reverse look-up is the correct one.

Dirk