Creating a C++ Hash-Table quickly, that has a QString as its Key-Type.

This posting will assume that the reader has a rough idea, what a hash table is. In case this is not so, this WiKiPedia Article explains that as a generality. Just by reading that article, especially if the reader is of a slightly older generation like me, he or she could get the impression that, putting a hash-table into a practical C++ program is arduous and complex. In reality, the most efficient implementation possible for a hash table, requires some attention to such ideas as, whether the hashes will generally tend to be coprime with the modulus of the hash table, or at least, have the lowest GCDs possible. Often, bucket-arrays with sizes that are powers of (2), and hashes that contain higher prime factors help. But, when writing practical code in C++, one will find that the “Standard Template Library” (‘STL’) already provides one, that has been implemented by experts. It can be put into almost any C++ program, as an ‘unordered_map’. (:1)

But there is a caveat. Any valid data-type meant to act as a key needs to have a hashing function defined, as a template specialization, and not all data-types have one by far. And of course, this hashing function should be as close to unique as possible, for unique key-values, while never changing, if there could be more than one possible representation of the same key-value. Conveniently, C-strings, which are denoted in C or C++ as the old-fashioned ‘char *’ data-type, happen to have this template specialization. But, because C-strings are a very old type of data-structure, and, because the reader may be in the process of writing code that uses the Qt GUI library, the reader may want to use ‘QString’ as his key-data-type, only to find, that a template specialization has not been provided…

For such cases, C++ allows the programmer to define his own hashing function, for QStrings if so chosen. In fact, Qt even allows a quick-and-dirty way to do it, via the ‘qHash()’ function. But there is something I do not like about ‘qHash()’ and its relatives. They tend to produce 32-bit hash-codes! While this does not directly cause an error – only the least-significant 32 bits within a modern 64-bit data-field will contain non-zero values – I think it’s very weak to be using 32-bit hash-codes anyway.

Further, I read somewhere that the latest versions of Qt – as of 5.14? – do, in fact, define such a specialization, so that the programmer does not need to worry about it anymore. But, IF the programmer is still using an older version of Qt, like me, THEN he or she might want to define their own 64-bit hash-function. And so, the following is the result which I arrived at this afternoon. I’ve tested that it will not generate any warnings when compiled under Qt 5.7.1, and that a trivial ‘main()’ function that uses it, will only generate warnings, about the fact that the trivial ‘main()’ function also received the standard ‘argc’ and ‘argv’ parameters, but made no use of them. Otherwise, the resulting executable also produced no messages at run-time, while having put one key-value pair into a sample hash table… (:2)

 

/*  File 'Hash_QStr.h'
 * 
 *  Regular use of a QString in an unordered_map doesn't
 * work, because in earlier versons of Qt, there was no
 * std::hash<QString>() specialization.
 * Thus, one could be included everywhere the unordered_map
 * is to be used...
 */

#ifndef HASH_QSTR_H
#define HASH_QSTR_H


#include <unordered_map>
#include <QString>
//#include <QHash>
#include <QChar>

#define PRIME 5351


/*
namespace cust {
	template<typename T>
	struct hash32 {};

  template<> struct hash32<QString> {
    std::size_t operator()(const QString& s) const noexcept {
      return (size_t) qHash(s);
    }
  };
}

*/


namespace cust
{
	inline size_t QStr_Orther(const QChar & mychar) noexcept {
		return ((mychar.row() << 8) | mychar.cell());
	}
	
	template<typename T>
	struct hash64 {};
	
    template<> struct hash64<QString>
    {
        size_t operator()(const QString& s) const noexcept
        {
            size_t hash = 0;

            for (int i = 0; i < s.size(); i++)
				hash = (hash << 4) + hash + QStr_Orther(s.at(i));

            return hash * PRIME;
        }
    };
}


#endif  //  HASH_QSTR_H

 

(Updated 4/12/2021, 8h00… )

 

And, the following was the test-program, which created an actual hash-table, while including my suggested header file above:

 

/*  unordered_map with QString key, for Qt versions
 * that do not already define it...
 */

#include <QDebug>
#include "Hash_QStr.h"

class Has_Hash {

public:
	Has_Hash();
	
	std::unordered_map<QString, QString, cust::hash64<QString>> m_map;
	
};

Has_Hash::Has_Hash() {
	m_map.reserve(1024);
}

int main(int argc, char* argv[]) {
	Has_Hash obj;
	QString ex_key("New Year's Eve");
	
	cust::hash64<QString> myFunctor;
	std::size_t ex_hash = myFunctor(ex_key);
	
	obj.m_map[ex_key] = QString("December 31");
	
	qDebug() << "The hash was:" << ex_hash;
	qDebug() << "The value-string was:" << obj.m_map[ex_key];
	
	return 0;
}

 

(As of 4/10/2021, 15h20…)

In case my reader has a hard time understanding the syntax of the template-structure, that’s a structure with a function-call operator defined. This type of structure is also known as “a functor”, and they sometimes get used, because actually to pass function-pointers around can be impractical, as an earlier posting of mine did point out. Functors can do much of what functions with pointers could do, but in the case of templates, with the big exception that to deploy them, means to be deploying source-code. They don’t tend to be compatible with a preference that programmers sometimes have, ‘to keep the source code closed’, or at least, ‘to deploy compiled binaries’.

What I have above needs to be included as a header file and is thus not compiled prior to the application being compiled.


 

(Update 4/10/2021, 17h15: )

I have just modified the test program, so that it does in fact output something. The newer version above will actually print the hash which was computed, to the debug console. And then, it will use the key, not the separately computed hash, to fetch and print the value-string that the key leads to.


 

(Update 4/10/2021, 19h20: )

I made another little adjustment in the suggested header file, which should reduce the probability of hash-table collisions. This adjustment is now displayed above, in place of the earlier version.

I suppose that there is more to be known about that subject. The way ‘unordered_map’ was implemented, each bucket in the bucket array points to a linked list. This means that aside from some performance issues, there is really no drastic consequence of hash-table collisions. Each element of the linked list states an exact key-value to be matched, after the hash was already matched, as well as the value being sought. Only, if the number of occupied buckets was to approach or exceed half the number of positions in the bucket array, those performance issues would start to rise, because in that case, the lengths of the linked lists would also increase. For that reason, what the STL implementation does is, to grow the bucket array once the number of buckets encroaches on some threshold.

This can cost time, and so in many cases, it is a good idea to set a size larger than the default, before any entries are added. The default is whatever number of bucket-positions, will take up one page of virtual memory. I guess, on a typical 64-bit machine, that would be an array of 512 bucket-positions.

For similar reasons, I also suppose there is no real need, for the hash to be 64 bits long instead of 32.


 

(Update 4/11/2021, 10h10: )

One related piece of information which the reader might actually find useful is, that the ‘unordered_map<>::reserve()’ function does not state, how many bucket positions should be allocated. The argument which is fed-in, is taken to be the expected number of hash-table entries (key-value pairs), so that the algorithm will attempt to derive the ideal bucket-array size that would follow.

(End of Update, 4/11/2021, 10h10. )


 

(Update 4/11/2021, 12h15: )

1:)

One of the facts which I’ve written about elsewhere in my blog was, that there is not just one way to implement a hash table, but there are several, out of which I’d say three are important:

  1. Hash-Tables in which the modulus is a prime number,
  2. Hash-Tables in which the modulus is a power of two,
  3. Hash-Tables in which the buckets do not point to a linked list.

One idea which will improve the performance of hash-tables in general is, that the hash could always be coprime with the modulus, so that as consecutive hashes are computed, they will land in different buckets, as many times as possible. One way to achieve that in fact, is to make the modulus itself prime, so that the hash is unlikely to share a prime factor with it. However, one drawback of implementing a hash-table that way is, that every time it needs to be sized or resized, the algorithm will need to find a suitable prime number as its array size. Also, a modulus that is a prime number is slightly more expensive to compute, than one which is a power of two. One operation actually requires that the remainder function be invoked once, while the other merely requires that some number of Least Significant Bits be masked out of a number. A bit-wise AND operation only takes 1 clock cycle.

And so, I heard through the grapevine somewhere, that hash-tables of type (2) above sometimes replace ones, of type (1) above. If the modulus is just a power of two, then the possibility of a hash which has some power of two in its factorization exists, but at least, the GCD will then probabilistically just be some small power of two. This still reduces the probability of hash-table collisions, though not as well, as hash tables of type (1) above. It seems to reduce this probability best then, if this initial deficiency is made up for, by multiplying the actual hash by a larger prime number. This can reduce computational overhead.

But there is another advantage to using hash tables of type (2) above, which my readers may not be acutely aware of. The operation can and will come at some point, to grow a hash table, that already has allocated buckets. If it’s of type (1) above, all the linked lists that the buckets pointed to, will also need to be traversed, new buckets computed, and new linked lists also computed. To my mind, a hash table of type (2) above can be implemented in such a way, that it can simply be upsized by one power of two, and neither existing buckets, nor existing linked lists be refactored. Existing buckets would still need to be copied over to the newly allocated array, however.

The way that would work is, that as soon as the modulus has changed, existing hashes will map, either to a pre-existing position in the bucket array, or to a new position, that belongs to the extended half of the bucket array. And the reason for which those are the two possible outcomes then, is the fact that the larger, new power of two, will be divisible by the older, smaller power of two.

But, if this approach is taken, then one side effect it would have is, that as part of its meta-data, the algorithm would need to keep track by how many powers of two the bucket array has been made larger than its minimum size, and that that number of additional look-ups into the bucket array need to be made, before an entry is either found with certainty, or found not to be in the bucket array. The hash would first be computed within the largest modulus, and then recomputed for every smaller power of two, that could also once have been a size of the bucket array. Simply finding a Null pointer at a larger modulus would not guarantee, that the corresponding bucket-position resulting from a smaller modulus, was never used, to store a bucket, all stemming from the same hash.

If one dwells on minutia, putting the previously modified hash into the lower modulus, and then performing another lookup, would only need to be done, every time the smaller modulus has indeed become equal to or smaller,  than the previously modified hash.

Yet, since it already happens frequently, that the hash table needs to be requeried when a key is sought, having to make an additional lookup in the bucket array, is not really worse, than having to traverse an added node in a linked list…

Now, I can’t just guess, what the ‘STL’ implementation of a hash table really does in this regard. I can trust that it was implemented properly, and know, that at its default size, its bucket array takes up one page of virtual memory, which is already known to have a size, which is a power of two…

If I allow myself to speculate greatly, I can speculate an ‘unordered_map<>::reserve()’ function which, if the number of allocated buckets is actually zero when it gets called, will resize the array / modulus, and will also reset the maximum number of look-ups to one…


 

Type (3) above also deserves some mention. There exist special situations, in which one cannot have every entry in the bucket array point to a linked list, yet, hash-table collisions can always take place. And so, what can be done instead is, that a bucket position in the array can be visited, and the question can be answered, of whether it corresponds to the exact key-value being sought, if that position isn’t Null. If the bucket is occupied, but does not correspond to the exact key being sought, then a next position in the bucket array can be visited, which can be recomputed quickly, and this can just be repeated, until either a Null position is found, or the sought key is found. It could also just be repeated until the exact key is found, if the fact is known with certainty, that a given key is in fact in the hash table…

This can be described in more complex language, as “the existence of two hashing functions”, one to compute the first bucket-position to be visited, and the other, to compute all the later bucket-positions to be visited.

In practice, the only situation where I really think that type (3) above is used, is when it comes to the swap partition or swap file. In this exact case, the actual choice of hashing functions need to be such, in practice, that hashes that were contiguous – virtual addresses’ page-numbers – need to lead to positions in the bucket array – in the swap partition – that are also contiguous, even if multiple bucket-positions and their back-references were visited in RAM. This is because the way traditional hard drives work, would also make the swap partition slow to access, every time sectors are read, which are not contiguous.

It’s an established fact, that operating systems that possess either swap partitions or swap files that hold (n) pages of memory, will also allocate a table in RAM, that possesses (n) entries, each of which states what virtual address one allocated, physical page in the swap-file holds.

And so, if type (3) of a hash table is being implemented to solve this one problem, then the function that recomputes hashes, basically needs to be kept as simple as an addition. Again, in more complex language, there could be a hashing function that computes a simple product between two numbers, one of which is a constant, and the other of which would be the lookup-number.

But obviously, since nobody on this blog is suggesting, that we re-implement an operating system, we can also limit our curiosity in hash table type (3) above, to purely theoretical curiosity. In fact, this blog isn’t even suggesting, that the reader re-implement the ‘STL’ implementation.


 

(Update 4/11/2021, 14h55: )

2:)

My readers could ask themselves, why I suddenly got the idea, just to throw some suggested code out, on such a narrow, specific subject. Honestly, I was inspired by These Examples, given by other people. But, more importantly, I also felt that the 64-bit example contained two errors:

  • ‘(str++)’ Has unpredictable behaviour, as my compiler warned me, And
  • The use of (5381) cited, does not have the same effect, which my use above does…

Simply starting a summation with a prime number, does not guarantee in any way, that the results will still be multiples of the same prime number. In fact, that use of the prime number was of no significance. Further, the expression ‘((hash << 5) + hash)’ has a well-defined effect. It multiplies the previous value by (33), which is also not prime. ‘((hash << 4) + hash)’ has the effect of multiplying the previous value by (17), which is prime.


 

Also, my readers might be asking themselves, why I did not just add a template specialization into ‘std::hash<>()’. The OP of the BB made the fact known, that his Qt library version did not already do so. My problem with that is, that I have no guarantee, that my readers’ Qt library version does not already do so. If it does, then my reader will get a compile-time error, over the duplication.

So, just because people will try out code without always analyzing it first, I felt it more prudent to put my own template-functor into a separate namespace, as one is usually supposed to do. That way, there would at least be no error messages.


 

(Update 4/11/2021, 18h55: )

I have just made an interesting discovery. Where I was avoiding the exact form of the code given on Stack Overflow, because it was giving me a compiler warning, I have now examined that code more closely, and found that the warning was due to an untrusting way my GNU compiler was coded, against programmers who cannot read idiosyncratic C++. The code would work, but was written in a way, that some programmers would not have understood.

Yet, since one does not wish to generate warnings if doing so can be avoided, I now have a version of that code, which is closer to what the contributor on Stack Overflow would have done, but which does not generate any compiler discontent:

 

/*  File 'Hash_QStr2.h'
 * 
 *  Regular use of a QString in an unordered_map doesn't
 * work, because in earlier versons of Qt, there was no
 * std::hash<QString>() specialization.
 * Thus, one could be included everywhere the unordered_map
 * is to be used...
 */

#ifndef HASH_QSTR2_H
#define HASH_QSTR2_H


#include <unordered_map>
#include <QString>
#include <QChar>

#define PRIME 5351


namespace cust
{
	template<typename T>
	struct hash64_i {};
	
    template<> struct hash64_i<QString>
    {
        size_t operator()(const QString& s) const noexcept
        {
            size_t hash = 0;
            QString::const_iterator str = s.cbegin();
//            const QChar * str = s.cbegin();

            for (int i = 0; i < s.size(); i++) {
				hash = (hash << 4) + hash + ((str->row() << 8) | str->cell());
				str++;
			}

            return hash * PRIME;
        }
    };
}


#endif  //  HASH_QSTR2_H

 

The initial problem seemed to be,  that instead of implementing a true iterator for the ‘QString’ class, what Qt does is, to typedef lines 31 and 32 above to mean exactly the same thing. If the post-incrementation which I now put by itself, on line 36, is left in parentheses instead, and simultaneously dereferenced, this warning results, even though the code does what it’s supposed to do either way.

 


 

(Update 4/12/2021, 8h00: )

Sometimes, it’s possible already to have a working implementation of an algorithm, but to want to optimize that algorithm, so that it consumes the fewest CPU cycles to complete its job. In reality, a Hash Table – aka, an ‘unordered_map’ – is only used, if the number of key-value pairs is far greater, than what can be managed efficiently using an associative array which, like the STL ‘map’ object, is based on a mere binary search.

Yet, even at that, I’m actually pushing my speculations about how an ‘unordered_map’ is going to be used in practice, by suggesting in my hypothetical code, that it will have 1024 key-value pairs. There may be few actual cases where this is done. However, if it is done, then it might also be worthwhile to suggest an implementation, which has been optimized as stated. And I find, that the following implementation will consume the fewer CPU cycles:

 

/*  File 'Hash_QStr3.h'
 * 
 *  Regular use of a QString in an unordered_map doesn't
 * work, because in earlier versons of Qt, there was no
 * std::hash<QString>() specialization.
 * Thus, one could be included everywhere the unordered_map
 * is to be used...
 */

#ifndef HASH_QSTR3_H
#define HASH_QSTR3_H

#include <unordered_map>
#include <QString>
#include <QChar>


#define PRIME 5351


namespace cust
{
	template<typename T>
	struct hash64_p {};
	
    template<> struct hash64_p<QString>
    {
        size_t operator()(const QString& s) const noexcept
        {
            size_t hash = 0;
            QString::const_iterator starts = s.cbegin();
//            const QChar * starts = s.cbegin();
			QString::const_iterator ends = s.cend();

            while (starts < ends) {
				hash = (hash << 4) + hash + ((starts->row() << 8) | starts->cell());
				starts++;
			}

            return hash * PRIME;
        }
    };
}


#endif  //  HASH_QSTR3_H

 

One thing I have done with this implementation is, also to test whether it malfunctions if given a key, which is an empty ‘QString’. According to my Math, this results in a hash of zero. But, according to my code, it might not be 100% reliable, how that code will behave. What I found is, that such a blank key-value, still leads consistently to the value-string, which the hash table is supposed to evoke from it.

 

Enjoy,

Dirk

 

Print Friendly, PDF & Email

One thought on “Creating a C++ Hash-Table quickly, that has a QString as its Key-Type.”

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>