C++ Unordered Map Under the Hood
The std::unordered_map
in C++ standard library is a really powerful data structure offering insertion, deletion, and lookup in O(1) amortized time. std::unordered_map
is implemented as a hashmap or hashtable which are the same thing. This article is going to describe how the C++ standard library implements the unordered_map
, how it achieves those time complexities, and when to use it.
Member Variables
class unordered_map{
private:
list<pair<int,int>> *buckets;
int bucket_size;
int total_elements;
float max_load_factor;
}
std::unordered_map
is an array of linked lists of pairs. They’re pairs because they store a key value pair which are both integers in our case. bucket_size
is the size of the array. total_elements
is the current number of key value pairs. The max_load_factor
is the max ratio of total_elements to bucket_size. If it exceeds this, the hash table is rehashed. Below is a picture representing buckets
Keys are hashed to an index in the array, aka a bucket. In each bucket is the head of a linked list. For example, John Smith and Lisa Smith happen to hash to the same bucket, 152. Thus, they both get added to the linked list for bucket 152. When keys map to the same bucket, this is called a collision.
Functions
int hash(int key) {
int hashed_key = strongHashingFunction(key)
return hashed_key % bucket_size;
}
hash
takes in a key, uses a strong hash function, which could be custom, then mods it by the bucket_size to get the bucket index. It’s very important to use a good strongHashingFunction
that can evenly and randomly distribute its outputs. If not, you will have more collisions which could degrade the time complexity significantly.
list<int>::iterator find(int key){
for(auto& i : buckets[hash(key)]) {
if(i->first == key){
return i;
}
}
return buckets[hashed_key].end();
}
find
will search for the key and return a list iterator to it if it exists.
void insert(int key, int val){
auto it = find(key);
if(it != buckets[hash(key)].end()){
it->second = val;
return;
}
buckets[hash(key)].push_back({key,val});
total_elements++;
rehashIfNeeded();
}
insert
first checks if the key exists already. If so, it modifies that key’s value. Otherwise, it pushes the value back on the linked list. If needed we’ll expand the buckets and rehash everything.
void delete(int key){
auto it = find(key)
if(it != buckets[hash(key)].end()){
buckets[hash(key)].erase(it);
total_elements--;
}
}
delete
finds the key by calling find. It will delete it if found
void rehashIfNeeded(){
if(total_elements/bucket_size <= max_load_factor){
return;
}
bucket_size *= 2;
auto new_buckets = new list<pair<int,int>>[bucket_size];
for(int i = 0; i < bucket_size/2; i++){
for(auto& kv_itr: buckets[i]){
new_buckets[hash(kv_itr->first)].push_back(*kv_itr);
}
}
delete[] buckets;
buckets = new_buckets;
}
rehashIfNeeded
first checks if the max_load_factor
has been exceeded. If so, it will double the size of the array and rehash everything to the new one. The reason we rehash is to decrease the chance of collisions. max_load_factor
is set to 1 by default in the standard library implementation. If the total_elements
in the map is equal to the bucket_size
then there is an average of 1 element per bucket, so collisions could happen but it’s unlikely a lot of them will happen on the same bucket. However, if you have a hashing function that’s time consuming or a lot of elements to copy over, you may want to consider giving the max_load_factor
a bit more slack.
Complexity Analysis
find
hashes in O(1), then traverses the bucket looking for the key. Assuming that we have a good low factor and a hash functions, the number of collisions or elements in this linked list is constant. In fact, on average, the number of elements in each bucket (collisions) should be less than the max_load_factor
. Thus, find
is O(1) on average.
delete
calls find in O(1) and gets a list::iterator. lists are implemented as doubly linked lists in C++, and deleting a node in a doubly linked list is O(1) operation. Thus, delete
is O(1) on average.
insert
is a bit more tricky. Most of the time insert
calls find
in O(1) and does an O(1) modification to the pair, but every so often we have to rehash everything which is an O(n) operation where n is total_elements
. So how do we measure a time complexity when its mostly O(1) and rarely O(n)? Do we just say O(n)? No, we used an amortized complexity.
Let’s say bucket_size=99 and max_load_factor=1. So, we insert 99 elements in O(1) time, then the 100th element in O(100) time, then our amortized time complexity is O(2) because (100 for rehashing + 1*every insert)=200 and we got 100 inserts completed. It’s kind of still an average according to the english definition, but it’s a different flavor of average. It’s like overtime, we know we’re going to hit this huge bump in time complexity, but it’s okay because the time complexity was low for so long before and things even out. Thus, insert is amortized O(1) on average.
Using an unordered_map
You should use an unordered_map
when you don’t care about keeping a sorted order of the keys. I’d say, you’d use an unordered_map
over an (ordered) map
in C++ unless you have a reason for wanting to preserve order. map
comes with a big complexity hit, so your usually better off with the unordered_map
. For the full reference of std::unordered_map
visit https://en.cppreference.com/w/cpp/container/unordered_map