C++ Unordered Map Under the Hood

Josh Segal
4 min readMay 13, 2021

--

The std::unordered_map in C++ standard library is a really powerful data structure offering insertion, deletion, and lookup in O(1) amortized time. std::unordered_map is implemented as a hashmap or hashtable which are the same thing. This article is going to describe how the C++ standard library implements the unordered_map, how it achieves those time complexities, and when to use it.

Member Variables

class unordered_map{
private:
list<pair<int,int>> *buckets;
int bucket_size;
int total_elements;
float max_load_factor;
}

std::unordered_map is an array of linked lists of pairs. They’re pairs because they store a key value pair which are both integers in our case. bucket_size is the size of the array. total_elements is the current number of key value pairs. The max_load_factor is the max ratio of total_elements to bucket_size. If it exceeds this, the hash table is rehashed. Below is a picture representing buckets

https://en.wikipedia.org/wiki/Hash_table#Separate_chaining

Keys are hashed to an index in the array, aka a bucket. In each bucket is the head of a linked list. For example, John Smith and Lisa Smith happen to hash to the same bucket, 152. Thus, they both get added to the linked list for bucket 152. When keys map to the same bucket, this is called a collision.

Functions

int hash(int key) {
int hashed_key = strongHashingFunction(key)
return hashed_key % bucket_size;
}

hash takes in a key, uses a strong hash function, which could be custom, then mods it by the bucket_size to get the bucket index. It’s very important to use a good strongHashingFunction that can evenly and randomly distribute its outputs. If not, you will have more collisions which could degrade the time complexity significantly.

list<int>::iterator find(int key){
for(auto& i : buckets[hash(key)]) {
if(i->first == key){
return i;
}
}
return buckets[hashed_key].end();
}

find will search for the key and return a list iterator to it if it exists.

void insert(int key, int val){
auto it = find(key);
if(it != buckets[hash(key)].end()){
it->second = val;
return;
}
buckets[hash(key)].push_back({key,val});
total_elements++;
rehashIfNeeded();
}

insert first checks if the key exists already. If so, it modifies that key’s value. Otherwise, it pushes the value back on the linked list. If needed we’ll expand the buckets and rehash everything.

void delete(int key){
auto it = find(key)
if(it != buckets[hash(key)].end()){
buckets[hash(key)].erase(it);
total_elements--;
}
}

delete finds the key by calling find. It will delete it if found

void rehashIfNeeded(){
if(total_elements/bucket_size <= max_load_factor){
return;
}
bucket_size *= 2;
auto new_buckets = new list<pair<int,int>>[bucket_size];
for(int i = 0; i < bucket_size/2; i++){
for(auto& kv_itr: buckets[i]){
new_buckets[hash(kv_itr->first)].push_back(*kv_itr);
}
}
delete[] buckets;
buckets = new_buckets;
}

rehashIfNeeded first checks if the max_load_factor has been exceeded. If so, it will double the size of the array and rehash everything to the new one. The reason we rehash is to decrease the chance of collisions. max_load_factor is set to 1 by default in the standard library implementation. If the total_elements in the map is equal to the bucket_size then there is an average of 1 element per bucket, so collisions could happen but it’s unlikely a lot of them will happen on the same bucket. However, if you have a hashing function that’s time consuming or a lot of elements to copy over, you may want to consider giving the max_load_factor a bit more slack.

Complexity Analysis

find hashes in O(1), then traverses the bucket looking for the key. Assuming that we have a good low factor and a hash functions, the number of collisions or elements in this linked list is constant. In fact, on average, the number of elements in each bucket (collisions) should be less than the max_load_factor. Thus, find is O(1) on average.

delete calls find in O(1) and gets a list::iterator. lists are implemented as doubly linked lists in C++, and deleting a node in a doubly linked list is O(1) operation. Thus, delete is O(1) on average.

insert is a bit more tricky. Most of the time insert calls find in O(1) and does an O(1) modification to the pair, but every so often we have to rehash everything which is an O(n) operation where n is total_elements. So how do we measure a time complexity when its mostly O(1) and rarely O(n)? Do we just say O(n)? No, we used an amortized complexity.

Let’s say bucket_size=99 and max_load_factor=1. So, we insert 99 elements in O(1) time, then the 100th element in O(100) time, then our amortized time complexity is O(2) because (100 for rehashing + 1*every insert)=200 and we got 100 inserts completed. It’s kind of still an average according to the english definition, but it’s a different flavor of average. It’s like overtime, we know we’re going to hit this huge bump in time complexity, but it’s okay because the time complexity was low for so long before and things even out. Thus, insert is amortized O(1) on average.

Using an unordered_map

You should use an unordered_map when you don’t care about keeping a sorted order of the keys. I’d say, you’d use an unordered_map over an (ordered) map in C++ unless you have a reason for wanting to preserve order. map comes with a big complexity hit, so your usually better off with the unordered_map. For the full reference of std::unordered_map visit https://en.cppreference.com/w/cpp/container/unordered_map

--

--

Josh Segal
Josh Segal

Written by Josh Segal

C/C++ | Computer Systems | Low Latency | Distributed Systems | Computer Networks | Operating Systems

Responses (4)