Hash Functions
A hash function is any function that can be used to map data of arbitrary size to fixed-size values.
Values returned by a hash function are called hash values, hash codes, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table.
Use of a hash function to index a hash table is called hashing or scatter storage addressing.[1]
We will face the properties and the logic between hash functions.
Properties we covered
Uniformity
A good hash function should map the expected inputs as evenly as possible over its output range.
That is, every hash value in the output range should be generated with roughly the same probability!
If a typical set of
SUHA - Simple Uniform Hashing Assumption
It is a basic assumption that facilitates the mathematical analysis of hash tables.
- This assumption generalizes the details of the hash function and allows for certain assumptions about the stochastic system.
The assumption states that a hypothetical hashing function will evenly distribute items into the slots of a hash table. Moreover, each item to be hashed has an equal probability of being placed into a slot, regardless of the other elements already placed.
This means the assumption of uniform hashing, given a hash function
How do we hash? - Closed Addressing
Let's consider a lucky case:
- There are
real numbers which represent the keys and hash table of size . These keys are randomly and uniformly distributed (like i.i.d. random variables) inside the interval - The hash function
satisfies the SUHA.
Usually, hash functions assume the keys are natural numbers belonging to
Example: for a string "CLRS", we encode it through the ASCII values (128 total values):
The resulting positional notation would be:
Division Hashing
A standard technique is to use a modulo function on the key, by selecting a divisor
Pros:
- It is easy to analyze
- This technique works well in practice because many key sets are sufficiently random already, a and the probability that a key set will be cyclical by a large prime number is small.
Cons:
- One drawback is that it won't break up clustered keys.
- For example, the keys 123000, 456000, 789000, etc. modulo 1000 all map to the same address.
- We need to choose carefully
- Avoid choosing
as a power of , - The result would be considering just a portion of the key
- It is best for the hash function to depend on all the bits of the key
- If
is a string, interpreted on the base , it is a bad idea to choose - The permutation of the characters in
does not change the value of the hash function
- The permutation of the characters in
- Avoid choosing
Multiplicative Hashing
Given an integer key
- We get a constant
such that - We calculate
- We extract the fractional part
Pros:
is not a critical value anymore, it works well with any value can take - Typically
is a power of ,
- Typically
- Knuth saw that it works well with
To simplify the computing of the function
: the length of a memory word , must be of the length of memory word , is an arbitrary integer that must be in , must be , a constant, must be
Now
- Since
- Where
is the fractional part
- Where
- We get that
- The less significant part of the word created by
- The less significant part of the word created by

So, this justifies $h(k) = \lfloor m \cdot (kA \text{ mod } 1) \rfloor = \lfloor 2^{p} \cdot (kA \text{ mod } 1) \rfloor $
- Our hash function returns, indeed, the p-bits more significant of the less significant part of the product between
- It means we take the p-bits significant bits of
The problem is, if we receive many keys mapped to the same cell, our performances heavily drop.
- It is the case with
keys where
We can avoid this through universal hashing
Universal Hashing
Universal Hashing refers to selecting a hash function at random from a family of hash functions with a certain mathematical property.
This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary.[4]
How do we hash? - Open Addressing
Since all the possible elements are in the table, the table must be as large as
For this reason when probing we always start from the index
To determine which slots to probe, we extend the hash function to include the probe number (starting from 0) as a second input.
To sum up, the hash function takes into account the key being inserted
With open addressing, we require that for every key
be a permutation of all the hash-table's indexes
So
represents the position of the key after failed inspections.
UHA - Uniform Hashing Assumption
In our analysis, we assume uniform hashing at every iteration: at each iteration, every key
has the same probability to return as a probe sequence any of the permutations of the indices $\langle 0, 1, \ldots , m-1 \rangle $ of our hash table.
For example:
is uniformly distributed on the cells is uniformly distributed on the cells is uniformly distributed on the cells
At each iteration the returned value of the hash function is the probe sequence: an entire sequence of inspections.
True uniform hashing is difficult to implement, however, and in practice suitable approximations (such as double hashing, defined below) are used. None of these techniques fulfills the assumption of uniform hashing, however, since none of them is capable of generating more than m2 different probe sequences (instead of the
Every probe method guarantees the returned permutation is a permutation of hash-table's slots
Linear Probing
Given an auxiliary ordinary hash function
The method of linear probing uses the hash function: `
We cannot alter the order of the keys we want to insert, this completely alters the outcome of the function.
Example: Inserting the following sequence in order:
- $h'(k)= k \text{ mod } m $

Pros:
- Linear probing is easy to implement
Cons:
- There are only
distinct probe sequences per key, - Observation: The first inspected determines the entire probe sequence.
- While ideally we would like to have
- It suffers from a problem known as primary clustering.
- There are long queues of occupied slots which increase the average search time and tend to get longer with time.
- Clusters arise because an empty slot preceded by
full slots gets filled next with probability . - The probability to occupy the cell
is .
- The probability to occupy the cell
Quadratic Probing
Quadratic probing uses a hash function of the form:
Where
is an auxiliary hash function are positive auxiliary constants - Possible trials
Pros:
- Better performances than linear probe
- Only the keys with the same hash get conflicts
Cons:
must not have arbitrary values, - We need to generate all the indices of the table
- As previously stated,
must be a permutation of the indices in the table. - For instance, some good options would be
- The first position determines the entire sequence, again the count of permutations is
and not - It suffers from a problem known as secondary clustering.
- Secondary clusters are generated when, given two distinct keys
, the auxiliary hash function returns . - Namely, they are mapped to the same cell, which implies their probe sequence turns out to be the same.
- Secondary clusters are generated when, given two distinct keys
Double Hashing
Here we combine two hashing auxiliary functions:
Where:
- Both
are auxiliary hash functions. determines the starting position determines the_step_, pace of probing (distance between keys). - How far we are going to insert the next key
The initial probe goes to position

Thus, unlike the case of linear or quadratic probing, the probe sequence here depends in two ways upon the key
Here's another example:

Ensuring correctness
The value of
How do we ensure this?
- A convenient way to ensure this condition is to let
be a power of and to design so that it always produces an odd number.
- A convenient way to ensure this condition is to let
- Another way is to let
be prime and to design so that it always returns a positive integer smaller than . - Example:
- Another way is to let
Pros:
- Double hashing uses
sequences of probing since every possible pair produces a distinct sequence of probing. - We get closer to the uniform distribution
- We avoid secondary clusters
Cons:
- Although values of
other than primes or powers of could in principle be used with double hashing, in practice it becomes more difficult to efficiently generate in a way that ensures that it is relatively prime to , in part because the relative density of such numbers may be small