Hash Table

This section concerns Hash Table data structure and its relevant hash functions designs. And it's an improved version of Direct-Access Table for operational efficiency.

Direct-Access Table

As an inherited example of Symbol Table, Direct-Access Table is satisfactory if given a small universe 𝒰 of keys for its implementation simplicity and effectiveness.

Formal Def:

Suppose distinct m keys are taken from 𝒰 ⊆ {0, 1, ..., n}, setup an array T[0, 1, ...m-1]:

wherein every position in array T is called a slot. It is expected to have only Θ(1) access time but if the range of keys is too large, e.g. 64-bit numbers which have over trillions of different keys, the direct-access table may be overwhelmed in memory usage as well as entry access time. It is a poor choice in the context of Internet data size. Thus, the Hash Table is introduced.

Design Idea

Hash Table is an extended array structure that associate keys with values that is dynamically resizable on demand. It is designed to combine the good qualities of [List](https://en.wikipedia.org/wiki/List_(abstract_data_type)) and Array data structures.

Suppose given an evolving set 𝒮 from global universe 𝒰.

Hash Table possesses both the small space of storage Ο(|𝒮|) and swift entry access of Ο(1) optimally.

In order to support three distinct operations of Hash Table: INSERTION of new record, DELETE existing record and LOOKUP for a specified record, Hash Table is built in several steps:

  1. decide a n that is the number of "buckets" for storing entries; (might be dynamically extensible if |𝒮| is variant).

  2. choose a hash function 𝒽: 𝒰 → {0,1,2, .., n-1}

  3. store entry 𝓍 in array A[𝒽(𝓍)]

Note: if the record does not exist, INSERTION of new entries will complete but DELETE will fail; if the record exists, INSERTION of new entries will override the existing values and DELETE will succeed.

What if the "bucket" location has already been occupied when trying to insert a new entry? In other words, a hash collision happens. Generally, there are two approaches, (separate) chaining and open-addressing, in resolving this situation.

Hash Collision

Note: hash collision always happens, there is never a "perfect" hash function ever discovered to have no collisions during entry insertion.

Load Factor

Define a load factor (α) as a hashing performance metric for hash table Τ:

α = n / m, where n is the total number of elements, m is the number of buckets in hash table for storing elements. The value of α can vary as bigger than 1, equal to 1 and less than 1; when α is far larger than 1, the hash table is hence overloaded and need to expand itself by the techniques introduced in table doubling.

The average-case performance of hashing depends on how well the 𝒽 function distribute n keys in m slots. Hence, assuming that bucket position each element is hashed into is independent with each other and equally likely any of m slots would be chosen as the bucket, then

simple uniform hashing:

For ј = 0, 1, ..., m-1, let's denote the length of the list Τ[ј] by nј, so that n = n0 + n1 + .. + nm-1. The expected value of nј is E[nj] = α = n/m.

And the following content provides practices on the basis of such term.

(Separate) Chaining

In chaining method, all entries with same hash values are chained together in a LinkedList; Specifically, either a pointer exists in a bucket of the hash table which points to the LinkedList (Singly or Doubly Linked list) of all entries hashed to that bucket, or there is only a NIL in that bucket.

LOOKUP costs Θ(1 + α) in chaining version of Hash Table. INSERTION operation in this structure takes worst-case running time of Ο(1), while the DELETION might takes up to Ο(n) if all entries hashed into one bucket; if using the doubly LinkedList, DELETION may speed up.

Open Addressing

Although it is possible for (separate) chaining method to have number of buckets smaller than the total number of elements to store, open addressing must be kept with enough extra spaces in hash table buckets (m ⩾ n, α ⩽ 1).

In other words, all elements are stored in buckets and one entry per bucket. While it is possible for such hash table to fill up the entire buckets, open addressing method saves memory occupation by pointers to allocate more buckets, drastically minimizing hash collisions and improving searching speed.

In order to INSERT element into the hash table, open addressing requires continuous buckets examinations, which is termed probe, until an empty bucket is located;

probe sequence: to make INSERTION operation successful, for each key k, there is a length m probe sequence:

< h(k,0), h(k, 1), ..., h(k, m-1) >

wherein each hash function generates a distinct position within the range of buckets. There are three major probing techniques in which probe sequence is produced in different manners.

Note: when performing DELETE operation, the designated entry should be labeled "deleted" in case of further DELETE operation failure. (e.g. k2 is inserted after probing the location of k1, if k1 is deleted to recover an empty bucket, there is no way to know beforehand whether k2 has been inserted. Then, the further SEARCH and DELETE operation of k2 will fail)

  • Linear Probing

Given an auxiliary hash function 𝒽': 𝒰 -> {0, 1, ..., m-1}, then linear probing would use:

𝒽(k, i) = (𝒽'(k) + i) mod m

that before INSERTION, first probing the bucket Τ[𝒽'(k)], and Τ[𝒽'(k) + 1],... till the Τ[𝒽'(k) - 1]; to be noted that the initial probing position determines the entire possible sequence.

  • Quadratic Probing

Under the same premise of linear probing, quadratic probing adopts a better strategy:

𝒽(k, i) = (𝒽'(k) + 𝒸1 ⋅ i + 𝒸2 ⋅ i2) mod m

wherein 𝒸1 and 𝒸2 are positive constants, i = 0, 1, ..., m-1

instead of occupying large portions of adjacent buckets by using linear probing, quadratic probing leads to a more dispersed distribution of elements.

  • Double Hashing

This is known as the best strategy of the three in open addressing method;

𝒽(k, i) = (𝒽1(k) + i ⋅ 𝒽2(k)) mod m

where in both 𝒽1 and 𝒽2 are auxiliary hash functions; in order to search the whole hash table, value of 𝒽2 has to be relatively prime to the hash table size m. And it results in a Θ(m2) of probing sequence for each key k, approximating the optimal simple uniform hashing that for each key there is m! number of probe sequence permutations.

Hash Function

What makes a "good" hash function? it should have properties of spreading data out uniformly in the hash table and simple to compute and evaluate.

A common practice in hash function is to transform the universe of keys into natural numbers {0, 1, 2, ...} and process such numbers to construct the hash function.

In designing the hash function 𝒽 against key k, there are genuinely three methods in which first two are heuristic while the last one is an randomized technique.

Division Method

By taking modular arithmetic in the following form, the key _k_is hashed into one of m buckets of the hash table:

𝒽(k) = k mod m, where m ≠ 2p, p is natural number.

Note: it is wise to choose a prime number as the value of m, this method is practical but inconvenient to find a prime number.

Multiplication Method

There are four steps to construct a multiplication hashing method:

  1. multiply k with a constant A (0 < A < 1). (all of length w in terms of word length)

  2. extract the decimal part of the previous result.

  3. multiply m with the previous result.

  4. round down to an integer and obtain the hash value.

𝒽(k) = ⌊ m ⋅ (_k_A mod 1) ⌋

Generally, this method outperforms the division method. If m is a power of 2, it is easier to compute the hash simply by shifting bits.

Universal Hashing

(this section is not yet covered)

Table Doubling

When the number of elements exceeds the number of buckets (assume chaining is not used), the hash table needs to grow its size to not slow down the access speed. Similar as when the number of elements are far less than the number of buckets, the hash table needs to shrink its size to save memory space.

The goal is to make sure Θ(n) = m whenever the table shrinks or grows. Then, how fast should it grow and shrink?

Suppose it grows from m to m+1, which means there is rebuild of hash table in every step of insertion, in total: Θ(1 + 2 + 3 + ... + n) = Θ(n2), not good.

What about grow it to its double size? e.g. m to 2m, which means there is only rebuild of hash table in 2ith step (i is the number of insertion operation), in total: Θ(1 + 2 + 4 + ... + n) = Θ(n) (n is a power of 2). There are few inserts cause linear time to rebuild the hash table while the overall average time would be Θ(1) (see amortized analysis)

For table shrinkage, if applying the similar rule to shrink the table size to its half, there will have problems when the n exceeds m by 1 then delete that new element (first double the table size then resize to original). Therefore, in order to avoid that frequent memory modification pattern, the table will shrink to half its size when n = m / 4

Examples

Various programming problems borrow solutions from Hash Table implementations to increase productivity.

De-Duplication

Given a stream of objects, say real time data from WEB, the goal is to remove duplicates and to keep track of unique object occurrence.

If a new object arrives, check if it exists in current hash table, insert it if not present even with same hash key and drop it otherwise.

It is applicable in scenarios such as reporting unique visitors to website or avoid repeated searching results by end users.

Given an unsorted array A of n integers and a target sum t, determine whether or not there are two numbers x, y in A with x + y = t.

A naive solution may be to use exhaustive search in two encapsulated loops of Θ(n2). And to improve slightly upon that, applying a comparison-based sort such as MergeSort piggybacking a binary-search for t - x given every x would have a Θ(n ⋅ log n) complexity.

Overall, a killer solution is by using the concept of Hash Table. Insert values of A into a hash table and for each value x in the table, search the t - x in Θ(n) time.

Java 8 HashMap

Hash collisions in hash table data structures have significant impact on performance of LOOKUP operation as to increase the worst-case running time from Ο(1) to Ο(n).

To improve upon that, Java8 spec of HashMap implementations requires the buckets containing colliding keys should store entries in a balanced tree instead of linked list. Hence, the searching operation takes no more than Ο(log(n)) in general.

References

  1. Google Security Blog - Announcing the first SHA1 collision, 2017.

  2. Hash tables and amortized analysis - CS 312 Lecture 20, Cornell, 2008.

Last updated