# Hash Table

This section concerns [Hash Table](#hash-table) data structure and its relevant hash functions designs. And it's an improved version of [Direct-Access Table](#direct-access-table) for operational efficiency.

## Direct-Access Table

As an inherited example of [Symbol Table](https://cs-notes.gitbook.io/algorithm-notes/outline/overview-5/overview), [Direct-Access Table](#direct-access-table) is satisfactory if given a small universe 𝒰 of *keys* for its implementation simplicity and effectiveness.

Formal Def:

Suppose distinct m *keys* are taken from 𝒰 ⊆ {0, 1, ..., n}, setup an array T\[0, 1, ...m-1]:

&#x20;![](https://204822229-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LQpLOTVKDY3pIvuIk2h%2F-LQpLtuWuWX31O5763zE%2F-LQpLzns2enqQnt6mCjp%2Fdirect_access_table.png?generation=1541714740973073\&alt=media)Figure 1. Direct-Access Table Mathematical Representation

wherein every position in array T is called a *slot*. It is expected to have only Θ(1) access time but if the range of *keys* is too large, e.g. 64-bit numbers which have over trillions of different *keys*, the [direct-access table](#direct-access-table) may be overwhelmed in memory usage as well as entry access time. It is a poor choice in the context of Internet data size. Thus, the [Hash Table](#hash-table) is introduced.

## Design Idea

[Hash Table](#hash-table) is an extended array structure that associate *keys* with *values* that is dynamically resizable on demand. It is designed to combine the good qualities of \[List]\(<https://en.wikipedia.org/wiki/List_(abstract_data_type>)) and [Array](https://en.wikipedia.org/wiki/Array_data_structure) data structures.

Suppose given an evolving set 𝒮 from global universe 𝒰.

* \[List]\(<https://en.wikipedia.org/wiki/List_(abstract_data_type>)) only requires Θ(|𝒮|) space but lookup time Θ(|𝒮|) as well.
* [Array](https://en.wikipedia.org/wiki/Array_data_structure) only requires Ο(1) for lookup time but Θ(|𝒰|) space.

[Hash Table](#hash-table) possesses both the small space of storage Ο(|𝒮|) and swift entry access of Ο(1) optimally.

In order to support three distinct operations of [Hash Table](#hash-table): **INSERTION** of new record, **DELETE** existing record and **LOOKUP** for a specified record, [Hash Table](#hash-table) is built in several steps:

1. decide a n that is the number of "buckets" for storing entries; (might be dynamically extensible if |𝒮| is variant).
2. choose a [hash function](#hash-function) 𝒽: 𝒰 → {0,1,2, .., n-1}
3. store entry 𝓍 in array A\[𝒽(𝓍)]

*Note: if the record does not exist, **INSERTION** of new entries will complete but **DELETE** will fail; if the record exists, **INSERTION** of new entries will override the existing values and **DELETE** will succeed.*

What if the "bucket" location has already been occupied when trying to insert a new entry? In other words, a [hash collision](#hash-collision) happens. Generally, there are two approaches, [(separate) chaining](#separate-chaining) and [open-addressing](#open-addressing), in resolving this situation.

## Hash Collision

*Note: hash collision always happens, there is never a "perfect" hash function ever discovered to have no collisions during entry insertion*.

### Load Factor

Define a [load factor](#load-factor) (α) as a hashing performance metric for [hash table](#hash-table) Τ:

α = n / m, where n is the total number of elements, m is the number of buckets in [hash table](#hash-table) for storing elements. The value of α can vary as bigger than 1, equal to 1 and less than 1; when α is far larger than 1, the hash table is hence overloaded and need to expand itself by the techniques introduced in [table doubling](#table-doubling).

The [average-case](https://cs-notes.gitbook.io/algorithm-notes/outline/asymptotic-analysis) performance of hashing depends on how well the 𝒽 function distribute n *keys* in m slots. Hence, assuming that bucket position each element is hashed into is *independent* with each other and equally likely any of m slots would be chosen as the bucket, then

***simple uniform hashing***:

For ј = 0, 1, ..., m-1, let's denote the length of the list Τ\[ј] by nј, so that n = n0 + n1 + .. + nm-1. The **expected value** of nј is E\[nj] = α = n/m.

And the following content provides practices on the basis of such term.

### (Separate) Chaining

In [chaining](#chaining) method, all entries with same hash values are chained together in a [LinkedList](https://en.wikipedia.org/wiki/Linked_list); Specifically, either a pointer exists in a bucket of the [hash table](#hash-table) which points to the LinkedList (Singly or Doubly Linked list) of all entries hashed to that bucket, or there is only a NIL in that bucket.

LOOKUP costs Θ(1 + α) in [chaining](#chaining) version of [Hash Table](#hash-table). INSERTION operation in this structure takes [worst-case](https://cs-notes.gitbook.io/algorithm-notes/outline/asymptotic-analysis) running time of Ο(1), while the DELETION might takes up to Ο(n) if all entries hashed into one bucket; if using the doubly [LinkedList](https://en.wikipedia.org/wiki/Linked_list), DELETION may speed up.

### Open Addressing

Although it is possible for [(separate) chaining](#separate-chaining) method to have number of buckets smaller than the total number of elements to store, [open addressing](#open-addressing) must be kept with enough extra spaces in [hash table](#hash-table) buckets (m ⩾ n, α ⩽ 1).

In other words, all elements are stored in buckets and one entry per bucket. While it is possible for such hash table to fill up the entire buckets, [open addressing](#open-addressing) method saves memory occupation by [pointers](https://en.wikipedia.org/wiki/Pointer) to allocate more buckets, drastically minimizing hash collisions and improving searching speed.

In order to INSERT element into the hash table, [open addressing](#open-addressing) requires continuous buckets examinations, which is termed **probe**, until an empty bucket is located;

*probe sequence*: to make INSERTION operation successful, for each *key* k, there is a length m **probe sequence**:

< h(k,0), h(k, 1), ..., h(k, m-1) >

wherein each [hash function](#hash-function) generates a distinct position within the range of buckets. There are three major *probing* techniques in which *probe sequence* is produced in different manners.

*Note: when performing DELETE operation, the designated entry should be labeled "deleted" in case of further DELETE operation failure. (e.g. k2 is inserted after probing the location of k1, if k1 is deleted to recover an empty bucket, there is no way to know beforehand whether k2 has been inserted. Then, the further SEARCH and DELETE operation of k2 will fail)*

* Linear Probing

Given an auxiliary hash function 𝒽': 𝒰 -> {0, 1, ..., m-1}, then linear probing would use:

𝒽(k, i) = (𝒽'(k) + i) mod m

that before INSERTION, first probing the bucket Τ\[𝒽'(k)], and Τ\[𝒽'(k) + 1],... till the Τ\[𝒽'(k) - 1]; to be noted that the initial probing position determines the entire possible sequence.

* Quadratic Probing

Under the same premise of [linear probing](#linear-probing), quadratic probing adopts a better strategy:

𝒽(k, i) = (𝒽'(k) + 𝒸1 ⋅ i + 𝒸2 ⋅ i2) mod m

wherein 𝒸1 and 𝒸2 are positive constants, i = 0, 1, ..., m-1

instead of occupying large portions of adjacent buckets by using [linear probing](#linear-probing), quadratic probing leads to a more dispersed distribution of elements.

* Double Hashing

This is known as the best strategy of the three in [open addressing](#open-addressing) method;

𝒽(k, i) = (𝒽1(k) + i ⋅ 𝒽2(k)) mod m

where in both 𝒽1 and 𝒽2 are auxiliary hash functions; in order to search the whole hash table, value of 𝒽2 has to be relatively prime to the hash table size m. And it results in a Θ(m2) of probing sequence for each *key* k, approximating the *optimal* **simple uniform hashing** that for each *key* there is m! number of *probe sequence* permutations.

## Hash Function

What makes a "good" hash function? it should have properties of *spreading data out* uniformly in the [hash table](#hash-table) and simple to compute and evaluate.

A common practice in hash function is to transform the universe of *keys* into *natural numbers* {0, 1, 2, ...} and process such numbers to construct the hash function.

In designing the hash function 𝒽 against key *k*, there are genuinely three methods in which first two are heuristic while the last one is an randomized technique.

### Division Method

By taking modular arithmetic in the following form, the key \_k\_is hashed into one of m buckets of the [hash table](#hash-table):

𝒽(*k*) = *k* mod *m*, where *m* ≠ 2p, p is natural number.

*Note: it is wise to choose a* [*prime number*](http://whatis.techtarget.com/definition/prime-number) *as the value of m, this method is practical but inconvenient to find a prime number*.

### Multiplication Method

There are four steps to construct a multiplication hashing method:

1. multiply *k* with a constant *A* (0 < *A* < 1). (all of length *w* in terms of word length)
2. extract the decimal part of the previous result.
3. multiply *m* with the previous result.
4. round down to an integer and obtain the hash value.

𝒽(*k*) = ⌊ m ⋅ (\_k\_A mod 1) ⌋

Generally, this method outperforms the [division method](#division-method). If *m* is a power of 2, it is easier to compute the hash simply by shifting bits.

&#x20;![](https://204822229-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LQpLOTVKDY3pIvuIk2h%2F-LQpLtuWuWX31O5763zE%2F-LQpLznzGqVuFcJNUdzp%2Fmultiplication-hashing.jpg?generation=1541714741398398\&alt=media)Figure 2. Multiplication Hashing Method

### Universal Hashing

(this section is not yet covered)

## Table Doubling

When the number of elements exceeds the number of buckets (assume chaining is not used), the hash table needs to grow its size to not slow down the access speed. Similar as when the number of elements are far less than the number of buckets, the hash table needs to shrink its size to save memory space.

The goal is to make sure Θ(n) = m whenever the table shrinks or grows. Then, how fast should it grow and shrink?

Suppose it grows from m to m+1, which means there is rebuild of hash table in every step of insertion, in total: Θ(1 + 2 + 3 + ... + n) = Θ(n2), not good.

What about grow it to its double size? e.g. m to 2m, which means there is only rebuild of hash table in 2ith step (i is the number of insertion operation), in total: Θ(1 + 2 + 4 + ... + n) = Θ(n) (n is a power of 2). There are few inserts cause linear time to rebuild the hash table while the overall average time would be Θ(1) (see [amortized analysis](https://stackoverflow.com/questions/11102585/what-is-amortized-analysis-of-algorithms))

For table shrinkage, if applying the similar rule to shrink the table size to its half, there will have problems when the n exceeds m by 1 then delete that new element (first double the table size then resize to original). Therefore, in order to avoid that frequent memory modification pattern, the table will shrink to half its size when n = m / 4

## Examples

Various programming problems borrow solutions from [Hash Table](#hash-table) implementations to increase productivity.

### De-Duplication

Given a *stream* of objects, say real time data from WEB, the goal is to remove duplicates and to keep track of unique object occurrence.

If a new object arrives, check if it exists in current hash table, insert it if not present even with same hash *key* and drop it otherwise.

It is applicable in scenarios such as reporting unique visitors to website or avoid repeated searching results by end users.

### [The 2-SUM Problem](https://leetcode.com/articles/two-sum/#approach-2-two-pass-hash-table-accepted)

Given an unsorted array A of n integers and a target sum t, determine whether or not there are two numbers x, y in A with x + y = t.

A naive solution may be to use exhaustive search in two encapsulated loops of Θ(n2). And to improve slightly upon that, applying a [comparison-based sort](https://cs-notes.gitbook.io/algorithm-notes/outline/overview-5/overview) such as [MergeSort](https://cs-notes.gitbook.io/algorithm-notes/outline/overview-2/merge-sort) piggybacking a [binary-search](https://cs-notes.gitbook.io/algorithm-notes/outline/overview-5/binary-search) for t - x given every x would have a Θ(n ⋅ log n) complexity.

Overall, a killer solution is by using the concept of [Hash Table](#hash-table). Insert values of A into a hash table and for each value x in the table, search the t - x in Θ(n) time.

### Java 8 HashMap

Hash collisions in hash table data structures have significant impact on performance of LOOKUP operation as to increase the [worst-case](https://cs-notes.gitbook.io/algorithm-notes/outline/asymptotic-analysis) running time from Ο(1) to Ο(n).

To improve upon that, Java8 spec of HashMap implementations requires the buckets containing colliding keys should store entries in a balanced tree instead of linked list. Hence, the searching operation takes no more than Ο(log(n)) in general.

## References

1. [Google Security Blog](https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html) - Announcing the first SHA1 collision, 2017.
2. [Hash tables and amortized analysis](http://www.cs.cornell.edu/courses/cs312/2008sp/lectures/lec20.html) - CS 312 Lecture 20, Cornell, 2008.
