algorithm-notes
  • Preface
  • Outline
    • Asymptotic-Analysis
    • Divide-and-Conquer (DnC)
      • Overview
      • Multiplication
      • DnC Strategy
      • Master Method
    • Randomization
      • Overview
      • Discrete Probability
      • Randomized Algorithms
      • Reservoir Sampling
    • Sorting
      • Overview
      • Selection Sort
      • Bubble Sort
      • Insertion Sort
      • Shell Sort
      • Merge Sort
      • Heap Sort
      • Quick Sort
      • Non-Comparison Sort
    • Graph Algorithms
      • Overview
      • Minimum Spanning Tree
      • Shortest Path
      • Topological Sort
    • Tree
      • Overview
      • Balanced Search Tree
        • AVL Tree
        • Red-Black Tree
        • B Tree
      • Heap
      • Disjoint Set
      • Threaded Binary Tree
      • Segment Tree
    • Searching
      • Overview
      • Binary Search
      • Graph Search
      • Hash Table
      • String Search
    • Dynamic Programming
      • Overview
      • Fibonacci Numbers
      • LCS problem
    • Greedy Algorithms
      • Overview
Powered by GitBook
On this page
  • Direct-Access Table
  • Design Idea
  • Hash Collision
  • Load Factor
  • (Separate) Chaining
  • Open Addressing
  • Hash Function
  • Division Method
  • Multiplication Method
  • Universal Hashing
  • Table Doubling
  • Examples
  • De-Duplication
  • The 2-SUM Problem
  • Java 8 HashMap
  • References

Was this helpful?

  1. Outline
  2. Searching

Hash Table

PreviousGraph SearchNextString Search

Last updated 6 years ago

Was this helpful?

This section concerns data structure and its relevant hash functions designs. And it's an improved version of for operational efficiency.

Direct-Access Table

As an inherited example of , is satisfactory if given a small universe 𝒰 of keys for its implementation simplicity and effectiveness.

Formal Def:

Suppose distinct m keys are taken from 𝒰 ⊆ {0, 1, ..., n}, setup an array T[0, 1, ...m-1]:

Figure 1. Direct-Access Table Mathematical Representation

wherein every position in array T is called a slot. It is expected to have only Θ(1) access time but if the range of keys is too large, e.g. 64-bit numbers which have over trillions of different keys, the may be overwhelmed in memory usage as well as entry access time. It is a poor choice in the context of Internet data size. Thus, the is introduced.

Design Idea

is an extended array structure that associate keys with values that is dynamically resizable on demand. It is designed to combine the good qualities of [List]()) and data structures.

Suppose given an evolving set 𝒮 from global universe 𝒰.

  • [List]()) only requires Θ(|𝒮|) space but lookup time Θ(|𝒮|) as well.

  • only requires Ο(1) for lookup time but Θ(|𝒰|) space.

possesses both the small space of storage Ο(|𝒮|) and swift entry access of Ο(1) optimally.

In order to support three distinct operations of : INSERTION of new record, DELETE existing record and LOOKUP for a specified record, is built in several steps:

  1. decide a n that is the number of "buckets" for storing entries; (might be dynamically extensible if |𝒮| is variant).

  2. choose a 𝒽: 𝒰 → {0,1,2, .., n-1}

  3. store entry 𝓍 in array A[𝒽(𝓍)]

Note: if the record does not exist, INSERTION of new entries will complete but DELETE will fail; if the record exists, INSERTION of new entries will override the existing values and DELETE will succeed.

Hash Collision

Note: hash collision always happens, there is never a "perfect" hash function ever discovered to have no collisions during entry insertion.

Load Factor

simple uniform hashing:

For ј = 0, 1, ..., m-1, let's denote the length of the list Τ[ј] by nј, so that n = n0 + n1 + .. + nm-1. The expected value of nј is E[nj] = α = n/m.

And the following content provides practices on the basis of such term.

(Separate) Chaining

Open Addressing

probe sequence: to make INSERTION operation successful, for each key k, there is a length m probe sequence:

< h(k,0), h(k, 1), ..., h(k, m-1) >

Note: when performing DELETE operation, the designated entry should be labeled "deleted" in case of further DELETE operation failure. (e.g. k2 is inserted after probing the location of k1, if k1 is deleted to recover an empty bucket, there is no way to know beforehand whether k2 has been inserted. Then, the further SEARCH and DELETE operation of k2 will fail)

  • Linear Probing

Given an auxiliary hash function 𝒽': 𝒰 -> {0, 1, ..., m-1}, then linear probing would use:

𝒽(k, i) = (𝒽'(k) + i) mod m

that before INSERTION, first probing the bucket Τ[𝒽'(k)], and Τ[𝒽'(k) + 1],... till the Τ[𝒽'(k) - 1]; to be noted that the initial probing position determines the entire possible sequence.

  • Quadratic Probing

𝒽(k, i) = (𝒽'(k) + 𝒸1 ⋅ i + 𝒸2 ⋅ i2) mod m

wherein 𝒸1 and 𝒸2 are positive constants, i = 0, 1, ..., m-1

  • Double Hashing

𝒽(k, i) = (𝒽1(k) + i ⋅ 𝒽2(k)) mod m

where in both 𝒽1 and 𝒽2 are auxiliary hash functions; in order to search the whole hash table, value of 𝒽2 has to be relatively prime to the hash table size m. And it results in a Θ(m2) of probing sequence for each key k, approximating the optimal simple uniform hashing that for each key there is m! number of probe sequence permutations.

Hash Function

A common practice in hash function is to transform the universe of keys into natural numbers {0, 1, 2, ...} and process such numbers to construct the hash function.

In designing the hash function 𝒽 against key k, there are genuinely three methods in which first two are heuristic while the last one is an randomized technique.

Division Method

𝒽(k) = k mod m, where m ≠ 2p, p is natural number.

Multiplication Method

There are four steps to construct a multiplication hashing method:

  1. multiply k with a constant A (0 < A < 1). (all of length w in terms of word length)

  2. extract the decimal part of the previous result.

  3. multiply m with the previous result.

  4. round down to an integer and obtain the hash value.

𝒽(k) = ⌊ m ⋅ (_k_A mod 1) ⌋

Universal Hashing

(this section is not yet covered)

Table Doubling

When the number of elements exceeds the number of buckets (assume chaining is not used), the hash table needs to grow its size to not slow down the access speed. Similar as when the number of elements are far less than the number of buckets, the hash table needs to shrink its size to save memory space.

The goal is to make sure Θ(n) = m whenever the table shrinks or grows. Then, how fast should it grow and shrink?

Suppose it grows from m to m+1, which means there is rebuild of hash table in every step of insertion, in total: Θ(1 + 2 + 3 + ... + n) = Θ(n2), not good.

For table shrinkage, if applying the similar rule to shrink the table size to its half, there will have problems when the n exceeds m by 1 then delete that new element (first double the table size then resize to original). Therefore, in order to avoid that frequent memory modification pattern, the table will shrink to half its size when n = m / 4

Examples

De-Duplication

Given a stream of objects, say real time data from WEB, the goal is to remove duplicates and to keep track of unique object occurrence.

If a new object arrives, check if it exists in current hash table, insert it if not present even with same hash key and drop it otherwise.

It is applicable in scenarios such as reporting unique visitors to website or avoid repeated searching results by end users.

Given an unsorted array A of n integers and a target sum t, determine whether or not there are two numbers x, y in A with x + y = t.

Java 8 HashMap

To improve upon that, Java8 spec of HashMap implementations requires the buckets containing colliding keys should store entries in a balanced tree instead of linked list. Hence, the searching operation takes no more than Ο(log(n)) in general.

References

What if the "bucket" location has already been occupied when trying to insert a new entry? In other words, a happens. Generally, there are two approaches, and , in resolving this situation.

Define a (α) as a hashing performance metric for Τ:

α = n / m, where n is the total number of elements, m is the number of buckets in for storing elements. The value of α can vary as bigger than 1, equal to 1 and less than 1; when α is far larger than 1, the hash table is hence overloaded and need to expand itself by the techniques introduced in .

The performance of hashing depends on how well the 𝒽 function distribute n keys in m slots. Hence, assuming that bucket position each element is hashed into is independent with each other and equally likely any of m slots would be chosen as the bucket, then

In method, all entries with same hash values are chained together in a ; Specifically, either a pointer exists in a bucket of the which points to the LinkedList (Singly or Doubly Linked list) of all entries hashed to that bucket, or there is only a NIL in that bucket.

LOOKUP costs Θ(1 + α) in version of . INSERTION operation in this structure takes running time of Ο(1), while the DELETION might takes up to Ο(n) if all entries hashed into one bucket; if using the doubly , DELETION may speed up.

Although it is possible for method to have number of buckets smaller than the total number of elements to store, must be kept with enough extra spaces in buckets (m ⩾ n, α ⩽ 1).

In other words, all elements are stored in buckets and one entry per bucket. While it is possible for such hash table to fill up the entire buckets, method saves memory occupation by to allocate more buckets, drastically minimizing hash collisions and improving searching speed.

In order to INSERT element into the hash table, requires continuous buckets examinations, which is termed probe, until an empty bucket is located;

wherein each generates a distinct position within the range of buckets. There are three major probing techniques in which probe sequence is produced in different manners.

Under the same premise of , quadratic probing adopts a better strategy:

instead of occupying large portions of adjacent buckets by using , quadratic probing leads to a more dispersed distribution of elements.

This is known as the best strategy of the three in method;

What makes a "good" hash function? it should have properties of spreading data out uniformly in the and simple to compute and evaluate.

By taking modular arithmetic in the following form, the key _k_is hashed into one of m buckets of the :

Note: it is wise to choose a as the value of m, this method is practical but inconvenient to find a prime number.

Generally, this method outperforms the . If m is a power of 2, it is easier to compute the hash simply by shifting bits.

Figure 2. Multiplication Hashing Method

What about grow it to its double size? e.g. m to 2m, which means there is only rebuild of hash table in 2ith step (i is the number of insertion operation), in total: Θ(1 + 2 + 4 + ... + n) = Θ(n) (n is a power of 2). There are few inserts cause linear time to rebuild the hash table while the overall average time would be Θ(1) (see )

Various programming problems borrow solutions from implementations to increase productivity.

A naive solution may be to use exhaustive search in two encapsulated loops of Θ(n2). And to improve slightly upon that, applying a such as piggybacking a for t - x given every x would have a Θ(n ⋅ log n) complexity.

Overall, a killer solution is by using the concept of . Insert values of A into a hash table and for each value x in the table, search the t - x in Θ(n) time.

Hash collisions in hash table data structures have significant impact on performance of LOOKUP operation as to increase the running time from Ο(1) to Ο(n).

- Announcing the first SHA1 collision, 2017.

- CS 312 Lecture 20, Cornell, 2008.

average-case
prime number
amortized analysis
The 2-SUM Problem
comparison-based sort
MergeSort
binary-search
worst-case
Google Security Blog
Hash tables and amortized analysis
hash collision
(separate) chaining
open-addressing
load factor
hash table
hash table
table doubling
LinkedList
chaining
hash table
worst-case
LinkedList
chaining
Hash Table
(separate) chaining
open addressing
hash table
pointers
open addressing
open addressing
hash function
linear probing
linear probing
open addressing
hash table
hash table
division method
Hash Table
Hash Table
Symbol Table
https://en.wikipedia.org/wiki/List_(abstract_data_type
Array
https://en.wikipedia.org/wiki/List_(abstract_data_type
Array
Hash Table
Direct-Access Table
Direct-Access Table
direct-access table
Hash Table
Hash Table
Hash Table
Hash Table
Hash Table
hash function