Hash Table
Last updated
Was this helpful?
Last updated
Was this helpful?
This section concerns data structure and its relevant hash functions designs. And it's an improved version of for operational efficiency.
As an inherited example of , is satisfactory if given a small universe 𝒰 of keys for its implementation simplicity and effectiveness.
Formal Def:
Suppose distinct m keys are taken from 𝒰 ⊆ {0, 1, ..., n}, setup an array T[0, 1, ...m-1]:
Figure 1. Direct-Access Table Mathematical Representation
wherein every position in array T is called a slot. It is expected to have only Θ(1) access time but if the range of keys is too large, e.g. 64-bit numbers which have over trillions of different keys, the may be overwhelmed in memory usage as well as entry access time. It is a poor choice in the context of Internet data size. Thus, the is introduced.
is an extended array structure that associate keys with values that is dynamically resizable on demand. It is designed to combine the good qualities of [List]()) and data structures.
Suppose given an evolving set 𝒮 from global universe 𝒰.
[List]()) only requires Θ(|𝒮|) space but lookup time Θ(|𝒮|) as well.
only requires Ο(1) for lookup time but Θ(|𝒰|) space.
possesses both the small space of storage Ο(|𝒮|) and swift entry access of Ο(1) optimally.
In order to support three distinct operations of : INSERTION of new record, DELETE existing record and LOOKUP for a specified record, is built in several steps:
decide a n that is the number of "buckets" for storing entries; (might be dynamically extensible if |𝒮| is variant).
choose a 𝒽: 𝒰 → {0,1,2, .., n-1}
store entry 𝓍 in array A[𝒽(𝓍)]
Note: if the record does not exist, INSERTION of new entries will complete but DELETE will fail; if the record exists, INSERTION of new entries will override the existing values and DELETE will succeed.
Note: hash collision always happens, there is never a "perfect" hash function ever discovered to have no collisions during entry insertion.
simple uniform hashing:
For ј = 0, 1, ..., m-1, let's denote the length of the list Τ[ј] by nј, so that n = n0 + n1 + .. + nm-1. The expected value of nј is E[nj] = α = n/m.
And the following content provides practices on the basis of such term.
probe sequence: to make INSERTION operation successful, for each key k, there is a length m probe sequence:
< h(k,0), h(k, 1), ..., h(k, m-1) >
Note: when performing DELETE operation, the designated entry should be labeled "deleted" in case of further DELETE operation failure. (e.g. k2 is inserted after probing the location of k1, if k1 is deleted to recover an empty bucket, there is no way to know beforehand whether k2 has been inserted. Then, the further SEARCH and DELETE operation of k2 will fail)
Linear Probing
Given an auxiliary hash function 𝒽': 𝒰 -> {0, 1, ..., m-1}, then linear probing would use:
𝒽(k, i) = (𝒽'(k) + i) mod m
that before INSERTION, first probing the bucket Τ[𝒽'(k)], and Τ[𝒽'(k) + 1],... till the Τ[𝒽'(k) - 1]; to be noted that the initial probing position determines the entire possible sequence.
Quadratic Probing
𝒽(k, i) = (𝒽'(k) + 𝒸1 ⋅ i + 𝒸2 ⋅ i2) mod m
wherein 𝒸1 and 𝒸2 are positive constants, i = 0, 1, ..., m-1
Double Hashing
𝒽(k, i) = (𝒽1(k) + i ⋅ 𝒽2(k)) mod m
where in both 𝒽1 and 𝒽2 are auxiliary hash functions; in order to search the whole hash table, value of 𝒽2 has to be relatively prime to the hash table size m. And it results in a Θ(m2) of probing sequence for each key k, approximating the optimal simple uniform hashing that for each key there is m! number of probe sequence permutations.
A common practice in hash function is to transform the universe of keys into natural numbers {0, 1, 2, ...} and process such numbers to construct the hash function.
In designing the hash function 𝒽 against key k, there are genuinely three methods in which first two are heuristic while the last one is an randomized technique.
𝒽(k) = k mod m, where m ≠ 2p, p is natural number.
There are four steps to construct a multiplication hashing method:
multiply k with a constant A (0 < A < 1). (all of length w in terms of word length)
extract the decimal part of the previous result.
multiply m with the previous result.
round down to an integer and obtain the hash value.
𝒽(k) = ⌊ m ⋅ (_k_A mod 1) ⌋
(this section is not yet covered)
When the number of elements exceeds the number of buckets (assume chaining is not used), the hash table needs to grow its size to not slow down the access speed. Similar as when the number of elements are far less than the number of buckets, the hash table needs to shrink its size to save memory space.
The goal is to make sure Θ(n) = m whenever the table shrinks or grows. Then, how fast should it grow and shrink?
Suppose it grows from m to m+1, which means there is rebuild of hash table in every step of insertion, in total: Θ(1 + 2 + 3 + ... + n) = Θ(n2), not good.
For table shrinkage, if applying the similar rule to shrink the table size to its half, there will have problems when the n exceeds m by 1 then delete that new element (first double the table size then resize to original). Therefore, in order to avoid that frequent memory modification pattern, the table will shrink to half its size when n = m / 4
Given a stream of objects, say real time data from WEB, the goal is to remove duplicates and to keep track of unique object occurrence.
If a new object arrives, check if it exists in current hash table, insert it if not present even with same hash key and drop it otherwise.
It is applicable in scenarios such as reporting unique visitors to website or avoid repeated searching results by end users.
Given an unsorted array A of n integers and a target sum t, determine whether or not there are two numbers x, y in A with x + y = t.
To improve upon that, Java8 spec of HashMap implementations requires the buckets containing colliding keys should store entries in a balanced tree instead of linked list. Hence, the searching operation takes no more than Ο(log(n)) in general.
What if the "bucket" location has already been occupied when trying to insert a new entry? In other words, a happens. Generally, there are two approaches, and , in resolving this situation.
Define a (α) as a hashing performance metric for Τ:
α = n / m, where n is the total number of elements, m is the number of buckets in for storing elements. The value of α can vary as bigger than 1, equal to 1 and less than 1; when α is far larger than 1, the hash table is hence overloaded and need to expand itself by the techniques introduced in .
The performance of hashing depends on how well the 𝒽 function distribute n keys in m slots. Hence, assuming that bucket position each element is hashed into is independent with each other and equally likely any of m slots would be chosen as the bucket, then
In method, all entries with same hash values are chained together in a ; Specifically, either a pointer exists in a bucket of the which points to the LinkedList (Singly or Doubly Linked list) of all entries hashed to that bucket, or there is only a NIL in that bucket.
LOOKUP costs Θ(1 + α) in version of . INSERTION operation in this structure takes running time of Ο(1), while the DELETION might takes up to Ο(n) if all entries hashed into one bucket; if using the doubly , DELETION may speed up.
Although it is possible for method to have number of buckets smaller than the total number of elements to store, must be kept with enough extra spaces in buckets (m ⩾ n, α ⩽ 1).
In other words, all elements are stored in buckets and one entry per bucket. While it is possible for such hash table to fill up the entire buckets, method saves memory occupation by to allocate more buckets, drastically minimizing hash collisions and improving searching speed.
In order to INSERT element into the hash table, requires continuous buckets examinations, which is termed probe, until an empty bucket is located;
wherein each generates a distinct position within the range of buckets. There are three major probing techniques in which probe sequence is produced in different manners.
Under the same premise of , quadratic probing adopts a better strategy:
instead of occupying large portions of adjacent buckets by using , quadratic probing leads to a more dispersed distribution of elements.
This is known as the best strategy of the three in method;
What makes a "good" hash function? it should have properties of spreading data out uniformly in the and simple to compute and evaluate.
By taking modular arithmetic in the following form, the key _k_is hashed into one of m buckets of the :
Note: it is wise to choose a as the value of m, this method is practical but inconvenient to find a prime number.
Generally, this method outperforms the . If m is a power of 2, it is easier to compute the hash simply by shifting bits.
Figure 2. Multiplication Hashing Method
What about grow it to its double size? e.g. m to 2m, which means there is only rebuild of hash table in 2ith step (i is the number of insertion operation), in total: Θ(1 + 2 + 4 + ... + n) = Θ(n) (n is a power of 2). There are few inserts cause linear time to rebuild the hash table while the overall average time would be Θ(1) (see )
Various programming problems borrow solutions from implementations to increase productivity.
A naive solution may be to use exhaustive search in two encapsulated loops of Θ(n2). And to improve slightly upon that, applying a such as piggybacking a for t - x given every x would have a Θ(n ⋅ log n) complexity.
Overall, a killer solution is by using the concept of . Insert values of A into a hash table and for each value x in the table, search the t - x in Θ(n) time.
Hash collisions in hash table data structures have significant impact on performance of LOOKUP operation as to increase the running time from Ο(1) to Ο(n).
- Announcing the first SHA1 collision, 2017.
- CS 312 Lecture 20, Cornell, 2008.