String Search
Last updated
Was this helpful?
Last updated
Was this helpful?
Given a text string T and a pattern P (a substring, a character or a ), which are both in the defined character set ∑, find some/all occurrences of P within T is termed the task of .
There are multiple string-matching applications in computing such as the popular UNIX tool in finding patterns within data sets.
Generally, there are two kinds of techniques in string matching tasks:
one-shot algorithmic approach: Ο(T) time in average, examples include , .
static data structures approach: preprocess string T and optimally obtain a Ο(P) time and Ο(T) space.
Obviously, algorithmic solutions that needs nearly no preprocessing are categorized as one-shot. There are various methods trying to improve upon the naive string match such as and .
A naive algorithm would be built under a principle of that beginning from each character occurrence of target strings (t), find whether the source string (s) matches a successive portion of target strings (t).
It takes Ο(|s|) for each substring comparison and the total number of comparison is Ο(|t| - |s|), then
Τ(n) = Ο(|s| ⋅ |t|) ≈
A rolling hash is a hash function that process inputs within a sliding window, it is powerful to rapidly calculate the hash value for the new entry without having to rehash it.
Given a string ABCDBBCA and a pattern BCDB; compute the hash value of pattern 𝒽(BCDB) and the first sub-text in the string 𝒽(ABCD), the result does not match, thus slide the window right by 1 slot; generate the new hash value 𝒽(BCDB) by rolling hash (append a character B to the end of old hash value and skip the tail character value A from the old hash value); the new results have a match, procedure ends.
(this section is not yet covered)
In the context of massive dataset of WEB, preprocessing before performing searching or statistical operations is necessary to reduce the time complexity, even demanding in the real-time systems.
A trie, also called radix tree or prefix tree, as the name suggests, is a tree-alike data structure for searching primitives (trie comes from the word retrieval). It sacrifices memory space in exchange of better searching performance and it satisfies the following properties:
the root of trie contains NULL entries.
strings with same prefixes share the common ancestors in the trie.
the root-to-leaf path represents the entire string and terminated with a special character $ that is not in the set ∑.
Specifically, given an input: {an, ana, ann, anna, anne}, constructing a trie:
Note: there are some denotations mark the edges as characters instead of nodes, which conveys the same meaning.
Since the generation of a trie is one-pass and remain static afterwards, it is wise to contract some of the adjacent edges (nodes) which have a single descendent to one edge (node) for searching simplicity. Then, a compressed trie would look like:
Note: it is valid to contract the termination character ($) to the final character node, the version provided here only assists comprehension.
Note: if the character set is ASCII based, it is recommended to use static data structures such as array (e.g. array[1..128]) to store children of a node instead of a dynamic structure like list (which costs linear time than constant to search for the prefixes and the cost of querying strings exceeds Ο(P)).
Applications
In the world of Internet, Trie has numerous useful applications. For instance, the autocomplete feature of search box in various websites such as Google, can be embedded with a Trie data structure whenever you type in a letter, the Trie gets pruned and greatly increase the speed of next search.
It is common that misspelled words share similar prefixes as the correct words, and Trie can be helpful in performing string matching algorithm in this context of spell check.
For instance, given a string "banana" and build a suffix trie for that, first a $ has to be appended to the end of the string, namely "banana$"; then list all its suffixes "banana$", "anana$", "nana$", "ana$", "na$", "a$" and "$"; finally, prefix them in a trie structure to obtain the suffix tree:
Applications
There are many practical uses of Suffix Trees such as Longest Common Substring problem, Genomic projects, etc. It is worth noted Suffix Tree is not practical in single searches since preprocess the given text requires a large amount of time.
Rabin-Karp or Karp-Rabin algorithm, is a string searching primitive that uses technique. Specifically, it requires to compare a hash value for the pattern with distinct hashes of the sub-text in the target string.
If hash computations are performed each time on sub-text, the overall complexity is even higher than the . Therefore, a technique is introduced. The design of such varies in different scenarios to provide a reasonable computation complexity and some magnitude of randomness. In the following discussions, it is assumed the hash function 𝒽 satisfies the .
Figure 1. Rolling Hash Example
Since the hash function processes the The expected running time for Rabin-Karp algorithm in string matching is Ο(|s| + (|t| - |s|) ⋅ cost(𝒽)); In the naive version of hashing, the cost of 𝒽 will be persistent as |s|, the overall complexity would be Ο(|t| ⋅ |s|) ≈ . By adopting , the cost of hash function could be reduced down to Ο(1), thus time complexity for is Ο(|t|) ≈
Figure 1. Trie Representation in a Tree
Figure 2. Compressed Trie
Roughly, query strings in the cost Ο(P) in time (LOOKUP time is independent of the number of strings in the trie!) and Ο(T) in space; Nonetheless, there are slight variations among different data representations, if you are interested, check .
Normally, suffix trie is also called suffix tree and as the name suggests, it stores the suffixes of a string in ways store prefixes. The difference between a compressed trie and a suffix trie is that suffix trie uses all suffixes of a single string which is appended with a $ and a trie terminates a bunch of strings with $.
Figure 3. Suffix Trie Representation
What is a rolling hash and when is it useful?
Suffix Trees, HackerEarth.
Applications of Suffix Trees