Given a string s
of length n
. The prefix function
for this string is defined as an array $\pi [i]$ is the length of proper prefix of substring s[0...i]
which is also a suffix of this substring. By definition $\pi[0] = 0$.
For example, prefix function of string “abcabcd” is $[0,0,0,1,2,3,0]$
Trivial Algorithm will be of complexity $O(n^3)$.
Efficient Algorithm was proposed by Knuth and Pratt and independently from them by Morris in 1977. It was used as the main function of a substring search algorithm.
Observe that the value of the prefix function can increase by at most one.
i.e. $\pi[i+1] \le \pi[i] + 1$.
This reduces the complexity to $O(n^2)$.
Lets compute prefix function for $i+1$.
If $s[i] = s[\pi[i]]$, then we can say with certainty that $\pi[i+1] = \pi[i]+1$, since we already know that the suffix at position $i$ of length $\pi[i]$ is equal to the prefix of length $\pi[i]$.
If this is not the case, $s[i] \ne s[\pi[i]]$, then we need to try a shorter string. In order to speed up things, we would like to immediately move to the longest length $j < \pi[i]$, such that the prefix property in the position $i$ holds, i.e. $s[0…j-1] = s[i-j+1…i]$.
Indeed, if we find such a length $j$, then we again only need to compare the characters $s[i+1]$ and $s[j]$ , if they equal , then we say $\pi[i+1] = j+1$ otherwise we will need to find the largest value smaller than $j$, for which prefix property holds, and so on.
When $j=0$, if $s[i+1] = s[0]$, we assign $\pi[i+1] = 1$ and $\pi[i+1] = 0$ otherwise.
vector<int> prefix_function(string s) {
int n = (int)s.length();
vector<int> pi(n);
for (int i = 1; i < n; i++) {
int j = pi[i-1];
while (j > 0 && s[i] != s[j])
j = pi[j-1];
if (s[i] == s[j])
j++;
pi[i] = j;
}
return pi;
}
Note : Its a online algorithm, i.e. it processes the data as it arrives.
Substring Searching (KMP)
given text $t$ and a string $s$, find and display all occurrences of the string $s$ in $t$.
size of s be n and size of t be m.
We generate a string s+#+t where # is a separator that doesn’t appear in both s and t. Calculate the prefix function for the string.
This of what prefix function represents for $i > n$. It represents longest length of a substring ending in the position i that coincides with the prefix. But in our cause this is nothing more than the largest block that coincides with $s$ and ends at $i$. This length cannot be bigger than $n$ due to separator. But if equality is approaches $\pi[i] = n$ then it means that the string $s$ appears completely in at this position i.e. it ends at position $i$. Just don’t forget indices in string s+#+t.
If at some $i$ $\pi[i] = n$ then at the position $i-(n+1) - n+1 = i - 2n$ in the string $t$ the string $s$ appears
Thus KMP solves the problem in $O(n+m)$ time and $O(n)$ memory.
Counting the number of occurrences of each prefix