Knuth Morris Pratt algorithm

O(n+m) pattern matching algorithm


KMP stands for Knuth, Morris and Pratt, the first ones to come up with the algorithm.

The naive approach has a time complexity of O(n*m), while KMP has O(n+m).
KMP preprocesses the pattern in O(m) and performs the search in O(n), hence O(m+n).
('m' is length of pattern, 'n' is length of input string against which we are matching the pattern).

The naive approach tests every substring of length m against the pattern.
That is, precisely speaking, n * (n - m) operations, because it will check m characters from first index, then second, then third, and from all other up until position (n - m).

KMP will touch each string character once.

KMP is easiest if explained prospectively.

If we encounter a full match or a mismatch, what should we be concerned with?

We should be concerned with the fact that the matching string portion or pattern prefix might contain 1 or more prefixes of the pattern in one or more of its suffixes.

IND: 0123456789...
TXT: AAAAAAAAAAAAA
PAT: AAAAAB

We have a mismatch at position 5 of pattern.
The matched portion from 0 to 4 contains exactly 4 possible prefixes, from pattern indices 1, 2, 3 and 4.

IND: 0123456789.............
TXT: AADAABCAADAADAABCAADAAA
PAT: AADAABCAADAAB

We are checking the results for position 13.
The length of the longest prefix for the matching portion (up until index 12) is 5, which is AADAA.
So we chack PAT[5] against TXT[13], C and D, respectively.
That is a mismatch.

So we look for the next longest prefix.

Its length is 2, which is AA.
So we check PAT[2] against TXT[13], D and D.
We have a match.

So next time PAT[3] will be matched against TXT[14].

Algorithm-wise, the important takeaway is that the next longest prefix checked must be both a prefix and suffix of the previous longest prefix.

1. AADAABCAADAAB -> AADAA is a proper prefix and proper suffix of length 5 of PAT[0] ... PAT[LEN - 2].
2. AADAA -> AA is a proper prefix and proper suffix of length 2 of PAT[0] ... PAT[PREV_PREFIX_LEN - 1].
3. AA -> A is a proper prefix and proper suffix of length 1 of PAT[0] ... PAT[PREV_PREFIX_LEN - 1].
4. A -> PAT[0] ... PAT[PREV_PREFIX_LEN - 1] has no proper suffix and prefix, it is zero in length.

Which, to the intuitive human eye, translates to:

S: AADAABCAADAA(D)

0. AADAABCAADAA(B) -> no match
1. AADAA(B).AADAA -> no match
2. AA(D).......AA -> match !
3. A(A).........A -> not considered already
4. (A)........... -> not considered already

The next shortest is 2 in length, so text character D is matched against pattern character D.
So the second prefix which was tried out proved to be good.

If it was not successful, it would have tried out the shortest prefix, which is zero, meaning it would have matched text character D against the first pattern character A.

That means these prefixes which are also suffixes of our matched pattern prefix are already matched, so there is no need for setting back our string position.
v
All these matched prefixes are possible candidates for a full match.

You can continue matching a prefix by simply matching the next string characters with the pattern characters after the prefix, whose position is length of the prefix. (Because indexing is 0 based).

But of course we can not choose just any of the possible candidates.
If we chose a prefix, and others exist that are longer than that, would mean that all of them would be lost, because in case of a full match (or mismatch), only shorter ones will be evaluated/considered as candidates, so it is essential that we choose the longest prefix so that each of the occurring prefixes will get evaluated.

So we end up writing a function that creates a sort of prefix table as large as the pattern, which stores, at a given index, the longest prefix of the pattern which is also a suffix of the pattern up until the given pattern index.

Or simply said, we find the longest prefix which is also suffix for every prefix of the pattern.

A very rightous question arises.
On full match or mismatch, we will start probing from longest to shortest known prefixes until one is found adequate either for the string character right after the full match, or the string character for which we had a mismatch.
If none was found adequate, then we are going to increase string position until it matches first pattern character.

While building the prefix table, almost the same thing happens.
If the next pattern character can not grow the longest so far, it will start probing down to the shortest until one is found adequate for current pattern character which is being considered in the for loop.
If none is adequate, then the "prefix value" at the current character is abovo 0.

This means that whether we are bulding the prefix table, or doing the string matching itself, it is evident that each string and pattern character will be accessed at most once, rendering it O(n).

But we also need to consider the probing part, which raises the question whether our algorithm is truly linear time.

If you read carefully, you know you have already seen an intuitive proof.
v
The functions can only be considered linear time if the number of probe iterations is not more than a constant multiple of "n" which is the length of the string or pattern, in the matcher or prefix function, respectively.

When we advance to the next character, our prefix length grows by at most 1.
1 such advance, not growth, allows for another (constant multiple of) probe iteration, so that the function is still linear time.
So there might be more allowable probe iterations due to an advance than what the prefix length is, because this extra allowance is not available for decrement, and (see below) spending allowance depends on prefix decrement.

1 such iteration in a round of probe allows for 1 (constant multiple) less of probe iterations.
1 single round of probing can not take more iterations than length of prefix because it can not go beyond zero, and because each subsequent iteration decreases it by at least 1.
So there might also be more allowable probing iterations due to decrements larger than 1, than what the prefix length is, because it means the prefix function can decrease in greater pace than the allowance can decrease, or than the number of iterations can increase.

So in worst case, each advance increases prefix length by 1, each iteration decreases it by 1, and prefix length is decreased down to zero before function finishes.

In this worst case we are doing 2*n operations, or 4*n operations for both the prefix function and the matcher function, which means it is indeed O(n) asymptotically.

QED

I prefer both the KMP matcher and the prefix table functions written in my style, but I will post here the more widely known (or original) implementation too.

My impls.:


    size_t * pfx_tbl(char *p, size_t len) {
        size_t i = 0;
        size_t lps = 0;
        size_t *tbl = (size_t *)malloc(sizeof(size_t) * len);
        // prefix value at PAT[0] is zero by definition
        // bc we are finding PROPER prefix that is suffix too
        tbl[i++] = lps;

        while (i < len) {
            /*
            Checking next longest prefix.
            Success increments it and
            assigns it to current char.
            DRAWING NEXT CHAR
            */
            if (p[i] == p[lps])
                tbl[i++] = ++lps;
            /* There was no success with
            the the given prefix value for PAT[i] */
            else {
                /*
                Prefix value can not decrease beyond zero.
                Current prefix value is then zero.
                DRAWING NEXT CHAR
                */
                if (lps == 0)
                    tbl[i++] = 0;
                /*
                NOT DRAWING NEXT CHAR.
                Preparing next longest prefix value
                to bechecked out in next iteration for
                SAME PAT CHAR.
                */
                else
                    lps = tbl[lps - 1];
            }
        }

        return tbl;
    }

    size_t * kmp(char *s, size_t slen, char *p, size_t plen) {
        size_t fromi, si, pi, ri;
        fromi = si = pi = ri = 0;
        size_t cap = 3;
        size_t *res = (size_t *)malloc(sizeof(size_t) * cap);
        size_t *tbl = pfx_tbl(p, plen);

        while (si < slen) {
            // PAT[PAT_INDEX] == TXT[TXT_INDEX]
            if (s[si] == p[pi]) {
                // There was a full match
                if (pi + 1 == plen) {
                    /*
                    fromi keeps track of TXT position
                    where PAT prefix was found.
                    It is now being added to result array.
                    */
                    res[ri++] = fromi;
                    // We double the size of the
                    result array if full
                    if (ri == cap)
                        res = (size_t *)realloc(res,
                        sizeof(size_t) * (cap *= 2));
                    /*
                    On full match, the matching portion
                    ends at PAT[PAT_LEN - 1].
                    Next  time, the next TXT char
                    is compared with PAT[TBL[PAT_LEN - 1]].
                    If PAT[TBL[PAT_LEN - 1]] is 0,
                    next fromi starts from si + 1.
                    If PAT[TBL[PAT_LEN - 1]] is 0 + N,
                    fromi starts from (si + 1) - N.
                    */
                    fromi = ++si - (pi = tbl[pi]);
                }
                /* There is match, but not full match,
                increasing PAT and TXT indices
                */
                else {
                    si++;
                    pi++;
                }
            }
            // mismatch occurred
            else {
                if (pi == 0)
                    fromi = ++si;
                /*
                NOT INCREASING TXT pos.
                We are preparing next
                longest prefix value to be
                checked out in next iteration
                for SAME TXT CHAR
                */
                else
                    fromi = si - (pi = tbl[pi - 1]);
            }

        }

        // (size_t)-1 signals end of array
        res[ri++] = std::numeric_limit<size_t>::max();
        res = (size_t *)realloc(res, sizeof(size_t) * ri);
        return res;
    }

"Mainstream" implementation for prefix function:



    size_t * prefix_function(char *p, size_t len) {
        size_t *tbl = (size_t *)malloc(sizeof(size_t) * len);
        tbl[0] = 0;

        for (size_t i = 1; i < len; i++) {
            size_t lps = tbl[i - 1];

            while (lps > 0 && p[i] != p[lps])
            lps = tbl[lps - 1];

            if (p[i] == p[lps])
                lps++;

            tbl[i] = lps;
        }
    }