Mergesort

O(nlogn) sorting algorithm


Mergesort is an array sorting algorithm.
It was invented by Hungarian informatician Neumann János.

Like any other sorting algorithm, it can be used on collections of any type where the relational operation "larger than" (or smaller than) can be defined for 2 objects of the same type. (If it is not a number, it could also be a string or date)
It is one of the 2 sorting algos that have a deterministic time complexity of O(nlogn).
It is also the faster one, as heapsort is slower in practise because of cache misses.

Only quicksort is able to beat it, which has a non-deterministic (randomised) average O(nlogn) time complexity.
That is because quicksort is an in-place algorithm requiring no auxiliary space, and that lack of extra memory management can make it perform somewhat better given you have luck to avoid the O(n^2) worst case.

It employs a concept called "divide and conquer".
It breaks down recursively a problem into subproblems, until it can solve the whole.

Mergesort makes use of a function called "merge", which turns 2 already sorted sequences into 1 sorted in deterministic O(n+m) time complexity, where 'm' and 'n' are lengths of the 2 input sequences.

How does mergesort use it? It will recursively halve the subsequences of the previous level into the next until their length is more than 1.
That will cause each level to have roughly equal sizes not considering odd sequence sizes because on the top level you have the original sequence whose halves might have at most a difference of one.
And the one lower level has twice the amount of subsequences by definition.

We can apply merge to the 2 halves on the 1 lower level of every subsequence on the 1 larger level once those halves are sorted.
On the lowest level where you have groups of halves of sizes 1, then those by definition already sorted sequences, and are ready to be merged group by group back into the subsequences of the one larger level, and when it is done, it means each subsequence on the 1 larger level are also sorted and ready for merging back into the even larger level and so on.

We have roughly equal subsequence sizes on each level, so it means log_2(array length) must tell you with an error of at most 1 how many levels there are going to be in this subsequence hierarchy because the sorted subsequence size grows by a factor of 2 on each level.

How does the merge function work? You need to imagine the 2 sorted sequences partitioned into subsequences such that for each of these subsequences it is true that the elements of the rest of the subsequences are either all smaller or all larger than all the elements of the selected one.
So basically these subsequences are like single numbers: they can be sorted into a resulting array.
You keep copying numbers from one array into the result until the current number in the other array is larger.
When a larger number is encountered then copying of the next subsequence from the other array begins.

Let's call the resulting sorted array our state.
The only property of our state which is maintained is that it contains numbers in ascending order.
The initial state has no numbers in it, so it is initially correct.

We need to prove that the change to this state produces another correct state.
Let's define change by appending a subsequence of one array from a given index into the state until at a larger index the number is larger than the number at some index at the other array.
Our state remains correct if: 1) The appended subsequence is sorted 2) All numbers in the subsequence are larger than the rest of the numbers in the state before the change.

Point 1) is correct since every subsequence of the 2 arrays are sorted, which means at 1 larger index there is always a larger number.

Point 2) is also correct, since copying of a new subsequence from the other array always starts when the next number at some index 'x' of one of the arrays is no longer smaller than the number at the iterator of the other array at index 'y'.
That means since numbers starting from 'x' and numbers starting from 'y' at the other array are all larger than the number at (x - 1) (which does not exist only in case the copying ends at index zero in which case the copied subsequence has a length of zero) because of the sorted order, the numbers copied in every change must all be larger than all ones previously copied into the state.

Merge also keeps the relative order of objects with equal keys, therefore mergesort is a stable sorting algorithm.

You see that on each level, each element of the original array is being processed by merge function.
We have log_2(n) levels, so it means O(n*log(n)) operations all in all.


    template <typename T>
    void merge(T *first, size_t firstLen, T *second, size_t secondLen, T *dest) {
        size_t toIndex = 0;
        size_t firstIndex = 0;
        size_t secondIndex = 0;
        T *toFirst = (T *)malloc(sizeof(T) * (firstLen + secondLen));

        while (firstIndex != firstLen && secondIndex != secondLen) {
            T &fromFirst = first[firstIndex];
            T &fromSecond = second[secondIndex];

            if (fromFirst > fromSecond) {
                toFirst[toIndex] = second[secondIndex];
                toIndex++;
                secondIndex++;
            }
            else {
                toFirst[toIndex] = first[firstIndex];
                toIndex++;
                firstIndex++;
            }
        }

        if (firstIndex < firstLen)
            memcpy(toFirst + toIndex, first + firstIndex, (firstLen - firstIndex) * sizeof(T));
        else if (secondIndex < secondLen)
            memcpy(toFirst + toIndex, second + secondIndex, (secondLen - secondIndex) * sizeof(T));

        for (size_t i = 0; i < firstLen + secondLen; i++)
            dest[i] = toFirst[i];

        free(toFirst);
    }

    template <typename T>
    void mergesort(T *first, size_t len) {
        if (len == 1) {
            *first = move(*first);
            return;
        }

        size_t m = (len / 2);
        mergesort(first, m);
        mergesort(first + m, len - m);
        merge(first, m, first + m, len - m, first);
    }