Matrix multiplication

Square matrix multiplication is the product of two n-by-n matrices. Each matrix has n² entries. As will be seen later, the naive algorithm for this operation performs in O(n³) running time. However there exist better algorithms like the Strassen matrix multiplication algorithm that performs in O(n^2.8074) running time.

The algorithm presented here performs in O(n²) running time. Precisely the total number of operations is at most (11n²+n+3). But this is true only for the special case where inputs are all natural numbers or zero. It is yet to be possibly improved to deal with real numbers.

'Square' matrix multiplication is used here only for explanation. Actually the algorithm can be used for any matrix multiplication of arbitrary input sizes. Precisely, if the two input matrices have sizes n-by-v and v-by-m respectively, then the total number of operations will be at most (3nm + 4v(n+m) + 2m-n+3). In general then, the algorithm does matrix multiplication in linear time with total matrix size. However there is often computation on very large numbers. Precisely, when the largest entry is equal to v, the storage space complexity is O(nm×log(v)).

In this article, the illustration and analysis of correctness and performance of the algorithm are done simultaneously. Note carefully that in the analysis of performance, the addition, subtraction, multiplication, division and modulo(remainder) operations between ANY two numbers are assumed to each take a constant time. Though, given a space complexity of O(f(n)) of the operands to an operation, it will actually take at least O(f(n)) time complexity. But the assumption is made so to emphasise on the basic idea of the matrix multiplication algorithm.

The special case of n=3 will be used all through.

Download the available implementation of the algorithm in C language or in Python3 language.

The naive algorithm

Let D and M be two 3-by-3 matrices, where:

        d e f          m n o
    D = g h i      M = p q r
        j k l          s t u

Let R = D × M .

Note: differentiate between the entry n in M and the size n of the matrices.

By using the naive algorithm for matrix multiplication, the following is obtained:

            d e f     m n o      (dm+ep+fs) (dn+eq+ft) (do+er+fu)
    D × M = g h i  ×  p q r  =   (gm+hp+is) (gn+hq+it) (go+hr+iu)  =  R
            j k l     s t u      (jm+kp+ls) (jn+kq+lt) (jo+kr+lu)

Performance:
Every element of the matrix R is obtained by doing 2n operations. There are n² elements. This forms a total of 2n × n² = 2n³ operations. Therefore the performance is O(n³).

The basic idea

Consider the matrices D, M and R mentioned above.

By summing the rows of R, we have:

    (dm+ep+fs) + (dn+eq+ft) + (do+er+fu)  =  d(m+n+o) + e(p+q+r) + f(s+t+u)
    (gm+hp+is) + (gn+hq+it) + (go+hr+iu)  =  g(m+n+o) + h(p+q+r) + i(s+t+u)
    (jm+kp+ls) + (jn+kq+lt) + (jo+kr+lu)  =  j(m+n+o) + k(p+q+r) + l(s+t+u)

Also, by summing the columns of R, we have:

    (dm+ep+fs)   (dn+eq+ft)   (do+er+fu)
        +            +            +
    (gm+hp+is)   (gn+hq+it)   (go+hr+iu)
        +            +            +
    (jm+kp+ls)   (jn+kq+lt)   (jo+kr+lu)
        =            =            =
    m(d+g+j)     n(d+g+j)     o(d+g+j)
        +            +            +
    p(e+h+k)     q(e+h+k)     r(e+h+k)
        +            +            +
    s(f+i+l)     t(f+i+l)     u(f+i+l)

From here on only the case of the rows of R will be considered. However what is to be done is also possible for the case of the columns of R.

Now, consider certain magical coefficients a, b and c.
These coefficients are such that when used in the matrix R we have:

    (dm+ep+fs)a + (dn+eq+ft)b + (do+er+fu)c  =  d(ma+nb+oc) + e(pa+qb+rc) + f(sa+tb+uc)
    (gm+hp+is)a + (gn+hq+it)b + (go+hr+iu)c  =  g(ma+nb+oc) + h(pa+qb+rc) + i(sa+tb+uc)
    (jm+kp+ls)a + (jn+kq+lt)b + (jo+kr+lu)c  =  j(ma+nb+oc) + k(pa+qb+rc) + l(sa+tb+uc)

As will be seen later, it is possible to get all the Right-Hand-Side (RHS) values in O(n²) running time. Note that these are essentially 'numbers'.

The basic idea is to choose the coefficients a, b and c, such that an algorithm is designed to use them, together with the RHS values, for the ultimate aim of obtaining the unknowns on the Left-Hand-Side (LHS). Of course, the unknowns on the LHS (what is found in parentheses on the LHS) are the elements of the result matrix R.

So a problem is as follows:
You are in a situation in which you choose three values a, b and c, such that when used with three unknowns x, y and z, you obtain a value k = ax+by+cz. The task is to use your three chosen values a, b and c, together with a given value of k, so to find the unknowns x, y and z. There must always be only one possible solution for the unknowns x, y and z.

Obtaining the RHS values in quadratic time

There are a total of n Right-Hand-Side (RHS) values. Note that these are essentially 'numbers'. This section is mainly characterised by the storage of already computed values in memory so to avoid recomputing them.

The matrix M is of the form:

        m n o
    M = p q r
        s t u

The array containing the coefficients {a, b, c} is stored in memory.

Now, consider a vector or array S, of the form:

    S = { (ma+nb+oc) , (pa+qb+rc) , (sa+tb+uc) }

Every element of the array S is obtained by doing 2n operations. There are n elements. This forms a total of 2n * n = 2n² operations. S must be stored in memory as it is to be used multiple times.

The matrix D is of the form:

        d e f
    D = g h i
        j k l

The array S is stored in memory.

Now, consider a vector or array T, of the form:

    T = {   d(ma+nb+oc) + e(pa+qb+rc) + f(sa+tb+uc) ,
            g(ma+nb+oc) + h(pa+qb+rc) + i(sa+tb+uc) ,
            j(ma+nb+oc) + k(pa+qb+rc) + l(sa+tb+uc)   }

Every element of the array T is obtained by doing 2n operations. There are n elements. This forms a total of 2n * n = 2n² operations. T must be stored in memory as it is to be used at multiple times.

The array T contains the RHS values. Every element of T will be used so to obtain the n unknowns found in the corresponding row of the result matrix R. The overall number of operations used to obtain T is 2n² + 2n² = 4n² operations.

Base n numeral system

Recall the problem that was previously mentioned about some chosen values a, b, c, some unknowns x, y, z, and a certain value k = ax+by+cz. The problem can be solved when the unknowns a, b and c are all natural numbers including zero. In this case the 'base n numeral system' is used.

In the base n numeral system, any non-negative integer can be correctly and distinctly represented as a sequence of whole numbers, also called digits. For example the number 'five' is represented in base 10 as '5', in base 3 as '12', and in base 2 as '101'. All digits are whole numbers between 0 and n-1 inclusive. As the word 'digit' is usually known to refer to a number between 0 and 9, it will not be used anymore so to avoid confusion.

Let I be the non-negative integer represented by a sequence s of the form {s₀, s₁, s₂, ..., s_m-2, s_m-1}. s_level in s is a whole number between 0 and n-1 inclusive, and is found at a certain level. There are m levels from 0 to m-1, where m is the size of the sequence. Let I be 'five'. Then in base 10, s = {5} = {0, 5} = {0, 0, 5} ... for different sizes m. In base 3, s = {1, 2} = {0, 1, 2}... In base 2, s = {1, 0, 1} = {0, 1, 0, 1}...

Given the base n, the sequence s and the size m, the equation below finds I:
    I = s₀*n⁰ + s₁*n¹ + s₂*n² + ... + s_m-2*n^m-2 + s_m-1*n^m-1
Proof:
Proof of that there is only one possible representation of a non-negative integer I as a sequence s, under the equation above.

First it should be noted that the largest computation of I with respect to the equation is for the sequence {n-1, n-1, n-1, ... m times}. In this case I evaluates to n^m-1 (that is I = n^m-1).

Now let any two sequences {s_a,0, s_a,1, s_a,2 , ..., s_a,m-2 , s_a,m-1} and {s_b,0, s_b,1 , s_b,2, ..., s_b,m-2, s_b,m-1} be representations of two non-negative integers I_a and I_b respectively, such that I_a=I_b. Then
    I_a-I_b=0 => (s_a,0 - s_b,0)*n⁰ + (s_a,1 - s_b,1)*n¹ + (s_a,2 - s_b,2)*n² + ... + (s_a,m-2 - s_b,m-2)*n^m-2 + (s_a,m-1 - s_b,m-1)*n^m-1 = 0 .

It turns out that for the above equation to hold, every term must evaluate to zero. Consider the worst case scenario where the term (s_a,m-1 - s_b,m-1)*n^m-1 becomes equal to (1)*n^m-1, meanwhile the terms (s_a,0 - s_b,0)*n⁰, (s_a,1 - s_b,1)*n¹, (s_a,2 - s_b,2)*n² , ... and (s_a,m-2 - s_b,m-2)*n^m-2 are all negative. The later all add up to form the most negative possible value of -(n^m-1-1). In this worst case the value of I is equal to ( n^m-1 - (n^m-1-1) ) = 1 , which is different from 0.

Only the values in brackets can evaluate to zero. Consequently for the equation to hold: s_a,0=s_b,0 , s_a,1=s_b,1, s_a,2=s_b,2, ... , s_a,m-2=s_b,m-2 and s_a,m-1=s_b,m-1. Therefore the two sequences must be the same. It follows that there is only one possible representation of I as a sequence s.

Also,
Given the base n, the number I, and a particular level in s, the equation below finds s_level:

     s_level = [ I / n^level ] - [ I / n^level+1 ]*n = [ (I mod n^level+1) / n^level ]

     where [x] = floor(x) = largest integer smaller than or equal to x.

Proof:
Consider true that I = s₀*n⁰ + s₁*n¹ + s₂*n² + ... + s_m-2*n^m-2 + s_m-1*n^m-1.
This represents computing the index I from the sequence s.

Now let y such that (substituting for s_level in the equation for I above):
   y = ([ I / n⁰ ] - [ I / n¹ ]*n)*n⁰ + ([ I / n¹ ] - [ I / n² ]*n)*n¹ + ([ I / n² ] - [ I / n³ ]*n)*n² + ... + ([ I / n^m-2 ] - [ I / n^m-1 ]*n)*n^m-2 + ([ I / n^m-1 ] - [ I / n^m ]*n)*n^m-1

Simplifying the expression,
=> y = [ I / n⁰ ]*n⁰ - [ I / n¹ ]*n¹ + [ I / n¹ ]*n¹ - [ I / n² ]*n² + [ I / n² ]*n² - [ I / n³ ]*n³ + ... + [ I / n^m-2 ]*n^m-2 - [ I / n^m-1 ]*n^m-1 + [ I / n^m-1 ]*n^m-1 - [ I / n^m ]*n^m

After massive cancellation, only the first and last terms remain,
=> y = [ I / n⁰ ]*n⁰ - [ I / n^m ]*n^m

But I < n^m,
=> y = I - 0
=> y = I
Since y=I, it follows that s_level = [ I / n^level ] - [ I / n^level+1 ]*n.

Now the problem initially mentioned can be solved.

Essentially, we have: k = ax+by+cz, where all are non-negative integers.
By analogy, we have: I = x*n⁰ + y*n¹ + z*n².
Therefore we have: s={x, y, z}, k=I, a=n⁰, b=n¹ and c=n².
The base 'n' is any integer greater than x, y and z.

When inputs are natural numbers or zero

Essentially, we have:

    (dm+ep+fs)a + (dn+eq+ft)b + (do+er+fu)c  =  d(ma+nb+oc) + e(pa+qb+rc) + f(sa+tb+uc)
    (gm+hp+is)a + (gn+hq+it)b + (go+hr+iu)c  =  g(ma+nb+oc) + h(pa+qb+rc) + i(sa+tb+uc)
    (jm+kp+ls)a + (jn+kq+lt)b + (jo+kr+lu)c  =  j(ma+nb+oc) + k(pa+qb+rc) + l(sa+tb+uc)

It is assumed that all inputs are natural numbers or zero. So the 'base n numeral system' can be used.

Following from the previous section, the choices of the coefficients a, b and c are:
    a = L⁰
    b = L¹
    c = L²
where L is a number larger than any element of the result matrix R.

Let g be the largest input of both matrices D and M. g can be obtained by going through the two matrices, taken 4n² comparison+act operations.
It is clear that L, where L = (g²)(n)+1, is always larger than any element of the result matrix R. This constitutes 4 extra operations (not 3 operations, in which case the analyses of performances made so far will have a problem!).

The coefficients a, b and c can now be obtained. Since they are stored in memory they can be computed one after another. That is a is used to compute b, then b is used to compute c. This forms an additional 2n-1 operations.

The array T can now be obtained, taking 4n² operations. This is where the largest numbers in the algorithm are encountered.
However is it really necessary for the elements of T to be essentially 'numbers'...?!

Following from the previous section, an element in T can be regarded as the value I. In this case the corresponding row of the result matrix R will be the sequence s that represents I. If for example:

    T[0]  =  d(ma+nb+oc) + e(pa+qb+rc) + f(sa+tb+uc)  =  (dm+ep+fs)a + (dn+eq+ft)b + (do+er+fu)c
then:
    (dm+ep+fs)  =  [ ( T[0] mod b) / a ]
    (dn+eq+ft)  =  [ ( T[0] mod c) / b ]
    (do+er+fu)  =  [ ( T[0]      ) / c ]

This forms 3n-1 operations (not 2n-1, in which case the analyses of performances made so far will have a problem!). For all n rows of R, there is a total of (3n-1)*n = 3n²-n operations.

Combining all the operations that have been analised, the overall total number of operations undertaken by the algorithm = 4n² + 4 + 2n-1 + 4n² + 3n²-n = (11n²+n+3) operations. However the storage space complexity is O(n² log(g)).

Constraints over a better algorithm

Recall the problem that was previously mentioned about some chosen values a, b, c, some unknowns x, y, z, and a certain value k = ax+by+cz. The main difficulty in this problem lies on the fact that x, y and z can be any value. For the special case where they are all whole numbers including zero, the solution previously mentioned regarding the base n numeral system can be used. There is no solution for the case where x, y and z are any real numbers. Essentially there is no numeral system under the set of real numbers.

As far as matrix multiplication is concerned, there is to wonder whether the elements of the result matrix R can be any values. If so then, based on the basic idea, there shouldn't be any better algorithm than the one discussed here. And also there shouldn't be any algorithm for the case where real numbers are involved. However, there is to wonder whether the elements of R are not just any values, but values with some relationship with the values of the input matrices D and M, the later which are of course any values. If so then there may be a better algorithm. So at the very least, based on the basic idea, a better algorithm should make use of a characteristic true to matrix multiplication.

Actually there is a characteristic of matrix multiplication. So far only the case of the rows of R was considered. But the case of the columns of R could have been used as well. It turns out that both cases share elements of the result matrix R. That is for example:

    T[0] for rows     =  d(ma+nb+oc) + e(pa+qb+rc) + f(sa+tb+uc)  =  (dm+ep+fs)a + (dn+eq+ft)b + (do+er+fu)c
    T[0] for columns  =  m(da+gb+jc) + p(ea+hb+kc) + s(fa+ib+lc)  =  (dm+ep+fs)a + (gm+hp+is)b + (jm+kp+ls)c

T[0] for both rows and columns share one unknown element of R, which is (dm+ep+fs). This characteristic of matrix multiplication may help to design a better algorithm.