Recently I wanted to speed up some very frequently called code using Intel MKL. Specifically, a series of weighted products should be calculated using a matrix multiplication. This can save a lot of time for large matrices, since good matrix multiplication algorithms can reach $O(n^{\approx2.8})$ complexity in the elements having to be processed without sacrificing precision, compared to $O(n^{3})$ in a non-optimized case [wiki].

Starting to go through the mkl documentation, I started using a CBLAS (C-interface to the BLAS routines) command to call the matrix multiplication:

1 2 | cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m, n, p, alpha, A, p, B, n, beta, C, n); |

measuring the performance as also described in the tutorial here. I was absolutely puzzled when I used numpy to perform the same multiplication and numpy was (despite the calling overhead!) faster.

After digging a little, I found out that it is also possible to call the BLAS routines directly from C. Intel does this itself when measuring the library performance, e.g. here. After using the direct BLAS call, I was able to beat the numpy performance.

## The CBLAS interface is slower than BLAS with non-constant overhead

Taking things a little further, I did some measurements and realized that the overhead that the CBLAS interface adds is even **not constant**! See the attached plot (even though I did not add many measurement points).

Since it is possible to use the direct BLAS interface without any overhead, why not do it? The problem is, that BLAS works FORTRAN based, and the arrays in this language are stored in memory with column-major order. This means, that the array elements are stored column by column in memory. The usual convention for C, however, is to store the elements row by row in row-major order.

The CBLAS interface takes an additional parameter to account exactly for this fact. I don't know what it exactly does to solve it, but increasing execution time hints at an additional copying or rearranging step for the values in memory.

However, I thought about the problem and supposed it must be possible to perform the matrix multiplication in row-major layout with an algorithm assuming column-major order without copying.

## Matrix multiplication with a column-major algorithm and row-major memory layout

Consider the following matrix multiplication as an illustrative example:

$$ \overset{A}{\overbrace{\left(\begin{array}{cc} 1 & 2\ 3 & 4\ 5 & 6 \end{array}\right)}}\cdot\overset{B}{\overbrace{\left(\begin{array}{ccc} 1 & 2 & 3\ 4 & 5 & 6 \end{array}\right)}}=\overset{C}{\overbrace{\left(\begin{array}{ccc} 9 & 12 & 15\ 19 & 26 & 33\ 29 & 40 & 51 \end{array}\right)}} $$

$A$ is of size $3\times2$, $B$ of size $2\times3$ and the resulting matrix C of size $3\times3$. Assuming row-major memory layout as we usually want to have it in C, the first two matrices are represented as

1 |
2 |
3 |
4 |
5 |
6 |

and the result matrix as

9 |
12 |
15 |
19 |
26 |
33 |
29 |
40 |
51 |

An algorithm being implemented for processing column-major stored matrices could interpret the chunk of memory for matrix $A$ or $B$ (among other possibilities, depending what you provide as row and column counts) either as

$$\underset{s_{1}}{\underbrace{\left(\begin{array}{cc} 1 & 4\ 2 & 5\ 3 & 6 \end{array}\right)}}\,\mbox{or }\underset{s_{2}}{\underbrace{\left(\begin{array}{ccc} 1 & 3 & 5\ 2 & 4 & 6 \end{array}\right)}}$$

That is great, because the two presented correspond right to the transposed versions of the original matrices. Linear algebra provides us with the equivalency of

$$B^{T}\cdot A^{T}=\left(A\cdot B\right)^{T}$$

To make things work in this example, we must provide the row and column counts for shape $s_{2}$ for matrix $A$ and for shape $s_{1}$ for matrix $B$. Calling the algorithm then with the multiplication of $B\cdot A$ (which it uses as transposed) calculates $$ \left(\begin{array}{cc} 1 & 4\\\ 2 & 5\\\ 3 & 6 \end{array}\right)\cdot\left(\begin{array}{ccc} 1 & 3 & 5\\\ 2 & 4 & 6 \end{array}\right)=\left(\begin{array}{ccc} 9 & 19 & 29\\\ 12 & 26 & 40\\\ 15 & 33 & 51 \end{array}\right)$$ but in column-major order! This means, it is in memory stored as

9 |
12 |
15 |
19 |
26 |
33 |
29 |
40 |
51 |

which is just what we want! The reason is, that the algorithm calculates the result transposed in column-major layout - but with the original memory layout we obtain the wanted result of the original calculation in row-major order, without the need of copying any data!

## Calling BLAS DGEMM

How can this theory be applied on the existing methods? The original BLAS interface takes as arguments

- two pointers to a character, stating whether $A$ or $B$ should be transposed (if it is an
`n`

or`N`

not, if it is a`t`

or`T`

transpose). They define $\mbox{op}_{A}$ and $\mbox{op}_{B}$, - n, m, k, pointers to
`int`

s, describing the dimensionality of the matrices used for the calculation (that is, after $A$ or $B$ was eventually transposed). It is assumed that $\mbox{op}_{A}(A)$ is of dimension $m\times k$ and $\mbox{op}_{B}(B)$ is of dimension $k\times n$, - a scalar factor $\alpha $ with which the elements of the result of the computation $\mbox{op}_{A}(A)\cdot \mbox{op}_{B}(B)$ are multiplied,
- Matrices $A$ and $B$, each followed by its leading dimension descriptor. This is the length of one row in row-major order, or the length of one column in column-major order,
- a scalar factor $\beta $ with which the elements of the matrix $C$ are multiplied, before its contents are added to the result of $\alpha\cdot\mbox{op}_{A}(A)\cdot \mbox{op}_{B}(B)$. The result of this computation is stored in $C$: $$C\leftarrow\alpha\cdot\mbox{op}_{A}\left(A\right)\cdot\mbox{op}_{B}\left(B\right)+\beta\cdot C,$$
- the matrix $C$ and a pointer to its leading dimension value.

The call for this normal matrix multiplication with the matrices in column-major order would be:

1 | dgemm("N","N", m, n, k, alpha, A, k, B, n, beta, C, n); |

However, we can rewrite it with our findings from the last paragraph for matrices in row-major layout easily:

1 | dgemm("N","N", n, m, k, alpha, B, n, A, k, beta, C, n); |

Pretty neat, and no copying whatsoever is needed. To take care of the nasty parameter swapping automatically, you can simply add this C define to your header:

1 2 | #define DGEMM_ROWMAJOR(A,B,C,m,n,k,alpha,beta,transf_A,transf_B, lda, ldb, ldc) \ DGEMM(transf_B, transf_A, n, m, k, alpha, B, ldb, A, lda, beta, C, ldc) |

This will grant you perfectly sound and safe access to the method for row-major arrays with unprecedented speed ;). The macro does all the magic of changing and swapping directly before even the code is executed, meaning there is actually no loss in speed whatsoever. You can use it with the usual parameters as if the algorithm was written for arrays in row-major order.

It took me a while of thinking to figure it out, and I also didn't find resources on this topic on the net. Whether I save more time with the faster computations than I invested remains to be seen... ;) In any case I hope I can relieve some people from severe headaches about this.

## Comments