Matrix Multiplication Optimization
Abstract:
The examples shows two ways of performing matrix multiplication, a simple one that gives moderate performance, and a slightly more complex versions that achieves optimal performance using the e-gcc compiler. The optimal code was partially unrolled to allow the compiler to take advantage of the double-load store of the architecture and to avoid unnecessary pipeline stalls.
Naive Code:
unsigned matmul_naive(float * restrict a, float * restrict b, float * restrict c)
{
int i, j, k;
for (i=0; i
Optimized Code:
unsigned matmul(float * restrict aa, float * restrict bb, float * restrict cc)
{
int i = 0;
for (i=0; i
Compile Switches:
{-Wall -O3 -std=c99 -mlong-calls -mfp-mode=round-nearest -ffp-contract=fast -funroll-loops}