## Part I: Performance of Matrix multiplication in Python, Java and C++

This is Part I of my matrix multiplication series. Part I was about simple matrix multiplication algorithms and Part II was about the Strassen algorithm.
Part III is about parallel matrix multiplication.

This post is about simple implementations of matrix multiplications. The goal of this post is to find out how easy it is to implement a matrix multiplication in Python, Java and C++. Additionally, I want to get to know how good these solutions are.

The second post will be an implementation of the Strassen algorithm for matrix multiplication. Strassen algorithm does matrix multiplication in $$\cal O(n^{log_2(7)+o(1)}) \approx \cal O(n^{2.807})$$ instead of $$\cal O(n^3)$$. I am quite sure this will outperform almost every other change. See Part II: The Strassen algorithm in Python, Java and C++.

The third post will be about parallel programming. I have two cores and I want to see if it will be significantly faster if I use both of them.

## The implementations

I will post all scripts for this test and I’ve added a GIT repository, so feel free to test it on your machine. I am also happy if you post some of your solutions with running times I am quite sure that my Java and C++ code can be written much better. If you know how, please leave a comment.
If you know other languages, you could create a script for these. I focus on Python, Java and C++ as they are very often used.

I have implemented these three types of algorithms for this post:

• ijk-algorithm: This is a simple, straight forward implementation of a matrix multiplication. I’ve used the definition of matrix multiplication. I didn’t use multiple threads.
• ikj-algorithm: just like the ijk-algorithm, but I’ve switched two of the three the for-loops.
• Library-functions: I always prefer libraries over self-implemented solutions. I think they are faster than anything I could come up with in a reasonable amount of time.

If you post a solution, please consider these restrictions:

• Input: The input file should get passed with the parameter -i, e.g.:
python -i 2000.in or java Shell -i 2000.in
• The standard value for the command line parameter -i should be “2000.in” (a $$2000 \times 2000$$ matrix)
• The user should not have to give the size of the matrix!
• The two square-matrices that should get multiplied are …
• … read from a text-file.
• … represented like this:
• Every line of one matrix is one line in the text-file.
• Newlines are only “\n”.
• Every number is separated by “\t”.
• The both matrices are separated by one newline.
• Output: The result has to get printed to standard output.
• The result has to be formatted like the input (tabs for separation of number, \n for marks a new line)

## The Tests

I will check the speed of a multiplication of two big matrices following for Python, Java and C++ for all algorithms like this:

time python scriptABC.py -i ../2000.in > result.txt
diff result.txt bigMatrix.out

The bigMatrix.out was produced by the Python ijk-implementation. I make the diff to test if the result is correct.

## The Setting

I created two “random” matrices $$A, B \in \mathbb{N}^{2000 \times 2000}$$ with this script. The file that was created needs about 29.7 MB and is also in the GIT-Hub repository. But you can also create the matrices with this script:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import random
random.seed(1234)

def createRandomMatrix(n):
maxVal = 1000 # I don't want to get Java / C++ into trouble
matrix = []
for i in xrange(n):
matrix.append([random.randint(0,maxVal) for el in xrange(n)])
return matrix

def saveMatrix(matrixA, matrixB, filename):
f = open(filename, 'w')
for i, matrix in enumerate([matrixA, matrixB]):
if i != 0:
f.write("\n")
for line in matrix:
f.write("\t".join(map(str, line)) + "\n")
n = 3
matrixA = createRandomMatrix(n)
matrixB = createRandomMatrix(n)
saveMatrix(matrixA, matrixB, "2000.in")

All scripts are tested on my computer:

 Acer TravelMate 5735Z CPU 2x Pentium(R) Dual-Core CPU T4500 @2.30GHz RAM 4 GB Video Card Intel GMA 4500MHD System Ubuntu 10.10.04 LTS

## Python

I’ve used Python 2.6.5.

### ijk-algorithm

#!/usr/bin/python
# -*- coding: utf-8 -*-

from optparse import OptionParser
parser = OptionParser()
help="input file with two matrices", metavar="FILE")
(options, args) = parser.parse_args()

A = []
B = []
matrix = A
for line in lines:
if line != "":
matrix.append(map(int, line.split("\t")))
else:
matrix = B
return A, B

def printMatrix(matrix):
for line in matrix:
print "\t".join(map(str,line))

def standardMatrixProduct(A, B):
n = len(A)
C = [[0 for i in xrange(n)] for j in xrange(n)]
for i in xrange(n):
for j in xrange(n):
for k in xrange(n):
C[i][j] += A[i][k] * B[k][j]
return C

C = standardMatrixProduct(A, B)
printMatrix(C)
real	56m49.266s
user	56m30.524s
sys	0m2.980s

### ikj-algorithm

def ikjMatrixProduct(A, B):
n = len(A)
C = [[0 for i in xrange(n)] for j in xrange(n)]
for i in xrange(n):
for k in xrange(n):
for j in xrange(n):
C[i][j] += A[i][k] * B[k][j]
return C
real	44m36.507s
user	44m13.458s
sys	0m2.000s

### Psyco ikj-algorithm

Psyco is a just in time compiler, which makes my scripts MUCH faster. It is very simple to use. Add these two lines at the top of the ikj-script:

import psyco
psyco.full()
real	6m14.820s
user	6m12.959s
sys	0m0.620s

Amazing, isn’t it?

### Libraries

#### NumPy

NumPy-Version: 1.3.0 (Current version is 1.6.2, see Wiki)

#!/usr/bin/python
# -*- coding: utf-8 -*-

import numpy

from optparse import OptionParser
parser = OptionParser()
help="input file with two matrices", metavar="FILE")
(options, args) = parser.parse_args()

A = []
B = []
matrix = A
for line in lines:
if line != "":
matrix.append(map(int, line.split("\t")))
else:
matrix = B
return A, B

def printMatrix(matrix):
matrix = numpy.array(matrix)
for line in matrix:
print "\t".join(map(str,line))

A = numpy.matrix(A)
B = numpy.matrix(B)
C = A * B # easy and intuitive, isn't it?
printMatrix(C)
real	1m38.425s
user	1m36.066s
sys	0m0.520s

#### SciPy

You might need to install python-scitools.

#!/usr/bin/python
# -*- coding: utf-8 -*-

import numpy
import scipy

from optparse import OptionParser
parser = OptionParser()
help="input file with two matrices", metavar="FILE")
(options, args) = parser.parse_args()

A = []
B = []
matrix = A
for line in lines:
if line != "":
matrix.append(map(int, line.split("\t")))
else:
matrix = B
return A, B

def printMatrix(matrix):
matrix = numpy.array(matrix)
for line in matrix:
print "\t".join(map(str,line))

A = scipy.matrix(A)
B = scipy.matrix(B)
C = A * B # easy and intuitive, isn't it?
printMatrix(C)
real	1m35.795s
user	1m33.438s
sys	0m0.488s

### Conclusion for Python

Python execution times for matrix multiplication

Using NumPy is by far the easiest and fastest option. I’ve needed about five minutes for each of the non-library scripts and about 10 minutes for the NumPy/SciPy scripts.

By the way, it is useless to combine Psyco and NumPy. It gets a little bit faster (1 minute and 28 seconds), but this could also be a random effect. If you execute it many times, you will see that the execution time is never the same.

## Java

I am using this Java version:

### ijk-algorithm

#include <sstream>
#include <string>
#include <fstream>
#include <iostream>
#include <vector>

using namespace std;

struct Result {
vector< vector<int> > A;
vector< vector<int> > B;
};

vector< vector<int> > A, B;
Result ab;
string line;
ifstream infile;
infile.open (filename.c_str());

int i = 0;
while (getline(infile, line) && !line.empty()) {
istringstream iss(line);
A.resize(A.size() + 1);
int a, j = 0;
while (iss >> a) {
A[i].push_back(a);
j++;
}
i++;
}

i = 0;
while (getline(infile, line)) {
istringstream iss(line);
B.resize(B.size() + 1);
int a;
int j = 0;
while (iss >> a) {
B[i].push_back(a);
j++;
}
i++;
}

infile.close();
ab.A = A;
ab.B = B;
return ab;
}

vector< vector<int> > ijkalgorithm(vector< vector<int> > A,
vector< vector<int> > B) {
int n = A.size();

// initialise C with 0s
vector<int> tmp(n, 0);
vector< vector<int> > C(n, tmp);

for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
for (int k = 0; k < n; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
return C;
}

void printMatrix(vector< vector<int> > matrix) {
vector< vector<int> >::iterator it;
vector<int>::iterator inner;
for (it=matrix.begin(); it != matrix.end(); it++) {
for (inner = it->begin(); inner != it->end(); inner++) {
cout << *inner;
if(inner+1 != it->end()) {
cout << "\t";
}
}
cout << endl;
}
}

int main (int argc, char* argv[]) {
string filename;
if (argc < 3) {
filename = "2000.in";
} else {
filename = argv[2];
}
vector< vector<int> > C = ijkalgorithm(result.A, result.B);
printMatrix(C);
return 0;
}
real	1m40.439s
user	1m38.642s
sys	0m0.280s

### ikj-algorithm

Again, I’ve only switched line 61 and 62.

real	0m15.172s
user	0m14.877s
sys	0m0.248s

### Library: Boost

If you want to compile these scripts, you might have to install the boost libraries first. On Ubuntu you can enter:

sudo apt-get install libboost-math*
#include <sstream>
#include <string>
#include <fstream>
#include <iostream>
#include <vector>
#include <boost/numeric/ublas/matrix.hpp>
#include <boost/numeric/ublas/io.hpp>

using namespace std;

struct Result {
boost::numeric::ublas::matrix<int> A;
boost::numeric::ublas::matrix<int> B;
};

int getMatrixSize(string filename) {
string line;
ifstream infile;
infile.open (filename.c_str());
getline(infile, line);
return count(line.begin(), line.end(), '\t') + 1;
}

void printMatrix(boost::numeric::ublas::matrix<int> matrix) {
for (unsigned int i=0; i < matrix.size1(); i++) {
for (unsigned int j=0; j < matrix.size2(); j++) {
cout << matrix(i, j);
if(j+1 != matrix.size2()) {
cout << "\t";
}
}
cout << endl;
}
}

Result ab;
string line;
ifstream infile;
infile.open (filename.c_str());

// get dimension
getline(infile, line);
int n = getMatrixSize(filename);

boost::numeric::ublas::matrix<int> A(n,n), B(n,n);

// process first line
istringstream iss(line);
int a, i = 0, j = 0;
while (iss >> a) {
A(i,j) = a;
j++;
}
i++;

while (getline(infile, line) && !line.empty()) {
istringstream iss(line);
j = 0;
while (iss >> a) {
A(i,j) = a;
j++;
}
i++;
}

i = 0;
while (getline(infile, line)) {
istringstream iss(line);
j = 0;
while (iss >> a) {
B(i,j) = a;
j++;
}
i++;
}

infile.close();
ab.A = A;
ab.B = B;
return ab;
}

int main (int argc, char* argv[]) {
string filename;
if (argc < 3) {
filename = "2000.in";
} else {
filename = argv[2];
}

boost::numeric::ublas::matrix<int> C;
C = boost::numeric::ublas::prod(result.A, result.B);
printMatrix(C);

return 0;
}
real	4m15.388s
user	4m10.272s
sys	0m0.588s

### Library: Blitz

This is a great example of useless library. I’ve installed the library:

sudo apt-get install libblitz*

Then I wanted to use it. Well, I have no clue how I could exactly use it! See my StackOverflow Question: Is a documentation of Blitz++ matrices available?

### Conclusion for C++

C++ execution times for matrix multiplication

Again, it brings a performance boost if you know how your CPU works. I was very astonished, that the library Boost is slower (actually MUCH slower) than my simplest approach was.

## Conclusion

If I want to create a working piece of code in a minimum amount of time, I will always take Python. It has been very easy to solve this task with the given restrictions. But C++ is amazing when speed is important.

It was astonishingly difficult to find working code examples for this task for Java and C++. I was searching for libraries and found some, but the search results were not satisfying.

Continue reading with Part II: The Strassen algorithm in Python, Java and C++
You can skip to the end and leave a response. Pinging is currently not allowed.

### 8 Responses to “Part I: Performance of Matrix multiplication in Python, Java and C++”

1. Martin Thoma says:

My professor for software engineering – who also inspired me to write this post – had some remarks:

* I should use 64-Bit double matrices, as int-matrices aren’t interesting.
* I probably don’t have a native JVM which would boost Java

A student suggested, that I should give a try to pypy. I’ll do both later

2. Moritz says:

Hi Martin,

nice article.

I think you should generally wrap the code that does the actual multiplication between calls to clock() to determine the computation time rather than measuring the overall execution time of the program. For the faster solutions, the latter will effectively measure the I/O time, which is not interesting here.

I’ve written a test for Fortran which is usually considered the gold standard in number crunching. On my machine, it performs only insignificantly better than the C++ solution. (Fortran has an intrinsic matrix multiplication function, matmul. I’ve also tried the BLAS degemm subroutine but it wasn’t faster.) Eventually, I’ll make a commit to your code later.

I don’t think, you did BOOST/uBLAS justice, however. First, please

#define NDEBUG 1

to switch off all debugging normally enabled by uBLAS. This reduced computation time by a factor of six or so, on my machine. Second, you may use

#include
// …
boost::numeric::ublas::matrix C(result.A.size1(), result.B.size2());
boost::numeric::ublas::axpy_prod(result.A, result.B, C, true);

as a better performing alternative.

When I did so, the code was nothing slower than your implementation. Of course, one would generally hope to improve performance when using a library (not just not make it worse), however, I’m not certain by any means that this is the end of all possible optimization.

Looking forward to see your Blitz++ solution as I couldn’t get it figured out either.

Eventually, I will also commit a makefile that compiles and runs all the different implementations and automatically logs the performance.

Regarding the Python solutions: Most of the time in the na(t)ive solution is wasted on the n**3 type checks, the Python interpreter must perform during the multiplication. I’m not sure if PyPy’s JIT compilation could help much here. I’m sure you know but I think it is worth to mention that NumPy uses external C modules which again link to the Fortran routines of the LAPACK / BLAS library so the additional time compared with the Fortran solution is basically used for caling the subroutines and passing around the data. As far as I know, numpy.matrix is the same as scipy.matrix.

The BLAS and other high-performance libraries make use of architecture-specific optimization so it might matter how one installs and sets them up. I think the manual iteration could be made even faster if you’d break the matrix into its memory pages so even less page swaps will be needed. This will, in general, also be a non-portable effort, of course.

Best

Moritz

• Martin Thoma says:

Hi Moritz,

thanks for your answer! I thought about measuring the time after the read operation and stopping it before the program is going to write. I didn’t do it for two reasons:
* I want to get comparable results. This includes …
* the resulting matrix which I want to diff with the correct result.
* using the same method of getting the time for every language.
* It is also important to get to know if the I/O speed of different languages is significantly different. My guess is that I could speed up I/O for Python as well as for Java. But how can I do so?
* I would like to get the time of matrix multiplication for many matrices. How does the time change when I use 100×100 matrices? I’m currently working on a detailed answer. Here is the short one: I want to create many matrices, loop in the bash over the files, execute the program for each file once, store the time in a CSV-file and plot this file with LaTeX. This task gets much more difficult if I can’t control the format of the output time. As matrix multiplication is in O(n^3) for the given algorithms I guess the results will not be very interesting, but this approach might be interesting for other problems, too. And I also thought that all libraries would give boring results which was obviously not true.

So I/O is also part of the task for every programming language. If I have an I/O heavy task and my preferred programming language is slow at I/O I can’t just say “hey, my language is very fast at computation, I/O isn’t interesting”. I admit that the title might be misleading, but calling it “Part I: Performance of Matrix reading, multiplying and writing in Python, Java and C++” would be much too long.

Nevertheless, I am also interested in exact I/O-time results. As you’re interested in them, you could time the I/O of Java and post the results.

I read about the Debugging-Flags on Stack Overflow. According to the author of this answer, Boost is still slower than the ikj-algorithm.
Thanks for your suggestions how to increase the performance for the boost part, I’ll include that as soon as I have some time.

3. Jaan says:

The Java code example could certainly be improved. One, who needs to efficiently hold and access large collections of numbers, should use primitive arrays (i. e. int[][] and double[][]) instead of ArrayList/Vector of Integers or Doubles. (Perhaps even better performance could be achieved by using one-dimensional arrays int[m*n] and double[m*n] for m-by-n matrices.)

Integers and Doubles are slow compared to int and double because of object dereferencing overhead, bad memory locality (due to high memory usage overhead per element), and time spent on object creation/garbage collection (for two 2000*2000 matrices, 2*2000*2000=8,000,000 Integer objects is quite a lot for the garbage collector to scan).

4. Kairat says:

The java code example is wrong.

Using reference types like Integer, Double , Long is not good a idea for such tasks, cause they use too much of dynamic memory.

So u should increase the memory for jvm , or use primitive types like int or double.

What parameters for JVM did you use ?

Did you use JIT-compiler of JVM ?

5. Marc Claesen says:

I am surprised you are benchmarking matrix multiplication without using the BLAS and/or ATLAS. Those are the standard libraries for such kind of thing. In ATLAS, a 2000×2000 matrix multiplication will require far less than 15 seconds (probably less than 2).

The Basic Linear Algebra Software (BLAS) is a collection of vector/matrix computations, organized in 3 levels of increasing complexity. It’s pretty much the standard API for any serious algebra (originally written in Fortran, comes with interfaces to C++, Java, Python …). Highly optimized vendor-specific implementations of the BLAS exist such as Intel Math Kernel Library (MKL). The Automatically Tuned Linear Algebra Software (ATLAS) is a package that self-optimizes for your own system by running various tests prior to compiling (it optimizes itself for cache use and more).

The BLAS is usually installed by default on most Linux distributions. If not, it should at least be listed in your favorite package manager.
For ATLAS: http://en.wikipedia.org/wiki/Automatically_Tuned_Linear_Algebra_Software

Finally, I strongly advise to time in high priority mode (e.g. sudo nice -3 time …). The measurements you have made in normal priority can be very noisy if other processes are interfering.

• Martin Thoma says:

Hi Marc,

thanks for your comment. Could you please make a minimal example how to use BLAS and/or ATLAS?

What’s the difference between BLAS and uBlas (which I’ve used, see “Boost”)?

• Marc Claesen says:

Hi Martin,

Interfacing to BLAS and ATLAS is easy if you use their C interfaces. For example, for single precision matrix-matrix multiplication you need to interface with sgemm (dgemm for double precision).

In C++, this is done by declaring the API function as extern:

enum CBLAS_ORDER {CblasRowMajor=101, CblasColMajor=102};
enum CBLAS_TRANSPOSE {CblasNoTrans=111, CblasTrans=112,
CblasConjTrans=113, AtlasConj=114};

extern “C”{
void cblas_sgemm(const enum CBLAS_ORDER Order, const enum CBLAS_TRANSPOSE
TransA, const enum CBLAS_TRANSPOSE TransB, const int M, const int N,
const int K, const float alpha, const float *A, const int lda,
const loat *B, const int ldb, const float beta, float *C, const int ldc);
}

You can google these functions to find their documentation and how to use them. You’ll need to link to BLAS/ATLAS libraries during compilation to use the functions (-lblas or -latlas).

I am not familiar with uBlas, so I can’t say what the difference are. The reference implementation of the BLAS is the netlib version (this is not optimized). If you want to use ATLAS, you can download it for free and tune/compile it yourself. ATLAS offers the xgemm routines in its API, so you don’t have to make any changes in your code.

To give you some idea with regards to performance, I recently timed a medium-scale matrix-matrix multiplication (2.000×37.000 times 37.000×2.000) with the following results (relative):
- optimal loop order + SIMD instructions: 1
- default BLAS on debian wheezy: 1/45
- ATLAS: 1/110

Hopefully this is useful for you!

Marc