We got some pretty interesting results for matrix multiplication so far. Now, I would like to get to know in how far performance increases (or decreases) if I make use of multiple cores. I only have two cores, so I hope somebody with a better computer will also run the ikj single core algorithm form part I and the parallel version from this article and post the results as a comment.
The implementations
As last time, I’ve added the scripts to a GIT repository. So you can test it on your machine.
Before we start implementing code for multiple processors, we have to get an algorithm that is actually parallelisable. You could use Cannon's algorithm, a algorithm that makes use of systolic arrays or try to find a solution by your own. The Scalable Universal Matrix Multiplication Algorithm (short: SUMMA) could also work. The paper that I've linked is well-written and easy to understand. You should definitively read it, if you're interested in matrix multiplication.
I will not use any advanced algorithm in this article. I will make the outer most for loop of the ikj-algorithm (see part I) execute in parallel.
More about parallel matrix multiplication:
- Case Study: Matrix Multiplication
- berrendorf
- Algorithms for efficient matrix multiplication
- Coppersmith-Winograd algorith
Python
Because of global interpreter lock (GIL), you need to start new processes in Python (source).
The ikj single core algorithm implemented in Python needs:
time python ikjMultiplication.py -i 2000.in > 2000-nonparallel.out
real 36m0.699s
user 35m53.463s
sys 0m2.356s
The most simple way to parallelize the ikj algorith is to use the multiprocessing module and compute every line of the result matrix C with a new process. But for the 2000x2000-example, this would mean we started 2000 processes. The overhead is much worse than the benefit:
time python ikjMultiplication.py -i 2000.in > 2000-parallel.out
real 20m47.693s
user 40m34.460s
sys 0m2.092s
When we share memory, the code looks like this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import multiprocessing, numpy, ctypes
def read(filename):
lines = open(filename, "r").read().splitlines()
A = []
B = []
matrix = A
for line in lines:
if line != "":
matrix.append(map(int, line.split("\t")))
else:
matrix = B
return A, B
def printMatrix(matrix, f):
for line in matrix:
f.write("\t".join(map(str, line)) + "\n")
def lineMult(start):
global A, B, mp_arr, part
n = len(A)
# create a new numpy array using the same memory as mp_arr
arr = numpy.frombuffer(mp_arr.get_obj(), dtype=ctypes.c_int)
C = arr.reshape((n, n))
for i in range(start, start + part):
for k in range(n):
for j in range(n):
C[i][j] += A[i][k] * B[k][j]
def ikjMatrixProduct(A, B, threadNumber):
n = len(A)
pool = multiprocessing.Pool(threadNumber)
pool.map(lineMult, range(0, n, part))
# mp_arr and arr share the same memory
arr = numpy.frombuffer(mp_arr.get_obj(), dtype=ctypes.c_int)
C = arr.reshape((n, n))
return C
def extant_file(x):
"""
'Type' for argparse - checks that file exists but does not open.
"""
if not isfile(x):
raise argparse.ArgumentError("{0} does not exist".format(x))
return x
if __name__ == "__main__":
import argparse, sys
from os.path import isfile
from argparse import ArgumentParser
parser = ArgumentParser(description="ikjMatrix multiplication")
parser.add_argument(
"-i",
"--input",
dest="filename",
required=True,
type=extant_file,
help="input file with two matrices",
metavar="FILE",
)
parser.add_argument(
"-o",
"--output",
type=argparse.FileType(mode="w"),
default=sys.stdout,
dest="output",
help="file to write output to (default=stdout)",
)
args = parser.parse_args()
A, B = read(args.filename)
n, m, p = len(A), len(A[0]), len(B[0])
threadNumber = 2
part = len(A) / threadNumber
if part < 1:
part = 1
# shared, can be used from multiple processes
mp_arr = multiprocessing.Array(ctypes.c_int, n * p)
C = ikjMatrixProduct(A, B, threadNumber)
printMatrix(C, args.output)
and it needs MUCH more time:
time python ikjMultiplication-shared.py -i 2000.in > 2000-parallel-2threads.out
real 131m35.433s
user 250m36.820s
sys 0m9.533s
When we don't use shared memory, things run faster:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import multiprocessing, numpy, ctypes
def read(filename):
lines = open(filename, "r").read().splitlines()
A = []
B = []
matrix = A
for line in lines:
if line != "":
matrix.append(map(int, line.split("\t")))
else:
matrix = B
return A, B
def printMatrix(matrix, f):
for line in matrix:
f.write("\t".join(map(str, line)) + "\n")
def lineMult(start):
global A, B, C, part
n = len(A)
for i in range(start, start + part):
for k in range(n):
for j in range(n):
C[i][j] += A[i][k] * B[k][j]
def ikjMatrixProduct(A, B, threadNumber):
n = len(A)
pool = multiprocessing.Pool(threadNumber)
pool.map(lineMult, range(0, n, part))
return C
def extant_file(x):
"""
'Type' for argparse - checks that file exists but does not open.
"""
if not isfile(x):
raise argparse.ArgumentError("{0} does not exist".format(x))
return x
if __name__ == "__main__":
import argparse, sys
from os.path import isfile
from argparse import ArgumentParser
parser = ArgumentParser(description="ikjMatrix multiplication")
parser.add_argument(
"-i",
"--input",
dest="filename",
required=True,
type=extant_file,
help="input file with two matrices",
metavar="FILE",
)
parser.add_argument(
"-o",
"--output",
type=argparse.FileType(mode="w"),
default=sys.stdout,
dest="output",
help="file to write output to (default=stdout)",
)
args = parser.parse_args()
A, B = read(args.filename)
n, m, p = len(A), len(A[0]), len(B[0])
threadNumber = 2
part = len(A) / threadNumber
if part < 1:
part = 1
C = [[0 for i in range(n)] for j in range(n)]
C = ikjMatrixProduct(A, B, threadNumber)
printMatrix(C, args.output)
time python ikjMultiplication.py -i 2000.in > 2000-parallel-4threads.out
real 22m46.066s
user 41m42.396s
sys 0m2.324s
Java
Shell.java:
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.LinkedList;
import java.util.List;
import java.util.ArrayList;
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
public class Shell {
static List<ArrayList<ArrayList<Integer>>> read(String filename) {
ArrayList<ArrayList<Integer>> A = new ArrayList<ArrayList<Integer>>();
ArrayList<ArrayList<Integer>> B = new ArrayList<ArrayList<Integer>>();
String thisLine;
try {
BufferedReader br = new BufferedReader(new FileReader(filename));
// Begin reading A
while ((thisLine = br.readLine()) != null) {
if (thisLine.trim().equals("")) {
break;
} else {
ArrayList<Integer> line = new ArrayList<Integer>();
String[] lineArray = thisLine.split("\t");
for (String number : lineArray) {
line.add(Integer.parseInt(number));
}
A.add(line);
}
}
// Begin reading B
while ((thisLine = br.readLine()) != null) {
ArrayList<Integer> line = new ArrayList<Integer>();
String[] lineArray = thisLine.split("\t");
for (String number : lineArray) {
line.add(Integer.parseInt(number));
}
B.add(line);
}
br.close();
} catch (IOException e) {
System.err.println("Error: " + e);
}
List<ArrayList<ArrayList<Integer>>> res = new LinkedList<ArrayList<ArrayList<Integer>>>();
res.add(A);
res.add(B);
return res;
}
static void printMatrix(int[][] matrix) {
for (int[] line : matrix) {
int i = 0;
StringBuilder sb = new StringBuilder(matrix.length);
for (int number : line) {
if (i != 0) {
sb.append("\t");
} else {
i++;
}
sb.append(number);
}
System.out.println(sb.toString());
}
}
public static int[][] parallelMult(ArrayList<ArrayList<Integer>> A,
ArrayList<ArrayList<Integer>> B, int threadNumber) {
int[][] C = new int[A.size()][B.get(0).size()];
ExecutorService executor = Executors.newFixedThreadPool(threadNumber);
List<Future<int[][]>> list = new ArrayList<Future<int[][]>>();
int part = A.size() / threadNumber;
if (part < 1) {
part = 1;
}
for (int i = 0; i < A.size(); i += part) {
System.err.println(i);
Callable<int[][]> worker = new LineMultiplier(A, B, i, i+part);
Future<int[][]> submit = executor.submit(worker);
list.add(submit);
}
// now retrieve the result
int start = 0;
int CF[][];
for (Future<int[][]> future : list) {
try {
CF = future.get();
for (int i=start; i < start+part; i += 1) {
C[i] = CF[i];
}
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ExecutionException e) {
e.printStackTrace();
}
start+=part;
}
executor.shutdown();
return C;
}
public static void main(String[] args) {
String filename;
int cores = Runtime.getRuntime().availableProcessors();
System.err.println("Number of cores:\t" + cores);
int threads;
if (args.length < 3) {
filename = "3.in";
threads = cores;
} else {
filename = args[1];
threads = Integer.parseInt(args[2]);
}
List<ArrayList<ArrayList<Integer>>> matrices = read(filename);
ArrayList<ArrayList<Integer>> A = matrices.get(0);
ArrayList<ArrayList<Integer>> B = matrices.get(1);
int[][] C = parallelMult(A, B, threads);
printMatrix(C);
}
}
LineMultiplier.java:
import java.util.ArrayList;
import java.util.concurrent.Callable;
public class LineMultiplier implements Callable<int[][]> {
ArrayList<ArrayList<Integer>> A;
ArrayList<ArrayList<Integer>> B;
int start;
int end;
public int[][] C;
public LineMultiplier(ArrayList<ArrayList<Integer>> a,
ArrayList<ArrayList<Integer>> b, int s, int e) {
A = a;
B = b;
C = new int[a.size()][b.get(0).size()];
start = s;
end = e;
}
@Override
public int[][] call() {
for (int i = start; i < end; i++) {
for (int k = 0; k < B.size(); k++) {
for (int j = 0; j < B.get(0).size(); j++) {
C[i][j] += A.get(i).get(k) * B.get(k).get(j);
}
}
}
return C;
}
}
Execute it with only one thread:
time java Shell -i 2000.in 1 > 2000-paralllel.out
Number of cores: 2
0
real 0m40.571s
user 0m42.259s
sys 0m0.388s
Execute it with two threads:
time java Shell -i 2000.in 2 > 2000-paralllel.out
Number of cores: 2
0
1000
real 0m30.188s
user 0m54.999s
sys 0m0.512s
Note that real time is lower than user time. The reason is simply that the execution time on each processor is added. So user time might be double as high as real time!
We got from 40.571s down to 0m30.188s!
Information for parallelism
C++
Making the ikj-algorithm parallel is trivial with C++. You only need to add #pragma omp parallel for
before the outer most for loop and add -fopenmp
as a compile flag!
(If you really want to see the code, go to my Git repository.)
$ time ./ikj-algorithm.out 2 2000.in > 2000-parallel.out
real 0m12.563s
user 0m20.569s
sys 0m0.156s
So we got from 20.407 seconds down to 12.563 seconds by adding only one line!