• Martin Thoma
  • Home
  • Categories
  • Tags
  • Archives
  • Support me

Matrix Multiplication: 2020 Update

Contents

  • Matrix Multiplication: 2020 Update
    • Python
      • pypy
    • C++
    • Using C++ from Python
    • See also

In 2012 I wrote a series of articles about matrix multiplication. Now I'm preparing a course about speeding up Python. For this reason I need an example of code that is fairly simple to understand and can be optimized. So let's update the results of my old articles πŸ™‚

The code can be found in a git repository on GitHub (github.com/MartinThoma/matrix-multiplication) and you can have a look at the old article as well, if you want.

All scripts are tested on my new Thinkpad T460p. For comparision, I've also added the laptop I had before when I wrote the 2012 matrix multiplication article:

  Acer TravelMate 5735Z Thinkpad T460p
CPU (comparison on ark.inten.com) 2x Pentium(R) Dual-Core CPU T4500 @2.30GHz 8x Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
RAM 4 GB 8 GB
Video Card Intel GMA 4500MHD Nvidia GeForce 940MX
System Ubuntu 10.10.04 LTS Ubuntu 18.04.3 LTS

Python

In the following table you can see the execution times for the different algorithms and different Python versions. As input, I took the 2000.in test set. To switch Python versions, I used pyenv

For Python 2.7, you can see the speedup compared to my 2012 machine. A speedup of 2.6x means that you could run the code on the new machine 2.6x in the time the old machine needed.

Algorithm Python 2.7.16 Python 3.8.1 pypy3.6-7.3.0 pypy-c-jit-latest Python 3.8.1 + cython
ijk 1253 (2.6x) 1884 158 147 1088
ikj 901s (2.9x) 1072 52 50 441
NumPy 20 (~6x)
version: 1.14.4
20
version: 1.18.1
42 36 22
SciPy 45 (2.6x)
version: 1.1.0
20
version: 1.4.1
Installation problems -
Strassen (LEAF_SIZE=8) 1709 (1.7x) 1780 190 125 -
Strassen (LEAF_SIZE=64) 855 (3.4x) 1022 44 41 441
ikj (2 threads) 953 1611 37 32 -
ikj (4 threads) 459 762 19 17 -

Things to note:

  • pypy is crazy fast compared to CPython - up to a speedup of 34x 😲 So JIT is worth a try. I actually wanted to try Pyston (comparison) as well, but the built failed with pyenv.
  • Python 2.7 is faster than 3.8 for this benchmark 😒
  • Ways to improve; speedups are always compared to the naive ijk algorithm:
    • Cache optimization: The ikj algorithm gave a 1.08x speedup
    • Algorithmic: The Strassen algorithm gave a 1.84x speedup
    • Parallelization: Using 4 threads instead of 1 with a super simple algorithm gave a 2.47x speedup
    • Time:
      • New Hardware which is 8 years newer gave a 2.6x speedup on the naive solution
      • Running the old code with updated libraries on my new machine gave a 6x speedup compared to running the older version of numpy on the old machine.
    • Libraries: Using numpy gave a 94.2x speedup!
  • Numpy and scipy are the way to go, just as expected 🀷‍♂️

I was interested in the influence of the machine, so I ran the algorithms with Python 3.8 on others as well:

  Thinkpad T460p
(Reference Machine)
EliteBook-1040
CPU (comparison on ark.inten.com) 8x Intel(R) Core(TM) i7-6700HQ
CPU @ 2.60GHz
4x Intel(R) Core(TM) i5-6300U
CPU @ 2.40GHz
RAM 8 GB 16 GB
Video Card Nvidia GeForce 940MX Intel HD Graphics 520
System Ubuntu 18.04.3 LTS Ubuntu 18.04.4 LTS
Algorithm Thinkpad T460p EliteBook-1040
ijk 1884 1891
ikj 1072 1455
NumPy 1.18.1 20 21
Strassen (LEAF_SIZE=8) 1780 TODO
Strassen (LEAF_SIZE=64) 1022 TODO
ikj (2 threads) 1611 TODO
ikj (4 threads) 762 TODO

The interesting thing to note here is that the EliteBook is about the same speed for the ikj-algorithm (even a tiny bit slower), but the EliteBook is way slower for the ikj algorithm.

You might also be interested in pybenchmarks.org which seems to be similar to The Benchmarks Game.

pypy

Quoting from the PyPy 1.2 Release notes:

PyPy is a reimplementation of Python in Python [...]. [PyPy] speed results often beat CPython, ranging from being slightly slower, to speedups of up to 2x on real application code, to speedups of up to 10x on small benchmarks.

Another resource which states that PyPy is faster than CPython is pybenchmarks.org (also PyPy 3).

PyPy uses Just-in-time compilation (JIT) to get those speedups.

Go to PyPy — How can it possibly beat CPython? for more information why PyPy is that fast.

C++

Algorithm g++-O0 -D _DEBUG -g -O3 -D NDEBUG -DBOOST_UBLAS_NDEBUG
ijk 168.06 45.05
ikj 89.67 5.24
library-boost 1002.36 4.56

Things to note:

  • Debug flags / Optimization settings while compiling make a huge difference
  • It's way faster than Python

Using C++ from Python

Obviously, C++ is way faster than pure Python. Astonishingly, C++ is also faster than Numpy for this simple number crunching task. So the idea is to combine both: Have the nice interface of Python with the raw computing power of C++.

However, there are many ways to combine the two:

  • subprocess πŸ™„
  • ctypes: beginner tutorial
  • cffi
  • cython
  • c extensions: examples
  • pybind11
  • Py++

Cython versus CFFI helped me to wrap my head around this topic a bit.

See also

  • Matti Picus: CFFI, Ctypes, Cython: the Good, the Bad and the Ugly at EuroSciPy, 2018. (first presented at PyCon Israel 2017)
    • 2:10 Begins
    • 4:05 Matti Picus introduces himself
    • 5:02 He will use Mandelbrot as an example
    • 15:00 C implementation
    • 18:00 C done: 130ms (Python 3057ms)
    • 18:15 ctypes is writing python in c
      • Take a shared object (*.so) or ddl
      • 20:48 ctypes.Structure
      • 24:00 Memory managment / free-ing memory
      • 36:00 Create *.so
      • 38:40 Load dll / *.so
      • 44:24 Timing - 1554ms
    • 45:10 cffi - cffi came out as a replacement for ctypes. cffi is writing c in python. Use c as directly as possible.
      • 46:20 Load C header in Python; header needs to be in very specific format. IT CAN'T HANDLE MACROS!
      • 50:20 cffi memory management
      • 52:37 Call function
      • 53:33 Timing - 640ms
      • 53:50 Q/A
    • 55:36 Cython - can be used from jupyter notebook
      • Cython has a own language; like a mix between C and Python
      • 56:58 Cython creates a shared object
      • 1:00:07 cpdef - a mixture between c and python
      • 1:02:12 Timing - 594ms
    • 1:02:49 cppyy - C++!
      • 1:03:30 LLVM
    • 1:09 Final thoughts
  • Armin Rigo: CFFI: calling C from Python at EuroPython, 2016.
  • Comparing Python, Numpy, Numba and C++ for matrix multiplication
  • Nuitka (Benchmark)

Published

Feb 16, 2020
by Martin Thoma

Category

Code

Tags

  • C++ 23
  • pypy 1
  • Python 141

Contact

  • Martin Thoma - A blog about Code, the Web and Cyberculture
  • E-mail subscription
  • RSS-Feed
  • Privacy/Datenschutzerklärung
  • Impressum
  • Powered by Pelican. Theme: Elegant by Talha Mansoor