In 2012 I wrote a series of articles about matrix multiplication. Now I'm preparing a course about speeding up Python. For this reason I need an example of code that is fairly simple to understand and can be optimized. So let's update the results of my old articles 🙂
The code can be found in a git repository on GitHub (github.com/MartinThoma/matrix-multiplication) and you can have a look at the old article as well, if you want.
All scripts are tested on my new Thinkpad T460p. For comparision, I've also added the laptop I had before when I wrote the 2012 matrix multiplication article:
|Acer TravelMate 5735Z||Thinkpad T460p|
|CPU (comparison on ark.inten.com)||2x Pentium(R) Dual-Core CPU T4500 @2.30GHz||8x Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz|
|RAM||4 GB||8 GB|
|Video Card||Intel GMA 4500MHD||Nvidia GeForce 940MX|
|System||Ubuntu 10.10.04 LTS||Ubuntu 18.04.3 LTS|
In the following table you can see the execution times for the different
algorithms and different Python versions. As input, I took the
2000.in test set. To switch Python versions, I used
For Python 2.7, you can see the speedup compared to my 2012 machine. A speedup of 2.6x means that you could run the code on the new machine 2.6x in the time the old machine needed.
|Algorithm||Python 2.7.16||Python 3.8.1||pypy3.6-7.3.0||pypy-c-jit-latest||Python 3.8.1 + cython|
|Strassen (LEAF_SIZE=8)||1709 (1.7x)||1780||190||125||-|
|Strassen (LEAF_SIZE=64)||855 (3.4x)||1022||44||41||441|
|ikj (2 threads)||953||1611||37||32||-|
|ikj (4 threads)||459||762||19||17||-|
Things to note:
- pypy is crazy fast compared to CPython - up to a speedup of 34x 😲 So JIT is worth a try. I actually wanted to try Pyston (comparison) as well, but the built failed with pyenv.
- Python 2.7 is faster than 3.8 for this benchmark 😢
- Ways to improve; speedups are always compared to the naive ijk algorithm:
- Cache optimization: The ikj algorithm gave a 1.08x speedup
- Algorithmic: The Strassen algorithm gave a 1.84x speedup
- Parallelization: Using 4 threads instead of 1 with a super simple algorithm gave a 2.47x speedup
- New Hardware which is 8 years newer gave a 2.6x speedup on the naive solution
- Running the old code with updated libraries on my new machine gave a 6x speedup compared to running the older version of numpy on the old machine.
- Libraries: Using numpy gave a 94.2x speedup!
- Numpy and scipy are the way to go, just as expected 🤷♂️
I was interested in the influence of the machine, so I ran the algorithms with Python 3.8 on others as well:
|CPU (comparison on ark.inten.com)||8x Intel(R) Core(TM) i7-6700HQ
CPU @ 2.60GHz
|4x Intel(R) Core(TM) i5-6300U
CPU @ 2.40GHz
|RAM||8 GB||16 GB|
|Video Card||Nvidia GeForce 940MX||Intel HD Graphics 520|
|System||Ubuntu 18.04.3 LTS||Ubuntu 18.04.4 LTS|
|ikj (2 threads)||1611||TODO|
|ikj (4 threads)||762||TODO|
The interesting thing to note here is that the EliteBook is about the same speed for the ikj-algorithm (even a tiny bit slower), but the EliteBook is way slower for the ikj algorithm.
You might also be interested in pybenchmarks.org which seems to be similar to The Benchmarks Game.
Quoting from the PyPy 1.2 Release notes:
PyPy is a reimplementation of Python in Python [...]. [PyPy] speed results often beat CPython, ranging from being slightly slower, to speedups of up to 2x on real application code, to speedups of up to 10x on small benchmarks.
Another resource which states that PyPy is faster than CPython is pybenchmarks.org (also PyPy 3).
PyPy uses Just-in-time compilation (JIT) to get those speedups.
Go to PyPy — How can it possibly beat CPython? for more information why PyPy is that fast.
|Algorithm||g++-O0 -D _DEBUG -g||-O3 -D NDEBUG -DBOOST_UBLAS_NDEBUG|
Things to note:
- Debug flags / Optimization settings while compiling make a huge difference
- It's way faster than Python
Using C++ from Python
Obviously, C++ is way faster than pure Python. Astonishingly, C++ is also faster than Numpy for this simple number crunching task. So the idea is to combine both: Have the nice interface of Python with the raw computing power of C++.
However, there are many ways to combine the two:
- subprocess 🙄
- ctypes: beginner tutorial
- c extensions: examples
Cython versus CFFI helped me to wrap my head around this topic a bit.
- Matti Picus: CFFI, Ctypes, Cython: the Good, the Bad and the Ugly at EuroSciPy, 2018. (first presented at PyCon Israel 2017)
- 2:10 Begins
- 4:05 Matti Picus introduces himself
- 5:02 He will use Mandelbrot as an example
- 15:00 C implementation
- 18:00 C done: 130ms (Python 3057ms)
- 18:15 ctypes is writing python in c
- Take a shared object (*.so) or ddl
- 20:48 ctypes.Structure
- 24:00 Memory managment / free-ing memory
- 36:00 Create *.so
- 38:40 Load dll / *.so
- 44:24 Timing - 1554ms
- 45:10 cffi - cffi came out as a replacement for ctypes. cffi is writing c in python. Use c as directly as possible.
- 46:20 Load C header in Python; header needs to be in very specific format. IT CAN'T HANDLE MACROS!
- 50:20 cffi memory management
- 52:37 Call function
- 53:33 Timing - 640ms
- 53:50 Q/A
- 55:36 Cython - can be used from jupyter notebook
- Cython has a own language; like a mix between C and Python
- 56:58 Cython creates a shared object
- 1:00:07 cpdef - a mixture between c and python
- 1:02:12 Timing - 594ms
- 1:02:49 cppyy - C++!
- 1:03:30 LLVM
- 1:09 Final thoughts
- Armin Rigo: CFFI: calling C from Python at EuroPython, 2016.
- Comparing Python, Numpy, Numba and C++ for matrix multiplication
- Nuitka (Benchmark)