This article is about optimizing a tiny bit of Python code by replacing it with its C++ counterpart.
Beware, geek stuff follows.
We are interested in implementing the Opening Handshake of The Websocket Protocol.
It is a fairly simple to understand task, it involes sizeable number crunching and intermediate object allocations to see it pop out in the results.
To be clear, this is a demo project with no real world benefits except for the methodology used.
Let’s get started.
First, we implement a function that returns the Sec-WebSocket-Accept
calculated from the Sec-WebSocket-Key
26 LOAD_FAST 0 (key) 28 LOAD_CONST 1 ('258EAFA5-E914-47DA-95CA-C5AB0DC85B11') 30 BINARY_OP 0 (+) 60 CALL 0 (encode) 74 CALL 1 (sha1) 110 CALL 0 (digest) 124 CALL 1 (b64encode) 160 CALL 0 (decode) 170 RETURN_VALUE
Except for the temporary values generated within the steps, the code itself looks the fastest possible. All the methods invoked are implemented in C inside the CPython implementation.
Now, we are going to implement all this as a single step in C++. To do that, we can initialize a Python extension and add our new function.
The implementation of sec_websocket_accept()
is cumbersome enough to worth ommitting from this article.
Here is a link to the full code.
It might not be trivial to see, but neither the Python, nor the C++ variant contains micro-optimizations or any harware specific ones.
These are just naive implentations. We can achieve significant results without using any of that.
We can add our tests, and see the results.
----------------------------------------------------------------------------------------------------------
Name (time in ns) Mean StdDev Median OPS (Kops/s)
----------------------------------------------------------------------------------------------------------
test_optimized_code 317.0102 (1.0) 0.0384 (1.0) 317.0121 (1.0) 3,154.4725 (1.0)
test_python_code 1,149.9825 (3.63) 0.8864 (23.10) 1,149.8373 (3.63) 869.5785 (0.28)
----------------------------------------------------------------------------------------------------------
- It seems our Python code did really well. It can execute 869k calls per second.
- It is also clear our C++ variant is 3.63x faster, clocking at 3.15m calls per second.
Amazing! Replacing a tiny bit of code that seems not optimizable has a significant effect.
If you wish to run these tests yourself, you will find all the necessary steps in the github actions here.
No-Goals of this Article
- This article does not address maintainability or any other burden introduced with replacing simple Python code with cumbersome low level C code.
- We are not interested in micro-optimization, using SSE or hardware implemented hashing.
- Not interested in multi-threaded approaches, concurrency.
- Not interested in implementing it in Rust or any other language not supported out of the box for Python Extensions.
Fun Fact
We can implement a magic function that may also work.
----------------------------------------------------------------------------------------------------------
Name (time in ns) Mean StdDev Median OPS (Kops/s)
----------------------------------------------------------------------------------------------------------
test_magic_code 83.1061 (1.0) 0.7204 (22.09) 82.8925 (1.0) 12,032.8104 (1.0)
test_optimized_code 338.3712 (4.07) 0.0326 (1.0) 338.3669 (4.08) 2,955.3341 (0.25)
test_python_code 1,288.4041 (15.50) 0.6493 (19.91) 1,288.4681 (15.54) 776.1540 (0.06)
----------------------------------------------------------------------------------------------------------
The dissambled version seems to be simple too.
Supprising as it may sound but our C++ implementation is just 4x slower.
(From measurements and interpretations we are now entering a realm of guessings).
This could be because of the overhead introduced by calling functions, the interpreter parsing bytecode or our mearuring tools used.
At 12m calls per second on a single core this is inevitable.
Summary
By implementing a simple task in C++ instead of Python, where the underlying function calls are already implemented in C++, we still can get a significant boost.