gh-145044: Increase performance in unsafe_object_compare by not using Py_DECREF() for immortal res_obj#145045
gh-145044: Increase performance in unsafe_object_compare by not using Py_DECREF() for immortal res_obj#145045benediktjohannes wants to merge 6 commits intopython:mainfrom
Conversation
…) for immortal res_obj
eendebakpt
left a comment
There was a problem hiding this comment.
The change itself is correct, but the performance gain is small.
Misc/NEWS.d/next/Core_and_Builtins/2026-02-20-16-49-28.gh-issue-145044.20DoM5.rst
Outdated
Show resolved
Hide resolved
|
Small addition for clarity: We measure A/B which is the lower speed of A in comparison to B which means that B is effectively not only 1.66% faster, but 1.688%. |
|
Please use pyperf instead of a custom tool for performance. This is true for all proposals for benchmarks. Also, be sure to run the tests with an PGO+LTO build not a DEBUG build. For immortal objects, we also don't have a clear pattern of whether to optimize or not. While it is an optimization, it complicates a bit the branches and the mental burden (we need to remember that we are working on an immortal object here). So unless we have (1) more benchmarks (2) macro benchmarks as well (small lists are not relevant IMO, it's in the realm of noise here), I'm not willing to support this proposal. |
|
Thanks for the review! I’ll add bechmarks using pyperf as soon as possible (probably tomorrow) and also including larger sizes (which probably will also have a similar or even better effect I guess because in this case comparisons are even more important in comparison to the other things done in sort I guess, but probably the effects will nearly be the same because in all cases comparisons should be a very important thing). And if it turns out that this actually improves performance by the expected amount, I think that merging this should be positive because the global impact in |
|
I'm currently working on this and I have really great news to announce! In fact this change seems to be even way more positive for performance (around 5% or even more) because what was measured earlier was only list.sort(), but also many other aspects for lists are improved through this change (even way more) which means that e.g. if you just set the start for measurement in the scripts above earlier (which means that more code executions are included in the measurement which is also more realistic for many use cases because the whole impact is important) then you get a way bigger increasement in performance (and probably this also applies to many other functions) which is great in my opinion! So I think that this should definitely be merged because the increasement should be way more important for lists in many other aspects as well! I'll be working on testing this with |
Unless I see real improvements with pyperf, I don't want to complicate the current code. The code is fine and easier to read. Also, I don't know if you are using an LLM to write your answers, but could you shorten them or at least make paragraphs? it is hard to read and catch your point.
I would also suggest running the pyperformance macrobenchmarks. |
I'll try to shorten them, I'm sorry, I'm sometimes using too many words for simple sentences. 🤣 And I see your point, I'll add |
I'll have a look on that! 👍 |
|
Thanks! |
|
Hmmm, I guess that the problem with macro benchmarks will be that sorting (or even lists in general) only are one part of the benchmarks and measuring the impacts will be extremely difficult because of noise (if we have around 1%-5% impact on lists, this is maybe around 0.01%-0.05% in general which is probably extremely hard to differenciate between noise and impact). But I‘ll (of couse) add some pyperf tests (and I think that if we measure an impact there, it‘s worth adding these changes because code should not be too complicating even with this change and then we have a real user effect). |
|
I hope that this message was not too long 😅 and is understandable |
Increase performance in unsafe_object_compare by not using Py_DECREF() for immortal res_obj
In
Objects/listobject.c, the functionunsafe_object_compare()currently performs an unconditional Py_DECREF(res_obj); after evaluating the result of tp_richcompare, regardless of whether res_obj is a boolean or another object type.Since
PyBool_Check(res_obj)implies thatres_objis eitherPy_TrueorPy_False, both of which are immortal singletons, decrementing their reference count has no semantic effect. WhilePy_DECREF()is safe for immortal objects, it still triggers the decrement which introduces measurable overhead (see benchmarks below).Rationale
PyBool_Check(res_obj)guaranteesres_objis eitherPy_TrueorPy_False. Both are immortal objects. CallingPy_DECREF()on immortal objects has no observable semantic effect. Avoiding the unnecessary decrement improves performance in microbenchmarks (see benchmarks below). This function sits on the hot path of list sorting (list.sort()), so even small savings per comparison accumulate measurably.Correctness Considerations
The Py_NotImplemented branch remains unchanged and still decrefs before fallback. The
res_obj == NULLcase remains unchanged. Non-boolean results continue to be decremented exactly once. For boolean resultsPy_DECREF()is not triggered anymore, which is safe due to their immortal status. The existing comment regarding non-deterministic comparison functions (e.g. lambda a, b: int(random.random() * 3) - 1 in test_sort.py) should not be any problem for the change in my opinion (but please correct me if I'm mistaken anywhere).Benchmarks
This was tested with two combined scripts in order to compare original
cpython(also calledCPythonin the benchmark results) and so calledcpython_patch(also calledCPython PATCHin the benchmark results). Thecpython_patchincludes the exact same copy ofcpython(created at the same time with the same commits) with only this change which is purposed in the PR.I use controller.py which starts 2000 independent runs of both scripts one after another each time switched so that impacts due to CPU boosts, etc. are not relevant anymore. I've also tested this with switched order of both scripts so that the start of the next iteration does not cause any relevant overhead before the test (down below one of the most representative benchmark results).
Scripts
controller.pyperformancetest.pyResults
The results printed in console for my microbenchmark were:
The full log is viewable online (because it includes around 4000 lines of single results) at: https://justpaste.it/fwgog