Conversation
Relacy helped identify a race in the existing MPSC algo. I am having a hard time exactly explaining what's going on, but in the newly added unit test (the non Relacy one), I am able to observe three different odd behaviors - a consumer consuming the same elemment in an finite loop, apparently due to the internal next pointers pointing in some sort of cycle - consumer returning &__nil_! - consumer never able to consume a produced value (node is lost) With the non-relacy unit test, in the existing algo, if I insert a random sleep of 0-10 microseconds in push_back after __back_ is exchanged, I can observe one of the above behaviors nearly every single time. The most common was the first behavior. The existing algo claims it came from Dmitry Vyukov's implementation, though one key difference is that the existing one uses an atomic pointer to a Node for the "nil" object, whereas Dmitry's stores an actual Node object embedded in the queue. I re-implemented the version in stdexec exactly as it appears on Dmitry's website (which I had to dig up on archive.org), and it passes newly added Relacy (exploring many thread interleavings) and non-Relacy unit tests. I originally tracked down a bug in timed_thread_scheduler.cpp, where sometimes `STDEXEC_ASSERT(op->command_ == command_type::command_type::stop);` failed.
|
cc @maikel - can you check this over, since I think you implemented the original version of the MPSC queue? |
|
Dmitry's website appears to have stopped working a few months ago, though the original link is on archive.org: https://web.archive.org/web/20221127081044/https://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue (Dmitry is also the author of Relacy). |
|
Also of note, I originally noticed the race/bug in timed_thread_scheduler while running the exec unit tests through with a version of TSAN I am trying to upstream which fuzzes the scheduler with random sleeps, and this is another success of that approach I think: llvm/llvm-project#178836 |
|
@maikel as an experiment to help narrow down where the race leading to the queue state becoming corrupt, I added a mutex in two places: https://gist.github.com/ccotter/37b0d3cb3ad88f8c4fea4b2adbb5dce7 ... Relacy says this queue is correct... |
|
Yes after revisiting the old code I agree that the management of nil is the culprit here. We should simplify the implementation towards dmitrys original one. The only thing I would check is whether your usage of memory orderings and atomics can be improved |
|
/ok to test 0a9df03 |
|
/ok to test 7e4e3c6 |
|
/ok to test b33534f |
|
My changes to build and run the relacy tests are failing because TBB ends up needing std::shared_mutex. My local build does not run with TBB, so I didn't encounter this. I will soon have a PR to upstream relacy to add support for std::shared_mutex, which should address this. If I can't get that in / merge in the next few days, I'll disable the relacy tests in this PR for the time being. |
- Use STDEXEC::__std::atomic - Update compiler requirements for Relacy - Compile Relacy with TBB off
|
Ok, I believe this should be good to go. Do you mind squashing this as the individual commits are more noisy than valuable in the history. Of note, I found two compiler crashes, one in gcc (not sure if the bugzilla reprot is in yet) and one in clang (llvm/llvm-project#181088) while working on this! |
This commit introduces an "adaptive delay" feature to the ThreadSanitizer runtime to improve race detection by perturbing thread schedules. At various synchronization points (atomic operations, mutexes, and thread lifecycle events), the runtime may inject small delays (spin loops, yields, or sleeps) to explore different thread interleavings and expose data races that would otherwise occur only in rare execution orders. This change is inspired by prior work, which is discussed in more detail on https://discourse.llvm.org/t/rfc-tsan-implementing-a-fuzz-scheduler-for-tsan/80969. In short, https://reviews.llvm.org/D65383 was an earlier unmerged attempt at adding a random delays. Feedback on the RFC led to the version in this commit, aiming to limit the amount of delay. The adaptive delay feature uses a configurable time budget and tiered sampling strategy to balance race exposure against performance impact. It prioritizes high-value synchronization points with clear happens-before relationships: relaxed atomics receive lightweight spin delays with low sampling, synchronizing atomics (acquire / release / seq_cst) receive moderate delays with higher sampling, and mutex and thread lifecycle operations receive the longest delays with highest sampling. The feature is disabled by default and incurs minimal overhead when not enabled. Nearly all checks are guarded by an inline check on a global variable that is only set when enable_adaptive_delay=1. Microbenchmarks with tight loops of atomic operations showed no meaningful performance difference between an unmodified TSAN runtime and this version when running with empty TSAN_OPTIONS. An LLM assisted in writing portions of the adaptive delay logic, including the TimeBudget class, tiering concept, address sampler, and per-thread quota system. I reviewed the output and made amendments to reduce duplication and simplify the behavior. I also replaced the LLM's original double-based calculation logic with the integer-based Percent class. The LLM also helped write unit test cases for Percent. cc @dvyukov ## Examples I used the delay scheduler to find novel bugs that rarely or never occurred with the unmodified TSAN runtime. Some of the bugs below were found with earlier versions of the delay scheduler that I iterated on, but with this most recent implementation in this PR, I can still find the bugs far more reliably than with the standard TSAN runtime. - A use-after-free in the [BlazingMQ](https://github.com/bloomberg/blazingmq) broker during ungraceful producer disconnect. - Race in stdexec: NVIDIA/stdexec#1395 - Race in stdexec's MPSC queue: NVIDIA/stdexec#1812 - A few races in [BDE](https://github.com/bloomberg/bde) thread enabled data structures/algorithms. - The "Data race on variable a" test from https://ceur-ws.org/Vol-2344/paper9.pdf is more reliably reproduced with more aggressive adaptive scheduler options # Outstanding work - The [RFC](https://discourse.llvm.org/t/rfc-tsan-implementing-a-fuzz-scheduler-for-tsan/80969) suggests moving the scheduler to sanitizer_common, so that ASAN can leverage this. This should be done (should it be done in this PR?). - Missing interceptors for libdispatch
This commit introduces an "adaptive delay" feature to the ThreadSanitizer runtime to improve race detection by perturbing thread schedules. At various synchronization points (atomic operations, mutexes, and thread lifecycle events), the runtime may inject small delays (spin loops, yields, or sleeps) to explore different thread interleavings and expose data races that would otherwise occur only in rare execution orders. This change is inspired by prior work, which is discussed in more detail on https://discourse.llvm.org/t/rfc-tsan-implementing-a-fuzz-scheduler-for-tsan/80969. In short, https://reviews.llvm.org/D65383 was an earlier unmerged attempt at adding a random delays. Feedback on the RFC led to the version in this commit, aiming to limit the amount of delay. The adaptive delay feature uses a configurable time budget and tiered sampling strategy to balance race exposure against performance impact. It prioritizes high-value synchronization points with clear happens-before relationships: relaxed atomics receive lightweight spin delays with low sampling, synchronizing atomics (acquire / release / seq_cst) receive moderate delays with higher sampling, and mutex and thread lifecycle operations receive the longest delays with highest sampling. The feature is disabled by default and incurs minimal overhead when not enabled. Nearly all checks are guarded by an inline check on a global variable that is only set when enable_adaptive_delay=1. Microbenchmarks with tight loops of atomic operations showed no meaningful performance difference between an unmodified TSAN runtime and this version when running with empty TSAN_OPTIONS. An LLM assisted in writing portions of the adaptive delay logic, including the TimeBudget class, tiering concept, address sampler, and per-thread quota system. I reviewed the output and made amendments to reduce duplication and simplify the behavior. I also replaced the LLM's original double-based calculation logic with the integer-based Percent class. The LLM also helped write unit test cases for Percent. cc @dvyukov ## Examples I used the delay scheduler to find novel bugs that rarely or never occurred with the unmodified TSAN runtime. Some of the bugs below were found with earlier versions of the delay scheduler that I iterated on, but with this most recent implementation in this PR, I can still find the bugs far more reliably than with the standard TSAN runtime. - A use-after-free in the [BlazingMQ](https://github.com/bloomberg/blazingmq) broker during ungraceful producer disconnect. - Race in stdexec: NVIDIA/stdexec#1395 - Race in stdexec's MPSC queue: NVIDIA/stdexec#1812 - A few races in [BDE](https://github.com/bloomberg/bde) thread enabled data structures/algorithms. - The "Data race on variable a" test from https://ceur-ws.org/Vol-2344/paper9.pdf is more reliably reproduced with more aggressive adaptive scheduler options # Outstanding work - The [RFC](https://discourse.llvm.org/t/rfc-tsan-implementing-a-fuzz-scheduler-for-tsan/80969) suggests moving the scheduler to sanitizer_common, so that ASAN can leverage this. This should be done (should it be done in this PR?). - Missing interceptors for libdispatch
|
@ericniebler bump? This should be ready for review/testing. Also ... the custom changes to TSAN I've been using to try to shake out other bugs has just landed in upstream clang. If anyone has easy access to trunk builds, https://github.com/llvm/llvm-project/blob/main/clang/docs/ThreadSanitizer.rst#adaptive-delay describes the feature and various flags. |
|
/ok to test 208fde6 |
|
/ok to test 659ad65 |
Relacy helped identify a race in the existing MPSC algo. I am having a hard time exactly explaining what's going on, but in the newly added unit test (the non Relacy one), I am able to observe three different odd behaviors
With the non-relacy unit test, in the existing algo, if I insert a random sleep of 0-10 microseconds in push_back after _back is exchanged, I can observe one of the above behaviors nearly every single time. The most common was the first behavior.
The existing algo claims it came from Dmitry Vyukov's implementation, though one key difference is that the existing one uses an atomic pointer to a Node for the "nil" object, whereas Dmitry's stores an actual Node object embedded in the queue.
I re-implemented the version in stdexec exactly as it appears on Dmitry's website (which I had to dig up on archive.org), and it passes newly added Relacy (exploring many thread interleavings) and non-Relacy unit tests.
I originally tracked down a bug in timed_thread_scheduler.cpp, where sometimes
STDEXEC_ASSERT(op->command_ == command_type::command_type::stop);failed.