This is old news for those who are reading the mailing list (this thread), but I thought I should tell the rest of you too, in case you run into it: there seems to be a bug with python's locks on Linux. As far as I tested it, the bug never happens on the Windows version of python, and all the reports I got were on Linux, so it's safe to assume it's a contained problem. On the other hand, it being a "platform issue" means I can't think of any solution or workaround other then opening a ticket at [bugs.python.org].
And now for the bug. In short, when running two threads that use the same RPyC connection, one of them might lock up forever. It happens at ~25% of the times according to my empirical results, which is just lovely for debugging. After much research, it turned out that a thread.lock object may have a thread pending on it, after the lock has been released, and the thread will not get runtime. Instead, another thread will attempt to acquire it and *succeed*! This behavior makes no sense, and I'm guessing it has to do with the infamous GIL.
Here's the debug prints I added, to see what's going wrong there:
[T0 L(send 1)] acq
[T0 L(send 1)] ACQ
[T0 L(send 1)] rel
[T1 L(recv 0)] acq
[T1 L(recv 0)] ACQ
[T0 L(send 1)] REL
[T0 L(recv 0)] acq
[T1 L(recv 0)] rel
[T1 L(recv 0)] REL
[T1 L(send 1)] acq
[T0 L(recv 0)] ACQ
[T1 L(send 1)] ACQ
[T1 L(send 1)] rel
[T1 L(send 1)] REL
[T1 L(recv 0)] acq <<<<<<< [1]
[T0 L(recv 0)] rel
[T0 L(recv 0)] REL <<<<<<< [2]
[T0 L(recv 0)] acq
[T0 L(recv 0)] ACQ <<<<<<< [3]
[T0 L(recv 0)] rel
[T0 L(recv 0)] REL
[T0 L(recv 0)] acq
[T0 L(recv 0)] ACQ
.
.
.
T0 and T1 are the two threads, L(recv 0) and L(send 1) are the two locks (one for sending, another for recving); acq is printed when some thread requests to acquire the lock, ACQ is printed when the lock has been actually acquired, rel when the thread requests to release the lock, and REL when the lock has actually been released.
As you can clearly see at point (1), T1 requests to acquire L(recv 0) and locks, because T0 currently holds it. A moment later, at point (2), T0 releases the lock. Now we would have expected T1 to get hold of the lock, but that doesn't happen. Meanwhile (and I'm talking about milliseconds), and T0 requests to acquire the lock again. Up until here, it was permissible (low level timing issues, OS context switches, GIL issues, etc), but lo and behold what happens next! At point (3), instead of T0 blocking and T1} getting hold of the lock (and unblocking), it's {{T0 who gets the lock!
How is that possible? A lock has one thread pending on acquiring it, and another thread manages to acquire it?! And once chaos theory got into the race, we no longer hear anything from T1 — it simply remains blocked forever, while T0 keeps releasing and acquiring the lock like a maniac.
So I tried something weird — let's give a penalty to the thread who got the lock. Simply put, after a thread gets the lock, recvs its data, and releases the lock — I added a sleep(0.1). Ta da! Now all threads get their turn! Well, almost… the blocking still happens, but by sleeping, I managed to reduce the odds greatly. So I'm pretty confident for blaming it on the GIL. Or maybe it's an interaction between the GIL and the CFS? Perhaps since the scheduling policy clashes with the GIL? Even if it does, a lock can't allow another thread to acquire it when there's already someone in the queue!
So I'll try to reduce the problem as much as possible and open an issue for python… nothing else to do about it.







