Describe the bug
In certain occasions, the GPU gets stuck in the while(1) loop in fence_wait and can never escape, leaving the GPU "locked" at 100% usage until the machine is restarted.
We've looked into the issue for a while and can't seem to pin down the specific cause. It seems as though, from my investigation, that the GPU in fence_wait does not see the timestamp when it is updated by the CPU. It may be a timing issue or some sort of coherence issue.
To Reproduce
This is fairly difficult to reproduce (happens non-deterministically as far as I can tell). The most consistent way I have found is to run two large models in parallel using RDMA. It is also more likely on 4 nodes vs 2.
The context that we encounter this issue is as follows:
- Using the standard stream_generate with either a tensor/pipeline sharded model in Exo, we see this issue occasionally.
- I would like to see if this occurs in MLX LM, but I have not gotten around to this.
Additional context
#3141 attempts to resolve this, and the number of GPU locks seems to be drastically reduced. However, this is not an exhaustive fix at the very least.
Describe the bug
In certain occasions, the GPU gets stuck in the while(1) loop in fence_wait and can never escape, leaving the GPU "locked" at 100% usage until the machine is restarted.
We've looked into the issue for a while and can't seem to pin down the specific cause. It seems as though, from my investigation, that the GPU in fence_wait does not see the timestamp when it is updated by the CPU. It may be a timing issue or some sort of coherence issue.
To Reproduce
This is fairly difficult to reproduce (happens non-deterministically as far as I can tell). The most consistent way I have found is to run two large models in parallel using RDMA. It is also more likely on 4 nodes vs 2.
The context that we encounter this issue is as follows:
Additional context
#3141 attempts to resolve this, and the number of GPU locks seems to be drastically reduced. However, this is not an exhaustive fix at the very least.