[BUG] GPU locking using METAL_FAST_SYNCH=1 and the JACCL backend

**Describe the bug**
In certain occasions, the GPU gets stuck in the while(1) loop in fence_wait and can never escape, leaving the GPU "locked" at 100% usage until the machine is restarted.

We've looked into the issue for a while and can't seem to pin down the specific cause. It seems as though, from my investigation, that the GPU in fence_wait does not see the timestamp when it is updated by the CPU. It may be a timing issue or some sort of coherence issue.

**To Reproduce**
This is fairly difficult to reproduce (happens non-deterministically as far as I can tell). The most consistent way I have found is to run two large models in parallel using RDMA. It is also more likely on 4 nodes vs 2.

The context that we encounter this issue is as follows:
- Using the standard stream_generate with either a tensor/pipeline sharded model in Exo, we see this issue occasionally.
- I would like to see if this occurs in MLX LM, but I have not gotten around to this.

**Additional context**
#3141 attempts to resolve this, and the number of GPU locks seems to be drastically reduced. However, this is not an exhaustive fix at the very least.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] GPU locking using METAL_FAST_SYNCH=1 and the JACCL backend #3142

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] GPU locking using METAL_FAST_SYNCH=1 and the JACCL backend #3142

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions