This change modifies `nccl_manager_test` so that it runs with multiple physical
GPUs. The main changes are to pick the number of nodes and ranks based on the
actual devices available.
PiperOrigin-RevId: 289146110
Change-Id: I5d06ac39eee3ffe69311194485fc64974bc5410f
For that we need to generate multiple device events (one per device) AND one driver callback events.
Add some trace me for nccl_manager.cc
PiperOrigin-RevId: 282467571
Change-Id: I1ce081c062876d2d504ada06c3c53b73ac8abfa9
The following commit breaks the --config=rocm build
921003e1c4
The above commit adds `//tensorflow/core/profiler/lib:traceme` as a build dependency on the CUDA side but not on the ROCm side, leading to bazel errors during the ROCm build.
The "fix" is the to add the dependency on the ROCm side as well.
For that we need to generate multiple device events (one per device) AND one driver callback events.
Add some trace me for nccl_manager.cc
PiperOrigin-RevId: 280077382
Change-Id: Ice0ebe4b98f3a6a222e210d266666b0e7fbf8171
Imported from GitHub PR #31485
Copybara import of the project:
- ba5748981bb02b9d0e91114cdc30eb64d1650a46 add ROCm RCCL support by Jeff Daily <jeff.daily@amd.com>
- 6f887a19731f030be58495ae4fea98b3ad1f1cc3 run buildifier against tensorflow/core/nccl/BUILD by Jeff Daily <jeff.daily@amd.com>
- 55ce583cf484953d90eb9b9310dc77cf63b4c0c9 Merge 6f887a19731f030be58495ae4fea98b3ad1f1cc3 into f9233... by Jeff Daily <jeff.daily@amd.com>
PiperOrigin-RevId: 264892468
The NcclManager records and waits on an Event as each Participant is added,
rather than synchronizing with the compute stream only after all Participants
have been added. Otherwise, most compute kernels are added to the compute
stream prior to the NCCL sync Event, delaying the start of the collective.
This change adds an implementation of the CollectiveImplementationInterface
that uses NCCL for all-reduce. The main bits include setting up a NCCL
communicator key during collective param resolution, and calling the relevant
functionality in NcclManager.
PiperOrigin-RevId: 229277364
After this change, we check the return value of every CUDA and NCCL call in
NcclManager. If any call is unsuccessful, we call the NCCL callback with an
error status.
This change also re-enables NCCL tests.
PiperOrigin-RevId: 224038066