Commit Graph

26 Commits

Author SHA1 Message Date
Derek Murray
7f09d08f5d [BUILD] Clean up unused aliases left over the "common_runtime/BUILD" split.
PiperOrigin-RevId: 304650362
Change-Id: Ifc27d8d87d7c7319462235f661668467494dd252
2020-04-03 11:07:56 -07:00
George Karpenkov
9b544af08d Change std::call_once to absl::call_once
absl::call_once is faster and supports fibers.
PiperOrigin-RevId: 292148213
Change-Id: I66e96d735b722a2642508a7e7a1e73de254234d7
2020-01-29 08:40:19 -08:00
Ayush Dubey
d65a5f1bdf Disable nccl_manager_test on single GPU and re-enable with multiple GPUs.
This change modifies `nccl_manager_test` so that it runs with multiple physical
GPUs.  The main changes are to pick the number of nodes and ranks based on the
actual devices available.

PiperOrigin-RevId: 289146110
Change-Id: I5d06ac39eee3ffe69311194485fc64974bc5410f
2020-01-10 12:49:08 -08:00
TensorFlower Gardener
034ae07f1a Merge pull request from ROCmSoftwarePlatform:google_upstream_rocm_fix_191113
PiperOrigin-RevId: 282889616
Change-Id: I0f1e0f8d099ad128b9e33b26529de89decfeb422
2019-11-27 23:29:45 -08:00
Mihai Maruseac
6c64232060 track ncclAllReduce api because it use culaunchcooperativekernelmultidevice interface.
For that we need to generate multiple device events (one per device) AND one driver callback events.

Add some trace me for nccl_manager.cc

PiperOrigin-RevId: 282467571
Change-Id: I1ce081c062876d2d504ada06c3c53b73ac8abfa9
2019-11-25 17:44:31 -08:00
Brian Atkinson
b5d2f3677f Add a redirection point to core/platform for build_config_root.bzl
PiperOrigin-RevId: 282394372
Change-Id: Iea26860cafba0304fe0846f4a992ad535292029c
2019-11-25 12:25:58 -08:00
TensorFlower Gardener
f46e758677 Merge pull request from lamarrr:patch-5
PiperOrigin-RevId: 282381036
Change-Id: I9870a352cf7a664ec4f7dfab13c6bb9bff41ec72
2019-11-25 11:24:46 -08:00
Deven Desai
4382fbe39f [ROCm] Fix for the broken ROCm CSB.
The following commit breaks the --config=rocm build

921003e1c4

The above commit adds `//tensorflow/core/profiler/lib:traceme` as a build dependency on the CUDA side but not on the ROCm side, leading to bazel errors during the ROCm build.
The "fix" is the to add the dependency on the ROCm side as well.
2019-11-14 18:55:26 +00:00
A. Unique TensorFlower
921003e1c4 track ncclAllReduce api because it use culaunchcooperativekernelmultidevice interface.
For that we need to generate multiple device events (one per device) AND one driver callback events.

Add some trace me for nccl_manager.cc

PiperOrigin-RevId: 280077382
Change-Id: Ice0ebe4b98f3a6a222e210d266666b0e7fbf8171
2019-11-12 16:13:53 -08:00
Christian Sigg
114292369a Automated rollback of commit b960daa3e8
PiperOrigin-RevId: 272601402
2019-10-03 02:26:52 -07:00
Smit Hinsu
b960daa3e8 Automated rollback of commit 8ebe98c378. Revert .
PiperOrigin-RevId: 271232366
2019-09-25 18:45:50 -07:00
Jeff Daily
f223d58aa0 fix whitespace error in tensorflow/core/nccl/BUILD 2019-09-12 14:35:34 +00:00
Jeff Daily
4befe14530 add test tag rocm_multi_gpu, run those separately for CSB builds 2019-09-11 20:15:04 +00:00
Jeff Daily
ad8b9ffd5f early nccl stream is now specific to ROCm platform
This also disables the multi-node emulation of the nccl_manager_test
for the ROCm platform.
2019-09-04 21:25:50 +00:00
Deven Desai
189a16d076 Adding 'no_rocm' tag to the '//tensorflow/core/nccl:nccl_manager_test'. A recent change broke this test for the ROCm platform. We are looking into fixing this test, but need to disable this test in the meantime because this test gets run as part of the ROCm Community Supported Build 2019-08-29 19:45:39 +00:00
Christian Sigg
b8f3b8d28b PR : [ROCm] add ROCm RCCL support
Imported from GitHub PR 

Copybara import of the project:

  - ba5748981bb02b9d0e91114cdc30eb64d1650a46 add ROCm RCCL support by Jeff Daily <jeff.daily@amd.com>
  - 6f887a19731f030be58495ae4fea98b3ad1f1cc3 run buildifier against tensorflow/core/nccl/BUILD by Jeff Daily <jeff.daily@amd.com>
  - 55ce583cf484953d90eb9b9310dc77cf63b4c0c9 Merge 6f887a19731f030be58495ae4fea98b3ad1f1cc3 into f9233... by Jeff Daily <jeff.daily@amd.com>

PiperOrigin-RevId: 264892468
2019-08-22 13:08:13 -07:00
Jeff Daily
7dbb5dd1c4 improve concurrency between compute and nccl streams
The NcclManager records and waits on an Event as each Participant is added,
rather than synchronizing with the compute stream only after all Participants
have been added. Otherwise, most compute kernels are added to the compute
stream prior to the NCCL sync Event, delaying the start of the collective.
2019-08-09 15:49:01 +00:00
Gunhan Gulsoy
a1019d9526 Create the initial BUILD file for tensorflow/core/platform folder
PiperOrigin-RevId: 261762636
2019-08-05 14:17:12 -07:00
A. Unique TensorFlower
d7f09bf0be Adjust structure of all BUILD files to recommended style (https://docs.bazel.build/versions/master/skylark/build-style.html#file-structure), moving loads to top.
PiperOrigin-RevId: 252864999
2019-06-12 12:36:28 -07:00
A. Unique TensorFlower
4e7bf7f554 Use unordered maps since key ordering is not needed; switched to using
absl::flat_hash_map.
PiperOrigin-RevId: 252507005
2019-06-10 16:29:56 -07:00
A. Unique TensorFlower
e44f32560d Apply 'buildozer fix moveLicensesAndDistribs movePackageToTop' to all BUILD files.
PiperOrigin-RevId: 249812574
2019-05-24 04:53:01 -07:00
Ayush Dubey
f6b81f458d Automated rollback of commit 681f6a6ac9
PiperOrigin-RevId: 229930698
2019-01-18 08:25:30 -08:00
Ayush Dubey
b58039978d Automated rollback of commit 681f6a6ac9
PiperOrigin-RevId: 229415244
2019-01-15 12:39:46 -08:00
Ayush Dubey
681f6a6ac9 Add a CollectiveReduce implementation that uses NCCL.
This change adds an implementation of the CollectiveImplementationInterface
that uses NCCL for all-reduce.  The main bits include setting up a NCCL
communicator key during collective param resolution, and calling the relevant
functionality in NcclManager.

PiperOrigin-RevId: 229277364
2019-01-14 16:31:11 -08:00
Ayush Dubey
1fa274b1e2 Better error checking and testing in NcclManager.
After this change, we check the return value of every CUDA and NCCL call in
NcclManager.  If any call is unsuccessful, we call the NCCL callback with an
error status.

This change also re-enables NCCL tests.

PiperOrigin-RevId: 224038066
2018-12-04 13:44:57 -08:00
A. Unique TensorFlower
fc6cd33c33 Move contrib/nccl to core/nccl.
PiperOrigin-RevId: 218908694
2018-10-26 14:05:30 -07:00