[ROCm] Raising the memory allocation cap for GPU unit tests from 1GB to 2GB

This PR/commit updates the `parallel_gpu_execute.sh` script to raise the GPU memory allocation cap from 1GB to 2GB when running unit-tests.

Recently a couple of unit tests started failing on the ROCm platform because they were running out of memory

```
//tensorflow/python/kernel_tests:extract_image_patches_grad_test_gpu
//tensorflow/python/ops/numpy_ops:np_interop_test_gpu
```

GPU unit tests (atleast on the ROCm platform) are run with a cap that is set and implemented as shown here :

* https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/ci_build/gpu_build/parallel_gpu_execute.sh#L26-L32
* https://github.com/tensorflow/tensorflow/blob/master/tensorflow/stream_executor/stream_executor_pimpl.cc#L130-L137
* https://github.com/tensorflow/tensorflow/blob/master/tensorflow/stream_executor/stream_executor_pimpl.cc#L151
* https://github.com/tensorflow/tensorflow/blob/master/tensorflow/stream_executor/stream_executor_pimpl.cc#L487-L503

It does not seem that the `parallel_gpu_execute.sh` is being used on the CUDA platform (anymore...think it was in the past). There does not seem to be any reference to it in the `Invocation Details` tab of the `Linux GPU` CI job.

for e.g - https://source.cloud.google.com/results/invocations/09d63e6a-f7a9-4fc6-9708-2fdd40b8b193/details

It also does not seem that GPU unit tests on the CUDA platform are being subjected to the 1GB memory cap. This can be verified by looking the at `Target Log` for the `//tensorflow/python/ops/numpy_ops:np_interop_test_gpu` test in the `Linux GPU` CI job (actually any GPU unit test)

for e.g. - https://source.cloud.google.com/results/invocations/09d63e6a-f7a9-4fc6-9708-2fdd40b8b193/targets/%2F%2Ftensorflow%2Fpython%2Fops%2Fnumpy_ops:np_interop_test_gpu/log

On the ROCm platform, we see the following log messages which are generated as a consequence of the memory cap (when TF tried to grab the entire available GPU memory on startup)

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/stream_executor/stream_executor_pimpl.cc#L488-L494

```
 W tensorflow/stream_executor/stream_executor_pimpl.cc:490] Not enough memory to allocate 16133306368 on device 0 within provided limit. [used=0, limit=1073741824]
 W tensorflow/stream_executor/stream_executor_pimpl.cc:490] Not enough memory to allocate 14519974912 on device 0 within provided limit. [used=0, limit=1073741824]
 W tensorflow/stream_executor/stream_executor_pimpl.cc:490] Not enough memory to allocate 13067976704 on device 0 within provided limit. [used=0, limit=1073741824]
 W tensorflow/stream_executor/stream_executor_pimpl.cc:490] Not enough memory to allocate 11761178624 on device 0 within provided limit. [used=0, limit=1073741824]
...
...
...
```

These messsage are not present in unit tests logs for `Linux GPU` CI job, which seems to suggest that the env var `TF_PER_DEVICE_MEMORY_LIMIT_MB` is not set when the unit tests are run. Either that or the GPU on which the tests are being run has 1GB total memory which is unlikely.
This commit is contained in:
Deven Desai 2021-01-06 15:08:40 +00:00
parent 2f4a5dffed
commit 4574fb2176
2 changed files with 2 additions and 8 deletions
tensorflow
python/ops/numpy_ops
tools/ci_build/gpu_build

View File

@ -110,7 +110,6 @@ cuda_py_test(
cuda_py_test(
name = "np_interop_test",
srcs = ["np_interop_test.py"],
tags = ["no_rocm"],
deps = [
":numpy",
"//tensorflow:tensorflow_py",

View File

@ -23,13 +23,8 @@
TF_GPU_COUNT=${TF_GPU_COUNT:-4}
TF_TESTS_PER_GPU=${TF_TESTS_PER_GPU:-8}
# We want to allow running one of the following configs:
# - 4 tests per GPU on k80
# - 8 tests per GPU on p100
# p100 has minimum 12G memory. Therefore, we should limit each test to 1.5G.
# To leave some room in case we want to run more tests in parallel in the
# future and to use a rounder number, we set it to 1G.
export TF_PER_DEVICE_MEMORY_LIMIT_MB=${TF_PER_DEVICE_MEMORY_LIMIT_MB:-1024}
export TF_PER_DEVICE_MEMORY_LIMIT_MB=${TF_PER_DEVICE_MEMORY_LIMIT_MB:-2048}
# *******************************************************************
# This section of the script is needed to