Merge pull request #38355 from Flamefire:slurm_cluster_resolver_docu

PiperOrigin-RevId: 305834198 Change-Id: I7bfeffbcd21a7245043c67769a3cfe6c0e423c59
2020-04-09 23:49:21 -07:00 · 2020-04-09 23:49:21 -07:00 · b083ceafd4
commit b083ceafd4
parent 05ad339414 961e341274
3 changed files with 98 additions and 51 deletions
--- a/tensorflow/python/distribute/cluster_resolver/README.slurm
+++ b/tensorflow/python/distribute/cluster_resolver/README.slurm
@ -1,50 +0,0 @@
-# Slurm Cluster Resolver
-
-The Slurm Cluster Resolver resolves cluster specification for distribution TensorFlow work launched on HPC system running on Slurm. This implementation is able to handle homogeneous task allocation on computing nodes with default task distribution plane. The resolution is done by determining job configuration through a number of Slurm output variables and user input. The resolver requires the specification of total number of tasks launched, process ID/rank of the running process, number of tasks launched per node, number of GPUs present on each node and the number of GPUs to allocate for each task.
-
-The process ID/rank is extracted from environment variable ```SLURM_PROCID``` and the total number of tasks launched is extract from ```SLURM_NTASKS```. The number of tasks per node is extracted from ```SLURM_NTASKS_PER_NODE```, unless a value is specified by user. The number of GPUs present on each node and number of GPUs for each task have to be specified by the user. A base port can be specified by user and in case there are more than one task launched per node the port number will be incremented for each additional tasks on that node. The hostnames are resolved by running command ```scontrol show hostname``` through a subprocess and a list of hostnames will be returned. The distribution of rank/process ID by default follows that order. By default allocated GPUs will be automatically exposed to processes according to specification by setting ```CUDA_VISIBLE_DEVICE```.
-
-## Example
- Slurm allocation in shell  ```salloc --nodes=2 -t 01:30:00 -A <project ID> --ntasks-per-node=2 --gres=gpu:k80:2 --exclusive```
- Creating cluster in Python
-```
-cluster_resolver = tf.contrib.cluster_resolver.SlurmClusterResolver(
-    {'ps': 1, 'worker': 3},
-    port_base=8888,
-    tasks_per_node=2,
-    gpus_per_node=2,
-    gpus_per_task=1,
-    auto_set_gpu=True)
-
-cluster = cluster_resolver.cluster_spec()
-job_name, task_index = cluster_resolver.get_task_info()
-```
-The above example resolves a cluster specification for a Slurm job allocation with two computing nodes each having two GPUs and two tasks will be launched on each node. The jobs are specified in form of a dictionary where the key is a string representing job name and value is an integer that specifies the number of tasks in that job. ```cluster_resolver.cluster_spec()``` will return a cluster specificaiton object and the cluster specification will have the following specification as protobuf.
-
-```
-job {
-  name: "ps"
-  tasks {
-    value: "t02n13:8888"
-  }
-}
-job {
-  name: "worker"
-  tasks {
-    value: "t02n13:8889"
-  }
-  tasks {
-    key: 1
-    value: "t02n41:8888"
-  }
-  tasks {
-    key: 2
-    value: "t02n41:8889"
-  }
-}
-```
-
-After calling ```cluster_resolver.cluster_spec()``` internal data structions of the resolver will be populated. By looking at the process ID/rank and comparing with cluster specification the task can 'realize' which task it belongs to. This can be retrieved by calling ```cluster_resolver.get_task_info()``` and a string specifying job name and an integer specifying the task index will be returned.
-
-GPUs will be automatically allocated to the processes. For example in the above example ```
-t02n41:8888``` will see GPU 0 and ```t02n41:8889``` will see GPU 1.
--- a/tensorflow/python/distribute/cluster_resolver/README_Slurm.md
+++ b/tensorflow/python/distribute/cluster_resolver/README_Slurm.md
@ -0,0 +1,97 @@
+# Slurm Cluster Resolver
+
+The Slurm Cluster Resolver resolves cluster specification for distributing
+TensorFlow work launched on HPC systems running on Slurm. This implementation is
+able to handle homogeneous and heterogeneous tasks as long as the number of GPUs
+per node and task are the same. This means on nodes with 4 GPUs each it will be
+possible to allocate 4 processes on node A and only 2 on node B. The resolution
+is done by determining job configuration through a number of Slurm variables and
+can be overwritten by user input. By default everything is determined from the
+Slurm environment, hence for most uses case no manual setting of parameters is
+required.
+
+## How it works
+
+`SlurmClusterResolver` reads the environment variables that are set inside a job
+step launched by Slurm. This means it will only work correctly for applications
+launched via `srun`.
+
+The process ID/rank is extracted from environment variable `SLURM_PROCID` and
+the total number of tasks launched is extracted from `SLURM_STEP_NUM_TASKS`. The
+hostnames are resolved by inspection `SLURM_STEP_NODELIST`. The number of tasks
+per node is extracted from `SLURM_STEP_TASKS_PER_NODE`, unless a value is
+specified by user. By using this variable heterogeneous task distributions are
+possible. The user can set `tasks_per_node` to a single integer for homogeneous
+tasks or a dictionary mapping node names to number of tasks for heterogeneous
+distributions. However setting this is **NOT** recommended as there is a chance
+it makes `SLURM_PROCID` be wrong.
+
+A base port can be specified by user and in case there are more than one task
+launched per node the port number will be incremented for each additional tasks
+on that node. However a reasonable default is used.
+
+The number of GPUs present on each node and number of GPUs for each tasks are
+automatically detected. This is done by checking for `CUDA_VISIBLE_DEVICES`
+first (which is set by Slurm to a list of GPUs for the current node) and has a
+fallback to using `nvidia-smi`. If this doesn't work or non-NVIDIA GPUs are used
+those 2 values have to be specified by the user. By default allocated GPUs will
+be automatically exposed to processes according to specification by setting
+`CUDA_VISIBLE_DEVICES`.
+
+## Basic example
+
+-   Slurm allocation in shell `salloc --nodes=2 -t 01:30:00 --ntasks-per-node=2
+    --gres=gpu:k80:4 --exclusive`
+-   Run the example `srun python tf_example.py`
+-   Creating cluster in Python `import tensorflow as tf cluster_resolver =
+    tf.distribute.cluster_resolver.SlurmClusterResolver() strategy =
+    tf.distribute.experimental.MultiWorkerMirroredStrategy(cluster_resolver=cluster_resolver)
+    with strategy.scope(): # Load and compile model and data`
+
+The above example will allocate 4 jobs on 2 nodes with each node having 2 jobs
+and 4 GPUs. `cluster_resolver.cluster_spec()` will return a cluster
+specification object in protobuf format with the following value (host names may
+vary): `job { name: "worker" tasks { key: 0 value: "t02n13:8888" } tasks { key:
+1 value: "t02n13:8889" } tasks { key: 2 value: "t02n41:8888" } tasks { key: 3
+value: "t02n41:8889" } }`
+
+The `job_name` will be `worker` for all nodes and `task_index` will be `0` to
+`3`. Also GPUs will be allocated automatically, so the first job on each node
+will see GPU 0 and 1, and the second GPU 2 and 3.
+
+## Advanced example
+
+-   Assuming the same job parameters (`salloc` & `srun`) as above
+-   Creating cluster in Python ``` cluster_resolver =
+    tf.contrib.cluster_resolver.SlurmClusterResolver( {'ps': 1, 'worker': 3},
+    port_base=1337, tasks_per_node=2, gpus_per_node=2, gpus_per_task=1,
+    auto_set_gpu=False)
+
+cluster = cluster_resolver.cluster_spec() job_name, task_index =
+cluster_resolver.get_task_info() ```
+
+In this case 1 parameter server job and 3 worker jobs are used. The resulting
+protobuf specification will look similar to this: `job { name: "ps" tasks { key:
+0 value: "t02n13:1337" } } job { name: "worker" tasks { key: 0 value:
+"t02n13:1338" } tasks { key: 1 value: "t02n41:1337" } tasks { key: 2 value:
+"t02n41:1338" } }`
+
+The value of `job_name` will be `ps` for `t02n13:1337` and `worker` for all
+others. There will be no GPU allocation done by the cluster resolver, so this
+has to be done manually which is useful if e.g. GPUs 0 should go to the first
+process and GPU 3 to the second process on each node. Also note that only 1 GPU
+will be used per task.
+
+## Extension points
+
+The class `SlurmClusterResolver` provides some methods that are meant to be
+overwritten by deriving classes:
+
+-   `_resolve_own_rank`
+-   `_resolve_num_tasks`
+-   `_resolve_hostlist`
+-   `_resolve_task_configuration`
+
+    Those can be used to implement a cluster resolver that gets information from
+    a different source, e.g. via MPI, a file or other environment variables. See
+    the documentation of these methods on what to return.
--- a/tensorflow/python/distribute/cluster_resolver/slurm_cluster_resolver.py
+++ b/tensorflow/python/distribute/cluster_resolver/slurm_cluster_resolver.py
@ -201,7 +201,7 @@ class SlurmClusterResolver(ClusterResolver):
      - SLURM_PROCID
      - (opt) SLURM_STEP_NUM_TASKS
      - (opt) SLURM_STEP_NODELIST
-      - (opt) SLURM_TASKS_PER_NODE
+      - (opt) SLURM_STEP_TASKS_PER_NODE

    Args:
      jobs: Dictionary with job names as key and number of tasks in the job as