Merge pull request #38355 from Flamefire:slurm_cluster_resolver_docu

PiperOrigin-RevId: 305834198
Change-Id: I7bfeffbcd21a7245043c67769a3cfe6c0e423c59
This commit is contained in:
TensorFlower Gardener 2020-04-09 23:49:21 -07:00
commit b083ceafd4
3 changed files with 98 additions and 51 deletions

View File

@ -1,50 +0,0 @@
# Slurm Cluster Resolver
The Slurm Cluster Resolver resolves cluster specification for distribution TensorFlow work launched on HPC system running on Slurm. This implementation is able to handle homogeneous task allocation on computing nodes with default task distribution plane. The resolution is done by determining job configuration through a number of Slurm output variables and user input. The resolver requires the specification of total number of tasks launched, process ID/rank of the running process, number of tasks launched per node, number of GPUs present on each node and the number of GPUs to allocate for each task.
The process ID/rank is extracted from environment variable ```SLURM_PROCID``` and the total number of tasks launched is extract from ```SLURM_NTASKS```. The number of tasks per node is extracted from ```SLURM_NTASKS_PER_NODE```, unless a value is specified by user. The number of GPUs present on each node and number of GPUs for each task have to be specified by the user. A base port can be specified by user and in case there are more than one task launched per node the port number will be incremented for each additional tasks on that node. The hostnames are resolved by running command ```scontrol show hostname``` through a subprocess and a list of hostnames will be returned. The distribution of rank/process ID by default follows that order. By default allocated GPUs will be automatically exposed to processes according to specification by setting ```CUDA_VISIBLE_DEVICE```.
## Example
- Slurm allocation in shell ```salloc --nodes=2 -t 01:30:00 -A <project ID> --ntasks-per-node=2 --gres=gpu:k80:2 --exclusive```
- Creating cluster in Python
```
cluster_resolver = tf.contrib.cluster_resolver.SlurmClusterResolver(
{'ps': 1, 'worker': 3},
port_base=8888,
tasks_per_node=2,
gpus_per_node=2,
gpus_per_task=1,
auto_set_gpu=True)
cluster = cluster_resolver.cluster_spec()
job_name, task_index = cluster_resolver.get_task_info()
```
The above example resolves a cluster specification for a Slurm job allocation with two computing nodes each having two GPUs and two tasks will be launched on each node. The jobs are specified in form of a dictionary where the key is a string representing job name and value is an integer that specifies the number of tasks in that job. ```cluster_resolver.cluster_spec()``` will return a cluster specificaiton object and the cluster specification will have the following specification as protobuf.
```
job {
name: "ps"
tasks {
value: "t02n13:8888"
}
}
job {
name: "worker"
tasks {
value: "t02n13:8889"
}
tasks {
key: 1
value: "t02n41:8888"
}
tasks {
key: 2
value: "t02n41:8889"
}
}
```
After calling ```cluster_resolver.cluster_spec()``` internal data structions of the resolver will be populated. By looking at the process ID/rank and comparing with cluster specification the task can 'realize' which task it belongs to. This can be retrieved by calling ```cluster_resolver.get_task_info()``` and a string specifying job name and an integer specifying the task index will be returned.
GPUs will be automatically allocated to the processes. For example in the above example ```
t02n41:8888``` will see GPU 0 and ```t02n41:8889``` will see GPU 1.

View File

@ -0,0 +1,97 @@
# Slurm Cluster Resolver
The Slurm Cluster Resolver resolves cluster specification for distributing
TensorFlow work launched on HPC systems running on Slurm. This implementation is
able to handle homogeneous and heterogeneous tasks as long as the number of GPUs
per node and task are the same. This means on nodes with 4 GPUs each it will be
possible to allocate 4 processes on node A and only 2 on node B. The resolution
is done by determining job configuration through a number of Slurm variables and
can be overwritten by user input. By default everything is determined from the
Slurm environment, hence for most uses case no manual setting of parameters is
required.
## How it works
`SlurmClusterResolver` reads the environment variables that are set inside a job
step launched by Slurm. This means it will only work correctly for applications
launched via `srun`.
The process ID/rank is extracted from environment variable `SLURM_PROCID` and
the total number of tasks launched is extracted from `SLURM_STEP_NUM_TASKS`. The
hostnames are resolved by inspection `SLURM_STEP_NODELIST`. The number of tasks
per node is extracted from `SLURM_STEP_TASKS_PER_NODE`, unless a value is
specified by user. By using this variable heterogeneous task distributions are
possible. The user can set `tasks_per_node` to a single integer for homogeneous
tasks or a dictionary mapping node names to number of tasks for heterogeneous
distributions. However setting this is **NOT** recommended as there is a chance
it makes `SLURM_PROCID` be wrong.
A base port can be specified by user and in case there are more than one task
launched per node the port number will be incremented for each additional tasks
on that node. However a reasonable default is used.
The number of GPUs present on each node and number of GPUs for each tasks are
automatically detected. This is done by checking for `CUDA_VISIBLE_DEVICES`
first (which is set by Slurm to a list of GPUs for the current node) and has a
fallback to using `nvidia-smi`. If this doesn't work or non-NVIDIA GPUs are used
those 2 values have to be specified by the user. By default allocated GPUs will
be automatically exposed to processes according to specification by setting
`CUDA_VISIBLE_DEVICES`.
## Basic example
- Slurm allocation in shell `salloc --nodes=2 -t 01:30:00 --ntasks-per-node=2
--gres=gpu:k80:4 --exclusive`
- Run the example `srun python tf_example.py`
- Creating cluster in Python `import tensorflow as tf cluster_resolver =
tf.distribute.cluster_resolver.SlurmClusterResolver() strategy =
tf.distribute.experimental.MultiWorkerMirroredStrategy(cluster_resolver=cluster_resolver)
with strategy.scope(): # Load and compile model and data`
The above example will allocate 4 jobs on 2 nodes with each node having 2 jobs
and 4 GPUs. `cluster_resolver.cluster_spec()` will return a cluster
specification object in protobuf format with the following value (host names may
vary): `job { name: "worker" tasks { key: 0 value: "t02n13:8888" } tasks { key:
1 value: "t02n13:8889" } tasks { key: 2 value: "t02n41:8888" } tasks { key: 3
value: "t02n41:8889" } }`
The `job_name` will be `worker` for all nodes and `task_index` will be `0` to
`3`. Also GPUs will be allocated automatically, so the first job on each node
will see GPU 0 and 1, and the second GPU 2 and 3.
## Advanced example
- Assuming the same job parameters (`salloc` & `srun`) as above
- Creating cluster in Python ``` cluster_resolver =
tf.contrib.cluster_resolver.SlurmClusterResolver( {'ps': 1, 'worker': 3},
port_base=1337, tasks_per_node=2, gpus_per_node=2, gpus_per_task=1,
auto_set_gpu=False)
cluster = cluster_resolver.cluster_spec() job_name, task_index =
cluster_resolver.get_task_info() ```
In this case 1 parameter server job and 3 worker jobs are used. The resulting
protobuf specification will look similar to this: `job { name: "ps" tasks { key:
0 value: "t02n13:1337" } } job { name: "worker" tasks { key: 0 value:
"t02n13:1338" } tasks { key: 1 value: "t02n41:1337" } tasks { key: 2 value:
"t02n41:1338" } }`
The value of `job_name` will be `ps` for `t02n13:1337` and `worker` for all
others. There will be no GPU allocation done by the cluster resolver, so this
has to be done manually which is useful if e.g. GPUs 0 should go to the first
process and GPU 3 to the second process on each node. Also note that only 1 GPU
will be used per task.
## Extension points
The class `SlurmClusterResolver` provides some methods that are meant to be
overwritten by deriving classes:
- `_resolve_own_rank`
- `_resolve_num_tasks`
- `_resolve_hostlist`
- `_resolve_task_configuration`
Those can be used to implement a cluster resolver that gets information from
a different source, e.g. via MPI, a file or other environment variables. See
the documentation of these methods on what to return.

View File

@ -201,7 +201,7 @@ class SlurmClusterResolver(ClusterResolver):
- SLURM_PROCID
- (opt) SLURM_STEP_NUM_TASKS
- (opt) SLURM_STEP_NODELIST
- (opt) SLURM_TASKS_PER_NODE
- (opt) SLURM_STEP_TASKS_PER_NODE
Args:
jobs: Dictionary with job names as key and number of tasks in the job as