PSv2: Docstring minor rephrase and typo/example corrections.

PiperOrigin-RevId: 339929158 Change-Id: I8592aa6e2cec32a2ba97743a6f022f263f0f65e2
2020-10-30 13:25:03 -07:00 · 2020-10-30 13:25:03 -07:00 · 84384703c0
commit 84384703c0
parent 60ac36f504
2 changed files with 42 additions and 43 deletions
--- a/tensorflow/python/distribute/coordinator/cluster_coordinator.py
+++ b/tensorflow/python/distribute/coordinator/cluster_coordinator.py
@ -291,11 +291,11 @@ class PerWorkerValues(object):
  """A container that holds a list of values, one value per worker.

  `tf.distribute.experimental.coordinator.PerWorkerValues` contains a collection
-  of values, where each of the value is located one worker respectively, and
-  upon being used as one of the `args` or `kwargs` of
+  of values, where each of the values is located on its corresponding worker,
+  and upon being used as one of the `args` or `kwargs` of
  `tf.distribute.experimental.coordinator.ClusterCoordinator.schedule()`, the
  value specific to a worker will be passed into the function being executed at
-  that particular worker.
+  that corresponding worker.

  Currently, the only supported path to create an object of
  `tf.distribute.experimental.coordinator.PerWorkerValues` is through calling
@ -948,14 +948,13 @@ class ClusterCoordinator(object):
  failed worker, it will be added for function execution after datasets created
  by `create_per_worker_dataset` are re-built on it.

-  When a parameter server the coordinator fails, a
-  `tf.errors.UnavailableError` is raised by `schedule`, `join` or `done`. In
-  this case, in addition to bringing back the failed parameter server, users
-  should restart the coordinator to so that it reconnects to the parameter
-  server, re-creates the variables and loads checkpoints. If the coordinator
-  fails, users need to bring it back as well. The program will automatically
-  connect to the parameter servers and workers, and continue the progress from a
-  checkpoint.
+  When a parameter server fails, a `tf.errors.UnavailableError` is raised by
+  `schedule`, `join` or `done`. In this case, in addition to bringing back the
+  failed parameter server, users should restart the coordinator so that it
+  reconnects to workers and parameter servers, re-creates the variables, and
+  loads checkpoints. If the coordinator fails, after the user brings it back,
+  the program will automatically connect to workers and parameter servers, and
+  continue the progress from a checkpoint.

  It is thus essential that in user's program, a checkpoint file is periodically
  saved, and restored at the start of the program. If an
@ -1137,7 +1136,7 @@ class ClusterCoordinator(object):

    def per_worker_dataset_fn():
      return strategy.distribute_datasets_from_function(
-          lambda x: tf.data.from_tensor_slices([3] * 3)
+          lambda x: tf.data.Dataset.from_tensor_slices([3] * 3))

    per_worker_dataset = coordinator.create_per_worker_dataset(
        per_worker_dataset_fn)
--- a/tensorflow/python/distribute/parameter_server_strategy_v2.py
+++ b/tensorflow/python/distribute/parameter_server_strategy_v2.py
@ -52,22 +52,22 @@ class ParameterServerStrategyV2(distribute_lib.Strategy):
  synchronizing with each other. Under this configuration, it is known as
  asynchronous training.

-  In TensorFlow 2, we recommend a central coordiantion-based architecture for
-  parameter server training, where workers and parameter servers run a
-  `tf.distribute.Server` and there is another task that creates resources on
-  workers and parameter servers, dispatches functions, and coordinates the
-  training. We refer to this task as “coordinator”. The coordinator uses a
+  In TensorFlow 2, we recommend an architecture based on central coordination
+  for parameter server training. Each worker and parameter server runs a
+  `tf.distribute.Server`, and on top of that, a coordinator task is responsible
+  for creating resources on workers and parameter servers, dispatching
+  functions, and coordinating the training. The coordinator uses a
  `tf.distribute.experimental.coordinator.ClusterCoordinator` to coordinate the
  cluster, and a `tf.distribute.experimental.ParameterServerStrategy` to define
  variables on parameter servers and computation on workers.

  For the training to work, the coordinator dispatches `tf.function`s to be
-  executed on remote workers. Upon receiving requests from
-  the coordinator, a worker executes the `tf.function` by reading the variables
-  from parameter servers, executing the ops, and updating the variables on the
-  parameter servers. Each of the worker only processes the requests from the
-  coordinator, and communicates with parameter servers, without direct
-  interactions with other workers in the cluster.
+  executed on remote workers. Upon receiving requests from the coordinator, a
+  worker executes the `tf.function` by reading the variables from parameter
+  servers, executing the ops, and updating the variables on the parameter
+  servers. Each of the worker only processes the requests from the coordinator,
+  and communicates with parameter servers, without direct interactions with
+  other workers in the cluster.

  As a result, failures of some workers do not prevent the cluster from
  continuing the work, and this allows the cluster to train with instances that
@ -77,7 +77,7 @@ class ParameterServerStrategyV2(distribute_lib.Strategy):

  Note that the coordinator is not one of the training workers. Instead, it
  creates resources such as variables and datasets, dispatchs `tf.function`s,
-  saving checkpoints and so on. In addition to workers, parameter servers and
+  saves checkpoints and so on. In addition to workers, parameter servers and
  the coordinator, an optional evaluator can be run on the side that
  periodically reads the checkpoints saved by the coordinator and runs
  evaluations against each checkpoint.
@ -226,8 +226,8 @@ class ParameterServerStrategyV2(distribute_lib.Strategy):
  ```
  Alternatively, you can also start a bunch of TensorFlow servers in advance and
  connect to them later. The coordinator can be in the same cluster or on any
-  machine that has connectivity to workers and parameter server. This is covered
-  in our guide and tutorial.
+  machine that has connectivity to workers and parameter servers. This is
+  covered in our guide and tutorial.

  __Variable creation with `strategy.scope()`__

@ -270,9 +270,9 @@ class ParameterServerStrategyV2(distribute_lib.Strategy):
  "shard" the variables across the ps. Partitioning large variable among ps is a
  commonly used technique to boost training throughput and mitigate memory
  constraints. It enables parallel computations and updates on different shards
-  of a variable, and often yields better load balancing across parameter servers
-  . Without sharding, models with large variables (e.g, embeddings) that can't
-  fit into one machine's memory would otherwise be unable to train.
+  of a variable, and often yields better load balancing across parameter
+  servers. Without sharding, models with large variables (e.g, embeddings) that
+  can't fit into one machine's memory would otherwise be unable to train.

  With `tf.distribute.experimental.ParameterServerStrategy`, if a
  `variable_partitioner` is provided to `__init__` and certain conditions are
@ -294,40 +294,41 @@ class ParameterServerStrategyV2(distribute_lib.Strategy):
      return x * self.w

  # Partition the dense layer into 2 shards.
-  variable_partitioiner  = (
+  variable_partitioner = (
    tf.distribute.experimental.partitioners.FixedShardsPartitioner(
      num_shards = 2))
-  strategy = ParameterServerStrategy(cluster_resolver=...,
+  strategy = tf.distribute.experimental.ParameterServerStrategy(
+    cluster_resolver=...,
    variable_partitioner = variable_partitioner)
  with strategy.scope():
    dense = Dense()
  assert len(dense.variables) == 2
  assert isinstance(dense.variables[0], tf.Variable)
  assert isinstance(dense.variables[1], tf.Variable)
-  assert dense.variables[0].name == "w/part_0"
-  assert dense.variables[1].name == "w/part_1"
+  assert dense.variables[0].shape == (50, 10)
+  assert dense.variables[1].shape == (50, 10)
  ```

  The sharded variable container can be converted to a `Tensor` via
  `tf.convert_to_tensor`. This means the container can be directly used in most
-  Python Ops where such `Tensor` convertion automatically happens. For example
+  Python Ops where such `Tensor` conversion automatically happens. For example,
  in the above code snippet, `x * self.w` would implicitly apply the said tensor
-  convertion. Note that such convertion can be expensive, as the variable
+  conversion. Note that such conversion can be expensive, as the variable
  components need to be transferred from multiple parameter servers to where
  the value is used.

-  `tf.nn.embedding_lookup` on the other hand doesn't apply the tensor convertion
-  , and performs parallel lookups on the variable components instead. This is
-  crutial to scale up embedding lookups when the embedding table variable is
-  large.
+  `tf.nn.embedding_lookup` on the other hand doesn't apply the tensor
+  conversion, and performs parallel lookups on the variable components instead.
+  This is crucial to scale up embedding lookups when the embedding table
+  variable is large.

-  When a partitioned variable is saved to `SavedModel`, it will be saved as if
+  When a partitioned variable is saved to a `SavedModel`, it will be saved as if
  it is one single variable. This improves serving efficiency by eliminating
  a number of Ops that handle the partiton aspects.

  Known limitations of variable partitioning:

-  * Number of parttions must not change across Checkpoint save/load.
+  * Number of partitions must not change across Checkpoint saving/loading.

  * After saving partitioned variables to a SavedModel, the SavedModel can't be
    loaded via `tf.saved_model.load`.
@ -358,7 +359,6 @@ class ParameterServerStrategyV2(distribute_lib.Strategy):
  coordinator =
      tf.distribute.experimental.coordinator.ClusterCoordinator(strategy=...)
  distributed_dataset = coordinator.create_per_worker_dataset(dataset_fn)
-
  ```

  __Limitations__
@ -404,7 +404,7 @@ class ParameterServerStrategyV2(distribute_lib.Strategy):
        * `variable_partitioner` will be called for each variable created under
        strategy `scope` to instruct how the variable should be partitioned.
        Variables that have only one partition along the partitioning axis
-        (i.e., no need for partition) will be created as normal `tf.Variable`.
+        (i.e., no need for partition) will be created as a normal `tf.Variable`.

        * Only the first / outermost axis partitioning is supported.