Move "supervisor.md" from programmer's guide to api_guides.

PiperOrigin-RevId: 165732026
2017-08-18 11:42:14 -07:00 · 2017-08-18 11:42:14 -07:00 · 5f5c3eb0ab
commit 5f5c3eb0ab
parent d001b58de9
2 changed files with 0 additions and 405 deletions
--- a/tensorflow/docs_src/programmers_guide/index.md
+++ b/tensorflow/docs_src/programmers_guide/index.md
@ -35,9 +35,6 @@ The units are now as follows:
    Embedding Projector.
  * @{$programmers_guide/debugger$Debugging TensorFlow Programs}, which
    explains how to use the TensorFlow debugger (tfdbg).
-  * @{$programmers_guide/supervisor$Supervisor: Training Helper for Days-Long Trainings},
-    which explains how to gracefully handle system crashes during lengthy
-    training sessions.  (We have not revised this document for v1.3.)
  * @{$programmers_guide/version_compat$TensorFlow Version Compatibility},
    which explains backward compatibility guarantees and non-guarantees.
  * @{$programmers_guide/faq$FAQ}, which contains frequently asked
--- a/tensorflow/docs_src/programmers_guide/supervisor.md
+++ b/tensorflow/docs_src/programmers_guide/supervisor.md
@ -1,402 +0,0 @@
-# Supervisor: Training Helper for Days-Long Trainings.
-
-To train a model with TensorFlow you can simply run a training op a number of
-times and save a checkpoint of the trained parameters when you're done.  This
-works well for small models that can train in a few hours.
-
-Larger models that require days of training, possibly across multiple replicas,
-need a more robust training process that:
-
- * Handles shutdowns and crashes cleanly.
- * Can be resumed after a shutdown or a crash.
- * Can be monitored through TensorBoard.
-
-To be able to resume training after a shutdown or a crash the training process
-must save checkpoints regularly.  On restart, it must look for the most recent
-checkpoint and load it before resuming training.
-
-To be monitored through TensorBoard, the training process must run summary ops
-regularly and append the returned values to an events file as explained in
-@{$summaries_and_tensorboard$TensorBoard: Visualizing Learning}.
-TensorBoard monitors events files and displays graphs reporting training
-progress over time.
-
-The @{tf.train.Supervisor} provides
-a set of services that helps implement a robust training process.
-
-This how-to shows how to use the supervisor directly.  Please also consider
-using one of several frameworks built on top of the supervisor that provide
-richer training loops, and numerous customization options:
-@{$python/contrib.learn$`tf.learn`} is a good choice.
-
-Note that the supervisor is very helpful for training large models, but can
-also be used for smaller models without any penalty.
-
-## Very Simple Scenario
-
-The simplest scenario for using a supervisor is to:
-
- * Create a `Supervisor` object, passing it the path to a directory where to
-   save checkpoints and summaries.
-
- * Ask the supervisor for a session with
-   @{tf.train.Supervisor.managed_session}.
-
- * Use the session to execute a train op, checking at each step if the
-   supervisor requests that the training stops.
-
-```python
-  ...create graph...
-  my_train_op = ...
-
-  sv = tf.train.Supervisor(logdir="/my/training/directory")
-  with sv.managed_session() as sess:
-    for step in range(100000):
-      if sv.should_stop():
-        break
-      sess.run(my_train_op)
-```
-
-### Started Services
-
-In the very simple scenario, the `managed_session()` call starts a few
-services, which run in their own threads, and use the managed session to run
-ops in your graph.
-
-If your graph contains an integer variable named `global_step`, the services
-use its value to measure the number of training steps executed.  See the @{$mechanics#training$MNIST training tutorial} for how to
-create a `global_step` variable.
-
- * _Checkpointing_ service: Saves a copy of the graph variables in the logdir.
-   The checkpoint filename uses the value of the `global_step` variable if one
-   was added to your graph.  Runs every 10 minutes by default.
-
- * _Summary_ service: Runs all the summary ops and appends their output to an
-   @{$summaries_and_tensorboard$events file} in the logdir.  Runs
-   every 2 minutes by default.
-
- * _Step counter_: Counts how many steps have been executed, by looking at
-   changes in the `global_step` variable.  Appends a summary to the events file
-   reporting the number of global steps per second.  The summary tag is
-   "global_step/sec".  This also runs every 2 minutes by default.
-
- * _Queue Runners_: If any @{tf.train.QueueRunner} were added to the
-   graph, the supervisor launches them in their own threads.
-
-All time intervals can be changed when constructing the supervisor object.  See
-the [supervisor reference](#supervisor_reference) for details.
-
-### Checking for Stop
-
-The check for stop in the main training loop is important and necessary.
-
-Exceptions raised in the service threads are reported to the supervisor which
-then sets its `should_stop()` condition to true.  Other service threads notice
-that condition and terminate properly.  The main training loop, within the
-`managed_session()` block, must also check for the stop condition and
-terminate.
-
-Note that `managed_session()` takes care of catching exceptions raised from the
-training loop to report them to the supervisor.  The main loop does not need to
-do anything special about exceptions.  It only needs to check for the stop
-condition.
-
-### Recovery
-
-If the training program shuts down or crashes, its most recent checkpoint and
-event files are left in the logdir.  When you restart the program,
-`managed_session()` restores the graph from the most recent checkpoint and
-resumes training where it stopped.
-
-A new events file is created.  If you start TensorBoard and point it to the
-logdir, it will know how to merge the contents of the two events files and will
-show the training resuming at the last global step from the checkpoint.
-
-## Larger Model Scenario
-
-The very simple scenario is sufficient for most small to medium sized models.
-Larger models may run out memory when the summary service runs: The summary ops
-are run in parallel with the main loop running the train op.  This can cause
-memory usage to peak to up to two times the normal use.
-
-For a larger model you can tell the supervisor to not run the summary service
-and instead run it yourself in your main training loop: pass `summary_op=None`
-when constructing the supervisor.
-
-For example this code runs the summary op every 100 steps in the training loop:
-
-```python
-  ...create graph...
-  my_train_op = ...
-  my_summary_op = tf.summary.merge_all()
-
-  sv = tf.train.Supervisor(logdir="/my/training/directory",
-                     summary_op=None) # Do not run the summary service
-  with sv.managed_session() as sess:
-    for step in range(100000):
-      if sv.should_stop():
-        break
-      if step % 100 == 0:
-        _, summ = sess.run([my_train_op, my_summary_op])
-        sv.summary_computed(sess, summ)
-      else:
-        sess.run(my_train_op)
-```
-
-## Pre-trained Model Scenario
-
-The `managed_session()` call takes care of initializing the model in the
-session.  The model is restored from a checkpoint if one is available,
-or initialized from scratch otherwise.
-
-One common scenario is to initialize the model by loading a "pre-trained"
-checkpoint that was saved while training a usually slightly different model
-using a different dataset.
-
-You can load a pre-trained checkpoint by passing an "init function" to the
-supervisor.  This function is called only if the model needs to be initialized
-from scratch, not when the model can be recovered from a checkpoint from the
-logdir.
-
-To load the pre-trained model, the init function needs a
-@{tf.train.Saver} object, so you should create
-a saver for this purpose.  This is usually a good idea because the new model
-may contain variables that are not present in the pre-trained checkpoint: This
-saver must only restore the pre-trained variables.  If you were using the
-default saver, you could get an error trying to restore all the variables of
-the new model from the pre-trained checkpoint.
-
-```python
-  ...create graph...
-  # Create a saver that restores only the pre-trained variables.
-  pre_train_saver = tf.train.Saver([pre_train_var1, pre_train_var2])
-
-  # Define an init function that loads the pretrained checkpoint.
-  def load_pretrain(sess):
-    pre_train_saver.restore(sess, "<path to pre-trained-checkpoint>")
-
-  # Pass the init function to the supervisor.
-  #
-  # The init function is called _after_ the variables have been initialized
-  # by running the init_op.
-  sv = tf.train.Supervisor(logdir="/my/training/directory",
-                     init_fn=load_pretrain)
-  with sv.managed_session() as sess:
-    # Here sess was either initialized from the pre-trained-checkpoint or
-    # recovered from a checkpoint saved in a previous run of this code.
-    ...
-```
-
-## Running Your Own Services
-
-Supervisor services, such as the checkpointing service, run in threads parallel
-to the main training loop.  You sometimes want to add your own services, for
-example to fetch different sets of summaries on a different schedule than the
-usual summary service.
-
-Use the @{tf.train.Supervisor.loop} method of
-the supervisor for this purpose.  It repeatedly calls a function of your choice
-on a timer until the supervisor stop condition becomes true, so it plays nicely
-with the other services.
-
-Example: Call `my_additional_summaries()` every 20mn:
-
-```python
-
-def my_additional_summaries(sv, sess):
- ...fetch and write summaries, see below...
-
-...
-  sv = tf.train.Supervisor(logdir="/my/training/directory")
-  with sv.managed_session() as sess:
-    # Call my_additional_summaries() every 1200s, or 20mn,
-    # passing (sv, sess) as arguments.
-    sv.loop(1200, my_additional_summaries, args=(sv, sess))
-    ...main training loop...
-```
-
-## Writing Summaries
-
-The supervisor always creates an events file in its logdir, as well as a
-@{tf.summary.FileWriter} to append
-events and summaries to that file.  If you want to write your own summaries it
-is a good idea to append them to that same events file: TensorBoard likes it
-better when only one events file in a directory is being actively appended to.
-
-The supervisor provides a helper function to append summaries:
-@{tf.train.Supervisor.summary_computed}.
-Just pass to the function the output returned by a summary op.  Here is an
-example of using that function to implement `my_additional_summaries()` from the
-previous example:
-
-```python
-def my_additional_summaries(sv, sess):
-  summaries = sess.run(my_additional_summary_op)
-  sv.summary_computed(sess, summaries)
-```
-
-For more advanced usages, the supervisor provides access to its summary writer
-through its
-@{tf.train.Supervisor.summary_writer}
-attribute.
-
-## Supervisor Reference
-
-The [Very Simple Scenario](#very_simple_scenario), and the [Larger Model
-Scenario](#larger_model_scenario) show basic uses of a supervisor.  More
-advanced scenarios can be constructed by using the many options provided by the
-supervisor
-
-### Checkpointing: Where and When.
-
-The `managed_session()` call launches the checkpointing service, which can be
-configured by the following keyword arguments to the `Supervisor()`
-constructor:
-
- * `logdir`: path to a directory where the checkpointing service creates
-   checkpoints.  The directory is created if needed.  Passing `None` disables
-   the checkpointing and the summary services.
-
- * `checkpoint_basename`: Name of the checkpoint files to create, defaults to
-   "model.ckpt".
-
-   If the model contains a scalar integer variable named `global_step`, the
-   value of that variable is appended to the checkpoint filename.
-
-   For example, at global step 1234 the checkpoint filename is
-   "model.ckpt-1234".
-
- * `save_model_secs`: Number of seconds between each checkpoint.  Defaults to
-   600, or 10 minutes.
-
-   When choosing a value, consider how much work you want to lose in case of a
-   crash: you will never lose more than `save_model_secs` seconds of work.
-   Setting this to 0 disables the checkpointing service.
-
- * `saver`: A @{tf.train.Saver} object to use
-   for checkpointing.
-
-   If you do not pass one, the supervisor creates one for you by calling
-   `tf.train.Saver()`, which add ops to save and restore all variables in your model.
-   This is usually what you need.
-
-Example: Use a custom Saver and checkpoint every 30 seconds.
-
-```python
-  ...create graph...
-  my_saver = tf.train.Saver(<only some variables>)
-  sv = tf.train.Supervisor(logdir="/my/training/directory",
-                     saver=my_saver,
-                     save_model_secs=30)
-  with sv.managed_session() as sess:
-    ...training loop...
-```
-
-### Summaries: Where and When.
-
-The `managed_session()` call launches the summary service which fetches
-summaries and reports the number of steps executed per second.  It can be
-configured by the following keyword arguments to the `Supervisor()`
-constructor:
-
- * `logdir`: Path to a directory where the summary service creates event files.
-   The directory is created if needed.  Passing `None` disables the summary
-   service as well as the checkpointing services.
-
- * `save_summaries_secs`: Number of seconds between each run of the summary
-   service.  Defaults to 120, or 2 minutes.
-
-   When choosing a value, consider how expensive your summaries are, and how
-   much disk they will occupy.  Pass 0 to disable the summary service.
-
- * `summary_op`: Op to use to fetch the summaries.
-
-   If not specified, the supervisor use the first op in the
-   `tf.GraphKeys.SUMMARY_OP` @{tf.Graph.add_to_collection$graph collection}.  If
-   the collection is empty the supervisor creates an op that aggregates all
-   summaries in the graph using `tf.summary.merge_all()`.
-
-   Passing `None` disables the summary service.
-
- * `global_step`: Tensor to use to count the global step.
-
-   If not specified, the supervisor uses the first tensor in the
-   `tf.GraphKeys.GLOBAL_STEP` @{tf.Graph.add_to_collection$graph collection}.  If
-   the collection is empty, the supervisor looks for a scalar integer variable
-   named `global_step` in the graph.
-
-   If found, the global step tensor is used to measure the number of training
-   steps executed.  Note that your training op is responsible for incrementing
-   the global step value.
-
-### Model Initialization and Recovery
-
-The `managed_session()` call takes care of initializing or recovering a
-session.  It returns a session with a fully initialized model, ready to run
-ops.  If a checkpoint exists in the logdir when `managed_session()` is called,
-the model is initialized by loading that checkpoint, otherwise it is
-initialized by calling an init op and optionally an init function.
-
-When no checkpoint is available, model initialization is controlled by the
-following keyword arguments to the `Supervisor()` constructor:
-
- * `init_op`: Op to run to initialize the model.
-
-   If not specified, the supervisor uses the first op in the
-   `tf.GraphKeys.INIT_OP` collection.  If the collection is empty, the
-   supervisor adds an op to initialize all the variables in the graph by
-   calling `tf.global_variables_initializer()`.
-
-   Pass `None` to not use an init op.
-
- * `init_fn`: Python function to call to initialize the model.
-
-   If specified, called as `init_fn(sess)` where `sess` is the managed session.
-   If an init op is also used, the init function is called _after_ the init op.
-
- * `local_init_op`: An additional op to initialize parts of the graph that are
-   not saved in checkpoints such as tables and
-   @{tf.contrib.framework.local_variable$local variables}. The
-   local init op is run _before_ the init op and the init function.
-
-   If not specified, the supervisor uses the first op in the
-   `tf.GraphKeys.LOCAL_INIT_OP` collection.  If the collection is empty the
-   supervisor adds an op to initialize all the tables and local variables in
-   the graph by calling `tf.tables_initializer()` and
-   `tf.local_variables_initializer()`.
-
-   Pass `None` to not use a local init op.
-
- * `ready_op`: Op to check if the model is initialized.
-
-   After running the local init op, the init op, and the init function, the
-   supervisor verifies that the model is fully initialized by running the ready
-   op.  This is an op that returns an empty string if the model is initialized,
-   or a description of what parts of the model are not initialized if not.
-
-   If not specified, the supervisor uses the first op in the
-   `tf.GraphKeys.READY_OP` collection.  If the collection is empty the
-   supervisor creates a ready op that verifies that all variables are
-   initialized by calling `tf.report_uninitialized_variables()`.
-
-   Pass `None` to disable the ready op.  In that case the model is not
-   checked after initialization.
-
-Checkpoint recovery is controlled by the following keyword arguments to the
-`Supervisor()` constructor:
-
- * `logdir`: Path to a directory in which to look for checkpoints.  The
-  checkpoint service saves a metadata file, named "checkpoint", in the
-  checkpoint directory that indicates the path to the most recent checkpoint.
-
-  This file is in text format. When in a pinch, you can edit it manually to
-  recover from a different checkpoint than the most recent one.
-
- * `ready_op`: (see above).  The ready op is run before and after loading the
-   checkpoint.  The first run checks if the model needs to be initialized and
-   the second run verifies that the model is fully initialized.
-
- * `local_init_op`: (see above).  The local init op is run before running the
-   ready op the first time, to initialize local variables and tables.
-
- * `saver`: (see above).  Saver object used to load the checkpoint.