Performance guide update

PiperOrigin-RevId: 168159289
2017-09-10 10:14:47 -07:00 · 2017-09-10 10:14:47 -07:00 · ce9a2b00fa
commit ce9a2b00fa
parent 3bce4f9a0d
1 changed files with 607 additions and 107 deletions
--- a/tensorflow/docs_src/performance/performance_guide.md
+++ b/tensorflow/docs_src/performance/performance_guide.md
@ -1,43 +1,182 @@
 # Performance Guide

-This guide contains a collection of best practices for optimizing your
-TensorFlow code. The best practices apply to both new and experienced
-Tensorflow users.  As a complement to the best practices in this document, the
-@{$performance_models$High-Performance Models} document links to example code
-and details for creating models that scale on a variety of hardware.
+This guide contains a collection of best practices for optimizing TensorFlow
+code. The guide is divided into a few sections:

-## Best Practices
-While optimizing implementations of different types of models can be different,
-the topics below cover best practices to get the most performance from
-TensorFlow. Although these suggestions focus on image-based models, we will
-regularly add tips for all kinds of models. The following list highlights key
-best practices:
+*   [General best practices](#general_best_practices) covers topics that are
+    common across a variety of model types and hardware.
+*   [Optimizing for GPU](#optimizing_for_gpu) details tips specifically relevant
+    to GPUs.
+*   [Optimizing for CPU](#optimizing_for_cpu) details CPU specific information.

-*   Build and install from source
-*   Utilize queues for reading data
-*   Preprocessing on the CPU
-*   Use `NCHW` image data format
-*   Place shared parameters on the GPU
-*   Use fused batch norm
+## General best practices

-The following sections detail the preceding suggestions.
+The sections below cover best practices that are relevant to a variety of
+hardware and models. The best practices section is broken down into the
+following sections:

-### Build and install from source
+*   [Input pipeline optimizations](#input-pipeline-optimization)
+*   [Data formats](#data-formats)
+*   [Common fused Ops](#common-fused-ops)
+*   [Building and installing from source](#building-and-installing-from-source)

-To install the most optimized version of TensorFlow, build and install
-TensorFlow from source by following [Installing TensorFlow from Source](../install/install_sources).
-Building from source with compiler optimizations for the target hardware and
-ensuring the latest CUDA platform and cuDNN libraries are installed results in
-the highest performing installs.
+### Input pipeline optimization

-For the most stable experience, build from the [latest release](https://github.com/tensorflow/tensorflow/releases)
-branch. To get the latest performance changes and accept some stability risk,
-build from [master](https://github.com/tensorflow/tensorflow).
+Typical models retrieve data from disk and preprocess it before sending the data
+through the network. For example, models that process JPEG images will follow
+this flow: load image from disk, decode JPEG into a tensor, crop and pad,
+possibly flip and distort, and then batch. This flow is referred to as the input
+pipeline. As GPUs and other hardware accelerators get faster, preprocessing of
+data can be a bottleneck.

-If there is a need to build TensorFlow on a platform that has different hardware
-than the target, then cross-compile with the highest optimizations for the target
-platform.  The following command is an example of telling `bazel` to compile for
-a specific platform:
+Determining if the input pipeline is the bottleneck can be complicated. One of
+the most straightforward methods is to reduce the model to a single operation
+(trivial model) after the input pipeline and measure the examples per second. If
+the difference in examples per second for the full model and the trivial model
+is minimal then the input pipeline is likely a bottleneck. Below are some other
+approaches to identifying issues:
+
+*   Check if a GPU is underutilized by running `watch -n 2 nvidia-smi`. If GPU
+    utilization is not approaching 80-100%, then the input pipeline may be the
+    bottleneck.
+*   Generate a timeline and look for large blocks of white space (waiting). An
+    example of generating a timeline exists as part of the @{$jit$XLA JIT}
+    tutorial.
+*   Check CPU usage. It is possible to have an optimized input pipeline and lack
+    the CPU cycles to process the pipeline.
+*   Estimate the throughput needed and verify the disk used is capable of that
+    level of throughput. Some cloud solutions have network attached disks that
+    start as low as 50 MB/sec, which is slower than spinning disks (150 MB/sec),
+    SATA SSDs (500 MB/sec), and PCIe SSDs (2,000+ MB/sec).
+
+#### Preprocessing on the CPU
+
+Placing input pipeline operations on the CPU can significantly improve
+performance. Utilizing the CPU for the input pipeline frees the GPU to focus on
+training. To ensure preprocessing is on the CPU, wrap the preprocessing
+operations as shown below:
+
+```python
+with tf.device('/cpu:0'):
+  # function to get and process images or data.
+  distorted_inputs = load_and_distort_images()
+```
+
+If using `tf.estimator.Estimator` the input function is automatically placed on
+the CPU.
+
+#### Using the Dataset API
+
+The @{$datasets$Dataset API} is replacing `queue_runner` as the recommended API
+for building input pipelines. The API was added to contrib as part of TensorFlow
+1.2 and will move to core in the near future. This
+[ResNet example](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator/cifar10_main.py)
+([arXiv:1512.03385](https://arxiv.org/abs/1512.03385))
+training CIFAR-10 illustrates the use of the Dataset API along with
+`tf.estimator.Estimator`. The Dataset API utilizes C++ multi-threading and has a
+much lower overhead than the Python-based `queue_runner` that is limited by
+Python's multi-threading performance.
+
+While feeding data using a `feed_dict` offers a high level of flexibility, in
+most instances using `feed_dict` does not scale optimally. However, in instances
+where only a single GPU is being used the difference can be negligible. Using
+the Dataset API is still strongly recommended. Try to avoid the following:
+
+```python
+# feed_dict often results in suboptimal performance when using large inputs.
+sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
+```
+
+#### Use large files
+
+Reading large numbers of small files significantly impacts I/O performance.
+One approach to get maximum I/O throughput is to preprocess input data into
+larger (~100MB) `TFRecord` files. For smaller data sets (200MB-1GB), the best
+approach is often to load the entire data set into memory. The document
+[Downloading and converting to TFRecord format](https://github.com/tensorflow/models/tree/master/slim#Data)
+includes information and scripts for creating `TFRecords` and this
+[script](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator/generate_cifar10_tfrecords.py)
+converts the CIFAR-10 data set into `TFRecords`.
+
+### Data formats
+
+Data formats refers to the structure of the Tensor passed to a given Op. The
+discussion below is specifically about 4D Tensors representing images. In
+TensorFlow the parts of the 4D tensor are often referred to by the following
+letters:
+
+*   N refers to the number of images in a batch.
+*   H refers to the number of pixels in the vertical (height) dimension.
+*   W refers to the number of pixels in the horizontal (width) dimension.
+*   C refers to the channels. For example, 1 for black and white or grayscale
+    and 3 for RGB.
+
+Within TensorFlow there are two naming conventions representing the two most
+common data formats:
+
+*   `NCHW` or `channels_first`
+*   `NHWC` or `channels_last`
+
+`NHWC` is the TensorFlow default and `NCHW` is the optimal format to use when
+training on NVIDIA GPUs using [cuDNN](https://developer.nvidia.com/cudnn).
+
+The best practice is to build models that work with both data formats. This
+simplifies training on GPUs and then running inference on CPUs. If TensorFlow is
+compiled with the [Intel MKL](#tensorflow_with_intel_mkl-dnn) optimizations,
+many operations, especially those related to CNN based models, will be optimized
+and support `NCHW`. If not using the MKL, some operations are not supported on
+CPU when using `NCHW`.
+
+The brief history of these two formats is that TensorFlow started by using
+`NHWC` because it was a little faster on CPUs. In the long term, we are working
+on tools to auto rewrite graphs to make switching between the formats
+transparent and take advantages of micro optimizations where a GPU Op may be
+faster using `NHWC` than the normally most efficient `NCHW`.
+
+### Common fused Ops
+
+Fused Ops combine multiple operations into a single kernel for improved
+performance. There are many fused Ops within TensorFlow and @{$xla$XLA} will
+create fused Ops when possible to automatically improve performance. Collected
+below are select fused Ops that can greatly improve performance and may be
+overlooked.
+
+#### Fused batch norm
+
+Fused batch norm combines the multiple operations needed to do batch
+normalization into a single kernel. Batch norm is an expensive process that for
+some models makes up a large percentage of the operation time. Using fused batch
+norm can result in a 12%-30% speedup.
+
+There are two commonly used batch norms and both support fusing. The core
+@{tf.layers.batch_normalization} added fused starting in TensorFlow 1.3.
+
+```python
+bn = tf.layers.batch_normalization(
+    input_layer, fused=True, data_format='NCHW')
+```
+
+The contrib @{tf.contrib.layers.batch_norm} method has had fused as an option
+since before TensorFlow 1.0.
+
+```python
+bn = tf.contrib.layers.batch_norm(input_layer, fused=True, data_format='NCHW')
+```
+
+### Building and installing from source
+
+The default TensorFlow binaries target the broadest range of hardware to make
+TensorFlow accessible to everyone. If using CPUs for training or inference, it
+is recommended to compile TensorFlow with all of the optimizations available for
+the CPU in use. Speedups for training and inference on CPU are documented below
+in [Comparing compiler optimizations](#comparing-compiler-optimizations).
+
+To install the most optimized version of TensorFlow,
+@{$install_sources$build and install} from source. If there is a need to build
+TensorFlow on a platform that has different hardware than the target, then
+cross-compile with the highest optimizations for the target platform. The
+following command is an example of using `bazel` to compile for a specific
+platform:

 ```python
 # This command optimizes for Intel’s Broadwell processor
@ -47,106 +186,467 @@ bazel build -c opt --copt=-march="broadwell" --config=cuda //tensorflow/tools/pi

 #### Environment, build, and install tips

-*   Compile with the highest level of compute the [GPU
-    supports](http://developer.nvidia.com/cuda-gpus), e.g. P100: 6.0, Titan X
-    (pascal): 6.2, Titan X (maxwell): 5.2, and K80: 3.7.
-*   Install the latest CUDA platform and cuDNN libraries.
-*   Make sure to use a version of gcc that supports all of the optimizations of
-    the target CPU. The recommended minimum gcc version is 4.8.3.  On OS X upgrade
-    to the latest Xcode version and use the version of clang that comes with Xcode.
-*   TensorFlow checks on startup whether it has been compiled with the
-    optimizations available on the CPU. If the optimizations are not included,
-    TensorFlow will emit warnings, e.g. AVX, AVX2, and FMA instructions not
-    included.
+*   `./configure` asks which compute capability to include in the build. This
+    does not impact overall performance but does impact initial startup. After
+    running TensorFlow once, the compiled kernels are cached by CUDA. If using
+    a docker container, the data is not cached and the penalty is paid each time
+    TensorFlow starts. The best practice is to include the
+    [compute capabilities](http://developer.nvidia.com/cuda-gpus)
+    of the GPUs that will be used, e.g. P100: 6.0, Titan X (Pascal): 6.1, Titan
+    X (Maxwell): 5.2, and K80: 3.7.
+*   Use a version of gcc that supports all of the optimizations of the target
+    CPU. The recommended minimum gcc version is 4.8.3. On OS X, upgrade to the
+    latest Xcode version and use the version of clang that comes with Xcode.
+*   Install the latest stable CUDA platform and cuDNN libraries supported by
+    TensorFlow.

-### Utilize queues for reading data
+## Optimizing for GPU

-One common cause of poor performance is underutilizing GPUs, or essentially
-"starving" them of data by not setting up an efficient pipeline. Make sure to
-set up an input pipeline to utilize queues and stream data effectively. Review
-the @{$reading_data#reading_from_files$Reading Data guide} for implementation
-details. One way to identify a "starved" GPU is to generate and review
-timelines. A detailed tutorial for timelines does not exist, but a quick example
-of generating a timeline exists as part of the @{$jit$XLA JIT} tutorial. Another
-simple way to check if a GPU is underutilized is to run `watch nvidia-smi`, and
-if GPU utilization is not approaching 100% then the GPU is not getting data fast
-enough.
+This section contains GPU-specific tips that are not covered in the
+[General best practices](#general-best-practices). Obtaining optimal performance
+on multi-GPUs is a challenge. A common approach is to use data parallelism.
+Scaling through the use of data parallelism involves making multiple copies of
+the model, which are referred to as "towers", and then placing one tower on each
+of the GPUs. Each tower operates on a different mini-batch of data and then
+updates variables, also known as parameters, that need to be shared between
+each of the towers. How each tower gets the updated variables and how the
+gradients are applied has an impact on the performance, scaling, and convergence
+of the model.  The rest of this section provides an overview of variable
+placement and the towering of a model on multiple GPUs.
+@{$performance_models$High-Performance Models} gets into more details regarding
+more complex methods that can be used to share and update variables between
+towers.

-Unless for a special circumstance or for example code, do not feed data
-into the session from Python variables, e.g. `dictionary`.
+The best approach to handling variable updates depends on the model, hardware,
+and even how the hardware has been configured. An example of this, is that two
+systems can be built with NVIDIA Tesla P100s but one may be using PCIe and the
+other [NVLink](http://www.nvidia.com/object/nvlink.html). In that scenario, the
+optimal solution for each system may be different. For real world examples, read
+the @{$benchmarks$benchmark} page which details the settings that were optimal
+for a variety of platforms. Below is a summary of what was learned from
+benchmarking various platforms and configurations:
+
+*   **Tesla K80**: If the GPUs are on the same PCI Express root complex and are
+    able to use [NVIDIA GPUDirect](https://developer.nvidia.com/gpudirect) Peer
+    to Peer, then placing the variables equally across the GPUs used for
+    training is the best approach. If the GPUs cannot use GPUDirect, then
+    placing the variables on the CPU is the best option.
+
+*   **Titan X (Maxwell and Pascal), M40, P100, and similar**: For models like
+    ResNet and InceptionV3, placing variables on the CPU is the optimal setting,
+    but for models with a lot of variables like AlexNet and VGG, using GPUs with
+    `NCCL` is better.
+
+A common approach to managing where variables are placed, is to create a method
+to determine where each Op is to be placed and use that method in place of a
+specific device name when calling `with tf.device():`. Consider a scenario where
+a model is being trained on 2 GPUs and the variables are to be placed on the
+CPU. There would be a loop for creating and placing the "towers" on each of the
+2 GPUs. A custom device placement method would be created that watches for Ops
+of type `Variable`, `VariableV2`, and `VarHandleOp` and indicates that they are
+to be placed on the CPU. All other Ops would be placed on the target GPU.
+The building of the graph would proceed as follows:
+
+*   On the first loop a "tower" of the model would be created for `gpu:0`.
+    During the placement of the Ops, the custom device placement method would
+    indicate that variables are to be placed on `cpu:0` and all other Ops on
+    `gpu:0`.
+
+*   On the second loop, `reuse` is set to `True` to indicate that variables are
+    to be reused and then the "tower" is created on `gpu:1`. During the
+    placement of the Ops associated with the "tower", the variables that were
+    placed on `cpu:0` are reused and all other Ops are created and placed on
+    `gpu:1`.
+
+The final result is all of the variables are placed on the CPU with each GPU
+having a copy of all of the computational Ops associated with the model.
+
+The code snippet below illustrates two different approaches for variable
+placement: one is placing variables on the CPU; the other is placing variables
+equally across the GPUs.

 ```python
-# Using feed_dict often results in suboptimal performance when using large inputs.
-sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
+
+class GpuParamServerDeviceSetter(object):
+  """Used with tf.device() to place variables on the least loaded GPU.
+
+    A common use for this class is to pass a list of GPU devices, e.g. ['gpu:0',
+    'gpu:1','gpu:2'], as ps_devices.  When each variable is placed, it will be
+    placed on the least loaded gpu. All other Ops, which will be the computation
+    Ops, will be placed on the worker_device.
+  """
+
+  def __init__(self, worker_device, ps_devices):
+    """Initializer for GpuParamServerDeviceSetter.
+    Args:
+      worker_device: the device to use for computation Ops.
+      ps_devices: a list of devices to use for Variable Ops. Each variable is
+      assigned to the least loaded device.
+    """
+    self.ps_devices = ps_devices
+    self.worker_device = worker_device
+    self.ps_sizes = [0] * len(self.ps_devices)
+
+  def __call__(self, op):
+    if op.device:
+      return op.device
+    if op.type not in ['Variable', 'VariableV2', 'VarHandleOp']:
+      return self.worker_device
+
+    # Gets the least loaded ps_device
+    device_index, _ = min(enumerate(self.ps_sizes), key=operator.itemgetter(1))
+    device_name = self.ps_devices[device_index]
+    var_size = op.outputs[0].get_shape().num_elements()
+    self.ps_sizes[device_index] += var_size
+
+    return device_name
+
+def _create_device_setter(is_cpu_ps, worker, num_gpus):
+  """Create device setter object."""
+  if is_cpu_ps:
+    # tf.train.replica_device_setter supports placing variables on the CPU, all
+    # on one GPU, or on ps_servers defined in a cluster_spec.
+    return tf.train.replica_device_setter(
+        worker_device=worker, ps_device='/cpu:0', ps_tasks=1)
+  else:
+    gpus = ['/gpu:%d' % i for i in range(num_gpus)]
+    return ParamServerDeviceSetter(worker, gpus)
+
+# The method below is a modified snippet from the full example.
+def _resnet_model_fn():
+    # When set to False, variables are placed on the least loaded GPU. If set
+    # to True, the variables will be placed on the CPU.
+    is_cpu_ps = False
+
+    # Loops over the number of GPUs and creates a copy ("tower") of the model on
+    # each GPU.
+    for i in range(num_gpus):
+      worker = '/gpu:%d' % i
+      # Creates a device setter used to determine where Ops are to be placed.
+      device_setter = _create_device_setter(is_cpu_ps, worker, FLAGS.num_gpus)
+      # Creates variables on the first loop.  On subsequent loops reuse is set
+      # to True, which results in the "towers" sharing variables.
+      with tf.variable_scope('resnet', reuse=bool(i != 0)):
+        with tf.name_scope('tower_%d' % i) as name_scope:
+          # tf.device calls the device_setter for each Op that is created.
+          # device_setter returns the device the Op is to be placed on.
+          with tf.device(device_setter):
+            # Creates the "tower".
+            _tower_fn(is_training, weight_decay, tower_features[i],
+                      tower_labels[i], tower_losses, tower_gradvars,
+                      tower_preds, False)
+
 ```

-### Preprocessing on the CPU
+In the near future the above code will be for illustration purposes only as
+there will be easy to use high level methods to support a wide range of popular
+approaches. This
+[example](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator)
+will continue to get updated as the API expands and evolves to address multi-GPU
+scenarios.

-Placing preprocessing operations on the CPU can significantly improve
-performance.  When preprocessing occurs on the GPU the flow of data is
-CPU -> GPU (preprocessing) -> CPU -> GPU (training).  The data is bounced back
-and forth between the CPU and GPU.  When preprocessing is placed on the CPU,
-the data flow is CPU (preprocessing) -> GPU (training).  Another benefit is
-preprocessing on the CPU frees GPU time to focus on training.
+## Optimizing for CPU

-Placing preprocessing on the CPU can result in a 6X+ increase in samples/sec
-processed, which could lead to training in 1/6th of the time.  To ensure
-preprocessing is on the CPU, wrap the preprocessing operations as shown below:
+CPUs, which includes Intel® Xeon Phi™, achieve optimal performance when
+TensorFlow is @{$install_sources$built from source} with all of the instructions
+supported by the target CPU.
+
+Beyond using the latest instruction sets, Intel® has added support for the
+Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) to
+TensorFlow. While the name is not completely accurate, these optimizations are
+often simply referred to as 'MKL' or 'TensorFlow with MKL'. [TensorFlow
+with Intel® MKL-DNN](#tensorflow_with_intel_mkl_dnn) contains details on the
+MKL optimizations.
+
+The two configurations listed below are used to optimize CPU performance by
+adjusting the thread pools.
+
+*   `intra_op_parallelism_threads`: Nodes that can use multiple threads to
+    parallelize their execution will schedule the individual pieces into this
+    pool.
+*   `inter_op_parallelism_threads`: All ready nodes are scheduled in this pool.
+
+These configurations are set via the `tf.ConfigProto` and passed to `tf.Session`
+in the `config` attribute as shown in the snippet below.  For both configuration
+options, if they are unset or set to 0, will default to the number of logical
+CPU cores. Testing has shown that the default is effective for systems ranging
+from one CPU with 4 cores to multiple CPUs with 70+ combined logical cores.
+A common alternative optimization is to set the number of threads in both pools
+equal to the number of physical cores rather than logical cores.

 ```python
-with tf.device('/cpu:0'):
-  # function to get and process images or data.
-  distorted_inputs = load_and_distort_images()
+
+  config = tf.ConfigProto()
+  config.intra_op_parallelism_threads = 44
+  config.inter_op_parallelism_threads = 44
+  tf.session(config=config)
+
 ```

-### Use large files
+The [Comparing compiler optimizations](#comparing-compiler-optimizations)
+section contains the results of tests that used different compiler
+optimizations.

-Under some circumstances, both the CPU and GPU can be starved for data by the
-I/O system. If you are using many small files to form your input data set, you
-may be limited by the speed of your filesystem. If your training loop runs
-faster when using SSDs vs HDDs for storing your input data, you could be
-I/O bottlenecked.
+### TensorFlow with Intel® MKL DNN

-If this is the case, you should pre-process your input data, creating a few
-large TFRecord files.
+Intel® has added optimizations to TensorFlow for Intel® Xeon® and Intel® Xeon
+Phi™ though the use of Intel® Math Kernel Library for Deep Neural Networks
+(Intel® MKL-DNN) optimized primitives. The optimizations also provide speedups
+for the consumer line of processors, e.g. i5 and i7 Intel processors. The Intel
+published paper
+[TensorFlow* Optimizations on Modern Intel® Architecture](https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture)
+contains additional details on the implementation.

-### Use NCHW image data format
+> Note: MKL was added as of TensorFlow 1.2 and currently only works on Linux. It
+> also does not work when also using `--config=cuda`.

-Image data format refers to the representation of batches of images. TensorFlow
-supports `NHWC` (TensorFlow default) and `NCHW` (cuDNN default). N refers to the
-number of images in a batch, H refers to the number of pixels in the vertical
-dimension, W refers to the number of pixels in the horizontal dimension, and C
-refers to the channels (e.g. 1 for black and white, 3 for RGB, etc.) Although
-cuDNN can operate on both formats, it is faster to operate in its default
-format.
+In addition to providing significant performance improvements for training CNN
+based models, compiling with the MKL creates a binary that is optimized for AVX
+and AVX2. The result is a single binary that is optimized and compatible with
+most modern (post-2011) processors.

-The best practice is to build models that work with both `NCHW` and `NHWC` as it
-is common to train using `NCHW` on GPU, and then do inference with `NHWC` on CPU.
+TensorFlow can be compiled with the MKL optimizations using the following
+commands that depending on the version of the TensorFlow source used.

-There are edge cases where `NCHW` can be slower on GPU than `NHWC`. One
-[case](https://github.com/tensorflow/tensorflow/issues/7551#issuecomment-280421351)
-is using non-fused batch norm on WRN-16-4 without dropout. In that case using
-fused batch norm, which is also recommended, is the optimal solution.
+For TensorFlow source versions after 1.3.0:

-The very brief history of these two formats is that TensorFlow started by using
-`NHWC` because it was a little faster on CPUs. Then the TensorFlow team
-discovered that `NCHW` performs better when using the NVIDIA cuDNN library.  The
-current recommendation is that users support both formats in their models. In
-the long term, we plan to rewrite graphs to make switching between the formats
-transparent.
+```bash
+./configure
+# Pick the desired options
+bazel build --config=mkl -c opt //tensorflow/tools/pip_package:build_pip_package

-### Use fused batch norm
+```

-When using batch norm
-@{tf.contrib.layers.batch_norm} set the attribute `fused=True`:
+For TensorFlow versions 1.2.0 through 1.3.0:
+
+```bash
+./configure
+Do you wish to build TensorFlow with MKL support? [y/N] Y
+Do you wish to download MKL LIB from the web? [Y/n] Y
+# Select the defaults for the rest of the options.
+
+bazel build --config=mkl --copt="-DEIGEN_USE_VML" -c opt //tensorflow/tools/pip_package:build_pip_package
+
+```
+
+#### Tuning MKL for the best performance
+
+This section details the different configurations and environment variables that
+can be used to tune the MKL to get optimal performance. Before tweaking various
+environment variables make sure the model is using the `NCHW` (`channels_first`)
+[data format](#data-formats). The MKL is optimized for `NCHW` and Intel is
+working to get near performance parity when using `NHWC`.
+
+MKL uses the following environment variables to tune performance:
+
+*   KMP_BLOCKTIME - Sets the time, in milliseconds, that a thread should wait,
+    after completing the execution of a parallel region, before sleeping.
+*   KMP_AFFINITY - Enables the run-time library to bind threads to physical
+    processing units.
+*   KMP_SETTINGS - Enables (true) or disables (false) the printing of OpenMP*
+    run-time library environment variables during program execution.
+*   OMP_NUM_THREADS - Specifies the number of threads to use.
+
+More details on the KMP variables are on
+[Intel's](https://software.intel.com/en-us/node/522775) site and the OMP
+variables on
+[gnu.org](https://gcc.gnu.org/onlinedocs/libgomp/Environment-Variables.html)
+
+While there can be substantial gains from adjusting the environment variables,
+which is discussed below, the simplified advice is to set the
+`inter_op_parallelism_threads` equal to the number of physical CPUs and to set
+the following environment variables:
+
+*   KMP_BLOCKTIME=0
+*   KMP_AFFINITY=granularity=fine,verbose,compact,1,0
+
+Example setting MKL variables with command-line arguments:
+
+```bash
+KMP_BLOCKTIME=0 KMP_AFFINITY=granularity=fine,verbose,compact,1,0 \
+KMP_SETTINGS=1 python your_python_script.py
+```
+
+Example setting MKL variables with python `os.environ`:

 ```python
-bn = tf.contrib.layers.batch_norm(
-          input_layer, fused=True, data_format='NCHW'
-          scope=scope, **kwargs)
+os.environ["KMP_BLOCKTIME"] = str(FLAGS.kmp_blocktime)
+os.environ["KMP_SETTINGS"] = str(FLAGS.kmp_settings)
+os.environ["KMP_AFFINITY"]= FLAGS.kmp_affinity
+if FLAGS.num_intra_threads > 0:
+  os.environ["OMP_NUM_THREADS"]= str(FLAGS.num_intra_threads)
+
 ```

-The non-fused batch norm does computations using several individual Ops. Fused
-batch norm combines the individual operations into a single kernel, which runs
-faster.
+There are models and hardware platforms that benefit from different settings.
+Each variable that impacts performance is discussed below.

+*   **KMP_BLOCKTIME**: The MKL default is 200ms, which was not optimal in our
+    testing. 0 (0ms) was a good default for CNN based models that were tested.
+    The best performance for AlexNex was achieved at 30ms and both GoogleNet and
+    VGG11 performed best set at 1ms.
+
+*   **KMP_AFFINITY**: The recommended setting is
+    `granularity=fine,verbose,compact,1,0`.
+
+*   **OMP_NUM_THREADS**: This defaults to the number of physical cores.
+    Adjusting this parameter beyond matching the number of cores can have an
+    impact when using Intel® Xeon Phi™ (Knights Landing) for some models. See
+    [TensorFlow* Optimizations on Modern Intel® Architecture](https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture)
+    for optimal settings.
+
+*   **intra_op_parallelism_threads**: Setting this equal to the number of
+    physical cores is recommended. Setting the value to 0, which is the default
+    and will result in the value being set to the number of logical cores, is an
+    option to try for some architectures.  This value and `OMP_NUM_THREADS`
+    should be equal.
+
+*   **inter_op_parallelism_threads**: Setting this equal to the number of
+    sockets is recommended. Setting the value to 0, which is the default,
+    results in the value being set to the number of logical cores.
+
+### Comparing compiler optimizations
+
+Collected below are performance results running training and inference on
+different types of CPUs on different platforms with various compiler
+optimizations.  The models used were ResNet-50
+([arXiv:1512.03385](https://arxiv.org/abs/1512.03385)) and
+InceptionV3 ([arXiv:1512.00567](https://arxiv.org/abs/1512.00567)).
+
+For each test, when the MKL optimization was used the environment variable
+KMP_BLOCKTIME was set to 0 (0ms) and KMP_AFFINITY to
+`granularity=fine,verbose,compact,1,0`.
+
+#### Inference InceptionV3
+
+**Environment**
+
+*   Instance Type: AWS EC2 m4.xlarge
+*   CPU: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz (Broadwell)
+*   Dataset: ImageNet
+*   TensorFlow Version: 1.2.0 RC2
+*   Test Script: [tf_cnn_benchmarks.py](https://github.com/tensorflow/benchmarks/blob/mkl_experiment/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py)
+
+**Batch Size: 1**
+
+Command executed for the MKL test:
+
+```bash
+python tf_cnn_benchmarks.py --forward_only=True --device=cpu --mkl=True \
+--kmp_blocktime=0 --nodistortions --model=inception3 --data_format=NCHW \
+--batch_size=1 --num_inter_threads=1 --num_intra_threads=4 \
+--data_dir=<path to ImageNet TFRecords>
+```
+
+| Optimization | Data Format | Images/Sec   | Intra threads | Inter Threads |
+:              :             : (step time)  :               :               :
+| ------------ | ----------- | ------------ | ------------- | ------------- |
+| AVX2         | NHWC        | 6.8 (147ms)  | 4             | 0             |
+| MKL          | NCHW        | 6.6 (151ms)  | 4             | 1             |
+| MKL          | NHWC        | 5.95 (168ms) | 4             | 1             |
+| AVX          | NHWC        | 4.7 (211ms)  | 4             | 0             |
+| SSE3         | NHWC        | 2.7 (370ms)  | 4             | 0             |
+
+**Batch Size: 32**
+
+Command executed for the MKL test:
+
+```bash
+python tf_cnn_benchmarks.py --forward_only=True --device=cpu --mkl=True \
+--kmp_blocktime=0 --nodistortions --model=inception3 --data_format=NCHW \
+--batch_size=32 --num_inter_threads=1 --num_intra_threads=4 \
+--data_dir=<path to ImageNet TFRecords>
+```
+
+| Optimization | Data Format | Images/Sec    | Intra threads | Inter Threads |
+:              :             : (step time)   :               :               :
+| ------------ | ----------- | ------------- | ------------- | ------------- |
+| MKL          | NCHW        | 10.24         | 4             | 1             |
+:              :             : (3125ms)      :               :               :
+| MKL          | NHWC        | 8.9 (3595ms)  | 4             | 1             |
+| AVX2         | NHWC        | 7.3 (4383ms)  | 4             | 0             |
+| AVX          | NHWC        | 5.1 (6275ms)  | 4             | 0             |
+| SSE3         | NHWC        | 2.8 (11428ms) | 4             | 0             |
+
+#### Inference ResNet-50
+
+**Environment**
+
+*   Instance Type: AWS EC2 m4.xlarge
+*   CPU: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz (Broadwell)
+*   Dataset: ImageNet
+*   TensorFlow Version: 1.2.0 RC2
+*   Test Script: [tf_cnn_benchmarks.py](https://github.com/tensorflow/benchmarks/blob/mkl_experiment/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py)
+
+**Batch Size: 1**
+
+Command executed for the MKL test:
+
+```bash
+python tf_cnn_benchmarks.py --forward_only=True --device=cpu --mkl=True \
+--kmp_blocktime=0 --nodistortions --model=resnet50 --data_format=NCHW \
+--batch_size=1 --num_inter_threads=1 --num_intra_threads=4 \
+--data_dir=<path to ImageNet TFRecords>
+```
+
+| Optimization | Data Format | Images/Sec   | Intra threads | Inter Threads |
+:              :             : (step time)  :               :               :
+| ------------ | ----------- | ------------ | ------------- | ------------- |
+| AVX2         | NHWC        | 6.8 (147ms)  | 4             | 0             |
+| MKL          | NCHW        | 6.6 (151ms)  | 4             | 1             |
+| MKL          | NHWC        | 5.95 (168ms) | 4             | 1             |
+| AVX          | NHWC        | 4.7 (211ms)  | 4             | 0             |
+| SSE3         | NHWC        | 2.7 (370ms)  | 4             | 0             |
+
+**Batch Size: 32**
+
+Command executed for the MKL test:
+
+```bash
+python tf_cnn_benchmarks.py --forward_only=True --device=cpu --mkl=True \
+--kmp_blocktime=0 --nodistortions --model=resnet50 --data_format=NCHW \
+--batch_size=32 --num_inter_threads=1 --num_intra_threads=4 \
+--data_dir=<path to ImageNet TFRecords>
+```
+
+| Optimization | Data Format | Images/Sec    | Intra threads | Inter Threads |
+:              :             : (step time)   :               :               :
+| ------------ | ----------- | ------------- | ------------- | ------------- |
+| MKL          | NCHW        | 10.24         | 4             | 1             |
+:              :             : (3125ms)      :               :               :
+| MKL          | NHWC        | 8.9 (3595ms)  | 4             | 1             |
+| AVX2         | NHWC        | 7.3 (4383ms)  | 4             | 0             |
+| AVX          | NHWC        | 5.1 (6275ms)  | 4             | 0             |
+| SSE3         | NHWC        | 2.8 (11428ms) | 4             | 0             |
+
+#### Training InceptionV3
+
+**Environment**
+
+*   Instance Type: Dedicated AWS EC2 r4.16xlarge (Broadwell)
+*   CPU: Intel Xeon E5-2686 v4 (Broadwell) Processors
+*   Dataset: ImageNet
+*   TensorFlow Version: 1.2.0 RC2
+*   Test Script: [tf_cnn_benchmarks.py](https://github.com/tensorflow/benchmarks/blob/mkl_experiment/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py)
+
+Command executed for MKL test:
+
+```bash
+python tf_cnn_benchmarks.py --device=cpu --mkl=True --kmp_blocktime=0 \
+--nodistortions --model=resnet50 --data_format=NCHW --batch_size=32 \
+--num_inter_threads=2 --num_intra_threads=36 \
+--data_dir=<path to ImageNet TFRecords>
+```
+
+Optimization | Data Format | Images/Sec | Intra threads | Inter Threads
+------------ | ----------- | ---------- | ------------- | -------------
+MKL          | NCHW        | 20.8       | 36            | 2
+AVX2         | NHWC        | 6.2        | 36            | 0
+AVX          | NHWC        | 5.7        | 36            | 0
+SSE3         | NHWC        | 4.3        | 36            | 0
+
+ResNet and [AlexNet](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf)
+were also run on this configuration but in an ad hoc manner. There were not
+enough runs executed to publish a coherent table of results. The incomplete
+results strongly indicated the final result would be similar to the table above
+with MKL providing significant 3x+ gains over AVX2.