Performance guide update

PiperOrigin-RevId: 168159289
2017-09-10 10:14:47 -07:00 · 2017-09-10 10:14:47 -07:00 · ce9a2b00fa
commit ce9a2b00fa
parent 3bce4f9a0d
1 changed files with 607 additions and 107 deletions
--- a/tensorflow/docs_src/performance/performance_guide.md
+++ b/tensorflow/docs_src/performance/performance_guide.md
@ -1,43 +1,182 @@
 # Performance Guide
-This guide contains a collection of best practices for optimizing your
+This guide contains a collection of best practices for optimizing TensorFlow
-TensorFlow code. The best practices apply to both new and experienced
+code. The guide is divided into a few sections:
 Tensorflow users.  As a complement to the best practices in this document, the
@{$performance_models$High-Performance Models} document links to example code
 and details for creating models that scale on a variety of hardware.
-## Best Practices
+*   [General best practices](#general_best_practices) covers topics that are
-While optimizing implementations of different types of models can be different,
+    common across a variety of model types and hardware.
-the topics below cover best practices to get the most performance from
+*   [Optimizing for GPU](#optimizing_for_gpu) details tips specifically relevant
-TensorFlow. Although these suggestions focus on image-based models, we will
+    to GPUs.
-regularly add tips for all kinds of models. The following list highlights key
+*   [Optimizing for CPU](#optimizing_for_cpu) details CPU specific information.
 best practices:
-*   Build and install from source
+## General best practices
 *   Utilize queues for reading data
 *   Preprocessing on the CPU
 *   Use `NCHW` image data format
 *   Place shared parameters on the GPU
 *   Use fused batch norm
-The following sections detail the preceding suggestions.
+The sections below cover best practices that are relevant to a variety of
 hardware and models. The best practices section is broken down into the
 following sections:
-### Build and install from source
+*   [Input pipeline optimizations](#input-pipeline-optimization)
 *   [Data formats](#data-formats)
 *   [Common fused Ops](#common-fused-ops)
 *   [Building and installing from source](#building-and-installing-from-source)
-To install the most optimized version of TensorFlow, build and install
+### Input pipeline optimization
 TensorFlow from source by following [Installing TensorFlow from Source](../install/install_sources).
 Building from source with compiler optimizations for the target hardware and
 ensuring the latest CUDA platform and cuDNN libraries are installed results in
 the highest performing installs.
-For the most stable experience, build from the [latest release](https://github.com/tensorflow/tensorflow/releases)
+Typical models retrieve data from disk and preprocess it before sending the data
-branch. To get the latest performance changes and accept some stability risk,
+through the network. For example, models that process JPEG images will follow
-build from [master](https://github.com/tensorflow/tensorflow).
+this flow: load image from disk, decode JPEG into a tensor, crop and pad,
 possibly flip and distort, and then batch. This flow is referred to as the input
 pipeline. As GPUs and other hardware accelerators get faster, preprocessing of
 data can be a bottleneck.
-If there is a need to build TensorFlow on a platform that has different hardware
+Determining if the input pipeline is the bottleneck can be complicated. One of
-than the target, then cross-compile with the highest optimizations for the target
+the most straightforward methods is to reduce the model to a single operation
-platform.  The following command is an example of telling `bazel` to compile for
+(trivial model) after the input pipeline and measure the examples per second. If
-a specific platform:
+the difference in examples per second for the full model and the trivial model
 is minimal then the input pipeline is likely a bottleneck. Below are some other
 approaches to identifying issues:
 *   Check if a GPU is underutilized by running `watch -n 2 nvidia-smi`. If GPU
    utilization is not approaching 80-100%, then the input pipeline may be the
    bottleneck.
 *   Generate a timeline and look for large blocks of white space (waiting). An
    example of generating a timeline exists as part of the @{$jit$XLA JIT}
    tutorial.
 *   Check CPU usage. It is possible to have an optimized input pipeline and lack
    the CPU cycles to process the pipeline.
 *   Estimate the throughput needed and verify the disk used is capable of that
    level of throughput. Some cloud solutions have network attached disks that
    start as low as 50 MB/sec, which is slower than spinning disks (150 MB/sec),
    SATA SSDs (500 MB/sec), and PCIe SSDs (2,000+ MB/sec).
 #### Preprocessing on the CPU
 Placing input pipeline operations on the CPU can significantly improve
 performance. Utilizing the CPU for the input pipeline frees the GPU to focus on
 training. To ensure preprocessing is on the CPU, wrap the preprocessing
 operations as shown below:
 ```python
 with tf.device('/cpu:0'):
  # function to get and process images or data.
  distorted_inputs = load_and_distort_images()
 ```
 If using `tf.estimator.Estimator` the input function is automatically placed on
 the CPU.
 #### Using the Dataset API
 The @{$datasets$Dataset API} is replacing `queue_runner` as the recommended API
 for building input pipelines. The API was added to contrib as part of TensorFlow
 1.2 and will move to core in the near future. This
 [ResNet example](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator/cifar10_main.py)
 ([arXiv:1512.03385](https://arxiv.org/abs/1512.03385))
 training CIFAR-10 illustrates the use of the Dataset API along with
 `tf.estimator.Estimator`. The Dataset API utilizes C++ multi-threading and has a
 much lower overhead than the Python-based `queue_runner` that is limited by
 Python's multi-threading performance.
 While feeding data using a `feed_dict` offers a high level of flexibility, in
 most instances using `feed_dict` does not scale optimally. However, in instances
 where only a single GPU is being used the difference can be negligible. Using
 the Dataset API is still strongly recommended. Try to avoid the following:
 ```python
 # feed_dict often results in suboptimal performance when using large inputs.
 sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
 ```
 #### Use large files
 Reading large numbers of small files significantly impacts I/O performance.
 One approach to get maximum I/O throughput is to preprocess input data into
 larger (~100MB) `TFRecord` files. For smaller data sets (200MB-1GB), the best
 approach is often to load the entire data set into memory. The document
 [Downloading and converting to TFRecord format](https://github.com/tensorflow/models/tree/master/slim#Data)
 includes information and scripts for creating `TFRecords` and this
 [script](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator/generate_cifar10_tfrecords.py)
 converts the CIFAR-10 data set into `TFRecords`.
 ### Data formats
 Data formats refers to the structure of the Tensor passed to a given Op. The
 discussion below is specifically about 4D Tensors representing images. In
 TensorFlow the parts of the 4D tensor are often referred to by the following
 letters:
 *   N refers to the number of images in a batch.
 *   H refers to the number of pixels in the vertical (height) dimension.
 *   W refers to the number of pixels in the horizontal (width) dimension.
 *   C refers to the channels. For example, 1 for black and white or grayscale
    and 3 for RGB.
 Within TensorFlow there are two naming conventions representing the two most
 common data formats:
 *   `NCHW` or `channels_first`
 *   `NHWC` or `channels_last`
 `NHWC` is the TensorFlow default and `NCHW` is the optimal format to use when
 training on NVIDIA GPUs using [cuDNN](https://developer.nvidia.com/cudnn).
 The best practice is to build models that work with both data formats. This
 simplifies training on GPUs and then running inference on CPUs. If TensorFlow is
 compiled with the [Intel MKL](#tensorflow_with_intel_mkl-dnn) optimizations,
 many operations, especially those related to CNN based models, will be optimized
 and support `NCHW`. If not using the MKL, some operations are not supported on
 CPU when using `NCHW`.
 The brief history of these two formats is that TensorFlow started by using
 `NHWC` because it was a little faster on CPUs. In the long term, we are working
 on tools to auto rewrite graphs to make switching between the formats
 transparent and take advantages of micro optimizations where a GPU Op may be
 faster using `NHWC` than the normally most efficient `NCHW`.
 ### Common fused Ops
 Fused Ops combine multiple operations into a single kernel for improved
 performance. There are many fused Ops within TensorFlow and @{$xla$XLA} will
 create fused Ops when possible to automatically improve performance. Collected
 below are select fused Ops that can greatly improve performance and may be
 overlooked.
 #### Fused batch norm
 Fused batch norm combines the multiple operations needed to do batch
 normalization into a single kernel. Batch norm is an expensive process that for
 some models makes up a large percentage of the operation time. Using fused batch
 norm can result in a 12%-30% speedup.
 There are two commonly used batch norms and both support fusing. The core
@{tf.layers.batch_normalization} added fused starting in TensorFlow 1.3.
 ```python
 bn = tf.layers.batch_normalization(
    input_layer, fused=True, data_format='NCHW')
 ```
 The contrib @{tf.contrib.layers.batch_norm} method has had fused as an option
 since before TensorFlow 1.0.
 ```python
 bn = tf.contrib.layers.batch_norm(input_layer, fused=True, data_format='NCHW')
 ```
 ### Building and installing from source
 The default TensorFlow binaries target the broadest range of hardware to make
 TensorFlow accessible to everyone. If using CPUs for training or inference, it
 is recommended to compile TensorFlow with all of the optimizations available for
 the CPU in use. Speedups for training and inference on CPU are documented below
 in [Comparing compiler optimizations](#comparing-compiler-optimizations).
 To install the most optimized version of TensorFlow,
@{$install_sources$build and install} from source. If there is a need to build
 TensorFlow on a platform that has different hardware than the target, then
 cross-compile with the highest optimizations for the target platform. The
 following command is an example of using `bazel` to compile for a specific
 platform:
 ```python
 # This command optimizes for Intel’s Broadwell processor
@ -47,106 +186,467 @@ bazel build -c opt --copt=-march="broadwell" --config=cuda //tensorflow/tools/pi
 #### Environment, build, and install tips
-*   Compile with the highest level of compute the [GPU
+*   `./configure` asks which compute capability to include in the build. This
-    supports](http://developer.nvidia.com/cuda-gpus), e.g. P100: 6.0, Titan X
+    does not impact overall performance but does impact initial startup. After
-    (pascal): 6.2, Titan X (maxwell): 5.2, and K80: 3.7.
+    running TensorFlow once, the compiled kernels are cached by CUDA. If using
-*   Install the latest CUDA platform and cuDNN libraries.
+    a docker container, the data is not cached and the penalty is paid each time
-*   Make sure to use a version of gcc that supports all of the optimizations of
+    TensorFlow starts. The best practice is to include the
-    the target CPU. The recommended minimum gcc version is 4.8.3.  On OS X upgrade
+    [compute capabilities](http://developer.nvidia.com/cuda-gpus)
-    to the latest Xcode version and use the version of clang that comes with Xcode.
+    of the GPUs that will be used, e.g. P100: 6.0, Titan X (Pascal): 6.1, Titan
-*   TensorFlow checks on startup whether it has been compiled with the
+    X (Maxwell): 5.2, and K80: 3.7.
-    optimizations available on the CPU. If the optimizations are not included,
+*   Use a version of gcc that supports all of the optimizations of the target
-    TensorFlow will emit warnings, e.g. AVX, AVX2, and FMA instructions not
+    CPU. The recommended minimum gcc version is 4.8.3. On OS X, upgrade to the
-    included.
+    latest Xcode version and use the version of clang that comes with Xcode.
 *   Install the latest stable CUDA platform and cuDNN libraries supported by
    TensorFlow.
-### Utilize queues for reading data
+## Optimizing for GPU
-One common cause of poor performance is underutilizing GPUs, or essentially
+This section contains GPU-specific tips that are not covered in the
-"starving" them of data by not setting up an efficient pipeline. Make sure to
+[General best practices](#general-best-practices). Obtaining optimal performance
-set up an input pipeline to utilize queues and stream data effectively. Review
+on multi-GPUs is a challenge. A common approach is to use data parallelism.
-the @{$reading_data#reading_from_files$Reading Data guide} for implementation
+Scaling through the use of data parallelism involves making multiple copies of
-details. One way to identify a "starved" GPU is to generate and review
+the model, which are referred to as "towers", and then placing one tower on each
-timelines. A detailed tutorial for timelines does not exist, but a quick example
+of the GPUs. Each tower operates on a different mini-batch of data and then
-of generating a timeline exists as part of the @{$jit$XLA JIT} tutorial. Another
+updates variables, also known as parameters, that need to be shared between
-simple way to check if a GPU is underutilized is to run `watch nvidia-smi`, and
+each of the towers. How each tower gets the updated variables and how the
-if GPU utilization is not approaching 100% then the GPU is not getting data fast
+gradients are applied has an impact on the performance, scaling, and convergence
-enough.
+of the model.  The rest of this section provides an overview of variable
 placement and the towering of a model on multiple GPUs.
@{$performance_models$High-Performance Models} gets into more details regarding
 more complex methods that can be used to share and update variables between
 towers.
-Unless for a special circumstance or for example code, do not feed data
+The best approach to handling variable updates depends on the model, hardware,
-into the session from Python variables, e.g. `dictionary`.
+and even how the hardware has been configured. An example of this, is that two
 systems can be built with NVIDIA Tesla P100s but one may be using PCIe and the
 other [NVLink](http://www.nvidia.com/object/nvlink.html). In that scenario, the
 optimal solution for each system may be different. For real world examples, read
 the @{$benchmarks$benchmark} page which details the settings that were optimal
 for a variety of platforms. Below is a summary of what was learned from
 benchmarking various platforms and configurations:
 *   **Tesla K80**: If the GPUs are on the same PCI Express root complex and are
    able to use [NVIDIA GPUDirect](https://developer.nvidia.com/gpudirect) Peer
    to Peer, then placing the variables equally across the GPUs used for
    training is the best approach. If the GPUs cannot use GPUDirect, then
    placing the variables on the CPU is the best option.
 *   **Titan X (Maxwell and Pascal), M40, P100, and similar**: For models like
    ResNet and InceptionV3, placing variables on the CPU is the optimal setting,
    but for models with a lot of variables like AlexNet and VGG, using GPUs with
    `NCCL` is better.
 A common approach to managing where variables are placed, is to create a method
 to determine where each Op is to be placed and use that method in place of a
 specific device name when calling `with tf.device():`. Consider a scenario where
 a model is being trained on 2 GPUs and the variables are to be placed on the
 CPU. There would be a loop for creating and placing the "towers" on each of the
 2 GPUs. A custom device placement method would be created that watches for Ops
 of type `Variable`, `VariableV2`, and `VarHandleOp` and indicates that they are
 to be placed on the CPU. All other Ops would be placed on the target GPU.
 The building of the graph would proceed as follows:
 *   On the first loop a "tower" of the model would be created for `gpu:0`.
    During the placement of the Ops, the custom device placement method would
    indicate that variables are to be placed on `cpu:0` and all other Ops on
    `gpu:0`.
 *   On the second loop, `reuse` is set to `True` to indicate that variables are
    to be reused and then the "tower" is created on `gpu:1`. During the
    placement of the Ops associated with the "tower", the variables that were
    placed on `cpu:0` are reused and all other Ops are created and placed on
    `gpu:1`.
 The final result is all of the variables are placed on the CPU with each GPU
 having a copy of all of the computational Ops associated with the model.
 The code snippet below illustrates two different approaches for variable
 placement: one is placing variables on the CPU; the other is placing variables
 equally across the GPUs.
 ```python
-# Using feed_dict often results in suboptimal performance when using large inputs.
+
-sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
+class GpuParamServerDeviceSetter(object):
  """Used with tf.device() to place variables on the least loaded GPU.
    A common use for this class is to pass a list of GPU devices, e.g. ['gpu:0',
    'gpu:1','gpu:2'], as ps_devices.  When each variable is placed, it will be
    placed on the least loaded gpu. All other Ops, which will be the computation
    Ops, will be placed on the worker_device.
  """
  def __init__(self, worker_device, ps_devices):
    """Initializer for GpuParamServerDeviceSetter.
    Args:
      worker_device: the device to use for computation Ops.
      ps_devices: a list of devices to use for Variable Ops. Each variable is
      assigned to the least loaded device.
    """
    self.ps_devices = ps_devices
    self.worker_device = worker_device
    self.ps_sizes = [0] * len(self.ps_devices)
  def __call__(self, op):
    if op.device:
      return op.device
    if op.type not in ['Variable', 'VariableV2', 'VarHandleOp']:
      return self.worker_device
    # Gets the least loaded ps_device
    device_index, _ = min(enumerate(self.ps_sizes), key=operator.itemgetter(1))
    device_name = self.ps_devices[device_index]
    var_size = op.outputs[0].get_shape().num_elements()
    self.ps_sizes[device_index] += var_size
    return device_name
 def _create_device_setter(is_cpu_ps, worker, num_gpus):
  """Create device setter object."""
  if is_cpu_ps:
    # tf.train.replica_device_setter supports placing variables on the CPU, all
    # on one GPU, or on ps_servers defined in a cluster_spec.
    return tf.train.replica_device_setter(
        worker_device=worker, ps_device='/cpu:0', ps_tasks=1)
  else:
    gpus = ['/gpu:%d' % i for i in range(num_gpus)]
    return ParamServerDeviceSetter(worker, gpus)
 # The method below is a modified snippet from the full example.
 def _resnet_model_fn():
    # When set to False, variables are placed on the least loaded GPU. If set
    # to True, the variables will be placed on the CPU.
    is_cpu_ps = False
    # Loops over the number of GPUs and creates a copy ("tower") of the model on
    # each GPU.
    for i in range(num_gpus):
      worker = '/gpu:%d' % i
      # Creates a device setter used to determine where Ops are to be placed.
      device_setter = _create_device_setter(is_cpu_ps, worker, FLAGS.num_gpus)
      # Creates variables on the first loop.  On subsequent loops reuse is set
      # to True, which results in the "towers" sharing variables.
      with tf.variable_scope('resnet', reuse=bool(i != 0)):
        with tf.name_scope('tower_%d' % i) as name_scope:
          # tf.device calls the device_setter for each Op that is created.
          # device_setter returns the device the Op is to be placed on.
          with tf.device(device_setter):
            # Creates the "tower".
            _tower_fn(is_training, weight_decay, tower_features[i],
                      tower_labels[i], tower_losses, tower_gradvars,
                      tower_preds, False)
 ```
-### Preprocessing on the CPU
+In the near future the above code will be for illustration purposes only as
 there will be easy to use high level methods to support a wide range of popular
 approaches. This
 [example](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator)
 will continue to get updated as the API expands and evolves to address multi-GPU
 scenarios.
-Placing preprocessing operations on the CPU can significantly improve
+## Optimizing for CPU
 performance.  When preprocessing occurs on the GPU the flow of data is
 CPU -> GPU (preprocessing) -> CPU -> GPU (training).  The data is bounced back
 and forth between the CPU and GPU.  When preprocessing is placed on the CPU,
 the data flow is CPU (preprocessing) -> GPU (training).  Another benefit is
 preprocessing on the CPU frees GPU time to focus on training.
-Placing preprocessing on the CPU can result in a 6X+ increase in samples/sec
+CPUs, which includes Intel® Xeon Phi™, achieve optimal performance when
-processed, which could lead to training in 1/6th of the time.  To ensure
+TensorFlow is @{$install_sources$built from source} with all of the instructions
-preprocessing is on the CPU, wrap the preprocessing operations as shown below:
+supported by the target CPU.
 Beyond using the latest instruction sets, Intel® has added support for the
 Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) to
 TensorFlow. While the name is not completely accurate, these optimizations are
 often simply referred to as 'MKL' or 'TensorFlow with MKL'. [TensorFlow
 with Intel® MKL-DNN](#tensorflow_with_intel_mkl_dnn) contains details on the
 MKL optimizations.
 The two configurations listed below are used to optimize CPU performance by
 adjusting the thread pools.
 *   `intra_op_parallelism_threads`: Nodes that can use multiple threads to
    parallelize their execution will schedule the individual pieces into this
    pool.
 *   `inter_op_parallelism_threads`: All ready nodes are scheduled in this pool.
 These configurations are set via the `tf.ConfigProto` and passed to `tf.Session`
 in the `config` attribute as shown in the snippet below.  For both configuration
 options, if they are unset or set to 0, will default to the number of logical
 CPU cores. Testing has shown that the default is effective for systems ranging
 from one CPU with 4 cores to multiple CPUs with 70+ combined logical cores.
 A common alternative optimization is to set the number of threads in both pools
 equal to the number of physical cores rather than logical cores.
 ```python
-with tf.device('/cpu:0'):
+
-  # function to get and process images or data.
+  config = tf.ConfigProto()
-  distorted_inputs = load_and_distort_images()
+  config.intra_op_parallelism_threads = 44
  config.inter_op_parallelism_threads = 44
  tf.session(config=config)
 ```
-### Use large files
+The [Comparing compiler optimizations](#comparing-compiler-optimizations)
 section contains the results of tests that used different compiler
 optimizations.
-Under some circumstances, both the CPU and GPU can be starved for data by the
+### TensorFlow with Intel® MKL DNN
 I/O system. If you are using many small files to form your input data set, you
 may be limited by the speed of your filesystem. If your training loop runs
 faster when using SSDs vs HDDs for storing your input data, you could be
 I/O bottlenecked.
-If this is the case, you should pre-process your input data, creating a few
+Intel® has added optimizations to TensorFlow for Intel® Xeon® and Intel® Xeon
-large TFRecord files.
+Phi™ though the use of Intel® Math Kernel Library for Deep Neural Networks
 (Intel® MKL-DNN) optimized primitives. The optimizations also provide speedups
 for the consumer line of processors, e.g. i5 and i7 Intel processors. The Intel
 published paper
 [TensorFlow* Optimizations on Modern Intel® Architecture](https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture)
 contains additional details on the implementation.
-### Use NCHW image data format
+> Note: MKL was added as of TensorFlow 1.2 and currently only works on Linux. It
 > also does not work when also using `--config=cuda`.
-Image data format refers to the representation of batches of images. TensorFlow
+In addition to providing significant performance improvements for training CNN
-supports `NHWC` (TensorFlow default) and `NCHW` (cuDNN default). N refers to the
+based models, compiling with the MKL creates a binary that is optimized for AVX
-number of images in a batch, H refers to the number of pixels in the vertical
+and AVX2. The result is a single binary that is optimized and compatible with
-dimension, W refers to the number of pixels in the horizontal dimension, and C
+most modern (post-2011) processors.
 refers to the channels (e.g. 1 for black and white, 3 for RGB, etc.) Although
 cuDNN can operate on both formats, it is faster to operate in its default
 format.
-The best practice is to build models that work with both `NCHW` and `NHWC` as it
+TensorFlow can be compiled with the MKL optimizations using the following
-is common to train using `NCHW` on GPU, and then do inference with `NHWC` on CPU.
+commands that depending on the version of the TensorFlow source used.
-There are edge cases where `NCHW` can be slower on GPU than `NHWC`. One
+For TensorFlow source versions after 1.3.0:
 [case](https://github.com/tensorflow/tensorflow/issues/7551#issuecomment-280421351)
 is using non-fused batch norm on WRN-16-4 without dropout. In that case using
 fused batch norm, which is also recommended, is the optimal solution.
-The very brief history of these two formats is that TensorFlow started by using
+```bash
-`NHWC` because it was a little faster on CPUs. Then the TensorFlow team
+./configure
-discovered that `NCHW` performs better when using the NVIDIA cuDNN library.  The
+# Pick the desired options
-current recommendation is that users support both formats in their models. In
+bazel build --config=mkl -c opt //tensorflow/tools/pip_package:build_pip_package
 the long term, we plan to rewrite graphs to make switching between the formats
 transparent.
-### Use fused batch norm
+```
-When using batch norm
+For TensorFlow versions 1.2.0 through 1.3.0:
-@{tf.contrib.layers.batch_norm} set the attribute `fused=True`:
+
 ```bash
 ./configure
 Do you wish to build TensorFlow with MKL support? [y/N] Y
 Do you wish to download MKL LIB from the web? [Y/n] Y
 # Select the defaults for the rest of the options.
 bazel build --config=mkl --copt="-DEIGEN_USE_VML" -c opt //tensorflow/tools/pip_package:build_pip_package
 ```
 #### Tuning MKL for the best performance
 This section details the different configurations and environment variables that
 can be used to tune the MKL to get optimal performance. Before tweaking various
 environment variables make sure the model is using the `NCHW` (`channels_first`)
 [data format](#data-formats). The MKL is optimized for `NCHW` and Intel is
 working to get near performance parity when using `NHWC`.
 MKL uses the following environment variables to tune performance:
 *   KMP_BLOCKTIME - Sets the time, in milliseconds, that a thread should wait,
    after completing the execution of a parallel region, before sleeping.
 *   KMP_AFFINITY - Enables the run-time library to bind threads to physical
    processing units.
 *   KMP_SETTINGS - Enables (true) or disables (false) the printing of OpenMP*
    run-time library environment variables during program execution.
 *   OMP_NUM_THREADS - Specifies the number of threads to use.
 More details on the KMP variables are on
 [Intel's](https://software.intel.com/en-us/node/522775) site and the OMP
 variables on
 [gnu.org](https://gcc.gnu.org/onlinedocs/libgomp/Environment-Variables.html)
 While there can be substantial gains from adjusting the environment variables,
 which is discussed below, the simplified advice is to set the
 `inter_op_parallelism_threads` equal to the number of physical CPUs and to set
 the following environment variables:
 *   KMP_BLOCKTIME=0
 *   KMP_AFFINITY=granularity=fine,verbose,compact,1,0
 Example setting MKL variables with command-line arguments:
 ```bash
 KMP_BLOCKTIME=0 KMP_AFFINITY=granularity=fine,verbose,compact,1,0 \
 KMP_SETTINGS=1 python your_python_script.py
 ```
 Example setting MKL variables with python `os.environ`:
 ```python
-bn = tf.contrib.layers.batch_norm(
+os.environ["KMP_BLOCKTIME"] = str(FLAGS.kmp_blocktime)
-          input_layer, fused=True, data_format='NCHW'
+os.environ["KMP_SETTINGS"] = str(FLAGS.kmp_settings)
-          scope=scope, **kwargs)
+os.environ["KMP_AFFINITY"]= FLAGS.kmp_affinity
 if FLAGS.num_intra_threads > 0:
  os.environ["OMP_NUM_THREADS"]= str(FLAGS.num_intra_threads)
 ```
-The non-fused batch norm does computations using several individual Ops. Fused
+There are models and hardware platforms that benefit from different settings.
-batch norm combines the individual operations into a single kernel, which runs
+Each variable that impacts performance is discussed below.
 faster.
 *   **KMP_BLOCKTIME**: The MKL default is 200ms, which was not optimal in our
    testing. 0 (0ms) was a good default for CNN based models that were tested.
    The best performance for AlexNex was achieved at 30ms and both GoogleNet and
    VGG11 performed best set at 1ms.
 *   **KMP_AFFINITY**: The recommended setting is
    `granularity=fine,verbose,compact,1,0`.
 *   **OMP_NUM_THREADS**: This defaults to the number of physical cores.
    Adjusting this parameter beyond matching the number of cores can have an
    impact when using Intel® Xeon Phi™ (Knights Landing) for some models. See
    [TensorFlow* Optimizations on Modern Intel® Architecture](https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture)
    for optimal settings.
 *   **intra_op_parallelism_threads**: Setting this equal to the number of
    physical cores is recommended. Setting the value to 0, which is the default
    and will result in the value being set to the number of logical cores, is an
    option to try for some architectures.  This value and `OMP_NUM_THREADS`
    should be equal.
 *   **inter_op_parallelism_threads**: Setting this equal to the number of
    sockets is recommended. Setting the value to 0, which is the default,
    results in the value being set to the number of logical cores.
 ### Comparing compiler optimizations
 Collected below are performance results running training and inference on
 different types of CPUs on different platforms with various compiler
 optimizations.  The models used were ResNet-50
 ([arXiv:1512.03385](https://arxiv.org/abs/1512.03385)) and
 InceptionV3 ([arXiv:1512.00567](https://arxiv.org/abs/1512.00567)).
 For each test, when the MKL optimization was used the environment variable
 KMP_BLOCKTIME was set to 0 (0ms) and KMP_AFFINITY to
 `granularity=fine,verbose,compact,1,0`.
 #### Inference InceptionV3
 **Environment**
 *   Instance Type: AWS EC2 m4.xlarge
 *   CPU: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz (Broadwell)
 *   Dataset: ImageNet
 *   TensorFlow Version: 1.2.0 RC2
 *   Test Script: [tf_cnn_benchmarks.py](https://github.com/tensorflow/benchmarks/blob/mkl_experiment/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py)
 **Batch Size: 1**
 Command executed for the MKL test:
 ```bash
 python tf_cnn_benchmarks.py --forward_only=True --device=cpu --mkl=True \
 --kmp_blocktime=0 --nodistortions --model=inception3 --data_format=NCHW \
 --batch_size=1 --num_inter_threads=1 --num_intra_threads=4 \
 --data_dir=<path to ImageNet TFRecords>
 ```
 | Optimization | Data Format | Images/Sec   | Intra threads | Inter Threads |
 :              :             : (step time)  :               :               :
 | ------------ | ----------- | ------------ | ------------- | ------------- |
 | AVX2         | NHWC        | 6.8 (147ms)  | 4             | 0             |
 | MKL          | NCHW        | 6.6 (151ms)  | 4             | 1             |
 | MKL          | NHWC        | 5.95 (168ms) | 4             | 1             |
 | AVX          | NHWC        | 4.7 (211ms)  | 4             | 0             |
 | SSE3         | NHWC        | 2.7 (370ms)  | 4             | 0             |
 **Batch Size: 32**
 Command executed for the MKL test:
 ```bash
 python tf_cnn_benchmarks.py --forward_only=True --device=cpu --mkl=True \
 --kmp_blocktime=0 --nodistortions --model=inception3 --data_format=NCHW \
 --batch_size=32 --num_inter_threads=1 --num_intra_threads=4 \
 --data_dir=<path to ImageNet TFRecords>
 ```
 | Optimization | Data Format | Images/Sec    | Intra threads | Inter Threads |
 :              :             : (step time)   :               :               :
 | ------------ | ----------- | ------------- | ------------- | ------------- |
 | MKL          | NCHW        | 10.24         | 4             | 1             |
 :              :             : (3125ms)      :               :               :
 | MKL          | NHWC        | 8.9 (3595ms)  | 4             | 1             |
 | AVX2         | NHWC        | 7.3 (4383ms)  | 4             | 0             |
 | AVX          | NHWC        | 5.1 (6275ms)  | 4             | 0             |
 | SSE3         | NHWC        | 2.8 (11428ms) | 4             | 0             |
 #### Inference ResNet-50
 **Environment**
 *   Instance Type: AWS EC2 m4.xlarge
 *   CPU: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz (Broadwell)
 *   Dataset: ImageNet
 *   TensorFlow Version: 1.2.0 RC2
 *   Test Script: [tf_cnn_benchmarks.py](https://github.com/tensorflow/benchmarks/blob/mkl_experiment/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py)
 **Batch Size: 1**
 Command executed for the MKL test:
 ```bash
 python tf_cnn_benchmarks.py --forward_only=True --device=cpu --mkl=True \
 --kmp_blocktime=0 --nodistortions --model=resnet50 --data_format=NCHW \
 --batch_size=1 --num_inter_threads=1 --num_intra_threads=4 \
 --data_dir=<path to ImageNet TFRecords>
 ```
 | Optimization | Data Format | Images/Sec   | Intra threads | Inter Threads |
 :              :             : (step time)  :               :               :
 | ------------ | ----------- | ------------ | ------------- | ------------- |
 | AVX2         | NHWC        | 6.8 (147ms)  | 4             | 0             |
 | MKL          | NCHW        | 6.6 (151ms)  | 4             | 1             |
 | MKL          | NHWC        | 5.95 (168ms) | 4             | 1             |
 | AVX          | NHWC        | 4.7 (211ms)  | 4             | 0             |
 | SSE3         | NHWC        | 2.7 (370ms)  | 4             | 0             |
 **Batch Size: 32**
 Command executed for the MKL test:
 ```bash
 python tf_cnn_benchmarks.py --forward_only=True --device=cpu --mkl=True \
 --kmp_blocktime=0 --nodistortions --model=resnet50 --data_format=NCHW \
 --batch_size=32 --num_inter_threads=1 --num_intra_threads=4 \
 --data_dir=<path to ImageNet TFRecords>
 ```
 | Optimization | Data Format | Images/Sec    | Intra threads | Inter Threads |
 :              :             : (step time)   :               :               :
 | ------------ | ----------- | ------------- | ------------- | ------------- |
 | MKL          | NCHW        | 10.24         | 4             | 1             |
 :              :             : (3125ms)      :               :               :
 | MKL          | NHWC        | 8.9 (3595ms)  | 4             | 1             |
 | AVX2         | NHWC        | 7.3 (4383ms)  | 4             | 0             |
 | AVX          | NHWC        | 5.1 (6275ms)  | 4             | 0             |
 | SSE3         | NHWC        | 2.8 (11428ms) | 4             | 0             |
 #### Training InceptionV3
 **Environment**
 *   Instance Type: Dedicated AWS EC2 r4.16xlarge (Broadwell)
 *   CPU: Intel Xeon E5-2686 v4 (Broadwell) Processors
 *   Dataset: ImageNet
 *   TensorFlow Version: 1.2.0 RC2
 *   Test Script: [tf_cnn_benchmarks.py](https://github.com/tensorflow/benchmarks/blob/mkl_experiment/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py)
 Command executed for MKL test:
 ```bash
 python tf_cnn_benchmarks.py --device=cpu --mkl=True --kmp_blocktime=0 \
 --nodistortions --model=resnet50 --data_format=NCHW --batch_size=32 \
 --num_inter_threads=2 --num_intra_threads=36 \
 --data_dir=<path to ImageNet TFRecords>
 ```
 Optimization | Data Format | Images/Sec | Intra threads | Inter Threads
 ------------ | ----------- | ---------- | ------------- | -------------
 MKL          | NCHW        | 20.8       | 36            | 2
 AVX2         | NHWC        | 6.2        | 36            | 0
 AVX          | NHWC        | 5.7        | 36            | 0
 SSE3         | NHWC        | 4.3        | 36            | 0
 ResNet and [AlexNet](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf)
 were also run on this configuration but in an ad hoc manner. There were not
 enough runs executed to publish a coherent table of results. The incomplete
 results strongly indicated the final result would be similar to the table above
 with MKL providing significant 3x+ gains over AVX2.