Performance guide update

PiperOrigin-RevId: 168159289
This commit is contained in:
Toby Boyd 2017-09-10 10:14:47 -07:00 committed by TensorFlower Gardener
parent 3bce4f9a0d
commit ce9a2b00fa

View File

@ -1,43 +1,182 @@
# Performance Guide
This guide contains a collection of best practices for optimizing your
TensorFlow code. The best practices apply to both new and experienced
Tensorflow users. As a complement to the best practices in this document, the
@{$performance_models$High-Performance Models} document links to example code
and details for creating models that scale on a variety of hardware.
This guide contains a collection of best practices for optimizing TensorFlow
code. The guide is divided into a few sections:
## Best Practices
While optimizing implementations of different types of models can be different,
the topics below cover best practices to get the most performance from
TensorFlow. Although these suggestions focus on image-based models, we will
regularly add tips for all kinds of models. The following list highlights key
best practices:
* [General best practices](#general_best_practices) covers topics that are
common across a variety of model types and hardware.
* [Optimizing for GPU](#optimizing_for_gpu) details tips specifically relevant
to GPUs.
* [Optimizing for CPU](#optimizing_for_cpu) details CPU specific information.
* Build and install from source
* Utilize queues for reading data
* Preprocessing on the CPU
* Use `NCHW` image data format
* Place shared parameters on the GPU
* Use fused batch norm
## General best practices
The following sections detail the preceding suggestions.
The sections below cover best practices that are relevant to a variety of
hardware and models. The best practices section is broken down into the
following sections:
### Build and install from source
* [Input pipeline optimizations](#input-pipeline-optimization)
* [Data formats](#data-formats)
* [Common fused Ops](#common-fused-ops)
* [Building and installing from source](#building-and-installing-from-source)
To install the most optimized version of TensorFlow, build and install
TensorFlow from source by following [Installing TensorFlow from Source](../install/install_sources).
Building from source with compiler optimizations for the target hardware and
ensuring the latest CUDA platform and cuDNN libraries are installed results in
the highest performing installs.
### Input pipeline optimization
For the most stable experience, build from the [latest release](https://github.com/tensorflow/tensorflow/releases)
branch. To get the latest performance changes and accept some stability risk,
build from [master](https://github.com/tensorflow/tensorflow).
Typical models retrieve data from disk and preprocess it before sending the data
through the network. For example, models that process JPEG images will follow
this flow: load image from disk, decode JPEG into a tensor, crop and pad,
possibly flip and distort, and then batch. This flow is referred to as the input
pipeline. As GPUs and other hardware accelerators get faster, preprocessing of
data can be a bottleneck.
If there is a need to build TensorFlow on a platform that has different hardware
than the target, then cross-compile with the highest optimizations for the target
platform. The following command is an example of telling `bazel` to compile for
a specific platform:
Determining if the input pipeline is the bottleneck can be complicated. One of
the most straightforward methods is to reduce the model to a single operation
(trivial model) after the input pipeline and measure the examples per second. If
the difference in examples per second for the full model and the trivial model
is minimal then the input pipeline is likely a bottleneck. Below are some other
approaches to identifying issues:
* Check if a GPU is underutilized by running `watch -n 2 nvidia-smi`. If GPU
utilization is not approaching 80-100%, then the input pipeline may be the
bottleneck.
* Generate a timeline and look for large blocks of white space (waiting). An
example of generating a timeline exists as part of the @{$jit$XLA JIT}
tutorial.
* Check CPU usage. It is possible to have an optimized input pipeline and lack
the CPU cycles to process the pipeline.
* Estimate the throughput needed and verify the disk used is capable of that
level of throughput. Some cloud solutions have network attached disks that
start as low as 50 MB/sec, which is slower than spinning disks (150 MB/sec),
SATA SSDs (500 MB/sec), and PCIe SSDs (2,000+ MB/sec).
#### Preprocessing on the CPU
Placing input pipeline operations on the CPU can significantly improve
performance. Utilizing the CPU for the input pipeline frees the GPU to focus on
training. To ensure preprocessing is on the CPU, wrap the preprocessing
operations as shown below:
```python
with tf.device('/cpu:0'):
# function to get and process images or data.
distorted_inputs = load_and_distort_images()
```
If using `tf.estimator.Estimator` the input function is automatically placed on
the CPU.
#### Using the Dataset API
The @{$datasets$Dataset API} is replacing `queue_runner` as the recommended API
for building input pipelines. The API was added to contrib as part of TensorFlow
1.2 and will move to core in the near future. This
[ResNet example](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator/cifar10_main.py)
([arXiv:1512.03385](https://arxiv.org/abs/1512.03385))
training CIFAR-10 illustrates the use of the Dataset API along with
`tf.estimator.Estimator`. The Dataset API utilizes C++ multi-threading and has a
much lower overhead than the Python-based `queue_runner` that is limited by
Python's multi-threading performance.
While feeding data using a `feed_dict` offers a high level of flexibility, in
most instances using `feed_dict` does not scale optimally. However, in instances
where only a single GPU is being used the difference can be negligible. Using
the Dataset API is still strongly recommended. Try to avoid the following:
```python
# feed_dict often results in suboptimal performance when using large inputs.
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
```
#### Use large files
Reading large numbers of small files significantly impacts I/O performance.
One approach to get maximum I/O throughput is to preprocess input data into
larger (~100MB) `TFRecord` files. For smaller data sets (200MB-1GB), the best
approach is often to load the entire data set into memory. The document
[Downloading and converting to TFRecord format](https://github.com/tensorflow/models/tree/master/slim#Data)
includes information and scripts for creating `TFRecords` and this
[script](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator/generate_cifar10_tfrecords.py)
converts the CIFAR-10 data set into `TFRecords`.
### Data formats
Data formats refers to the structure of the Tensor passed to a given Op. The
discussion below is specifically about 4D Tensors representing images. In
TensorFlow the parts of the 4D tensor are often referred to by the following
letters:
* N refers to the number of images in a batch.
* H refers to the number of pixels in the vertical (height) dimension.
* W refers to the number of pixels in the horizontal (width) dimension.
* C refers to the channels. For example, 1 for black and white or grayscale
and 3 for RGB.
Within TensorFlow there are two naming conventions representing the two most
common data formats:
* `NCHW` or `channels_first`
* `NHWC` or `channels_last`
`NHWC` is the TensorFlow default and `NCHW` is the optimal format to use when
training on NVIDIA GPUs using [cuDNN](https://developer.nvidia.com/cudnn).
The best practice is to build models that work with both data formats. This
simplifies training on GPUs and then running inference on CPUs. If TensorFlow is
compiled with the [Intel MKL](#tensorflow_with_intel_mkl-dnn) optimizations,
many operations, especially those related to CNN based models, will be optimized
and support `NCHW`. If not using the MKL, some operations are not supported on
CPU when using `NCHW`.
The brief history of these two formats is that TensorFlow started by using
`NHWC` because it was a little faster on CPUs. In the long term, we are working
on tools to auto rewrite graphs to make switching between the formats
transparent and take advantages of micro optimizations where a GPU Op may be
faster using `NHWC` than the normally most efficient `NCHW`.
### Common fused Ops
Fused Ops combine multiple operations into a single kernel for improved
performance. There are many fused Ops within TensorFlow and @{$xla$XLA} will
create fused Ops when possible to automatically improve performance. Collected
below are select fused Ops that can greatly improve performance and may be
overlooked.
#### Fused batch norm
Fused batch norm combines the multiple operations needed to do batch
normalization into a single kernel. Batch norm is an expensive process that for
some models makes up a large percentage of the operation time. Using fused batch
norm can result in a 12%-30% speedup.
There are two commonly used batch norms and both support fusing. The core
@{tf.layers.batch_normalization} added fused starting in TensorFlow 1.3.
```python
bn = tf.layers.batch_normalization(
input_layer, fused=True, data_format='NCHW')
```
The contrib @{tf.contrib.layers.batch_norm} method has had fused as an option
since before TensorFlow 1.0.
```python
bn = tf.contrib.layers.batch_norm(input_layer, fused=True, data_format='NCHW')
```
### Building and installing from source
The default TensorFlow binaries target the broadest range of hardware to make
TensorFlow accessible to everyone. If using CPUs for training or inference, it
is recommended to compile TensorFlow with all of the optimizations available for
the CPU in use. Speedups for training and inference on CPU are documented below
in [Comparing compiler optimizations](#comparing-compiler-optimizations).
To install the most optimized version of TensorFlow,
@{$install_sources$build and install} from source. If there is a need to build
TensorFlow on a platform that has different hardware than the target, then
cross-compile with the highest optimizations for the target platform. The
following command is an example of using `bazel` to compile for a specific
platform:
```python
# This command optimizes for Intels Broadwell processor
@ -47,106 +186,467 @@ bazel build -c opt --copt=-march="broadwell" --config=cuda //tensorflow/tools/pi
#### Environment, build, and install tips
* Compile with the highest level of compute the [GPU
supports](http://developer.nvidia.com/cuda-gpus), e.g. P100: 6.0, Titan X
(pascal): 6.2, Titan X (maxwell): 5.2, and K80: 3.7.
* Install the latest CUDA platform and cuDNN libraries.
* Make sure to use a version of gcc that supports all of the optimizations of
the target CPU. The recommended minimum gcc version is 4.8.3. On OS X upgrade
to the latest Xcode version and use the version of clang that comes with Xcode.
* TensorFlow checks on startup whether it has been compiled with the
optimizations available on the CPU. If the optimizations are not included,
TensorFlow will emit warnings, e.g. AVX, AVX2, and FMA instructions not
included.
* `./configure` asks which compute capability to include in the build. This
does not impact overall performance but does impact initial startup. After
running TensorFlow once, the compiled kernels are cached by CUDA. If using
a docker container, the data is not cached and the penalty is paid each time
TensorFlow starts. The best practice is to include the
[compute capabilities](http://developer.nvidia.com/cuda-gpus)
of the GPUs that will be used, e.g. P100: 6.0, Titan X (Pascal): 6.1, Titan
X (Maxwell): 5.2, and K80: 3.7.
* Use a version of gcc that supports all of the optimizations of the target
CPU. The recommended minimum gcc version is 4.8.3. On OS X, upgrade to the
latest Xcode version and use the version of clang that comes with Xcode.
* Install the latest stable CUDA platform and cuDNN libraries supported by
TensorFlow.
### Utilize queues for reading data
## Optimizing for GPU
One common cause of poor performance is underutilizing GPUs, or essentially
"starving" them of data by not setting up an efficient pipeline. Make sure to
set up an input pipeline to utilize queues and stream data effectively. Review
the @{$reading_data#reading_from_files$Reading Data guide} for implementation
details. One way to identify a "starved" GPU is to generate and review
timelines. A detailed tutorial for timelines does not exist, but a quick example
of generating a timeline exists as part of the @{$jit$XLA JIT} tutorial. Another
simple way to check if a GPU is underutilized is to run `watch nvidia-smi`, and
if GPU utilization is not approaching 100% then the GPU is not getting data fast
enough.
This section contains GPU-specific tips that are not covered in the
[General best practices](#general-best-practices). Obtaining optimal performance
on multi-GPUs is a challenge. A common approach is to use data parallelism.
Scaling through the use of data parallelism involves making multiple copies of
the model, which are referred to as "towers", and then placing one tower on each
of the GPUs. Each tower operates on a different mini-batch of data and then
updates variables, also known as parameters, that need to be shared between
each of the towers. How each tower gets the updated variables and how the
gradients are applied has an impact on the performance, scaling, and convergence
of the model. The rest of this section provides an overview of variable
placement and the towering of a model on multiple GPUs.
@{$performance_models$High-Performance Models} gets into more details regarding
more complex methods that can be used to share and update variables between
towers.
Unless for a special circumstance or for example code, do not feed data
into the session from Python variables, e.g. `dictionary`.
The best approach to handling variable updates depends on the model, hardware,
and even how the hardware has been configured. An example of this, is that two
systems can be built with NVIDIA Tesla P100s but one may be using PCIe and the
other [NVLink](http://www.nvidia.com/object/nvlink.html). In that scenario, the
optimal solution for each system may be different. For real world examples, read
the @{$benchmarks$benchmark} page which details the settings that were optimal
for a variety of platforms. Below is a summary of what was learned from
benchmarking various platforms and configurations:
* **Tesla K80**: If the GPUs are on the same PCI Express root complex and are
able to use [NVIDIA GPUDirect](https://developer.nvidia.com/gpudirect) Peer
to Peer, then placing the variables equally across the GPUs used for
training is the best approach. If the GPUs cannot use GPUDirect, then
placing the variables on the CPU is the best option.
* **Titan X (Maxwell and Pascal), M40, P100, and similar**: For models like
ResNet and InceptionV3, placing variables on the CPU is the optimal setting,
but for models with a lot of variables like AlexNet and VGG, using GPUs with
`NCCL` is better.
A common approach to managing where variables are placed, is to create a method
to determine where each Op is to be placed and use that method in place of a
specific device name when calling `with tf.device():`. Consider a scenario where
a model is being trained on 2 GPUs and the variables are to be placed on the
CPU. There would be a loop for creating and placing the "towers" on each of the
2 GPUs. A custom device placement method would be created that watches for Ops
of type `Variable`, `VariableV2`, and `VarHandleOp` and indicates that they are
to be placed on the CPU. All other Ops would be placed on the target GPU.
The building of the graph would proceed as follows:
* On the first loop a "tower" of the model would be created for `gpu:0`.
During the placement of the Ops, the custom device placement method would
indicate that variables are to be placed on `cpu:0` and all other Ops on
`gpu:0`.
* On the second loop, `reuse` is set to `True` to indicate that variables are
to be reused and then the "tower" is created on `gpu:1`. During the
placement of the Ops associated with the "tower", the variables that were
placed on `cpu:0` are reused and all other Ops are created and placed on
`gpu:1`.
The final result is all of the variables are placed on the CPU with each GPU
having a copy of all of the computational Ops associated with the model.
The code snippet below illustrates two different approaches for variable
placement: one is placing variables on the CPU; the other is placing variables
equally across the GPUs.
```python
# Using feed_dict often results in suboptimal performance when using large inputs.
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
class GpuParamServerDeviceSetter(object):
"""Used with tf.device() to place variables on the least loaded GPU.
A common use for this class is to pass a list of GPU devices, e.g. ['gpu:0',
'gpu:1','gpu:2'], as ps_devices. When each variable is placed, it will be
placed on the least loaded gpu. All other Ops, which will be the computation
Ops, will be placed on the worker_device.
"""
def __init__(self, worker_device, ps_devices):
"""Initializer for GpuParamServerDeviceSetter.
Args:
worker_device: the device to use for computation Ops.
ps_devices: a list of devices to use for Variable Ops. Each variable is
assigned to the least loaded device.
"""
self.ps_devices = ps_devices
self.worker_device = worker_device
self.ps_sizes = [0] * len(self.ps_devices)
def __call__(self, op):
if op.device:
return op.device
if op.type not in ['Variable', 'VariableV2', 'VarHandleOp']:
return self.worker_device
# Gets the least loaded ps_device
device_index, _ = min(enumerate(self.ps_sizes), key=operator.itemgetter(1))
device_name = self.ps_devices[device_index]
var_size = op.outputs[0].get_shape().num_elements()
self.ps_sizes[device_index] += var_size
return device_name
def _create_device_setter(is_cpu_ps, worker, num_gpus):
"""Create device setter object."""
if is_cpu_ps:
# tf.train.replica_device_setter supports placing variables on the CPU, all
# on one GPU, or on ps_servers defined in a cluster_spec.
return tf.train.replica_device_setter(
worker_device=worker, ps_device='/cpu:0', ps_tasks=1)
else:
gpus = ['/gpu:%d' % i for i in range(num_gpus)]
return ParamServerDeviceSetter(worker, gpus)
# The method below is a modified snippet from the full example.
def _resnet_model_fn():
# When set to False, variables are placed on the least loaded GPU. If set
# to True, the variables will be placed on the CPU.
is_cpu_ps = False
# Loops over the number of GPUs and creates a copy ("tower") of the model on
# each GPU.
for i in range(num_gpus):
worker = '/gpu:%d' % i
# Creates a device setter used to determine where Ops are to be placed.
device_setter = _create_device_setter(is_cpu_ps, worker, FLAGS.num_gpus)
# Creates variables on the first loop. On subsequent loops reuse is set
# to True, which results in the "towers" sharing variables.
with tf.variable_scope('resnet', reuse=bool(i != 0)):
with tf.name_scope('tower_%d' % i) as name_scope:
# tf.device calls the device_setter for each Op that is created.
# device_setter returns the device the Op is to be placed on.
with tf.device(device_setter):
# Creates the "tower".
_tower_fn(is_training, weight_decay, tower_features[i],
tower_labels[i], tower_losses, tower_gradvars,
tower_preds, False)
```
### Preprocessing on the CPU
In the near future the above code will be for illustration purposes only as
there will be easy to use high level methods to support a wide range of popular
approaches. This
[example](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator)
will continue to get updated as the API expands and evolves to address multi-GPU
scenarios.
Placing preprocessing operations on the CPU can significantly improve
performance. When preprocessing occurs on the GPU the flow of data is
CPU -> GPU (preprocessing) -> CPU -> GPU (training). The data is bounced back
and forth between the CPU and GPU. When preprocessing is placed on the CPU,
the data flow is CPU (preprocessing) -> GPU (training). Another benefit is
preprocessing on the CPU frees GPU time to focus on training.
## Optimizing for CPU
Placing preprocessing on the CPU can result in a 6X+ increase in samples/sec
processed, which could lead to training in 1/6th of the time. To ensure
preprocessing is on the CPU, wrap the preprocessing operations as shown below:
CPUs, which includes Intel® Xeon Phi™, achieve optimal performance when
TensorFlow is @{$install_sources$built from source} with all of the instructions
supported by the target CPU.
Beyond using the latest instruction sets, Intel® has added support for the
Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) to
TensorFlow. While the name is not completely accurate, these optimizations are
often simply referred to as 'MKL' or 'TensorFlow with MKL'. [TensorFlow
with Intel® MKL-DNN](#tensorflow_with_intel_mkl_dnn) contains details on the
MKL optimizations.
The two configurations listed below are used to optimize CPU performance by
adjusting the thread pools.
* `intra_op_parallelism_threads`: Nodes that can use multiple threads to
parallelize their execution will schedule the individual pieces into this
pool.
* `inter_op_parallelism_threads`: All ready nodes are scheduled in this pool.
These configurations are set via the `tf.ConfigProto` and passed to `tf.Session`
in the `config` attribute as shown in the snippet below. For both configuration
options, if they are unset or set to 0, will default to the number of logical
CPU cores. Testing has shown that the default is effective for systems ranging
from one CPU with 4 cores to multiple CPUs with 70+ combined logical cores.
A common alternative optimization is to set the number of threads in both pools
equal to the number of physical cores rather than logical cores.
```python
with tf.device('/cpu:0'):
# function to get and process images or data.
distorted_inputs = load_and_distort_images()
config = tf.ConfigProto()
config.intra_op_parallelism_threads = 44
config.inter_op_parallelism_threads = 44
tf.session(config=config)
```
### Use large files
The [Comparing compiler optimizations](#comparing-compiler-optimizations)
section contains the results of tests that used different compiler
optimizations.
Under some circumstances, both the CPU and GPU can be starved for data by the
I/O system. If you are using many small files to form your input data set, you
may be limited by the speed of your filesystem. If your training loop runs
faster when using SSDs vs HDDs for storing your input data, you could be
I/O bottlenecked.
### TensorFlow with Intel® MKL DNN
If this is the case, you should pre-process your input data, creating a few
large TFRecord files.
Intel® has added optimizations to TensorFlow for Intel® Xeon® and Intel® Xeon
Phi™ though the use of Intel® Math Kernel Library for Deep Neural Networks
(Intel® MKL-DNN) optimized primitives. The optimizations also provide speedups
for the consumer line of processors, e.g. i5 and i7 Intel processors. The Intel
published paper
[TensorFlow* Optimizations on Modern Intel® Architecture](https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture)
contains additional details on the implementation.
### Use NCHW image data format
> Note: MKL was added as of TensorFlow 1.2 and currently only works on Linux. It
> also does not work when also using `--config=cuda`.
Image data format refers to the representation of batches of images. TensorFlow
supports `NHWC` (TensorFlow default) and `NCHW` (cuDNN default). N refers to the
number of images in a batch, H refers to the number of pixels in the vertical
dimension, W refers to the number of pixels in the horizontal dimension, and C
refers to the channels (e.g. 1 for black and white, 3 for RGB, etc.) Although
cuDNN can operate on both formats, it is faster to operate in its default
format.
In addition to providing significant performance improvements for training CNN
based models, compiling with the MKL creates a binary that is optimized for AVX
and AVX2. The result is a single binary that is optimized and compatible with
most modern (post-2011) processors.
The best practice is to build models that work with both `NCHW` and `NHWC` as it
is common to train using `NCHW` on GPU, and then do inference with `NHWC` on CPU.
TensorFlow can be compiled with the MKL optimizations using the following
commands that depending on the version of the TensorFlow source used.
There are edge cases where `NCHW` can be slower on GPU than `NHWC`. One
[case](https://github.com/tensorflow/tensorflow/issues/7551#issuecomment-280421351)
is using non-fused batch norm on WRN-16-4 without dropout. In that case using
fused batch norm, which is also recommended, is the optimal solution.
For TensorFlow source versions after 1.3.0:
The very brief history of these two formats is that TensorFlow started by using
`NHWC` because it was a little faster on CPUs. Then the TensorFlow team
discovered that `NCHW` performs better when using the NVIDIA cuDNN library. The
current recommendation is that users support both formats in their models. In
the long term, we plan to rewrite graphs to make switching between the formats
transparent.
```bash
./configure
# Pick the desired options
bazel build --config=mkl -c opt //tensorflow/tools/pip_package:build_pip_package
### Use fused batch norm
```
When using batch norm
@{tf.contrib.layers.batch_norm} set the attribute `fused=True`:
For TensorFlow versions 1.2.0 through 1.3.0:
```bash
./configure
Do you wish to build TensorFlow with MKL support? [y/N] Y
Do you wish to download MKL LIB from the web? [Y/n] Y
# Select the defaults for the rest of the options.
bazel build --config=mkl --copt="-DEIGEN_USE_VML" -c opt //tensorflow/tools/pip_package:build_pip_package
```
#### Tuning MKL for the best performance
This section details the different configurations and environment variables that
can be used to tune the MKL to get optimal performance. Before tweaking various
environment variables make sure the model is using the `NCHW` (`channels_first`)
[data format](#data-formats). The MKL is optimized for `NCHW` and Intel is
working to get near performance parity when using `NHWC`.
MKL uses the following environment variables to tune performance:
* KMP_BLOCKTIME - Sets the time, in milliseconds, that a thread should wait,
after completing the execution of a parallel region, before sleeping.
* KMP_AFFINITY - Enables the run-time library to bind threads to physical
processing units.
* KMP_SETTINGS - Enables (true) or disables (false) the printing of OpenMP*
run-time library environment variables during program execution.
* OMP_NUM_THREADS - Specifies the number of threads to use.
More details on the KMP variables are on
[Intel's](https://software.intel.com/en-us/node/522775) site and the OMP
variables on
[gnu.org](https://gcc.gnu.org/onlinedocs/libgomp/Environment-Variables.html)
While there can be substantial gains from adjusting the environment variables,
which is discussed below, the simplified advice is to set the
`inter_op_parallelism_threads` equal to the number of physical CPUs and to set
the following environment variables:
* KMP_BLOCKTIME=0
* KMP_AFFINITY=granularity=fine,verbose,compact,1,0
Example setting MKL variables with command-line arguments:
```bash
KMP_BLOCKTIME=0 KMP_AFFINITY=granularity=fine,verbose,compact,1,0 \
KMP_SETTINGS=1 python your_python_script.py
```
Example setting MKL variables with python `os.environ`:
```python
bn = tf.contrib.layers.batch_norm(
input_layer, fused=True, data_format='NCHW'
scope=scope, **kwargs)
os.environ["KMP_BLOCKTIME"] = str(FLAGS.kmp_blocktime)
os.environ["KMP_SETTINGS"] = str(FLAGS.kmp_settings)
os.environ["KMP_AFFINITY"]= FLAGS.kmp_affinity
if FLAGS.num_intra_threads > 0:
os.environ["OMP_NUM_THREADS"]= str(FLAGS.num_intra_threads)
```
The non-fused batch norm does computations using several individual Ops. Fused
batch norm combines the individual operations into a single kernel, which runs
faster.
There are models and hardware platforms that benefit from different settings.
Each variable that impacts performance is discussed below.
* **KMP_BLOCKTIME**: The MKL default is 200ms, which was not optimal in our
testing. 0 (0ms) was a good default for CNN based models that were tested.
The best performance for AlexNex was achieved at 30ms and both GoogleNet and
VGG11 performed best set at 1ms.
* **KMP_AFFINITY**: The recommended setting is
`granularity=fine,verbose,compact,1,0`.
* **OMP_NUM_THREADS**: This defaults to the number of physical cores.
Adjusting this parameter beyond matching the number of cores can have an
impact when using Intel® Xeon Phi™ (Knights Landing) for some models. See
[TensorFlow* Optimizations on Modern Intel® Architecture](https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture)
for optimal settings.
* **intra_op_parallelism_threads**: Setting this equal to the number of
physical cores is recommended. Setting the value to 0, which is the default
and will result in the value being set to the number of logical cores, is an
option to try for some architectures. This value and `OMP_NUM_THREADS`
should be equal.
* **inter_op_parallelism_threads**: Setting this equal to the number of
sockets is recommended. Setting the value to 0, which is the default,
results in the value being set to the number of logical cores.
### Comparing compiler optimizations
Collected below are performance results running training and inference on
different types of CPUs on different platforms with various compiler
optimizations. The models used were ResNet-50
([arXiv:1512.03385](https://arxiv.org/abs/1512.03385)) and
InceptionV3 ([arXiv:1512.00567](https://arxiv.org/abs/1512.00567)).
For each test, when the MKL optimization was used the environment variable
KMP_BLOCKTIME was set to 0 (0ms) and KMP_AFFINITY to
`granularity=fine,verbose,compact,1,0`.
#### Inference InceptionV3
**Environment**
* Instance Type: AWS EC2 m4.xlarge
* CPU: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz (Broadwell)
* Dataset: ImageNet
* TensorFlow Version: 1.2.0 RC2
* Test Script: [tf_cnn_benchmarks.py](https://github.com/tensorflow/benchmarks/blob/mkl_experiment/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py)
**Batch Size: 1**
Command executed for the MKL test:
```bash
python tf_cnn_benchmarks.py --forward_only=True --device=cpu --mkl=True \
--kmp_blocktime=0 --nodistortions --model=inception3 --data_format=NCHW \
--batch_size=1 --num_inter_threads=1 --num_intra_threads=4 \
--data_dir=<path to ImageNet TFRecords>
```
| Optimization | Data Format | Images/Sec | Intra threads | Inter Threads |
: : : (step time) : : :
| ------------ | ----------- | ------------ | ------------- | ------------- |
| AVX2 | NHWC | 6.8 (147ms) | 4 | 0 |
| MKL | NCHW | 6.6 (151ms) | 4 | 1 |
| MKL | NHWC | 5.95 (168ms) | 4 | 1 |
| AVX | NHWC | 4.7 (211ms) | 4 | 0 |
| SSE3 | NHWC | 2.7 (370ms) | 4 | 0 |
**Batch Size: 32**
Command executed for the MKL test:
```bash
python tf_cnn_benchmarks.py --forward_only=True --device=cpu --mkl=True \
--kmp_blocktime=0 --nodistortions --model=inception3 --data_format=NCHW \
--batch_size=32 --num_inter_threads=1 --num_intra_threads=4 \
--data_dir=<path to ImageNet TFRecords>
```
| Optimization | Data Format | Images/Sec | Intra threads | Inter Threads |
: : : (step time) : : :
| ------------ | ----------- | ------------- | ------------- | ------------- |
| MKL | NCHW | 10.24 | 4 | 1 |
: : : (3125ms) : : :
| MKL | NHWC | 8.9 (3595ms) | 4 | 1 |
| AVX2 | NHWC | 7.3 (4383ms) | 4 | 0 |
| AVX | NHWC | 5.1 (6275ms) | 4 | 0 |
| SSE3 | NHWC | 2.8 (11428ms) | 4 | 0 |
#### Inference ResNet-50
**Environment**
* Instance Type: AWS EC2 m4.xlarge
* CPU: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz (Broadwell)
* Dataset: ImageNet
* TensorFlow Version: 1.2.0 RC2
* Test Script: [tf_cnn_benchmarks.py](https://github.com/tensorflow/benchmarks/blob/mkl_experiment/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py)
**Batch Size: 1**
Command executed for the MKL test:
```bash
python tf_cnn_benchmarks.py --forward_only=True --device=cpu --mkl=True \
--kmp_blocktime=0 --nodistortions --model=resnet50 --data_format=NCHW \
--batch_size=1 --num_inter_threads=1 --num_intra_threads=4 \
--data_dir=<path to ImageNet TFRecords>
```
| Optimization | Data Format | Images/Sec | Intra threads | Inter Threads |
: : : (step time) : : :
| ------------ | ----------- | ------------ | ------------- | ------------- |
| AVX2 | NHWC | 6.8 (147ms) | 4 | 0 |
| MKL | NCHW | 6.6 (151ms) | 4 | 1 |
| MKL | NHWC | 5.95 (168ms) | 4 | 1 |
| AVX | NHWC | 4.7 (211ms) | 4 | 0 |
| SSE3 | NHWC | 2.7 (370ms) | 4 | 0 |
**Batch Size: 32**
Command executed for the MKL test:
```bash
python tf_cnn_benchmarks.py --forward_only=True --device=cpu --mkl=True \
--kmp_blocktime=0 --nodistortions --model=resnet50 --data_format=NCHW \
--batch_size=32 --num_inter_threads=1 --num_intra_threads=4 \
--data_dir=<path to ImageNet TFRecords>
```
| Optimization | Data Format | Images/Sec | Intra threads | Inter Threads |
: : : (step time) : : :
| ------------ | ----------- | ------------- | ------------- | ------------- |
| MKL | NCHW | 10.24 | 4 | 1 |
: : : (3125ms) : : :
| MKL | NHWC | 8.9 (3595ms) | 4 | 1 |
| AVX2 | NHWC | 7.3 (4383ms) | 4 | 0 |
| AVX | NHWC | 5.1 (6275ms) | 4 | 0 |
| SSE3 | NHWC | 2.8 (11428ms) | 4 | 0 |
#### Training InceptionV3
**Environment**
* Instance Type: Dedicated AWS EC2 r4.16xlarge (Broadwell)
* CPU: Intel Xeon E5-2686 v4 (Broadwell) Processors
* Dataset: ImageNet
* TensorFlow Version: 1.2.0 RC2
* Test Script: [tf_cnn_benchmarks.py](https://github.com/tensorflow/benchmarks/blob/mkl_experiment/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py)
Command executed for MKL test:
```bash
python tf_cnn_benchmarks.py --device=cpu --mkl=True --kmp_blocktime=0 \
--nodistortions --model=resnet50 --data_format=NCHW --batch_size=32 \
--num_inter_threads=2 --num_intra_threads=36 \
--data_dir=<path to ImageNet TFRecords>
```
Optimization | Data Format | Images/Sec | Intra threads | Inter Threads
------------ | ----------- | ---------- | ------------- | -------------
MKL | NCHW | 20.8 | 36 | 2
AVX2 | NHWC | 6.2 | 36 | 0
AVX | NHWC | 5.7 | 36 | 0
SSE3 | NHWC | 4.3 | 36 | 0
ResNet and [AlexNet](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf)
were also run on this configuration but in an ad hoc manner. There were not
enough runs executed to publish a coherent table of results. The incomplete
results strongly indicated the final result would be similar to the table above
with MKL providing significant 3x+ gains over AVX2.