Performance guide update
PiperOrigin-RevId: 168159289
This commit is contained in:
parent
3bce4f9a0d
commit
ce9a2b00fa
@ -1,43 +1,182 @@
|
||||
# Performance Guide
|
||||
|
||||
This guide contains a collection of best practices for optimizing your
|
||||
TensorFlow code. The best practices apply to both new and experienced
|
||||
Tensorflow users. As a complement to the best practices in this document, the
|
||||
@{$performance_models$High-Performance Models} document links to example code
|
||||
and details for creating models that scale on a variety of hardware.
|
||||
This guide contains a collection of best practices for optimizing TensorFlow
|
||||
code. The guide is divided into a few sections:
|
||||
|
||||
## Best Practices
|
||||
While optimizing implementations of different types of models can be different,
|
||||
the topics below cover best practices to get the most performance from
|
||||
TensorFlow. Although these suggestions focus on image-based models, we will
|
||||
regularly add tips for all kinds of models. The following list highlights key
|
||||
best practices:
|
||||
* [General best practices](#general_best_practices) covers topics that are
|
||||
common across a variety of model types and hardware.
|
||||
* [Optimizing for GPU](#optimizing_for_gpu) details tips specifically relevant
|
||||
to GPUs.
|
||||
* [Optimizing for CPU](#optimizing_for_cpu) details CPU specific information.
|
||||
|
||||
* Build and install from source
|
||||
* Utilize queues for reading data
|
||||
* Preprocessing on the CPU
|
||||
* Use `NCHW` image data format
|
||||
* Place shared parameters on the GPU
|
||||
* Use fused batch norm
|
||||
## General best practices
|
||||
|
||||
The following sections detail the preceding suggestions.
|
||||
The sections below cover best practices that are relevant to a variety of
|
||||
hardware and models. The best practices section is broken down into the
|
||||
following sections:
|
||||
|
||||
### Build and install from source
|
||||
* [Input pipeline optimizations](#input-pipeline-optimization)
|
||||
* [Data formats](#data-formats)
|
||||
* [Common fused Ops](#common-fused-ops)
|
||||
* [Building and installing from source](#building-and-installing-from-source)
|
||||
|
||||
To install the most optimized version of TensorFlow, build and install
|
||||
TensorFlow from source by following [Installing TensorFlow from Source](../install/install_sources).
|
||||
Building from source with compiler optimizations for the target hardware and
|
||||
ensuring the latest CUDA platform and cuDNN libraries are installed results in
|
||||
the highest performing installs.
|
||||
### Input pipeline optimization
|
||||
|
||||
For the most stable experience, build from the [latest release](https://github.com/tensorflow/tensorflow/releases)
|
||||
branch. To get the latest performance changes and accept some stability risk,
|
||||
build from [master](https://github.com/tensorflow/tensorflow).
|
||||
Typical models retrieve data from disk and preprocess it before sending the data
|
||||
through the network. For example, models that process JPEG images will follow
|
||||
this flow: load image from disk, decode JPEG into a tensor, crop and pad,
|
||||
possibly flip and distort, and then batch. This flow is referred to as the input
|
||||
pipeline. As GPUs and other hardware accelerators get faster, preprocessing of
|
||||
data can be a bottleneck.
|
||||
|
||||
If there is a need to build TensorFlow on a platform that has different hardware
|
||||
than the target, then cross-compile with the highest optimizations for the target
|
||||
platform. The following command is an example of telling `bazel` to compile for
|
||||
a specific platform:
|
||||
Determining if the input pipeline is the bottleneck can be complicated. One of
|
||||
the most straightforward methods is to reduce the model to a single operation
|
||||
(trivial model) after the input pipeline and measure the examples per second. If
|
||||
the difference in examples per second for the full model and the trivial model
|
||||
is minimal then the input pipeline is likely a bottleneck. Below are some other
|
||||
approaches to identifying issues:
|
||||
|
||||
* Check if a GPU is underutilized by running `watch -n 2 nvidia-smi`. If GPU
|
||||
utilization is not approaching 80-100%, then the input pipeline may be the
|
||||
bottleneck.
|
||||
* Generate a timeline and look for large blocks of white space (waiting). An
|
||||
example of generating a timeline exists as part of the @{$jit$XLA JIT}
|
||||
tutorial.
|
||||
* Check CPU usage. It is possible to have an optimized input pipeline and lack
|
||||
the CPU cycles to process the pipeline.
|
||||
* Estimate the throughput needed and verify the disk used is capable of that
|
||||
level of throughput. Some cloud solutions have network attached disks that
|
||||
start as low as 50 MB/sec, which is slower than spinning disks (150 MB/sec),
|
||||
SATA SSDs (500 MB/sec), and PCIe SSDs (2,000+ MB/sec).
|
||||
|
||||
#### Preprocessing on the CPU
|
||||
|
||||
Placing input pipeline operations on the CPU can significantly improve
|
||||
performance. Utilizing the CPU for the input pipeline frees the GPU to focus on
|
||||
training. To ensure preprocessing is on the CPU, wrap the preprocessing
|
||||
operations as shown below:
|
||||
|
||||
```python
|
||||
with tf.device('/cpu:0'):
|
||||
# function to get and process images or data.
|
||||
distorted_inputs = load_and_distort_images()
|
||||
```
|
||||
|
||||
If using `tf.estimator.Estimator` the input function is automatically placed on
|
||||
the CPU.
|
||||
|
||||
#### Using the Dataset API
|
||||
|
||||
The @{$datasets$Dataset API} is replacing `queue_runner` as the recommended API
|
||||
for building input pipelines. The API was added to contrib as part of TensorFlow
|
||||
1.2 and will move to core in the near future. This
|
||||
[ResNet example](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator/cifar10_main.py)
|
||||
([arXiv:1512.03385](https://arxiv.org/abs/1512.03385))
|
||||
training CIFAR-10 illustrates the use of the Dataset API along with
|
||||
`tf.estimator.Estimator`. The Dataset API utilizes C++ multi-threading and has a
|
||||
much lower overhead than the Python-based `queue_runner` that is limited by
|
||||
Python's multi-threading performance.
|
||||
|
||||
While feeding data using a `feed_dict` offers a high level of flexibility, in
|
||||
most instances using `feed_dict` does not scale optimally. However, in instances
|
||||
where only a single GPU is being used the difference can be negligible. Using
|
||||
the Dataset API is still strongly recommended. Try to avoid the following:
|
||||
|
||||
```python
|
||||
# feed_dict often results in suboptimal performance when using large inputs.
|
||||
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
|
||||
```
|
||||
|
||||
#### Use large files
|
||||
|
||||
Reading large numbers of small files significantly impacts I/O performance.
|
||||
One approach to get maximum I/O throughput is to preprocess input data into
|
||||
larger (~100MB) `TFRecord` files. For smaller data sets (200MB-1GB), the best
|
||||
approach is often to load the entire data set into memory. The document
|
||||
[Downloading and converting to TFRecord format](https://github.com/tensorflow/models/tree/master/slim#Data)
|
||||
includes information and scripts for creating `TFRecords` and this
|
||||
[script](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator/generate_cifar10_tfrecords.py)
|
||||
converts the CIFAR-10 data set into `TFRecords`.
|
||||
|
||||
### Data formats
|
||||
|
||||
Data formats refers to the structure of the Tensor passed to a given Op. The
|
||||
discussion below is specifically about 4D Tensors representing images. In
|
||||
TensorFlow the parts of the 4D tensor are often referred to by the following
|
||||
letters:
|
||||
|
||||
* N refers to the number of images in a batch.
|
||||
* H refers to the number of pixels in the vertical (height) dimension.
|
||||
* W refers to the number of pixels in the horizontal (width) dimension.
|
||||
* C refers to the channels. For example, 1 for black and white or grayscale
|
||||
and 3 for RGB.
|
||||
|
||||
Within TensorFlow there are two naming conventions representing the two most
|
||||
common data formats:
|
||||
|
||||
* `NCHW` or `channels_first`
|
||||
* `NHWC` or `channels_last`
|
||||
|
||||
`NHWC` is the TensorFlow default and `NCHW` is the optimal format to use when
|
||||
training on NVIDIA GPUs using [cuDNN](https://developer.nvidia.com/cudnn).
|
||||
|
||||
The best practice is to build models that work with both data formats. This
|
||||
simplifies training on GPUs and then running inference on CPUs. If TensorFlow is
|
||||
compiled with the [Intel MKL](#tensorflow_with_intel_mkl-dnn) optimizations,
|
||||
many operations, especially those related to CNN based models, will be optimized
|
||||
and support `NCHW`. If not using the MKL, some operations are not supported on
|
||||
CPU when using `NCHW`.
|
||||
|
||||
The brief history of these two formats is that TensorFlow started by using
|
||||
`NHWC` because it was a little faster on CPUs. In the long term, we are working
|
||||
on tools to auto rewrite graphs to make switching between the formats
|
||||
transparent and take advantages of micro optimizations where a GPU Op may be
|
||||
faster using `NHWC` than the normally most efficient `NCHW`.
|
||||
|
||||
### Common fused Ops
|
||||
|
||||
Fused Ops combine multiple operations into a single kernel for improved
|
||||
performance. There are many fused Ops within TensorFlow and @{$xla$XLA} will
|
||||
create fused Ops when possible to automatically improve performance. Collected
|
||||
below are select fused Ops that can greatly improve performance and may be
|
||||
overlooked.
|
||||
|
||||
#### Fused batch norm
|
||||
|
||||
Fused batch norm combines the multiple operations needed to do batch
|
||||
normalization into a single kernel. Batch norm is an expensive process that for
|
||||
some models makes up a large percentage of the operation time. Using fused batch
|
||||
norm can result in a 12%-30% speedup.
|
||||
|
||||
There are two commonly used batch norms and both support fusing. The core
|
||||
@{tf.layers.batch_normalization} added fused starting in TensorFlow 1.3.
|
||||
|
||||
```python
|
||||
bn = tf.layers.batch_normalization(
|
||||
input_layer, fused=True, data_format='NCHW')
|
||||
```
|
||||
|
||||
The contrib @{tf.contrib.layers.batch_norm} method has had fused as an option
|
||||
since before TensorFlow 1.0.
|
||||
|
||||
```python
|
||||
bn = tf.contrib.layers.batch_norm(input_layer, fused=True, data_format='NCHW')
|
||||
```
|
||||
|
||||
### Building and installing from source
|
||||
|
||||
The default TensorFlow binaries target the broadest range of hardware to make
|
||||
TensorFlow accessible to everyone. If using CPUs for training or inference, it
|
||||
is recommended to compile TensorFlow with all of the optimizations available for
|
||||
the CPU in use. Speedups for training and inference on CPU are documented below
|
||||
in [Comparing compiler optimizations](#comparing-compiler-optimizations).
|
||||
|
||||
To install the most optimized version of TensorFlow,
|
||||
@{$install_sources$build and install} from source. If there is a need to build
|
||||
TensorFlow on a platform that has different hardware than the target, then
|
||||
cross-compile with the highest optimizations for the target platform. The
|
||||
following command is an example of using `bazel` to compile for a specific
|
||||
platform:
|
||||
|
||||
```python
|
||||
# This command optimizes for Intel’s Broadwell processor
|
||||
@ -47,106 +186,467 @@ bazel build -c opt --copt=-march="broadwell" --config=cuda //tensorflow/tools/pi
|
||||
|
||||
#### Environment, build, and install tips
|
||||
|
||||
* Compile with the highest level of compute the [GPU
|
||||
supports](http://developer.nvidia.com/cuda-gpus), e.g. P100: 6.0, Titan X
|
||||
(pascal): 6.2, Titan X (maxwell): 5.2, and K80: 3.7.
|
||||
* Install the latest CUDA platform and cuDNN libraries.
|
||||
* Make sure to use a version of gcc that supports all of the optimizations of
|
||||
the target CPU. The recommended minimum gcc version is 4.8.3. On OS X upgrade
|
||||
to the latest Xcode version and use the version of clang that comes with Xcode.
|
||||
* TensorFlow checks on startup whether it has been compiled with the
|
||||
optimizations available on the CPU. If the optimizations are not included,
|
||||
TensorFlow will emit warnings, e.g. AVX, AVX2, and FMA instructions not
|
||||
included.
|
||||
* `./configure` asks which compute capability to include in the build. This
|
||||
does not impact overall performance but does impact initial startup. After
|
||||
running TensorFlow once, the compiled kernels are cached by CUDA. If using
|
||||
a docker container, the data is not cached and the penalty is paid each time
|
||||
TensorFlow starts. The best practice is to include the
|
||||
[compute capabilities](http://developer.nvidia.com/cuda-gpus)
|
||||
of the GPUs that will be used, e.g. P100: 6.0, Titan X (Pascal): 6.1, Titan
|
||||
X (Maxwell): 5.2, and K80: 3.7.
|
||||
* Use a version of gcc that supports all of the optimizations of the target
|
||||
CPU. The recommended minimum gcc version is 4.8.3. On OS X, upgrade to the
|
||||
latest Xcode version and use the version of clang that comes with Xcode.
|
||||
* Install the latest stable CUDA platform and cuDNN libraries supported by
|
||||
TensorFlow.
|
||||
|
||||
### Utilize queues for reading data
|
||||
## Optimizing for GPU
|
||||
|
||||
One common cause of poor performance is underutilizing GPUs, or essentially
|
||||
"starving" them of data by not setting up an efficient pipeline. Make sure to
|
||||
set up an input pipeline to utilize queues and stream data effectively. Review
|
||||
the @{$reading_data#reading_from_files$Reading Data guide} for implementation
|
||||
details. One way to identify a "starved" GPU is to generate and review
|
||||
timelines. A detailed tutorial for timelines does not exist, but a quick example
|
||||
of generating a timeline exists as part of the @{$jit$XLA JIT} tutorial. Another
|
||||
simple way to check if a GPU is underutilized is to run `watch nvidia-smi`, and
|
||||
if GPU utilization is not approaching 100% then the GPU is not getting data fast
|
||||
enough.
|
||||
This section contains GPU-specific tips that are not covered in the
|
||||
[General best practices](#general-best-practices). Obtaining optimal performance
|
||||
on multi-GPUs is a challenge. A common approach is to use data parallelism.
|
||||
Scaling through the use of data parallelism involves making multiple copies of
|
||||
the model, which are referred to as "towers", and then placing one tower on each
|
||||
of the GPUs. Each tower operates on a different mini-batch of data and then
|
||||
updates variables, also known as parameters, that need to be shared between
|
||||
each of the towers. How each tower gets the updated variables and how the
|
||||
gradients are applied has an impact on the performance, scaling, and convergence
|
||||
of the model. The rest of this section provides an overview of variable
|
||||
placement and the towering of a model on multiple GPUs.
|
||||
@{$performance_models$High-Performance Models} gets into more details regarding
|
||||
more complex methods that can be used to share and update variables between
|
||||
towers.
|
||||
|
||||
Unless for a special circumstance or for example code, do not feed data
|
||||
into the session from Python variables, e.g. `dictionary`.
|
||||
The best approach to handling variable updates depends on the model, hardware,
|
||||
and even how the hardware has been configured. An example of this, is that two
|
||||
systems can be built with NVIDIA Tesla P100s but one may be using PCIe and the
|
||||
other [NVLink](http://www.nvidia.com/object/nvlink.html). In that scenario, the
|
||||
optimal solution for each system may be different. For real world examples, read
|
||||
the @{$benchmarks$benchmark} page which details the settings that were optimal
|
||||
for a variety of platforms. Below is a summary of what was learned from
|
||||
benchmarking various platforms and configurations:
|
||||
|
||||
* **Tesla K80**: If the GPUs are on the same PCI Express root complex and are
|
||||
able to use [NVIDIA GPUDirect](https://developer.nvidia.com/gpudirect) Peer
|
||||
to Peer, then placing the variables equally across the GPUs used for
|
||||
training is the best approach. If the GPUs cannot use GPUDirect, then
|
||||
placing the variables on the CPU is the best option.
|
||||
|
||||
* **Titan X (Maxwell and Pascal), M40, P100, and similar**: For models like
|
||||
ResNet and InceptionV3, placing variables on the CPU is the optimal setting,
|
||||
but for models with a lot of variables like AlexNet and VGG, using GPUs with
|
||||
`NCCL` is better.
|
||||
|
||||
A common approach to managing where variables are placed, is to create a method
|
||||
to determine where each Op is to be placed and use that method in place of a
|
||||
specific device name when calling `with tf.device():`. Consider a scenario where
|
||||
a model is being trained on 2 GPUs and the variables are to be placed on the
|
||||
CPU. There would be a loop for creating and placing the "towers" on each of the
|
||||
2 GPUs. A custom device placement method would be created that watches for Ops
|
||||
of type `Variable`, `VariableV2`, and `VarHandleOp` and indicates that they are
|
||||
to be placed on the CPU. All other Ops would be placed on the target GPU.
|
||||
The building of the graph would proceed as follows:
|
||||
|
||||
* On the first loop a "tower" of the model would be created for `gpu:0`.
|
||||
During the placement of the Ops, the custom device placement method would
|
||||
indicate that variables are to be placed on `cpu:0` and all other Ops on
|
||||
`gpu:0`.
|
||||
|
||||
* On the second loop, `reuse` is set to `True` to indicate that variables are
|
||||
to be reused and then the "tower" is created on `gpu:1`. During the
|
||||
placement of the Ops associated with the "tower", the variables that were
|
||||
placed on `cpu:0` are reused and all other Ops are created and placed on
|
||||
`gpu:1`.
|
||||
|
||||
The final result is all of the variables are placed on the CPU with each GPU
|
||||
having a copy of all of the computational Ops associated with the model.
|
||||
|
||||
The code snippet below illustrates two different approaches for variable
|
||||
placement: one is placing variables on the CPU; the other is placing variables
|
||||
equally across the GPUs.
|
||||
|
||||
```python
|
||||
# Using feed_dict often results in suboptimal performance when using large inputs.
|
||||
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
|
||||
|
||||
class GpuParamServerDeviceSetter(object):
|
||||
"""Used with tf.device() to place variables on the least loaded GPU.
|
||||
|
||||
A common use for this class is to pass a list of GPU devices, e.g. ['gpu:0',
|
||||
'gpu:1','gpu:2'], as ps_devices. When each variable is placed, it will be
|
||||
placed on the least loaded gpu. All other Ops, which will be the computation
|
||||
Ops, will be placed on the worker_device.
|
||||
"""
|
||||
|
||||
def __init__(self, worker_device, ps_devices):
|
||||
"""Initializer for GpuParamServerDeviceSetter.
|
||||
Args:
|
||||
worker_device: the device to use for computation Ops.
|
||||
ps_devices: a list of devices to use for Variable Ops. Each variable is
|
||||
assigned to the least loaded device.
|
||||
"""
|
||||
self.ps_devices = ps_devices
|
||||
self.worker_device = worker_device
|
||||
self.ps_sizes = [0] * len(self.ps_devices)
|
||||
|
||||
def __call__(self, op):
|
||||
if op.device:
|
||||
return op.device
|
||||
if op.type not in ['Variable', 'VariableV2', 'VarHandleOp']:
|
||||
return self.worker_device
|
||||
|
||||
# Gets the least loaded ps_device
|
||||
device_index, _ = min(enumerate(self.ps_sizes), key=operator.itemgetter(1))
|
||||
device_name = self.ps_devices[device_index]
|
||||
var_size = op.outputs[0].get_shape().num_elements()
|
||||
self.ps_sizes[device_index] += var_size
|
||||
|
||||
return device_name
|
||||
|
||||
def _create_device_setter(is_cpu_ps, worker, num_gpus):
|
||||
"""Create device setter object."""
|
||||
if is_cpu_ps:
|
||||
# tf.train.replica_device_setter supports placing variables on the CPU, all
|
||||
# on one GPU, or on ps_servers defined in a cluster_spec.
|
||||
return tf.train.replica_device_setter(
|
||||
worker_device=worker, ps_device='/cpu:0', ps_tasks=1)
|
||||
else:
|
||||
gpus = ['/gpu:%d' % i for i in range(num_gpus)]
|
||||
return ParamServerDeviceSetter(worker, gpus)
|
||||
|
||||
# The method below is a modified snippet from the full example.
|
||||
def _resnet_model_fn():
|
||||
# When set to False, variables are placed on the least loaded GPU. If set
|
||||
# to True, the variables will be placed on the CPU.
|
||||
is_cpu_ps = False
|
||||
|
||||
# Loops over the number of GPUs and creates a copy ("tower") of the model on
|
||||
# each GPU.
|
||||
for i in range(num_gpus):
|
||||
worker = '/gpu:%d' % i
|
||||
# Creates a device setter used to determine where Ops are to be placed.
|
||||
device_setter = _create_device_setter(is_cpu_ps, worker, FLAGS.num_gpus)
|
||||
# Creates variables on the first loop. On subsequent loops reuse is set
|
||||
# to True, which results in the "towers" sharing variables.
|
||||
with tf.variable_scope('resnet', reuse=bool(i != 0)):
|
||||
with tf.name_scope('tower_%d' % i) as name_scope:
|
||||
# tf.device calls the device_setter for each Op that is created.
|
||||
# device_setter returns the device the Op is to be placed on.
|
||||
with tf.device(device_setter):
|
||||
# Creates the "tower".
|
||||
_tower_fn(is_training, weight_decay, tower_features[i],
|
||||
tower_labels[i], tower_losses, tower_gradvars,
|
||||
tower_preds, False)
|
||||
|
||||
```
|
||||
|
||||
### Preprocessing on the CPU
|
||||
In the near future the above code will be for illustration purposes only as
|
||||
there will be easy to use high level methods to support a wide range of popular
|
||||
approaches. This
|
||||
[example](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator)
|
||||
will continue to get updated as the API expands and evolves to address multi-GPU
|
||||
scenarios.
|
||||
|
||||
Placing preprocessing operations on the CPU can significantly improve
|
||||
performance. When preprocessing occurs on the GPU the flow of data is
|
||||
CPU -> GPU (preprocessing) -> CPU -> GPU (training). The data is bounced back
|
||||
and forth between the CPU and GPU. When preprocessing is placed on the CPU,
|
||||
the data flow is CPU (preprocessing) -> GPU (training). Another benefit is
|
||||
preprocessing on the CPU frees GPU time to focus on training.
|
||||
## Optimizing for CPU
|
||||
|
||||
Placing preprocessing on the CPU can result in a 6X+ increase in samples/sec
|
||||
processed, which could lead to training in 1/6th of the time. To ensure
|
||||
preprocessing is on the CPU, wrap the preprocessing operations as shown below:
|
||||
CPUs, which includes Intel® Xeon Phi™, achieve optimal performance when
|
||||
TensorFlow is @{$install_sources$built from source} with all of the instructions
|
||||
supported by the target CPU.
|
||||
|
||||
Beyond using the latest instruction sets, Intel® has added support for the
|
||||
Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) to
|
||||
TensorFlow. While the name is not completely accurate, these optimizations are
|
||||
often simply referred to as 'MKL' or 'TensorFlow with MKL'. [TensorFlow
|
||||
with Intel® MKL-DNN](#tensorflow_with_intel_mkl_dnn) contains details on the
|
||||
MKL optimizations.
|
||||
|
||||
The two configurations listed below are used to optimize CPU performance by
|
||||
adjusting the thread pools.
|
||||
|
||||
* `intra_op_parallelism_threads`: Nodes that can use multiple threads to
|
||||
parallelize their execution will schedule the individual pieces into this
|
||||
pool.
|
||||
* `inter_op_parallelism_threads`: All ready nodes are scheduled in this pool.
|
||||
|
||||
These configurations are set via the `tf.ConfigProto` and passed to `tf.Session`
|
||||
in the `config` attribute as shown in the snippet below. For both configuration
|
||||
options, if they are unset or set to 0, will default to the number of logical
|
||||
CPU cores. Testing has shown that the default is effective for systems ranging
|
||||
from one CPU with 4 cores to multiple CPUs with 70+ combined logical cores.
|
||||
A common alternative optimization is to set the number of threads in both pools
|
||||
equal to the number of physical cores rather than logical cores.
|
||||
|
||||
```python
|
||||
with tf.device('/cpu:0'):
|
||||
# function to get and process images or data.
|
||||
distorted_inputs = load_and_distort_images()
|
||||
|
||||
config = tf.ConfigProto()
|
||||
config.intra_op_parallelism_threads = 44
|
||||
config.inter_op_parallelism_threads = 44
|
||||
tf.session(config=config)
|
||||
|
||||
```
|
||||
|
||||
### Use large files
|
||||
The [Comparing compiler optimizations](#comparing-compiler-optimizations)
|
||||
section contains the results of tests that used different compiler
|
||||
optimizations.
|
||||
|
||||
Under some circumstances, both the CPU and GPU can be starved for data by the
|
||||
I/O system. If you are using many small files to form your input data set, you
|
||||
may be limited by the speed of your filesystem. If your training loop runs
|
||||
faster when using SSDs vs HDDs for storing your input data, you could be
|
||||
I/O bottlenecked.
|
||||
### TensorFlow with Intel® MKL DNN
|
||||
|
||||
If this is the case, you should pre-process your input data, creating a few
|
||||
large TFRecord files.
|
||||
Intel® has added optimizations to TensorFlow for Intel® Xeon® and Intel® Xeon
|
||||
Phi™ though the use of Intel® Math Kernel Library for Deep Neural Networks
|
||||
(Intel® MKL-DNN) optimized primitives. The optimizations also provide speedups
|
||||
for the consumer line of processors, e.g. i5 and i7 Intel processors. The Intel
|
||||
published paper
|
||||
[TensorFlow* Optimizations on Modern Intel® Architecture](https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture)
|
||||
contains additional details on the implementation.
|
||||
|
||||
### Use NCHW image data format
|
||||
> Note: MKL was added as of TensorFlow 1.2 and currently only works on Linux. It
|
||||
> also does not work when also using `--config=cuda`.
|
||||
|
||||
Image data format refers to the representation of batches of images. TensorFlow
|
||||
supports `NHWC` (TensorFlow default) and `NCHW` (cuDNN default). N refers to the
|
||||
number of images in a batch, H refers to the number of pixels in the vertical
|
||||
dimension, W refers to the number of pixels in the horizontal dimension, and C
|
||||
refers to the channels (e.g. 1 for black and white, 3 for RGB, etc.) Although
|
||||
cuDNN can operate on both formats, it is faster to operate in its default
|
||||
format.
|
||||
In addition to providing significant performance improvements for training CNN
|
||||
based models, compiling with the MKL creates a binary that is optimized for AVX
|
||||
and AVX2. The result is a single binary that is optimized and compatible with
|
||||
most modern (post-2011) processors.
|
||||
|
||||
The best practice is to build models that work with both `NCHW` and `NHWC` as it
|
||||
is common to train using `NCHW` on GPU, and then do inference with `NHWC` on CPU.
|
||||
TensorFlow can be compiled with the MKL optimizations using the following
|
||||
commands that depending on the version of the TensorFlow source used.
|
||||
|
||||
There are edge cases where `NCHW` can be slower on GPU than `NHWC`. One
|
||||
[case](https://github.com/tensorflow/tensorflow/issues/7551#issuecomment-280421351)
|
||||
is using non-fused batch norm on WRN-16-4 without dropout. In that case using
|
||||
fused batch norm, which is also recommended, is the optimal solution.
|
||||
For TensorFlow source versions after 1.3.0:
|
||||
|
||||
The very brief history of these two formats is that TensorFlow started by using
|
||||
`NHWC` because it was a little faster on CPUs. Then the TensorFlow team
|
||||
discovered that `NCHW` performs better when using the NVIDIA cuDNN library. The
|
||||
current recommendation is that users support both formats in their models. In
|
||||
the long term, we plan to rewrite graphs to make switching between the formats
|
||||
transparent.
|
||||
```bash
|
||||
./configure
|
||||
# Pick the desired options
|
||||
bazel build --config=mkl -c opt //tensorflow/tools/pip_package:build_pip_package
|
||||
|
||||
### Use fused batch norm
|
||||
```
|
||||
|
||||
When using batch norm
|
||||
@{tf.contrib.layers.batch_norm} set the attribute `fused=True`:
|
||||
For TensorFlow versions 1.2.0 through 1.3.0:
|
||||
|
||||
```bash
|
||||
./configure
|
||||
Do you wish to build TensorFlow with MKL support? [y/N] Y
|
||||
Do you wish to download MKL LIB from the web? [Y/n] Y
|
||||
# Select the defaults for the rest of the options.
|
||||
|
||||
bazel build --config=mkl --copt="-DEIGEN_USE_VML" -c opt //tensorflow/tools/pip_package:build_pip_package
|
||||
|
||||
```
|
||||
|
||||
#### Tuning MKL for the best performance
|
||||
|
||||
This section details the different configurations and environment variables that
|
||||
can be used to tune the MKL to get optimal performance. Before tweaking various
|
||||
environment variables make sure the model is using the `NCHW` (`channels_first`)
|
||||
[data format](#data-formats). The MKL is optimized for `NCHW` and Intel is
|
||||
working to get near performance parity when using `NHWC`.
|
||||
|
||||
MKL uses the following environment variables to tune performance:
|
||||
|
||||
* KMP_BLOCKTIME - Sets the time, in milliseconds, that a thread should wait,
|
||||
after completing the execution of a parallel region, before sleeping.
|
||||
* KMP_AFFINITY - Enables the run-time library to bind threads to physical
|
||||
processing units.
|
||||
* KMP_SETTINGS - Enables (true) or disables (false) the printing of OpenMP*
|
||||
run-time library environment variables during program execution.
|
||||
* OMP_NUM_THREADS - Specifies the number of threads to use.
|
||||
|
||||
More details on the KMP variables are on
|
||||
[Intel's](https://software.intel.com/en-us/node/522775) site and the OMP
|
||||
variables on
|
||||
[gnu.org](https://gcc.gnu.org/onlinedocs/libgomp/Environment-Variables.html)
|
||||
|
||||
While there can be substantial gains from adjusting the environment variables,
|
||||
which is discussed below, the simplified advice is to set the
|
||||
`inter_op_parallelism_threads` equal to the number of physical CPUs and to set
|
||||
the following environment variables:
|
||||
|
||||
* KMP_BLOCKTIME=0
|
||||
* KMP_AFFINITY=granularity=fine,verbose,compact,1,0
|
||||
|
||||
Example setting MKL variables with command-line arguments:
|
||||
|
||||
```bash
|
||||
KMP_BLOCKTIME=0 KMP_AFFINITY=granularity=fine,verbose,compact,1,0 \
|
||||
KMP_SETTINGS=1 python your_python_script.py
|
||||
```
|
||||
|
||||
Example setting MKL variables with python `os.environ`:
|
||||
|
||||
```python
|
||||
bn = tf.contrib.layers.batch_norm(
|
||||
input_layer, fused=True, data_format='NCHW'
|
||||
scope=scope, **kwargs)
|
||||
os.environ["KMP_BLOCKTIME"] = str(FLAGS.kmp_blocktime)
|
||||
os.environ["KMP_SETTINGS"] = str(FLAGS.kmp_settings)
|
||||
os.environ["KMP_AFFINITY"]= FLAGS.kmp_affinity
|
||||
if FLAGS.num_intra_threads > 0:
|
||||
os.environ["OMP_NUM_THREADS"]= str(FLAGS.num_intra_threads)
|
||||
|
||||
```
|
||||
|
||||
The non-fused batch norm does computations using several individual Ops. Fused
|
||||
batch norm combines the individual operations into a single kernel, which runs
|
||||
faster.
|
||||
There are models and hardware platforms that benefit from different settings.
|
||||
Each variable that impacts performance is discussed below.
|
||||
|
||||
* **KMP_BLOCKTIME**: The MKL default is 200ms, which was not optimal in our
|
||||
testing. 0 (0ms) was a good default for CNN based models that were tested.
|
||||
The best performance for AlexNex was achieved at 30ms and both GoogleNet and
|
||||
VGG11 performed best set at 1ms.
|
||||
|
||||
* **KMP_AFFINITY**: The recommended setting is
|
||||
`granularity=fine,verbose,compact,1,0`.
|
||||
|
||||
* **OMP_NUM_THREADS**: This defaults to the number of physical cores.
|
||||
Adjusting this parameter beyond matching the number of cores can have an
|
||||
impact when using Intel® Xeon Phi™ (Knights Landing) for some models. See
|
||||
[TensorFlow* Optimizations on Modern Intel® Architecture](https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture)
|
||||
for optimal settings.
|
||||
|
||||
* **intra_op_parallelism_threads**: Setting this equal to the number of
|
||||
physical cores is recommended. Setting the value to 0, which is the default
|
||||
and will result in the value being set to the number of logical cores, is an
|
||||
option to try for some architectures. This value and `OMP_NUM_THREADS`
|
||||
should be equal.
|
||||
|
||||
* **inter_op_parallelism_threads**: Setting this equal to the number of
|
||||
sockets is recommended. Setting the value to 0, which is the default,
|
||||
results in the value being set to the number of logical cores.
|
||||
|
||||
### Comparing compiler optimizations
|
||||
|
||||
Collected below are performance results running training and inference on
|
||||
different types of CPUs on different platforms with various compiler
|
||||
optimizations. The models used were ResNet-50
|
||||
([arXiv:1512.03385](https://arxiv.org/abs/1512.03385)) and
|
||||
InceptionV3 ([arXiv:1512.00567](https://arxiv.org/abs/1512.00567)).
|
||||
|
||||
For each test, when the MKL optimization was used the environment variable
|
||||
KMP_BLOCKTIME was set to 0 (0ms) and KMP_AFFINITY to
|
||||
`granularity=fine,verbose,compact,1,0`.
|
||||
|
||||
#### Inference InceptionV3
|
||||
|
||||
**Environment**
|
||||
|
||||
* Instance Type: AWS EC2 m4.xlarge
|
||||
* CPU: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz (Broadwell)
|
||||
* Dataset: ImageNet
|
||||
* TensorFlow Version: 1.2.0 RC2
|
||||
* Test Script: [tf_cnn_benchmarks.py](https://github.com/tensorflow/benchmarks/blob/mkl_experiment/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py)
|
||||
|
||||
**Batch Size: 1**
|
||||
|
||||
Command executed for the MKL test:
|
||||
|
||||
```bash
|
||||
python tf_cnn_benchmarks.py --forward_only=True --device=cpu --mkl=True \
|
||||
--kmp_blocktime=0 --nodistortions --model=inception3 --data_format=NCHW \
|
||||
--batch_size=1 --num_inter_threads=1 --num_intra_threads=4 \
|
||||
--data_dir=<path to ImageNet TFRecords>
|
||||
```
|
||||
|
||||
| Optimization | Data Format | Images/Sec | Intra threads | Inter Threads |
|
||||
: : : (step time) : : :
|
||||
| ------------ | ----------- | ------------ | ------------- | ------------- |
|
||||
| AVX2 | NHWC | 6.8 (147ms) | 4 | 0 |
|
||||
| MKL | NCHW | 6.6 (151ms) | 4 | 1 |
|
||||
| MKL | NHWC | 5.95 (168ms) | 4 | 1 |
|
||||
| AVX | NHWC | 4.7 (211ms) | 4 | 0 |
|
||||
| SSE3 | NHWC | 2.7 (370ms) | 4 | 0 |
|
||||
|
||||
**Batch Size: 32**
|
||||
|
||||
Command executed for the MKL test:
|
||||
|
||||
```bash
|
||||
python tf_cnn_benchmarks.py --forward_only=True --device=cpu --mkl=True \
|
||||
--kmp_blocktime=0 --nodistortions --model=inception3 --data_format=NCHW \
|
||||
--batch_size=32 --num_inter_threads=1 --num_intra_threads=4 \
|
||||
--data_dir=<path to ImageNet TFRecords>
|
||||
```
|
||||
|
||||
| Optimization | Data Format | Images/Sec | Intra threads | Inter Threads |
|
||||
: : : (step time) : : :
|
||||
| ------------ | ----------- | ------------- | ------------- | ------------- |
|
||||
| MKL | NCHW | 10.24 | 4 | 1 |
|
||||
: : : (3125ms) : : :
|
||||
| MKL | NHWC | 8.9 (3595ms) | 4 | 1 |
|
||||
| AVX2 | NHWC | 7.3 (4383ms) | 4 | 0 |
|
||||
| AVX | NHWC | 5.1 (6275ms) | 4 | 0 |
|
||||
| SSE3 | NHWC | 2.8 (11428ms) | 4 | 0 |
|
||||
|
||||
#### Inference ResNet-50
|
||||
|
||||
**Environment**
|
||||
|
||||
* Instance Type: AWS EC2 m4.xlarge
|
||||
* CPU: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz (Broadwell)
|
||||
* Dataset: ImageNet
|
||||
* TensorFlow Version: 1.2.0 RC2
|
||||
* Test Script: [tf_cnn_benchmarks.py](https://github.com/tensorflow/benchmarks/blob/mkl_experiment/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py)
|
||||
|
||||
**Batch Size: 1**
|
||||
|
||||
Command executed for the MKL test:
|
||||
|
||||
```bash
|
||||
python tf_cnn_benchmarks.py --forward_only=True --device=cpu --mkl=True \
|
||||
--kmp_blocktime=0 --nodistortions --model=resnet50 --data_format=NCHW \
|
||||
--batch_size=1 --num_inter_threads=1 --num_intra_threads=4 \
|
||||
--data_dir=<path to ImageNet TFRecords>
|
||||
```
|
||||
|
||||
| Optimization | Data Format | Images/Sec | Intra threads | Inter Threads |
|
||||
: : : (step time) : : :
|
||||
| ------------ | ----------- | ------------ | ------------- | ------------- |
|
||||
| AVX2 | NHWC | 6.8 (147ms) | 4 | 0 |
|
||||
| MKL | NCHW | 6.6 (151ms) | 4 | 1 |
|
||||
| MKL | NHWC | 5.95 (168ms) | 4 | 1 |
|
||||
| AVX | NHWC | 4.7 (211ms) | 4 | 0 |
|
||||
| SSE3 | NHWC | 2.7 (370ms) | 4 | 0 |
|
||||
|
||||
**Batch Size: 32**
|
||||
|
||||
Command executed for the MKL test:
|
||||
|
||||
```bash
|
||||
python tf_cnn_benchmarks.py --forward_only=True --device=cpu --mkl=True \
|
||||
--kmp_blocktime=0 --nodistortions --model=resnet50 --data_format=NCHW \
|
||||
--batch_size=32 --num_inter_threads=1 --num_intra_threads=4 \
|
||||
--data_dir=<path to ImageNet TFRecords>
|
||||
```
|
||||
|
||||
| Optimization | Data Format | Images/Sec | Intra threads | Inter Threads |
|
||||
: : : (step time) : : :
|
||||
| ------------ | ----------- | ------------- | ------------- | ------------- |
|
||||
| MKL | NCHW | 10.24 | 4 | 1 |
|
||||
: : : (3125ms) : : :
|
||||
| MKL | NHWC | 8.9 (3595ms) | 4 | 1 |
|
||||
| AVX2 | NHWC | 7.3 (4383ms) | 4 | 0 |
|
||||
| AVX | NHWC | 5.1 (6275ms) | 4 | 0 |
|
||||
| SSE3 | NHWC | 2.8 (11428ms) | 4 | 0 |
|
||||
|
||||
#### Training InceptionV3
|
||||
|
||||
**Environment**
|
||||
|
||||
* Instance Type: Dedicated AWS EC2 r4.16xlarge (Broadwell)
|
||||
* CPU: Intel Xeon E5-2686 v4 (Broadwell) Processors
|
||||
* Dataset: ImageNet
|
||||
* TensorFlow Version: 1.2.0 RC2
|
||||
* Test Script: [tf_cnn_benchmarks.py](https://github.com/tensorflow/benchmarks/blob/mkl_experiment/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py)
|
||||
|
||||
Command executed for MKL test:
|
||||
|
||||
```bash
|
||||
python tf_cnn_benchmarks.py --device=cpu --mkl=True --kmp_blocktime=0 \
|
||||
--nodistortions --model=resnet50 --data_format=NCHW --batch_size=32 \
|
||||
--num_inter_threads=2 --num_intra_threads=36 \
|
||||
--data_dir=<path to ImageNet TFRecords>
|
||||
```
|
||||
|
||||
Optimization | Data Format | Images/Sec | Intra threads | Inter Threads
|
||||
------------ | ----------- | ---------- | ------------- | -------------
|
||||
MKL | NCHW | 20.8 | 36 | 2
|
||||
AVX2 | NHWC | 6.2 | 36 | 0
|
||||
AVX | NHWC | 5.7 | 36 | 0
|
||||
SSE3 | NHWC | 4.3 | 36 | 0
|
||||
|
||||
ResNet and [AlexNet](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf)
|
||||
were also run on this configuration but in an ad hoc manner. There were not
|
||||
enough runs executed to publish a coherent table of results. The incomplete
|
||||
results strongly indicated the final result would be similar to the table above
|
||||
with MKL providing significant 3x+ gains over AVX2.
|
||||
|
Loading…
Reference in New Issue
Block a user