diff --git a/tensorflow/docs_src/performance/performance_guide.md b/tensorflow/docs_src/performance/performance_guide.md index 2448ffac8bb..bf69b7e6fc2 100644 --- a/tensorflow/docs_src/performance/performance_guide.md +++ b/tensorflow/docs_src/performance/performance_guide.md @@ -1,43 +1,182 @@ # Performance Guide -This guide contains a collection of best practices for optimizing your -TensorFlow code. The best practices apply to both new and experienced -Tensorflow users. As a complement to the best practices in this document, the -@{$performance_models$High-Performance Models} document links to example code -and details for creating models that scale on a variety of hardware. +This guide contains a collection of best practices for optimizing TensorFlow +code. The guide is divided into a few sections: -## Best Practices -While optimizing implementations of different types of models can be different, -the topics below cover best practices to get the most performance from -TensorFlow. Although these suggestions focus on image-based models, we will -regularly add tips for all kinds of models. The following list highlights key -best practices: +* [General best practices](#general_best_practices) covers topics that are + common across a variety of model types and hardware. +* [Optimizing for GPU](#optimizing_for_gpu) details tips specifically relevant + to GPUs. +* [Optimizing for CPU](#optimizing_for_cpu) details CPU specific information. -* Build and install from source -* Utilize queues for reading data -* Preprocessing on the CPU -* Use `NCHW` image data format -* Place shared parameters on the GPU -* Use fused batch norm +## General best practices -The following sections detail the preceding suggestions. +The sections below cover best practices that are relevant to a variety of +hardware and models. The best practices section is broken down into the +following sections: -### Build and install from source +* [Input pipeline optimizations](#input-pipeline-optimization) +* [Data formats](#data-formats) +* [Common fused Ops](#common-fused-ops) +* [Building and installing from source](#building-and-installing-from-source) -To install the most optimized version of TensorFlow, build and install -TensorFlow from source by following [Installing TensorFlow from Source](../install/install_sources). -Building from source with compiler optimizations for the target hardware and -ensuring the latest CUDA platform and cuDNN libraries are installed results in -the highest performing installs. +### Input pipeline optimization -For the most stable experience, build from the [latest release](https://github.com/tensorflow/tensorflow/releases) -branch. To get the latest performance changes and accept some stability risk, -build from [master](https://github.com/tensorflow/tensorflow). +Typical models retrieve data from disk and preprocess it before sending the data +through the network. For example, models that process JPEG images will follow +this flow: load image from disk, decode JPEG into a tensor, crop and pad, +possibly flip and distort, and then batch. This flow is referred to as the input +pipeline. As GPUs and other hardware accelerators get faster, preprocessing of +data can be a bottleneck. -If there is a need to build TensorFlow on a platform that has different hardware -than the target, then cross-compile with the highest optimizations for the target -platform. The following command is an example of telling `bazel` to compile for -a specific platform: +Determining if the input pipeline is the bottleneck can be complicated. One of +the most straightforward methods is to reduce the model to a single operation +(trivial model) after the input pipeline and measure the examples per second. If +the difference in examples per second for the full model and the trivial model +is minimal then the input pipeline is likely a bottleneck. Below are some other +approaches to identifying issues: + +* Check if a GPU is underutilized by running `watch -n 2 nvidia-smi`. If GPU + utilization is not approaching 80-100%, then the input pipeline may be the + bottleneck. +* Generate a timeline and look for large blocks of white space (waiting). An + example of generating a timeline exists as part of the @{$jit$XLA JIT} + tutorial. +* Check CPU usage. It is possible to have an optimized input pipeline and lack + the CPU cycles to process the pipeline. +* Estimate the throughput needed and verify the disk used is capable of that + level of throughput. Some cloud solutions have network attached disks that + start as low as 50 MB/sec, which is slower than spinning disks (150 MB/sec), + SATA SSDs (500 MB/sec), and PCIe SSDs (2,000+ MB/sec). + +#### Preprocessing on the CPU + +Placing input pipeline operations on the CPU can significantly improve +performance. Utilizing the CPU for the input pipeline frees the GPU to focus on +training. To ensure preprocessing is on the CPU, wrap the preprocessing +operations as shown below: + +```python +with tf.device('/cpu:0'): + # function to get and process images or data. + distorted_inputs = load_and_distort_images() +``` + +If using `tf.estimator.Estimator` the input function is automatically placed on +the CPU. + +#### Using the Dataset API + +The @{$datasets$Dataset API} is replacing `queue_runner` as the recommended API +for building input pipelines. The API was added to contrib as part of TensorFlow +1.2 and will move to core in the near future. This +[ResNet example](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator/cifar10_main.py) +([arXiv:1512.03385](https://arxiv.org/abs/1512.03385)) +training CIFAR-10 illustrates the use of the Dataset API along with +`tf.estimator.Estimator`. The Dataset API utilizes C++ multi-threading and has a +much lower overhead than the Python-based `queue_runner` that is limited by +Python's multi-threading performance. + +While feeding data using a `feed_dict` offers a high level of flexibility, in +most instances using `feed_dict` does not scale optimally. However, in instances +where only a single GPU is being used the difference can be negligible. Using +the Dataset API is still strongly recommended. Try to avoid the following: + +```python +# feed_dict often results in suboptimal performance when using large inputs. +sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys}) +``` + +#### Use large files + +Reading large numbers of small files significantly impacts I/O performance. +One approach to get maximum I/O throughput is to preprocess input data into +larger (~100MB) `TFRecord` files. For smaller data sets (200MB-1GB), the best +approach is often to load the entire data set into memory. The document +[Downloading and converting to TFRecord format](https://github.com/tensorflow/models/tree/master/slim#Data) +includes information and scripts for creating `TFRecords` and this +[script](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator/generate_cifar10_tfrecords.py) +converts the CIFAR-10 data set into `TFRecords`. + +### Data formats + +Data formats refers to the structure of the Tensor passed to a given Op. The +discussion below is specifically about 4D Tensors representing images. In +TensorFlow the parts of the 4D tensor are often referred to by the following +letters: + +* N refers to the number of images in a batch. +* H refers to the number of pixels in the vertical (height) dimension. +* W refers to the number of pixels in the horizontal (width) dimension. +* C refers to the channels. For example, 1 for black and white or grayscale + and 3 for RGB. + +Within TensorFlow there are two naming conventions representing the two most +common data formats: + +* `NCHW` or `channels_first` +* `NHWC` or `channels_last` + +`NHWC` is the TensorFlow default and `NCHW` is the optimal format to use when +training on NVIDIA GPUs using [cuDNN](https://developer.nvidia.com/cudnn). + +The best practice is to build models that work with both data formats. This +simplifies training on GPUs and then running inference on CPUs. If TensorFlow is +compiled with the [Intel MKL](#tensorflow_with_intel_mkl-dnn) optimizations, +many operations, especially those related to CNN based models, will be optimized +and support `NCHW`. If not using the MKL, some operations are not supported on +CPU when using `NCHW`. + +The brief history of these two formats is that TensorFlow started by using +`NHWC` because it was a little faster on CPUs. In the long term, we are working +on tools to auto rewrite graphs to make switching between the formats +transparent and take advantages of micro optimizations where a GPU Op may be +faster using `NHWC` than the normally most efficient `NCHW`. + +### Common fused Ops + +Fused Ops combine multiple operations into a single kernel for improved +performance. There are many fused Ops within TensorFlow and @{$xla$XLA} will +create fused Ops when possible to automatically improve performance. Collected +below are select fused Ops that can greatly improve performance and may be +overlooked. + +#### Fused batch norm + +Fused batch norm combines the multiple operations needed to do batch +normalization into a single kernel. Batch norm is an expensive process that for +some models makes up a large percentage of the operation time. Using fused batch +norm can result in a 12%-30% speedup. + +There are two commonly used batch norms and both support fusing. The core +@{tf.layers.batch_normalization} added fused starting in TensorFlow 1.3. + +```python +bn = tf.layers.batch_normalization( + input_layer, fused=True, data_format='NCHW') +``` + +The contrib @{tf.contrib.layers.batch_norm} method has had fused as an option +since before TensorFlow 1.0. + +```python +bn = tf.contrib.layers.batch_norm(input_layer, fused=True, data_format='NCHW') +``` + +### Building and installing from source + +The default TensorFlow binaries target the broadest range of hardware to make +TensorFlow accessible to everyone. If using CPUs for training or inference, it +is recommended to compile TensorFlow with all of the optimizations available for +the CPU in use. Speedups for training and inference on CPU are documented below +in [Comparing compiler optimizations](#comparing-compiler-optimizations). + +To install the most optimized version of TensorFlow, +@{$install_sources$build and install} from source. If there is a need to build +TensorFlow on a platform that has different hardware than the target, then +cross-compile with the highest optimizations for the target platform. The +following command is an example of using `bazel` to compile for a specific +platform: ```python # This command optimizes for Intel’s Broadwell processor @@ -47,106 +186,467 @@ bazel build -c opt --copt=-march="broadwell" --config=cuda //tensorflow/tools/pi #### Environment, build, and install tips -* Compile with the highest level of compute the [GPU - supports](http://developer.nvidia.com/cuda-gpus), e.g. P100: 6.0, Titan X - (pascal): 6.2, Titan X (maxwell): 5.2, and K80: 3.7. -* Install the latest CUDA platform and cuDNN libraries. -* Make sure to use a version of gcc that supports all of the optimizations of - the target CPU. The recommended minimum gcc version is 4.8.3. On OS X upgrade - to the latest Xcode version and use the version of clang that comes with Xcode. -* TensorFlow checks on startup whether it has been compiled with the - optimizations available on the CPU. If the optimizations are not included, - TensorFlow will emit warnings, e.g. AVX, AVX2, and FMA instructions not - included. +* `./configure` asks which compute capability to include in the build. This + does not impact overall performance but does impact initial startup. After + running TensorFlow once, the compiled kernels are cached by CUDA. If using + a docker container, the data is not cached and the penalty is paid each time + TensorFlow starts. The best practice is to include the + [compute capabilities](http://developer.nvidia.com/cuda-gpus) + of the GPUs that will be used, e.g. P100: 6.0, Titan X (Pascal): 6.1, Titan + X (Maxwell): 5.2, and K80: 3.7. +* Use a version of gcc that supports all of the optimizations of the target + CPU. The recommended minimum gcc version is 4.8.3. On OS X, upgrade to the + latest Xcode version and use the version of clang that comes with Xcode. +* Install the latest stable CUDA platform and cuDNN libraries supported by + TensorFlow. -### Utilize queues for reading data +## Optimizing for GPU -One common cause of poor performance is underutilizing GPUs, or essentially -"starving" them of data by not setting up an efficient pipeline. Make sure to -set up an input pipeline to utilize queues and stream data effectively. Review -the @{$reading_data#reading_from_files$Reading Data guide} for implementation -details. One way to identify a "starved" GPU is to generate and review -timelines. A detailed tutorial for timelines does not exist, but a quick example -of generating a timeline exists as part of the @{$jit$XLA JIT} tutorial. Another -simple way to check if a GPU is underutilized is to run `watch nvidia-smi`, and -if GPU utilization is not approaching 100% then the GPU is not getting data fast -enough. +This section contains GPU-specific tips that are not covered in the +[General best practices](#general-best-practices). Obtaining optimal performance +on multi-GPUs is a challenge. A common approach is to use data parallelism. +Scaling through the use of data parallelism involves making multiple copies of +the model, which are referred to as "towers", and then placing one tower on each +of the GPUs. Each tower operates on a different mini-batch of data and then +updates variables, also known as parameters, that need to be shared between +each of the towers. How each tower gets the updated variables and how the +gradients are applied has an impact on the performance, scaling, and convergence +of the model. The rest of this section provides an overview of variable +placement and the towering of a model on multiple GPUs. +@{$performance_models$High-Performance Models} gets into more details regarding +more complex methods that can be used to share and update variables between +towers. -Unless for a special circumstance or for example code, do not feed data -into the session from Python variables, e.g. `dictionary`. +The best approach to handling variable updates depends on the model, hardware, +and even how the hardware has been configured. An example of this, is that two +systems can be built with NVIDIA Tesla P100s but one may be using PCIe and the +other [NVLink](http://www.nvidia.com/object/nvlink.html). In that scenario, the +optimal solution for each system may be different. For real world examples, read +the @{$benchmarks$benchmark} page which details the settings that were optimal +for a variety of platforms. Below is a summary of what was learned from +benchmarking various platforms and configurations: + +* **Tesla K80**: If the GPUs are on the same PCI Express root complex and are + able to use [NVIDIA GPUDirect](https://developer.nvidia.com/gpudirect) Peer + to Peer, then placing the variables equally across the GPUs used for + training is the best approach. If the GPUs cannot use GPUDirect, then + placing the variables on the CPU is the best option. + +* **Titan X (Maxwell and Pascal), M40, P100, and similar**: For models like + ResNet and InceptionV3, placing variables on the CPU is the optimal setting, + but for models with a lot of variables like AlexNet and VGG, using GPUs with + `NCCL` is better. + +A common approach to managing where variables are placed, is to create a method +to determine where each Op is to be placed and use that method in place of a +specific device name when calling `with tf.device():`. Consider a scenario where +a model is being trained on 2 GPUs and the variables are to be placed on the +CPU. There would be a loop for creating and placing the "towers" on each of the +2 GPUs. A custom device placement method would be created that watches for Ops +of type `Variable`, `VariableV2`, and `VarHandleOp` and indicates that they are +to be placed on the CPU. All other Ops would be placed on the target GPU. +The building of the graph would proceed as follows: + +* On the first loop a "tower" of the model would be created for `gpu:0`. + During the placement of the Ops, the custom device placement method would + indicate that variables are to be placed on `cpu:0` and all other Ops on + `gpu:0`. + +* On the second loop, `reuse` is set to `True` to indicate that variables are + to be reused and then the "tower" is created on `gpu:1`. During the + placement of the Ops associated with the "tower", the variables that were + placed on `cpu:0` are reused and all other Ops are created and placed on + `gpu:1`. + +The final result is all of the variables are placed on the CPU with each GPU +having a copy of all of the computational Ops associated with the model. + +The code snippet below illustrates two different approaches for variable +placement: one is placing variables on the CPU; the other is placing variables +equally across the GPUs. ```python -# Using feed_dict often results in suboptimal performance when using large inputs. -sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys}) + +class GpuParamServerDeviceSetter(object): + """Used with tf.device() to place variables on the least loaded GPU. + + A common use for this class is to pass a list of GPU devices, e.g. ['gpu:0', + 'gpu:1','gpu:2'], as ps_devices. When each variable is placed, it will be + placed on the least loaded gpu. All other Ops, which will be the computation + Ops, will be placed on the worker_device. + """ + + def __init__(self, worker_device, ps_devices): + """Initializer for GpuParamServerDeviceSetter. + Args: + worker_device: the device to use for computation Ops. + ps_devices: a list of devices to use for Variable Ops. Each variable is + assigned to the least loaded device. + """ + self.ps_devices = ps_devices + self.worker_device = worker_device + self.ps_sizes = [0] * len(self.ps_devices) + + def __call__(self, op): + if op.device: + return op.device + if op.type not in ['Variable', 'VariableV2', 'VarHandleOp']: + return self.worker_device + + # Gets the least loaded ps_device + device_index, _ = min(enumerate(self.ps_sizes), key=operator.itemgetter(1)) + device_name = self.ps_devices[device_index] + var_size = op.outputs[0].get_shape().num_elements() + self.ps_sizes[device_index] += var_size + + return device_name + +def _create_device_setter(is_cpu_ps, worker, num_gpus): + """Create device setter object.""" + if is_cpu_ps: + # tf.train.replica_device_setter supports placing variables on the CPU, all + # on one GPU, or on ps_servers defined in a cluster_spec. + return tf.train.replica_device_setter( + worker_device=worker, ps_device='/cpu:0', ps_tasks=1) + else: + gpus = ['/gpu:%d' % i for i in range(num_gpus)] + return ParamServerDeviceSetter(worker, gpus) + +# The method below is a modified snippet from the full example. +def _resnet_model_fn(): + # When set to False, variables are placed on the least loaded GPU. If set + # to True, the variables will be placed on the CPU. + is_cpu_ps = False + + # Loops over the number of GPUs and creates a copy ("tower") of the model on + # each GPU. + for i in range(num_gpus): + worker = '/gpu:%d' % i + # Creates a device setter used to determine where Ops are to be placed. + device_setter = _create_device_setter(is_cpu_ps, worker, FLAGS.num_gpus) + # Creates variables on the first loop. On subsequent loops reuse is set + # to True, which results in the "towers" sharing variables. + with tf.variable_scope('resnet', reuse=bool(i != 0)): + with tf.name_scope('tower_%d' % i) as name_scope: + # tf.device calls the device_setter for each Op that is created. + # device_setter returns the device the Op is to be placed on. + with tf.device(device_setter): + # Creates the "tower". + _tower_fn(is_training, weight_decay, tower_features[i], + tower_labels[i], tower_losses, tower_gradvars, + tower_preds, False) + ``` -### Preprocessing on the CPU +In the near future the above code will be for illustration purposes only as +there will be easy to use high level methods to support a wide range of popular +approaches. This +[example](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator) +will continue to get updated as the API expands and evolves to address multi-GPU +scenarios. -Placing preprocessing operations on the CPU can significantly improve -performance. When preprocessing occurs on the GPU the flow of data is -CPU -> GPU (preprocessing) -> CPU -> GPU (training). The data is bounced back -and forth between the CPU and GPU. When preprocessing is placed on the CPU, -the data flow is CPU (preprocessing) -> GPU (training). Another benefit is -preprocessing on the CPU frees GPU time to focus on training. +## Optimizing for CPU -Placing preprocessing on the CPU can result in a 6X+ increase in samples/sec -processed, which could lead to training in 1/6th of the time. To ensure -preprocessing is on the CPU, wrap the preprocessing operations as shown below: +CPUs, which includes Intel® Xeon Phi™, achieve optimal performance when +TensorFlow is @{$install_sources$built from source} with all of the instructions +supported by the target CPU. + +Beyond using the latest instruction sets, Intel® has added support for the +Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) to +TensorFlow. While the name is not completely accurate, these optimizations are +often simply referred to as 'MKL' or 'TensorFlow with MKL'. [TensorFlow +with Intel® MKL-DNN](#tensorflow_with_intel_mkl_dnn) contains details on the +MKL optimizations. + +The two configurations listed below are used to optimize CPU performance by +adjusting the thread pools. + +* `intra_op_parallelism_threads`: Nodes that can use multiple threads to + parallelize their execution will schedule the individual pieces into this + pool. +* `inter_op_parallelism_threads`: All ready nodes are scheduled in this pool. + +These configurations are set via the `tf.ConfigProto` and passed to `tf.Session` +in the `config` attribute as shown in the snippet below. For both configuration +options, if they are unset or set to 0, will default to the number of logical +CPU cores. Testing has shown that the default is effective for systems ranging +from one CPU with 4 cores to multiple CPUs with 70+ combined logical cores. +A common alternative optimization is to set the number of threads in both pools +equal to the number of physical cores rather than logical cores. ```python -with tf.device('/cpu:0'): - # function to get and process images or data. - distorted_inputs = load_and_distort_images() + + config = tf.ConfigProto() + config.intra_op_parallelism_threads = 44 + config.inter_op_parallelism_threads = 44 + tf.session(config=config) + ``` -### Use large files +The [Comparing compiler optimizations](#comparing-compiler-optimizations) +section contains the results of tests that used different compiler +optimizations. -Under some circumstances, both the CPU and GPU can be starved for data by the -I/O system. If you are using many small files to form your input data set, you -may be limited by the speed of your filesystem. If your training loop runs -faster when using SSDs vs HDDs for storing your input data, you could be -I/O bottlenecked. +### TensorFlow with Intel® MKL DNN -If this is the case, you should pre-process your input data, creating a few -large TFRecord files. +Intel® has added optimizations to TensorFlow for Intel® Xeon® and Intel® Xeon +Phi™ though the use of Intel® Math Kernel Library for Deep Neural Networks +(Intel® MKL-DNN) optimized primitives. The optimizations also provide speedups +for the consumer line of processors, e.g. i5 and i7 Intel processors. The Intel +published paper +[TensorFlow* Optimizations on Modern Intel® Architecture](https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture) +contains additional details on the implementation. -### Use NCHW image data format +> Note: MKL was added as of TensorFlow 1.2 and currently only works on Linux. It +> also does not work when also using `--config=cuda`. -Image data format refers to the representation of batches of images. TensorFlow -supports `NHWC` (TensorFlow default) and `NCHW` (cuDNN default). N refers to the -number of images in a batch, H refers to the number of pixels in the vertical -dimension, W refers to the number of pixels in the horizontal dimension, and C -refers to the channels (e.g. 1 for black and white, 3 for RGB, etc.) Although -cuDNN can operate on both formats, it is faster to operate in its default -format. +In addition to providing significant performance improvements for training CNN +based models, compiling with the MKL creates a binary that is optimized for AVX +and AVX2. The result is a single binary that is optimized and compatible with +most modern (post-2011) processors. -The best practice is to build models that work with both `NCHW` and `NHWC` as it -is common to train using `NCHW` on GPU, and then do inference with `NHWC` on CPU. +TensorFlow can be compiled with the MKL optimizations using the following +commands that depending on the version of the TensorFlow source used. -There are edge cases where `NCHW` can be slower on GPU than `NHWC`. One -[case](https://github.com/tensorflow/tensorflow/issues/7551#issuecomment-280421351) -is using non-fused batch norm on WRN-16-4 without dropout. In that case using -fused batch norm, which is also recommended, is the optimal solution. +For TensorFlow source versions after 1.3.0: -The very brief history of these two formats is that TensorFlow started by using -`NHWC` because it was a little faster on CPUs. Then the TensorFlow team -discovered that `NCHW` performs better when using the NVIDIA cuDNN library. The -current recommendation is that users support both formats in their models. In -the long term, we plan to rewrite graphs to make switching between the formats -transparent. +```bash +./configure +# Pick the desired options +bazel build --config=mkl -c opt //tensorflow/tools/pip_package:build_pip_package -### Use fused batch norm +``` -When using batch norm -@{tf.contrib.layers.batch_norm} set the attribute `fused=True`: +For TensorFlow versions 1.2.0 through 1.3.0: + +```bash +./configure +Do you wish to build TensorFlow with MKL support? [y/N] Y +Do you wish to download MKL LIB from the web? [Y/n] Y +# Select the defaults for the rest of the options. + +bazel build --config=mkl --copt="-DEIGEN_USE_VML" -c opt //tensorflow/tools/pip_package:build_pip_package + +``` + +#### Tuning MKL for the best performance + +This section details the different configurations and environment variables that +can be used to tune the MKL to get optimal performance. Before tweaking various +environment variables make sure the model is using the `NCHW` (`channels_first`) +[data format](#data-formats). The MKL is optimized for `NCHW` and Intel is +working to get near performance parity when using `NHWC`. + +MKL uses the following environment variables to tune performance: + +* KMP_BLOCKTIME - Sets the time, in milliseconds, that a thread should wait, + after completing the execution of a parallel region, before sleeping. +* KMP_AFFINITY - Enables the run-time library to bind threads to physical + processing units. +* KMP_SETTINGS - Enables (true) or disables (false) the printing of OpenMP* + run-time library environment variables during program execution. +* OMP_NUM_THREADS - Specifies the number of threads to use. + +More details on the KMP variables are on +[Intel's](https://software.intel.com/en-us/node/522775) site and the OMP +variables on +[gnu.org](https://gcc.gnu.org/onlinedocs/libgomp/Environment-Variables.html) + +While there can be substantial gains from adjusting the environment variables, +which is discussed below, the simplified advice is to set the +`inter_op_parallelism_threads` equal to the number of physical CPUs and to set +the following environment variables: + +* KMP_BLOCKTIME=0 +* KMP_AFFINITY=granularity=fine,verbose,compact,1,0 + +Example setting MKL variables with command-line arguments: + +```bash +KMP_BLOCKTIME=0 KMP_AFFINITY=granularity=fine,verbose,compact,1,0 \ +KMP_SETTINGS=1 python your_python_script.py +``` + +Example setting MKL variables with python `os.environ`: ```python -bn = tf.contrib.layers.batch_norm( - input_layer, fused=True, data_format='NCHW' - scope=scope, **kwargs) +os.environ["KMP_BLOCKTIME"] = str(FLAGS.kmp_blocktime) +os.environ["KMP_SETTINGS"] = str(FLAGS.kmp_settings) +os.environ["KMP_AFFINITY"]= FLAGS.kmp_affinity +if FLAGS.num_intra_threads > 0: + os.environ["OMP_NUM_THREADS"]= str(FLAGS.num_intra_threads) + ``` -The non-fused batch norm does computations using several individual Ops. Fused -batch norm combines the individual operations into a single kernel, which runs -faster. +There are models and hardware platforms that benefit from different settings. +Each variable that impacts performance is discussed below. +* **KMP_BLOCKTIME**: The MKL default is 200ms, which was not optimal in our + testing. 0 (0ms) was a good default for CNN based models that were tested. + The best performance for AlexNex was achieved at 30ms and both GoogleNet and + VGG11 performed best set at 1ms. + +* **KMP_AFFINITY**: The recommended setting is + `granularity=fine,verbose,compact,1,0`. + +* **OMP_NUM_THREADS**: This defaults to the number of physical cores. + Adjusting this parameter beyond matching the number of cores can have an + impact when using Intel® Xeon Phi™ (Knights Landing) for some models. See + [TensorFlow* Optimizations on Modern Intel® Architecture](https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture) + for optimal settings. + +* **intra_op_parallelism_threads**: Setting this equal to the number of + physical cores is recommended. Setting the value to 0, which is the default + and will result in the value being set to the number of logical cores, is an + option to try for some architectures. This value and `OMP_NUM_THREADS` + should be equal. + +* **inter_op_parallelism_threads**: Setting this equal to the number of + sockets is recommended. Setting the value to 0, which is the default, + results in the value being set to the number of logical cores. + +### Comparing compiler optimizations + +Collected below are performance results running training and inference on +different types of CPUs on different platforms with various compiler +optimizations. The models used were ResNet-50 +([arXiv:1512.03385](https://arxiv.org/abs/1512.03385)) and +InceptionV3 ([arXiv:1512.00567](https://arxiv.org/abs/1512.00567)). + +For each test, when the MKL optimization was used the environment variable +KMP_BLOCKTIME was set to 0 (0ms) and KMP_AFFINITY to +`granularity=fine,verbose,compact,1,0`. + +#### Inference InceptionV3 + +**Environment** + +* Instance Type: AWS EC2 m4.xlarge +* CPU: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz (Broadwell) +* Dataset: ImageNet +* TensorFlow Version: 1.2.0 RC2 +* Test Script: [tf_cnn_benchmarks.py](https://github.com/tensorflow/benchmarks/blob/mkl_experiment/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py) + +**Batch Size: 1** + +Command executed for the MKL test: + +```bash +python tf_cnn_benchmarks.py --forward_only=True --device=cpu --mkl=True \ +--kmp_blocktime=0 --nodistortions --model=inception3 --data_format=NCHW \ +--batch_size=1 --num_inter_threads=1 --num_intra_threads=4 \ +--data_dir= +``` + +| Optimization | Data Format | Images/Sec | Intra threads | Inter Threads | +: : : (step time) : : : +| ------------ | ----------- | ------------ | ------------- | ------------- | +| AVX2 | NHWC | 6.8 (147ms) | 4 | 0 | +| MKL | NCHW | 6.6 (151ms) | 4 | 1 | +| MKL | NHWC | 5.95 (168ms) | 4 | 1 | +| AVX | NHWC | 4.7 (211ms) | 4 | 0 | +| SSE3 | NHWC | 2.7 (370ms) | 4 | 0 | + +**Batch Size: 32** + +Command executed for the MKL test: + +```bash +python tf_cnn_benchmarks.py --forward_only=True --device=cpu --mkl=True \ +--kmp_blocktime=0 --nodistortions --model=inception3 --data_format=NCHW \ +--batch_size=32 --num_inter_threads=1 --num_intra_threads=4 \ +--data_dir= +``` + +| Optimization | Data Format | Images/Sec | Intra threads | Inter Threads | +: : : (step time) : : : +| ------------ | ----------- | ------------- | ------------- | ------------- | +| MKL | NCHW | 10.24 | 4 | 1 | +: : : (3125ms) : : : +| MKL | NHWC | 8.9 (3595ms) | 4 | 1 | +| AVX2 | NHWC | 7.3 (4383ms) | 4 | 0 | +| AVX | NHWC | 5.1 (6275ms) | 4 | 0 | +| SSE3 | NHWC | 2.8 (11428ms) | 4 | 0 | + +#### Inference ResNet-50 + +**Environment** + +* Instance Type: AWS EC2 m4.xlarge +* CPU: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz (Broadwell) +* Dataset: ImageNet +* TensorFlow Version: 1.2.0 RC2 +* Test Script: [tf_cnn_benchmarks.py](https://github.com/tensorflow/benchmarks/blob/mkl_experiment/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py) + +**Batch Size: 1** + +Command executed for the MKL test: + +```bash +python tf_cnn_benchmarks.py --forward_only=True --device=cpu --mkl=True \ +--kmp_blocktime=0 --nodistortions --model=resnet50 --data_format=NCHW \ +--batch_size=1 --num_inter_threads=1 --num_intra_threads=4 \ +--data_dir= +``` + +| Optimization | Data Format | Images/Sec | Intra threads | Inter Threads | +: : : (step time) : : : +| ------------ | ----------- | ------------ | ------------- | ------------- | +| AVX2 | NHWC | 6.8 (147ms) | 4 | 0 | +| MKL | NCHW | 6.6 (151ms) | 4 | 1 | +| MKL | NHWC | 5.95 (168ms) | 4 | 1 | +| AVX | NHWC | 4.7 (211ms) | 4 | 0 | +| SSE3 | NHWC | 2.7 (370ms) | 4 | 0 | + +**Batch Size: 32** + +Command executed for the MKL test: + +```bash +python tf_cnn_benchmarks.py --forward_only=True --device=cpu --mkl=True \ +--kmp_blocktime=0 --nodistortions --model=resnet50 --data_format=NCHW \ +--batch_size=32 --num_inter_threads=1 --num_intra_threads=4 \ +--data_dir= +``` + +| Optimization | Data Format | Images/Sec | Intra threads | Inter Threads | +: : : (step time) : : : +| ------------ | ----------- | ------------- | ------------- | ------------- | +| MKL | NCHW | 10.24 | 4 | 1 | +: : : (3125ms) : : : +| MKL | NHWC | 8.9 (3595ms) | 4 | 1 | +| AVX2 | NHWC | 7.3 (4383ms) | 4 | 0 | +| AVX | NHWC | 5.1 (6275ms) | 4 | 0 | +| SSE3 | NHWC | 2.8 (11428ms) | 4 | 0 | + +#### Training InceptionV3 + +**Environment** + +* Instance Type: Dedicated AWS EC2 r4.16xlarge (Broadwell) +* CPU: Intel Xeon E5-2686 v4 (Broadwell) Processors +* Dataset: ImageNet +* TensorFlow Version: 1.2.0 RC2 +* Test Script: [tf_cnn_benchmarks.py](https://github.com/tensorflow/benchmarks/blob/mkl_experiment/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py) + +Command executed for MKL test: + +```bash +python tf_cnn_benchmarks.py --device=cpu --mkl=True --kmp_blocktime=0 \ +--nodistortions --model=resnet50 --data_format=NCHW --batch_size=32 \ +--num_inter_threads=2 --num_intra_threads=36 \ +--data_dir= +``` + +Optimization | Data Format | Images/Sec | Intra threads | Inter Threads +------------ | ----------- | ---------- | ------------- | ------------- +MKL | NCHW | 20.8 | 36 | 2 +AVX2 | NHWC | 6.2 | 36 | 0 +AVX | NHWC | 5.7 | 36 | 0 +SSE3 | NHWC | 4.3 | 36 | 0 + +ResNet and [AlexNet](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf) +were also run on this configuration but in an ad hoc manner. There were not +enough runs executed to publish a coherent table of results. The incomplete +results strongly indicated the final result would be similar to the table above +with MKL providing significant 3x+ gains over AVX2.