TensorFlower Gardener 9489702e35 Merge pull request #45420 from offscale:args-for-google-style-docstrings
PiperOrigin-RevId: 348788129
Change-Id: I2e4c86b5526fdc83fec1e176702049f1462d1b12
2020-12-23 06:52:51 -08:00
..

Benchmarks for keras model examples

Keras benchmarks

These are benchmark tests running on keras models: models from keras/examples. Benchmarks in the current folder (tensorflow/python/keras/benchmarks/keras_examples_benchmarks) use Keras built-in dataset. In addition, these benchmarks support different distribution strategies on multiple GPUs.

Available models

These examples are implemented by Functional API and Sequential API.

Computer Vision examples

Text & Sequence examples

Other examples

Available benchmark results

The listed benchmark results are obtained by running on Google Cloud Platform (GCP) with the following setup:

  • GPU: 2 x Tesla V100
  • OS: Ubuntu 18.04
  • CPU: 8 x vCPUs, 30 GB memory
  • CUDA: 10.1
  • Bazel: 3.1.0

If you want to run benchmark tests on GPU, please make sure you already installed CUDA and other dependencies by following the instructions from the official tutorial for GPU support.

Metrics for following benchmarks:

  • Batch_size: Number of samples per batch of computation.
  • Wall_time: Total time to run benchmark test in seconds.
  • Avg_epoch_time: Average time for each epoch.
  • Exp_per_sec: Examples per second. The number of examples processed in one second.
  • Distribution_Strategy: The distribution strategies used in the benchmark.

Cifar10 CNN benchmark

  | Batch_size | Wall_time | Avg_epoch_time | Exp_per_sec | Distribution_Strategy

:---: | :--------: | :-------: | :------------: | :---------: | :-------------------: CPU | 256 | 1393.4896 | 3.21 | 15397.69 | off GPU:2 | 256 | 76.49 | 2.59 | 18758.01 | mirrored

MNIST Conv benchmark

  | Batch_size | Wall_time | Avg_epoch_time | Exp_per_sec | Distribution_Strategy

:---: | :--------: | :-------: | :------------: | :---------: | :-------------------: CPU | 256 | 196.52 | 12.19 | 4915.26 | off GPU:2 | 256 | 24.5794 | 1.21 | 47899.32 | mirrored

MNIST Hierarchical RNN (HRNN) benchmark

  | Batch_size | Wall_time | Avg_epoch_time | Exp_per_sec | Distribution_Strategy

:---: | :--------: | :-------: | :------------: | :---------: | :-------------------: CPU | 256 | 654.05 | 218.68 | 274.24 | off GPU:2 | 256 | 20.77 | 3.73 | 15088.06 | mirrored

Bidirectional LSTM benchmark

  | Batch_size | Wall_time | Avg_epoch_time | Exp_per_sec | Distribution_Strategy

:---: | :--------: | :-------: | :------------: | :---------: | :-------------------: CPU | 512 | 225.57 | 72.55 | 344.70 | off GPU:2 | 512 | 23.54 | 3.23 | 7532.53 | mirrored

Text classification with transformer benchmark

  | Batch_size | Wall_time | Avg_epoch_time | Exp_per_sec | Distribution_Strategy

:---: | :--------: | :-------: | :------------: | :---------: | :-------------------: CPU | 512 | 109.22 | 35.93 | 698.10 | off GPU:2 | 512 | 9.28 | 0.83 | 26567.54 | mirrored

MLP benchmark

  | Batch_size | Wall_time | Avg_epoch_time | Exp_per_sec | Distribution_Strategy

:---: | :--------: | :-------: | :------------: | :---------: | :-------------------: CPU | 128 | 3.76 | 0.54 | 17678.54 | off GPU:2 | 128 | 5.91 | 0.30 | 25435.14 | mirrored

Antirectifier benchmark

  | Batch_size | Wall_time | Avg_epoch_time | Exp_per_sec | Distribution_Strategy

:---: | :--------: | :-------: | :------------: | :---------: | :-------------------: CPU | 512 | 6.77 | 1.79 | 30916.39 | off GPU:2 | 512 | 6.81 | 0.66 | 66563.17 | mirrored

IRNN benchmark

  | Batch_size | Wall_time | Avg_epoch_time | Exp_per_sec | Distribution_Strategy

:---: | :--------: | :-------: | :------------: | :---------: | :-------------------: CPU | 1024 | 213.00 | 69.01 | 868.08 | off GPU:2 | 1024 | 92.71 | 29.12 | 2042.94 | mirrored

Note: For the small models, running on GPU might be even slower than CPU. The potential reason is, training small models is not computation dominant, and there might be some overhead on model replication and data sharding with distributed training on GPUs.

Install Bazel

This step can be skipped if Bazel is already installed.

Bazel is used to build targets based on BUILD files. It will take a while for the first time because it will compile all dependencies from your BUILD file. For the next time, Bazel will use the cache and itll be much faster. For Ubuntu OS, please use the following steps for Bazel installation. For other platforms, you may follow the corresponding guide for the installation.

  1. Add bazel as package source

    sudo apt install curl gnupg
    
    curl https://bazel.build/bazel-release.pub.gpg | sudo apt-key add -
    
    echo "deb [arch=amd64] https://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list
    

    Before we install the bazel, We should take a look for a bazel version that can build the specific tensorflow version, you can check it from here. In addition, you can follow the instructions from Bazel website.

  2. Install Bazel

    sudo apt update && sudo apt install bazel-`version`
    

Run benchmarks

To run benchmarks in keras/benchmarks, please take the following steps:

  1. Pull the latest tensorflow repo from github.

  2. Install the Bazel tool which works with tensorflow, please take a look for the Install bazel section.

  3. To run benchmarks with Bazel, use the --benchmarks=. flags to specify the benchmarks to run.

    • To run all benchmarks on CPU

      bazel run -c opt benchmark_test -- --benchmarks=.
      
    • To run all benchmarks on GPU

      bazel run run --config=cuda -c opt --copt="-mavx" benchmarks_test -- --benchmarks=.
      
    • To run a subset of benchmarks using --benchmarks flag, --benchmarks: the list of benchmarks to run. The specified value is interpreted as a regular expression and any benchmarks whose name contains a partial match to the regular expression is executed. e.g. --benchmarks=".*lstm*.", will run all lstm layer related benchmarks.

Add new benchmarks

To add a new benchmark, please take the following steps:

  1. Create your own benchmark test file, xxxx_benchmark_test.py.
  2. Import benchmark_util to measure and track performance if needed.
  3. Create class which inherits from tf.test.Benchmark
  4. Define and load dataset in __init__ method.
  5. Design and create a model in _build_model method.
  6. Define the benchmark_xxx method to measure the performance of benchmarks with different hyper parameters, such as batch_size, run_iters, distribution_strategy and etc. You can check examples from here.
  7. Add the benchmark target to the BUILD file.

Troubleshooting

  1. tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid

    • Make sure CUDA is installed on your machine.
    • Pull the latest tensorflow repo and run the ./configure in the root folder of tensorflow. It will help you to create the configuration file which shows your local environment. Please check this post for more details.