Merge changes from github.

PiperOrigin-RevId: 183100142
This commit is contained in:
Jianwei Xie 2018-01-24 10:02:35 -08:00 committed by TensorFlower Gardener
parent 7b62a71e2d
commit d9f93c42a5
143 changed files with 5281 additions and 1432 deletions

View File

@ -13,46 +13,146 @@
* [TensorFlow Lite](https://github.com/tensorflow/tensorflow/tree/r1.5/tensorflow/contrib/lite)
dev preview is now available.
* CUDA 9 and cuDNN 7 support.
* Accelerated Linear Algebra (XLA):
* Add `complex64` support to XLA compiler.
* `bfloat` support is now added to XLA infrastructure.
* Make `ClusterSpec` propagation work with XLA devices.
* Use a determinisitic executor to generate XLA graph.
* `tf.contrib`:
* `tf.contrib.distributions`:
* Add `tf.contrib.distributions.Autoregressive`.
* Make `tf.contrib.distributions` QuadratureCompound classes support batch
* Infer `tf.contrib.distributions.RelaxedOneHotCategorical` `dtype` from arguments.
* Make `tf.contrib.distributions` quadrature family parameterized by
`quadrature_grid_and_prob` vs `quadrature_degree`.
* `auto_correlation` added to `tf.contrib.distributions`
* Add `tf.contrib.bayesflow.layers`, a collection of probabilistic (neural) layers.
* Add `tf.contrib.bayesflow.halton_sequence`.
* Add `tf.contrib.data.make_saveable_from_iterator.`
* Add `tf.contrib.data.shuffle_and_repeat`.
* Add new custom transformation: `tf.contrib.data.scan()`.
* `tf.contrib.distributions.bijectors`:
* Add `tf.contrib.distributions.bijectors.MaskedAutoregressiveFlow`.
* Add `tf.contrib.distributions.bijectors.Permute`.
* Add `tf.contrib.distributions.bijectors.Gumbel`.
* Add `tf.contrib.distributions.bijectors.Reshape`.
* Support shape inference (i.e., shapes containing -1) in the Reshape bijector.
* Add `streaming_precision_recall_at_equal_thresholds,` a method for computing
streaming precision and recall with `O(num_thresholds + size of predictions)`
time and space complexity.
* Change `RunConfig` default behavior to not set a random seed, making random
behavior independently random on distributed workers. We expect this to
generally improve training performance. Models that do rely on determinism
should set a random seed explicitly.
* Replaced the implementation of `tf.flags` with `absl.flags`.
* Add support for `CUBLAS_TENSOR_OP_MATH` in fp16 GEMM
* Add support for CUDA on NVIDIA Tegra devices
## Bug Fixes and Other Changes
* `auto_correlation` added to `tf.contrib.distributions`.
* Add `DenseFlipout` probabilistic layer.
* Restandardize `DenseVariational` as simpler template for other probabilistic layers.
* Make `tf.contrib.distributions` QuadratureCompound classes support batch.
* Documentation updates:
* Clarified that you can only install TensorFlow on 64-bit machines.
* Added a short doc explaining how `Estimator`s save checkpoints.
* Add documentation for ops supported by the `tf2xla` bridge.
* Fix minor typos in the doc of `SpaceToDepth` and `DepthToSpace`.
* Updated documentation comments in `mfcc_mel_filterbank.h` and `mfcc.h` to
clarify that the input domain is squared magnitude spectra and the weighting
is done on linear magnitude spectra (sqrt of inputs).
* Change `tf.contrib.distributions` docstring examples to use `tfd` alias
rather than `ds`, `bs`.
* Fix docstring typos in `tf.distributions.bijectors.Bijector`.
* `tf.assert_equal` no longer raises `ValueError.` It now raises
`InvalidArgumentError,` as documented.
* Update Getting Started docs and API intro.
* Google Cloud Storage (GCS):
* Add userspace DNS caching for the GCS client.
* Customize request timeouts for the GCS filesystem.
* Improve GCS filesystem caching.
* Bug Fixes:
* Fix bug where partitioned integer variables got their wrong shapes. Before
* Fix correctness bug in CPU and GPU implementations of Adadelta.
* Fix a bug in `import_meta_graph`'s handling of partitioned variables when
importing into a scope. WARNING: This may break loading checkpoints of
graphs with partitioned variables saved after using `import_meta_graph` with
a non-empty `import_scope` argument.
* Fix bug in offline debugger which prevented viewing events.
* Added the `WorkerService.DeleteWorkerSession` method to the gRPC interface,
to fix a memory leak. Ensure that your master and worker servers are running
the same version of TensorFlow to avoid compatibility issues.
* Fix bug in peephole implementation of BlockLSTM cell.
* Fix bug by casting dtype of `log_det_jacobian` to match `log_prob` in
`TransformedDistribution`.
* Fix a bug in `import_meta_graph`'s handling of partitioned variables when
* Ensure `tf.distributions.Multinomial` doesn't underflow in `log_prob`.
Before this change, all partitions of an integer variable were initialized
with the shape of the unpartitioned variable; after this change they are
initialized correctly.
* Other:
* Add necessary shape util support for bfloat16.
* Add a way to run ops using a step function to MonitoredSession.
* Add `DenseFlipout` probabilistic layer.
* A new flag `ignore_live_threads` is available on train. If set to `True`, it
will ignore threads that remain running when tearing down infrastructure
after successfully completing training, instead of throwing a RuntimeError.
* Restandardize `DenseVariational` as simpler template for other probabilistic
layers.
* `tf.data` now supports `tf.SparseTensor` components in dataset elements.
* It is now possible to iterate over `Tensor`s.
* Allow `SparseSegmentReduction` ops to have missing segment IDs.
* Modify custom export strategy to account for multidimensional sparse float
splits.
* `Conv2D`, `Conv2DBackpropInput`, `Conv2DBackpropFilter` now supports arbitrary
dilations with GPU and cuDNNv6 support.
* `Estimator` now supports `Dataset`: `input_fn` can return a `Dataset`
instead of `Tensor`s.
* Add `RevBlock`, a memory-efficient implementation of reversible residual layers.
* Reduce BFCAllocator internal fragmentation.
* Add `cross_entropy` and `kl_divergence` to `tf.distributions.Distribution`.
* Add `tf.nn.softmax_cross_entropy_with_logits_v2` which enables backprop
w.r.t. the labels.
* GPU back-end now uses `ptxas` to compile generated PTX.
* `BufferAssignment`'s protocol buffer dump is now deterministic.
* Change embedding op to use parallel version of `DynamicStitch`.
* Add support for sparse multidimensional feature columns.
* Speed up the case for sparse float columns that have only 1 value.
* Allow sparse float splits to support multivalent feature columns.
* Add `quantile` to `tf.distributions.TransformedDistribution`.
* Add `NCHW_VECT_C` support for `tf.depth_to_space` on GPU.
* Add `NCHW_VECT_C` support for `tf.space_to_depth` on GPU.
## API Changes
* Rename `SqueezeDims` attribute to `Axis` in C++ API for Squeeze op.
* `Stream::BlockHostUntilDone` now returns Status rather than bool.
* Customize request timeouts for the GCS filesystem.
* Minor refactor: move stats files from `stochastic` to `common` and remove
`stochastic`.
## Thanks to our Contributors
This release contains contributions from many people at Google, as well as:
4d55397500, Abdullah Alrasheed, abenmao, Adam Salvail, Aditya Dhulipala, Ag Ramesh,
Akimasa Kimura, Alan Du, Alan Yee, Alexander, Amit Kushwaha, Amy, Andrei Costinescu,
Andrei Nigmatulin, Andrew Erlichson, Andrew Myers, Andrew Stepanov, Androbin, AngryPowman,
Anish Shah, Anton Daitche, Artsiom Chapialiou, asdf2014, Aseem Raj Baranwal, Ash Hall,
Bart Kiers, Batchu Venkat Vishal, ben, Ben Barsdell, Bill Piel, Carl Thomé, Catalin Voss,
Changming Sun, Chengzhi Chen, Chi Zeng, Chris Antaki, Chris Donahue, Chris Oelmueller,
Chris Tava, Clayne Robison, Codrut, Courtial Florian, Dalmo Cirne, Dan J, Darren Garvey,
David Kristoffersson, David Norman, David RöThlisberger, DavidNorman, Dhruv, DimanNe,
Dorokhov, Duncan Mac-Vicar P, EdwardDixon, EMCP, error.d, FAIJUL, Fan Xia,
Francois Xavier, Fred Reiss, Freedom" Koan-Sin Tan, Fritz Obermeyer, Gao, Xiang,
Guenther Schmuelling, Guo Yejun (郭叶军), Hans Gaiser, HectorSVC, Hyungsuk Yoon,
James Pruegsanusak, Jay Young, Jean Wanka, Jeff Carpenter, Jeremy Rutman, Jeroen BéDorf,
Jett Jones, Jimmy Jia, jinghuangintel, jinze1994, JKurland, Joel Hestness, joetoth,
John B Nelson, John Impallomeni, John Lawson, Jonas, Jonathan Dekhtiar, joshkyh, Jun Luan,
Jun Mei, Kai Sasaki, Karl Lessard, karl@kubx.ca, Kb Sriram, Kenichi Ueno, Kevin Slagle,
Kongsea, Lakshay Garg, lhlmgr, Lin Min, liu.guangcong, Loki Der Quaeler, Louie Helm,
lucasmoura, Luke Iwanski, Lyndon White, Mahmoud Abuzaina, Marcel Puyat, Mark Aaron Shirley,
Michele Colombo, MtDersvan, Namrata-Ibm, Nathan Luehr, Naurril, Nayana Thorat, Nicolas Lopez,
Niranjan Hasabnis, Nolan Liu, Nouce, Oliver Hennigh, osdamv, Patrik Erdes,
Patryk Chrabaszcz, Pavel Christof, Penghao Cen, postBG, Qingqing Cao, Qingying Chen, qjivy,
Raphael, Rasmi, raymondxyang, Renze Yu, resec, Roffel, Ruben Vereecken, Ryohei Kuroki,
sandipmgiri, Santiago Castro, Scott Kirkland, Sean Vig, Sebastian Raschka, Sebastian Weiss,
Sergey Kolesnikov, Sergii Khomenko, Shahid, Shivam Kotwalia, Stuart Berg, Sumit Gouthaman,
superzerg, Sven Mayer, tetris, Ti Zhou, Tiago Freitas Pereira, Tian Jin, Tomoaki Oiki,
Vaibhav Sood, vfdev, Vivek Rane, Vladimir Moskva, wangqr, Weber Xie, Will Frey,
Yan Facai (颜发才), yanivbl6, Yaroslav Bulatov, Yixing Lao, Yong Tang, youkaichao,
Yuan (Terry) Tang, Yue Zhang, Yuxin Wu, Ziming Dong, ZxYuan, 黄璞
Adam Zahran, Ag Ramesh, Alan Lee, Alan Yee, Alex Sergeev, Alexander, Amir H. Jadidinejad,
Amy, Anastasios Doumoulakis, Andrei Costinescu, Andrei Nigmatulin, Anthony Platanios,
Anush Elangovan, arixlin, Armen Donigian, ArtëM Sobolev, Atlas7, Ben Barsdell, Bill Prin,
Bo Wang, Brett Koonce, Cameron Thomas, Carl Thomé, Cem Eteke, cglewis, Changming Sun,
Charles Shenton, Chi-Hung, Chris Donahue, Chris Filo Gorgolewski, Chris Hoyean Song,
Chris Tava, Christian Grail, Christoph Boeddeker, cinqS, Clayne Robison, codrut3, concerttttt,
CQY, Dan Becker, Dan Jarvis, Daniel Zhang, David Norman, dmaclach, Dmitry Trifonov,
Donggeon Lim, dongpilYu, Dr. Kashif Rasul, Edd Wilder-James, Eric Lv, fcharras, Felix Abecassis,
FirefoxMetzger, formath, FredZhang, Gaojin Cao, Gary Deer, Guenther Schmuelling, Hanchen Li,
Hanmin Qin, hannesa2, hyunyoung2, Ilya Edrenkin, Jackson Kontny, Jan, Javier Luraschi,
Jay Young, Jayaram Bobba, Jeff, Jeff Carpenter, Jeremy Sharpe, Jeroen BéDorf, Jimmy Jia,
Jinze Bai, Jiongyan Zhang, Joe Castagneri, Johan Ju, Josh Varty, Julian Niedermeier,
JxKing, Karl Lessard, Kb Sriram, Keven Wang, Koan-Sin Tan, Kyle Mills, lanhin, LevineHuang,
Loki Der Quaeler, Loo Rong Jie, Luke Iwanski, LáSzló Csomor, Mahdi Abavisani, Mahmoud Abuzaina,
ManHyuk, Marek ŠUppa, MathSquared, Mats Linander, Matt Wytock, Matthew Daley, Maximilian Bachl,
mdymczyk, melvyniandrag, Michael Case, Mike Traynor, miqlas, Namrata-Ibm, Nathan Luehr,
Nathan Van Doorn, Noa Ezra, Nolan Liu, Oleg Zabluda, opensourcemattress, Ouwen Huang,
Paul Van Eck, peisong, Peng Yu, PinkySan, pks, powderluv, Qiao Hai-Jun, Qiao Longfei,
Rajendra Arora, Ralph Tang, resec, Robin Richtsfeld, Rohan Varma, Ryohei Kuroki, SaintNazaire,
Samuel He, Sandeep Dcunha, sandipmgiri, Sang Han, scott, Scott Mudge, Se-Won Kim, Simon Perkins,
Simone Cirillo, Steffen Schmitz, Suvojit Manna, Sylvus, Taehoon Lee, Ted Chang, Thomas Deegan,
Till Hoffmann, Tim, Toni Kunic, Toon Verstraelen, Tristan Rice, Urs KöSter, Utkarsh Upadhyay,
Vish (Ishaya) Abrams, Winnie Tsang, Yan Chen, Yan Facai (颜发才), Yi Yang, Yong Tang,
Youssef Hesham, Yuan (Terry) Tang, Zhengsheng Wei, zxcqwe4906, 张志豪, 田传武
We are also grateful to all who filed issues or helped resolve them, asked and
answered questions, and were part of inspiring discussions.
@ -60,7 +160,15 @@ answered questions, and were part of inspiring discussions.
# Release 1.4.1
## Bug Fixes and Other Changes
* `LinearClassifier` fix for the Google Cloud Machine Learning Engine.
* `LinearClassifier` fix.
# Release 1.4.0
## Major Features And Improvements
* `tf.keras` is now part of the core TensorFlow API.
* [`tf.data`](http://tensorflow.org/programmers_guide/datasets) is now part of
the core TensorFlow API.
* The API is now subject to backwards compatibility guarantees.
# Release 1.4.0

View File

@ -2,11 +2,11 @@ workspace(name = "org_tensorflow")
http_archive(
name = "io_bazel_rules_closure",
sha256 = "110fe68753413777944b473c25eed6368c4a0487cee23a7bac1b13cc49d3e257",
strip_prefix = "rules_closure-4af89ef1db659eb41f110df189b67d4cf14073e1",
sha256 = "6691c58a2cd30a86776dd9bb34898b041e37136f2dc7e24cadaeaf599c95c657",
strip_prefix = "rules_closure-08039ba8ca59f64248bb3b6ae016460fe9c9914f",
urls = [
"https://mirror.bazel.build/github.com/bazelbuild/rules_closure/archive/4af89ef1db659eb41f110df189b67d4cf14073e1.tar.gz",
"https://github.com/bazelbuild/rules_closure/archive/4af89ef1db659eb41f110df189b67d4cf14073e1.tar.gz", # 2017-08-28
"https://mirror.bazel.build/github.com/bazelbuild/rules_closure/archive/08039ba8ca59f64248bb3b6ae016460fe9c9914f.tar.gz",
"https://github.com/bazelbuild/rules_closure/archive/08039ba8ca59f64248bb3b6ae016460fe9c9914f.tar.gz", # 2018-01-16
],
)

View File

@ -662,6 +662,9 @@ filegroup(
"//tensorflow/tools/quantization:all_files",
"//tensorflow/tools/test:all_files",
"//tensorflow/user_ops:all_files",
"//third_party/eigen3:all_files",
"//third_party/fft2d:all_files",
"//third_party/flatbuffers:all_files",
"//third_party/hadoop:all_files",
"//third_party/sycl:all_files",
"//third_party/sycl/sycl:all_files",

View File

@ -190,6 +190,7 @@ TFE_TensorHandle* TFE_TensorHandleCopyToDevice(TFE_TensorHandle* h,
bool is_same_device =
(srcd == dstd) || (DeviceName(srcd) == DeviceName(dstd));
const bool dst_cpu = IsCPU(dstd);
const bool src_cpu = IsCPU(srcd);
if (is_same_device) {
return new TFE_TensorHandle(h->t, dst_cpu ? nullptr : dstd);
}
@ -213,7 +214,7 @@ TFE_TensorHandle* TFE_TensorHandleCopyToDevice(TFE_TensorHandle* h,
return new TFE_TensorHandle(dst, dst_cpu ? nullptr : dstd);
}
tensorflow::DeviceContext* src_device_context = nullptr;
if (!IsCPU(srcd)) {
if (!src_cpu) {
src_device_context = srcd->tensorflow_gpu_device_info()->default_context;
}
tensorflow::DeviceContext* dst_device_context = nullptr;

View File

@ -595,6 +595,11 @@ XLA_TEST_F(DynamicUpdateSliceTest, R3ContiguousSingleElement) {
// Single element, no wrap.
std::vector<int32> operand_shape({4, 5, 2});
RunR3Contiguous<float>(operand_shape, /*index=*/1, /*size=*/1);
}
XLA_TEST_F(DynamicUpdateSliceTest, R3ContiguousSingleElementBF16) {
// Single element, no wrap.
std::vector<int32> operand_shape({4, 5, 2});
RunR3Contiguous<bfloat16>(operand_shape, /*index=*/1, /*size=*/1);
}
@ -602,6 +607,11 @@ XLA_TEST_F(DynamicUpdateSliceTest, R3ContiguousMultipleElements) {
// Multiple element, no wrap.
std::vector<int32> operand_shape({4, 5, 2});
RunR3Contiguous<float>(operand_shape, /*index=*/1, /*size=*/2);
}
XLA_TEST_F(DynamicUpdateSliceTest, R3ContiguousMultipleElementsBF16) {
// Multiple element, no wrap.
std::vector<int32> operand_shape({4, 5, 2});
RunR3Contiguous<bfloat16>(operand_shape, /*index=*/1, /*size=*/2);
}
@ -609,6 +619,11 @@ XLA_TEST_F(DynamicUpdateSliceTest, R3ContiguousMultipleWrapping) {
// Multiple element, wrapping.
std::vector<int32> operand_shape({4, 5, 2});
RunR3Contiguous<float>(operand_shape, /*index=*/3, /*size=*/2);
}
XLA_TEST_F(DynamicUpdateSliceTest, R3ContiguousMultipleWrappingBF16) {
// Multiple element, wrapping.
std::vector<int32> operand_shape({4, 5, 2});
RunR3Contiguous<bfloat16>(operand_shape, /*index=*/3, /*size=*/2);
}
@ -616,12 +631,21 @@ XLA_TEST_F(DynamicUpdateSliceTest, R3ContiguousTooLarge) {
// Multiple element, update size larger than operand.
std::vector<int32> operand_shape({4, 5, 2});
RunR3Contiguous<float>(operand_shape, /*index=*/5, /*size=*/2);
}
XLA_TEST_F(DynamicUpdateSliceTest, R3ContiguousTooLargeBF16) {
// Multiple element, update size larger than operand.
std::vector<int32> operand_shape({4, 5, 2});
RunR3Contiguous<bfloat16>(operand_shape, /*index=*/5, /*size=*/2);
}
XLA_TEST_F(DynamicUpdateSliceTest, R3ContiguousUnaligned) {
std::vector<int32> operand_shape({3, 123, 247});
RunR3Contiguous<float>(operand_shape, /*index=*/1, /*size=*/1);
}
XLA_TEST_F(DynamicUpdateSliceTest, R3ContiguousUnalignedBF16) {
std::vector<int32> operand_shape({3, 123, 247});
RunR3Contiguous<bfloat16>(operand_shape, /*index=*/1, /*size=*/1);
}
@ -629,6 +653,10 @@ XLA_TEST_F(DynamicUpdateSliceTest, R3ContiguousUnaligned) {
XLA_TEST_F(DynamicUpdateSliceTest, DISABLED_ON_GPU(R3ContiguousLarger)) {
std::vector<int32> operand_shape({32, 128, 1024});
RunR3Contiguous<float>(operand_shape, /*index=*/7, /*size=*/1);
}
XLA_TEST_F(DynamicUpdateSliceTest, DISABLED_ON_GPU(R3ContiguousLargerBF16)) {
std::vector<int32> operand_shape({32, 128, 1024});
RunR3Contiguous<bfloat16>(operand_shape, /*index=*/7, /*size=*/1);
}

View File

@ -30,7 +30,7 @@ bindings.
* CMake version 3.5 or later.
* [Git](http://git-scm.com)
* [Git](https://git-scm.com)
* [SWIG](http://www.swig.org/download.html)
@ -48,7 +48,7 @@ bindings.
* Microsoft Windows 10
- Microsoft Visual Studio Enterprise 2015 with Visual C++ 2015
- [Anaconda 4.1.1 (Python 3.5 64-bit)](https://www.continuum.io/downloads)
- [Anaconda 4.1.1 (Python 3.5 64-bit)](https://www.anaconda.com/download/)
- [Git for Windows version 2.9.2.windows.1](https://git-scm.com/download/win)
- [swigwin-3.0.10](http://www.swig.org/download.html)
- [NVidia CUDA Toolkit 8.0](https://developer.nvidia.com/cuda-downloads)

View File

@ -47,4 +47,4 @@ ExternalProject_Add(snappy
)
# actually enables snappy in the source code
add_definitions(-DTF_USE_SNAPPY)
add_definitions(-DTF_USE_SNAPPY)

View File

@ -1,3 +1,5 @@
# python_sanity_test.py will complain about invalid or missing entries
# problematic entries can be commented for temporary whitelisting
tensorflow
tensorflow/core
tensorflow/core/example
@ -307,6 +309,8 @@ tensorflow/contrib/metrics
tensorflow/contrib/metrics/python
tensorflow/contrib/metrics/python/metrics
tensorflow/contrib/metrics/python/ops
tensorflow/contrib/mpi_collectives/python
tensorflow/contrib/mpi_collectives/python/ops
tensorflow/contrib/model_pruning
tensorflow/contrib/model_pruning/examples
tensorflow/contrib/model_pruning/examples/cifar10

View File

@ -0,0 +1,124 @@
# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""
Complain about invalid or missing entries in python_*.txt files.
Problematic entries can be commented for temporary whitelisting.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import unittest
def abs_path(path):
root = os.path.dirname(__file__)
for _ in range(3):
root = os.path.join(root, os.pardir)
path = os.path.join(root, path)
path = os.path.abspath(path)
return path
def read_entries(test):
with open(abs_path(test.entries_file), "r") as f:
lines = f.readlines()
lines = [line.strip() for line in lines]
lines = [line for line in lines if line]
test.entries = []
test.whitelist = []
for line in lines:
# line is comment
if line.startswith('#'):
line = line[1:].strip()
# whitelist entry
if line.startswith('tensorflow/'):
test.whitelist.append(line)
# line has comment -> strip comment
elif line.find('#') != -1:
line = line[:line.find('#')].strip()
test.entries.append(line)
else:
test.entries.append(line)
def test_invalid_directories(test):
for entry in test.entries:
if not os.path.isdir(abs_path(entry)):
problem = "'" + test.entries_file + "' contains invalid '" + entry + "'"
solution = "Please remove the invalid entry (or add the missing directory)."
raise AssertionError(problem + "\n" + solution)
def test_missing_directory(test, path):
if path in test.whitelist:
return
dir_exists = os.path.isdir(abs_path(path))
entry_exists = path in test.entries
if dir_exists and not entry_exists:
problem = "'" + test.entries_file + "' is missing '" + path + "'"
solution = "Please add the missing entry (comment to whitelist if needed)."
raise AssertionError(problem + "\n" + solution)
class PythonModuleTest(unittest.TestCase):
def setUp(self):
self.entries_file = "tensorflow/contrib/cmake/python_modules.txt"
read_entries(self)
def testInvalidEntries(self):
test_invalid_directories(self)
def testMissingModules(self):
module_names = next(os.walk(abs_path("tensorflow/contrib")))[1]
for module_name in module_names:
path = "tensorflow/contrib/" + module_name
test_missing_directory(self, path + "/python")
test_missing_directory(self, path + "/python/ops")
test_missing_directory(self, path + "/python/kernels")
test_missing_directory(self, path + "/python/layers")
class PythonProtoTest(unittest.TestCase):
def setUp(self):
self.entries_file = "tensorflow/contrib/cmake/python_protos.txt"
read_entries(self)
def testInvalidEntries(self):
test_invalid_directories(self)
class PythonProtoCCTest(unittest.TestCase):
def setUp(self):
self.entries_file = "tensorflow/contrib/cmake/python_protos_cc.txt"
read_entries(self)
def testInvalidEntries(self):
test_invalid_directories(self)
if __name__ == "__main__":
unittest.main()

View File

@ -126,7 +126,9 @@ endfunction()
file(GLOB_RECURSE tf_protos_cc_srcs RELATIVE ${tensorflow_source_dir}
"${tensorflow_source_dir}/tensorflow/core/*.proto"
"${tensorflow_source_dir}/tensorflow/contrib/boosted_trees/proto/*.proto"
"${tensorflow_source_dir}/tensorflow/contrib/tpu/proto/*.proto"
)
RELATIVE_PROTOBUF_GENERATE_CPP(PROTO_SRCS PROTO_HDRS
${tensorflow_source_dir} ${tf_protos_cc_srcs}
)

View File

@ -126,7 +126,8 @@ STRING(REGEX REPLACE ";" "\\\\;" python_protos "${python_protos}")
STRING(REGEX REPLACE "\n" ";" python_protos "${python_protos}")
foreach(python_proto ${python_protos})
if(NOT python_proto MATCHES "\#")
if(NOT python_proto MATCHES "^\#")
STRING(REGEX REPLACE " *\#.*" "" python_proto "${python_proto}")
if(NOT EXISTS "${tensorflow_source_dir}/${python_proto}")
message(SEND_ERROR "Python proto directory not found: ${python_proto}")
endif()
@ -147,7 +148,8 @@ STRING(REGEX REPLACE ";" "\\\\;" python_protos_cc "${python_protos_cc}")
STRING(REGEX REPLACE "\n" ";" python_protos_cc "${python_protos_cc}")
foreach(python_proto_cc ${python_protos_cc})
if(NOT python_proto_cc MATCHES "\#")
if(NOT python_proto_cc MATCHES "^\#")
STRING(REGEX REPLACE " *\#.*" "" python_proto_cc "${python_proto_cc}")
if(NOT EXISTS "${tensorflow_source_dir}/${python_proto_cc}")
message(SEND_ERROR "Python proto CC directory not found: ${python_proto_cc}")
endif()
@ -209,7 +211,8 @@ STRING(REGEX REPLACE ";" "\\\\;" python_modules "${python_modules}")
STRING(REGEX REPLACE "\n" ";" python_modules "${python_modules}")
foreach(python_module ${python_modules})
if(NOT python_module MATCHES "\#")
if(NOT python_module MATCHES "^\#")
STRING(REGEX REPLACE " *\#.*" "" python_module "${python_module}")
if(NOT EXISTS "${tensorflow_source_dir}/${python_module}")
message(SEND_ERROR "Python module not found: ${python_module}")
endif()

View File

@ -649,7 +649,7 @@ class CudnnRNNParamsToCanonical<GPUDevice, T> : public CudnnRNNKernelCommon {
}
const int num_params_per_layer = num_params_ / num_layers / num_dirs;
// Number of params applied on inputs. The rest are applied on recurrent
// hiddden states.
// hidden states.
const int num_params_input_state = num_params_per_layer / 2;
CHECK(num_params_ % (num_layers * num_dirs) == 0)
<< "Number of params is not a multiple of num_layers * num_dirs.";

View File

@ -1542,7 +1542,7 @@ class _CudnnRNNNoInputC(_CudnnRNN):
params: the parameter buffer created for this model.
is_training: whether this operation will be used in training or inference.
Returns:
output: the output sequuence.
output: the output sequence.
output_h: the final state for h.
"""
return _cudnn_rnn_no_input_c(

View File

@ -292,7 +292,7 @@ def loss(weight, bias):
error = prediction(training_inputs, weight, bias) - training_outputs
return tf.reduce_mean(tf.square(error))
# Function that returns the the derivative of loss with respect to
# Function that returns the derivative of loss with respect to
# weight and bias
grad = tfe.gradients_function(loss)

View File

@ -100,7 +100,7 @@ def add_metrics(estimator, metric_fn):
def clip_gradients_by_norm(optimizer, clip_norm):
"""Returns an optimizer which clips gradients before appliying them.
"""Returns an optimizer which clips gradients before applying them.
Example:

View File

@ -82,7 +82,9 @@ std::vector<string> FfmpegVideoCommandLine(const string& input_filename,
"-probesize",
StrCat(kDefaultProbeSize),
"-loglevel",
"error", // Print errors only.
// Info is needed to get the information about stream, etc.
// It is generated to a separate file, not stdout/stderr.
"info",
"-hide_banner", // Skip printing build options, version, etc.
"-vcodec",
"rawvideo",

View File

@ -12,6 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Script Language Operators. See the @{$python/script_ops} guide.
@@py_func

View File

@ -279,14 +279,16 @@ def acgan_model(
generator_inputs = _convert_tensor_or_l_or_d(generator_inputs)
generated_data = generator_fn(generator_inputs)
with variable_scope.variable_scope(discriminator_scope) as dis_scope:
(discriminator_gen_outputs, discriminator_gen_classification_logits
) = _validate_acgan_discriminator_outputs(
discriminator_fn(generated_data, generator_inputs))
with ops.name_scope(dis_scope.name+'/generated/'):
(discriminator_gen_outputs, discriminator_gen_classification_logits
) = _validate_acgan_discriminator_outputs(
discriminator_fn(generated_data, generator_inputs))
with variable_scope.variable_scope(dis_scope, reuse=True):
real_data = ops.convert_to_tensor(real_data)
(discriminator_real_outputs, discriminator_real_classification_logits
) = _validate_acgan_discriminator_outputs(
discriminator_fn(real_data, generator_inputs))
with ops.name_scope(dis_scope.name+'/real/'):
real_data = ops.convert_to_tensor(real_data)
(discriminator_real_outputs, discriminator_real_classification_logits
) = _validate_acgan_discriminator_outputs(
discriminator_fn(real_data, generator_inputs))
if check_shapes:
if not generated_data.shape.is_compatible_with(real_data.shape):
raise ValueError(

View File

@ -479,8 +479,12 @@ def batch_norm(inputs,
Sergey Ioffe, Christian Szegedy
Can be used as a normalizer function for conv2d and fully_connected.
Can be used as a normalizer function for conv2d and fully_connected. The
normalization is over all but the last dimension if `data_format` is `NHWC`
and all but the second dimension if `data_format` is `NCHW`. In case of a 2D
tensor this corresponds to the batch dimension, while in case of a 4D tensor this
corresponds to the batch and space dimensions.
Note: when training, the moving_mean and moving_variance need to be updated.
By default the update ops are placed in `tf.GraphKeys.UPDATE_OPS`, so they
need to be added as a dependency to the `train_op`. For example:

View File

@ -22,7 +22,14 @@ cd "$SCRIPT_DIR/../../.."
DOWNLOADS_DIR=tensorflow/contrib/lite/downloads
BZL_FILE_PATH=tensorflow/workspace.bzl
EIGEN_URL="$(grep -o 'http.*bitbucket.org/eigen/eigen/get/.*tar\.gz' "${BZL_FILE_PATH}" | grep -v bazel-mirror | head -n1)"
# Ensure it is being run from repo root
if [ ! -f $BZL_FILE_PATH ]; then
echo "Could not find ${BZL_FILE_PATH}":
echo "Likely you are not running this from the root directory of the repository.";
exit 1;
fi
EIGEN_URL="$(grep -o 'http.*bitbucket.org/eigen/eigen/get/.*tar\.gz' "${BZL_FILE_PATH}" | grep -v mirror.bazel | head -n1)"
GEMMLOWP_URL="$(grep -o 'https://mirror.bazel.build/github.com/google/gemmlowp/.*zip' "${BZL_FILE_PATH}" | head -n1)"
GOOGLETEST_URL="https://github.com/google/googletest/archive/release-1.8.0.tar.gz"
ABSL_URL="$(grep -o 'https://github.com/abseil/abseil-cpp/.*tar.gz' "${BZL_FILE_PATH}" | head -n1)"

View File

@ -42,9 +42,10 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
TF_LITE_ENSURE_EQ(context, positions->type, kTfLiteInt32);
// Check that input and output types match.
TF_LITE_ENSURE_EQ(context, input->type, output->type);
// TODO(mgubin): only 1D positions are currently supported.
TF_LITE_ENSURE_EQ(context, NumDimensions(positions), 1);
// TODO(mgubin): only 0D or 1D positions are currently supported.
TF_LITE_ENSURE(context, NumDimensions(positions) <= 1);
// TODO(mgubin): Only default axis == 0 is supported.
TF_LITE_ENSURE_EQ(context, params->axis, 0);
// Check conditions for different types.
switch (input->type) {
case kTfLiteFloat32:
@ -64,7 +65,7 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
}
const int num_dimensions =
NumDimensions(input) + NumDimensions(positions) - 1;
TF_LITE_ENSURE(context, params->axis < num_dimensions);
TF_LITE_ENSURE(context, params->axis <= num_dimensions);
TfLiteIntArray* output_shape = TfLiteIntArrayCreate(num_dimensions);
int output_index = 0;
for (int i = 0; i < params->axis; ++i) {

View File

@ -48,8 +48,8 @@ class GatherOpModel : public SingleOpModel {
PopulateStringTensor(input_, data);
}
void SetPositions(std::initializer_list<int32> data) {
PopulateTensor<int32>(positions_, data);
void SetPositions(std::initializer_list<int> data) {
PopulateTensor<int>(positions_, data);
}
std::vector<float> GetOutputFloat() { return ExtractVector<float>(output_); }
@ -76,6 +76,29 @@ TEST(GatherOpTest, Shuffle) {
ElementsAreArray(ArrayFloatNear({0.7, 0.8, -2, 0.2})));
}
TEST(GatherOpTest, Test0DIndex) {
GatherOpModel m({2, 2}, TensorType_FLOAT32, {});
m.SetInputFloat({-2.0, 0.2, 0.7, 0.8});
m.SetPositions({1});
m.Invoke();
EXPECT_THAT(m.GetOutputFloat(),
ElementsAreArray(ArrayFloatNear({0.7, 0.8})));
EXPECT_THAT(m.GetOutputShape(),
ElementsAreArray({2}));
}
TEST(GatherOpTest, Test0DIndexWith0DResult) {
// 0D tensor is special case in current TFLite. Test it once to make sure
// existing workarounds are fine with it.
GatherOpModel m({3}, TensorType_FLOAT32, {});
m.SetInputFloat({1.0, 2.0, 3.0});
m.SetPositions({1});
m.Invoke();
EXPECT_THAT(m.GetOutputFloat(),
ElementsAreArray(ArrayFloatNear({2.0})));
EXPECT_TRUE(m.GetOutputShape().empty());
}
TEST(FloatGatherOpTest, Duplicate) {
GatherOpModel m({1, 2, 2}, TensorType_FLOAT32, {2});
m.SetInputFloat({-2.0, 0.2, 0.7, 0.8});

View File

@ -53,7 +53,7 @@ with the corresponding parameters as shown in the figure.
### Automatic Speech Recognizer (ASR) Acoustic Model (AM)
The acoustic model for automatic speech recognition is the neural network model
for matching phonemes to the input autio features. It generates posterior
for matching phonemes to the input audio features. It generates posterior
probabilities of phonemes from speech frontend features (log-mel filterbanks).
It has an input size of 320 (float), an output size of 42 (float), five LSTM
layers and one fully connected layers with a Softmax activation function, with
@ -68,7 +68,7 @@ for predicting the probability of a word given previous words in a sentence.
It generates posterior probabilities of the next word based from a sequence of
words. The words are encoded as indices in a fixed size dictionary.
The model has two inputs both of size one (integer): the current word index and
next word index, an output size of one (float): the log probability. It consits
next word index, an output size of one (float): the log probability. It consists
of three embedding layer, three LSTM layers, followed by a multiplication, a
fully connected layers and an addition.
The corresponding parameters as shown in the figure.

View File

@ -370,7 +370,7 @@ enum {
* Looks up items from a given tensor.
*
* Each item in the output is a raw copy of the corresponding item in
* the input values. If the the given lookup indices are out of bounds,
* the input values. If the given lookup indices are out of bounds,
* the op will fail and an error will be reported.
*
* Inputs:

View File

@ -1170,7 +1170,7 @@ def make_pad_tests(zip_path):
def make_reshape_tests(zip_path):
"""Make a set of tests to do reshape."""
# Alll shapes below are suitable for tensors with 420 elements.
# All shapes below are suitable for tensors with 420 elements.
test_parameters = [{
"dtype": [tf.float32, tf.int32],
"input_shape": [[3, 4, 5, 7], [4, 105], [21, 5, 2, 2], [420]],

View File

@ -229,7 +229,7 @@ additional information about the multiple input arrays:
well-formed quantized representation of these graphs. Such graphs should be
fixed, but as a temporary work-around, setting this
reorder_across_fake_quant flag allows the converter to perform necessary
graph transformaitons on them, at the cost of no longer faithfully matching
graph transformations on them, at the cost of no longer faithfully matching
inference and training arithmetic.
### Logging flags

View File

@ -27,6 +27,27 @@ tf_cc_binary(
],
)
tf_cc_binary(
name = "benchmark_model",
srcs = ["benchmark_model.cc"],
linkopts = select({
"//tensorflow:android": [
"-pie",
"-landroid",
"-lm",
"-z defs",
"-Wl,--exclude-libs,ALL", # Exclude syms in all libs from auto export
],
"//conditions:default": [],
}),
deps = [
":mutable_op_resolver",
"//tensorflow/contrib/lite:framework",
"//tensorflow/contrib/lite:string_util",
"//tensorflow/contrib/lite/kernels:builtin_ops",
],
)
cc_library(
name = "gen_op_registration",
srcs = ["gen_op_registration.cc"],

View File

@ -63,12 +63,17 @@ download_and_extract() {
elif [[ "${url}" == *zip ]]; then
tempdir=$(mktemp -d)
tempdir2=$(mktemp -d)
wget -P ${tempdir} ${url}
unzip ${tempdir}/* -d ${tempdir2}
if [[ "$OSTYPE" == "darwin"* ]]; then
# macOS (AKA darwin) doesn't have wget.
(cd "${tempdir}"; curl --remote-name --silent --location "${url}")
else
wget -P "${tempdir}" "${url}"
fi
unzip "${tempdir}"/* -d "${tempdir2}"
# unzip has no strip components, so unzip to a temp dir, and move the files
# we want from the tempdir to destination.
cp -R ${tempdir2}/*/* ${dir}/
rm -rf ${tempdir2} ${tempdir}
cp -R "${tempdir2}"/*/* "${dir}"/
rm -rf "${tempdir2}" "${tempdir}"
fi
# Delete any potential BUILD files, which would interfere with Bazel builds.

View File

@ -19,6 +19,7 @@ py_library(
"python/training/elastic_average_optimizer.py",
"python/training/external_optimizer.py",
"python/training/lazy_adam_optimizer.py",
"python/training/model_average_optimizer.py",
"python/training/moving_average_optimizer.py",
"python/training/multitask_optimizer_wrapper.py",
"python/training/nadam_optimizer.py",
@ -193,6 +194,27 @@ tf_py_test(
],
)
tf_py_test(
name = "model_average_optimizer_test",
srcs = ["python/training/model_average_optimizer_test.py"],
additional_deps = [
":opt_py",
"//tensorflow/python:client",
"//tensorflow/python:client_testlib",
"//tensorflow/python:array_ops",
"//tensorflow/python:variables",
"//tensorflow/python:framework",
"//tensorflow/python:platform",
"//tensorflow/python:training",
"//tensorflow/python:ops",
"//tensorflow/python:framework_for_generated_wrappers",
"//third_party/py/numpy",
],
tags = [
"notap", # This test launches local server.
],
)
py_test(
name = "sign_decay_test",
srcs = ["python/training/sign_decay_test.py"],

View File

@ -29,6 +29,7 @@ from tensorflow.contrib.opt.python.training.nadam_optimizer import *
from tensorflow.contrib.opt.python.training.powersign import *
from tensorflow.contrib.opt.python.training.variable_clipping_optimizer import *
from tensorflow.contrib.opt.python.training.elastic_average_optimizer import *
from tensorflow.contrib.opt.python.training.model_average_optimizer import *
# pylint: enable=wildcard-import
from tensorflow.python.util.all_util import remove_undocumented
@ -48,7 +49,9 @@ _allowed_symbols = [
'MultitaskOptimizerWrapper',
'clip_gradients_by_global_norm',
'ElasticAverageOptimizer',
'ElasticAverageCustomGetter'
'ElasticAverageCustomGetter',
'ModelAverageOptimizer',
'ModelAverageCustomGetter'
]
remove_undocumented(__name__, _allowed_symbols)

View File

@ -0,0 +1,299 @@
# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Wrapper optimizer for Model Average """
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from tensorflow.python.framework import ops
from tensorflow.python.framework import dtypes
from tensorflow.python.framework import constant_op
from tensorflow.python.training import optimizer
from tensorflow.python.training import session_run_hook
from tensorflow.python.ops import math_ops
from tensorflow.python.ops import control_flow_ops
from tensorflow.python.ops import variable_scope
from tensorflow.python.ops import variables
from tensorflow.python.ops import state_ops
from tensorflow.python.ops import array_ops
from tensorflow.python.ops import data_flow_ops
GLOBAL_VARIABLE_NAME = 'global_center_variable'
class ModelAverageCustomGetter(object):
"""Custom_getter class is used to do:
1. Change trainable variables to local collection and place them at worker
device
2. Generate global variables
Notice that the class should be used with tf.replica_device_setter,
so that the global center variables and global step variable can be placed
at ps device. Besides, use 'tf.get_variable' instead of 'tf.Variable' to
use this custom getter.
For example,
ma_custom_getter = ModelAverageCustomGetter(worker_device)
with tf.device(
tf.train.replica_device_setter(
worker_device=worker_device,
ps_device="/job:ps/cpu:0",
cluster=cluster)),
tf.variable_scope('',custom_getter=ma_custom_getter):
hid_w = tf.get_variable(
initializer=tf.truncated_normal(
[IMAGE_PIXELS * IMAGE_PIXELS, FLAGS.hidden_units],
stddev=1.0 / IMAGE_PIXELS),
name="hid_w")
hid_b = tf.get_variable(initializer=tf.zeros([FLAGS.hidden_units]),
name="hid_b")
"""
def __init__(self, worker_device):
"""Create a new `ElasticAverageCustomGetter`.
Args:
worker_device: String. Name of the `worker` job.
"""
self._worker_device = worker_device
self._local_2_global = {}
def __call__(self, getter, name, trainable, collections, *args, **kwargs):
if trainable:
with ops.device(self._worker_device):
local_var = getter(name, trainable=True,
collections=[ops.GraphKeys.LOCAL_VARIABLES],
*args, **kwargs)
global_variable = variable_scope.variable(
name='%s/%s' % (GLOBAL_VARIABLE_NAME, name),
initial_value=local_var.initialized_value(),
trainable=False,
collections=[ops.GraphKeys.GLOBAL_VARIABLES])
self._local_2_global[local_var] = global_variable
return local_var
else:
return getter(name, trainable, collections, *args, **kwargs)
class ModelAverageOptimizer(optimizer.Optimizer):
"""Wrapper optimizer that implements the Model Average algorithm.
This is a sync optimizer. During the training, each worker will update
the local variables and maintains its own local_step, which starts from 0
and is incremented by 1 after each update of local variables. Whenever the
interval_steps divides the local step, the local variables from all the
workers will be averaged and assigned to global center variables. Then the
local variables will be assigned by global center variables.
"""
def __init__(
self,
opt,
num_worker,
is_chief,
ma_custom_getter,
interval_steps=100,
use_locking=True,
name="ModelAverageOptimizer"):
"""Construct a new model average optimizer.
Args:
opt: The actual optimizer that will be used to update local variables
num_worker: The number of workers
is_chief: whether chief worker
ma_custom_getter: ModelAverageCustomGetter
interval_steps: An int point value to controls the frequency of the
average of local variables
use_locking: If True use locks for update operations
name: string. Optional name of the returned operation
"""
super(ModelAverageOptimizer, self).__init__(use_locking, name)
self._opt = opt
self._num_worker = num_worker
self._is_chief = is_chief
self._local_2_global = ma_custom_getter._local_2_global
self._interval_steps = interval_steps
self._accumulator_list = []
self._chief_init_op = None
self._local_step = variable_scope.get_variable(
initializer=0,
trainable=False,
collections=[ops.GraphKeys.LOCAL_VARIABLES],
name="local_step")
self._opt._prepare()
def compute_gradients(self, *args, **kwargs):
"""Compute gradients of "loss" for the variables in "var_list".
This simply wraps the compute_gradients() from the real optimizer.
Args:
*args: Arguments for compute_gradients().
**kwargs: Keyword arguments for compute_gradients().
Returns:
A list of (gradient, variable) pairs.
"""
return self._opt.compute_gradients(*args, **kwargs)
def _local_vars_update(self, var_list):
"""Get the update ops for the local variables in "var_list".
Args:
var_list: Optional list or tuple of 'tf.Variable' to update
Returns:
An update op
"""
if not var_list:
raise ValueError(
'The list of local_variables should not be empty')
update_ops = []
global_center_vars = [self._local_2_global[var] for var in var_list]
for lvar, gvar in zip(var_list, global_center_vars):
with ops.device(lvar.device):
update_ops.append(state_ops.assign(lvar, gvar.read_value()))
return control_flow_ops.group(*(update_ops))
def apply_gradients(self, grads_and_vars, global_step=None, name=None):
"""Apply gradients to variables.
This contains most of the synchronization implementation and also wraps the
apply_gradients() from the real optimizer. The chief work updates global
variables.
Args:
grads_and_vars: List of (gradient, variable) pairs as returned by
compute_gradients().
global_step: Optional Variable to increment by one after the
variables have been updated.
name: Optional name for the returned operation. Default to the
name passed to the Optimizer constructor.
Returns:
A conditional 'Operation' that update both local and global variables or
just local variables
Raises:
ValueError: If the grads_and_vars is empty.
ValueError: If global step is not provided, the staleness cannot be
checked.
"""
# update local variables
if not grads_and_vars:
raise ValueError("Must supply at least one variable")
if global_step is None:
raise ValueError("Global step is required")
apply_updates = self._opt.apply_gradients(grads_and_vars)
with ops.control_dependencies([apply_updates]):
local_update = state_ops.assign_add(
self._local_step, 1, name='local_step_update').op
# update global variables.
def _Update_global_variables():
local_vars = [v for g, v in grads_and_vars if g is not None]
global_vars = [self._local_2_global[v] for v in local_vars]
# sync queue
with ops.colocate_with(global_step):
sync_queue = data_flow_ops.FIFOQueue(-1, [dtypes.bool], shapes=[[]],
shared_name='sync_queue')
train_ops = []
aggregated_vars = []
with ops.name_scope(None, self._name + '/global'):
for var, gvar in zip(local_vars, global_vars):
with ops.device(gvar.device):
if isinstance(var._ref(), ops.Tensor):
var_accum = data_flow_ops.ConditionalAccumulator(
var.dtype,
shape=var.get_shape(),
shared_name=gvar.name + "/var_accum")
train_ops.append(
var_accum.apply_grad(var._ref(), local_step=global_step))
aggregated_vars.append(var_accum.take_grad(self._num_worker))
else:
raise ValueError("Unknown local variable type!")
self._accumulator_list.append((var_accum, gvar.device))
# chief worker updates global vars and enqueues tokens to the sync queue
if self._is_chief:
update_ops = []
with ops.control_dependencies(train_ops):
for avg_var, gvar in zip(aggregated_vars, global_vars):
with ops.device(gvar.device):
update_ops.append(state_ops.assign(gvar, avg_var))
with ops.device(global_step.device):
update_ops.append(state_ops.assign_add(global_step, 1))
with ops.control_dependencies(update_ops), ops.device(
global_step.device):
tokens = array_ops.fill([self._num_worker - 1],
constant_op.constant(False))
sync_op = sync_queue.enqueue_many(tokens)
else:
with ops.control_dependencies(train_ops), ops.device(
global_step.device):
sync_op = sync_queue.dequeue()
with ops.control_dependencies([sync_op]):
local_update_op = self._local_vars_update(local_vars)
return local_update_op
with ops.control_dependencies([local_update]):
condition = math_ops.equal(math_ops.mod(
self._local_step, self._interval_steps), 0)
conditional_update = control_flow_ops.cond(
condition, _Update_global_variables, control_flow_ops.no_op)
chief_init_ops = []
for accum, dev in self._accumulator_list:
with ops.device(dev):
chief_init_ops.append(
accum.set_global_step(
global_step, name="SetGlobalStep"))
self._chief_init_op = control_flow_ops.group(*(chief_init_ops))
return conditional_update
def get_init_op(self):
"""Returns the op to let all the local variables equal to the global
variables before the training begins"""
return self._local_vars_update(variables.trainable_variables())
def make_session_run_hook(self):
"""Creates a hook to handle ModelAverage ops such as initialization."""
return _ModelAverageOptimizerHook(self, self._is_chief)
class _ModelAverageOptimizerHook(session_run_hook.SessionRunHook):
def __init__(self, ma_optimizer, is_chief):
"""Creates hook to handle ModelAverageOptimizer initialization ops.
Args:
ea_optimizer: `ModelAverageOptimizer` which this hook will initialize.
is_chief: `Bool`, whether is this a chief replica or not.
"""
self._ma_optimizer = ma_optimizer
self._is_chief = is_chief
def begin(self):
self._local_init_op = variables.local_variables_initializer()
self._global_init_op = None
if self._is_chief:
self._global_init_op = variables.global_variables_initializer()
self._chief_init_op = self._ma_optimizer._chief_init_op
self._variable_init_op = self._ma_optimizer.get_init_op()

View File

@ -0,0 +1,200 @@
# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Tests for ModelAverageOptimizer."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import portpicker
from tensorflow.python.framework import constant_op
from tensorflow.python.framework import ops
from tensorflow.python.ops import variables
from tensorflow.python.platform import test
from tensorflow.python.training import gradient_descent
from tensorflow.python.training import server_lib
from tensorflow.python.training import training
from tensorflow.python.training import training_util
from tensorflow.python.ops import variable_scope
from tensorflow.python.training import device_setter
from tensorflow.contrib.opt.python.training.model_average_optimizer import \
ModelAverageOptimizer, ModelAverageCustomGetter, GLOBAL_VARIABLE_NAME
def create_local_cluster(num_workers, num_ps, protocol="grpc"):
"""Create local GRPC servers and return them."""
worker_ports = [portpicker.pick_unused_port() for _ in range(num_workers)]
ps_ports = [portpicker.pick_unused_port() for _ in range(num_ps)]
cluster_dict = {
"worker": ["localhost:%s" % port for port in worker_ports],
"ps": ["localhost:%s" % port for port in ps_ports]
}
cs = server_lib.ClusterSpec(cluster_dict)
workers = [
server_lib.Server(
cs, job_name="worker", protocol=protocol, task_index=ix, start=True)
for ix in range(num_workers)
]
ps_servers = [
server_lib.Server(
cs, job_name="ps", protocol=protocol, task_index=ix, start=True)
for ix in range(num_ps)
]
return cluster_dict, workers, ps_servers
# Creates the workers and return their sessions, graphs, train_ops.
# Cheif worker will update at last
def _get_workers(num_workers, steps, workers):
sessions = []
graphs = []
train_ops = []
for worker_id in range(num_workers):
graph = ops.Graph()
is_chief = (worker_id == 0)
with graph.as_default():
worker_device = "/job:worker/task:%d/cpu:0" % (worker_id)
ma_coustom = ModelAverageCustomGetter(
worker_device=worker_device)
with variable_scope.variable_scope('',
custom_getter=ma_coustom), ops.device(
device_setter.replica_device_setter(worker_device=worker_device,
ps_device="/job:ps/task:0/cpu:0",
ps_tasks=1)):
global_step = variables.Variable(0, name='global_step',
trainable=False)
var_0 = variable_scope.get_variable(initializer=0.0, name="v0")
var_1 = variable_scope.get_variable(initializer=1.0, name="v1")
with ops.device("/job:worker/task:" + str(worker_id)):
if worker_id == 0:
grads_0 = constant_op.constant(-1.0)
grads_1 = constant_op.constant(-1.0)
else:
grads_0 = constant_op.constant(-2.0)
grads_1 = constant_op.constant(-2.0)
sgd_opt = gradient_descent.GradientDescentOptimizer(1.0)
opt = ModelAverageOptimizer(
opt=sgd_opt,
num_worker=num_workers,
ma_custom_getter=ma_coustom,
is_chief=is_chief,
interval_steps=steps
)
train_op = [
opt.apply_gradients(
[[grads_0, var_0],
[grads_1, var_1]], global_step)
]
easgd_hook = opt.make_session_run_hook()
# Creates MonitoredSession
sess = training.MonitoredTrainingSession(workers[worker_id].target,
hooks=[easgd_hook])
sessions.append(sess)
graphs.append(graph)
train_ops.append(train_op)
return sessions, graphs, train_ops
class ModelAverageOptimizerTest(test.TestCase):
def _run(self, train_op, sess):
sess.run(train_op)
def test1Workers2Period(self):
num_workers = 2
steps = 2
num_ps = 1
cluster, workers, _ = create_local_cluster(num_workers=num_workers,
num_ps=num_ps)
sessions, graphs, train_ops = _get_workers(num_workers,
steps,
workers)
var_0 = graphs[0].get_tensor_by_name('v0:0')
var_1 = graphs[0].get_tensor_by_name('v1:0')
global_step = training_util.get_global_step(graphs[0])
global_var_0 = graphs[0].get_tensor_by_name(GLOBAL_VARIABLE_NAME + "/v0:0")
global_var_1 = graphs[0].get_tensor_by_name(GLOBAL_VARIABLE_NAME + "/v1:0")
# Verify the initialized value.
self.assertAllEqual(0.0, sessions[0].run(var_0))
self.assertAllEqual(1.0, sessions[0].run(var_1))
self.assertAllEqual(0.0, sessions[0].run(global_var_0))
self.assertAllEqual(1.0, sessions[0].run(global_var_1))
self.assertAllEqual(0, sessions[0].run(global_step))
sessions[0].run(train_ops[0])
sessions[1].run(train_ops[1])
self.assertAllEqual(1.0, sessions[0].run(var_0))
self.assertAllEqual(2.0, sessions[0].run(var_1))
self.assertAllEqual(0.0, sessions[0].run(global_var_0))
self.assertAllEqual(1.0, sessions[0].run(global_var_1))
self.assertAllEqual(0, sessions[0].run(global_step))
# iteration 2, global varibale update
thread_0 = self.checkedThread(
target=self._run, args=(train_ops[0], sessions[0]))
thread_1 = self.checkedThread(
target=self._run, args=(train_ops[1], sessions[1]))
thread_0.start()
thread_1.start()
thread_0.join()
thread_1.join()
self.assertAllEqual(3.0, sessions[0].run(var_0))
self.assertAllEqual(4.0, sessions[0].run(var_1))
self.assertAllEqual(3.0, sessions[0].run(global_var_0))
self.assertAllEqual(4.0, sessions[0].run(global_var_1))
self.assertAllEqual(1, sessions[0].run(global_step))
# iteration 3
sessions[0].run(train_ops[0])
self.assertAllEqual(4.0, sessions[0].run(var_0))
self.assertAllEqual(5.0, sessions[0].run(var_1))
self.assertAllEqual(3.0, sessions[0].run(global_var_0))
self.assertAllEqual(4.0, sessions[0].run(global_var_1))
self.assertAllEqual(1, sessions[0].run(global_step))
def testPS2TasksWithClusterSpecClass(self):
cluster_spec = server_lib.ClusterSpec({
"ps": ["ps0:2222", "ps1:2222"],
"worker": ["worker0:2222", "worker1:2222", "worker2:2222"]
})
worker_device = "/job:worker/task:0"
ma_coustom = ModelAverageCustomGetter(
worker_device=worker_device)
from tensorflow.python.training import device_setter
with ops.device(
device_setter.replica_device_setter(cluster=cluster_spec,
worker_device=worker_device,
ps_device="/job:ps")), \
variable_scope.variable_scope('', custom_getter=ma_coustom):
v = variable_scope.get_variable(initializer=[1, 2], name="v")
w = variable_scope.get_variable(initializer=[2, 1], name='w')
v_g, w_g = ma_coustom._local_2_global[v], ma_coustom._local_2_global[w]
self.assertDeviceEqual("/job:worker/task:0", v.device)
self.assertDeviceEqual("job:ps/task:0", v_g.device)
self.assertDeviceEqual("/job:worker/task:0", w.device)
self.assertDeviceEqual("job:ps/task:1", w_g.device)
if __name__ == '__main__':
test.main()

View File

@ -6,6 +6,7 @@ exports_files(["LICENSE"])
load(
"//tensorflow:tensorflow.bzl",
"py_test",
"tf_gen_op_libs",
"tf_custom_op_library",
"tf_custom_op_py_library",
@ -64,11 +65,28 @@ py_library(
"python/__init__.py",
],
srcs_version = "PY2AND3",
tags = [
"notap",
],
deps = [
":periodic_resample_op_py",
],
)
py_test(
name = "periodic_resample_op_test",
srcs = ["python/kernel_tests/periodic_resample_op_test.py"],
srcs_version = "PY2AND3",
tags = [
"notap",
],
deps = [
":init_py",
"//tensorflow/contrib/util:util_py",
"//tensorflow/python:framework_test_lib",
],
)
# py_library(
# name = "periodic_resample_op_py",
# srcs = ["python/ops/periodic_resample_op.py"],

View File

@ -100,6 +100,8 @@ template <class InputDataT,
desired_shape.size(), "."));
bool found = false;
const auto& input_tensor_shape = input_tensor.shape();
for (int i = 0; i < rank; ++i) {
// if (desired_shape(i) < 1) {
if (desired_shape[i] < 1) {
@ -111,6 +113,15 @@ template <class InputDataT,
adjustable_dimension = i;
found = true;
} else {
OP_REQUIRES(
context, desired_shape[i] >= input_tensor_shape.dim_size(i),
tensorflow::errors::InvalidArgument(
"periodic_resample expects the size of non-adjustable "
"dimensions be at least as large as size of input tensor."
" Dimension ", i, " input tensor has size ",
input_tensor_shape.dim_size(i), ", desired shape has size ",
desired_shape[i], "."));
// target_dimensions[i] = desired_shape(i);
target_dimensions[i] = desired_shape[i];
new_sliced_size *= target_dimensions[i];

View File

@ -34,26 +34,40 @@ This function implements a slightly more generic version of the subpixel
convolutions found in this [paper](https://arxiv.org/abs/1609.05158).
The formula for computing the elements in the `output` tensor is as follows:
`T` = `values` tensor of rank `R`
`S` = desired `shape` of output tensor (vector of length `R`)
`P` = `output` tensor of rank `R`
\((T_1,\ldots,T_R)\) = shape(`T`)
\([S_1,\ldots,S_q,\ldots,S_R]\) = elements of vector `S`
A single element in `S` is left unspecified (denoted \(S_q=-1\)).
Let \(f_i\) denote the (possibly non-integer) factor that relates the original
dimension to the desired dimensions, \(S_i=f_i T_i\), for \(i\neq q\) where
\(f_i>0\).
`T` = `values` tensor of rank `R`
`S` = desired `shape` of output tensor (vector of length `R`)
`P` = `output` tensor of rank `R`
\\((T_1,\\ldots,T_R)\\) = shape(`T`)
\\([S_1,\\ldots,S_q,\\ldots,S_R]\\) = elements of vector `S`
A single element in `S` is left unspecified (denoted \\(S_q=-1\\)).
Let \\(f_i\\) denote the (possibly non-integer) factor that relates the original
dimension to the desired dimensions, \\(S_i=f_i T_i\\), for \\(i\\neq q\\) where
\\(f_i>0\\).
Define the following:
\(g_i=\lceil f_i\rceil\)
\(t=\prod_i T_i\)
\(s=\prod_{i\neq q} S_i\)
\(S_q\) can then be defined as by \(S_q=\lfloor t/s\rfloor\).
\\(g_i=\\lceil f_i\\rceil\\)
\\(t=\\prod_i T_i\\)
\\(s=\\prod_{i\\neq q} S_i\\)
\\(S_q\\) can then be defined by \\(S_q=\\lfloor t/s\\rfloor\\).
The elements of the resulting tensor are defined as
\(P_{s_1,\ldots,s_R}=T_{h_1,\ldots,h_q,\ldots,h_R}\).
The \(h_i\) (\(i\neq q\)) are defined by \(h_i=\lfloor s_i/g_i\rfloor\).
\(h_q=S_q\sum_{j\neq q}^{q-1}G_j \mathrm{mod}(s_j,g_j) + s_q\), where
\(G_j=\prod_{i}^{j-1}g_i\) (\(G_0=1\)).
\\(P_{s_1,\\ldots,s_R}=T_{h_1,\\ldots,h_q,\\ldots,h_R}\\).
The \\(h_i\\) (\\(i\\neq q\\)) are defined by \\(h_i=\\lfloor s_i/g_i\\rfloor\\).
\\(h_q=S_q\\sum_{j\\neq q}^{q-1}G_j \\mathrm{mod}(s_j,g_j) + s_q\\), where
\\(G_j=\\prod_{i}^{j-1}g_i\\) (\\(G_0=1\\)).
One drawback of this method is that whenever the output dimensions are slightly
less than integer multiples of the input dimensions, many of the tensor elements

View File

@ -19,8 +19,9 @@ from __future__ import division
from __future__ import print_function
import numpy
import tensorflow
from tensorflow.contrib.periodic_resample import periodic_resample
from tensorflow.python.framework import errors_impl
from tensorflow.python.framework import test_util
from tensorflow.python.ops import variables
from tensorflow.python.platform import googletest
@ -96,6 +97,19 @@ class PeriodicResampleTest(test_util.TensorFlowTestCase):
result = periodic_resample(input_tensor, desired_shape).eval()
self.assertAllEqual(result, output_tensor)
def testPeriodicResampleErrors(self):
input_tensor = numpy.zeros(shape=[1, 2, 2, 4])
with self.test_session():
variables.global_variables_initializer().run()
with self.assertRaisesWithPredicateMatch(
errors_impl.InvalidArgumentError,
'Dimension 3 input tensor has size 4, desired shape has size 1'):
periodic_resample(input_tensor, [None, 4, 4, 1]).eval()
with self.assertRaisesWithPredicateMatch(
errors_impl.InvalidArgumentError,
'4, to be the same as the length of the desired shape, 3'):
periodic_resample(input_tensor, [None, 4, 4]).eval()
if __name__ == "__main__":
googletest.main()

View File

@ -663,6 +663,12 @@ class DropoutWrapperTest(test.TestCase):
self.assertEqual(res[1].h.shape, (batch_size, 3))
return res
def testWrappedCellProperty(self):
cell = rnn_cell_impl.BasicRNNCell(10)
wrapper = rnn_cell_impl.DropoutWrapper(cell)
# Github issue 15810
self.assertEqual(wrapper.wrapped_cell, cell)
def testDropoutWrapperKeepAllConstantInput(self):
keep = array_ops.ones([])
res = self._testDropoutWrapper(

View File

@ -1549,5 +1549,100 @@ class BenchmarkLSTMCellXLA(test.Benchmark):
benchmark_results["wall_time"]]]))
class WeightNormLSTMCellTest(test.TestCase):
"""Compared cell output with pre-calculated values."""
def _cell_output(self, cell):
"""Calculate cell output"""
with self.test_session() as sess:
init = init_ops.constant_initializer(0.5)
with variable_scope.variable_scope("root",
initializer=init):
x = array_ops.zeros([1, 2])
c0 = array_ops.zeros([1, 2])
h0 = array_ops.zeros([1, 2])
state0 = rnn_cell.LSTMStateTuple(c0, h0)
xout, sout = cell()(x, state0)
sess.run([variables.global_variables_initializer()])
res = sess.run([xout, sout], {
x.name: np.array([[1., 1.]]),
c0.name: 0.1 * np.asarray([[0, 1]]),
h0.name: 0.1 * np.asarray([[2, 3]]),
})
actual_state_c = res[1].c
actual_state_h = res[1].h
return actual_state_c, actual_state_h
def testBasicCell(self):
"""Tests cell w/o peepholes and w/o normalisation"""
def cell():
return contrib_rnn_cell.WeightNormLSTMCell(2,
norm=False,
use_peepholes=False)
actual_c, actual_h = self._cell_output(cell)
expected_c = np.array([[0.65937078, 0.74983585]])
expected_h = np.array([[0.44923624, 0.49362513]])
self.assertAllClose(expected_c, actual_c, 1e-5)
self.assertAllClose(expected_h, actual_h, 1e-5)
def testNonbasicCell(self):
"""Tests cell with peepholes and w/o normalisation"""
def cell():
return contrib_rnn_cell.WeightNormLSTMCell(2,
norm=False,
use_peepholes=True)
actual_c, actual_h = self._cell_output(cell)
expected_c = np.array([[0.65937084, 0.7574988]])
expected_h = np.array([[0.4792085, 0.53470564]])
self.assertAllClose(expected_c, actual_c, 1e-5)
self.assertAllClose(expected_h, actual_h, 1e-5)
def testBasicCellWithNorm(self):
"""Tests cell w/o peepholes and with normalisation"""
def cell():
return contrib_rnn_cell.WeightNormLSTMCell(2,
norm=True,
use_peepholes=False)
actual_c, actual_h = self._cell_output(cell)
expected_c = np.array([[0.50125383, 0.58805949]])
expected_h = np.array([[0.32770363, 0.37397948]])
self.assertAllClose(expected_c, actual_c, 1e-5)
self.assertAllClose(expected_h, actual_h, 1e-5)
def testNonBasicCellWithNorm(self):
"""Tests cell with peepholes and with normalisation"""
def cell():
return contrib_rnn_cell.WeightNormLSTMCell(2,
norm=True,
use_peepholes=True)
actual_c, actual_h = self._cell_output(cell)
expected_c = np.array([[0.50125383, 0.59587258]])
expected_h = np.array([[0.35041603, 0.40873795]])
self.assertAllClose(expected_c, actual_c, 1e-5)
self.assertAllClose(expected_h, actual_h, 1e-5)
if __name__ == "__main__":
test.main()

View File

@ -38,6 +38,7 @@ from tensorflow.python.ops import random_ops
from tensorflow.python.ops import rnn_cell_impl
from tensorflow.python.ops import variable_scope as vs
from tensorflow.python.ops import partitioned_variables
from tensorflow.python.ops import nn_impl
from tensorflow.python.platform import tf_logging as logging
from tensorflow.python.util import nest
@ -328,7 +329,7 @@ class TimeFreqLSTMCell(rnn_cell_impl.RNNCell):
def __init__(self, num_units, use_peepholes=False,
cell_clip=None, initializer=None,
num_unit_shards=1, forget_bias=1.0,
feature_size=None, frequency_skip=None,
feature_size=None, frequency_skip=1,
reuse=None):
"""Initialize the parameters for an LSTM cell.
@ -2723,3 +2724,257 @@ class SRUCell(rnn_cell_impl._LayerRNNCell):
h = r * self._activation(c) + (1.0 - r) * inputs
return h, c
class WeightNormLSTMCell(rnn_cell_impl.RNNCell):
"""Weight normalized LSTM Cell. Adapted from `rnn_cell_impl.LSTMCell`.
The weight-norm implementation is based on:
https://arxiv.org/abs/1602.07868
Tim Salimans, Diederik P. Kingma.
Weight Normalization: A Simple Reparameterization to Accelerate
Training of Deep Neural Networks
The default LSTM implementation based on:
http://www.bioinf.jku.at/publications/older/2604.pdf
S. Hochreiter and J. Schmidhuber.
"Long Short-Term Memory". Neural Computation, 9(8):1735-1780, 1997.
The class uses optional peephole connections, optional cell clipping
and an optional projection layer.
The optional peephole implementation is based on:
https://research.google.com/pubs/archive/43905.pdf
Hasim Sak, Andrew Senior, and Francoise Beaufays.
"Long short-term memory recurrent neural network architectures for
large scale acoustic modeling." INTERSPEECH, 2014.
"""
def __init__(self, num_units, norm=True, use_peepholes=False,
cell_clip=None, initializer=None, num_proj=None,
proj_clip=None, forget_bias=1, activation=None,
reuse=None):
"""Initialize the parameters of a weight-normalized LSTM cell.
Args:
num_units: int, The number of units in the LSTM cell
norm: If `True`, apply normalization to the weight matrices. If False,
the result is identical to that obtained from `rnn_cell_impl.LSTMCell`
use_peepholes: bool, set `True` to enable diagonal/peephole connections.
cell_clip: (optional) A float value, if provided the cell state is clipped
by this value prior to the cell output activation.
initializer: (optional) The initializer to use for the weight matrices.
num_proj: (optional) int, The output dimensionality for the projection
matrices. If None, no projection is performed.
proj_clip: (optional) A float value. If `num_proj > 0` and `proj_clip` is
provided, then the projected values are clipped elementwise to within
`[-proj_clip, proj_clip]`.
forget_bias: Biases of the forget gate are initialized by default to 1
in order to reduce the scale of forgetting at the beginning of
the training.
activation: Activation function of the inner states. Default: `tanh`.
reuse: (optional) Python boolean describing whether to reuse variables
in an existing scope. If not `True`, and the existing scope already has
the given variables, an error is raised.
"""
super(WeightNormLSTMCell, self).__init__(_reuse=reuse)
self._scope = 'wn_lstm_cell'
self._num_units = num_units
self._norm = norm
self._initializer = initializer
self._use_peepholes = use_peepholes
self._cell_clip = cell_clip
self._num_proj = num_proj
self._proj_clip = proj_clip
self._activation = activation or math_ops.tanh
self._forget_bias = forget_bias
self._weights_variable_name = "kernel"
self._bias_variable_name = "bias"
if num_proj:
self._state_size = rnn_cell_impl.LSTMStateTuple(num_units, num_proj)
self._output_size = num_proj
else:
self._state_size = rnn_cell_impl.LSTMStateTuple(num_units, num_units)
self._output_size = num_units
@property
def state_size(self):
return self._state_size
@property
def output_size(self):
return self._output_size
def _normalize(self, weight, name):
"""Apply weight normalization.
Args:
weight: a 2D tensor with known number of columns.
name: string, variable name for the normalizer.
Returns:
A tensor with the same shape as `weight`.
"""
output_size = weight.get_shape().as_list()[1]
g = vs.get_variable(name, [output_size], dtype=weight.dtype)
return nn_impl.l2_normalize(weight, dim=0) * g
def _linear(self, args,
output_size,
norm,
bias,
bias_initializer=None,
kernel_initializer=None):
"""Linear map: sum_i(args[i] * W[i]), where W[i] is a variable.
Args:
args: a 2D Tensor or a list of 2D, batch x n, Tensors.
output_size: int, second dimension of W[i].
bias: boolean, whether to add a bias term or not.
bias_initializer: starting value to initialize the bias
(default is all zeros).
kernel_initializer: starting value to initialize the weight.
Returns:
A 2D Tensor with shape [batch x output_size] equal to
sum_i(args[i] * W[i]), where W[i]s are newly created matrices.
Raises:
ValueError: if some of the arguments has unspecified or wrong shape.
"""
if args is None or (nest.is_sequence(args) and not args):
raise ValueError("`args` must be specified")
if not nest.is_sequence(args):
args = [args]
# Calculate the total size of arguments on dimension 1.
total_arg_size = 0
shapes = [a.get_shape() for a in args]
for shape in shapes:
if shape.ndims != 2:
raise ValueError("linear is expecting 2D arguments: %s" % shapes)
if shape[1].value is None:
raise ValueError("linear expects shape[1] to be provided for shape %s, "
"but saw %s" % (shape, shape[1]))
else:
total_arg_size += shape[1].value
dtype = [a.dtype for a in args][0]
# Now the computation.
scope = vs.get_variable_scope()
with vs.variable_scope(scope) as outer_scope:
weights = vs.get_variable(
self._weights_variable_name, [total_arg_size, output_size],
dtype=dtype,
initializer=kernel_initializer)
if norm:
wn = []
st = 0
with ops.control_dependencies(None):
for i in range(len(args)):
en = st + shapes[i][1].value
wn.append(self._normalize(weights[st:en, :],
name='norm_{}'.format(i)))
st = en
weights = array_ops.concat(wn, axis=0)
if len(args) == 1:
res = math_ops.matmul(args[0], weights)
else:
res = math_ops.matmul(array_ops.concat(args, 1), weights)
if not bias:
return res
with vs.variable_scope(outer_scope) as inner_scope:
inner_scope.set_partitioner(None)
if bias_initializer is None:
bias_initializer = init_ops.constant_initializer(0.0, dtype=dtype)
biases = vs.get_variable(
self._bias_variable_name, [output_size],
dtype=dtype,
initializer=bias_initializer)
return nn_ops.bias_add(res, biases)
def call(self, inputs, state):
"""Run one step of LSTM.
Args:
inputs: input Tensor, 2D, batch x num_units.
state: A tuple of state Tensors, both `2-D`, with column sizes
`c_state` and `m_state`.
Returns:
A tuple containing:
- A `2-D, [batch x output_dim]`, Tensor representing the output of the
LSTM after reading `inputs` when previous state was `state`.
Here output_dim is:
num_proj if num_proj was set,
num_units otherwise.
- Tensor(s) representing the new state of LSTM after reading `inputs` when
the previous state was `state`. Same type and shape(s) as `state`.
Raises:
ValueError: If input size cannot be inferred from inputs via
static shape inference.
"""
dtype = inputs.dtype
num_units = self._num_units
sigmoid = math_ops.sigmoid
c, h = state
input_size = inputs.get_shape().with_rank(2)[1]
if input_size.value is None:
raise ValueError("Could not infer input size from inputs.get_shape()[-1]")
with vs.variable_scope(self._scope, initializer=self._initializer):
concat = self._linear([inputs, h], 4 * num_units,
norm=self._norm, bias=True)
# i = input_gate, j = new_input, f = forget_gate, o = output_gate
i, j, f, o = array_ops.split(value=concat, num_or_size_splits=4, axis=1)
if self._use_peepholes:
w_f_diag = vs.get_variable("w_f_diag", shape=[num_units], dtype=dtype)
w_i_diag = vs.get_variable("w_i_diag", shape=[num_units], dtype=dtype)
w_o_diag = vs.get_variable("w_o_diag", shape=[num_units], dtype=dtype)
new_c = (c * sigmoid(f + self._forget_bias + w_f_diag * c)
+ sigmoid(i + w_i_diag * c) * self._activation(j))
else:
new_c = (c * sigmoid(f + self._forget_bias)
+ sigmoid(i) * self._activation(j))
if self._cell_clip is not None:
# pylint: disable=invalid-unary-operand-type
new_c = clip_ops.clip_by_value(new_c, -self._cell_clip, self._cell_clip)
# pylint: enable=invalid-unary-operand-type
if self._use_peepholes:
new_h = sigmoid(o + w_o_diag * new_c) * self._activation(new_c)
else:
new_h = sigmoid(o) * self._activation(new_c)
if self._num_proj is not None:
with vs.variable_scope("projection"):
new_h = self._linear(new_h,
self._num_proj,
norm=self._norm,
bias=False)
if self._proj_clip is not None:
# pylint: disable=invalid-unary-operand-type
new_h = clip_ops.clip_by_value(new_h,
-self._proj_clip,
self._proj_clip)
# pylint: enable=invalid-unary-operand-type
new_state = rnn_cell_impl.LSTMStateTuple(new_c, new_h)
return new_h, new_state

View File

@ -225,6 +225,94 @@ class TestBeamStep(test.TestCase):
self.assertAllEqual(next_state_.log_probs, expected_log_probs)
class TestLargeBeamStep(test.TestCase):
"""
Tests a single step of beam search in such
case that beam size is larger than vocabulary size.
"""
def setUp(self):
super(TestLargeBeamStep, self).setUp()
self.batch_size = 2
self.beam_width = 8
self.vocab_size = 5
self.end_token = 0
self.length_penalty_weight = 0.6
def test_step(self):
def get_probs():
"""this simulates the initialize method in BeamSearchDecoder"""
log_prob_mask = array_ops.one_hot(array_ops.zeros([self.batch_size],
dtype=dtypes.int32),
depth=self.beam_width, on_value=True,
off_value=False, dtype=dtypes.bool)
log_prob_zeros = array_ops.zeros([self.batch_size, self.beam_width],
dtype=dtypes.float32)
log_prob_neg_inf = array_ops.ones([self.batch_size, self.beam_width],
dtype=dtypes.float32) * -np.Inf
log_probs = array_ops.where(log_prob_mask, log_prob_zeros,
log_prob_neg_inf)
return log_probs
log_probs = get_probs()
dummy_cell_state = array_ops.zeros([self.batch_size, self.beam_width])
_finished = array_ops.one_hot(
array_ops.zeros([self.batch_size], dtype=dtypes.int32),
depth=self.beam_width, on_value=False,
off_value=True, dtype=dtypes.bool)
_lengths = np.zeros([self.batch_size, self.beam_width], dtype=np.int64)
_lengths[:, 0]=2
_lengths = constant_op.constant(_lengths, dtype=dtypes.int64)
beam_state = beam_search_decoder.BeamSearchDecoderState(
cell_state=dummy_cell_state,
log_probs=log_probs,
lengths=_lengths,
finished=_finished)
logits_ = np.full([self.batch_size, self.beam_width, self.vocab_size],
0.0001)
logits_[0, 0, 2] = 1.9
logits_[0, 0, 3] = 2.1
logits_[0, 1, 3] = 3.1
logits_[0, 1, 4] = 0.9
logits_[1, 0, 1] = 0.5
logits_[1, 1, 2] = 2.7
logits_[1, 2, 2] = 10.0
logits_[1, 2, 3] = 0.2
logits = constant_op.constant(logits_, dtype=dtypes.float32)
log_probs = nn_ops.log_softmax(logits)
outputs, next_beam_state = beam_search_decoder._beam_search_step(
time=2,
logits=logits,
next_cell_state=dummy_cell_state,
beam_state=beam_state,
batch_size=ops.convert_to_tensor(self.batch_size),
beam_width=self.beam_width,
end_token=self.end_token,
length_penalty_weight=self.length_penalty_weight)
with self.test_session() as sess:
outputs_, next_state_, state_, log_probs_ = sess.run(
[outputs, next_beam_state, beam_state, log_probs])
self.assertEqual(outputs_.predicted_ids[0, 0], 3)
self.assertEqual(outputs_.predicted_ids[0, 1], 2)
self.assertEqual(outputs_.predicted_ids[1, 0], 1)
neg_inf = -np.Inf
self.assertAllEqual(next_state_.log_probs[:, -3:],
[[neg_inf, neg_inf, neg_inf],
[neg_inf, neg_inf, neg_inf]])
self.assertEqual((next_state_.log_probs[:, :-3] > neg_inf).all(), True)
self.assertEqual((next_state_.lengths[:, :-3] > 0).all(), True)
self.assertAllEqual(next_state_.lengths[:, -3:], [[0, 0, 0],
[0, 0, 0]])
class BeamSearchDecoderTest(test.TestCase):
def _testDynamicDecodeRNN(self, time_major, has_attention):

View File

@ -19,7 +19,6 @@ from __future__ import division
from __future__ import print_function
import collections
import numpy as np
from tensorflow.contrib.seq2seq.python.ops import beam_search_ops
@ -229,8 +228,11 @@ class BeamSearchDecoder(decoder.Decoder):
self._start_tokens = array_ops.tile(
array_ops.expand_dims(self._start_tokens, 1), [1, self._beam_width])
self._start_inputs = self._embedding_fn(self._start_tokens)
self._finished = array_ops.zeros(
[self._batch_size, self._beam_width], dtype=dtypes.bool)
self._finished = array_ops.one_hot(
array_ops.zeros([self._batch_size], dtype=dtypes.int32),
depth=self._beam_width, on_value=False,
off_value=True, dtype=dtypes.bool)
@property
def batch_size(self):
@ -298,11 +300,15 @@ class BeamSearchDecoder(decoder.Decoder):
"""
finished, start_inputs = self._finished, self._start_inputs
log_probs = array_ops.one_hot( # shape(batch_sz, beam_sz)
array_ops.zeros([self._batch_size], dtype=dtypes.int32),
depth=self._beam_width, on_value=0.0, off_value=-np.Inf,
dtype=nest.flatten(self._initial_cell_state)[0].dtype)
initial_state = BeamSearchDecoderState(
cell_state=self._initial_cell_state,
log_probs=array_ops.zeros(
[self._batch_size, self._beam_width],
dtype=nest.flatten(self._initial_cell_state)[0].dtype),
log_probs=log_probs,
finished=finished,
lengths=array_ops.zeros(
[self._batch_size, self._beam_width], dtype=dtypes.int64))
@ -563,18 +569,11 @@ def _beam_search_step(time, logits, next_cell_state, beam_state, batch_size,
time = ops.convert_to_tensor(time, name="time")
# During the first time step we only consider the initial beam
scores_shape = array_ops.shape(scores)
scores_flat = control_flow_ops.cond(
time > 0,
lambda: array_ops.reshape(scores, [batch_size, -1]),
lambda: scores[:, 0])
num_available_beam = control_flow_ops.cond(
time > 0, lambda: math_ops.reduce_prod(scores_shape[1:]),
lambda: math_ops.reduce_prod(scores_shape[2:]))
scores_flat = array_ops.reshape(scores, [batch_size, -1])
# Pick the next beams according to the specified successors function
next_beam_size = math_ops.minimum(
ops.convert_to_tensor(beam_width, dtype=dtypes.int32, name="beam_width"),
num_available_beam)
next_beam_size = ops.convert_to_tensor(beam_width, dtype=dtypes.int32,
name="beam_width")
next_beam_scores, word_indices = nn_ops.top_k(scores_flat, k=next_beam_size)
next_beam_scores.set_shape([static_batch_size, beam_width])

View File

@ -99,7 +99,7 @@ cc_library(
alwayslink = 1,
)
tf_cuda_library(
cc_library(
name = "rdma_rendezvous_mgr",
srcs = ["rdma_rendezvous_mgr.cc"],
hdrs = ["rdma_rendezvous_mgr.h"],
@ -114,7 +114,7 @@ tf_cuda_library(
],
)
cc_library(
tf_cuda_library(
name = "rdma_mgr",
srcs = ["rdma_mgr.cc"],
hdrs = ["rdma_mgr.h"],
@ -141,6 +141,8 @@ tf_cuda_library(
"//conditions:default": [],
}),
deps = [
":grpc_verbs_client",
":verbs_service_proto_cc",
":verbs_util",
"//tensorflow/core:core_cpu_internal",
"//tensorflow/core:framework",

View File

@ -24,66 +24,144 @@ The design is based on TensorFlow r1.0. An RDMA path is added between servers fo
During the server setup, an RDMA manager is created to manage low-level RDMA components such as RDMA channel and RDMA adapter, an RDMA rendezvous manager is created to oversee send/recv operations between servers. Following the distributed TensorFlow design philosophy, the send operation is passive, i.e. merely placing a tensor in the local out-going table. It is the receive operation that actually initiates the tensor transfer.
TensorFlow dynamically allocates memory for tensors that are to be sent or received. This causes difficulty for RDMA operations where pinned memory is required. Two remedies are possible, either the memory is pinned, transfer, then unpinned for each and every tensor to be transferred, or a buffer is pre-allocated and pinned for each tensor. The former incurs significant operation overhead since pinning and unpinning memory for each dynamically generated tensor is slow. The latter incurs large memory overhead and extra copying from the tensor to its pinned buffer, but may still be faster than the former. The second approach is adopted in this design. Each RDMA channel, representing a RDMA connection to a peer, contains a table of pinned buffers for all the seen tensors that requires transfer. It is assumed that the tensor size rarely changes across different steps. So only one buffer is created for the same tensor across all the steps. In the rare case when the tensor size does increases, the old buffer is discarded and new buffer of larger size is created and pinned.
TensorFlow dynamically allocates memory for tensors that are to be sent or received. This causes difficulty for RDMA operations where pinned memory is required. Few remedies are possible:
1. The memory is pinned, transfered, then unpinned for each and every tensor to be transferred. This incurs significant operation overhead since pinning and unpinning memory for each dynamically generated tensor is slow.
2. Buffer is pre-allocated and pinned for each tensor. This incurs large memory overhead and extra copying from the tensor to its pinned buffer, but may still be faster than the former.
3. Following HKUST research on the use of GPU direct, and their [GDR implementation](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/gdr/README.md), there is a smart way to benefit from the TensorFlow allocation theme which is mostly pool based, i.e allocators pre-allocate a large memory block, and allocate the tensors from there. By attaching a custom Visitor to relevant alloactors, we can do a single registration of the entire memory block, which zeros the registration overhead. Once the block is registered, each new tensor allocated will be at a registred address, which will allow us to do direct RDMA writes to it.
When a tensor is prepared for transfer, it is first converted to TensorProto, then the proto is serialized to byte array and copied to the pinned buffer. The content of the buffer is transferred to the remote node via RDMA write. On the remote side, the process is reversed. This is illustrated in the diagram below. The conversion of TensorProto is introduced to simplify transfer of string-tensors. Also since the TensorProto lives in host memory, even if the origin tensor lives in the device, the pinned buffers are all allocated in the host memory.
![TensorFlow RDMA path](./design_diagram.png)
For best performance, we will adopt HKUST 0 copies approach in our solution. This means:
1. Tensor writes will be done directly from the source tensor to the **result** tensor, with no memory copies in between. This should be done for all DMAable tensors which are located either on CPU or on a RDMA compatible GPU device (GPU direct).
2. Non DMAable tensors (CanMemCopy == false) will be serialized to a TensorProto on the sender side, RDMA written to a registered buffer on the receiver side, and then deserialized by the receiver.
3. Tensors which are located on a non-RDMA-compatible GPU, will be RDMA written to a registered CPU **proxy** buffer on the receiver side, and then copied to GPU by the receiver.
The following improvements can be made in the future. First, conversion to TensorProto and serialization can be avoided for numeric (float/int) tensors since their internal buffer can be access directly as byte array. Second, the pinned buffer may be allocated on device if the tensor is located in the device. This avoids extra device-to-host copy at the expense of extra device memory consumption.
## Design details
### Terminology
* **Sender** - The node which sends the tensor.
* **Receiver** - The node which receives the tensor.
* **Result tensor** - The destination tensor, allocated on its appropriate device.
* **Proxy tensor** - A CPU allocated tensor, which will be used in the case where the result tensor cannot be RDMA written to directly (GPU direct is disabled or not available). The RDMA write will therefore be done to the proxy tensor, and afterwards we will do a manual local copy from it to the result tensor.
### Messages
* RDMA_MESSAGE_TENSOR_REQUEST
* RDMA_MESSAGE_META_DATA_RESPONSE
* RDMA_MESSAGE_TENSOR_RE_REQUEST
### Transport protocol
The tensor transfer process is initiated when the receiver requests a tensor. In code it is done by calling **Rendezvous::Recv()** or **Rendezvous::RecvAsync()**. The TensorFlow base implementation handles the case where the requested tensor is located on the same node. The more interesting case where the requested tensor is located on a remote node (receiver != sender) is to be handled in a derivation of the pure virtual **BaseRemoteRendezvous::RecvFromRemoteAsync()**. TensorFlow provides a default GRPC based implementation which comes in the vanilla version but suffers in scalability when running large models. Our RDMA based implementation presumes to be more scalable. HKUST's contrib GDR implementation is more scalable than GRPC, and less scalable than ours, only because we did our evolution based on it.
Our entry point is the implementation of **RdmaRemoteRendezvous::RecvFromRemoteAsync()**, located in rdma_rendezvous_mgr.cc. The implementation creates a new **RdmaTensorRequest** object, keyed by request index (uint32_t), stores it in a list of pending requests, and calls its **Start()** method. The **Start()** method basically does 2 things:
1. Allocate the result tensor (and the proxy tensor if required).
2. Send a **RDMA_MESSAGE_TENSOR_REQUEST** to the sender, containing the address of the destination tensor (result/proxy) for RDMA write.
In order to allocate the result and proxy tensors, we need to know the tensor's meta-data, i.e. shape and data-type for DMAable tensors, and proto-size for serialized tensors. Unfortunately, this information is only available on the sender side which complicates manners. In order to avoid sending extra messages for querying the meta-data at each step, we store a local meta-data cache per tensor, which will only be update upon changes. Based on the assumption that the meta-data of a tensor rarely changes between steps, we expect that on most times the cache will only be updated once. The sender is responsible to detect changes in the meta-data, and update the receiver. In order for the sender to know that the meta-data had changed, each **RDMA_MESSAGE_TENSOR_REQUEST** will contain the meta-data that the receiver had grabbed from the local cache. The sender will then compare the meta-data from the message to the tensor's new meta-data.
When the sender receives an **RDMA_MESSAGE_TENSOR_REQUEST**, it will create a new **RdmaTensorResponse** object for the given request message, store it in a list of pending responses, and will invoke its **Start()** method. The **Start()** method does the following:
1. Grab the source tensor from the local table (In code, **RecvLocalAsync()**).
2. If the source tensor is not DMAable, serialize it to a TensorProto.
3. If the source tensor is located on a device which cannot be DMA written from, copy it to CPU.
4. If it is the first time this tensor is requested, or if the tensor's meta-data changed:
1. Clone the tensor's data to be sent later.
2. Send a **RDMA_MESSAGE_META_DATA_RESPONSE** containing the new meta-data.
5. Otherwise:
1. RDMA write the tensor (or TensorProto) to the destination address and rkey specified in the request message. The immediate value for the write will be the request index.
When the receiver receives the **RDMA_MESSAGE_META_DATA_RESPONSE**, it will locate the relevant **RdmaTensorRequest** using the request index specified in the message, and invoke its **RecvTensorMetaData()** which does the following:
1. Update the local meta-data cache.
2. Reallocate the result/proxy tensors.
3. Re-send the tensor request. For tracability, the new message has a different name: **RDMA_MESSAGE_TENSOR_RE_REQUEST**.
When the sender receives a **RDMA_MESSAGE_TENSOR_RE_REQUEST**, it will locate the relevant **RdmaTensorResponse** using the request index specified in the message, and invoke its **Resume()** method, which will RDMA write the contents of the tensor that was cloned earlier, to the new remote address specified in the re-request.
When the receiver receives the RDMA write, it will locate the relevant **RdmaTensorRequest** using the request index which is the immediate value. It will then invoke its **RecvTensorContent()** which does the following:
1. Proxy copy/deserialize if required.
2. Invoke the done callback.
3. Deallocate the result/proxy tensors and remove the request from the pending list.
![alt text](verbs_with_0_copies.png "Transport protocol")
### Additional design notes
1. When the sender receives a tensor request, the source tensor may or may not be ready yet. The situation is handled through a process of tag matching:
* If the request arrives before the tensor is ready, then a callback is put in a local table, and will be invoked once the tensor arrives.
* If the tensor is ready before the request arives, than the tensor is put in a local table. When the request arrives, it will invoke the callback immediatly.
In code it is done by calling **RecvLocalAsync()**, which receives the tensor's key, step-id, and the callback.
2. When the callback is invoked, the relevant tensor is removed from the tag matching table. In the case where we need to send the tensor's meta-data, the **RdmaTensorResponse** will store a copy of the tensor until the re-request arrives.
3. The sending of protocol messages (**RDMA_MESSAGE_TENSOR_REQUEST**, **RDMA_MESSAGE_META_DATA_RESPONSE** and **RDMA_MESSAGE_TENSOR_RE_REQUEST**) is done by the class **RdmaMessageBuffer**. All messages are sent using RDMA writes from/to fixed messages buffers. This implies that we cannot send on a specific channel more than one message at a time. In order to synchronize the messages, the **RdmaMessageBuffer** holds the a local and remote buffer statuses which can be either busy or idle. When a write is issued, both statuses will be changed to busy. When the write-complete event is received, the local status is changed to idle. When the write is received on the remote side, the remote side will parse the message, and return an ACK back to the sending side on which the sending side will update the remote status to idle. When both the local and remote statuses are idle, the next message can be sent.
5. ACK writes are empty writes (hence they require no buffer) with immediate value 0xFFFFFFFE. Message writes have the immediate value 0xFFFFFFFF. All other writes are tensor-content writes whose immediate value is the request-index.
### RDMA components
* **RDMA adapter:** The base for RDMA communications. It may contain multiple channels and buffers. It is responsible for handling various incoming RDMA messages.
* **RDMA channel:** Responsible for RDMA connection to a particular node. It manages multiple buffers. A channel has a callback table which stores all the callbacks for the requested tensors.
* **RDMA buffer:** Responsible for sending or receiving data. It has a fixed size memory to store the data. It has a queue to store the pending jobs. There are three types of buffers, message buffer, ACK buffer and tensor buffer. A channel has two message buffers, two ack buffers and many tensor buffers.
* **RDMA manager:** Manages the adapter and channels, including channel creation, channel setup via GRPC service, channel lookup, etc.
* **RDMA rendezvous manager:** manages multiple rdma rendezvous.
* **RDMA rendezvous:** a derived class of BaseRemoteRendezvous. This class is the back end for "send" and "recv" ops. When the sendrecv_op wants to send or receive a tensor, it calls the rendezvous' "send" and "recv" functions respectively. Rendezvous are identified by "step_id", a random number, so that tensors for different iterations don't get mixed up.
* **enum RdmaImmDataType** - Immediate types to distinguish between different RDMA writes on the remote side. Ack writes and control-message writes have a fixed immediate value. The rest of the writes are tensor writes and the immediate value is the relevant request index.
* **enum RdmaWriteIDType** - Types to distinguish between different RDMA write-complete events: Ack, control message and tensor writes.
* **class RdmaWriteID** - Context for RDMA write complete events. Holds the RdmaWriteIDType and additional data.
* **class RdmaTensorMetaData** - Meta-data for a tensor (type, shape, is_dead, proto_size).
* **class RdmaMemoryMgr** - Manages the meta-data cache, and the registered memory regions.
* **class RdmaTensorRequest** - Holds and manages information for a single tensor request throughout the entire receive cycle. API:
* **Start()** - Start the request sequence.
* Allocate the result tensor (and proxy tensor if required).
* Send RDMA_MESSAGE_TENSOR_REQUEST to the remote side.
* **RecvTensorMetaData()** - Receive meta-data from the remote side.
* Update the local meta-data cache.
* Reallocate the result tensor (and proxy tensor if required).
* Re-send the request to the remote side.
* **RecvTensorContent()** - Receive tensor content from the remote side (RDMA write was completed).
* Decode proto if required and/or move to GPU if the content was not written to it directly (GPU direct is not avaliable).
* Invoke the done callback.
* **class RdmaTensorResponse** - Holds and manages information for a single tensor response throughout the entire send cycle. API:
* **Start()** - Start the response sequence.
* Find the tensor in the local tag-match table.
* Compare the tensor's meta-data to the meta-data in the message (taken from the requester's local cache).
* If meta-data changed:
* Clone the tensor to be sent later.
* Send a meta-data update message and wait for re-request.
* Else:
* Send the tensor's content (using direct RDMA write).
* **Resume()** - Resume the response sequence after a re-request. Send the tensor's content that was cloned earlier.
* **Destroy()** - Destroy the response's resources and remove it form the pending list.
* **class RdmaAdapter** - The base for RDMA communications. It may contain multiple channels and buffers. It is responsible for handling various incoming RDMA messages.
* **class RdmaChannel** - Responsible for RDMA connection to a particular node. It manages messagee buffers. A channel has a request table which stores all the pending tensor requests.
* **class RdmaMessageBuffer** - Responsible for sending or receiving messages. It has a fixed size memory to store the data. It has a queue to store the pending jobs. A channel has two message buffers one for tx and one for rx.
* **class RdmaMgr** - Manages the adapter and channels, including channel creation, channel setup via GRPC service, channel lookup, etc.
* **class RdmaRendezvousMgr** - Manages multiple rdma rendezvous.
* **class RdmaRemoteRendezvous** - A derived class of BaseRemoteRendezvous. This class is the back end for "send" and "recv" ops. When the sendrecv_op wants to send or receive a tensor, it calls the rendezvous' "send" and "recv" functions respectively. Rendezvous are identified by "step_id", a random number, so that tensors for different iterations don't get mixed up.
### The SEND operation
### Message structure:
In TensorFlow, when rendezvous sends a tensor, it merely puts a tensor in a local table in the corresponding rendezvous. If the tensor has been requested, a callback exists in the table. "send" will activate the callback, which tries to send the tensor across the node.
| type | name_size | name | step_id | request_index | remote_addr/checksum | rkey | is_dead | data_type | tensor_shape | tensor_bytes | error_status |
|------|---------- |------|---------|---------------|----------------------|------|---------|-----------|--------------|--------------|-----------------------|
| 1B | 2B | 512 | 8B | 8B | 8B | 4B | 1B | XB | XB | 8B | Size - 4B, proto - XB |
### The RECV operation
When a tensor is requested, rendezvous' recv function is called. The function first places a callback in the channel's callback table, which will be activated once the tensor is sent from the source. In the next step, a message is sent to notify the source of the requested tensor. Once the source receives the message, it will check locally for the tensor, if not found, a callback is placed in the table, otherwise, the tensor id will be placed at corresponding RDMA buffer's job queue for future transmission. When a tensor is scheduled to be transmitted, the RDMA buffer needs to have the memory allocated and initialized (registered with the remote buffer info). If the memory is not ready, the transmission is deferred, a message is sent to the destination to establish the memory first. The other case a transmission can be deferred is when the buffer is still being used by an on-going transmission.
### Three types of RDMA buffers
* **Message buffer:** responsible for sending message only.
* **Ack buffer:** once a message is sent, the recipient needs to send an ack via the ack buffer to free up the message buffer. An ack buffer is exclusively for its coupled message buffer.
* **Tensor buffer:** responsible for sending tensors. The recipient needs to send back a message to free up the sending buffer.
### RDMA packet format
|type|name_size|name|step_id|buffer_size|remote_addr|rkey|is_dead|data_type|tensor_shape|tensor_bytes|tensor_buffer|
### Six types of RDMA messages
* RDMA_MESSAGE_ACK
* RDMA_MESSAGE_BUFFER_IDLE
* RDMA_MESSAGE_BUFFER_REQUEST
* RDMA_MESSAGE_BUFFER_RESPONSE
* RDMA_MESSAGE_TENSOR_REQUEST
* RDMA_MESSAGE_TENSOR_WRITE
### Actions upon receiving RDMA messages
* RDMA_MESSAGE_ACK
* sender: mark local ack buffer idle.
* receiver: mark remote message buffer idle, send next item.
* RDMA_MESSAGE_BUFFER_IDLE
* sender: mark local message buffer idle, send next item.
* receiver: send ack, set remote tensor buffer idle, send next item.
* RDMA_MESSAGE_BUFFER_REQUEST
* sender: mark local message buffer idle, send next item.
* receiver: send ack, find or create tensor buffer, send BUFFER_RESPONSE.
* RDMA_MESSAGE_BUFFER_RESPONSE
* sender: mark local message buffer idle, send next item.
* receiver: send ack, set remote buffer info, set local and remote buffer idle, send next item.
* RDMA_MESSAGE_TENSOR_REQUEST
* sender: mark local message buffer idle, send next item.
* receiver: send ack, find or create tensor buffer, enqueue tensor id, send next item.
* RDMA_MESSAGE_TENSOR_WRITE
* sender: mark local message buffer idle, send next item.
* receiver: run callback.
* **RDMA_MESSAGE_TENSOR_REQUEST** - (receiver ==> sender) The original tensor request.
* type - The message type.
* name (name_size) - Name of the requested tensor.
* step_id - Step ID.
* request_index - Request index.
* remote_addr/rkey - Address/rkey of the result/proxy tensor. Irrelevant for first-time request.
* is_dead/data_type/tensor_shape/tensor_bytes - The current meta-data as stored in the receiver local cache. The sender will use that information to know if the receiver's cache requires updating.
* **RDMA_MESSAGE_META_DATA_RESPONSE** - (sender ==> receiver) The meta-data update message in case meta-data had changed (or if it is the first time the tensor is requested).
* type - The message type.
* request_index - Request index.
* is_dead/data_type/tensor_shape/tensor_bytes - The up-to-date meta-data.
* checksum - In data validation mode, this will hold the checksum of the source tensor.
* **RDMA_MESSAGE_TENSOR_RE_REQUEST** - (receiver ==> sender) Tensor re-requset after meta-data update and reallocation of result/proxy tensors.
* type - The message type.
* name (name_size) - Name of the requested tensor.
* step_id - Step ID.
* request_index - Request index.
* remote_addr/rkey - Address/rkey of the reallocated result/proxy tensor.
* **RDMA_MESSAGE_ERROR_STATUS** - (sender ==> receiver) Notify the receiver that an error had occured on the sender side, so it can propagate it to the upper levels.
* type - The message type.
* name (name_size) - Name of the requested tensor.
* step_id - Step ID.
* request_index - Request index.
* error_status - The error status (code, message, details).

View File

@ -122,17 +122,15 @@ Status GrpcVerbsService::GetRemoteAddressSync(
rc->SetRemoteAddress(ra, false);
rc->Connect();
int i = 0;
int idx[] = {1, 0, 3, 2};
std::vector<RdmaBuffer*> mb(rc->message_buffers());
CHECK_EQ(request->mr_size(), 4);
int idx[] = {1, 0};
std::vector<RdmaMessageBuffer*> mb(rc->message_buffers());
CHECK_EQ(request->mr_size(), RdmaChannel::kNumMessageBuffers);
for (const auto& mr : request->mr()) {
// the connections are crossed, i.e.
// local tx_message_buffer <---> remote rx_message_buffer_
// local rx_message_buffer <---> remote tx_message_buffer_
// local tx_ack_buffer <---> remote rx_ack_buffer_
// local rx_ack_buffer <---> remote tx_ack_buffer_
// hence idx[] = {1, 0, 3, 2}.
RdmaBuffer* rb = mb[idx[i]];
// hence idx[] = {1, 0}.
RdmaMessageBuffer* rb = mb[idx[i]];
RemoteMR rmr;
rmr.remote_addr = mr.remote_addr();
rmr.rkey = mr.rkey();

View File

@ -0,0 +1,87 @@
## Verbs implementation to use direct tensor writes (0 copies)
### Motivation:
Following HKUST research on the use of GPU direct, and their [GDR implementation](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/gdr/README.md), we wish to adopt the 0 copies approach and apply it to the current verbs implementation, while keeping the current implementation advantages, such as configurability and the use of RDMA for control messages.
### Performance:
Compared with the current GRPC, verbs and GDR implementation, the result implementation gave the best performance for every model, with any number of nodes. For VGG16 on 8 nodes with 4 P100 GPUs each, the prototype beat the second place by over 15%.
### Implementation requirements:
1. Tensor writes need to be done directly from the source Tensor to the destination Tensor, with no memory copies in between. This should be done for all DMAble tensors which are located either on CPU or on a RDMA compatible GPU device (GPU direct).
2. Non DMAble tensors (CanMemCopy == false) will be serialized to proto on the sender side, RDMA written to a registered buffer on the receiver side, and then deserialized by the receiver.
3. Tensors which are located on a non-RDMA-compatible GPU, will be RDMA written to a registered CPU proxy buffer on the receiver side, and then copied to GPU by the receiver.
### Implementation constrains:
For best stability and proof of correctness, we will divide the implementation to two stages:
1. At first stage we will keep changes to the current implementation to the minimum possible. The expense will be that we may have unused or unnecessary code leftovers, which may also affect performance.
2. At second stage, we will re-iterate over the code and remove irrelevant code parts.
The design of the solution aims that we will achieve both stages with relative ease.
### Design guidelines:
1. Since we do not want to do any unnecessary memory copying, we will no longer allocate a fixed CPU buffer as the destination for the RDMA write. Instead we will do the writing directly to the result tensor, or if the result tensor is on a device which does not support RDMA, we will do the writing to a proxy CPU tensor and then copy its content to the result tensor.
2. The address of the destination Tensor needs to be sent to the sender side for writing, meaning that the result/proxy tensor should be pre-allocated on the receiver side, prior to sending the tensor request. In order to do that, we need to know its meta-data, i.e. shape and data-type for DMAble tensors, and proto-size for serialized tensors. Unfortunately, this information is only available on the sender side which complicates manners. In order to avoid sending extra messages for querying the meta-data on each step, we store a local meta-data cache per tensor. Based on the assumption that the meta-data of a tensor rarely changes between steps, we expect that on most times the cache will only be updated once. When the sender receives a request for a tensor, if it is the first time this tensor is requested, or in the rare case that the meta-data did change, the sender will first send a meta-data response, on which the receiver will update the local cache, and reallocate the result/proxy tensors if required. When the receiver sends the tensor request, it will contain also the meta-data currently stored in its local cache, so the sender can compare it to see if there was a change.
3. When the sender writes the tensor content to the result tensor, no additional data is being written with it. That means we need to reside on ibverbs immediate (uint32_t) to indicate which request we are responding to (in order to trigger the receive callback). The easiest and most elegant way is to key the recv callback with a unique request_index (uint32_t), instead of the current key_with_step_id (string).
4. Since the sender no longer writes the tensor from/to fixed buffers, we no longer need to schedule the writes using the local/remote status. In addition we no longer rely on the RmdaTensorBuffer members as the source/destination addresses and rkey/lkey. Instead, each RdmaTensorBuffer will hold multiple "Response" objects (one per step-id), from which we derive destination address and rkey. The source address and lkey are always the ones of the source Tensor.
5. With the addition of tensor pre-allocation, we noticed there is a large code similarity between sending the first tensor request and re-sending the request in case of meta-data changes. After implementing a common method for tensor pre-allocation, it turned out that implementation becomes much simpler by encapsulating the process of request sending/re-sending, meta-data response callback and content response callback, all in a single "Request" class. The request class holds all the relevant request information, which reduces excessive parameter passing and lambda capturing. This decision is purely for elegance and code simplicity, and we decided to implement it in first stage because it makes the implementation much easier.
### New types/classes:
* **enum RdmaImmDataType** - Immediate types to distinguish between different RDMA writes on the remote side. Ack writes and control-message writes have a fixed immediate value. The rest of the writes are tensor writes and the immediate value is the relevant request index.
* **enum RdmaWriteIDType** - Types to distinguish between different RDMA write-complete events: Ack, control message, tensor DMA write and tensor proto write.
* **class RdmaWriteID** - Context for RDMA write complete events. Holds the RdmaWriteIDType and additional data.
* **class RemoteAddressContext** - Remote address information (address + mr). Will be passed as write context for tensor proto writes.
* **class RdmaTensorMetaData** - Meta-data for a tensor (type, shape, is_dead, proto_size).
* **class RdmaMemoryMgr** - Manages the meta-data cache, and the registered memory regions.
* **class RdmaTensorRequest** - Holds and manages information for a single tensor request throughout the entire receive cycle. API:
* Start() - Start the request.
* RecvTensorMetaData() - Receive meta-data from the remote side.
* RecvTensorContent() - Receive tensor content from the remote side and invoke the done() callback.
* **class RdmaTensorResponse** - Holds information for a single tensor response, such as destination address and rkey.
### Protocol changes:
The protocol messages themselves will remain mostly unchanged at the first stage, but will be used differently, as described below. The current messages structures already have most of the required fields for the new implementation. The only change is the "buffer_size" field which is no longer used since we are no longer sending additional information with the tensor, and thus it is now always equal to the "tensor_bytes" field. Instead, we use that field to pass the "request_index".
### Message structure:
| type | name_size | name | step_id | request_index | remote_addr | rkey | is_dead | data_type | tensor_shape | tensor_bytes |
|------|---------- |------|---------|---------------|-------------|------|---------|-----------|--------------|--------------|
| 1B | 2B | 512 | 8B | 8B | 8B | 4B | 1B | XB | XB | 8B |
* **RDMA_MESSAGE_TENSOR_REQUEST** - (receiver ==> sender) The original tensor request.
* type - The message type.
* name (name_size) - Name of the requested tensor.
* step_id - Step ID.
* request_index - Request index.
* remote_addr/rkey - Address/rkey of the result/proxy tensor. Irrelevant for first-time request.
* is_dead/data_type/tensor_shape/tensor_bytes - The current meta-data as stored in the receiver local cache. The sender will use that information to know if the receiver's cache requires updating.
* **RDMA_MESSAGE_BUFFER_REQUEST** - (sender ==> receiver) The meta-data update message in case meta-data had changed (or if it is the first time the tensor is requested).
* type - The message type.
* request_index - Request index.
* is_dead/data_type/tensor_shape/tensor_bytes - The up-to-date meta-data.
* **RDMA_MESSAGE_BUFFER_RESPONSE** - (receiver ==> sender) Tensor re-requset after meta-data update and reallocation of result/proxy tensors.
* type - The message type.
* name (name_size) - Name of the requested tensor.
* step_id - Step ID.
* request_index - Request index.
* remote_addr/rkey - Address/rkey of the reallocated result/proxy tensor.
* is_dead/data_type/tensor_shape/tensor_bytes - The new meta-data. Will be removed in the next phase.
* **RDMA_MESSAGE_TENSOR_WRITE** - (sender ==> receiver) No longer sent. There is only a direct write of the tensor content to the result/proxy tensor. Request index passed as the immediate value of the write.
* **RDMA_MESSAGE_TENSOR_IDLE** - (receiver ==> sender) No longer sent.
![alt text](verbs_with_0_copies_phase1_protocol.jpg "Phase 1 message protocol")
### Second stage optimizations:
1. Remove unused code leftovers.
2. Remove the ACK buffer completely, since we can rely completely on its immediate value.
### Future optimizations:
1. Map the tensor names to indexes, to significantly reduce the request message size.
2. Understand the purpose of empty tensors and if we can skip remote fetching for them.
3. Consider concatenating multiple requests and/or using multiple message buffers.
4. Consider a no-request architecture.

File diff suppressed because it is too large Load Diff

View File

@ -27,6 +27,7 @@ limitations under the License.
#include <unordered_map>
#include <vector>
#include "tensorflow/contrib/verbs/verbs_util.h"
#include "tensorflow/core/distributed_runtime/worker_env.h"
#include "tensorflow/core/framework/rendezvous.h"
#include "tensorflow/core/framework/tensor.h"
@ -43,6 +44,11 @@ namespace tensorflow {
#define SL_DEFAULT 0
#define TRAFFIC_CLASS 0
#define RDMA_LOG_0 LOG(INFO)
#define RDMA_LOG_1 VLOG(1)
#define RDMA_LOG_2 VLOG(2)
#define RDMA_LOG(LEVEL) RDMA_LOG_##LEVEL
struct RdmaParams {
uint8_t port_num;
uint8_t sgid_index;
@ -76,29 +82,303 @@ enum Location {
local,
remote
};
enum BufferType {
ACK,
MESSAGE,
TENSOR
};
enum RdmaMessageType {
RDMA_MESSAGE_ACK,
RDMA_MESSAGE_BUFFER_IDLE,
RDMA_MESSAGE_BUFFER_REQUEST,
RDMA_MESSAGE_BUFFER_RESPONSE,
RDMA_MESSAGE_META_DATA_UPDATE,
RDMA_MESSAGE_TENSOR_RE_REQUEST,
RDMA_MESSAGE_TENSOR_REQUEST,
RDMA_MESSAGE_TENSOR_WRITE
RDMA_MESSAGE_ERROR_STATUS,
};
class RdmaBuffer;
struct RdmaMessage {
RdmaMessageType type_;
uint16_t name_size_;
string name_;
int64 step_id_;
uint64_t request_index_;
union {
uint64_t remote_addr_;
#ifdef RDMA_DATA_VALIDATION
uint64_t checksum_;
#endif
};
uint32_t rkey_;
bool is_dead_;
DataType data_type_;
TensorShape tensor_shape_;
size_t tensor_bytes_;
// For error status:
Status status_;
// type|name_size|name|step_id|request_index|remote_addr/checksum|rkey|...
// 1B| 2B | 512| 8B | 8B | 8B | 4B |...
// ...|is_dead|data_type|tensor_shape|tensor_bytes|error_status |
// ...| 1B | XB | XB | 8B |size - 4B, proto - XB |
static const size_t kNameCapacity = 512;
static const size_t kTypeStartIndex = 0;
static const size_t kNameSizeStartIndex = kTypeStartIndex + sizeof(type_);
static const size_t kNameStartIndex =
kNameSizeStartIndex + sizeof(name_size_);
static const size_t kStepIdStartIndex = kNameStartIndex + kNameCapacity;
static const size_t kRequestIndexStartIndex =
kStepIdStartIndex + sizeof(step_id_);
static const size_t kRemoteAddrStartIndex =
kRequestIndexStartIndex + sizeof(request_index_);
static const size_t kChecksumStartIndex = kRemoteAddrStartIndex;
static const size_t kRkeyStartIndex =
kRemoteAddrStartIndex + sizeof(remote_addr_);
static const size_t kIsDeadStartIndex = kRkeyStartIndex + sizeof(rkey_);
static const size_t kDataTypeStartIndex =
kIsDeadStartIndex + sizeof(is_dead_);
static const size_t kTensorShapeStartIndex =
kDataTypeStartIndex + sizeof(data_type_);
static const size_t kTensorBytesStartIndex =
kTensorShapeStartIndex + sizeof(TensorShape);
static const size_t kErrorStatusStartIndex =
kTensorBytesStartIndex + sizeof(tensor_bytes_);
static const size_t kErrorStatusMaxSize = 4096;
static const size_t kMessageTotalBytes = kErrorStatusStartIndex;
static const size_t kRdmaMessageBufferSize =
kMessageTotalBytes + kErrorStatusMaxSize;
static string CreateMessage(const RdmaMessage& rm);
static void ParseMessage(RdmaMessage& rm, void* buffer);
};
// Immediate types for RDMA write
enum RdmaImmDataType {
RDMA_IMM_MAX_REQUEST_ID = 0xFFFFFFFD,
RDMA_IMM_DATA_ACK = 0xFFFFFFFE,
RDMA_IMM_DATA_MESSAGE = 0xFFFFFFFF
};
// Write types for RDMA write-complete events
enum RdmaWriteIDType {
RDMA_WRITE_ID_ACK,
RDMA_WRITE_ID_MESSAGE,
RDMA_WRITE_ID_TENSOR_WRITE
};
// Context for RDMA write-complete events
class RdmaWriteID {
public:
RdmaWriteID(RdmaWriteIDType write_type, void* write_context)
: write_type(write_type), write_context(write_context) {}
RdmaWriteIDType write_type;
void* write_context;
};
// Tensor meta-data
class TensorMetaData {
public:
TensorShape tensor_shape_;
DataType data_type_;
size_t proto_size_;
bool is_dead_;
std::ostream& print(std::ostream& out) const {
out << "Dtype = " << DataTypeString(data_type_)
<< ", Shape = " << tensor_shape_.DebugString() << ", Proto size = 0x"
<< std::hex << proto_size_ << ", Is dead = " << is_dead_;
return out;
}
};
inline std::ostream& operator<<(std::ostream& out,
const TensorMetaData& meta_data) {
return meta_data.print(out);
}
class RdmaChannel;
void MRDeleter(ibv_mr* mr);
using MemoryRegionPtr = std::unique_ptr<ibv_mr, decltype(&MRDeleter)>;
// RdmaMemoryMgr
// Manages the local meta-data cache, and the registered RDMA memory regions.
class RdmaMemoryMgr {
public:
static RdmaMemoryMgr& Singleton() {
static RdmaMemoryMgr instance;
return instance;
}
// Memory regions
ibv_mr* FindMemoryRegion(void* addr, size_t length);
void InsertMemoryRegion(void* addr, size_t length,
const std::string& allocator_name);
void EvictMemoryRegion(void* addr, size_t length);
// Tensor meta-data cache
const TensorMetaData* GetTensorMetaData(const std::string& tensor_name);
const TensorMetaData* SetTensorMetaData(const std::string& tensor_name,
DataType dtype,
const TensorShape& shape,
bool is_dead, size_t proto_size);
struct ibv_pd* pd_;
protected:
RdmaMemoryMgr() : pd_(nullptr) {}
static bool Comparator(const void* ptr, const MemoryRegionPtr& other) {
return ptr < reinterpret_cast<char*>(other->addr) + other->length;
}
private:
mutex tensor_meta_data_mu_;
std::unordered_map<std::string, TensorMetaData> tensors_meta_data_;
// Managed memory regions
mutex mrs_mu_;
std::vector<MemoryRegionPtr> mrs_ GUARDED_BY(mrs_mu_);
};
// RdmaTensorRequest
// Represents a single tensor request.
class RdmaTensorRequest {
public:
typedef Rendezvous::DoneCallback RecvDoneCallback;
// Creates a tensor request identified by index.
RdmaTensorRequest(uint32_t index, const string& key, int64 step_id,
RdmaChannel* channel, Device* dst_dev,
const Rendezvous::Args recv_args,
const RecvDoneCallback& done);
~RdmaTensorRequest();
// Request unique index.
uint32_t index() { return index_; }
// Start the tensor request sequence.
//
// 1. Allocate the result tensor (and proxy tensor if required).
// 2. Send RDMA_MESSAGE_TENSOR_REQUEST to the remote side.
void Start();
// Receive tensor meta-data.
//
// 1. Update the local meta-data cache.
// 2. Reallocate the result tensor (and proxy tensor if required).
// 3. Re-send the request to the remote side.
void RecvTensorMetaData(DataType dtype, TensorShape shape, bool is_dead,
size_t proto_size);
// Receive tensor content (RDMA write was completed).
//
// Decode proto if required and/or move to GPU if the content was not
// written to it directly (GPU direct is not avaliable). Afterwards,
// invoke Done().
void RecvTensorContent();
// Receive error status (in case of a remote error).
// Invoke Done() with the status code.
void RecvErrorStatus(const Status& status);
#ifdef RDMA_DATA_VALIDATION
// Receive tensor checksum
//
// For validation: Get and store the Tensor's expected checksum for the
// current request. Compare the result Tensor's checksum with the stored
// checksum right before invoking Done().
void RecvTensorChecksum(uint64_t checksum) { checksum_ = checksum; }
#endif
private:
void Done(const Status& s);
void Send(RdmaMessageType message_type);
bool AllocateTensors();
void AllocateTensorsAsync(StatusCallback done);
void DeallocateTensors();
uint32_t index_;
string key_;
int64 step_id_;
RdmaChannel* channel_;
Device* dst_dev_;
Rendezvous::Args recv_args_;
const TensorMetaData* meta_data_;
Tensor* result_tensor_;
Tensor* proxy_tensor_;
void* rdma_addr_;
ibv_mr* mr_;
RecvDoneCallback done_;
#ifdef RDMA_DATA_VALIDATION
uint64_t checksum_;
#endif
};
// RdmaTensorResponse
// Represents a single tensor response.
class RdmaTensorResponse {
public:
// Creates a response for request message.
RdmaTensorResponse(RdmaChannel* channel, const RdmaMessage& rm)
: channel_(channel), rm_(rm) {}
void Update(const RdmaMessage& rm) { rm_ = rm; }
// Start the tensor response sequence.
//
// 1. Find the tensor in the local tag-match table and invoke RecvHandler.
// (Using RecvLocalAsync()).
// 2. Compare the tensor's meta-data to the meta-data in the message (taken
// from the requester's local cache).
// If meta-data changed:
// a. Clone the tensor to be sent later.
// b. Send a meta-data update message and wait for re-request.
// Else:
// a. Send the tensor's content (using direct RDMA write).
void Start();
// Resume the response sequence, after a re-request.
//
// 1. Send the tensor's content that was cloned earlier.
void Resume();
// Destroy the response's resources and remove it from the pending list.
void Destroy();
private:
void RecvHandler(Rendezvous::ParsedKey parsed,
const Rendezvous::Args& send_args,
const Rendezvous::Args& recv_args, const Tensor& in,
bool is_dead);
void Clone(const Tensor& in, const TensorProto& proto, bool is_dead);
void Send(const Tensor& in, const TensorProto& proto, bool is_dead,
const Status& status);
bool TensorMetaDataChanged(const Tensor& in, bool is_dead);
Status PrepareRecvTensor(const Rendezvous::ParsedKey& parsed,
Device** src_dev);
void SendMetaData(const Tensor& in, const TensorProto& proto, bool is_dead);
void SendContent(const Tensor& in, const TensorProto& proto, bool is_dead);
void SendErrorStatus(const Status& status);
RdmaChannel* channel_;
RdmaMessage rm_; // The request message
Device* src_dev_ = nullptr;
TensorBuffer* src_buffer_ = nullptr;
void* src_addr_ = nullptr;
ibv_mr* mr_ = nullptr;
uint64_t checksum_ = 0;
bool meta_data_changed_ = false;
// Re-item:
TensorProto* proto_ = nullptr;
Tensor* tensor_ = nullptr;
bool is_dead_ = false;
};
class RdmaMessageBuffer;
// Class that represents the Rdma Adapter.
// Responsible for creation of the completion queue, and handling
// of work completions.
class RdmaAdapter {
friend class RdmaChannel;
friend class RdmaBuffer;
friend class RdmaAckBuffer;
friend class RdmaMessageBuffer;
friend class RdmaTensorBuffer;
friend class RdmaTensorResponse;
friend class RdmaMgr;
friend class RdmaRemoteRendezvous;
@ -133,10 +413,10 @@ class RdmaAdapter {
// Responsible for connecting queue pairs.
class RdmaChannel {
friend class RdmaAdapter;
friend class RdmaBuffer;
friend class RdmaAckBuffer;
friend class RdmaMessageBuffer;
friend class RdmaTensorBuffer;
friend class RdmaTensorRequest;
friend class RdmaTensorResponse;
friend class RdmaMgr;
friend class RdmaRemoteRendezvous;
@ -146,22 +426,28 @@ class RdmaChannel {
~RdmaChannel();
inline const RdmaAddress& self() { return self_; }
RdmaAddress address() const;
inline const std::vector<RdmaBuffer*>& message_buffers() const {
inline const std::vector<RdmaMessageBuffer*>& message_buffers() const {
return message_buffers_;
}
void Connect(const RdmaAddress& remoteAddr);
void Connect();
void Recv();
RdmaBuffer* FindBuffer(const uint32_t index);
RdmaBuffer* FindBuffer(const string& name);
RdmaBuffer* FindOrCreateBuffer(const string& name,
BufferType buffer_type = TENSOR);
uint32_t LookupBufferIndex(const string& buffer_name);
void SetRemoteAddress(const RdmaAddress& ra, bool override);
void InsertRecvCallback(const string& key, std::function<void()> recv_done);
void RemoveRecvCallback(const string& key);
void RunRecvCallback(const string& key);
static const int kNumMessageBuffers = 4;
// Requests:
RdmaTensorRequest* InsertTensorRequest(
const string& key, int64 step_id, Device* dst_dev,
const Rendezvous::Args recv_args,
const RdmaTensorRequest::RecvDoneCallback& done);
void RemoveTensorRequest(uint32_t request_index);
RdmaTensorRequest* GetTensorRequest(uint32_t request_index);
// Responses:
RdmaTensorResponse* AddTensorResponse(const RdmaMessage& rm);
RdmaTensorResponse* UpdateTensorResponse(const RdmaMessage& rm);
void RemoveTensorResponse(uint32_t request_index);
static const int kNumMessageBuffers = 2;
static const int kPingRecvWrid = 0;
private:
@ -179,36 +465,31 @@ class RdmaChannel {
string remote_name_;
ibv_qp* qp_;
mutex mu_;
bool connected_ GUARDED_BY(bt_mu_) = false;
RdmaAddress remote_ GUARDED_BY(bt_mu_);
bool remote_set_ GUARDED_BY(bt_mu_) = false;
bool connected_ GUARDED_BY(mu_) = false;
RdmaAddress remote_ GUARDED_BY(mu_);
bool remote_set_ GUARDED_BY(mu_) = false;
mutex ct_mu_;
typedef std::unordered_map<string, std::function<void()> > CallbackTable;
CallbackTable callback_table_ GUARDED_BY(ct_mu_);
mutex bt_mu_;
typedef std::unordered_map<unsigned int, RdmaBuffer*> BufferTable;
BufferTable buffer_table_ GUARDED_BY(bt_mu_);
typedef std::unordered_map<uint32_t, string> BufferIndexNameTable;
BufferIndexNameTable buffer_index_name_table_ GUARDED_BY(bt_mu_);
typedef std::unordered_map<string, uint32_t> BufferNameIndexTable;
BufferNameIndexTable buffer_name_index_table_ GUARDED_BY(bt_mu_);
RdmaBuffer* tx_message_buffer_;
RdmaBuffer* rx_message_buffer_;
RdmaBuffer* tx_ack_buffer_;
RdmaBuffer* rx_ack_buffer_;
std::vector<RdmaBuffer*> message_buffers_;
typedef std::unordered_map<uint32_t, RdmaTensorRequest> RequestTable;
RequestTable request_table_ GUARDED_BY(ct_mu_);
uint32_t request_serial_ GUARDED_BY(ct_mu_);
mutex responses_mu_;
typedef std::unordered_map<uint32_t, RdmaTensorResponse> ResponsesTable;
ResponsesTable responses_table_ GUARDED_BY(responses_mu_);
RdmaMessageBuffer* tx_message_buffer_;
RdmaMessageBuffer* rx_message_buffer_;
std::vector<RdmaMessageBuffer*> message_buffers_;
};
// Class that represents a buffer for Rdma writes and reads.
class RdmaBuffer {
// Class that represents a buffer for Rdma message sending.
class RdmaMessageBuffer {
friend class RdmaChannel;
friend class RdmaAdapter;
friend class RdmaMgr;
friend class RdmaRemoteRendezvous;
public:
explicit RdmaBuffer(RdmaChannel* channel, string name);
virtual ~RdmaBuffer();
explicit RdmaMessageBuffer(RdmaChannel* channel, string name);
~RdmaMessageBuffer();
inline void* buffer() const { return buffer_; }
inline ibv_mr* self() const { return self_; }
@ -223,13 +504,15 @@ class RdmaBuffer {
}
void FreeBuffer();
void EnqueueItem(string Item);
virtual void SendNextItem() {};
void SendNextItem();
void CreateCPUBuffer(size_t size, bool lock = true);
void SetRemoteMR(RemoteMR rmi, bool override);
uint32_t LookupBufferIndex(const string& buffer_name) {
return const_cast<RdmaChannel*>(channel_)->LookupBufferIndex(buffer_name);
}
void Write(uint32_t imm_data, size_t buffer_size);
static void Write(const RdmaChannel* channel, uint32_t imm_data,
size_t buffer_size, uint64_t src_addr, uint32_t lkey,
uint64_t remote_addr, uint32_t rkey,
RdmaWriteIDType write_type, void* write_context);
static void SendAck(const RdmaChannel* channel);
protected:
const RdmaChannel* channel_;
@ -245,125 +528,6 @@ class RdmaBuffer {
BufferStatus remote_status_ GUARDED_BY(mu_) = none;
};
class RdmaAckBuffer : public RdmaBuffer {
public:
explicit RdmaAckBuffer(RdmaChannel* channel, string name);
virtual ~RdmaAckBuffer() override {}
void SendNextItem() override;
};
class RdmaMessageBuffer : public RdmaBuffer {
friend class RdmaChannel;
friend class RdmaAapater;
public:
explicit RdmaMessageBuffer(RdmaChannel* channel, string name);
virtual ~RdmaMessageBuffer() override {}
void SendNextItem() override;
};
class RdmaTensorBuffer : public RdmaBuffer {
public:
explicit RdmaTensorBuffer(RdmaChannel* channel, string name);
virtual ~RdmaTensorBuffer() override;
void SendNextItem() override;
void PostCopyOperations(bool can_memcpy, size_t buffer_size,
size_t tensor_bytes, const string& key,
const Tensor& in, int64 step_id, bool is_dead,
const string& key_with_step_id, const Tensor* copy,
const TensorProto* proto, const StringPiece* copy_buf,
const Rendezvous::Args& send_args,
const Rendezvous::Args& recv_args);
void ReSendNextItem();
private:
Rendezvous::DoneCallback getRecvTensorCallback(
const string& key_with_step_id, const string& key, int64 step_id,
const Rendezvous::ParsedKey& parsed);
struct ReItem {
Rendezvous::Args send_args;
Rendezvous::Args recv_args;
Tensor in;
bool is_dead;
ReItem(const Rendezvous::Args& send_args_,
const Rendezvous::Args& recv_args_, const Tensor& in_, bool is_dead_)
: send_args(send_args_),
recv_args(recv_args_),
in(in_),
is_dead(is_dead_) {
if (send_args.device_context) {
send_args.device_context->Ref();
}
if (recv_args.device_context) {
recv_args.device_context->Ref();
}
}
~ReItem() {
if (send_args.device_context) {
send_args.device_context->Unref();
}
if (recv_args.device_context) {
recv_args.device_context->Unref();
}
}
};
typedef std::map<string, ReItem*> Table;
typedef Table::iterator Itable;
std::queue<string> requeue GUARDED_BY(mu_);
Table retable GUARDED_BY(mu_);
};
struct RdmaMessage {
RdmaMessageType type_;
uint16_t name_size_;
string name_;
int64 step_id_;
uint64_t buffer_size_;
uint64_t remote_addr_;
uint32_t rkey_;
bool is_dead_;
DataType data_type_;
TensorShape tensor_shape_;
size_t tensor_bytes_;
// type|name_size|name|step_id|buffer_size|remote_addr|rkey|is_dead|...
// 1B| 2B | 512| 8B | 8B | 8B | 4B | 1B |...
// ...|data_type|tensor_shape|tensor_bytes|tensor_buffer
// ...| XB | XB | 8B |...
//
static const size_t kNameCapacity = 512;
static const size_t kTypeStartIndex = 0;
static const size_t kNameSizeStartIndex = kTypeStartIndex + sizeof(type_);
static const size_t kNameStartIndex =
kNameSizeStartIndex + sizeof(name_size_);
static const size_t kStepIdStartIndex = kNameStartIndex + kNameCapacity;
static const size_t kBufferSizeStartIndex =
kStepIdStartIndex + sizeof(step_id_);
static const size_t kRemoteAddrStartIndex =
kBufferSizeStartIndex + sizeof(buffer_size_);
static const size_t kRkeyStartIndex =
kRemoteAddrStartIndex + sizeof(remote_addr_);
static const size_t kIsDeadStartIndex = kRkeyStartIndex + sizeof(rkey_);
static const size_t kDataTypeStartIndex =
kIsDeadStartIndex + sizeof(is_dead_);
static const size_t kTensorShapeStartIndex =
kDataTypeStartIndex + sizeof(data_type_);
static const size_t kTensorBytesStartIndex =
kTensorShapeStartIndex + sizeof(TensorShape);
static const size_t kTensorBufferStartIndex =
kTensorBytesStartIndex + sizeof(tensor_bytes_);
static const size_t kMessageTotalBytes = kTensorBufferStartIndex;
static const size_t kRdmaMessageBufferSize = kMessageTotalBytes;
static const size_t kRdmaAckBufferSize = kMessageTotalBytes;
static string CreateMessage(const RdmaMessage& rm);
static void ParseMessage(RdmaMessage& rm, void* buffer);
};
} // namespace tensorflow
#endif // TENSORFLOW_USE_VERBS

View File

@ -16,11 +16,16 @@ limitations under the License.
#ifdef TENSORFLOW_USE_VERBS
#include "tensorflow/contrib/verbs/rdma_mgr.h"
#include <fstream>
#include <vector>
#include "tensorflow/contrib/verbs/grpc_verbs_client.h"
#include "tensorflow/contrib/verbs/verbs_service.pb.h"
#include "tensorflow/core/common_runtime/bfc_allocator.h"
#include "tensorflow/core/common_runtime/gpu/gpu_util.h"
#include "tensorflow/core/common_runtime/gpu/process_state.h"
#include "tensorflow/core/distributed_runtime/rpc/grpc_worker_cache.h"
#include "tensorflow/core/distributed_runtime/session_mgr.h"
#include "tensorflow/core/framework/allocator_registry.h"
#include "tensorflow/core/lib/core/status.h"
namespace tensorflow {
@ -53,7 +58,7 @@ RdmaMgr::RdmaMgr(const WorkerEnv* const worker_env,
void RdmaMgr::SetupChannels() {
for (const auto& p : channel_table_) {
string worker_name = p.first;
LOG(INFO) << "connecting to remote node " << worker_name;
RDMA_LOG(2) << "Connecting to remote node " << worker_name;
RdmaChannel* rc = p.second;
GetRemoteAddressRequest req;
GetRemoteAddressResponse resp;
@ -78,39 +83,49 @@ void RdmaMgr::SetupChannels() {
mr->set_rkey(rc->message_buffers_[i]->self_->rkey);
}
// synchronous call
Status s = client->GetRemoteAddress(&req, &resp);
// save obtained remote addresses
// connect to the remote channel
if (s.ok()) {
CHECK(worker_name.compare(resp.host_name()) == 0);
RdmaAddress ra;
ra.lid = resp.channel().lid();
ra.qpn = resp.channel().qpn();
ra.psn = resp.channel().psn();
ra.snp = resp.channel().snp();
ra.iid = resp.channel().iid();
rc->SetRemoteAddress(ra, false);
rc->Connect();
int i = 0;
int idx[] = {1, 0, 3, 2};
for (const auto& mr : resp.mr()) {
// the connections are crossed, i.e.
// local tx_message_buffer <---> remote rx_message_buffer_
// local rx_message_buffer <---> remote tx_message_buffer_
// local tx_ack_buffer <---> remote rx_ack_buffer_
// local rx_ack_buffer <---> remote tx_ack_buffer_
// hence idx[] = {1, 0, 3, 2}.
RdmaBuffer* rb = rc->message_buffers_[idx[i]];
RemoteMR rmr;
rmr.remote_addr = mr.remote_addr();
rmr.rkey = mr.rkey();
rb->SetRemoteMR(rmr, false);
i++;
Status s;
int attempts = 0;
static const int max_num_attempts = 5;
do {
s = client->GetRemoteAddress(&req, &resp);
// save obtained remote addresses
// connect to the remote channel
if (s.ok()) {
CHECK(worker_name.compare(resp.host_name()) == 0);
RdmaAddress ra;
ra.lid = resp.channel().lid();
ra.qpn = resp.channel().qpn();
ra.psn = resp.channel().psn();
ra.snp = resp.channel().snp();
ra.iid = resp.channel().iid();
rc->SetRemoteAddress(ra, false);
rc->Connect();
int i = 0;
int idx[] = {1, 0};
for (const auto& mr : resp.mr()) {
// the connections are crossed, i.e.
// local tx_message_buffer <---> remote rx_message_buffer_
// local rx_message_buffer <---> remote tx_message_buffer_
// hence idx[] = {1, 0}.
RdmaMessageBuffer* rb = rc->message_buffers_[idx[i]];
RemoteMR rmr;
rmr.remote_addr = mr.remote_addr();
rmr.rkey = mr.rkey();
rb->SetRemoteMR(rmr, false);
i++;
}
CHECK(i == RdmaChannel::kNumMessageBuffers);
} else {
LOG(ERROR) << "Connecting to " << worker_name
<< ": Got " << s.error_message() << ". Retrying ("
<< (attempts + 1) << "/" << max_num_attempts << ")..." ;
if (++attempts == max_num_attempts) {
break;
}
worker_env_->env->SleepForMicroseconds(2000000);
}
CHECK(i == RdmaChannel::kNumMessageBuffers);
} else {
LOG(ERROR) << s.error_message();
}
} while (!s.ok());
RDMA_LOG(0) << "Connected to remote node " << worker_name;
delete client;
}
}
@ -183,6 +198,138 @@ RdmaChannel* RdmaMgr::FindChannel(const string& name) {
return iter->second;
}
bool IsGDRAvailable() {
#if defined(__APPLE__)
return false;
#elif defined(PLATFORM_WINDOWS)
return false;
#else
std::ifstream ifs("/proc/modules");
string line;
while (std::getline(ifs, line)) {
auto sep = line.find(' ');
CHECK_NE(sep, std::string::npos);
if (line.substr(0, sep) == "nv_peer_mem") {
return true;
}
}
return false;
#endif
}
int TryToReadNumaNode(ibv_device* device) {
#if defined(__APPLE__)
LOG(INFO) << "OS X does not support NUMA - returning NUMA node 0";
return 0;
#elif defined(PLATFORM_WINDOWS)
// Windows support for NUMA is not currently implemented. Return node 0.
return 0;
#else
VLOG(2) << "Trying to read NUMA node for device: " << device->name;
static const int kUnknownNumaNode = -1;
auto filename = string(device->ibdev_path) + "/device/numa_node";
std::ifstream ifs(filename.c_str());
string content;
CHECK(std::getline(ifs, content));
int32 value;
if (strings::safe_strto32(content, &value)) {
if (value < 0) {
LOG(INFO) << "Successful NUMA node read from SysFS had negative value ("
<< value << "), but there must be at least one NUMA node"
", so returning NUMA node zero";
return 0;
}
LOG(INFO) << "NUMA node for device: " << device->name << " is " << value;
return value;
}
return kUnknownNumaNode;
#endif
}
void MRDeleter(ibv_mr* mr) {
if (mr) {
ibv_dereg_mr(mr);
}
}
// TODO(byronyi): remove this class duplicated from the one in
// common/runtime/gpu/pool_allocator.h when it is available in common_runtime
class BasicCPUAllocator : public SubAllocator {
public:
~BasicCPUAllocator() override {}
void* Alloc(size_t alignment, size_t num_bytes) override {
return port::AlignedMalloc(num_bytes, alignment);
}
void Free(void* ptr, size_t) override { port::AlignedFree(ptr); }
};
// TODO(byronyi): remove this class and its registration when the default
// cpu_allocator() returns visitable allocator
class BFCRdmaAllocator : public BFCAllocator {
public:
BFCRdmaAllocator()
: BFCAllocator(new BasicCPUAllocator(), 1LL << 36, true, "cpu_rdma_bfc") {
}
};
REGISTER_MEM_ALLOCATOR("BFCRdmaAllocator", 101, BFCRdmaAllocator);
void RdmaMgr::InitAllocators() {
RdmaMemoryMgr::Singleton().pd_ = rdma_adapter_->pd_;
Allocator* allocators[] = {
#if GOOGLE_CUDA
ProcessState::singleton()->GetCUDAHostAllocator(0),
ProcessState::singleton()->GetCPUAllocator(0),
#endif // GOOGLE_CUDA
cpu_allocator(),
};
using namespace std::placeholders;
std::set<Allocator*> instrumented_;
// Host memory allocators
for (Allocator* allocator : allocators) {
VisitableAllocator::Visitor alloc_visitor =
std::bind(&RdmaMemoryMgr::InsertMemoryRegion,
&RdmaMemoryMgr::Singleton(), _1, _2, allocator->Name());
VisitableAllocator::Visitor free_visitor = std::bind(
&RdmaMemoryMgr::EvictMemoryRegion, &RdmaMemoryMgr::Singleton(), _1, _2);
auto* visitable_allocator = dynamic_cast<VisitableAllocator*>(allocator);
CHECK(visitable_allocator) << "is not visitable for instrumentation"
<< allocator->Name();
// Make sure we don't instrument the same allocator twice
if (instrumented_.find(allocator) == std::end(instrumented_)) {
visitable_allocator->AddAllocVisitor(alloc_visitor);
visitable_allocator->AddFreeVisitor(free_visitor);
instrumented_.insert(allocator);
LOG(INFO) << "Instrumenting CPU allocator " << allocator->Name();
}
}
#if GOOGLE_CUDA
if (IsGDRAvailable()) {
// Note we don't free allocated GPU memory so there is no free visitor
int32_t bus_id = TryToReadNumaNode(rdma_adapter_->context_->device) + 1;
char buf[8];
sprintf(buf, "gpu");
VisitableAllocator::Visitor cuda_alloc_visitor =
std::bind(&RdmaMemoryMgr::InsertMemoryRegion,
&RdmaMemoryMgr::Singleton(), _1, _2, std::string(buf));
ProcessState::singleton()->AddGPUAllocVisitor(bus_id, cuda_alloc_visitor);
LOG(INFO) << "Instrumenting GPU allocator with bus_id " << bus_id;
}
#endif // GOOGLE_CUDA
}
} // end namespace tensorflow
#endif

View File

@ -38,6 +38,7 @@ class RdmaMgr {
RdmaChannel* FindChannel(const string& key);
void SetupChannels();
bool ConnectivityCheck();
void InitAllocators();
const string& local_worker() { return local_worker_; }
private:

View File

@ -21,10 +21,6 @@ limitations under the License.
#include "tensorflow/core/common_runtime/device.h"
#include "tensorflow/core/common_runtime/device_mgr.h"
#include "tensorflow/core/common_runtime/dma_helper.h"
#if GOOGLE_CUDA
#include "tensorflow/core/common_runtime/gpu/gpu_util.h"
#include "tensorflow/core/common_runtime/gpu/process_state.h"
#endif // GOOGLE_CUDA
#include "tensorflow/core/lib/core/errors.h"
#include "tensorflow/core/lib/strings/numbers.h"
#include "tensorflow/core/lib/strings/str_util.h"
@ -36,11 +32,6 @@ class RdmaRemoteRendezvous : public BaseRemoteRendezvous {
RdmaRemoteRendezvous(const WorkerEnv* env, int64 step_id, RdmaMgr* rdma_mgr)
: BaseRemoteRendezvous(env, step_id), rdma_mgr_(rdma_mgr) {}
void RecvPostCopyOps(const string& key, const string& key_with_step_id,
const Rendezvous::Args& recv_args,
const DoneCallback& done, const RdmaMessage& rm,
RdmaChannel* rc, Tensor& val, const Status& s);
protected:
void RecvFromRemoteAsync(const Rendezvous::ParsedKey& parsed,
const Rendezvous::Args& args,
@ -74,101 +65,18 @@ void RdmaRemoteRendezvous::RecvFromRemoteAsync(
RdmaChannel* rc = rdma_mgr_->FindChannel(src_name);
string key(std::move(parsed.FullKey().ToString()));
string key_with_step_id = VerbsUtil::AppendStepidToKey(key, step_id_);
// insert callback
rc->InsertRecvCallback(key_with_step_id, [this, key, key_with_step_id, rc,
recv_args, parsed, done]() {
Status src_s, dst_s, s;
Device* src_dev, *dst_dev;
src_s = env_->device_mgr->LookupDevice("CPU:0", &src_dev);
dst_s = env_->device_mgr->LookupDevice(parsed.dst_device, &dst_dev);
if (!src_s.ok() || !dst_s.ok()) {
s = src_s.ok() ? dst_s : src_s;
LOG(ERROR) << "s is not ok, error code " << s.error_message();
done(s, Args(), recv_args, Tensor(), true);
return;
}
RdmaBuffer* rb = rc->FindBuffer(key);
RdmaMessage rm;
CHECK(rb->size_ >= RdmaMessage::kMessageTotalBytes);
RdmaMessage::ParseMessage(rm, rb->buffer_);
CHECK(rm.type_ == RDMA_MESSAGE_TENSOR_WRITE);
Tensor val;
if (!rm.is_dead_) {
void* input = static_cast<char*>(rb->buffer_) +
RdmaMessage::kTensorBufferStartIndex;
bool can_memcpy = DataTypeCanUseMemcpy(rm.data_type_);
if (can_memcpy) {
if (dst_dev->tensorflow_gpu_device_info() &&
(!recv_args.alloc_attrs.on_host())) {
#if GOOGLE_CUDA
CHECK(recv_args.device_context)
<< "send dev name: " << src_dev->name()
<< " gpu_info: " << src_dev->tensorflow_gpu_device_info();
Allocator* alloc = ProcessState::singleton()->GetCUDAHostAllocator(0);
Tensor copy(alloc, rm.data_type_, rm.tensor_shape_);
memcpy(DMAHelper::base(&copy), input, rm.tensor_bytes_);
Allocator* dst_alloc = dst_dev->GetAllocator(recv_args.alloc_attrs);
Tensor gpu_copy(dst_alloc, rm.data_type_, rm.tensor_shape_);
Device* dst_dev;
s = env_->device_mgr->LookupDevice(parsed.dst_device, &dst_dev);
CHECK(s.ok()) << "s is not ok, error code " << s.error_message();
if (!s.ok()) {
done(s, Args(), recv_args, Tensor(), true);
return;
}
GPUUtil::CopyCPUTensorToGPU(
&copy, recv_args.device_context, dst_dev, &gpu_copy,
[this, gpu_copy, key, key_with_step_id, recv_args, done, rm, rc](
const Status& s) {
CHECK(s.ok()) << "copy tensor to gpu sync";
Tensor val;
val = std::move(gpu_copy);
RecvPostCopyOps(key, key_with_step_id, recv_args, done, rm, rc,
val, s);
});
#endif // GOOGLE_CUDA
return;
} else {
AllocatorAttributes host_alloc_attrs;
host_alloc_attrs.set_gpu_compatible(true);
host_alloc_attrs.set_on_host(true);
Allocator* alloc = dst_dev->GetAllocator(host_alloc_attrs);
Tensor copy(alloc, rm.data_type_, rm.tensor_shape_);
memcpy(DMAHelper::base(&copy), input, rm.tensor_bytes_);
val = std::move(copy);
}
} else {
TensorProto proto;
CHECK(rm.tensor_bytes_ + RdmaMessage::kTensorBufferStartIndex <=
rb->size_);
CHECK(ParseProtoUnlimited(&proto, input, rm.tensor_bytes_))
<< "fail to parse proto from array";
s = dst_dev->MakeTensorFromProto(proto, recv_args.alloc_attrs, &val);
}
}
RecvPostCopyOps(key, key_with_step_id, recv_args, done, rm, rc, val, s);
});
// append key to message queue
RdmaBuffer* rb = rc->tx_message_buffer_;
RdmaMessage rm;
rm.type_ = RDMA_MESSAGE_TENSOR_REQUEST;
rm.name_size_ = key.size();
rm.name_ = key;
rm.step_id_ = step_id_;
string message = RdmaMessage::CreateMessage(rm);
rb->EnqueueItem(message);
rb->SendNextItem();
}
void RdmaRemoteRendezvous::RecvPostCopyOps(
const string& key, const string& key_with_step_id,
const Rendezvous::Args& recv_args, const DoneCallback& done,
const RdmaMessage& rm, RdmaChannel* rc, Tensor& val, const Status& s) {
rc->RemoveRecvCallback(key_with_step_id);
RdmaMessage br;
br.type_ = RDMA_MESSAGE_BUFFER_IDLE;
br.name_size_ = key.size();
br.name_ = key;
string message = RdmaMessage::CreateMessage(br);
RdmaBuffer* tb = rc->tx_message_buffer_;
tb->EnqueueItem(message);
tb->SendNextItem();
done(s, Args(), recv_args, val, rm.is_dead_);
RdmaTensorRequest* request =
rc->InsertTensorRequest(key, step_id_, dst_dev, recv_args, done);
request->Start();
}
RdmaRendezvousMgr::RdmaRendezvousMgr(const WorkerEnv* env)

View File

@ -104,6 +104,7 @@ Status VerbsServer::Start() {
[this] { verbs_service_->HandleRPCsLoop(); }));
rdma_mgr_->SetupChannels();
CHECK(rdma_mgr_->ConnectivityCheck()) << "Connectivity check failed!";
rdma_mgr_->InitAllocators();
verbs_state_ = CONNECTED;
}
}

View File

@ -50,6 +50,12 @@ message GetRemoteAddressResponse {
repeated MemoryRegion mr = 3;
}
message ErrorStatusProto {
int32 error_code = 1;
string error_message = 2;
string error_details = 3;
}
////////////////////////////////////////////////////////////////////////////////
//
// VerbsService

Binary file not shown.

After

Width:  |  Height:  |  Size: 61 KiB

View File

@ -0,0 +1 @@
<mxfile userAgent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36" version="7.8.7" editor="www.draw.io" type="device"><diagram name="Page-1" id="74e2e168-ea6b-b213-b513-2b3c1d86103e">7Vxtc9o4EP41zKQfmsGW3/hIgPQ60/RyIZ1rPzHClsFXY1FZEOivP8mW8ZsAB2yXtHQ6jb2SJXl3n0e7K6cdMFhsPhC4nD9gB/kdtetsOmDYUVVFUw32g0u2scRUu7FgRjxHdEoFY+8nEsKk28pzUJjrSDH2qbfMC20cBMimORkkBL/ku7nYz8+6hDNUEoxt6Jel/3oOnQup0u2mDX8hbzYXU1u6aJhC+/uM4FUg5uuowI3+xM0LmIwl+odz6OCXjAiMOmBAMKbx1WIzQD7XbaK2+Ln7Pa27dRMU0CoP6CB+Yg39FUqWHC2MbhNlRK+D+APdDrh7mXsUjZfQ5q0vzPxMNqcLn90p7NL1fH+AfUzYfYAD1ulOzIAIRZu9y1R2L8+cCuEFomTLumx2mo8fEf5kiduX1DhWIptn7GIkQigcYrYbOlUKuxB6kevIkqjI8NkMd463zqnK+LHihrtjL0rfQ9+bBR3QZz185NK0lV3NxM9olHAJg0Q2ppDQm3dJE1tatjUjjqbOS+tfTSKbEskK2lqYcsua+r6PbUgRJwIUhJiEt03OqfI5xyhwbp5Hn8d/P02eRv98GY2f37VhAam2c7PVBVDGTg5Elmtzu1OCv6NMi2FbaOru5iuBVQLp/fgFefwqRhnAalcC4B3jngPghGxrR/ATstfPkTs+IAqHkMKbC/GQ2oD3ZenEsGNah+/ZNeTbLrTnqHkAPiHYLuxlHAjKVCBjg8q0+XsBGahtAlkxjkcryGGRnLjFhM7xDAfQH6XSu7yWMxr9D1G6FcEoXFHMROkInzBein579RjiFbGTEFIsjW3oM5R002IZX+NBbRPkQ+qt89HoWZpTG6LAiCQGBMUgJShc4iBshRvs9SfGDX4/3AZ2s7QbUcAAL7dRGsKvH79wyCPisWd/8hf/6EZv/2PlEeTsffs3B3dT+aX7tv4r0M20RbZfszff+GC3Or/dePRr7u6bmKgaJ2hlTkhS5fo4QTz6iD22lJ0pe728KU2jYKF4UeKp1Eh9QuA2023JO4QH5jELO4RZyECP9E9SttRH4hWkHrPTSTXm00rMd++RkD57Cw7cjjngf1UDLjjggmm4LPJGjN4kwhvMYTBDDmMccF8V53O8mK7C4xh3GHvY1MOMjYbMbzjCLgP3LW/zvePbfDiHS37p+mjT5xWfCKyOuBzaPgxDzz5Ioa5lI9vOIr5bRnxJzVNL1/TKiLckQYBaEfAZXesSVSeyM3kBFCI6tcgL8euUeKE0kNbt3jK/MUxLUSxD10wjP65SjW9OgHjiHm35y5k+Id0FuhflFIbSu1WtHlAVy1KApqs5U2pFU1Z1kcPDgoo70ikeUi5zfkNhyUl4OJh3gbypRUFTUuMUMeTQZoZHTH7Hudbj4aloWHiOE8Unsi0gH7M0wd+gzN+axH3UOqK28ob7Gf++qraK8U6bqpblw00VQqJM75GFyfI0r61yMM/+5MFadgVPwwc+xAvxorz0Jq7SdaITI8qM/e61S3/zqZsh8cvmQjigH9+S28zlRMK2i+2q7tWqKdmrW8rYrKIFtWr74/6M7Zwd1CwZdFcaztDBm4eJ1mqFA1Q4fn0jmU6yn+WQYlZESjtRrVpIdTTzxDi2OJBe9IWaaimleZR6ayOgQj29EfeTlNbOdT+j7H7gstzvcPZjgkaSqtIXEPUlVaC8JdR9rDqIo7Xf6VRVUlqMI2uCN9soQOXnDPcOyh4veJWOF0oDyyLj6PTkY7BmYGMXDs+p+IGu7/NPl45Fxa9S0ZEz1aHkdJf7aeBE77rAayReGoX0tEzjzQfxxePWdoP4JN6UAHyuVIKAyNFlCEeMSEnGUHzEPXY6vVbAXMX2ghkT6Ondc5QevFf3GZ05HnH9aHebe46DgibKBoqp5yzbKxt299Fb1rDFHOAku6pN2ZV/J/EnW9U0jlq115RRZamEYO0iL7o4CiDsHWmldgSu2+1y8ihBdvjQnzyMxuP+h9Ek/1Vcxt7xyCUWnjbgBRdWBwRWnqoVyTeq0uiyjkKgVq65Nmf8h9FzfzLss3+eRuPHvz+PR1cHkDgA0Np0AFm9rXH0XwnggP21Vglg/0lALfYv1s+vFpdY3NDbtHhT2XfNSXUi+LhYxP2ruCF3Qv47M8VRBQO9yvsOJYiyVLLm9773kO+E+VfPLpBulyzPHaSp7sRjXrqJRQFciMaQouWE2V70XGCKJtBxiBB8R9v4in+mPYk/0z6SGZ9w6nUMuTwvNqaGbnTKNUDXVaMa4KVh2MhjWJV86QRkGbZeB4ab/tWihjAsg1m31PvPBbpUPweBfoXtebAFkmCrOdjKvk+8wvYPhO3r9ucrsk9Att7mhpyMcUV2ceqC818+viueVF1xtwd3hqR2XRfu2G36XxzEJ8/p/yMBRv8D</diagram></mxfile>

Binary file not shown.

After

Width:  |  Height:  |  Size: 87 KiB

View File

@ -0,0 +1 @@
<mxfile userAgent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36" version="7.8.4" editor="www.draw.io" type="device"><diagram name="Page-1" id="74e2e168-ea6b-b213-b513-2b3c1d86103e">7Vxbc5s4FP41nuk+pMMd8ujYTrYzTZqNk9nmyaOAbNhgRIWc2P31K4G4yzZ1AHtaZzJjOBLS8TnnOzeRDNTRcn2DQejeIgf6A0Vy1gN1PFAUWVMM+sEom4RiGpcJYYE9h0/KCVPvJ+REiVNXngOj0kSCkE+8sEy0URBAm5RoAGP0Xp42R3551xAsYI0wtYFfp/7rOcTlVFmS8oG/obdw+daWzgdegP26wGgV8P0GijqPf5LhJUjX4vMjFzjovUBSJwN1hBEiydVyPYI+k20qtuS56y2jGd8YBqTJA1bywBvwVzDl2PDpo1eO98b4IxsuE+PHijF1ReCaXADfWwQDdUhn+HBO8lF6teCf8SpRCIKUNiUAk09/pUOUqeJogRxvXaa2z01Ke8ECDnpkDCxDehG8ROzjgs4c+j6yAYGPMIgQjkoC64WBKQycT4+Tu+m3h9nD5J+nyfSxax62a6K0m1LaRYlxBpklS3T43fUInIbAZqPv1C9RmkuWPr2Ts6ffIKZMbQWLnEGQujaIlpDgDZ3CH7A4aLlTkw1+/567CCX1EG7BO2RuA3C3tMiWzqFJLzg6xUhNPUbrUH2A9ltia7eQgDEgoHOTa6juNnZjBj2GoE9M1UHc/X4xZq+erq8nDLPT+29308nWTU8LRqrSJ4xkQwCjikCgQ5MBfoswcdECBcCf5NSrssgK4vkPErLh+QxYEURJ+QpfEQpLYmQb7RYi5QutsJ3O4rzSQLqA6TRNLGwMfUC8t/L6H5Kc0pEDivHiON/wU+hQyDzAKERBBLsHDfN8XylM/WG0Cezu9/syj/XyY+VhZjvDPgP7gKGsJRL7LiMUbm7unx7R6F6UXT2dUp7XqSCmEHuUsZ/wqN/45HMnQztq8qQfw8lT0eDN9+LNM1vss85u1x75Xrp75hsdGBq0emhIqvAPhAb+6D3y6M6ZKi+VsipNo6KhhAf+VK6kIcZgU5gWsglR831Us1LL7plvWLvnW9rO+fQi4Ti3sEyGzQKmVguY1x6OyKO3hMyZmCP2W/MyBbeQYDdNy0cuCBbQoX5GvW6KchctX1bRfoQ7NCTZxEPU2YypWTFEdoH6nnO9y/25XuSCkF3Ofbgess5RDFWHX45tH0SRZ5eFNfd8f4R8hOMl0gZPAe9SHe+HodoS5HuKWOIFieoCgaa0D2K/mrwrVewn3NewX1tI1aXPpiwZpiXLlqFrplFeV27mUw6AZWoEfVlFi/5cGhxR9bpx+VmxLlVFtixZ1XSlpDCtqrCmhrB7WbVhbDnEDtSaHTzDqGYKLA8rKzoiGL3CVNUBCmBF+5zEk7exTfUMKf2KuVKPlRt8YOk5TpxpiJxzOfvowherdV+sCcxHaSP/qofCO/T7itqSjihqUYOjq+KKFUD3KBSW7H2VQBfbUqgiAw/jW7bCO6bap5+fkr7cID5BIlSrv8r4iVVThsDAusurVH1/BO2zvOI1VJZwHRx0FVMQdLsposyqBrVmgW57EfWRUGjWFLqlIXdidq/12kVQ6xlDTSKnXU+kAaZk4KZY5P1klXKloNDMA/PI6kJ6VeMtdSVq+8jtdg3UBgcUnRiZoEl1oJEZdSNTj2pku2sMU+2kdEnbSR2ULmrdX7d9FjxK8qLf6ShY0Fo78FSmvyOWETtiubl/aqSHftgaw0h45LGbq/hJWqzte+TEEm3rmHl2muwIYO7KjYDA62ERziF1p7ggdbbiFqEfXpfTUsr2ggUl6PndY5zBXyjbNIgoY3M/jmQuLdth0E2JXlLsZUO9VrP0g9QqakC2olb2FsifrNRqedCrVkXFAQ9vVSc3R3EaYWfpWK5ImphJEuOxBtnx7XB2O5lOhzeTWfntvILCk5VrLvWlAzM4sZ5b1mNLD5gtgfJFOWYbTTet3t/sTvnZa15n5W9Tvqr1qXxRO6xz5Sfv+J21L9C+1iv0t/fbW9F+loLHK1T71tloVe1na8hydr1Pa+iqMm+54E4JX5bLZH4TE2UGyv6Spboq902/ZH27akDRAUzL3/vag74Tlb96kUGyCeFAGfHOAIzIzKNWuk5IAVjywYjAcEZ1z2cuEYEz4DiYE17hJrmidgRmDiBgb/F7wOHTPuRSzTnGi6EbA1EXULHtE8SwXMawInhvSBVl8nobGO76j6I6wrAIZlJt9p8LdKF8dgL9DNuPwVYVJGLdwVb0tt8Ztn8gbD8Un7e+kXvG/i9hX+8zZKdrnLFf3boCj9P3AA2PAs+424I7Q9D0bgt39Db/1wTJuXX+/x/Uyf8=</diagram></mxfile>

View File

@ -279,6 +279,7 @@ cc_library(
"platform/platform.h",
"platform/protobuf.h",
"platform/types.h",
"platform/windows/cpu_info.h",
"lib/bfloat16/bfloat16.h",
] + tf_additional_proto_hdrs() + glob(tf_env_time_hdrs()),
copts = tf_copts(),
@ -865,6 +866,7 @@ cc_library(
"//tensorflow/core/kernels:mkl_pooling_ops",
"//tensorflow/core/kernels:mkl_relu_op",
"//tensorflow/core/kernels:mkl_reshape_op",
"//tensorflow/core/kernels:mkl_softmax_op",
"//tensorflow/core/kernels:mkl_tfconv_op",
"//tensorflow/core/kernels:mkl_aggregate_ops",
]),
@ -2831,6 +2833,7 @@ tf_cc_test_mkl(
"//tensorflow/core/kernels:mkl_pooling_ops",
"//tensorflow/core/kernels:mkl_relu_op",
"//tensorflow/core/kernels:mkl_reshape_op",
"//tensorflow/core/kernels:mkl_softmax_op",
"//tensorflow/core/kernels:mkl_tfconv_op",
]),
)

View File

@ -453,6 +453,13 @@ inline bool DataTypeIsInteger(DataType dt) {
return kDataTypeIsInteger.Contains(dt);
}
// Is the dtype a signed integral type?
constexpr DataTypeSet kDataTypeIsSigned =
ToSet(DT_INT8) | ToSet(DT_INT16) | ToSet(DT_INT32) | ToSet(DT_INT64);
inline bool DataTypeIsSigned(DataType dt) {
return kDataTypeIsSigned.Contains(dt);
}
// Is the dtype an unsigned integral type?
constexpr DataTypeSet kDataTypeIsUnsigned =
ToSet(DT_UINT8) | ToSet(DT_UINT16) | ToSet(DT_UINT32) | ToSet(DT_UINT64);

View File

@ -2456,9 +2456,9 @@ class MklLayoutRewritePass : public GraphOptimizationPass {
// NOTE: names are alphabetically sorted.
rinfo_.push_back({csinfo_.addn, mkl_op_registry::GetMklOpName(csinfo_.addn),
CopyAttrsAddN, AddNRewrite});
rinfo_.push_back({csinfo_.add,
/* rinfo_.push_back({csinfo_.add,
mkl_op_registry::GetMklOpName(csinfo_.add),
CopyAttrsDataType, AlwaysRewrite});
CopyAttrsDataType, AlwaysRewrite}); */
rinfo_.push_back({csinfo_.avg_pool,
mkl_op_registry::GetMklOpName(csinfo_.avg_pool),
CopyAttrsPooling, AlwaysRewrite});
@ -3117,7 +3117,9 @@ void MklLayoutRewritePass::GetDummyMklTensorNode(std::unique_ptr<Graph>* g,
Node* orig_input0 = nullptr;
TF_CHECK_OK(orig_node->input_node(0,
const_cast<const Node**>(&orig_input0)));
CHECK_NOTNULL((*g)->AddControlEdge(orig_input0, *out));
// Allow duplicate while adding control edge as it would fail (return
// NULL) if we try to add duplicate edge.
CHECK_NOTNULL((*g)->AddControlEdge(orig_input0, *out, true));
}
(*out)->set_assigned_device_name(orig_node->assigned_device_name());
@ -3382,8 +3384,8 @@ void MklLayoutRewritePass::GetDummyWorkspaceTensorNode(
std::unique_ptr<Graph>* g, Node** out, Node* orig_node) {
// We use a tensor of shape {1} and value 0 to represent
// dummy float tensor. We need this as a dummy workspace tensor.
// Workspace tensor has type float.
const DataType dt = DataTypeToEnum<float>::v();
// Workspace tensor has type uint8.
const DataType dt = DataTypeToEnum<uint8>::v();
TensorProto proto;
proto.set_dtype(dt);
float zero[1] = {0};
@ -3413,7 +3415,9 @@ void MklLayoutRewritePass::GetDummyWorkspaceTensorNode(
Node* orig_input0 = nullptr;
TF_CHECK_OK(orig_node->input_node(0,
const_cast<const Node**>(&orig_input0)));
CHECK_NOTNULL((*g)->AddControlEdge(orig_input0, *out));
// Allow duplicate while adding control edge as it would fail (return
// NULL) if we try to add duplicate edge.
CHECK_NOTNULL((*g)->AddControlEdge(orig_input0, *out, true));
}
(*out)->set_assigned_device_name(orig_node->assigned_device_name());
@ -3863,12 +3867,16 @@ Status MklLayoutRewritePass::MergeConv2DWithBiasAdd(std::unique_ptr<Graph>* g,
// node are already copied in BuildNode. We handle control edges now.
for (const Edge* e : pred->in_edges()) {
if (e->IsControlEdge()) {
CHECK_NOTNULL((*g)->AddControlEdge(e->src(), new_node));
// Allow duplicate while adding control edge as it would fail (return
// NULL) if we try to add duplicate edge.
CHECK_NOTNULL((*g)->AddControlEdge(e->src(), new_node, true));
}
}
for (const Edge* e : succ->in_edges()) {
if (e->IsControlEdge()) {
CHECK_NOTNULL((*g)->AddControlEdge(e->src(), new_node));
// Allow duplicate while adding control edge as it would fail (return
// NULL) if we try to add duplicate edge.
CHECK_NOTNULL((*g)->AddControlEdge(e->src(), new_node, true));
}
}
@ -3876,14 +3884,18 @@ Status MklLayoutRewritePass::MergeConv2DWithBiasAdd(std::unique_ptr<Graph>* g,
// First, we will fix outgoing control edges from 'pred' node.
for (const Edge* e : pred->out_edges()) {
if (e->IsControlEdge()) {
CHECK_NOTNULL((*g)->AddControlEdge(new_node, e->dst()));
// Allow duplicate while adding control edge as it would fail (return
// NULL) if we try to add duplicate edge.
CHECK_NOTNULL((*g)->AddControlEdge(new_node, e->dst(), true));
}
}
// Second, we will fix outgoing control and data edges from 'succ' node.
for (const Edge* e : succ->out_edges()) {
if (e->IsControlEdge()) {
CHECK_NOTNULL((*g)->AddControlEdge(new_node, e->dst()));
// Allow duplicate while adding control edge as it would fail (return
// NULL) if we try to add duplicate edge.
CHECK_NOTNULL((*g)->AddControlEdge(new_node, e->dst(), true));
} else {
// BiasAdd has only 1 output (at slot 0) and merged node also has only 1
// output (at slot 0).
@ -3966,12 +3978,16 @@ Status MklLayoutRewritePass::MergeConv2DBackpropFilterWithBiasAddGrad(
// edges now.
for (const Edge* e : badd->in_edges()) {
if (e->IsControlEdge()) {
CHECK_NOTNULL((*g)->AddControlEdge(e->src(), new_node));
// Allow duplicate while adding control edge as it would fail (return
// NULL) if we try to add duplicate edge.
CHECK_NOTNULL((*g)->AddControlEdge(e->src(), new_node, true));
}
}
for (const Edge* e : fltr->in_edges()) {
if (e->IsControlEdge()) {
CHECK_NOTNULL((*g)->AddControlEdge(e->src(), new_node));
// Allow duplicate while adding control edge as it would fail (return
// NULL) if we try to add duplicate edge.
CHECK_NOTNULL((*g)->AddControlEdge(e->src(), new_node, true));
}
}
@ -3987,7 +4003,9 @@ Status MklLayoutRewritePass::MergeConv2DBackpropFilterWithBiasAddGrad(
for (const Edge* e : badd->out_edges()) {
if (e->IsControlEdge()) {
CHECK_NOTNULL((*g)->AddControlEdge(new_node, e->dst()));
// Allow duplicate while adding control edge as it would fail (return
// NULL) if we try to add duplicate edge.
CHECK_NOTNULL((*g)->AddControlEdge(new_node, e->dst(), true));
} else {
CHECK_NOTNULL((*g)->AddEdge(new_node, kMergedNodeBiasGradOutputIdx,
e->dst(), e->dst_input()));
@ -3997,7 +4015,11 @@ Status MklLayoutRewritePass::MergeConv2DBackpropFilterWithBiasAddGrad(
// Second, we will fix outgoing control and data edges from 'fltr' node.
for (const Edge* e : fltr->out_edges()) {
if (e->IsControlEdge()) {
CHECK_NOTNULL((*g)->AddControlEdge(new_node, e->dst()));
// We allow duplicate edge for this case since we already add control
// edge from new_node in line 3990. Line below could be adding same
// edge to same destination again. In such case, if we do not allow
// duplicate edge, then this call will fail.
CHECK_NOTNULL((*g)->AddControlEdge(new_node, e->dst(), true));
} else {
CHECK_NOTNULL((*g)->AddEdge(new_node, kMergedNodeFilterGradOutputIdx,
e->dst(), e->dst_input()));
@ -4091,7 +4113,9 @@ Status MklLayoutRewritePass::RewriteNode(std::unique_ptr<Graph>* g,
// already copied in BuildNode. We need to handle control edges now.
for (const Edge* e : orig_node->in_edges()) {
if (e->IsControlEdge()) {
CHECK_NOTNULL((*g)->AddControlEdge(e->src(), new_node));
// Allow duplicate while adding control edge as it would fail (return
// NULL) if we try to add duplicate edge.
CHECK_NOTNULL((*g)->AddControlEdge(e->src(), new_node, true));
}
}
@ -4104,7 +4128,9 @@ Status MklLayoutRewritePass::RewriteNode(std::unique_ptr<Graph>* g,
// GetTensorDataIndex provides this mapping function.
for (const Edge* e : orig_node->out_edges()) {
if (e->IsControlEdge()) {
CHECK_NOTNULL((*g)->AddControlEdge(new_node, e->dst()));
// Allow duplicate while adding control edge as it would fail (return
// NULL) if we try to add duplicate edge.
CHECK_NOTNULL((*g)->AddControlEdge(new_node, e->dst(), true));
} else {
CHECK_NOTNULL((*g)->AddEdge(new_node, GetTensorDataIndex(e->src_output(),
e->src()->num_outputs()),

View File

@ -5846,6 +5846,23 @@ tf_mkl_kernel_library(
]),
)
tf_mkl_kernel_library(
name = "mkl_softmax_op",
prefix = "mkl_softmax",
deps = [
":bounds_check",
":ops_util",
"//tensorflow/core:core_cpu",
"//tensorflow/core:framework",
"//tensorflow/core:lib",
"//tensorflow/core:lib_internal",
"//tensorflow/core:nn_ops_op_lib",
] + if_mkl([
"//third_party/mkl:intel_binary_blob",
"@mkl_dnn//:mkl_dnn",
]),
)
tf_mkl_kernel_library(
name = "mkl_fused_batch_norm_op",
srcs = ["mkl_fused_batch_norm_op.cc"],

View File

@ -427,7 +427,7 @@ inline DeviceLapackInfo CudaSolver::GetDeviceLapackInfo(
int64 size, const string& debug_info) {
DeviceLapackInfo new_dev_info(context_, size, debug_info);
scratch_tensor_refs_.emplace_back(new_dev_info.tensor());
return std::move(new_dev_info);
return new_dev_info;
}
} // namespace tensorflow

View File

@ -16,8 +16,9 @@ limitations under the License.
#include "tensorflow/core/kernels/cwise_ops_common.h"
namespace tensorflow {
REGISTER7(BinaryOp, CPU, "Pow", functor::pow, float, Eigen::half, double, int32,
int64, complex64, complex128);
REGISTER5(BinaryOp, CPU, "Pow", functor::pow, float, Eigen::half, double,
complex64, complex128);
REGISTER2(BinaryOp, CPU, "Pow", functor::safe_pow, int32, int64);
#if GOOGLE_CUDA
REGISTER4(BinaryOp, GPU, "Pow", functor::pow, float, Eigen::half, double,
@ -25,5 +26,5 @@ REGISTER4(BinaryOp, GPU, "Pow", functor::pow, float, Eigen::half, double,
#endif
#ifdef TENSORFLOW_USE_SYCL
REGISTER2(BinaryOp, SYCL, "Pow", functor::pow, float, double);
#endif // TENSORFLOW_USE_SYCL
#endif // TENSORFLOW_USE_SYCL
} // namespace tensorflow

View File

@ -21,6 +21,7 @@ limitations under the License.
#include <type_traits>
#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
#include "tensorflow/core/framework/numeric_types.h"
#include "tensorflow/core/framework/tensor_types.h"
#include "tensorflow/core/kernels/bounds_check.h"
@ -115,6 +116,35 @@ struct functor_traits<scalar_binary_pow_op_google<Scalar, Exponent>> {
enum { Cost = 5 * NumTraits<Scalar>::MulCost, PacketAccess = false };
};
template <typename Scalar, typename Exponent>
struct safe_scalar_binary_pow_op {
static_assert(std::is_integral<Scalar>::value, "Integer type expected");
static_assert(std::is_integral<Exponent>::value &&
std::is_signed<Exponent>::value,
"Signed integer type expected");
bool* const error;
EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE safe_scalar_binary_pow_op(bool* error)
: error(error) {}
EIGEN_DEVICE_FUNC inline Scalar operator()(const Scalar& a,
const Exponent& b) const {
const Exponent safe_b = tensorflow::internal::SubtleMustCopy(b);
if (TF_PREDICT_TRUE(safe_b >= 0)) {
return numext::pow(a, safe_b);
} else {
*error = true;
return 0;
}
}
};
template <typename Scalar, typename Exponent>
struct functor_traits<safe_scalar_binary_pow_op<Scalar, Exponent>> {
enum { Cost = 5 * NumTraits<Scalar>::MulCost, PacketAccess = false };
};
template <typename T, typename DivOrMod>
struct safe_div_or_mod_op {
static_assert(std::is_integral<T>::value, "Integer type expected");
@ -741,6 +771,11 @@ struct floor_div_real : base<T, Eigen::internal::google_floor_div_real<T>> {};
template <typename T>
struct pow : base<T, Eigen::internal::scalar_binary_pow_op_google<T, T>> {};
template <typename T>
struct safe_pow : base<T, Eigen::internal::safe_scalar_binary_pow_op<T, T>> {
static const bool has_errors = true;
};
template <typename T>
struct maximum : base<T, Eigen::internal::scalar_max_op<T>> {};

View File

@ -40,6 +40,11 @@ void BinaryOpShared::SetComputeError(OpKernelContext* ctx) {
if ((op == "Div" || op == "Mod" || op == "FloorMod" || op == "FloorDiv") &&
DataTypeIsInteger(ctx->op_kernel().input_type(0))) {
ctx->CtxFailure(errors::InvalidArgument("Integer division by zero"));
} else if ((op == "Pow") &&
DataTypeIsInteger(ctx->op_kernel().input_type(0)) &&
DataTypeIsSigned(ctx->op_kernel().input_type(1))) {
ctx->CtxFailure(errors::InvalidArgument(
"Integers to negative integer powers are not allowed"));
} else {
ctx->CtxFailure(
errors::Internal("Unexpected error in binary operator "

View File

@ -87,11 +87,10 @@ class DecodeImageOp : public OpKernel {
channels_ = 3;
} else {
OP_REQUIRES_OK(context, context->GetAttr("channels", &channels_));
OP_REQUIRES(
context,
channels_ == 0 || channels_ == 1 || channels_ == 3 || channels_ == 4,
errors::InvalidArgument("channels must be 0, 1, 3, or 4, got ",
channels_));
OP_REQUIRES(context, channels_ == 0 || channels_ == 1 || channels_ == 3 ||
channels_ == 4,
errors::InvalidArgument(
"channels must be 0, 1, 3, or 4, got ", channels_));
}
flags_.components = channels_;
@ -115,9 +114,8 @@ class DecodeImageOp : public OpKernel {
if (format_ == kJpgFormat) {
OP_REQUIRES_OK(context, context->GetAttr("ratio", &flags_.ratio));
OP_REQUIRES(context,
flags_.ratio == 1 || flags_.ratio == 2 || flags_.ratio == 4 ||
flags_.ratio == 8,
OP_REQUIRES(context, flags_.ratio == 1 || flags_.ratio == 2 ||
flags_.ratio == 4 || flags_.ratio == 8,
errors::InvalidArgument("ratio must be 1, 2, 4, or 8, got ",
flags_.ratio));
OP_REQUIRES_OK(context, context->GetAttr("fancy_upscaling",
@ -132,9 +130,8 @@ class DecodeImageOp : public OpKernel {
string dct_method;
OP_REQUIRES_OK(context, context->GetAttr("dct_method", &dct_method));
OP_REQUIRES(
context,
(dct_method.empty() || dct_method == "INTEGER_FAST" ||
dct_method == "INTEGER_ACCURATE"),
context, (dct_method.empty() || dct_method == "INTEGER_FAST" ||
dct_method == "INTEGER_ACCURATE"),
errors::InvalidArgument("dct_method must be one of "
"{'', 'INTEGER_FAST', 'INTEGER_ACCURATE'}"));
if (dct_method == "INTEGER_FAST") {
@ -160,9 +157,9 @@ class DecodeImageOp : public OpKernel {
errors::InvalidArgument("Expected image (JPEG, PNG, or GIF), got ",
FileFormatString(magic, input)));
OP_REQUIRES(context, input.size() <= std::numeric_limits<int>::max(),
errors::InvalidArgument(
FileFormatString(magic, input),
" contents are too large for int: ", input.size()));
errors::InvalidArgument(FileFormatString(magic, input),
" contents are too large for int: ",
input.size()));
OP_REQUIRES(context, magic == kPngFormat || channel_bits_ == 8,
errors::InvalidArgument(FileFormatString(magic, input),
" does not support uint16 output"));
@ -215,10 +212,9 @@ class DecodeImageOp : public OpKernel {
input.data(), input.size(), flags, nullptr /* nwarn */,
[=, &output](int width, int height, int channels) -> uint8* {
Status status(context->allocate_output(
0,
format_ == kGifFormat
? TensorShape({1, height, width, channels})
: TensorShape({height, width, channels}),
0, format_ == kGifFormat
? TensorShape({1, height, width, channels})
: TensorShape({height, width, channels}),
&output));
if (!status.ok()) {
VLOG(1) << status;
@ -294,6 +290,7 @@ class DecodeImageOp : public OpKernel {
// Decode GIF, allocating tensor once the size is known.
Tensor* output = nullptr;
string error_string;
OP_REQUIRES(
context,
gif::Decode(input.data(), input.size(),
@ -320,8 +317,10 @@ class DecodeImageOp : public OpKernel {
return nullptr;
}
return output->flat<uint8>().data();
}),
errors::InvalidArgument("Invalid GIF data, size ", input.size()));
},
&error_string),
errors::InvalidArgument("Invalid GIF data (size ", input.size(), "), ",
error_string));
}
private:

View File

@ -539,6 +539,7 @@ struct MatMulFunctor<SYCLDevice, T> {
REGISTER_KERNEL_BUILDER( \
Name("MatMul").Device(DEVICE_CPU).TypeConstraint<T>("T").Label("eigen"), \
MatMulOp<CPUDevice, T, false /* cublas, ignored for CPU */>);
#define REGISTER_CPU(T) \
REGISTER_KERNEL_BUILDER( \
Name("MatMul").Device(DEVICE_CPU).TypeConstraint<T>("T"), \

View File

@ -61,6 +61,18 @@ class MklAddNOp : public OpKernel {
GetMklShape(ctx, src2_idx, &(mkl_context.input2_shape));
bool input2_in_mkl_format = mkl_context.input2_shape.IsMklTensor();
// if the shapes of two tensors are not same raise op error
TensorShape src1_shape, src2_shape;
src1_shape = input0.shape();
src2_shape = input1.shape();
if (!src1_shape.IsSameSize(src2_shape) ){
ctx->SetStatus(
errors::InvalidArgument(
"Inputs to operation ", this->name(), " of type ", this->type_string(),
" must have the same size and shape. Input 0: ",
src1_shape.DebugString(), " != input 1: ",
src2_shape.DebugString()));
}
// handle the case of a scalar
if (!input1_in_mkl_format && input0.dims() == 0) {
const TensorShape& o_shape = input0.shape();
@ -307,6 +319,18 @@ class MklAddNOp : public OpKernel {
src1_mkl_shape.GetDimension(): src1_tensor.dims();
int src2_dims_size = input2_in_mkl_format?
src2_mkl_shape.GetDimension(): src2_tensor.dims();
// if the shapes of two tensors are not same raise op error
TensorShape src1_shape, src2_shape;
src1_shape = src1_tensor.shape();
src2_shape = src2_tensor.shape();
if (!src1_shape.IsSameSize(src2_shape) ){
ctx->SetStatus(
errors::InvalidArgument(
"Inputs to operation ", this->name(), " of type ", this->type_string(),
" must have the same size and shape. Input 0: ",
src1_shape.DebugString(), " != input 1: ",
src2_shape.DebugString()));
}
if (!input1_in_mkl_format && src1_dims_size == 0) {
Tensor* dst_tensor = nullptr;

View File

@ -598,7 +598,6 @@ class MklConcatOp : public OpKernel {
concat_dim_tensor.shape().DebugString()));
int32 concat_dim = internal::SubtleMustCopy(
concat_dim_tensor.scalar<int32>()());
if (concat_dim < 0) concat_dim = N + concat_dim;
// check that ranks of all tensors match
// and that their shapes match except for concat_dim.
@ -609,6 +608,9 @@ class MklConcatOp : public OpKernel {
input_shapes[0].GetTfShape() :
input_tensors[0].shape();
size_t expected_dims = expected_shape.dims();
if (concat_dim < 0) concat_dim = expected_dims + concat_dim;
for (auto& s : input_shapes) {
if (s == expected_shape) {++i; continue;}

View File

@ -467,6 +467,13 @@ class MklConv2DCustomBackpropFilterOp :
return filter_tf_shape;
}
TensorShape GetOutputTfShape(const TensorShape& input_shape,
const TensorShape& filter_shape,
const TensorShape& outbprop_shape) {
// Shape of output of Conv2DBackpropFilter is same as shape of filter.
return filter_shape;
}
const memory::dims& GetOutputDims(const memory::dims& fwd_input_dims,
const memory::dims& fwd_filter_dims) {
// Shape of output of Conv2DBackpropFilter is same as shape of filter.

View File

@ -396,6 +396,13 @@ class MklConv2DCustomBackpropInputOp :
return GetTfShape(context, kInputIndex_Filter);
}
TensorShape GetOutputTfShape(const TensorShape& input_shape,
const TensorShape& filter_shape,
const TensorShape& outbprop_shape) {
// Output Shape of Conv2DBackpropInput is same as shape of Conv2D 'input'.
return input_shape;
}
const memory::dims& GetOutputDims(const memory::dims& fwd_input_dims,
const memory::dims& fwd_filter_dims) {
// Output Shape of Conv2DBackpropInput is same as shape of Conv2D 'input'.

View File

@ -551,6 +551,13 @@ class MklConv2DOp : public OpKernel {
output_mkl_shape.SetMklTensor(false);
AllocateOutputSetMklShape(context, kOutputIndex_Dst, &output_tensor,
src_tf_shape, output_mkl_shape);
// MklConv2D also outputs converted filter as 2nd output of Conv2D.
filter_mkl_shape.SetMklTensor(false);
Tensor* output_filter_tensor = nullptr;
AllocateOutputSetMklShape(context, kOutputIndex_Filter,
&output_filter_tensor,
filter_tf_shape, filter_mkl_shape);
return;
}

View File

@ -390,6 +390,29 @@ class MklConv2DBackpropCommonOp : public OpKernel {
TensorShape filter_tf_shape = MakeFilterTfShape(context, filter_tensor);
TensorShape outbprop_tf_shape = GetTfShape(context, kOutbpropIdx);
// Corner cases: output with 0 elements and 0 batch size.
Tensor* output_tensor = nullptr;
if (input_tf_shape.num_elements() == 0 ||
filter_tf_shape.num_elements() == 0 ||
outbprop_tf_shape.num_elements() == 0) {
MklDnnShape output_mkl_shape;
output_mkl_shape.SetMklTensor(false);
TensorShape output_tf_shape = GetOutputTfShape(input_tf_shape,
filter_tf_shape,
outbprop_tf_shape);
const int kOutputIdx = 0;
AllocateOutputSetMklShape(context, kOutputIdx, &output_tensor,
output_tf_shape, output_mkl_shape);
CHECK_NOTNULL(output_tensor);
// if output tensor has more than 0 elements, we need to 0 them out.
for (size_t i = 0; i < output_tf_shape.num_elements(); ++i) {
output_tensor->flat<T>().data()[i] = 0;
}
return;
}
// By default, all dims are in MKL order. Only dims in TF order
// are those with prefix tf_order.
memory::dims outbprop_dims, fwd_input_dims, fwd_filter_dims;
@ -471,7 +494,6 @@ class MklConv2DBackpropCommonOp : public OpKernel {
output.SetOpMemDesc(bwd_output_dims, memory::format::any);
// Operator-specific call to create and execute primitive.
Tensor* output_tensor = nullptr;
CreatePrimitive(context, cpu_engine, fwd_pd, &input, &filter,
&outbackprop, &output, &output_tensor,
strides, padding_l, padding_r,
@ -507,6 +529,11 @@ class MklConv2DBackpropCommonOp : public OpKernel {
virtual TensorShape MakeFilterTfShape(OpKernelContext* context,
const Tensor& filter_tensor) = 0;
/// Get the TensorFlow shape of output tensor.
virtual TensorShape GetOutputTfShape(const TensorShape& input_shape,
const TensorShape& filter_shape,
const TensorShape& outbprop_shape) = 0;
/// Get shape of output in MKL-DNN order. Computes shape of output from
/// input shape (fwd_input_dims) and filter shape (fwd_filter_dims).
virtual

View File

@ -703,27 +703,31 @@ class MklFusedBatchNormOp : public OpKernel {
void Compute(OpKernelContext* context) override {
try {
auto cpu_engine = engine(engine::cpu, 0);
const size_t src_index = 0; // index of src input tensor
const size_t scale_index = 1; // index of scale tensor
const size_t shift_index = 2; // index of shift tensor
const size_t mean_index = 3; // index of est_mean tensor
const size_t var_index = 4; // index of est_variance tensor
const size_t kSrcIndex = 0; // index of src input tensor
const size_t kScaleIndex = 1; // index of scale tensor
const size_t kShiftIndex = 2; // index of shift tensor
const size_t kMeanIndex = 3; // index of est_mean tensor
const size_t kVarianceIndex = 4; // index of est_variance tensor
const Tensor& src_tensor = MklGetInput(context, src_index);
const Tensor& scale_tensor = MklGetInput(context, scale_index);
const Tensor& shift_tensor = MklGetInput(context, shift_index);
const Tensor& est_mean_tensor = MklGetInput(context, mean_index);
const Tensor& est_variance_tensor = MklGetInput(context, var_index);
const Tensor& src_tensor = MklGetInput(context, kSrcIndex);
const Tensor& scale_tensor = MklGetInput(context, kScaleIndex);
const Tensor& shift_tensor = MklGetInput(context, kShiftIndex);
const Tensor& est_mean_tensor = MklGetInput(context, kMeanIndex);
const Tensor& est_variance_tensor = MklGetInput(context,
kVarianceIndex);
TensorShape tf_shape_src;
MklDnnShape dnn_shape_src;
GetMklShape(context, src_index, &dnn_shape_src);
GetMklShape(context, kSrcIndex, &dnn_shape_src);
if (dnn_shape_src.IsMklTensor()) {
tf_shape_src = dnn_shape_src.GetTfShape();
OP_REQUIRES(context, dnn_shape_src.GetDimension() == 4,
errors::InvalidArgument(
"input must be 4-dimensional",
src_tensor.shape().DebugString()));
} else {
tf_shape_src = src_tensor.shape();
OP_REQUIRES(context, src_tensor.dims() == 4,
errors::InvalidArgument(
"input must be 4-dimensional",
@ -756,39 +760,35 @@ class MklFusedBatchNormOp : public OpKernel {
est_variance_tensor.shape().DebugString()));
}
// special case: input with 0 element and 0 batch size
Tensor* dst_tensor = nullptr;
if (tf_shape_src.num_elements() == 0) {
HandleEmptyInput(context,
tf_shape_src,
scale_tensor.shape(),
&dst_tensor);
return;
}
if (dnn_shape_src.IsMklTensor())
depth_ = dnn_shape_src.DimSize(MklDnnDims::Dim_C);
else
ExtractParams(context);
// Indices of output tensors
const size_t dst_index = 0;
const size_t batch_mean_index = 1;
const size_t batch_variance_index = 2;
const size_t saved_mean_index = 3;
const size_t saved_variance_index = 4;
const size_t kDstIndex = 0;
// allocate batch mean output tensor
// allocate 4 output TF tensors
Tensor* batch_mean_tensor = nullptr;
MklDnnShape mkl_shape_batch_mean;
mkl_shape_batch_mean.SetMklTensor(false);
AllocateOutputSetMklShape(context,
batch_mean_index,
&batch_mean_tensor,
scale_tensor.shape(),
mkl_shape_batch_mean);
CHECK_NOTNULL(batch_mean_tensor);
// Batch variance
Tensor* batch_variance_tensor = nullptr;
MklDnnShape mkl_shape_batch_variance;
mkl_shape_batch_variance.SetMklTensor(false);
AllocateOutputSetMklShape(context,
batch_variance_index,
&batch_variance_tensor,
scale_tensor.shape(),
mkl_shape_batch_variance);
CHECK_NOTNULL(batch_variance_tensor);
Tensor* saved_mean_tensor = nullptr;
Tensor* saved_variance_tensor = nullptr;
AllocateTFOutputs(context,
scale_tensor.shape(),
&batch_mean_tensor,
&batch_variance_tensor,
&saved_mean_tensor,
&saved_variance_tensor);
if (is_training_)
SetMeanVariance(*batch_mean_tensor, *batch_variance_tensor);
@ -844,26 +844,6 @@ class MklFusedBatchNormOp : public OpKernel {
weights_data[k + depth_] = shift_tf[k];
}
// Mean and variance (without Bessel's correction) saved for backward
// computation to serve as pre-computed mean and variance.
Tensor* saved_mean_tensor = nullptr;
MklDnnShape mkl_shape_saved_mean;
mkl_shape_saved_mean.SetMklTensor(false);
AllocateOutputSetMklShape(context, saved_mean_index,
&saved_mean_tensor,
scale_tensor.shape(),
mkl_shape_saved_mean);
CHECK_NOTNULL(saved_mean_tensor);
Tensor* saved_variance_tensor = nullptr;
MklDnnShape mkl_shape_saved_variance;
mkl_shape_saved_variance.SetMklTensor(false);
AllocateOutputSetMklShape(context, saved_variance_index,
&saved_variance_tensor,
scale_tensor.shape(),
mkl_shape_saved_variance);
CHECK_NOTNULL(saved_variance_tensor);
// set mean primitive
auto mean_desc = memory::desc({1, depth_},
MklDnnType<T>(),
@ -902,7 +882,6 @@ class MklFusedBatchNormOp : public OpKernel {
// allocate dst tensor
MklDnnShape dnn_shape_dst;
TensorShape tf_shape_dst;
Tensor* dst_tensor = nullptr;
if (dnn_shape_src.IsMklTensor()) {
dnn_shape_dst.SetMklTensor(true);
auto dst_pd = bnrm_fwd_pd.dst_primitive_desc();
@ -915,7 +894,7 @@ class MklFusedBatchNormOp : public OpKernel {
dnn_shape_dst.SetMklTensor(false);
tf_shape_dst = src_tensor.shape();
}
AllocateOutputSetMklShape(context, dst_index, &dst_tensor,
AllocateOutputSetMklShape(context, kDstIndex, &dst_tensor,
tf_shape_dst, dnn_shape_dst);
// Output of batchnorm has same shape as input.
@ -958,10 +937,8 @@ class MklFusedBatchNormOp : public OpKernel {
size_t adjust_size = orig_size - 1;
adjust_factor = (static_cast<float>(orig_size)) / adjust_size;
}
T* batch_variance_data_tf = reinterpret_cast<T*>(
batch_variance_tensor->flat<T>().data());
for (int k=0; k < depth_; k++)
batch_variance_data_tf[k] =
batch_variance_tensor->flat<T>().data()[k] =
(reinterpret_cast<T*>(variance_m.get_data_handle()))[k]
* adjust_factor;
} catch (mkldnn::error &e) {
@ -994,8 +971,100 @@ class MklFusedBatchNormOp : public OpKernel {
variance_values_ = reinterpret_cast<T*>(
const_cast<T*>(variance.flat<T>().data()));
}
};
void HandleEmptyInput(OpKernelContext* context,
TensorShape tf_shape_src,
TensorShape tf_shape_scale,
Tensor** dst_tensor) {
CHECK_NOTNULL(dst_tensor);
const size_t kDstIndex = 0;
MklDnnShape dnn_shape_dst;
dnn_shape_dst.SetMklTensor(false);
AllocateOutputSetMklShape(context, kDstIndex, dst_tensor,
tf_shape_src, dnn_shape_dst);
CHECK_NOTNULL(*dst_tensor);
memset(const_cast<char*>((*dst_tensor)->tensor_data().data()), 0,
(*dst_tensor)->tensor_data().size());
Tensor* batch_mean_tensor = nullptr;
Tensor* batch_variance_tensor = nullptr;
Tensor* saved_mean_tensor = nullptr;
Tensor* saved_variance_tensor = nullptr;
AllocateTFOutputs(context, tf_shape_scale,
&batch_mean_tensor,
&batch_variance_tensor,
&saved_mean_tensor,
&saved_variance_tensor);
}
void AllocateTFOutputs(OpKernelContext* context,
TensorShape tf_shape_scale,
Tensor** batch_mean_tensor,
Tensor** batch_variance_tensor,
Tensor** saved_mean_tensor,
Tensor** saved_variance_tensor) {
CHECK_NOTNULL(batch_mean_tensor);
CHECK_NOTNULL(batch_variance_tensor);
CHECK_NOTNULL(saved_mean_tensor);
CHECK_NOTNULL(saved_variance_tensor);
const size_t kBatchMeanIndex = 1;
const size_t kBatchVarianceIndex = 2;
const size_t kSavedMeanIndex = 3;
const size_t kSavedVarianceIndex = 4;
// allocate batch mean output tensor
MklDnnShape mkl_shape_batch_mean;
mkl_shape_batch_mean.SetMklTensor(false);
AllocateOutputSetMklShape(context,
kBatchMeanIndex,
batch_mean_tensor,
tf_shape_scale,
mkl_shape_batch_mean);
CHECK_NOTNULL(*batch_mean_tensor);
// set NAN mean value in case of empty input tensor
for (int k=0; k < tf_shape_scale.num_elements(); k++)
(*batch_mean_tensor)->flat<T>().data()[k] = NAN;
// allocate batch variance output tensor
MklDnnShape mkl_shape_batch_variance;
mkl_shape_batch_variance.SetMklTensor(false);
AllocateOutputSetMklShape(context,
kBatchVarianceIndex,
batch_variance_tensor,
tf_shape_scale,
mkl_shape_batch_variance);
CHECK_NOTNULL(*batch_variance_tensor);
// set NAN variance value in case of empty input tensor
for (int k=0; k < tf_shape_scale.num_elements(); k++)
(*batch_variance_tensor)->flat<T>().data()[k] = NAN;
// Mean and variance (without Bessel's correction) saved for backward
// computation to serve as pre-computed mean and variance.
MklDnnShape mkl_shape_saved_mean;
mkl_shape_saved_mean.SetMklTensor(false);
AllocateOutputSetMklShape(context, kSavedMeanIndex,
saved_mean_tensor,
tf_shape_scale,
mkl_shape_saved_mean);
CHECK_NOTNULL(*saved_mean_tensor);
// set NAN mean value in case of empty input tensor
for (int k=0; k < tf_shape_scale.num_elements(); k++)
(*saved_mean_tensor)->flat<T>().data()[k] = NAN;
MklDnnShape mkl_shape_saved_variance;
mkl_shape_saved_variance.SetMklTensor(false);
AllocateOutputSetMklShape(context, kSavedVarianceIndex,
saved_variance_tensor,
tf_shape_scale,
mkl_shape_saved_variance);
CHECK_NOTNULL(*saved_variance_tensor);
// set NAN variance value in case of empty input tensor
for (int k=0; k < tf_shape_scale.num_elements(); k++)
(*saved_variance_tensor)->flat<T>().data()[k] = NAN;
}
};
template <typename Device, typename T>
class MklFusedBatchNormGradOp : public OpKernel {
@ -1009,34 +1078,37 @@ class MklFusedBatchNormGradOp : public OpKernel {
OP_REQUIRES_OK(context, context->GetAttr("data_format", &tensor_format));
OP_REQUIRES(context, FormatFromString(tensor_format, &tensor_format_),
errors::InvalidArgument("Invalid data format"));
OP_REQUIRES_OK(context, context->GetAttr("is_training", &is_training_));
}
void Compute(OpKernelContext* context) override {
try {
auto cpu_engine = engine(engine::cpu, 0);
const size_t diff_dst_index = 0; // index of diff_dst tensor
const size_t src_index = 1; // index of src input tensor
const size_t scale_index = 2; // index of scale tensor
const size_t mean_index = 3; // index of saved_mean tensor
const size_t variance_index = 4; // index of saved_variance tensor
const Tensor& diff_dst_tensor = MklGetInput(context, diff_dst_index);
const Tensor& src_tensor = MklGetInput(context, src_index);
const Tensor& scale_tensor = MklGetInput(context, scale_index);
const Tensor& saved_mean_tensor = MklGetInput(context, mean_index);
const size_t kDiffDstIndex = 0; // index of diff_dst tensor
const size_t kSrcIndex = 1; // index of src input tensor
const size_t kScaleIndex = 2; // index of scale tensor
const size_t kMeanIndex = 3; // index of saved_mean tensor
const size_t kVarianceIndex = 4; // index of saved_variance tensor
const Tensor& diff_dst_tensor = MklGetInput(context, kDiffDstIndex);
const Tensor& src_tensor = MklGetInput(context, kSrcIndex);
const Tensor& scale_tensor = MklGetInput(context, kScaleIndex);
const Tensor& saved_mean_tensor = MklGetInput(context, kMeanIndex);
const Tensor& saved_variance_tensor = MklGetInput(context,
variance_index);
kVarianceIndex);
MklDnnShape dnn_shape_src, dnn_shape_diff_dst;
GetMklShape(context, src_index, &dnn_shape_src);
GetMklShape(context, diff_dst_index, &dnn_shape_diff_dst);
GetMklShape(context, kSrcIndex, &dnn_shape_src);
GetMklShape(context, kDiffDstIndex, &dnn_shape_diff_dst);
TensorShape tf_shape_src, tf_shape_diff_dst;
if (dnn_shape_diff_dst.IsMklTensor()) {
tf_shape_diff_dst = dnn_shape_diff_dst.GetTfShape();
OP_REQUIRES(context, dnn_shape_diff_dst.GetDimension() == 4,
errors::InvalidArgument(
"input must be 4-dimensional",
diff_dst_tensor.shape().DebugString()));
} else {
tf_shape_diff_dst = diff_dst_tensor.shape();
OP_REQUIRES(context, diff_dst_tensor.dims() == 4,
errors::InvalidArgument(
"input must be 4-dimensional",
@ -1044,11 +1116,13 @@ class MklFusedBatchNormGradOp : public OpKernel {
}
if (dnn_shape_src.IsMklTensor()) {
tf_shape_src = dnn_shape_src.GetTfShape();
OP_REQUIRES(context, dnn_shape_src.GetDimension() == 4,
errors::InvalidArgument(
"input must be 4-dimensional",
src_tensor.shape().DebugString()));
} else {
tf_shape_src = src_tensor.shape();
OP_REQUIRES(context, src_tensor.dims() == 4,
errors::InvalidArgument(
"input must be 4-dimensional",
@ -1069,6 +1143,15 @@ class MklFusedBatchNormGradOp : public OpKernel {
"saved variance must be 1-dimensional",
saved_variance_tensor.shape().DebugString()));
Tensor* diff_src_tensor = nullptr;
if (tf_shape_src.num_elements() == 0 ||
tf_shape_diff_dst.num_elements() == 0) {
HandleEmptyInput(context, tf_shape_src,
scale_tensor.shape(),
&diff_src_tensor);
return;
}
if (dnn_shape_src.IsMklTensor())
depth_ = dnn_shape_src.DimSize(MklDnnDims::Dim_C);
else
@ -1165,25 +1248,21 @@ class MklFusedBatchNormGradOp : public OpKernel {
auto diff_weights_m = memory(diff_weights_pd);
auto bnrm_fwd_desc = batch_normalization_forward::desc(
prop_kind::forward_training,
src.GetUsrMemDesc(),
epsilon_,
use_scale_shift);
prop_kind::forward_training,
src.GetUsrMemDesc(),
epsilon_,
is_training_ ? use_scale_shift :
(use_scale_shift | use_global_stats));
auto bnrm_fwd_pd = batch_normalization_forward::primitive_desc(
bnrm_fwd_desc,
cpu_engine);
// Indices of output tensors
const size_t diff_src_index = 0; // index of diff_src tensor
const size_t diff_scale_index = 1; // index of diff_scale tensor
const size_t diff_shift_index = 2; // index of diff_shift tensor
const size_t p1_index = 3; // index of 1st placeholder tensor
const size_t p2_index = 4; // index of 2nd placeholder tensor
const size_t kDiffSrcIndex = 0; // index of diff_src tensor
// allocate diff_src tensor
MklDnnShape dnn_shape_diff_src;
TensorShape tf_shape_diff_src;
Tensor* diff_src_tensor = nullptr;
if (dnn_shape_src.IsMklTensor()) {
dnn_shape_diff_src.SetMklTensor(true);
auto diff_src_pd = bnrm_fwd_pd.dst_primitive_desc();
@ -1201,7 +1280,7 @@ class MklFusedBatchNormGradOp : public OpKernel {
dnn_shape_diff_src.SetMklTensor(false);
tf_shape_diff_src = src_tensor.shape();
}
AllocateOutputSetMklShape(context, diff_src_index, &diff_src_tensor,
AllocateOutputSetMklShape(context, kDiffSrcIndex, &diff_src_tensor,
tf_shape_diff_src, dnn_shape_diff_src);
diff_src.SetUsrMem(src_md, diff_src_tensor);
@ -1212,7 +1291,15 @@ class MklFusedBatchNormGradOp : public OpKernel {
diff_src.GetUsrMemDesc(),
src.GetUsrMemDesc(),
epsilon_,
use_scale_shift);
/* for inference, specify use_global_stats
1. on fwd prop, use mean and variance
provided as inputs
2. on bwd prop, mean and variance are
considered as constants. Thus,
reduce the amout of MKL computations
*/
is_training_ ? use_scale_shift :
(use_scale_shift | use_global_stats));
auto bnrm_bwd_pd = batch_normalization_backward::primitive_desc(
bnrm_bwd_desc,
cpu_engine,
@ -1232,41 +1319,22 @@ class MklFusedBatchNormGradOp : public OpKernel {
net.push_back(bnrm_bwd_op);
stream(stream::kind::eager).submit(net).wait();
// separate out scale and shift grad and copy to individual tensors
const TensorShape& tf_shape_scale_shift = scale_tensor.shape();
// allocate 4 output TF tensors
Tensor* diff_scale_tensor = nullptr;
MklDnnShape mkl_shape_diff_scale;
mkl_shape_diff_scale.SetMklTensor(false);
AllocateOutputSetMklShape(context, diff_scale_index, &diff_scale_tensor,
tf_shape_scale_shift, mkl_shape_diff_scale);
Tensor* diff_shift_tensor = nullptr;
MklDnnShape mkl_shape_diff_shift;
mkl_shape_diff_shift.SetMklTensor(false);
AllocateOutputSetMklShape(context, diff_shift_index, &diff_shift_tensor,
tf_shape_scale_shift, mkl_shape_diff_shift);
AllocateTFOutputs(context, scale_tensor.shape(),
&diff_scale_tensor,
&diff_shift_tensor);
// copy data: diff_scale and diff_shift
T* diff_weights_data_dnn = reinterpret_cast<T*>
(diff_weights_m.get_data_handle());
float* diff_scale_data_tf = const_cast<float*>(
static_cast<const float*>(diff_scale_tensor->flat<T>().data()));
float* diff_shift_data_tf = const_cast<float*>(
static_cast<const float*>(diff_shift_tensor->flat<T>().data()));
for (int i = 0; i < depth_; i++) {
diff_scale_data_tf[i] = diff_weights_data_dnn[i];
diff_shift_data_tf[i] = diff_weights_data_dnn[i + depth_];
diff_scale_tensor->flat<T>().data()[i] =
diff_weights_data_dnn[i];
diff_shift_tensor->flat<T>().data()[i] =
diff_weights_data_dnn[i + depth_];
}
// Placeholders for estimated_mean and estimated_variance, which are
// used for inference and thus not needed here for gradient computation.
Tensor* p1_tensor = nullptr, *p2_tensor = nullptr;
MklDnnShape mkl_shape_p;
mkl_shape_p.SetMklTensor(false);
AllocateOutputSetMklShape(context, p1_index, &p1_tensor,
TensorShape({}), mkl_shape_p);
AllocateOutputSetMklShape(context, p2_index, &p2_tensor,
TensorShape({}), mkl_shape_p);
} catch (mkldnn::error &e) {
string error_msg = "Status: " + std::to_string(e.status) +
", message: " + string(e.message) +
@ -1282,12 +1350,74 @@ class MklFusedBatchNormGradOp : public OpKernel {
T epsilon_;
TensorFormat tensor_format_;
int depth_; // batch normalization is done for per channel.
bool is_training_;
void ExtractParams(OpKernelContext* context) {
const Tensor& input = MklGetInput(context, 0);
depth_ = static_cast<int>(GetTensorDim(input, tensor_format_, 'C'));
}
void HandleEmptyInput(OpKernelContext* context,
TensorShape tf_shape_src,
TensorShape tf_shape_scale_shift,
Tensor** diff_src_tensor) {
const size_t kDiffSrcIndex = 0;
MklDnnShape dnn_shape_diff_src;
dnn_shape_diff_src.SetMklTensor(false);
AllocateOutputSetMklShape(context, kDiffSrcIndex, diff_src_tensor,
tf_shape_src, dnn_shape_diff_src);
for (size_t i=0; i < (*diff_src_tensor)->shape().num_elements(); i++)
(*diff_src_tensor)->flat<T>().data()[i] = 0;
Tensor* diff_scale_tensor = nullptr;
Tensor* diff_shift_tensor = nullptr;
AllocateTFOutputs(context,
tf_shape_scale_shift,
&diff_scale_tensor,
&diff_shift_tensor);
}
void AllocateTFOutputs(OpKernelContext* context,
TensorShape tf_shape_scale_shift,
Tensor** diff_scale_tensor,
Tensor** diff_shift_tensor) {
CHECK_NOTNULL(diff_scale_tensor);
CHECK_NOTNULL(diff_shift_tensor);
const size_t kDiffScaleIndex = 1;
const size_t kDiffShiftIndex = 2;
const size_t kP1Index = 3;
const size_t kP2Index = 4;
// separate out scale and shift grad and copy to individual tensors
MklDnnShape mkl_shape_diff_scale;
mkl_shape_diff_scale.SetMklTensor(false);
AllocateOutputSetMklShape(context, kDiffScaleIndex, diff_scale_tensor,
tf_shape_scale_shift, mkl_shape_diff_scale);
CHECK_NOTNULL(*diff_scale_tensor);
for (size_t i=0; i < (*diff_scale_tensor)->shape().num_elements(); i++)
(*diff_scale_tensor)->flat<T>().data()[i] = 0;
MklDnnShape mkl_shape_diff_shift;
mkl_shape_diff_shift.SetMklTensor(false);
AllocateOutputSetMklShape(context, kDiffShiftIndex, diff_shift_tensor,
tf_shape_scale_shift, mkl_shape_diff_shift);
CHECK_NOTNULL(*diff_shift_tensor);
for (size_t i=0; i < (*diff_shift_tensor)->shape().num_elements(); i++)
(*diff_shift_tensor)->flat<T>().data()[i] = 0;
// Placeholders for estimated_mean and estimated_variance, which are
// used for inference and thus not needed here for gradient computation.
Tensor* p1_tensor = nullptr, *p2_tensor = nullptr;
MklDnnShape mkl_shape_p;
mkl_shape_p.SetMklTensor(false);
AllocateOutputSetMklShape(context, kP1Index, &p1_tensor,
TensorShape({}), mkl_shape_p);
AllocateOutputSetMklShape(context, kP2Index, &p2_tensor,
TensorShape({}), mkl_shape_p);
}
memory::dims GetMeanVarianceDims() {
return memory::dims({1, depth_});
}

View File

@ -396,7 +396,7 @@ class MklInputConversionOp : public OpKernel {
auto cpu_engine = engine(engine::cpu, 0);
MklDnnData<T> tf_input(&cpu_engine);
auto input_tf_md = mkl_output_mkl_shape.GetTfLayout();
tf_input.SetUsrMem(input_tf_md, &tf_tensor);
tf_input.SetUsrMem(input_tf_md, tf_tensor);
// Create reorder between tensorflow layout and Mkl layout.
std::vector<primitive> net;

View File

@ -43,7 +43,7 @@ limitations under the License.
using mkldnn::lrn_forward;
using mkldnn::lrn_backward;
using mkldnn::prop_kind;
using mkldnn::algorithm::lrn_across_channels;
using mkldnn::lrn_across_channels;
using mkldnn::stream;
#endif
@ -910,17 +910,23 @@ class MklLRNOp : public OpKernel {
Eigen::Tensor<T, 2, Eigen::RowMajor> multiplier(depth, depth);
GetBandMatrix<T>(depth, depth_radius_, &multiplier);
Tensor *output_dnn_data, *workspace;
MklDnnShape mkl_output_mkl_shape, mkl_workspace_mkl_shape;
Tensor *output_dnn_data = nullptr;
MklDnnShape mkl_output_mkl_shape;
mkl_output_mkl_shape.SetMklTensor(false);
mkl_output_mkl_shape.SetDimensions(4);
AllocateOutputSetMklShape(context, kIdxOutput, &output_dnn_data,
input.shape(), mkl_output_mkl_shape);
CHECK_NOTNULL(output_dnn_data);
mkl_workspace_mkl_shape.SetMklTensor(false);
mkl_workspace_mkl_shape.SetDimensions(4);
AllocateOutputSetMklShape(context, kIdxWorkspace, &workspace,
input.shape(), mkl_workspace_mkl_shape);
Tensor* workspace_tensor = nullptr;
MklDnnShape workspace_mkl_shape;
workspace_mkl_shape.SetMklTensor(false);
TensorShape workspace_tf_shape;
workspace_tf_shape.AddDim(0);
AllocateOutputSetMklShape(context, kIdxWorkspace,
&workspace_tensor,
workspace_tf_shape, workspace_mkl_shape);
CHECK_NOTNULL(workspace_tensor);
auto out_shaped = output_dnn_data->shaped<T, 2>({nodes * batch, depth});
Eigen::array<DimPair, 1> dims = {{DimPair(1, 0)}};
@ -1344,12 +1350,14 @@ class MklLRNGradOp : public OpKernel {
errors::InvalidArgument("Output image must be 4-dimensional"));
}
if (workspace_dnn_shape.IsMklTensor()) {
OP_REQUIRES(context, workspace_dnn_shape.IsMklTensor() == false,
errors::InvalidArgument("Workspace should not be MKL Tensor."));
} else {
OP_REQUIRES(context, workspace_tensor.dims() == 1,
errors::InvalidArgument("Workspace must be 1-dimensional"));
if (workspace_enabled_) {
if (workspace_dnn_shape.IsMklTensor()) {
OP_REQUIRES(context, workspace_dnn_shape.IsMklTensor() == false,
errors::InvalidArgument("Workspace should not be MKL Tensor."));
} else {
OP_REQUIRES(context, workspace_tensor.dims() == 1,
errors::InvalidArgument("Workspace must be 1-dimensional"));
}
}
}

View File

@ -517,7 +517,7 @@ class MklMaxPoolingOp : public MklPoolingForwardOpBase<T> {
MklDnnData<T> dnn_data_input(&cpu_engine);
MklDnnData<T> dnn_data_output(&cpu_engine);
MklDnnData<T> dnn_data_wksp(&cpu_engine);
MklDnnData<uint8> dnn_data_wksp(&cpu_engine);
// initialize variables for the pooling op
MklPoolParameters pool_params;
@ -588,16 +588,16 @@ class MklMaxPoolingOp : public MklPoolingForwardOpBase<T> {
void AllocateWorkspaceTensor(OpKernelContext* context,
const pooling_forward::primitive_desc& pool_fwd_prim_desc,
MklDnnData<T>* dnn_data_wksp) {
MklDnnData<uint8>* dnn_data_wksp) {
CHECK_NOTNULL(dnn_data_wksp);
Tensor* workspace_tensor = nullptr;
memory::primitive_desc workspace_pd
= pool_fwd_prim_desc.workspace_primitive_desc();
size_t workspace_t_elems = this->GetNumTElements(workspace_pd);
size_t workspace_bytes = workspace_pd.get_size();
MklDnnShape workspace_mkl_shape;
workspace_mkl_shape.SetMklTensor(false);
TensorShape workspace_tf_shape;
workspace_tf_shape.AddDim(workspace_t_elems);
workspace_tf_shape.AddDim(workspace_bytes);
AllocateOutputSetMklShape(context, kOutputTensorIndexWorkspace,
&workspace_tensor,
workspace_tf_shape, workspace_mkl_shape);
@ -651,7 +651,7 @@ class MklMaxPoolingGradOp : public MklPoolingBackwardOpBase<T> {
if (!context->status().ok()) return;
MklDnnData<T> grad_dnn_data(&cpu_engine);
MklDnnData<T> workspace_dnn_data(&cpu_engine);
MklDnnData<uint8> workspace_dnn_data(&cpu_engine);
MklDnnData<T> output_dnn_data(&cpu_engine);
Tensor* output_tensor = nullptr;
MklPoolParameters pool_params;
@ -770,7 +770,7 @@ class MklMaxPoolingGradOp : public MklPoolingBackwardOpBase<T> {
void ConfigureWorkspace(const Tensor& workspace_tensor,
memory::primitive_desc workspace_pd,
MklDnnData<T> *workspace_dnn_data) {
MklDnnData<uint8> *workspace_dnn_data) {
CHECK_NOTNULL(workspace_dnn_data);
workspace_dnn_data->SetUsrMem(workspace_pd, &workspace_tensor);
@ -811,7 +811,7 @@ class MklMaxPoolingGradOp : public MklPoolingBackwardOpBase<T> {
errors::InvalidArgument("Gradient must be "
"4-dimensional"));
}
if (this->workspace_enabled_){
if (this->workspace_enabled_) {
// The workspace should not be an MKL tensor
OP_REQUIRES(context, workspace_mkl_shape.IsMklTensor() == false,
errors::InvalidArgument("Workspace tensor should not"

View File

@ -231,7 +231,7 @@ class MklPoolingForwardOpBase : public MklPoolingOpBase<T> {
const pooling_forward::primitive_desc& pool_fwd_desc,
const MklDnnData<T>* src,
MklDnnData<T>* dst,
MklDnnData<T>* wksp = nullptr) {
MklDnnData<uint8>* wksp = nullptr) {
std::vector<primitive> net;
// Create pooling primitive and add it to net
@ -307,7 +307,7 @@ class MklPoolingBackwardOpBase : public MklPoolingOpBase<T> {
MklDnnData<T>* input_gradient_diff_dst,
MklDnnData<T>* output_diff_src,
const memory::primitive_desc& target_diff_dst_pd,
const MklDnnData<T>* workspace = nullptr) {
const MklDnnData<uint8>* workspace = nullptr) {
std::vector<primitive> net;

View File

@ -256,11 +256,18 @@ class MklReshapeOp : public OpKernel {
AllocateOutputSetMklShape(context, kOutputSlotIdx, &output_tensor,
shape_to, mkl_shape_output);
// Insert reorder between Mkl layout and TensorFlow layout.
// Insert reorder between Mkl layout and TensorFlow layout if
// needed. If reorder is not needed but reshape is needed (since
// shape_from != shape_to), then we just copy input tensor to
// output tensor with target shape (we cannot forward Mkl layout
// in such case because shape has changed.)
std::vector<primitive> net;
CHECK_EQ(dnn_data_input.CheckReorderToOpMem(output_tf_pd,
output_tensor, &net), true);
stream(stream::kind::eager).submit(net).wait();
if (dnn_data_input.CheckReorderToOpMem(output_tf_pd,
output_tensor, &net)) {
stream(stream::kind::eager).submit(net).wait();
} else {
output_tensor->CopyFrom(input_tensor, shape_to);
}
return;
} else {
// If dimensions that are being expanded or collapsed are

View File

@ -0,0 +1,163 @@
/* Copyright 2015 The TensorFlow Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
==============================================================================*/
// See docs in ../ops/nn_ops.cc.
#ifdef INTEL_MKL
#ifdef INTEL_MKL_DNN
#include "tensorflow/core/framework/numeric_op.h"
#include "tensorflow/core/framework/op_kernel.h"
#include "tensorflow/core/framework/register_types.h"
#include "tensorflow/core/framework/tensor.h"
#include "tensorflow/core/lib/core/errors.h"
#include "tensorflow/core/util/tensor_format.h"
#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
#include "mkldnn.h"
#include "mkldnn_types.h"
#include "tensorflow/core/platform/default/logging.h"
#include "tensorflow/core/util/mkl_util.h"
#include "mkldnn.hpp"
using mkldnn::stream;
using mkldnn::prop_kind;
using mkldnn::softmax_forward;
namespace tensorflow {
typedef Eigen::ThreadPoolDevice CPUDevice;
template <typename Device, typename T>
class MklSoftmaxOp : public OpKernel {
public:
~MklSoftmaxOp() {}
explicit MklSoftmaxOp(OpKernelConstruction* context) : OpKernel(context) {}
void Compute(OpKernelContext* context) override {
try {
auto cpu_engine = engine(engine::cpu, 0);
// src_tensor now points to the 0-th input of global data struct "context"
size_t src_idx = 0;
const Tensor& src_tensor = MklGetInput(context, src_idx);
// Add: get MklShape
MklDnnShape src_mkl_shape;
GetMklShape(context, src_idx, &src_mkl_shape);
// src_dims is the dimenstion of src_tensor
// dim of the dst will also be same as src_dims
auto src_tf_shape = src_mkl_shape.IsMklTensor() ?
src_mkl_shape.GetTfShape() : src_tensor.shape();
auto src_dims = TFShapeToMklDnnDims(src_tf_shape);
auto output_dims = src_dims;
// Create softmax memory for src, dst: both are defined in mkl_util.h,
// they are wrapper
MklDnnData<T> src(&cpu_engine);
MklDnnData<T> dst(&cpu_engine);
// If input is in MKL layout, then simply grab input layout; otherwise,
// construct input Tf layout. For TF layout, although input shape
// (src_dims) required is in MKL-DNN order, the layout is Tensorflow's
// layout
auto src_md = src_mkl_shape.IsMklTensor()
? src_mkl_shape.GetMklLayout()
: memory::desc(src_dims, MklDnnType<T>(),
memory::format::nc);
// src: setting memory descriptor and op memory descriptor
// Basically following two functions maps the TF "src_tensor" to mkl
// tensor object "src"
// following functions are in mkl_util.h
// data format is "nc" for src and dst; since the src and dst buffer is
// always in 2D shape
src.SetUsrMem(src_md, &src_tensor);
src.SetOpMemDesc(src_dims, memory::format::nc);
// creating a memory descriptor
int axis = 1; // axis to which softmax will be applied
auto softmax_fwd_desc = softmax_forward::desc(prop_kind::forward_scoring,
src.GetOpMemDesc(), axis);
auto softmax_fwd_pd = softmax_forward::primitive_desc(softmax_fwd_desc,
cpu_engine);
// add: output
Tensor* output_tensor = nullptr;
MklDnnShape output_mkl_shape;
TensorShape output_tf_shape; // shape of output TF tensor.
// Softmax MklDnn output layout is same as input layout.
auto dst_pd = src.GetUsrMemPrimDesc();
// if input is MKL shape, ouput is also MKL shape.
// if input is TF shape, output is also TF shape
if (src_mkl_shape.IsMklTensor()) {
output_mkl_shape.SetMklTensor(true);
output_mkl_shape.SetMklLayout(&dst_pd);
output_mkl_shape.SetElemType(MklDnnType<T>());
output_mkl_shape.SetTfLayout(output_dims.size(), output_dims,
memory::format::nc);
output_tf_shape.AddDim((dst_pd.get_size() / sizeof(T)));
} else { // then output is also TF shape
output_mkl_shape.SetMklTensor(false);
output_tf_shape = MklDnnDimsToTFShape(output_dims);
}
// Allocate output shape (MKL or TF based on the above)
AllocateOutputSetMklShape(context, 0, &output_tensor, output_tf_shape,
output_mkl_shape);
// Output_dims and input_dims are same
dst.SetUsrMem(src_md, output_tensor);
// finally creating the "softmax op" using the primitive descriptor, src
// and dst
auto softmax_fwd =
softmax_forward(softmax_fwd_pd, src.GetOpMem(), dst.GetOpMem());
// execute net (pushing to the stream)
// following 3 are common for all mkl dnn ops
std::vector<primitive> net;
net.push_back(softmax_fwd);
stream(stream::kind::eager).submit(net).wait();
} catch (mkldnn::error& e) {
string error_msg = "Status: " + std::to_string(e.status) + ", message: " +
string(e.message) + ", in file " + string(__FILE__) +
":" + std::to_string(__LINE__);
OP_REQUIRES_OK(
context,
errors::Aborted("Operation received an exception:", error_msg));
}
}
};
/* Register DNN kernels for supported operations and supported types - right now
* it is only Softmax and f32 */
#define REGISTER_SOFTMAX_MKL_SUPPORTED_KERNELS_TYPES(type) \
REGISTER_KERNEL_BUILDER(Name("_MklSoftmax") \
.Device(DEVICE_CPU) \
.TypeConstraint<type>("T") \
.Label(mkl_op_registry::kMklOpLabel), \
MklSoftmaxOp<CPUDevice, type>);
TF_CALL_float(REGISTER_SOFTMAX_MKL_SUPPORTED_KERNELS_TYPES);
} // namespace tensorflow
#endif // INTEL_MKL_DNN
#endif // INTEL_MKL

View File

@ -222,7 +222,7 @@ void DnnPoolingOp<T>::Compute(
output_desc, &output_data)
.ok();
OP_REQUIRES(context, status,
errors::Internal("cudnn PoolBackward launch failed"));
errors::Internal("cudnn PoolForward launch failed"));
if (data_format == FORMAT_NHWC) {
/// Transform the output data from NCHW back to NHWC

View File

@ -70,10 +70,24 @@ bool ReadRawFloatFileToComplexVector(
int offset = 0;
const int end = data_string.size();
while (offset < end) {
#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
char arr[4];
for (int i = 0; i < kBytesPerValue; ++i ) {
arr[3 - i] = *(data_string.data() + offset + i);
}
memcpy(&real_out, arr, kBytesPerValue);
offset += kBytesPerValue;
for (int i = 0; i < kBytesPerValue; ++i ) {
arr[3 - i] = *(data_string.data() + offset + i);
}
memcpy(&imag_out, arr, kBytesPerValue);
offset += kBytesPerValue;
#else
memcpy(&real_out, data_string.data() + offset, kBytesPerValue);
offset += kBytesPerValue;
memcpy(&imag_out, data_string.data() + offset, kBytesPerValue);
offset += kBytesPerValue;
#endif
if (row_counter >= row_length) {
data->push_back(data_row);
data_row.clear();

View File

@ -88,6 +88,18 @@ struct Transpose<CPUDevice, T, conjugate> {
internal::TransposeUsingEigen<CPUDevice, T, 5>(d, in, perm, conjugate,
out);
break;
case 6:
internal::TransposeUsingEigen<CPUDevice, T, 6>(d, in, perm, conjugate,
out);
break;
case 7:
internal::TransposeUsingEigen<CPUDevice, T, 7>(d, in, perm, conjugate,
out);
break;
case 8:
internal::TransposeUsingEigen<CPUDevice, T, 8>(d, in, perm, conjugate,
out);
break;
default:
TransposeSimple<T, conjugate>(d, in, perm, out);
break;

View File

@ -201,6 +201,27 @@ struct Transpose<GPUDevice, T, conjugate> {
out);
}
break;
case 6:
if (!internal::TransposeUsingTile<T, conjugate>::run(d, in, perm,
out)) {
internal::TransposeUsingEigen<GPUDevice, T, 6>(d, in, perm, conjugate,
out);
}
break;
case 7:
if (!internal::TransposeUsingTile<T, conjugate>::run(d, in, perm,
out)) {
internal::TransposeUsingEigen<GPUDevice, T, 7>(d, in, perm, conjugate,
out);
}
break;
case 8:
if (!internal::TransposeUsingTile<T, conjugate>::run(d, in, perm,
out)) {
internal::TransposeUsingEigen<GPUDevice, T, 8>(d, in, perm, conjugate,
out);
}
break;
default:
internal::TransposeSimple<T, conjugate>(d, in, perm, out);
break;

View File

@ -67,10 +67,12 @@ class SoftmaxXentWithLogitsOp : public OpKernel {
// Try to reuse the logits_in buffer for the backprop output.
OP_REQUIRES_OK(context, context->forward_input_or_allocate_output(
{0}, 1, logits_in.shape(), &back_out));
functor::XentFunctor<Device, T> functor;
functor(context->eigen_device<Device>(), logits_in.matrix<T>(),
labels_in.matrix<T>(), scratch.matrix<T>(), loss_out->vec<T>(),
back_out->matrix<T>());
if (logits_in.dim_size(0) > 0) {
functor::XentFunctor<Device, T> functor;
functor(context->eigen_device<Device>(), logits_in.matrix<T>(),
labels_in.matrix<T>(), scratch.matrix<T>(), loss_out->vec<T>(),
back_out->matrix<T>());
}
}
};

View File

@ -17,6 +17,7 @@ limitations under the License.
#include "tensorflow/core/lib/gif/gif_io.h"
#include "tensorflow/core/lib/gtl/cleanup.h"
#include "tensorflow/core/lib/strings/strcat.h"
#include "tensorflow/core/platform/gif.h"
#include "tensorflow/core/platform/logging.h"
#include "tensorflow/core/platform/mem.h"
@ -44,7 +45,8 @@ int input_callback(GifFileType* gif_file, GifByteType* buf, int size) {
}
uint8* Decode(const void* srcdata, int datasize,
std::function<uint8*(int, int, int, int)> allocate_output) {
const std::function<uint8*(int, int, int, int)>& allocate_output,
string* error_string) {
int error_code = D_GIF_SUCCEEDED;
InputBufferInfo info = {reinterpret_cast<const uint8*>(srcdata), datasize};
GifFileType* gif_file =
@ -57,17 +59,17 @@ uint8* Decode(const void* srcdata, int datasize,
}
});
if (error_code != D_GIF_SUCCEEDED) {
LOG(ERROR) << "Fail to open gif file, reason: "
<< GifErrorString(error_code);
*error_string = strings::StrCat("failed to open gif file: ",
GifErrorString(error_code));
return nullptr;
}
if (DGifSlurp(gif_file) != GIF_OK) {
LOG(ERROR) << "Fail to slurp gif file, reason: "
<< GifErrorString(gif_file->Error);
*error_string = strings::StrCat("failed to slurp gif file: ",
GifErrorString(gif_file->Error));
return nullptr;
}
if (gif_file->ImageCount <= 0) {
LOG(ERROR) << "Gif file does not contain any image";
*error_string = strings::StrCat("gif file does not contain any image");
return nullptr;
}
@ -83,7 +85,7 @@ uint8* Decode(const void* srcdata, int datasize,
GifImageDesc* img_desc = &this_image->ImageDesc;
if (img_desc->Left != 0 || img_desc->Top != 0 || img_desc->Width != width ||
img_desc->Height != height) {
LOG(ERROR) << "Can't process optimized gif.";
*error_string = strings::StrCat("can't process optimized gif");
return nullptr;
}

View File

@ -43,7 +43,8 @@ namespace tensorflow {
namespace gif {
uint8* Decode(const void* srcdata, int datasize,
std::function<uint8*(int, int, int, int)> allocate_output);
const std::function<uint8*(int, int, int, int)>& allocate_output,
string* error_string);
} // namespace gif
} // namespace tensorflow

View File

@ -1818,7 +1818,11 @@ REGISTER_OP("_MklMaxPool")
.Input("input: T")
.Input("mkl_input: uint8")
.Output("output: T")
#ifndef INTEL_MKL_DNN
.Output("workspace: T")
#else
.Output("workspace: uint8")
#endif
.Output("mkl_output: uint8")
.Output("mkl_workspace: uint8")
.SetShapeFn(shape_inference::MaxPoolShape)
@ -1840,7 +1844,11 @@ REGISTER_OP("_MklMaxPoolGrad")
.Input("orig_input: T")
.Input("orig_output: T")
.Input("grad: T")
#ifndef INTEL_MKL_DNN
.Input("workspace: T")
#else
.Input("workspace: uint8")
#endif
.Input("mkl_orig_input: uint8")
.Input("mkl_orig_output: uint8")
.Input("mkl_grad: uint8")

View File

@ -48,6 +48,7 @@ void AWSLogSystem::LogStream(Aws::Utils::Logging::LogLevel log_level,
void AWSLogSystem::LogMessage(Aws::Utils::Logging::LogLevel log_level,
const std::string& message) {
if (message == "Initializing Curl library") return;
switch (log_level) {
case Aws::Utils::Logging::LogLevel::Info:
LOG(INFO) << message;

View File

@ -14,11 +14,13 @@ limitations under the License.
==============================================================================*/
#include "tensorflow/core/platform/s3/s3_file_system.h"
#include "tensorflow/core/lib/io/path.h"
#include "tensorflow/core/lib/strings/str_util.h"
#include "tensorflow/core/platform/mutex.h"
#include "tensorflow/core/platform/s3/aws_logging.h"
#include "tensorflow/core/platform/s3/s3_crypto.h"
#include <aws/core/Aws.h>
#include <aws/core/config/AWSProfileConfigLoader.h>
#include <aws/core/utils/FileSystemUtils.h>
#include <aws/core/utils/logging/AWSLogging.h>
#include <aws/core/utils/logging/LogSystemInterface.h>
@ -54,13 +56,37 @@ Aws::Client::ClientConfiguration& GetDefaultClientConfig() {
cfg.endpointOverride = Aws::String(endpoint);
}
const char* region = getenv("AWS_REGION");
if (!region) {
// TODO (yongtang): `S3_REGION` should be deprecated after 2.0.
region = getenv("S3_REGION");
}
if (region) {
cfg.region = Aws::String(region);
} else {
// TODO (yongtang): `S3_REGION` should be deprecated after 2.0.
const char* region = getenv("S3_REGION");
if (region) {
cfg.region = Aws::String(region);
// Load config file (e.g., ~/.aws/config) only if AWS_SDK_LOAD_CONFIG
// is set with a truthy value.
const char* load_config_env = getenv("AWS_SDK_LOAD_CONFIG");
string load_config =
load_config_env ? str_util::Lowercase(load_config_env) : "";
if (load_config == "true" || load_config == "1") {
Aws::String config_file;
// If AWS_CONFIG_FILE is set then use it, otherwise use ~/.aws/config.
const char* config_file_env = getenv("AWS_CONFIG_FILE");
if (config_file_env) {
config_file = config_file_env;
} else {
const char* home_env = getenv("HOME");
if (home_env) {
config_file = home_env;
config_file += "/.aws/config";
}
}
Aws::Config::AWSConfigFileProfileConfigLoader loader(config_file);
loader.Load();
auto profiles = loader.GetProfiles();
if (!profiles["default"].GetRegion().empty()) {
cfg.region = profiles["default"].GetRegion();
}
}
}
const char* use_https = getenv("S3_USE_HTTPS");
@ -79,6 +105,22 @@ Aws::Client::ClientConfiguration& GetDefaultClientConfig() {
cfg.verifySSL = true;
}
}
const char* connect_timeout = getenv("S3_CONNECT_TIMEOUT_MSEC");
if (connect_timeout) {
int64 timeout;
if (strings::safe_strto64(connect_timeout, &timeout)) {
cfg.connectTimeoutMs = timeout;
}
}
const char* request_timeout = getenv("S3_REQUEST_TIMEOUT_MSEC");
if (request_timeout) {
int64 timeout;
if (strings::safe_strto64(request_timeout, &timeout)) {
cfg.requestTimeoutMs = timeout;
}
}
init = true;
}

View File

@ -24,7 +24,7 @@ limitations under the License.
// TF_VERSION_SUFFIX is non-empty for pre-releases (e.g. "-alpha", "-alpha.1",
// "-beta", "-rc", "-rc.1")
#define TF_VERSION_SUFFIX "-rc0"
#define TF_VERSION_SUFFIX "-rc1"
#define TF_STR_HELPER(x) #x
#define TF_STR(x) TF_STR_HELPER(x)

View File

@ -14,16 +14,16 @@ suitable if fast sharding or other non-sequential access is desired.
## TFRecords Format Details
A TFRecords file contains a sequence of strings with CRC hashes. Each record
has the format
A TFRecords file contains a sequence of strings with CRC32C (32-bit CRC using
the Castagnoli polynomial) hashes. Each record has the format
uint64 length
uint32 masked_crc32_of_length
byte data[length]
uint32 masked_crc32_of_data
and the records are concatenated together to produce the file. The CRC32s
are [described here](https://en.wikipedia.org/wiki/Cyclic_redundancy_check),
and the mask of a CRC is
and the records are concatenated together to produce the file. CRCs are
[described here](https://en.wikipedia.org/wiki/Cyclic_redundancy_check), and
the mask of a CRC is
masked_crc = ((crc >> 15) | (crc << 17)) + 0xa282ead8ul

View File

@ -38,7 +38,7 @@ enable TensorFlow for C:
OS="linux" # Change to "darwin" for macOS
TARGET_DIRECTORY="/usr/local"
curl -L \
"https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-${OS}-x86_64-1.5.0-rc0.tar.gz" |
"https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-${OS}-x86_64-1.5.0-rc1.tar.gz" |
sudo tar -C $TARGET_DIRECTORY -xz
The `tar` command extracts the TensorFlow C library into the `lib`

View File

@ -38,7 +38,7 @@ steps to install this library and enable TensorFlow for Go:
TF_TYPE="cpu" # Change to "gpu" for GPU support
TARGET_DIRECTORY='/usr/local'
curl -L \
"https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-$(go env GOOS)-x86_64-1.5.0-rc0.tar.gz" |
"https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-$(go env GOOS)-x86_64-1.5.0-rc1.tar.gz" |
sudo tar -C $TARGET_DIRECTORY -xz
The `tar` command extracts the TensorFlow C library into the `lib`

View File

@ -36,7 +36,7 @@ following to the project's `pom.xml` to use the TensorFlow Java APIs:
<dependency>
<groupId>org.tensorflow</groupId>
<artifactId>tensorflow</artifactId>
<version>1.5.0-rc0</version>
<version>1.5.0-rc1</version>
</dependency>
```
@ -65,7 +65,7 @@ As an example, these steps will create a Maven project that uses TensorFlow:
<dependency>
<groupId>org.tensorflow</groupId>
<artifactId>tensorflow</artifactId>
<version>1.5.0-rc0</version>
<version>1.5.0-rc1</version>
</dependency>
</dependencies>
</project>
@ -123,12 +123,12 @@ instead:
<dependency>
<groupId>org.tensorflow</groupId>
<artifactId>libtensorflow</artifactId>
<version>1.4.0</version>
<version>1.5.0-rc1</version>
</dependency>
<dependency>
<groupId>org.tensorflow</groupId>
<artifactId>libtensorflow_jni_gpu</artifactId>
<version>1.4.0</version>
<version>1.5.0-rc1</version>
</dependency>
```
@ -147,7 +147,7 @@ refer to the simpler instructions above instead.
Take the following steps to install TensorFlow for Java on Linux or macOS:
1. Download
[libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.5.0-rc0.jar),
[libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.5.0-rc1.jar),
which is the TensorFlow Java Archive (JAR).
2. Decide whether you will run TensorFlow for Java on CPU(s) only or with
@ -166,7 +166,7 @@ Take the following steps to install TensorFlow for Java on Linux or macOS:
OS=$(uname -s | tr '[:upper:]' '[:lower:]')
mkdir -p ./jni
curl -L \
"https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-${TF_TYPE}-${OS}-x86_64-1.5.0-rc0.tar.gz" |
"https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-${TF_TYPE}-${OS}-x86_64-1.5.0-rc1.tar.gz" |
tar -xz -C ./jni
### Install on Windows
@ -174,10 +174,10 @@ Take the following steps to install TensorFlow for Java on Linux or macOS:
Take the following steps to install TensorFlow for Java on Windows:
1. Download
[libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.5.0-rc0.jar),
[libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.5.0-rc1.jar),
which is the TensorFlow Java Archive (JAR).
2. Download the following Java Native Interface (JNI) file appropriate for
[TensorFlow for Java on Windows](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-cpu-windows-x86_64-1.5.0-rc0.zip).
[TensorFlow for Java on Windows](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-cpu-windows-x86_64-1.5.0-rc1.zip).
3. Extract this .zip file.
@ -225,7 +225,7 @@ must be part of your `classpath`. For example, you can include the
downloaded `.jar` in your `classpath` by using the `-cp` compilation flag
as follows:
<pre><b>javac -cp libtensorflow-1.5.0-rc0.jar HelloTF.java</b></pre>
<pre><b>javac -cp libtensorflow-1.5.0-rc1.jar HelloTF.java</b></pre>
### Running
@ -239,11 +239,11 @@ two files are available to the JVM:
For example, the following command line executes the `HelloTF` program on Linux
and macOS X:
<pre><b>java -cp libtensorflow-1.5.0-rc0.jar:. -Djava.library.path=./jni HelloTF</b></pre>
<pre><b>java -cp libtensorflow-1.5.0-rc1.jar:. -Djava.library.path=./jni HelloTF</b></pre>
And the following command line executes the `HelloTF` program on Windows:
<pre><b>java -cp libtensorflow-1.5.0-rc0.jar;. -Djava.library.path=jni HelloTF</b></pre>
<pre><b>java -cp libtensorflow-1.5.0-rc1.jar;. -Djava.library.path=jni HelloTF</b></pre>
If the program prints <tt>Hello from <i>version</i></tt>, you've successfully
installed TensorFlow for Java and are ready to use the API. If the program

View File

@ -188,7 +188,7 @@ Take the following steps to install TensorFlow with Virtualenv:
Virtualenv environment:
<pre>(tensorflow)$ <b>pip3 install --upgrade \
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0rc0-cp34-cp34m-linux_x86_64.whl</b></pre>
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0rc1-cp34-cp34m-linux_x86_64.whl</b></pre>
If you encounter installation problems, see
[Common Installation Problems](#common_installation_problems).
@ -293,7 +293,7 @@ take the following steps:
<pre>
$ <b>sudo pip3 install --upgrade \
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0rc0-cp34-cp34m-linux_x86_64.whl</b>
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0rc1-cp34-cp34m-linux_x86_64.whl</b>
</pre>
If this step fails, see
@ -480,7 +480,7 @@ Take the following steps to install TensorFlow in an Anaconda environment:
<pre>
(tensorflow)$ <b>pip install --ignore-installed --upgrade \
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0rc0-cp34-cp34m-linux_x86_64.whl</b></pre>
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0rc1-cp34-cp34m-linux_x86_64.whl</b></pre>
<a name="ValidateYourInstallation"></a>
@ -648,14 +648,14 @@ This section documents the relevant values for Linux installations.
CPU only:
<pre>
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0rc0-cp27-none-linux_x86_64.whl
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0rc1-cp27-none-linux_x86_64.whl
</pre>
GPU support:
<pre>
https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.5.0rc0-cp27-none-linux_x86_64.whl
https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.5.0rc1-cp27-none-linux_x86_64.whl
</pre>
Note that GPU support requires the NVIDIA hardware and software described in
@ -667,14 +667,14 @@ Note that GPU support requires the NVIDIA hardware and software described in
CPU only:
<pre>
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0rc0-cp34-cp34m-linux_x86_64.whl
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0rc1-cp34-cp34m-linux_x86_64.whl
</pre>
GPU support:
<pre>
https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.5.0rc0-cp34-cp34m-linux_x86_64.whl
https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.5.0rc1-cp34-cp34m-linux_x86_64.whl
</pre>
Note that GPU support requires the NVIDIA hardware and software described in
@ -686,14 +686,14 @@ Note that GPU support requires the NVIDIA hardware and software described in
CPU only:
<pre>
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0rc0-cp35-cp35m-linux_x86_64.whl
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0rc1-cp35-cp35m-linux_x86_64.whl
</pre>
GPU support:
<pre>
https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.5.0rc0-cp35-cp35m-linux_x86_64.whl
https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.5.0rc1-cp35-cp35m-linux_x86_64.whl
</pre>
@ -705,14 +705,14 @@ Note that GPU support requires the NVIDIA hardware and software described in
CPU only:
<pre>
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0rc0-cp36-cp36m-linux_x86_64.whl
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0rc1-cp36-cp36m-linux_x86_64.whl
</pre>
GPU support:
<pre>
https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.5.0rc0-cp36-cp36m-linux_x86_64.whl
https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.5.0rc1-cp36-cp36m-linux_x86_64.whl
</pre>

View File

@ -115,7 +115,7 @@ Take the following steps to install TensorFlow with Virtualenv:
TensorFlow in the active Virtualenv is as follows:
<pre> $ <b>pip3 install --upgrade \
https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.5.0rc0-py2-none-any.whl</b></pre>
https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.5.0rc1-py2-none-any.whl</b></pre>
If you encounter installation problems, see
[Common Installation Problems](#common-installation-problems).
@ -238,7 +238,7 @@ take the following steps:
issue the following command:
<pre> $ <b>sudo pip3 install --upgrade \
https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.5.0rc0-py2-none-any.whl</b> </pre>
https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.5.0rc1-py2-none-any.whl</b> </pre>
If the preceding command fails, see
[installation problems](#common-installation-problems).
@ -347,7 +347,7 @@ Take the following steps to install TensorFlow in an Anaconda environment:
TensorFlow for Python 2.7:
<pre> (<i>targetDirectory</i>)$ <b>pip install --ignore-installed --upgrade \
https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.5.0rc0-py2-none-any.whl</b></pre>
https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.5.0rc1-py2-none-any.whl</b></pre>
<a name="ValidateYourInstallation"></a>
@ -520,7 +520,7 @@ This section documents the relevant values for Mac OS installations.
<pre>
https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.5.0rc0-py2-none-any.whl
https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.5.0rc1-py2-none-any.whl
</pre>
@ -528,5 +528,5 @@ https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.5.0rc0-py2-none-a
<pre>
https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.5.0rc0-py3-none-any.whl
https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.5.0rc1-py3-none-any.whl
</pre>

View File

@ -361,10 +361,10 @@ Invoke `pip install` to install that pip package.
The filename of the `.whl` file depends on your platform.
For example, the following command will install the pip package
for TensorFlow 1.5.0rc0 on Linux:
for TensorFlow 1.5.0rc1 on Linux:
<pre>
$ <b>sudo pip install /tmp/tensorflow_pkg/tensorflow-1.5.0rc0-py2-none-any.whl</b>
$ <b>sudo pip install /tmp/tensorflow_pkg/tensorflow-1.5.0rc1-py2-none-any.whl</b>
</pre>
## Validate your installation
@ -462,9 +462,12 @@ Stack Overflow and specify the `tensorflow` tag.
**Linux**
<table>
<tr><th>Version:</th><th>CPU/GPU:</th><th>Python Version:</th><th>Compiler:</th><th>Build Tools:</th><th>cuDNN:</th><th>CUDA:</th></tr>
<tr><td>tensorflow-1.5.0-rc1</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.8.0</td><td>N/A</td><td>N/A</td></tr>
<tr><td>tensorflow_gpu-1.5.0-rc1</td><td>GPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.8.0</td><td>7</td><td>9</td></tr>
<tr><td>tensorflow-1.4.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.5.4</td><td>N/A</td><td>N/A</td></tr>
<tr><td>tensorflow_gpu-1.4.0</td><td>GPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.5.4</td><td>6</td><td>8</td></tr>
<tr><td>tensorflow-1.3.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.4.5</td><td>N/A</td><td>N/A</td></tr>
<tr><td>tensorflow-1.3.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.4.5</td><td>N/A</td><td>N/A</td></tr>
<tr><td>tensorflow_gpu-1.3.0</td><td>GPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.4.5</td><td>6</td><td>8</td></tr>
<tr><td>tensorflow-1.2.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.4.5</td><td>N/A</td><td>N/A</td></tr>
<tr><td>tensorflow_gpu-1.2.0</td><td>GPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.4.5</td><td>5.1</td><td>8</td></tr>
@ -477,8 +480,9 @@ Stack Overflow and specify the `tensorflow` tag.
**Mac**
<table>
<tr><th>Version:</th><th>CPU/GPU:</th><th>Python Version:</th><th>Compiler:</th><th>Build Tools:</th><th>cuDNN:</th><th>CUDA:</th></tr>
<tr><td>tensorflow-1.5.0-rc1</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.8.1</td><td>N/A</td><td>N/A</td></tr>
<tr><td>tensorflow-1.4.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.5.4</td><td>N/A</td><td>N/A</td></tr>
<tr><td>tensorflow-1.3.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.4.5</td><td>N/A</td><td>N/A</td></tr>
<tr><td>tensorflow-1.3.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.4.5</td><td>N/A</td><td>N/A</td></tr>
<tr><td>tensorflow-1.2.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.4.5</td><td>N/A</td><td>N/A</td></tr>
<tr><td>tensorflow-1.1.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.4.2</td><td>N/A</td><td>N/A</td></tr>
<tr><td>tensorflow_gpu-1.1.0</td><td>GPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.4.2</td><td>5.1</td><td>8</td></tr>
@ -489,6 +493,8 @@ Stack Overflow and specify the `tensorflow` tag.
**Windows**
<table>
<tr><th>Version:</th><th>CPU/GPU:</th><th>Python Version:</th><th>Compiler:</th><th>Build Tools:</th><th>cuDNN:</th><th>CUDA:</th></tr>
<tr><td>tensorflow-1.5.0-rc1</td><td>CPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>N/A</td><td>N/A</td></tr>
<tr><td>tensorflow_gpu-1.5.0-rc1</td><td>GPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>7</td><td>9</td></tr>
<tr><td>tensorflow-1.4.0</td><td>CPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>N/A</td><td>N/A</td></tr>
<tr><td>tensorflow_gpu-1.4.0</td><td>GPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>6</td><td>8</td></tr>
<tr><td>tensorflow-1.3.0</td><td>CPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>N/A</td><td>N/A</td></tr>

View File

@ -21,6 +21,8 @@ from __future__ import print_function
import collections
import math
import os
import sys
import argparse
import random
from tempfile import gettempdir
import zipfile
@ -30,6 +32,24 @@ from six.moves import urllib
from six.moves import xrange # pylint: disable=redefined-builtin
import tensorflow as tf
from tensorflow.contrib.tensorboard.plugins import projector
# Give a folder path as an argument with '--log_dir' to save
# TensorBoard summaries. Default is a log folder in current directory.
current_path = os.path.dirname(os.path.realpath(sys.argv[0]))
parser = argparse.ArgumentParser()
parser.add_argument(
'--log_dir',
type=str,
default=os.path.join(current_path, 'log'),
help='The log directory for TensorBoard summaries.')
FLAGS, unparsed = parser.parse_known_args()
# Create the directory for TensorBoard variables if there is not.
if not os.path.exists(FLAGS.log_dir):
os.makedirs(FLAGS.log_dir)
# Step 1: Download the data.
url = 'http://mattmahoney.net/dc/'
@ -156,38 +176,47 @@ graph = tf.Graph()
with graph.as_default():
# Input data.
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
with tf.name_scope('inputs'):
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
# Ops and variables pinned to the CPU because of missing GPU implementation
with tf.device('/cpu:0'):
# Look up embeddings for inputs.
embeddings = tf.Variable(
tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs)
with tf.name_scope('embeddings'):
embeddings = tf.Variable(
tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs)
# Construct the variables for the NCE loss
nce_weights = tf.Variable(
tf.truncated_normal([vocabulary_size, embedding_size],
stddev=1.0 / math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
with tf.name_scope('weights'):
nce_weights = tf.Variable(
tf.truncated_normal([vocabulary_size, embedding_size],
stddev=1.0 / math.sqrt(embedding_size)))
with tf.name_scope('biases'):
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
# Compute the average NCE loss for the batch.
# tf.nce_loss automatically draws a new sample of the negative labels each
# time we evaluate the loss.
# Explanation of the meaning of NCE loss:
# http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
loss = tf.reduce_mean(
tf.nn.nce_loss(weights=nce_weights,
biases=nce_biases,
labels=train_labels,
inputs=embed,
num_sampled=num_sampled,
num_classes=vocabulary_size))
with tf.name_scope('loss'):
loss = tf.reduce_mean(
tf.nn.nce_loss(weights=nce_weights,
biases=nce_biases,
labels=train_labels,
inputs=embed,
num_sampled=num_sampled,
num_classes=vocabulary_size))
# Add the loss value as a scalar to summary.
tf.summary.scalar('loss', loss)
# Construct the SGD optimizer using a learning rate of 1.0.
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
with tf.name_scope('optimizer'):
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
# Compute the cosine similarity between minibatch examples and all embeddings.
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
@ -197,13 +226,22 @@ with graph.as_default():
similarity = tf.matmul(
valid_embeddings, normalized_embeddings, transpose_b=True)
# Merge all summaries.
merged = tf.summary.merge_all()
# Add variable initializer.
init = tf.global_variables_initializer()
# Create a saver.
saver = tf.train.Saver()
# Step 5: Begin training.
num_steps = 100001
with tf.Session(graph=graph) as session:
# Open a writer to write summaries.
writer = tf.summary.FileWriter(FLAGS.log_dir, session.graph)
# We must initialize all variables before we use them.
init.run()
print('Initialized')
@ -214,10 +252,21 @@ with tf.Session(graph=graph) as session:
batch_size, num_skips, skip_window)
feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}
# Define metadata variable.
run_metadata = tf.RunMetadata()
# We perform one update step by evaluating the optimizer op (including it
# in the list of returned values for session.run()
_, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)
# Also, evaluate the merged op to get all summaries from the returned "summary" variable.
# Feed metadata variable to session for visualizing the graph in TensorBoard.
_, summary, loss_val = session.run([optimizer, merged, loss], feed_dict=feed_dict, run_metadata=run_metadata)
average_loss += loss_val
# Add returned summaries to writer in each step.
writer.add_summary(summary, step)
# Add metadata to visualize the graph for the last run.
if step == (num_steps - 1):
writer.add_run_metadata(run_metadata, 'step%d' % step)
if step % 2000 == 0:
if step > 0:
@ -240,6 +289,23 @@ with tf.Session(graph=graph) as session:
print(log_str)
final_embeddings = normalized_embeddings.eval()
# Write corresponding labels for the embeddings.
with open(FLAGS.log_dir + '/metadata.tsv', 'w') as f:
for i in xrange(vocabulary_size):
f.write(reverse_dictionary[i] + '\n')
# Save the model for checkpoints.
saver.save(session, os.path.join(FLAGS.log_dir, "model.ckpt"))
# Create a configuration for visualizing embeddings with the labels in TensorBoard.
config = projector.ProjectorConfig()
embedding_conf = config.embeddings.add()
embedding_conf.tensor_name = embeddings.name
embedding_conf.metadata_path = os.path.join(FLAGS.log_dir, 'metadata.tsv')
projector.visualize_embeddings(writer, config)
writer.close()
# Step 6: Visualize the embeddings.

View File

@ -1,3 +1,4 @@
# -*- coding: utf-8 -*-
# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
@ -301,6 +302,27 @@ class BatchDatasetTest(test.TestCase):
with self.assertRaises(errors.OutOfRangeError):
sess.run(get_next)
def testPaddedBatchDatasetUnicode(self):
# See GitHub issue 16149
def generator():
data = [
[u'Простой', u'тест', u'юникода'],
[u'никогда', u'не', u'бывает', u'простым']]
for seq in data:
yield seq, [0, 1, 2, 3]
dataset = dataset_ops.Dataset.from_generator(
generator,
(dtypes.string, dtypes.int32),
(tensor_shape.TensorShape([None]), tensor_shape.TensorShape([None])))
padded_dataset = dataset.padded_batch(2, padded_shapes=([None], [None]),
padding_values=('', 0))
with self.test_session() as sess:
next_element = padded_dataset.make_one_shot_iterator().get_next()
sess.run(next_element)
def testPaddedBatchDatasetShapeSpecifications(self):
int_placeholder = array_ops.placeholder(dtypes.int32)
float_placeholder = array_ops.placeholder(dtypes.float32)

View File

@ -128,9 +128,10 @@ class Estimator(object):
model_dir: Directory to save model parameters, graph and etc. This can
also be used to load checkpoints from the directory into a estimator to
continue training a previously saved model. If `None`, the model_dir in
`config` will be used if set. If both are set, they must be same. If
both are `None`, a temporary directory will be used.
continue training a previously saved model. If `PathLike` object, the
path will be resolved. If `None`, the model_dir in `config` will be used
if set. If both are set, they must be same. If both are `None`, a
temporary directory will be used.
config: Configuration object.
params: `dict` of hyper parameters that will be passed into `model_fn`.
Keys are names of parameters, values are basic python types.
@ -158,6 +159,7 @@ class Estimator(object):
self._config = config
# Model directory.
model_dir = compat.path_to_str(model_dir)
if (model_dir is not None) and (self._config.model_dir is not None):
if model_dir != self._config.model_dir:
# TODO(alanyee): remove this suppression after it is no longer needed

Some files were not shown because too many files have changed in this diff Show More