Merge changes from github.
PiperOrigin-RevId: 183100142
This commit is contained in:
parent
7b62a71e2d
commit
d9f93c42a5
174
RELEASE.md
174
RELEASE.md
@ -13,46 +13,146 @@
|
||||
* [TensorFlow Lite](https://github.com/tensorflow/tensorflow/tree/r1.5/tensorflow/contrib/lite)
|
||||
dev preview is now available.
|
||||
* CUDA 9 and cuDNN 7 support.
|
||||
* Accelerated Linear Algebra (XLA):
|
||||
* Add `complex64` support to XLA compiler.
|
||||
* `bfloat` support is now added to XLA infrastructure.
|
||||
* Make `ClusterSpec` propagation work with XLA devices.
|
||||
* Use a determinisitic executor to generate XLA graph.
|
||||
* `tf.contrib`:
|
||||
* `tf.contrib.distributions`:
|
||||
* Add `tf.contrib.distributions.Autoregressive`.
|
||||
* Make `tf.contrib.distributions` QuadratureCompound classes support batch
|
||||
* Infer `tf.contrib.distributions.RelaxedOneHotCategorical` `dtype` from arguments.
|
||||
* Make `tf.contrib.distributions` quadrature family parameterized by
|
||||
`quadrature_grid_and_prob` vs `quadrature_degree`.
|
||||
* `auto_correlation` added to `tf.contrib.distributions`
|
||||
* Add `tf.contrib.bayesflow.layers`, a collection of probabilistic (neural) layers.
|
||||
* Add `tf.contrib.bayesflow.halton_sequence`.
|
||||
* Add `tf.contrib.data.make_saveable_from_iterator.`
|
||||
* Add `tf.contrib.data.shuffle_and_repeat`.
|
||||
* Add new custom transformation: `tf.contrib.data.scan()`.
|
||||
* `tf.contrib.distributions.bijectors`:
|
||||
* Add `tf.contrib.distributions.bijectors.MaskedAutoregressiveFlow`.
|
||||
* Add `tf.contrib.distributions.bijectors.Permute`.
|
||||
* Add `tf.contrib.distributions.bijectors.Gumbel`.
|
||||
* Add `tf.contrib.distributions.bijectors.Reshape`.
|
||||
* Support shape inference (i.e., shapes containing -1) in the Reshape bijector.
|
||||
* Add `streaming_precision_recall_at_equal_thresholds,` a method for computing
|
||||
streaming precision and recall with `O(num_thresholds + size of predictions)`
|
||||
time and space complexity.
|
||||
* Change `RunConfig` default behavior to not set a random seed, making random
|
||||
behavior independently random on distributed workers. We expect this to
|
||||
generally improve training performance. Models that do rely on determinism
|
||||
should set a random seed explicitly.
|
||||
* Replaced the implementation of `tf.flags` with `absl.flags`.
|
||||
* Add support for `CUBLAS_TENSOR_OP_MATH` in fp16 GEMM
|
||||
* Add support for CUDA on NVIDIA Tegra devices
|
||||
|
||||
## Bug Fixes and Other Changes
|
||||
* `auto_correlation` added to `tf.contrib.distributions`.
|
||||
* Add `DenseFlipout` probabilistic layer.
|
||||
* Restandardize `DenseVariational` as simpler template for other probabilistic layers.
|
||||
* Make `tf.contrib.distributions` QuadratureCompound classes support batch.
|
||||
* Documentation updates:
|
||||
* Clarified that you can only install TensorFlow on 64-bit machines.
|
||||
* Added a short doc explaining how `Estimator`s save checkpoints.
|
||||
* Add documentation for ops supported by the `tf2xla` bridge.
|
||||
* Fix minor typos in the doc of `SpaceToDepth` and `DepthToSpace`.
|
||||
* Updated documentation comments in `mfcc_mel_filterbank.h` and `mfcc.h` to
|
||||
clarify that the input domain is squared magnitude spectra and the weighting
|
||||
is done on linear magnitude spectra (sqrt of inputs).
|
||||
* Change `tf.contrib.distributions` docstring examples to use `tfd` alias
|
||||
rather than `ds`, `bs`.
|
||||
* Fix docstring typos in `tf.distributions.bijectors.Bijector`.
|
||||
* `tf.assert_equal` no longer raises `ValueError.` It now raises
|
||||
`InvalidArgumentError,` as documented.
|
||||
* Update Getting Started docs and API intro.
|
||||
* Google Cloud Storage (GCS):
|
||||
* Add userspace DNS caching for the GCS client.
|
||||
* Customize request timeouts for the GCS filesystem.
|
||||
* Improve GCS filesystem caching.
|
||||
* Bug Fixes:
|
||||
* Fix bug where partitioned integer variables got their wrong shapes. Before
|
||||
* Fix correctness bug in CPU and GPU implementations of Adadelta.
|
||||
* Fix a bug in `import_meta_graph`'s handling of partitioned variables when
|
||||
importing into a scope. WARNING: This may break loading checkpoints of
|
||||
graphs with partitioned variables saved after using `import_meta_graph` with
|
||||
a non-empty `import_scope` argument.
|
||||
* Fix bug in offline debugger which prevented viewing events.
|
||||
* Added the `WorkerService.DeleteWorkerSession` method to the gRPC interface,
|
||||
to fix a memory leak. Ensure that your master and worker servers are running
|
||||
the same version of TensorFlow to avoid compatibility issues.
|
||||
* Fix bug in peephole implementation of BlockLSTM cell.
|
||||
* Fix bug by casting dtype of `log_det_jacobian` to match `log_prob` in
|
||||
`TransformedDistribution`.
|
||||
* Fix a bug in `import_meta_graph`'s handling of partitioned variables when
|
||||
* Ensure `tf.distributions.Multinomial` doesn't underflow in `log_prob`.
|
||||
Before this change, all partitions of an integer variable were initialized
|
||||
with the shape of the unpartitioned variable; after this change they are
|
||||
initialized correctly.
|
||||
* Other:
|
||||
* Add necessary shape util support for bfloat16.
|
||||
* Add a way to run ops using a step function to MonitoredSession.
|
||||
* Add `DenseFlipout` probabilistic layer.
|
||||
* A new flag `ignore_live_threads` is available on train. If set to `True`, it
|
||||
will ignore threads that remain running when tearing down infrastructure
|
||||
after successfully completing training, instead of throwing a RuntimeError.
|
||||
* Restandardize `DenseVariational` as simpler template for other probabilistic
|
||||
layers.
|
||||
* `tf.data` now supports `tf.SparseTensor` components in dataset elements.
|
||||
* It is now possible to iterate over `Tensor`s.
|
||||
* Allow `SparseSegmentReduction` ops to have missing segment IDs.
|
||||
* Modify custom export strategy to account for multidimensional sparse float
|
||||
splits.
|
||||
* `Conv2D`, `Conv2DBackpropInput`, `Conv2DBackpropFilter` now supports arbitrary
|
||||
dilations with GPU and cuDNNv6 support.
|
||||
* `Estimator` now supports `Dataset`: `input_fn` can return a `Dataset`
|
||||
instead of `Tensor`s.
|
||||
* Add `RevBlock`, a memory-efficient implementation of reversible residual layers.
|
||||
* Reduce BFCAllocator internal fragmentation.
|
||||
* Add `cross_entropy` and `kl_divergence` to `tf.distributions.Distribution`.
|
||||
* Add `tf.nn.softmax_cross_entropy_with_logits_v2` which enables backprop
|
||||
w.r.t. the labels.
|
||||
* GPU back-end now uses `ptxas` to compile generated PTX.
|
||||
* `BufferAssignment`'s protocol buffer dump is now deterministic.
|
||||
* Change embedding op to use parallel version of `DynamicStitch`.
|
||||
* Add support for sparse multidimensional feature columns.
|
||||
* Speed up the case for sparse float columns that have only 1 value.
|
||||
* Allow sparse float splits to support multivalent feature columns.
|
||||
* Add `quantile` to `tf.distributions.TransformedDistribution`.
|
||||
* Add `NCHW_VECT_C` support for `tf.depth_to_space` on GPU.
|
||||
* Add `NCHW_VECT_C` support for `tf.space_to_depth` on GPU.
|
||||
|
||||
## API Changes
|
||||
* Rename `SqueezeDims` attribute to `Axis` in C++ API for Squeeze op.
|
||||
* `Stream::BlockHostUntilDone` now returns Status rather than bool.
|
||||
* Customize request timeouts for the GCS filesystem.
|
||||
* Minor refactor: move stats files from `stochastic` to `common` and remove
|
||||
`stochastic`.
|
||||
|
||||
## Thanks to our Contributors
|
||||
|
||||
This release contains contributions from many people at Google, as well as:
|
||||
|
||||
4d55397500, Abdullah Alrasheed, abenmao, Adam Salvail, Aditya Dhulipala, Ag Ramesh,
|
||||
Akimasa Kimura, Alan Du, Alan Yee, Alexander, Amit Kushwaha, Amy, Andrei Costinescu,
|
||||
Andrei Nigmatulin, Andrew Erlichson, Andrew Myers, Andrew Stepanov, Androbin, AngryPowman,
|
||||
Anish Shah, Anton Daitche, Artsiom Chapialiou, asdf2014, Aseem Raj Baranwal, Ash Hall,
|
||||
Bart Kiers, Batchu Venkat Vishal, ben, Ben Barsdell, Bill Piel, Carl Thomé, Catalin Voss,
|
||||
Changming Sun, Chengzhi Chen, Chi Zeng, Chris Antaki, Chris Donahue, Chris Oelmueller,
|
||||
Chris Tava, Clayne Robison, Codrut, Courtial Florian, Dalmo Cirne, Dan J, Darren Garvey,
|
||||
David Kristoffersson, David Norman, David RöThlisberger, DavidNorman, Dhruv, DimanNe,
|
||||
Dorokhov, Duncan Mac-Vicar P, EdwardDixon, EMCP, error.d, FAIJUL, Fan Xia,
|
||||
Francois Xavier, Fred Reiss, Freedom" Koan-Sin Tan, Fritz Obermeyer, Gao, Xiang,
|
||||
Guenther Schmuelling, Guo Yejun (郭叶军), Hans Gaiser, HectorSVC, Hyungsuk Yoon,
|
||||
James Pruegsanusak, Jay Young, Jean Wanka, Jeff Carpenter, Jeremy Rutman, Jeroen BéDorf,
|
||||
Jett Jones, Jimmy Jia, jinghuangintel, jinze1994, JKurland, Joel Hestness, joetoth,
|
||||
John B Nelson, John Impallomeni, John Lawson, Jonas, Jonathan Dekhtiar, joshkyh, Jun Luan,
|
||||
Jun Mei, Kai Sasaki, Karl Lessard, karl@kubx.ca, Kb Sriram, Kenichi Ueno, Kevin Slagle,
|
||||
Kongsea, Lakshay Garg, lhlmgr, Lin Min, liu.guangcong, Loki Der Quaeler, Louie Helm,
|
||||
lucasmoura, Luke Iwanski, Lyndon White, Mahmoud Abuzaina, Marcel Puyat, Mark Aaron Shirley,
|
||||
Michele Colombo, MtDersvan, Namrata-Ibm, Nathan Luehr, Naurril, Nayana Thorat, Nicolas Lopez,
|
||||
Niranjan Hasabnis, Nolan Liu, Nouce, Oliver Hennigh, osdamv, Patrik Erdes,
|
||||
Patryk Chrabaszcz, Pavel Christof, Penghao Cen, postBG, Qingqing Cao, Qingying Chen, qjivy,
|
||||
Raphael, Rasmi, raymondxyang, Renze Yu, resec, Roffel, Ruben Vereecken, Ryohei Kuroki,
|
||||
sandipmgiri, Santiago Castro, Scott Kirkland, Sean Vig, Sebastian Raschka, Sebastian Weiss,
|
||||
Sergey Kolesnikov, Sergii Khomenko, Shahid, Shivam Kotwalia, Stuart Berg, Sumit Gouthaman,
|
||||
superzerg, Sven Mayer, tetris, Ti Zhou, Tiago Freitas Pereira, Tian Jin, Tomoaki Oiki,
|
||||
Vaibhav Sood, vfdev, Vivek Rane, Vladimir Moskva, wangqr, Weber Xie, Will Frey,
|
||||
Yan Facai (颜发才), yanivbl6, Yaroslav Bulatov, Yixing Lao, Yong Tang, youkaichao,
|
||||
Yuan (Terry) Tang, Yue Zhang, Yuxin Wu, Ziming Dong, ZxYuan, 黄璞
|
||||
Adam Zahran, Ag Ramesh, Alan Lee, Alan Yee, Alex Sergeev, Alexander, Amir H. Jadidinejad,
|
||||
Amy, Anastasios Doumoulakis, Andrei Costinescu, Andrei Nigmatulin, Anthony Platanios,
|
||||
Anush Elangovan, arixlin, Armen Donigian, ArtëM Sobolev, Atlas7, Ben Barsdell, Bill Prin,
|
||||
Bo Wang, Brett Koonce, Cameron Thomas, Carl Thomé, Cem Eteke, cglewis, Changming Sun,
|
||||
Charles Shenton, Chi-Hung, Chris Donahue, Chris Filo Gorgolewski, Chris Hoyean Song,
|
||||
Chris Tava, Christian Grail, Christoph Boeddeker, cinqS, Clayne Robison, codrut3, concerttttt,
|
||||
CQY, Dan Becker, Dan Jarvis, Daniel Zhang, David Norman, dmaclach, Dmitry Trifonov,
|
||||
Donggeon Lim, dongpilYu, Dr. Kashif Rasul, Edd Wilder-James, Eric Lv, fcharras, Felix Abecassis,
|
||||
FirefoxMetzger, formath, FredZhang, Gaojin Cao, Gary Deer, Guenther Schmuelling, Hanchen Li,
|
||||
Hanmin Qin, hannesa2, hyunyoung2, Ilya Edrenkin, Jackson Kontny, Jan, Javier Luraschi,
|
||||
Jay Young, Jayaram Bobba, Jeff, Jeff Carpenter, Jeremy Sharpe, Jeroen BéDorf, Jimmy Jia,
|
||||
Jinze Bai, Jiongyan Zhang, Joe Castagneri, Johan Ju, Josh Varty, Julian Niedermeier,
|
||||
JxKing, Karl Lessard, Kb Sriram, Keven Wang, Koan-Sin Tan, Kyle Mills, lanhin, LevineHuang,
|
||||
Loki Der Quaeler, Loo Rong Jie, Luke Iwanski, LáSzló Csomor, Mahdi Abavisani, Mahmoud Abuzaina,
|
||||
ManHyuk, Marek ŠUppa, MathSquared, Mats Linander, Matt Wytock, Matthew Daley, Maximilian Bachl,
|
||||
mdymczyk, melvyniandrag, Michael Case, Mike Traynor, miqlas, Namrata-Ibm, Nathan Luehr,
|
||||
Nathan Van Doorn, Noa Ezra, Nolan Liu, Oleg Zabluda, opensourcemattress, Ouwen Huang,
|
||||
Paul Van Eck, peisong, Peng Yu, PinkySan, pks, powderluv, Qiao Hai-Jun, Qiao Longfei,
|
||||
Rajendra Arora, Ralph Tang, resec, Robin Richtsfeld, Rohan Varma, Ryohei Kuroki, SaintNazaire,
|
||||
Samuel He, Sandeep Dcunha, sandipmgiri, Sang Han, scott, Scott Mudge, Se-Won Kim, Simon Perkins,
|
||||
Simone Cirillo, Steffen Schmitz, Suvojit Manna, Sylvus, Taehoon Lee, Ted Chang, Thomas Deegan,
|
||||
Till Hoffmann, Tim, Toni Kunic, Toon Verstraelen, Tristan Rice, Urs KöSter, Utkarsh Upadhyay,
|
||||
Vish (Ishaya) Abrams, Winnie Tsang, Yan Chen, Yan Facai (颜发才), Yi Yang, Yong Tang,
|
||||
Youssef Hesham, Yuan (Terry) Tang, Zhengsheng Wei, zxcqwe4906, 张志豪, 田传武
|
||||
|
||||
We are also grateful to all who filed issues or helped resolve them, asked and
|
||||
answered questions, and were part of inspiring discussions.
|
||||
@ -60,7 +160,15 @@ answered questions, and were part of inspiring discussions.
|
||||
# Release 1.4.1
|
||||
|
||||
## Bug Fixes and Other Changes
|
||||
* `LinearClassifier` fix for the Google Cloud Machine Learning Engine.
|
||||
* `LinearClassifier` fix.
|
||||
|
||||
# Release 1.4.0
|
||||
|
||||
## Major Features And Improvements
|
||||
* `tf.keras` is now part of the core TensorFlow API.
|
||||
* [`tf.data`](http://tensorflow.org/programmers_guide/datasets) is now part of
|
||||
the core TensorFlow API.
|
||||
* The API is now subject to backwards compatibility guarantees.
|
||||
|
||||
# Release 1.4.0
|
||||
|
||||
|
@ -2,11 +2,11 @@ workspace(name = "org_tensorflow")
|
||||
|
||||
http_archive(
|
||||
name = "io_bazel_rules_closure",
|
||||
sha256 = "110fe68753413777944b473c25eed6368c4a0487cee23a7bac1b13cc49d3e257",
|
||||
strip_prefix = "rules_closure-4af89ef1db659eb41f110df189b67d4cf14073e1",
|
||||
sha256 = "6691c58a2cd30a86776dd9bb34898b041e37136f2dc7e24cadaeaf599c95c657",
|
||||
strip_prefix = "rules_closure-08039ba8ca59f64248bb3b6ae016460fe9c9914f",
|
||||
urls = [
|
||||
"https://mirror.bazel.build/github.com/bazelbuild/rules_closure/archive/4af89ef1db659eb41f110df189b67d4cf14073e1.tar.gz",
|
||||
"https://github.com/bazelbuild/rules_closure/archive/4af89ef1db659eb41f110df189b67d4cf14073e1.tar.gz", # 2017-08-28
|
||||
"https://mirror.bazel.build/github.com/bazelbuild/rules_closure/archive/08039ba8ca59f64248bb3b6ae016460fe9c9914f.tar.gz",
|
||||
"https://github.com/bazelbuild/rules_closure/archive/08039ba8ca59f64248bb3b6ae016460fe9c9914f.tar.gz", # 2018-01-16
|
||||
],
|
||||
)
|
||||
|
||||
|
@ -662,6 +662,9 @@ filegroup(
|
||||
"//tensorflow/tools/quantization:all_files",
|
||||
"//tensorflow/tools/test:all_files",
|
||||
"//tensorflow/user_ops:all_files",
|
||||
"//third_party/eigen3:all_files",
|
||||
"//third_party/fft2d:all_files",
|
||||
"//third_party/flatbuffers:all_files",
|
||||
"//third_party/hadoop:all_files",
|
||||
"//third_party/sycl:all_files",
|
||||
"//third_party/sycl/sycl:all_files",
|
||||
|
@ -190,6 +190,7 @@ TFE_TensorHandle* TFE_TensorHandleCopyToDevice(TFE_TensorHandle* h,
|
||||
bool is_same_device =
|
||||
(srcd == dstd) || (DeviceName(srcd) == DeviceName(dstd));
|
||||
const bool dst_cpu = IsCPU(dstd);
|
||||
const bool src_cpu = IsCPU(srcd);
|
||||
if (is_same_device) {
|
||||
return new TFE_TensorHandle(h->t, dst_cpu ? nullptr : dstd);
|
||||
}
|
||||
@ -213,7 +214,7 @@ TFE_TensorHandle* TFE_TensorHandleCopyToDevice(TFE_TensorHandle* h,
|
||||
return new TFE_TensorHandle(dst, dst_cpu ? nullptr : dstd);
|
||||
}
|
||||
tensorflow::DeviceContext* src_device_context = nullptr;
|
||||
if (!IsCPU(srcd)) {
|
||||
if (!src_cpu) {
|
||||
src_device_context = srcd->tensorflow_gpu_device_info()->default_context;
|
||||
}
|
||||
tensorflow::DeviceContext* dst_device_context = nullptr;
|
||||
|
@ -595,6 +595,11 @@ XLA_TEST_F(DynamicUpdateSliceTest, R3ContiguousSingleElement) {
|
||||
// Single element, no wrap.
|
||||
std::vector<int32> operand_shape({4, 5, 2});
|
||||
RunR3Contiguous<float>(operand_shape, /*index=*/1, /*size=*/1);
|
||||
}
|
||||
|
||||
XLA_TEST_F(DynamicUpdateSliceTest, R3ContiguousSingleElementBF16) {
|
||||
// Single element, no wrap.
|
||||
std::vector<int32> operand_shape({4, 5, 2});
|
||||
RunR3Contiguous<bfloat16>(operand_shape, /*index=*/1, /*size=*/1);
|
||||
}
|
||||
|
||||
@ -602,6 +607,11 @@ XLA_TEST_F(DynamicUpdateSliceTest, R3ContiguousMultipleElements) {
|
||||
// Multiple element, no wrap.
|
||||
std::vector<int32> operand_shape({4, 5, 2});
|
||||
RunR3Contiguous<float>(operand_shape, /*index=*/1, /*size=*/2);
|
||||
}
|
||||
|
||||
XLA_TEST_F(DynamicUpdateSliceTest, R3ContiguousMultipleElementsBF16) {
|
||||
// Multiple element, no wrap.
|
||||
std::vector<int32> operand_shape({4, 5, 2});
|
||||
RunR3Contiguous<bfloat16>(operand_shape, /*index=*/1, /*size=*/2);
|
||||
}
|
||||
|
||||
@ -609,6 +619,11 @@ XLA_TEST_F(DynamicUpdateSliceTest, R3ContiguousMultipleWrapping) {
|
||||
// Multiple element, wrapping.
|
||||
std::vector<int32> operand_shape({4, 5, 2});
|
||||
RunR3Contiguous<float>(operand_shape, /*index=*/3, /*size=*/2);
|
||||
}
|
||||
|
||||
XLA_TEST_F(DynamicUpdateSliceTest, R3ContiguousMultipleWrappingBF16) {
|
||||
// Multiple element, wrapping.
|
||||
std::vector<int32> operand_shape({4, 5, 2});
|
||||
RunR3Contiguous<bfloat16>(operand_shape, /*index=*/3, /*size=*/2);
|
||||
}
|
||||
|
||||
@ -616,12 +631,21 @@ XLA_TEST_F(DynamicUpdateSliceTest, R3ContiguousTooLarge) {
|
||||
// Multiple element, update size larger than operand.
|
||||
std::vector<int32> operand_shape({4, 5, 2});
|
||||
RunR3Contiguous<float>(operand_shape, /*index=*/5, /*size=*/2);
|
||||
}
|
||||
|
||||
XLA_TEST_F(DynamicUpdateSliceTest, R3ContiguousTooLargeBF16) {
|
||||
// Multiple element, update size larger than operand.
|
||||
std::vector<int32> operand_shape({4, 5, 2});
|
||||
RunR3Contiguous<bfloat16>(operand_shape, /*index=*/5, /*size=*/2);
|
||||
}
|
||||
|
||||
XLA_TEST_F(DynamicUpdateSliceTest, R3ContiguousUnaligned) {
|
||||
std::vector<int32> operand_shape({3, 123, 247});
|
||||
RunR3Contiguous<float>(operand_shape, /*index=*/1, /*size=*/1);
|
||||
}
|
||||
|
||||
XLA_TEST_F(DynamicUpdateSliceTest, R3ContiguousUnalignedBF16) {
|
||||
std::vector<int32> operand_shape({3, 123, 247});
|
||||
RunR3Contiguous<bfloat16>(operand_shape, /*index=*/1, /*size=*/1);
|
||||
}
|
||||
|
||||
@ -629,6 +653,10 @@ XLA_TEST_F(DynamicUpdateSliceTest, R3ContiguousUnaligned) {
|
||||
XLA_TEST_F(DynamicUpdateSliceTest, DISABLED_ON_GPU(R3ContiguousLarger)) {
|
||||
std::vector<int32> operand_shape({32, 128, 1024});
|
||||
RunR3Contiguous<float>(operand_shape, /*index=*/7, /*size=*/1);
|
||||
}
|
||||
|
||||
XLA_TEST_F(DynamicUpdateSliceTest, DISABLED_ON_GPU(R3ContiguousLargerBF16)) {
|
||||
std::vector<int32> operand_shape({32, 128, 1024});
|
||||
RunR3Contiguous<bfloat16>(operand_shape, /*index=*/7, /*size=*/1);
|
||||
}
|
||||
|
||||
|
@ -30,7 +30,7 @@ bindings.
|
||||
|
||||
* CMake version 3.5 or later.
|
||||
|
||||
* [Git](http://git-scm.com)
|
||||
* [Git](https://git-scm.com)
|
||||
|
||||
* [SWIG](http://www.swig.org/download.html)
|
||||
|
||||
@ -48,7 +48,7 @@ bindings.
|
||||
|
||||
* Microsoft Windows 10
|
||||
- Microsoft Visual Studio Enterprise 2015 with Visual C++ 2015
|
||||
- [Anaconda 4.1.1 (Python 3.5 64-bit)](https://www.continuum.io/downloads)
|
||||
- [Anaconda 4.1.1 (Python 3.5 64-bit)](https://www.anaconda.com/download/)
|
||||
- [Git for Windows version 2.9.2.windows.1](https://git-scm.com/download/win)
|
||||
- [swigwin-3.0.10](http://www.swig.org/download.html)
|
||||
- [NVidia CUDA Toolkit 8.0](https://developer.nvidia.com/cuda-downloads)
|
||||
|
@ -47,4 +47,4 @@ ExternalProject_Add(snappy
|
||||
)
|
||||
|
||||
# actually enables snappy in the source code
|
||||
add_definitions(-DTF_USE_SNAPPY)
|
||||
add_definitions(-DTF_USE_SNAPPY)
|
||||
|
@ -1,3 +1,5 @@
|
||||
# python_sanity_test.py will complain about invalid or missing entries
|
||||
# problematic entries can be commented for temporary whitelisting
|
||||
tensorflow
|
||||
tensorflow/core
|
||||
tensorflow/core/example
|
||||
@ -307,6 +309,8 @@ tensorflow/contrib/metrics
|
||||
tensorflow/contrib/metrics/python
|
||||
tensorflow/contrib/metrics/python/metrics
|
||||
tensorflow/contrib/metrics/python/ops
|
||||
tensorflow/contrib/mpi_collectives/python
|
||||
tensorflow/contrib/mpi_collectives/python/ops
|
||||
tensorflow/contrib/model_pruning
|
||||
tensorflow/contrib/model_pruning/examples
|
||||
tensorflow/contrib/model_pruning/examples/cifar10
|
||||
|
124
tensorflow/contrib/cmake/python_sanity_test.py
Normal file
124
tensorflow/contrib/cmake/python_sanity_test.py
Normal file
@ -0,0 +1,124 @@
|
||||
# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
# ==============================================================================
|
||||
"""
|
||||
Complain about invalid or missing entries in python_*.txt files.
|
||||
Problematic entries can be commented for temporary whitelisting.
|
||||
"""
|
||||
|
||||
from __future__ import absolute_import
|
||||
from __future__ import division
|
||||
from __future__ import print_function
|
||||
|
||||
import os
|
||||
import unittest
|
||||
|
||||
|
||||
def abs_path(path):
|
||||
root = os.path.dirname(__file__)
|
||||
|
||||
for _ in range(3):
|
||||
root = os.path.join(root, os.pardir)
|
||||
|
||||
path = os.path.join(root, path)
|
||||
path = os.path.abspath(path)
|
||||
return path
|
||||
|
||||
def read_entries(test):
|
||||
with open(abs_path(test.entries_file), "r") as f:
|
||||
lines = f.readlines()
|
||||
|
||||
lines = [line.strip() for line in lines]
|
||||
lines = [line for line in lines if line]
|
||||
|
||||
test.entries = []
|
||||
test.whitelist = []
|
||||
|
||||
for line in lines:
|
||||
# line is comment
|
||||
if line.startswith('#'):
|
||||
line = line[1:].strip()
|
||||
# whitelist entry
|
||||
if line.startswith('tensorflow/'):
|
||||
test.whitelist.append(line)
|
||||
# line has comment -> strip comment
|
||||
elif line.find('#') != -1:
|
||||
line = line[:line.find('#')].strip()
|
||||
test.entries.append(line)
|
||||
else:
|
||||
test.entries.append(line)
|
||||
|
||||
def test_invalid_directories(test):
|
||||
for entry in test.entries:
|
||||
if not os.path.isdir(abs_path(entry)):
|
||||
problem = "'" + test.entries_file + "' contains invalid '" + entry + "'"
|
||||
solution = "Please remove the invalid entry (or add the missing directory)."
|
||||
raise AssertionError(problem + "\n" + solution)
|
||||
|
||||
def test_missing_directory(test, path):
|
||||
if path in test.whitelist:
|
||||
return
|
||||
|
||||
dir_exists = os.path.isdir(abs_path(path))
|
||||
entry_exists = path in test.entries
|
||||
|
||||
if dir_exists and not entry_exists:
|
||||
problem = "'" + test.entries_file + "' is missing '" + path + "'"
|
||||
solution = "Please add the missing entry (comment to whitelist if needed)."
|
||||
raise AssertionError(problem + "\n" + solution)
|
||||
|
||||
|
||||
class PythonModuleTest(unittest.TestCase):
|
||||
|
||||
def setUp(self):
|
||||
self.entries_file = "tensorflow/contrib/cmake/python_modules.txt"
|
||||
read_entries(self)
|
||||
|
||||
def testInvalidEntries(self):
|
||||
test_invalid_directories(self)
|
||||
|
||||
def testMissingModules(self):
|
||||
module_names = next(os.walk(abs_path("tensorflow/contrib")))[1]
|
||||
|
||||
for module_name in module_names:
|
||||
path = "tensorflow/contrib/" + module_name
|
||||
|
||||
test_missing_directory(self, path + "/python")
|
||||
test_missing_directory(self, path + "/python/ops")
|
||||
test_missing_directory(self, path + "/python/kernels")
|
||||
test_missing_directory(self, path + "/python/layers")
|
||||
|
||||
|
||||
class PythonProtoTest(unittest.TestCase):
|
||||
|
||||
def setUp(self):
|
||||
self.entries_file = "tensorflow/contrib/cmake/python_protos.txt"
|
||||
read_entries(self)
|
||||
|
||||
def testInvalidEntries(self):
|
||||
test_invalid_directories(self)
|
||||
|
||||
|
||||
class PythonProtoCCTest(unittest.TestCase):
|
||||
|
||||
def setUp(self):
|
||||
self.entries_file = "tensorflow/contrib/cmake/python_protos_cc.txt"
|
||||
read_entries(self)
|
||||
|
||||
def testInvalidEntries(self):
|
||||
test_invalid_directories(self)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
@ -126,7 +126,9 @@ endfunction()
|
||||
file(GLOB_RECURSE tf_protos_cc_srcs RELATIVE ${tensorflow_source_dir}
|
||||
"${tensorflow_source_dir}/tensorflow/core/*.proto"
|
||||
"${tensorflow_source_dir}/tensorflow/contrib/boosted_trees/proto/*.proto"
|
||||
"${tensorflow_source_dir}/tensorflow/contrib/tpu/proto/*.proto"
|
||||
)
|
||||
|
||||
RELATIVE_PROTOBUF_GENERATE_CPP(PROTO_SRCS PROTO_HDRS
|
||||
${tensorflow_source_dir} ${tf_protos_cc_srcs}
|
||||
)
|
||||
|
@ -126,7 +126,8 @@ STRING(REGEX REPLACE ";" "\\\\;" python_protos "${python_protos}")
|
||||
STRING(REGEX REPLACE "\n" ";" python_protos "${python_protos}")
|
||||
|
||||
foreach(python_proto ${python_protos})
|
||||
if(NOT python_proto MATCHES "\#")
|
||||
if(NOT python_proto MATCHES "^\#")
|
||||
STRING(REGEX REPLACE " *\#.*" "" python_proto "${python_proto}")
|
||||
if(NOT EXISTS "${tensorflow_source_dir}/${python_proto}")
|
||||
message(SEND_ERROR "Python proto directory not found: ${python_proto}")
|
||||
endif()
|
||||
@ -147,7 +148,8 @@ STRING(REGEX REPLACE ";" "\\\\;" python_protos_cc "${python_protos_cc}")
|
||||
STRING(REGEX REPLACE "\n" ";" python_protos_cc "${python_protos_cc}")
|
||||
|
||||
foreach(python_proto_cc ${python_protos_cc})
|
||||
if(NOT python_proto_cc MATCHES "\#")
|
||||
if(NOT python_proto_cc MATCHES "^\#")
|
||||
STRING(REGEX REPLACE " *\#.*" "" python_proto_cc "${python_proto_cc}")
|
||||
if(NOT EXISTS "${tensorflow_source_dir}/${python_proto_cc}")
|
||||
message(SEND_ERROR "Python proto CC directory not found: ${python_proto_cc}")
|
||||
endif()
|
||||
@ -209,7 +211,8 @@ STRING(REGEX REPLACE ";" "\\\\;" python_modules "${python_modules}")
|
||||
STRING(REGEX REPLACE "\n" ";" python_modules "${python_modules}")
|
||||
|
||||
foreach(python_module ${python_modules})
|
||||
if(NOT python_module MATCHES "\#")
|
||||
if(NOT python_module MATCHES "^\#")
|
||||
STRING(REGEX REPLACE " *\#.*" "" python_module "${python_module}")
|
||||
if(NOT EXISTS "${tensorflow_source_dir}/${python_module}")
|
||||
message(SEND_ERROR "Python module not found: ${python_module}")
|
||||
endif()
|
||||
|
@ -649,7 +649,7 @@ class CudnnRNNParamsToCanonical<GPUDevice, T> : public CudnnRNNKernelCommon {
|
||||
}
|
||||
const int num_params_per_layer = num_params_ / num_layers / num_dirs;
|
||||
// Number of params applied on inputs. The rest are applied on recurrent
|
||||
// hiddden states.
|
||||
// hidden states.
|
||||
const int num_params_input_state = num_params_per_layer / 2;
|
||||
CHECK(num_params_ % (num_layers * num_dirs) == 0)
|
||||
<< "Number of params is not a multiple of num_layers * num_dirs.";
|
||||
|
@ -1542,7 +1542,7 @@ class _CudnnRNNNoInputC(_CudnnRNN):
|
||||
params: the parameter buffer created for this model.
|
||||
is_training: whether this operation will be used in training or inference.
|
||||
Returns:
|
||||
output: the output sequuence.
|
||||
output: the output sequence.
|
||||
output_h: the final state for h.
|
||||
"""
|
||||
return _cudnn_rnn_no_input_c(
|
||||
|
@ -292,7 +292,7 @@ def loss(weight, bias):
|
||||
error = prediction(training_inputs, weight, bias) - training_outputs
|
||||
return tf.reduce_mean(tf.square(error))
|
||||
|
||||
# Function that returns the the derivative of loss with respect to
|
||||
# Function that returns the derivative of loss with respect to
|
||||
# weight and bias
|
||||
grad = tfe.gradients_function(loss)
|
||||
|
||||
|
@ -100,7 +100,7 @@ def add_metrics(estimator, metric_fn):
|
||||
|
||||
|
||||
def clip_gradients_by_norm(optimizer, clip_norm):
|
||||
"""Returns an optimizer which clips gradients before appliying them.
|
||||
"""Returns an optimizer which clips gradients before applying them.
|
||||
|
||||
Example:
|
||||
|
||||
|
@ -82,7 +82,9 @@ std::vector<string> FfmpegVideoCommandLine(const string& input_filename,
|
||||
"-probesize",
|
||||
StrCat(kDefaultProbeSize),
|
||||
"-loglevel",
|
||||
"error", // Print errors only.
|
||||
// Info is needed to get the information about stream, etc.
|
||||
// It is generated to a separate file, not stdout/stderr.
|
||||
"info",
|
||||
"-hide_banner", // Skip printing build options, version, etc.
|
||||
"-vcodec",
|
||||
"rawvideo",
|
||||
|
@ -12,6 +12,7 @@
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
# ==============================================================================
|
||||
|
||||
"""Script Language Operators. See the @{$python/script_ops} guide.
|
||||
|
||||
@@py_func
|
||||
|
@ -279,14 +279,16 @@ def acgan_model(
|
||||
generator_inputs = _convert_tensor_or_l_or_d(generator_inputs)
|
||||
generated_data = generator_fn(generator_inputs)
|
||||
with variable_scope.variable_scope(discriminator_scope) as dis_scope:
|
||||
(discriminator_gen_outputs, discriminator_gen_classification_logits
|
||||
) = _validate_acgan_discriminator_outputs(
|
||||
discriminator_fn(generated_data, generator_inputs))
|
||||
with ops.name_scope(dis_scope.name+'/generated/'):
|
||||
(discriminator_gen_outputs, discriminator_gen_classification_logits
|
||||
) = _validate_acgan_discriminator_outputs(
|
||||
discriminator_fn(generated_data, generator_inputs))
|
||||
with variable_scope.variable_scope(dis_scope, reuse=True):
|
||||
real_data = ops.convert_to_tensor(real_data)
|
||||
(discriminator_real_outputs, discriminator_real_classification_logits
|
||||
) = _validate_acgan_discriminator_outputs(
|
||||
discriminator_fn(real_data, generator_inputs))
|
||||
with ops.name_scope(dis_scope.name+'/real/'):
|
||||
real_data = ops.convert_to_tensor(real_data)
|
||||
(discriminator_real_outputs, discriminator_real_classification_logits
|
||||
) = _validate_acgan_discriminator_outputs(
|
||||
discriminator_fn(real_data, generator_inputs))
|
||||
if check_shapes:
|
||||
if not generated_data.shape.is_compatible_with(real_data.shape):
|
||||
raise ValueError(
|
||||
|
@ -479,8 +479,12 @@ def batch_norm(inputs,
|
||||
|
||||
Sergey Ioffe, Christian Szegedy
|
||||
|
||||
Can be used as a normalizer function for conv2d and fully_connected.
|
||||
|
||||
Can be used as a normalizer function for conv2d and fully_connected. The
|
||||
normalization is over all but the last dimension if `data_format` is `NHWC`
|
||||
and all but the second dimension if `data_format` is `NCHW`. In case of a 2D
|
||||
tensor this corresponds to the batch dimension, while in case of a 4D tensor this
|
||||
corresponds to the batch and space dimensions.
|
||||
|
||||
Note: when training, the moving_mean and moving_variance need to be updated.
|
||||
By default the update ops are placed in `tf.GraphKeys.UPDATE_OPS`, so they
|
||||
need to be added as a dependency to the `train_op`. For example:
|
||||
|
@ -22,7 +22,14 @@ cd "$SCRIPT_DIR/../../.."
|
||||
DOWNLOADS_DIR=tensorflow/contrib/lite/downloads
|
||||
BZL_FILE_PATH=tensorflow/workspace.bzl
|
||||
|
||||
EIGEN_URL="$(grep -o 'http.*bitbucket.org/eigen/eigen/get/.*tar\.gz' "${BZL_FILE_PATH}" | grep -v bazel-mirror | head -n1)"
|
||||
# Ensure it is being run from repo root
|
||||
if [ ! -f $BZL_FILE_PATH ]; then
|
||||
echo "Could not find ${BZL_FILE_PATH}":
|
||||
echo "Likely you are not running this from the root directory of the repository.";
|
||||
exit 1;
|
||||
fi
|
||||
|
||||
EIGEN_URL="$(grep -o 'http.*bitbucket.org/eigen/eigen/get/.*tar\.gz' "${BZL_FILE_PATH}" | grep -v mirror.bazel | head -n1)"
|
||||
GEMMLOWP_URL="$(grep -o 'https://mirror.bazel.build/github.com/google/gemmlowp/.*zip' "${BZL_FILE_PATH}" | head -n1)"
|
||||
GOOGLETEST_URL="https://github.com/google/googletest/archive/release-1.8.0.tar.gz"
|
||||
ABSL_URL="$(grep -o 'https://github.com/abseil/abseil-cpp/.*tar.gz' "${BZL_FILE_PATH}" | head -n1)"
|
||||
|
@ -42,9 +42,10 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
|
||||
TF_LITE_ENSURE_EQ(context, positions->type, kTfLiteInt32);
|
||||
// Check that input and output types match.
|
||||
TF_LITE_ENSURE_EQ(context, input->type, output->type);
|
||||
// TODO(mgubin): only 1D positions are currently supported.
|
||||
TF_LITE_ENSURE_EQ(context, NumDimensions(positions), 1);
|
||||
// TODO(mgubin): only 0D or 1D positions are currently supported.
|
||||
TF_LITE_ENSURE(context, NumDimensions(positions) <= 1);
|
||||
// TODO(mgubin): Only default axis == 0 is supported.
|
||||
TF_LITE_ENSURE_EQ(context, params->axis, 0);
|
||||
// Check conditions for different types.
|
||||
switch (input->type) {
|
||||
case kTfLiteFloat32:
|
||||
@ -64,7 +65,7 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
|
||||
}
|
||||
const int num_dimensions =
|
||||
NumDimensions(input) + NumDimensions(positions) - 1;
|
||||
TF_LITE_ENSURE(context, params->axis < num_dimensions);
|
||||
TF_LITE_ENSURE(context, params->axis <= num_dimensions);
|
||||
TfLiteIntArray* output_shape = TfLiteIntArrayCreate(num_dimensions);
|
||||
int output_index = 0;
|
||||
for (int i = 0; i < params->axis; ++i) {
|
||||
|
@ -48,8 +48,8 @@ class GatherOpModel : public SingleOpModel {
|
||||
PopulateStringTensor(input_, data);
|
||||
}
|
||||
|
||||
void SetPositions(std::initializer_list<int32> data) {
|
||||
PopulateTensor<int32>(positions_, data);
|
||||
void SetPositions(std::initializer_list<int> data) {
|
||||
PopulateTensor<int>(positions_, data);
|
||||
}
|
||||
|
||||
std::vector<float> GetOutputFloat() { return ExtractVector<float>(output_); }
|
||||
@ -76,6 +76,29 @@ TEST(GatherOpTest, Shuffle) {
|
||||
ElementsAreArray(ArrayFloatNear({0.7, 0.8, -2, 0.2})));
|
||||
}
|
||||
|
||||
TEST(GatherOpTest, Test0DIndex) {
|
||||
GatherOpModel m({2, 2}, TensorType_FLOAT32, {});
|
||||
m.SetInputFloat({-2.0, 0.2, 0.7, 0.8});
|
||||
m.SetPositions({1});
|
||||
m.Invoke();
|
||||
EXPECT_THAT(m.GetOutputFloat(),
|
||||
ElementsAreArray(ArrayFloatNear({0.7, 0.8})));
|
||||
EXPECT_THAT(m.GetOutputShape(),
|
||||
ElementsAreArray({2}));
|
||||
}
|
||||
|
||||
TEST(GatherOpTest, Test0DIndexWith0DResult) {
|
||||
// 0D tensor is special case in current TFLite. Test it once to make sure
|
||||
// existing workarounds are fine with it.
|
||||
GatherOpModel m({3}, TensorType_FLOAT32, {});
|
||||
m.SetInputFloat({1.0, 2.0, 3.0});
|
||||
m.SetPositions({1});
|
||||
m.Invoke();
|
||||
EXPECT_THAT(m.GetOutputFloat(),
|
||||
ElementsAreArray(ArrayFloatNear({2.0})));
|
||||
EXPECT_TRUE(m.GetOutputShape().empty());
|
||||
}
|
||||
|
||||
TEST(FloatGatherOpTest, Duplicate) {
|
||||
GatherOpModel m({1, 2, 2}, TensorType_FLOAT32, {2});
|
||||
m.SetInputFloat({-2.0, 0.2, 0.7, 0.8});
|
||||
|
@ -53,7 +53,7 @@ with the corresponding parameters as shown in the figure.
|
||||
### Automatic Speech Recognizer (ASR) Acoustic Model (AM)
|
||||
|
||||
The acoustic model for automatic speech recognition is the neural network model
|
||||
for matching phonemes to the input autio features. It generates posterior
|
||||
for matching phonemes to the input audio features. It generates posterior
|
||||
probabilities of phonemes from speech frontend features (log-mel filterbanks).
|
||||
It has an input size of 320 (float), an output size of 42 (float), five LSTM
|
||||
layers and one fully connected layers with a Softmax activation function, with
|
||||
@ -68,7 +68,7 @@ for predicting the probability of a word given previous words in a sentence.
|
||||
It generates posterior probabilities of the next word based from a sequence of
|
||||
words. The words are encoded as indices in a fixed size dictionary.
|
||||
The model has two inputs both of size one (integer): the current word index and
|
||||
next word index, an output size of one (float): the log probability. It consits
|
||||
next word index, an output size of one (float): the log probability. It consists
|
||||
of three embedding layer, three LSTM layers, followed by a multiplication, a
|
||||
fully connected layers and an addition.
|
||||
The corresponding parameters as shown in the figure.
|
||||
|
@ -370,7 +370,7 @@ enum {
|
||||
* Looks up items from a given tensor.
|
||||
*
|
||||
* Each item in the output is a raw copy of the corresponding item in
|
||||
* the input “values”. If the the given “lookup” indices are out of bounds,
|
||||
* the input “values”. If the given “lookup” indices are out of bounds,
|
||||
* the op will fail and an error will be reported.
|
||||
*
|
||||
* Inputs:
|
||||
|
@ -1170,7 +1170,7 @@ def make_pad_tests(zip_path):
|
||||
def make_reshape_tests(zip_path):
|
||||
"""Make a set of tests to do reshape."""
|
||||
|
||||
# Alll shapes below are suitable for tensors with 420 elements.
|
||||
# All shapes below are suitable for tensors with 420 elements.
|
||||
test_parameters = [{
|
||||
"dtype": [tf.float32, tf.int32],
|
||||
"input_shape": [[3, 4, 5, 7], [4, 105], [21, 5, 2, 2], [420]],
|
||||
|
@ -229,7 +229,7 @@ additional information about the multiple input arrays:
|
||||
well-formed quantized representation of these graphs. Such graphs should be
|
||||
fixed, but as a temporary work-around, setting this
|
||||
reorder_across_fake_quant flag allows the converter to perform necessary
|
||||
graph transformaitons on them, at the cost of no longer faithfully matching
|
||||
graph transformations on them, at the cost of no longer faithfully matching
|
||||
inference and training arithmetic.
|
||||
|
||||
### Logging flags
|
||||
|
@ -27,6 +27,27 @@ tf_cc_binary(
|
||||
],
|
||||
)
|
||||
|
||||
tf_cc_binary(
|
||||
name = "benchmark_model",
|
||||
srcs = ["benchmark_model.cc"],
|
||||
linkopts = select({
|
||||
"//tensorflow:android": [
|
||||
"-pie",
|
||||
"-landroid",
|
||||
"-lm",
|
||||
"-z defs",
|
||||
"-Wl,--exclude-libs,ALL", # Exclude syms in all libs from auto export
|
||||
],
|
||||
"//conditions:default": [],
|
||||
}),
|
||||
deps = [
|
||||
":mutable_op_resolver",
|
||||
"//tensorflow/contrib/lite:framework",
|
||||
"//tensorflow/contrib/lite:string_util",
|
||||
"//tensorflow/contrib/lite/kernels:builtin_ops",
|
||||
],
|
||||
)
|
||||
|
||||
cc_library(
|
||||
name = "gen_op_registration",
|
||||
srcs = ["gen_op_registration.cc"],
|
||||
|
@ -63,12 +63,17 @@ download_and_extract() {
|
||||
elif [[ "${url}" == *zip ]]; then
|
||||
tempdir=$(mktemp -d)
|
||||
tempdir2=$(mktemp -d)
|
||||
wget -P ${tempdir} ${url}
|
||||
unzip ${tempdir}/* -d ${tempdir2}
|
||||
if [[ "$OSTYPE" == "darwin"* ]]; then
|
||||
# macOS (AKA darwin) doesn't have wget.
|
||||
(cd "${tempdir}"; curl --remote-name --silent --location "${url}")
|
||||
else
|
||||
wget -P "${tempdir}" "${url}"
|
||||
fi
|
||||
unzip "${tempdir}"/* -d "${tempdir2}"
|
||||
# unzip has no strip components, so unzip to a temp dir, and move the files
|
||||
# we want from the tempdir to destination.
|
||||
cp -R ${tempdir2}/*/* ${dir}/
|
||||
rm -rf ${tempdir2} ${tempdir}
|
||||
cp -R "${tempdir2}"/*/* "${dir}"/
|
||||
rm -rf "${tempdir2}" "${tempdir}"
|
||||
fi
|
||||
|
||||
# Delete any potential BUILD files, which would interfere with Bazel builds.
|
||||
|
@ -19,6 +19,7 @@ py_library(
|
||||
"python/training/elastic_average_optimizer.py",
|
||||
"python/training/external_optimizer.py",
|
||||
"python/training/lazy_adam_optimizer.py",
|
||||
"python/training/model_average_optimizer.py",
|
||||
"python/training/moving_average_optimizer.py",
|
||||
"python/training/multitask_optimizer_wrapper.py",
|
||||
"python/training/nadam_optimizer.py",
|
||||
@ -193,6 +194,27 @@ tf_py_test(
|
||||
],
|
||||
)
|
||||
|
||||
tf_py_test(
|
||||
name = "model_average_optimizer_test",
|
||||
srcs = ["python/training/model_average_optimizer_test.py"],
|
||||
additional_deps = [
|
||||
":opt_py",
|
||||
"//tensorflow/python:client",
|
||||
"//tensorflow/python:client_testlib",
|
||||
"//tensorflow/python:array_ops",
|
||||
"//tensorflow/python:variables",
|
||||
"//tensorflow/python:framework",
|
||||
"//tensorflow/python:platform",
|
||||
"//tensorflow/python:training",
|
||||
"//tensorflow/python:ops",
|
||||
"//tensorflow/python:framework_for_generated_wrappers",
|
||||
"//third_party/py/numpy",
|
||||
],
|
||||
tags = [
|
||||
"notap", # This test launches local server.
|
||||
],
|
||||
)
|
||||
|
||||
py_test(
|
||||
name = "sign_decay_test",
|
||||
srcs = ["python/training/sign_decay_test.py"],
|
||||
|
@ -29,6 +29,7 @@ from tensorflow.contrib.opt.python.training.nadam_optimizer import *
|
||||
from tensorflow.contrib.opt.python.training.powersign import *
|
||||
from tensorflow.contrib.opt.python.training.variable_clipping_optimizer import *
|
||||
from tensorflow.contrib.opt.python.training.elastic_average_optimizer import *
|
||||
from tensorflow.contrib.opt.python.training.model_average_optimizer import *
|
||||
# pylint: enable=wildcard-import
|
||||
|
||||
from tensorflow.python.util.all_util import remove_undocumented
|
||||
@ -48,7 +49,9 @@ _allowed_symbols = [
|
||||
'MultitaskOptimizerWrapper',
|
||||
'clip_gradients_by_global_norm',
|
||||
'ElasticAverageOptimizer',
|
||||
'ElasticAverageCustomGetter'
|
||||
'ElasticAverageCustomGetter',
|
||||
'ModelAverageOptimizer',
|
||||
'ModelAverageCustomGetter'
|
||||
]
|
||||
|
||||
remove_undocumented(__name__, _allowed_symbols)
|
||||
|
@ -0,0 +1,299 @@
|
||||
# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
# ==============================================================================
|
||||
|
||||
"""Wrapper optimizer for Model Average """
|
||||
from __future__ import absolute_import
|
||||
from __future__ import division
|
||||
from __future__ import print_function
|
||||
|
||||
from tensorflow.python.framework import ops
|
||||
from tensorflow.python.framework import dtypes
|
||||
from tensorflow.python.framework import constant_op
|
||||
from tensorflow.python.training import optimizer
|
||||
from tensorflow.python.training import session_run_hook
|
||||
from tensorflow.python.ops import math_ops
|
||||
from tensorflow.python.ops import control_flow_ops
|
||||
from tensorflow.python.ops import variable_scope
|
||||
from tensorflow.python.ops import variables
|
||||
from tensorflow.python.ops import state_ops
|
||||
from tensorflow.python.ops import array_ops
|
||||
from tensorflow.python.ops import data_flow_ops
|
||||
|
||||
GLOBAL_VARIABLE_NAME = 'global_center_variable'
|
||||
|
||||
|
||||
class ModelAverageCustomGetter(object):
|
||||
"""Custom_getter class is used to do:
|
||||
1. Change trainable variables to local collection and place them at worker
|
||||
device
|
||||
2. Generate global variables
|
||||
Notice that the class should be used with tf.replica_device_setter,
|
||||
so that the global center variables and global step variable can be placed
|
||||
at ps device. Besides, use 'tf.get_variable' instead of 'tf.Variable' to
|
||||
use this custom getter.
|
||||
|
||||
For example,
|
||||
ma_custom_getter = ModelAverageCustomGetter(worker_device)
|
||||
with tf.device(
|
||||
tf.train.replica_device_setter(
|
||||
worker_device=worker_device,
|
||||
ps_device="/job:ps/cpu:0",
|
||||
cluster=cluster)),
|
||||
tf.variable_scope('',custom_getter=ma_custom_getter):
|
||||
hid_w = tf.get_variable(
|
||||
initializer=tf.truncated_normal(
|
||||
[IMAGE_PIXELS * IMAGE_PIXELS, FLAGS.hidden_units],
|
||||
stddev=1.0 / IMAGE_PIXELS),
|
||||
name="hid_w")
|
||||
hid_b = tf.get_variable(initializer=tf.zeros([FLAGS.hidden_units]),
|
||||
name="hid_b")
|
||||
"""
|
||||
|
||||
def __init__(self, worker_device):
|
||||
"""Create a new `ElasticAverageCustomGetter`.
|
||||
|
||||
Args:
|
||||
worker_device: String. Name of the `worker` job.
|
||||
"""
|
||||
self._worker_device = worker_device
|
||||
self._local_2_global = {}
|
||||
|
||||
def __call__(self, getter, name, trainable, collections, *args, **kwargs):
|
||||
if trainable:
|
||||
with ops.device(self._worker_device):
|
||||
local_var = getter(name, trainable=True,
|
||||
collections=[ops.GraphKeys.LOCAL_VARIABLES],
|
||||
*args, **kwargs)
|
||||
|
||||
global_variable = variable_scope.variable(
|
||||
name='%s/%s' % (GLOBAL_VARIABLE_NAME, name),
|
||||
initial_value=local_var.initialized_value(),
|
||||
trainable=False,
|
||||
collections=[ops.GraphKeys.GLOBAL_VARIABLES])
|
||||
|
||||
self._local_2_global[local_var] = global_variable
|
||||
return local_var
|
||||
else:
|
||||
return getter(name, trainable, collections, *args, **kwargs)
|
||||
|
||||
|
||||
class ModelAverageOptimizer(optimizer.Optimizer):
|
||||
"""Wrapper optimizer that implements the Model Average algorithm.
|
||||
This is a sync optimizer. During the training, each worker will update
|
||||
the local variables and maintains its own local_step, which starts from 0
|
||||
and is incremented by 1 after each update of local variables. Whenever the
|
||||
interval_steps divides the local step, the local variables from all the
|
||||
workers will be averaged and assigned to global center variables. Then the
|
||||
local variables will be assigned by global center variables.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
opt,
|
||||
num_worker,
|
||||
is_chief,
|
||||
ma_custom_getter,
|
||||
interval_steps=100,
|
||||
use_locking=True,
|
||||
name="ModelAverageOptimizer"):
|
||||
"""Construct a new model average optimizer.
|
||||
|
||||
Args:
|
||||
opt: The actual optimizer that will be used to update local variables
|
||||
num_worker: The number of workers
|
||||
is_chief: whether chief worker
|
||||
ma_custom_getter: ModelAverageCustomGetter
|
||||
interval_steps: An int point value to controls the frequency of the
|
||||
average of local variables
|
||||
use_locking: If True use locks for update operations
|
||||
name: string. Optional name of the returned operation
|
||||
"""
|
||||
super(ModelAverageOptimizer, self).__init__(use_locking, name)
|
||||
self._opt = opt
|
||||
self._num_worker = num_worker
|
||||
self._is_chief = is_chief
|
||||
self._local_2_global = ma_custom_getter._local_2_global
|
||||
self._interval_steps = interval_steps
|
||||
self._accumulator_list = []
|
||||
self._chief_init_op = None
|
||||
|
||||
self._local_step = variable_scope.get_variable(
|
||||
initializer=0,
|
||||
trainable=False,
|
||||
collections=[ops.GraphKeys.LOCAL_VARIABLES],
|
||||
name="local_step")
|
||||
|
||||
self._opt._prepare()
|
||||
|
||||
def compute_gradients(self, *args, **kwargs):
|
||||
"""Compute gradients of "loss" for the variables in "var_list".
|
||||
|
||||
This simply wraps the compute_gradients() from the real optimizer.
|
||||
|
||||
Args:
|
||||
*args: Arguments for compute_gradients().
|
||||
**kwargs: Keyword arguments for compute_gradients().
|
||||
|
||||
Returns:
|
||||
A list of (gradient, variable) pairs.
|
||||
"""
|
||||
return self._opt.compute_gradients(*args, **kwargs)
|
||||
|
||||
def _local_vars_update(self, var_list):
|
||||
"""Get the update ops for the local variables in "var_list".
|
||||
|
||||
Args:
|
||||
var_list: Optional list or tuple of 'tf.Variable' to update
|
||||
|
||||
Returns:
|
||||
An update op
|
||||
"""
|
||||
if not var_list:
|
||||
raise ValueError(
|
||||
'The list of local_variables should not be empty')
|
||||
update_ops = []
|
||||
global_center_vars = [self._local_2_global[var] for var in var_list]
|
||||
for lvar, gvar in zip(var_list, global_center_vars):
|
||||
with ops.device(lvar.device):
|
||||
update_ops.append(state_ops.assign(lvar, gvar.read_value()))
|
||||
return control_flow_ops.group(*(update_ops))
|
||||
|
||||
def apply_gradients(self, grads_and_vars, global_step=None, name=None):
|
||||
"""Apply gradients to variables.
|
||||
|
||||
This contains most of the synchronization implementation and also wraps the
|
||||
apply_gradients() from the real optimizer. The chief work updates global
|
||||
variables.
|
||||
|
||||
Args:
|
||||
grads_and_vars: List of (gradient, variable) pairs as returned by
|
||||
compute_gradients().
|
||||
global_step: Optional Variable to increment by one after the
|
||||
variables have been updated.
|
||||
name: Optional name for the returned operation. Default to the
|
||||
name passed to the Optimizer constructor.
|
||||
|
||||
Returns:
|
||||
A conditional 'Operation' that update both local and global variables or
|
||||
just local variables
|
||||
|
||||
Raises:
|
||||
ValueError: If the grads_and_vars is empty.
|
||||
ValueError: If global step is not provided, the staleness cannot be
|
||||
checked.
|
||||
"""
|
||||
|
||||
# update local variables
|
||||
if not grads_and_vars:
|
||||
raise ValueError("Must supply at least one variable")
|
||||
if global_step is None:
|
||||
raise ValueError("Global step is required")
|
||||
|
||||
apply_updates = self._opt.apply_gradients(grads_and_vars)
|
||||
with ops.control_dependencies([apply_updates]):
|
||||
local_update = state_ops.assign_add(
|
||||
self._local_step, 1, name='local_step_update').op
|
||||
|
||||
# update global variables.
|
||||
def _Update_global_variables():
|
||||
local_vars = [v for g, v in grads_and_vars if g is not None]
|
||||
global_vars = [self._local_2_global[v] for v in local_vars]
|
||||
# sync queue
|
||||
with ops.colocate_with(global_step):
|
||||
sync_queue = data_flow_ops.FIFOQueue(-1, [dtypes.bool], shapes=[[]],
|
||||
shared_name='sync_queue')
|
||||
train_ops = []
|
||||
aggregated_vars = []
|
||||
with ops.name_scope(None, self._name + '/global'):
|
||||
for var, gvar in zip(local_vars, global_vars):
|
||||
with ops.device(gvar.device):
|
||||
if isinstance(var._ref(), ops.Tensor):
|
||||
var_accum = data_flow_ops.ConditionalAccumulator(
|
||||
var.dtype,
|
||||
shape=var.get_shape(),
|
||||
shared_name=gvar.name + "/var_accum")
|
||||
train_ops.append(
|
||||
var_accum.apply_grad(var._ref(), local_step=global_step))
|
||||
aggregated_vars.append(var_accum.take_grad(self._num_worker))
|
||||
else:
|
||||
raise ValueError("Unknown local variable type!")
|
||||
self._accumulator_list.append((var_accum, gvar.device))
|
||||
# chief worker updates global vars and enqueues tokens to the sync queue
|
||||
if self._is_chief:
|
||||
update_ops = []
|
||||
with ops.control_dependencies(train_ops):
|
||||
for avg_var, gvar in zip(aggregated_vars, global_vars):
|
||||
with ops.device(gvar.device):
|
||||
update_ops.append(state_ops.assign(gvar, avg_var))
|
||||
with ops.device(global_step.device):
|
||||
update_ops.append(state_ops.assign_add(global_step, 1))
|
||||
with ops.control_dependencies(update_ops), ops.device(
|
||||
global_step.device):
|
||||
tokens = array_ops.fill([self._num_worker - 1],
|
||||
constant_op.constant(False))
|
||||
sync_op = sync_queue.enqueue_many(tokens)
|
||||
else:
|
||||
with ops.control_dependencies(train_ops), ops.device(
|
||||
global_step.device):
|
||||
sync_op = sync_queue.dequeue()
|
||||
|
||||
with ops.control_dependencies([sync_op]):
|
||||
local_update_op = self._local_vars_update(local_vars)
|
||||
return local_update_op
|
||||
|
||||
with ops.control_dependencies([local_update]):
|
||||
condition = math_ops.equal(math_ops.mod(
|
||||
self._local_step, self._interval_steps), 0)
|
||||
conditional_update = control_flow_ops.cond(
|
||||
condition, _Update_global_variables, control_flow_ops.no_op)
|
||||
|
||||
chief_init_ops = []
|
||||
for accum, dev in self._accumulator_list:
|
||||
with ops.device(dev):
|
||||
chief_init_ops.append(
|
||||
accum.set_global_step(
|
||||
global_step, name="SetGlobalStep"))
|
||||
self._chief_init_op = control_flow_ops.group(*(chief_init_ops))
|
||||
|
||||
return conditional_update
|
||||
|
||||
def get_init_op(self):
|
||||
"""Returns the op to let all the local variables equal to the global
|
||||
variables before the training begins"""
|
||||
return self._local_vars_update(variables.trainable_variables())
|
||||
|
||||
def make_session_run_hook(self):
|
||||
"""Creates a hook to handle ModelAverage ops such as initialization."""
|
||||
return _ModelAverageOptimizerHook(self, self._is_chief)
|
||||
|
||||
|
||||
class _ModelAverageOptimizerHook(session_run_hook.SessionRunHook):
|
||||
def __init__(self, ma_optimizer, is_chief):
|
||||
"""Creates hook to handle ModelAverageOptimizer initialization ops.
|
||||
|
||||
Args:
|
||||
ea_optimizer: `ModelAverageOptimizer` which this hook will initialize.
|
||||
is_chief: `Bool`, whether is this a chief replica or not.
|
||||
"""
|
||||
self._ma_optimizer = ma_optimizer
|
||||
self._is_chief = is_chief
|
||||
|
||||
def begin(self):
|
||||
self._local_init_op = variables.local_variables_initializer()
|
||||
self._global_init_op = None
|
||||
if self._is_chief:
|
||||
self._global_init_op = variables.global_variables_initializer()
|
||||
self._chief_init_op = self._ma_optimizer._chief_init_op
|
||||
self._variable_init_op = self._ma_optimizer.get_init_op()
|
@ -0,0 +1,200 @@
|
||||
# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
# ==============================================================================
|
||||
"""Tests for ModelAverageOptimizer."""
|
||||
from __future__ import absolute_import
|
||||
from __future__ import division
|
||||
from __future__ import print_function
|
||||
|
||||
import portpicker
|
||||
from tensorflow.python.framework import constant_op
|
||||
from tensorflow.python.framework import ops
|
||||
from tensorflow.python.ops import variables
|
||||
from tensorflow.python.platform import test
|
||||
from tensorflow.python.training import gradient_descent
|
||||
from tensorflow.python.training import server_lib
|
||||
from tensorflow.python.training import training
|
||||
from tensorflow.python.training import training_util
|
||||
from tensorflow.python.ops import variable_scope
|
||||
from tensorflow.python.training import device_setter
|
||||
from tensorflow.contrib.opt.python.training.model_average_optimizer import \
|
||||
ModelAverageOptimizer, ModelAverageCustomGetter, GLOBAL_VARIABLE_NAME
|
||||
|
||||
|
||||
def create_local_cluster(num_workers, num_ps, protocol="grpc"):
|
||||
"""Create local GRPC servers and return them."""
|
||||
worker_ports = [portpicker.pick_unused_port() for _ in range(num_workers)]
|
||||
ps_ports = [portpicker.pick_unused_port() for _ in range(num_ps)]
|
||||
cluster_dict = {
|
||||
"worker": ["localhost:%s" % port for port in worker_ports],
|
||||
"ps": ["localhost:%s" % port for port in ps_ports]
|
||||
}
|
||||
cs = server_lib.ClusterSpec(cluster_dict)
|
||||
|
||||
workers = [
|
||||
server_lib.Server(
|
||||
cs, job_name="worker", protocol=protocol, task_index=ix, start=True)
|
||||
for ix in range(num_workers)
|
||||
]
|
||||
ps_servers = [
|
||||
server_lib.Server(
|
||||
cs, job_name="ps", protocol=protocol, task_index=ix, start=True)
|
||||
for ix in range(num_ps)
|
||||
]
|
||||
|
||||
return cluster_dict, workers, ps_servers
|
||||
|
||||
|
||||
# Creates the workers and return their sessions, graphs, train_ops.
|
||||
# Cheif worker will update at last
|
||||
def _get_workers(num_workers, steps, workers):
|
||||
sessions = []
|
||||
graphs = []
|
||||
train_ops = []
|
||||
for worker_id in range(num_workers):
|
||||
graph = ops.Graph()
|
||||
is_chief = (worker_id == 0)
|
||||
with graph.as_default():
|
||||
worker_device = "/job:worker/task:%d/cpu:0" % (worker_id)
|
||||
ma_coustom = ModelAverageCustomGetter(
|
||||
worker_device=worker_device)
|
||||
with variable_scope.variable_scope('',
|
||||
custom_getter=ma_coustom), ops.device(
|
||||
device_setter.replica_device_setter(worker_device=worker_device,
|
||||
ps_device="/job:ps/task:0/cpu:0",
|
||||
ps_tasks=1)):
|
||||
|
||||
global_step = variables.Variable(0, name='global_step',
|
||||
trainable=False)
|
||||
var_0 = variable_scope.get_variable(initializer=0.0, name="v0")
|
||||
var_1 = variable_scope.get_variable(initializer=1.0, name="v1")
|
||||
|
||||
with ops.device("/job:worker/task:" + str(worker_id)):
|
||||
if worker_id == 0:
|
||||
grads_0 = constant_op.constant(-1.0)
|
||||
grads_1 = constant_op.constant(-1.0)
|
||||
else:
|
||||
grads_0 = constant_op.constant(-2.0)
|
||||
grads_1 = constant_op.constant(-2.0)
|
||||
sgd_opt = gradient_descent.GradientDescentOptimizer(1.0)
|
||||
opt = ModelAverageOptimizer(
|
||||
opt=sgd_opt,
|
||||
num_worker=num_workers,
|
||||
ma_custom_getter=ma_coustom,
|
||||
is_chief=is_chief,
|
||||
interval_steps=steps
|
||||
)
|
||||
train_op = [
|
||||
opt.apply_gradients(
|
||||
[[grads_0, var_0],
|
||||
[grads_1, var_1]], global_step)
|
||||
]
|
||||
easgd_hook = opt.make_session_run_hook()
|
||||
# Creates MonitoredSession
|
||||
sess = training.MonitoredTrainingSession(workers[worker_id].target,
|
||||
hooks=[easgd_hook])
|
||||
|
||||
sessions.append(sess)
|
||||
graphs.append(graph)
|
||||
train_ops.append(train_op)
|
||||
return sessions, graphs, train_ops
|
||||
|
||||
|
||||
class ModelAverageOptimizerTest(test.TestCase):
|
||||
def _run(self, train_op, sess):
|
||||
sess.run(train_op)
|
||||
|
||||
def test1Workers2Period(self):
|
||||
num_workers = 2
|
||||
steps = 2
|
||||
num_ps = 1
|
||||
cluster, workers, _ = create_local_cluster(num_workers=num_workers,
|
||||
num_ps=num_ps)
|
||||
|
||||
sessions, graphs, train_ops = _get_workers(num_workers,
|
||||
steps,
|
||||
workers)
|
||||
|
||||
var_0 = graphs[0].get_tensor_by_name('v0:0')
|
||||
var_1 = graphs[0].get_tensor_by_name('v1:0')
|
||||
global_step = training_util.get_global_step(graphs[0])
|
||||
global_var_0 = graphs[0].get_tensor_by_name(GLOBAL_VARIABLE_NAME + "/v0:0")
|
||||
global_var_1 = graphs[0].get_tensor_by_name(GLOBAL_VARIABLE_NAME + "/v1:0")
|
||||
|
||||
# Verify the initialized value.
|
||||
self.assertAllEqual(0.0, sessions[0].run(var_0))
|
||||
self.assertAllEqual(1.0, sessions[0].run(var_1))
|
||||
self.assertAllEqual(0.0, sessions[0].run(global_var_0))
|
||||
self.assertAllEqual(1.0, sessions[0].run(global_var_1))
|
||||
self.assertAllEqual(0, sessions[0].run(global_step))
|
||||
|
||||
sessions[0].run(train_ops[0])
|
||||
sessions[1].run(train_ops[1])
|
||||
|
||||
self.assertAllEqual(1.0, sessions[0].run(var_0))
|
||||
self.assertAllEqual(2.0, sessions[0].run(var_1))
|
||||
self.assertAllEqual(0.0, sessions[0].run(global_var_0))
|
||||
self.assertAllEqual(1.0, sessions[0].run(global_var_1))
|
||||
self.assertAllEqual(0, sessions[0].run(global_step))
|
||||
|
||||
# iteration 2, global varibale update
|
||||
thread_0 = self.checkedThread(
|
||||
target=self._run, args=(train_ops[0], sessions[0]))
|
||||
thread_1 = self.checkedThread(
|
||||
target=self._run, args=(train_ops[1], sessions[1]))
|
||||
thread_0.start()
|
||||
thread_1.start()
|
||||
thread_0.join()
|
||||
thread_1.join()
|
||||
|
||||
self.assertAllEqual(3.0, sessions[0].run(var_0))
|
||||
self.assertAllEqual(4.0, sessions[0].run(var_1))
|
||||
self.assertAllEqual(3.0, sessions[0].run(global_var_0))
|
||||
self.assertAllEqual(4.0, sessions[0].run(global_var_1))
|
||||
self.assertAllEqual(1, sessions[0].run(global_step))
|
||||
|
||||
# iteration 3
|
||||
sessions[0].run(train_ops[0])
|
||||
|
||||
self.assertAllEqual(4.0, sessions[0].run(var_0))
|
||||
self.assertAllEqual(5.0, sessions[0].run(var_1))
|
||||
self.assertAllEqual(3.0, sessions[0].run(global_var_0))
|
||||
self.assertAllEqual(4.0, sessions[0].run(global_var_1))
|
||||
self.assertAllEqual(1, sessions[0].run(global_step))
|
||||
|
||||
def testPS2TasksWithClusterSpecClass(self):
|
||||
cluster_spec = server_lib.ClusterSpec({
|
||||
"ps": ["ps0:2222", "ps1:2222"],
|
||||
"worker": ["worker0:2222", "worker1:2222", "worker2:2222"]
|
||||
})
|
||||
worker_device = "/job:worker/task:0"
|
||||
ma_coustom = ModelAverageCustomGetter(
|
||||
worker_device=worker_device)
|
||||
from tensorflow.python.training import device_setter
|
||||
with ops.device(
|
||||
device_setter.replica_device_setter(cluster=cluster_spec,
|
||||
worker_device=worker_device,
|
||||
ps_device="/job:ps")), \
|
||||
variable_scope.variable_scope('', custom_getter=ma_coustom):
|
||||
v = variable_scope.get_variable(initializer=[1, 2], name="v")
|
||||
w = variable_scope.get_variable(initializer=[2, 1], name='w')
|
||||
v_g, w_g = ma_coustom._local_2_global[v], ma_coustom._local_2_global[w]
|
||||
self.assertDeviceEqual("/job:worker/task:0", v.device)
|
||||
self.assertDeviceEqual("job:ps/task:0", v_g.device)
|
||||
self.assertDeviceEqual("/job:worker/task:0", w.device)
|
||||
self.assertDeviceEqual("job:ps/task:1", w_g.device)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
test.main()
|
@ -6,6 +6,7 @@ exports_files(["LICENSE"])
|
||||
|
||||
load(
|
||||
"//tensorflow:tensorflow.bzl",
|
||||
"py_test",
|
||||
"tf_gen_op_libs",
|
||||
"tf_custom_op_library",
|
||||
"tf_custom_op_py_library",
|
||||
@ -64,11 +65,28 @@ py_library(
|
||||
"python/__init__.py",
|
||||
],
|
||||
srcs_version = "PY2AND3",
|
||||
tags = [
|
||||
"notap",
|
||||
],
|
||||
deps = [
|
||||
":periodic_resample_op_py",
|
||||
],
|
||||
)
|
||||
|
||||
py_test(
|
||||
name = "periodic_resample_op_test",
|
||||
srcs = ["python/kernel_tests/periodic_resample_op_test.py"],
|
||||
srcs_version = "PY2AND3",
|
||||
tags = [
|
||||
"notap",
|
||||
],
|
||||
deps = [
|
||||
":init_py",
|
||||
"//tensorflow/contrib/util:util_py",
|
||||
"//tensorflow/python:framework_test_lib",
|
||||
],
|
||||
)
|
||||
|
||||
# py_library(
|
||||
# name = "periodic_resample_op_py",
|
||||
# srcs = ["python/ops/periodic_resample_op.py"],
|
||||
|
@ -100,6 +100,8 @@ template <class InputDataT,
|
||||
desired_shape.size(), "."));
|
||||
|
||||
bool found = false;
|
||||
const auto& input_tensor_shape = input_tensor.shape();
|
||||
|
||||
for (int i = 0; i < rank; ++i) {
|
||||
// if (desired_shape(i) < 1) {
|
||||
if (desired_shape[i] < 1) {
|
||||
@ -111,6 +113,15 @@ template <class InputDataT,
|
||||
adjustable_dimension = i;
|
||||
found = true;
|
||||
} else {
|
||||
OP_REQUIRES(
|
||||
context, desired_shape[i] >= input_tensor_shape.dim_size(i),
|
||||
tensorflow::errors::InvalidArgument(
|
||||
"periodic_resample expects the size of non-adjustable "
|
||||
"dimensions be at least as large as size of input tensor."
|
||||
" Dimension ", i, " input tensor has size ",
|
||||
input_tensor_shape.dim_size(i), ", desired shape has size ",
|
||||
desired_shape[i], "."));
|
||||
|
||||
// target_dimensions[i] = desired_shape(i);
|
||||
target_dimensions[i] = desired_shape[i];
|
||||
new_sliced_size *= target_dimensions[i];
|
||||
|
@ -34,26 +34,40 @@ This function implements a slightly more generic version of the subpixel
|
||||
convolutions found in this [paper](https://arxiv.org/abs/1609.05158).
|
||||
|
||||
The formula for computing the elements in the `output` tensor is as follows:
|
||||
`T` = `values` tensor of rank `R`
|
||||
`S` = desired `shape` of output tensor (vector of length `R`)
|
||||
`P` = `output` tensor of rank `R`
|
||||
\((T_1,\ldots,T_R)\) = shape(`T`)
|
||||
\([S_1,\ldots,S_q,\ldots,S_R]\) = elements of vector `S`
|
||||
|
||||
A single element in `S` is left unspecified (denoted \(S_q=-1\)).
|
||||
Let \(f_i\) denote the (possibly non-integer) factor that relates the original
|
||||
dimension to the desired dimensions, \(S_i=f_i T_i\), for \(i\neq q\) where
|
||||
\(f_i>0\).
|
||||
`T` = `values` tensor of rank `R`
|
||||
|
||||
`S` = desired `shape` of output tensor (vector of length `R`)
|
||||
|
||||
`P` = `output` tensor of rank `R`
|
||||
|
||||
\\((T_1,\\ldots,T_R)\\) = shape(`T`)
|
||||
|
||||
\\([S_1,\\ldots,S_q,\\ldots,S_R]\\) = elements of vector `S`
|
||||
|
||||
A single element in `S` is left unspecified (denoted \\(S_q=-1\\)).
|
||||
|
||||
Let \\(f_i\\) denote the (possibly non-integer) factor that relates the original
|
||||
dimension to the desired dimensions, \\(S_i=f_i T_i\\), for \\(i\\neq q\\) where
|
||||
\\(f_i>0\\).
|
||||
|
||||
Define the following:
|
||||
\(g_i=\lceil f_i\rceil\)
|
||||
\(t=\prod_i T_i\)
|
||||
\(s=\prod_{i\neq q} S_i\)
|
||||
\(S_q\) can then be defined as by \(S_q=\lfloor t/s\rfloor\).
|
||||
|
||||
\\(g_i=\\lceil f_i\\rceil\\)
|
||||
|
||||
\\(t=\\prod_i T_i\\)
|
||||
|
||||
\\(s=\\prod_{i\\neq q} S_i\\)
|
||||
|
||||
\\(S_q\\) can then be defined by \\(S_q=\\lfloor t/s\\rfloor\\).
|
||||
The elements of the resulting tensor are defined as
|
||||
\(P_{s_1,\ldots,s_R}=T_{h_1,\ldots,h_q,\ldots,h_R}\).
|
||||
The \(h_i\) (\(i\neq q\)) are defined by \(h_i=\lfloor s_i/g_i\rfloor\).
|
||||
\(h_q=S_q\sum_{j\neq q}^{q-1}G_j \mathrm{mod}(s_j,g_j) + s_q\), where
|
||||
\(G_j=\prod_{i}^{j-1}g_i\) (\(G_0=1\)).
|
||||
|
||||
\\(P_{s_1,\\ldots,s_R}=T_{h_1,\\ldots,h_q,\\ldots,h_R}\\).
|
||||
|
||||
The \\(h_i\\) (\\(i\\neq q\\)) are defined by \\(h_i=\\lfloor s_i/g_i\\rfloor\\).
|
||||
|
||||
\\(h_q=S_q\\sum_{j\\neq q}^{q-1}G_j \\mathrm{mod}(s_j,g_j) + s_q\\), where
|
||||
\\(G_j=\\prod_{i}^{j-1}g_i\\) (\\(G_0=1\\)).
|
||||
|
||||
One drawback of this method is that whenever the output dimensions are slightly
|
||||
less than integer multiples of the input dimensions, many of the tensor elements
|
||||
|
@ -19,8 +19,9 @@ from __future__ import division
|
||||
from __future__ import print_function
|
||||
|
||||
import numpy
|
||||
import tensorflow
|
||||
|
||||
from tensorflow.contrib.periodic_resample import periodic_resample
|
||||
from tensorflow.python.framework import errors_impl
|
||||
from tensorflow.python.framework import test_util
|
||||
from tensorflow.python.ops import variables
|
||||
from tensorflow.python.platform import googletest
|
||||
@ -96,6 +97,19 @@ class PeriodicResampleTest(test_util.TensorFlowTestCase):
|
||||
result = periodic_resample(input_tensor, desired_shape).eval()
|
||||
self.assertAllEqual(result, output_tensor)
|
||||
|
||||
def testPeriodicResampleErrors(self):
|
||||
input_tensor = numpy.zeros(shape=[1, 2, 2, 4])
|
||||
with self.test_session():
|
||||
variables.global_variables_initializer().run()
|
||||
with self.assertRaisesWithPredicateMatch(
|
||||
errors_impl.InvalidArgumentError,
|
||||
'Dimension 3 input tensor has size 4, desired shape has size 1'):
|
||||
periodic_resample(input_tensor, [None, 4, 4, 1]).eval()
|
||||
with self.assertRaisesWithPredicateMatch(
|
||||
errors_impl.InvalidArgumentError,
|
||||
'4, to be the same as the length of the desired shape, 3'):
|
||||
periodic_resample(input_tensor, [None, 4, 4]).eval()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
googletest.main()
|
||||
|
@ -663,6 +663,12 @@ class DropoutWrapperTest(test.TestCase):
|
||||
self.assertEqual(res[1].h.shape, (batch_size, 3))
|
||||
return res
|
||||
|
||||
def testWrappedCellProperty(self):
|
||||
cell = rnn_cell_impl.BasicRNNCell(10)
|
||||
wrapper = rnn_cell_impl.DropoutWrapper(cell)
|
||||
# Github issue 15810
|
||||
self.assertEqual(wrapper.wrapped_cell, cell)
|
||||
|
||||
def testDropoutWrapperKeepAllConstantInput(self):
|
||||
keep = array_ops.ones([])
|
||||
res = self._testDropoutWrapper(
|
||||
|
@ -1549,5 +1549,100 @@ class BenchmarkLSTMCellXLA(test.Benchmark):
|
||||
benchmark_results["wall_time"]]]))
|
||||
|
||||
|
||||
class WeightNormLSTMCellTest(test.TestCase):
|
||||
"""Compared cell output with pre-calculated values."""
|
||||
|
||||
def _cell_output(self, cell):
|
||||
"""Calculate cell output"""
|
||||
|
||||
with self.test_session() as sess:
|
||||
init = init_ops.constant_initializer(0.5)
|
||||
with variable_scope.variable_scope("root",
|
||||
initializer=init):
|
||||
x = array_ops.zeros([1, 2])
|
||||
c0 = array_ops.zeros([1, 2])
|
||||
h0 = array_ops.zeros([1, 2])
|
||||
|
||||
state0 = rnn_cell.LSTMStateTuple(c0, h0)
|
||||
|
||||
xout, sout = cell()(x, state0)
|
||||
|
||||
sess.run([variables.global_variables_initializer()])
|
||||
res = sess.run([xout, sout], {
|
||||
x.name: np.array([[1., 1.]]),
|
||||
c0.name: 0.1 * np.asarray([[0, 1]]),
|
||||
h0.name: 0.1 * np.asarray([[2, 3]]),
|
||||
})
|
||||
|
||||
actual_state_c = res[1].c
|
||||
actual_state_h = res[1].h
|
||||
|
||||
return actual_state_c, actual_state_h
|
||||
|
||||
def testBasicCell(self):
|
||||
"""Tests cell w/o peepholes and w/o normalisation"""
|
||||
|
||||
def cell():
|
||||
return contrib_rnn_cell.WeightNormLSTMCell(2,
|
||||
norm=False,
|
||||
use_peepholes=False)
|
||||
|
||||
actual_c, actual_h = self._cell_output(cell)
|
||||
|
||||
expected_c = np.array([[0.65937078, 0.74983585]])
|
||||
expected_h = np.array([[0.44923624, 0.49362513]])
|
||||
|
||||
self.assertAllClose(expected_c, actual_c, 1e-5)
|
||||
self.assertAllClose(expected_h, actual_h, 1e-5)
|
||||
|
||||
def testNonbasicCell(self):
|
||||
"""Tests cell with peepholes and w/o normalisation"""
|
||||
|
||||
def cell():
|
||||
return contrib_rnn_cell.WeightNormLSTMCell(2,
|
||||
norm=False,
|
||||
use_peepholes=True)
|
||||
|
||||
actual_c, actual_h = self._cell_output(cell)
|
||||
|
||||
expected_c = np.array([[0.65937084, 0.7574988]])
|
||||
expected_h = np.array([[0.4792085, 0.53470564]])
|
||||
|
||||
self.assertAllClose(expected_c, actual_c, 1e-5)
|
||||
self.assertAllClose(expected_h, actual_h, 1e-5)
|
||||
|
||||
|
||||
def testBasicCellWithNorm(self):
|
||||
"""Tests cell w/o peepholes and with normalisation"""
|
||||
|
||||
def cell():
|
||||
return contrib_rnn_cell.WeightNormLSTMCell(2,
|
||||
norm=True,
|
||||
use_peepholes=False)
|
||||
|
||||
actual_c, actual_h = self._cell_output(cell)
|
||||
|
||||
expected_c = np.array([[0.50125383, 0.58805949]])
|
||||
expected_h = np.array([[0.32770363, 0.37397948]])
|
||||
|
||||
self.assertAllClose(expected_c, actual_c, 1e-5)
|
||||
self.assertAllClose(expected_h, actual_h, 1e-5)
|
||||
|
||||
def testNonBasicCellWithNorm(self):
|
||||
"""Tests cell with peepholes and with normalisation"""
|
||||
|
||||
def cell():
|
||||
return contrib_rnn_cell.WeightNormLSTMCell(2,
|
||||
norm=True,
|
||||
use_peepholes=True)
|
||||
|
||||
actual_c, actual_h = self._cell_output(cell)
|
||||
|
||||
expected_c = np.array([[0.50125383, 0.59587258]])
|
||||
expected_h = np.array([[0.35041603, 0.40873795]])
|
||||
|
||||
self.assertAllClose(expected_c, actual_c, 1e-5)
|
||||
self.assertAllClose(expected_h, actual_h, 1e-5)
|
||||
|
||||
if __name__ == "__main__":
|
||||
test.main()
|
||||
|
@ -38,6 +38,7 @@ from tensorflow.python.ops import random_ops
|
||||
from tensorflow.python.ops import rnn_cell_impl
|
||||
from tensorflow.python.ops import variable_scope as vs
|
||||
from tensorflow.python.ops import partitioned_variables
|
||||
from tensorflow.python.ops import nn_impl
|
||||
from tensorflow.python.platform import tf_logging as logging
|
||||
from tensorflow.python.util import nest
|
||||
|
||||
@ -328,7 +329,7 @@ class TimeFreqLSTMCell(rnn_cell_impl.RNNCell):
|
||||
def __init__(self, num_units, use_peepholes=False,
|
||||
cell_clip=None, initializer=None,
|
||||
num_unit_shards=1, forget_bias=1.0,
|
||||
feature_size=None, frequency_skip=None,
|
||||
feature_size=None, frequency_skip=1,
|
||||
reuse=None):
|
||||
"""Initialize the parameters for an LSTM cell.
|
||||
|
||||
@ -2723,3 +2724,257 @@ class SRUCell(rnn_cell_impl._LayerRNNCell):
|
||||
h = r * self._activation(c) + (1.0 - r) * inputs
|
||||
|
||||
return h, c
|
||||
|
||||
|
||||
class WeightNormLSTMCell(rnn_cell_impl.RNNCell):
|
||||
"""Weight normalized LSTM Cell. Adapted from `rnn_cell_impl.LSTMCell`.
|
||||
|
||||
The weight-norm implementation is based on:
|
||||
https://arxiv.org/abs/1602.07868
|
||||
Tim Salimans, Diederik P. Kingma.
|
||||
Weight Normalization: A Simple Reparameterization to Accelerate
|
||||
Training of Deep Neural Networks
|
||||
|
||||
The default LSTM implementation based on:
|
||||
http://www.bioinf.jku.at/publications/older/2604.pdf
|
||||
S. Hochreiter and J. Schmidhuber.
|
||||
"Long Short-Term Memory". Neural Computation, 9(8):1735-1780, 1997.
|
||||
|
||||
The class uses optional peephole connections, optional cell clipping
|
||||
and an optional projection layer.
|
||||
|
||||
The optional peephole implementation is based on:
|
||||
https://research.google.com/pubs/archive/43905.pdf
|
||||
Hasim Sak, Andrew Senior, and Francoise Beaufays.
|
||||
"Long short-term memory recurrent neural network architectures for
|
||||
large scale acoustic modeling." INTERSPEECH, 2014.
|
||||
"""
|
||||
|
||||
def __init__(self, num_units, norm=True, use_peepholes=False,
|
||||
cell_clip=None, initializer=None, num_proj=None,
|
||||
proj_clip=None, forget_bias=1, activation=None,
|
||||
reuse=None):
|
||||
"""Initialize the parameters of a weight-normalized LSTM cell.
|
||||
|
||||
Args:
|
||||
num_units: int, The number of units in the LSTM cell
|
||||
norm: If `True`, apply normalization to the weight matrices. If False,
|
||||
the result is identical to that obtained from `rnn_cell_impl.LSTMCell`
|
||||
use_peepholes: bool, set `True` to enable diagonal/peephole connections.
|
||||
cell_clip: (optional) A float value, if provided the cell state is clipped
|
||||
by this value prior to the cell output activation.
|
||||
initializer: (optional) The initializer to use for the weight matrices.
|
||||
num_proj: (optional) int, The output dimensionality for the projection
|
||||
matrices. If None, no projection is performed.
|
||||
proj_clip: (optional) A float value. If `num_proj > 0` and `proj_clip` is
|
||||
provided, then the projected values are clipped elementwise to within
|
||||
`[-proj_clip, proj_clip]`.
|
||||
forget_bias: Biases of the forget gate are initialized by default to 1
|
||||
in order to reduce the scale of forgetting at the beginning of
|
||||
the training.
|
||||
activation: Activation function of the inner states. Default: `tanh`.
|
||||
reuse: (optional) Python boolean describing whether to reuse variables
|
||||
in an existing scope. If not `True`, and the existing scope already has
|
||||
the given variables, an error is raised.
|
||||
"""
|
||||
super(WeightNormLSTMCell, self).__init__(_reuse=reuse)
|
||||
|
||||
self._scope = 'wn_lstm_cell'
|
||||
self._num_units = num_units
|
||||
self._norm = norm
|
||||
self._initializer = initializer
|
||||
self._use_peepholes = use_peepholes
|
||||
self._cell_clip = cell_clip
|
||||
self._num_proj = num_proj
|
||||
self._proj_clip = proj_clip
|
||||
self._activation = activation or math_ops.tanh
|
||||
self._forget_bias = forget_bias
|
||||
|
||||
self._weights_variable_name = "kernel"
|
||||
self._bias_variable_name = "bias"
|
||||
|
||||
if num_proj:
|
||||
self._state_size = rnn_cell_impl.LSTMStateTuple(num_units, num_proj)
|
||||
self._output_size = num_proj
|
||||
else:
|
||||
self._state_size = rnn_cell_impl.LSTMStateTuple(num_units, num_units)
|
||||
self._output_size = num_units
|
||||
|
||||
@property
|
||||
def state_size(self):
|
||||
return self._state_size
|
||||
|
||||
@property
|
||||
def output_size(self):
|
||||
return self._output_size
|
||||
|
||||
def _normalize(self, weight, name):
|
||||
"""Apply weight normalization.
|
||||
|
||||
Args:
|
||||
weight: a 2D tensor with known number of columns.
|
||||
name: string, variable name for the normalizer.
|
||||
Returns:
|
||||
A tensor with the same shape as `weight`.
|
||||
"""
|
||||
|
||||
output_size = weight.get_shape().as_list()[1]
|
||||
g = vs.get_variable(name, [output_size], dtype=weight.dtype)
|
||||
return nn_impl.l2_normalize(weight, dim=0) * g
|
||||
|
||||
def _linear(self, args,
|
||||
output_size,
|
||||
norm,
|
||||
bias,
|
||||
bias_initializer=None,
|
||||
kernel_initializer=None):
|
||||
"""Linear map: sum_i(args[i] * W[i]), where W[i] is a variable.
|
||||
|
||||
Args:
|
||||
args: a 2D Tensor or a list of 2D, batch x n, Tensors.
|
||||
output_size: int, second dimension of W[i].
|
||||
bias: boolean, whether to add a bias term or not.
|
||||
bias_initializer: starting value to initialize the bias
|
||||
(default is all zeros).
|
||||
kernel_initializer: starting value to initialize the weight.
|
||||
|
||||
Returns:
|
||||
A 2D Tensor with shape [batch x output_size] equal to
|
||||
sum_i(args[i] * W[i]), where W[i]s are newly created matrices.
|
||||
|
||||
Raises:
|
||||
ValueError: if some of the arguments has unspecified or wrong shape.
|
||||
"""
|
||||
if args is None or (nest.is_sequence(args) and not args):
|
||||
raise ValueError("`args` must be specified")
|
||||
if not nest.is_sequence(args):
|
||||
args = [args]
|
||||
|
||||
# Calculate the total size of arguments on dimension 1.
|
||||
total_arg_size = 0
|
||||
shapes = [a.get_shape() for a in args]
|
||||
for shape in shapes:
|
||||
if shape.ndims != 2:
|
||||
raise ValueError("linear is expecting 2D arguments: %s" % shapes)
|
||||
if shape[1].value is None:
|
||||
raise ValueError("linear expects shape[1] to be provided for shape %s, "
|
||||
"but saw %s" % (shape, shape[1]))
|
||||
else:
|
||||
total_arg_size += shape[1].value
|
||||
|
||||
dtype = [a.dtype for a in args][0]
|
||||
|
||||
# Now the computation.
|
||||
scope = vs.get_variable_scope()
|
||||
with vs.variable_scope(scope) as outer_scope:
|
||||
weights = vs.get_variable(
|
||||
self._weights_variable_name, [total_arg_size, output_size],
|
||||
dtype=dtype,
|
||||
initializer=kernel_initializer)
|
||||
if norm:
|
||||
wn = []
|
||||
st = 0
|
||||
with ops.control_dependencies(None):
|
||||
for i in range(len(args)):
|
||||
en = st + shapes[i][1].value
|
||||
wn.append(self._normalize(weights[st:en, :],
|
||||
name='norm_{}'.format(i)))
|
||||
st = en
|
||||
|
||||
weights = array_ops.concat(wn, axis=0)
|
||||
|
||||
if len(args) == 1:
|
||||
res = math_ops.matmul(args[0], weights)
|
||||
else:
|
||||
res = math_ops.matmul(array_ops.concat(args, 1), weights)
|
||||
if not bias:
|
||||
return res
|
||||
|
||||
with vs.variable_scope(outer_scope) as inner_scope:
|
||||
inner_scope.set_partitioner(None)
|
||||
if bias_initializer is None:
|
||||
bias_initializer = init_ops.constant_initializer(0.0, dtype=dtype)
|
||||
|
||||
biases = vs.get_variable(
|
||||
self._bias_variable_name, [output_size],
|
||||
dtype=dtype,
|
||||
initializer=bias_initializer)
|
||||
|
||||
return nn_ops.bias_add(res, biases)
|
||||
|
||||
def call(self, inputs, state):
|
||||
"""Run one step of LSTM.
|
||||
|
||||
Args:
|
||||
inputs: input Tensor, 2D, batch x num_units.
|
||||
state: A tuple of state Tensors, both `2-D`, with column sizes
|
||||
`c_state` and `m_state`.
|
||||
|
||||
Returns:
|
||||
A tuple containing:
|
||||
|
||||
- A `2-D, [batch x output_dim]`, Tensor representing the output of the
|
||||
LSTM after reading `inputs` when previous state was `state`.
|
||||
Here output_dim is:
|
||||
num_proj if num_proj was set,
|
||||
num_units otherwise.
|
||||
- Tensor(s) representing the new state of LSTM after reading `inputs` when
|
||||
the previous state was `state`. Same type and shape(s) as `state`.
|
||||
|
||||
Raises:
|
||||
ValueError: If input size cannot be inferred from inputs via
|
||||
static shape inference.
|
||||
"""
|
||||
dtype = inputs.dtype
|
||||
num_units = self._num_units
|
||||
sigmoid = math_ops.sigmoid
|
||||
c, h = state
|
||||
|
||||
input_size = inputs.get_shape().with_rank(2)[1]
|
||||
if input_size.value is None:
|
||||
raise ValueError("Could not infer input size from inputs.get_shape()[-1]")
|
||||
|
||||
with vs.variable_scope(self._scope, initializer=self._initializer):
|
||||
|
||||
concat = self._linear([inputs, h], 4 * num_units,
|
||||
norm=self._norm, bias=True)
|
||||
|
||||
# i = input_gate, j = new_input, f = forget_gate, o = output_gate
|
||||
i, j, f, o = array_ops.split(value=concat, num_or_size_splits=4, axis=1)
|
||||
|
||||
if self._use_peepholes:
|
||||
w_f_diag = vs.get_variable("w_f_diag", shape=[num_units], dtype=dtype)
|
||||
w_i_diag = vs.get_variable("w_i_diag", shape=[num_units], dtype=dtype)
|
||||
w_o_diag = vs.get_variable("w_o_diag", shape=[num_units], dtype=dtype)
|
||||
|
||||
new_c = (c * sigmoid(f + self._forget_bias + w_f_diag * c)
|
||||
+ sigmoid(i + w_i_diag * c) * self._activation(j))
|
||||
else:
|
||||
new_c = (c * sigmoid(f + self._forget_bias)
|
||||
+ sigmoid(i) * self._activation(j))
|
||||
|
||||
if self._cell_clip is not None:
|
||||
# pylint: disable=invalid-unary-operand-type
|
||||
new_c = clip_ops.clip_by_value(new_c, -self._cell_clip, self._cell_clip)
|
||||
# pylint: enable=invalid-unary-operand-type
|
||||
if self._use_peepholes:
|
||||
new_h = sigmoid(o + w_o_diag * new_c) * self._activation(new_c)
|
||||
else:
|
||||
new_h = sigmoid(o) * self._activation(new_c)
|
||||
|
||||
if self._num_proj is not None:
|
||||
with vs.variable_scope("projection"):
|
||||
new_h = self._linear(new_h,
|
||||
self._num_proj,
|
||||
norm=self._norm,
|
||||
bias=False)
|
||||
|
||||
if self._proj_clip is not None:
|
||||
# pylint: disable=invalid-unary-operand-type
|
||||
new_h = clip_ops.clip_by_value(new_h,
|
||||
-self._proj_clip,
|
||||
self._proj_clip)
|
||||
# pylint: enable=invalid-unary-operand-type
|
||||
|
||||
new_state = rnn_cell_impl.LSTMStateTuple(new_c, new_h)
|
||||
return new_h, new_state
|
||||
|
@ -225,6 +225,94 @@ class TestBeamStep(test.TestCase):
|
||||
self.assertAllEqual(next_state_.log_probs, expected_log_probs)
|
||||
|
||||
|
||||
class TestLargeBeamStep(test.TestCase):
|
||||
"""
|
||||
Tests a single step of beam search in such
|
||||
case that beam size is larger than vocabulary size.
|
||||
"""
|
||||
|
||||
def setUp(self):
|
||||
super(TestLargeBeamStep, self).setUp()
|
||||
self.batch_size = 2
|
||||
self.beam_width = 8
|
||||
self.vocab_size = 5
|
||||
self.end_token = 0
|
||||
self.length_penalty_weight = 0.6
|
||||
|
||||
|
||||
def test_step(self):
|
||||
def get_probs():
|
||||
"""this simulates the initialize method in BeamSearchDecoder"""
|
||||
log_prob_mask = array_ops.one_hot(array_ops.zeros([self.batch_size],
|
||||
dtype=dtypes.int32),
|
||||
depth=self.beam_width, on_value=True,
|
||||
off_value=False, dtype=dtypes.bool)
|
||||
|
||||
log_prob_zeros = array_ops.zeros([self.batch_size, self.beam_width],
|
||||
dtype=dtypes.float32)
|
||||
log_prob_neg_inf = array_ops.ones([self.batch_size, self.beam_width],
|
||||
dtype=dtypes.float32) * -np.Inf
|
||||
|
||||
log_probs = array_ops.where(log_prob_mask, log_prob_zeros,
|
||||
log_prob_neg_inf)
|
||||
return log_probs
|
||||
|
||||
log_probs = get_probs()
|
||||
dummy_cell_state = array_ops.zeros([self.batch_size, self.beam_width])
|
||||
|
||||
_finished = array_ops.one_hot(
|
||||
array_ops.zeros([self.batch_size], dtype=dtypes.int32),
|
||||
depth=self.beam_width, on_value=False,
|
||||
off_value=True, dtype=dtypes.bool)
|
||||
_lengths = np.zeros([self.batch_size, self.beam_width], dtype=np.int64)
|
||||
_lengths[:, 0]=2
|
||||
_lengths = constant_op.constant(_lengths, dtype=dtypes.int64)
|
||||
|
||||
beam_state = beam_search_decoder.BeamSearchDecoderState(
|
||||
cell_state=dummy_cell_state,
|
||||
log_probs=log_probs,
|
||||
lengths=_lengths,
|
||||
finished=_finished)
|
||||
|
||||
logits_ = np.full([self.batch_size, self.beam_width, self.vocab_size],
|
||||
0.0001)
|
||||
logits_[0, 0, 2] = 1.9
|
||||
logits_[0, 0, 3] = 2.1
|
||||
logits_[0, 1, 3] = 3.1
|
||||
logits_[0, 1, 4] = 0.9
|
||||
logits_[1, 0, 1] = 0.5
|
||||
logits_[1, 1, 2] = 2.7
|
||||
logits_[1, 2, 2] = 10.0
|
||||
logits_[1, 2, 3] = 0.2
|
||||
logits = constant_op.constant(logits_, dtype=dtypes.float32)
|
||||
log_probs = nn_ops.log_softmax(logits)
|
||||
|
||||
outputs, next_beam_state = beam_search_decoder._beam_search_step(
|
||||
time=2,
|
||||
logits=logits,
|
||||
next_cell_state=dummy_cell_state,
|
||||
beam_state=beam_state,
|
||||
batch_size=ops.convert_to_tensor(self.batch_size),
|
||||
beam_width=self.beam_width,
|
||||
end_token=self.end_token,
|
||||
length_penalty_weight=self.length_penalty_weight)
|
||||
|
||||
with self.test_session() as sess:
|
||||
outputs_, next_state_, state_, log_probs_ = sess.run(
|
||||
[outputs, next_beam_state, beam_state, log_probs])
|
||||
|
||||
self.assertEqual(outputs_.predicted_ids[0, 0], 3)
|
||||
self.assertEqual(outputs_.predicted_ids[0, 1], 2)
|
||||
self.assertEqual(outputs_.predicted_ids[1, 0], 1)
|
||||
neg_inf = -np.Inf
|
||||
self.assertAllEqual(next_state_.log_probs[:, -3:],
|
||||
[[neg_inf, neg_inf, neg_inf],
|
||||
[neg_inf, neg_inf, neg_inf]])
|
||||
self.assertEqual((next_state_.log_probs[:, :-3] > neg_inf).all(), True)
|
||||
self.assertEqual((next_state_.lengths[:, :-3] > 0).all(), True)
|
||||
self.assertAllEqual(next_state_.lengths[:, -3:], [[0, 0, 0],
|
||||
[0, 0, 0]])
|
||||
|
||||
class BeamSearchDecoderTest(test.TestCase):
|
||||
|
||||
def _testDynamicDecodeRNN(self, time_major, has_attention):
|
||||
|
@ -19,7 +19,6 @@ from __future__ import division
|
||||
from __future__ import print_function
|
||||
|
||||
import collections
|
||||
|
||||
import numpy as np
|
||||
|
||||
from tensorflow.contrib.seq2seq.python.ops import beam_search_ops
|
||||
@ -229,8 +228,11 @@ class BeamSearchDecoder(decoder.Decoder):
|
||||
self._start_tokens = array_ops.tile(
|
||||
array_ops.expand_dims(self._start_tokens, 1), [1, self._beam_width])
|
||||
self._start_inputs = self._embedding_fn(self._start_tokens)
|
||||
self._finished = array_ops.zeros(
|
||||
[self._batch_size, self._beam_width], dtype=dtypes.bool)
|
||||
|
||||
self._finished = array_ops.one_hot(
|
||||
array_ops.zeros([self._batch_size], dtype=dtypes.int32),
|
||||
depth=self._beam_width, on_value=False,
|
||||
off_value=True, dtype=dtypes.bool)
|
||||
|
||||
@property
|
||||
def batch_size(self):
|
||||
@ -298,11 +300,15 @@ class BeamSearchDecoder(decoder.Decoder):
|
||||
"""
|
||||
finished, start_inputs = self._finished, self._start_inputs
|
||||
|
||||
log_probs = array_ops.one_hot( # shape(batch_sz, beam_sz)
|
||||
array_ops.zeros([self._batch_size], dtype=dtypes.int32),
|
||||
depth=self._beam_width, on_value=0.0, off_value=-np.Inf,
|
||||
dtype=nest.flatten(self._initial_cell_state)[0].dtype)
|
||||
|
||||
|
||||
initial_state = BeamSearchDecoderState(
|
||||
cell_state=self._initial_cell_state,
|
||||
log_probs=array_ops.zeros(
|
||||
[self._batch_size, self._beam_width],
|
||||
dtype=nest.flatten(self._initial_cell_state)[0].dtype),
|
||||
log_probs=log_probs,
|
||||
finished=finished,
|
||||
lengths=array_ops.zeros(
|
||||
[self._batch_size, self._beam_width], dtype=dtypes.int64))
|
||||
@ -563,18 +569,11 @@ def _beam_search_step(time, logits, next_cell_state, beam_state, batch_size,
|
||||
time = ops.convert_to_tensor(time, name="time")
|
||||
# During the first time step we only consider the initial beam
|
||||
scores_shape = array_ops.shape(scores)
|
||||
scores_flat = control_flow_ops.cond(
|
||||
time > 0,
|
||||
lambda: array_ops.reshape(scores, [batch_size, -1]),
|
||||
lambda: scores[:, 0])
|
||||
num_available_beam = control_flow_ops.cond(
|
||||
time > 0, lambda: math_ops.reduce_prod(scores_shape[1:]),
|
||||
lambda: math_ops.reduce_prod(scores_shape[2:]))
|
||||
scores_flat = array_ops.reshape(scores, [batch_size, -1])
|
||||
|
||||
# Pick the next beams according to the specified successors function
|
||||
next_beam_size = math_ops.minimum(
|
||||
ops.convert_to_tensor(beam_width, dtype=dtypes.int32, name="beam_width"),
|
||||
num_available_beam)
|
||||
next_beam_size = ops.convert_to_tensor(beam_width, dtype=dtypes.int32,
|
||||
name="beam_width")
|
||||
next_beam_scores, word_indices = nn_ops.top_k(scores_flat, k=next_beam_size)
|
||||
|
||||
next_beam_scores.set_shape([static_batch_size, beam_width])
|
||||
|
@ -99,7 +99,7 @@ cc_library(
|
||||
alwayslink = 1,
|
||||
)
|
||||
|
||||
tf_cuda_library(
|
||||
cc_library(
|
||||
name = "rdma_rendezvous_mgr",
|
||||
srcs = ["rdma_rendezvous_mgr.cc"],
|
||||
hdrs = ["rdma_rendezvous_mgr.h"],
|
||||
@ -114,7 +114,7 @@ tf_cuda_library(
|
||||
],
|
||||
)
|
||||
|
||||
cc_library(
|
||||
tf_cuda_library(
|
||||
name = "rdma_mgr",
|
||||
srcs = ["rdma_mgr.cc"],
|
||||
hdrs = ["rdma_mgr.h"],
|
||||
@ -141,6 +141,8 @@ tf_cuda_library(
|
||||
"//conditions:default": [],
|
||||
}),
|
||||
deps = [
|
||||
":grpc_verbs_client",
|
||||
":verbs_service_proto_cc",
|
||||
":verbs_util",
|
||||
"//tensorflow/core:core_cpu_internal",
|
||||
"//tensorflow/core:framework",
|
||||
|
@ -24,66 +24,144 @@ The design is based on TensorFlow r1.0. An RDMA path is added between servers fo
|
||||
|
||||
During the server setup, an RDMA manager is created to manage low-level RDMA components such as RDMA channel and RDMA adapter, an RDMA rendezvous manager is created to oversee send/recv operations between servers. Following the distributed TensorFlow design philosophy, the send operation is passive, i.e. merely placing a tensor in the local out-going table. It is the receive operation that actually initiates the tensor transfer.
|
||||
|
||||
TensorFlow dynamically allocates memory for tensors that are to be sent or received. This causes difficulty for RDMA operations where pinned memory is required. Two remedies are possible, either the memory is pinned, transfer, then unpinned for each and every tensor to be transferred, or a buffer is pre-allocated and pinned for each tensor. The former incurs significant operation overhead since pinning and unpinning memory for each dynamically generated tensor is slow. The latter incurs large memory overhead and extra copying from the tensor to its pinned buffer, but may still be faster than the former. The second approach is adopted in this design. Each RDMA channel, representing a RDMA connection to a peer, contains a table of pinned buffers for all the seen tensors that requires transfer. It is assumed that the tensor size rarely changes across different steps. So only one buffer is created for the same tensor across all the steps. In the rare case when the tensor size does increases, the old buffer is discarded and new buffer of larger size is created and pinned.
|
||||
TensorFlow dynamically allocates memory for tensors that are to be sent or received. This causes difficulty for RDMA operations where pinned memory is required. Few remedies are possible:
|
||||
1. The memory is pinned, transfered, then unpinned for each and every tensor to be transferred. This incurs significant operation overhead since pinning and unpinning memory for each dynamically generated tensor is slow.
|
||||
2. Buffer is pre-allocated and pinned for each tensor. This incurs large memory overhead and extra copying from the tensor to its pinned buffer, but may still be faster than the former.
|
||||
3. Following HKUST research on the use of GPU direct, and their [GDR implementation](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/gdr/README.md), there is a smart way to benefit from the TensorFlow allocation theme which is mostly pool based, i.e allocators pre-allocate a large memory block, and allocate the tensors from there. By attaching a custom Visitor to relevant alloactors, we can do a single registration of the entire memory block, which zeros the registration overhead. Once the block is registered, each new tensor allocated will be at a registred address, which will allow us to do direct RDMA writes to it.
|
||||
|
||||
When a tensor is prepared for transfer, it is first converted to TensorProto, then the proto is serialized to byte array and copied to the pinned buffer. The content of the buffer is transferred to the remote node via RDMA write. On the remote side, the process is reversed. This is illustrated in the diagram below. The conversion of TensorProto is introduced to simplify transfer of string-tensors. Also since the TensorProto lives in host memory, even if the origin tensor lives in the device, the pinned buffers are all allocated in the host memory.
|
||||

|
||||
For best performance, we will adopt HKUST 0 copies approach in our solution. This means:
|
||||
|
||||
1. Tensor writes will be done directly from the source tensor to the **result** tensor, with no memory copies in between. This should be done for all DMAable tensors which are located either on CPU or on a RDMA compatible GPU device (GPU direct).
|
||||
2. Non DMAable tensors (CanMemCopy == false) will be serialized to a TensorProto on the sender side, RDMA written to a registered buffer on the receiver side, and then deserialized by the receiver.
|
||||
3. Tensors which are located on a non-RDMA-compatible GPU, will be RDMA written to a registered CPU **proxy** buffer on the receiver side, and then copied to GPU by the receiver.
|
||||
|
||||
The following improvements can be made in the future. First, conversion to TensorProto and serialization can be avoided for numeric (float/int) tensors since their internal buffer can be access directly as byte array. Second, the pinned buffer may be allocated on device if the tensor is located in the device. This avoids extra device-to-host copy at the expense of extra device memory consumption.
|
||||
## Design details
|
||||
|
||||
### Terminology
|
||||
|
||||
* **Sender** - The node which sends the tensor.
|
||||
* **Receiver** - The node which receives the tensor.
|
||||
* **Result tensor** - The destination tensor, allocated on its appropriate device.
|
||||
* **Proxy tensor** - A CPU allocated tensor, which will be used in the case where the result tensor cannot be RDMA written to directly (GPU direct is disabled or not available). The RDMA write will therefore be done to the proxy tensor, and afterwards we will do a manual local copy from it to the result tensor.
|
||||
|
||||
### Messages
|
||||
|
||||
* RDMA_MESSAGE_TENSOR_REQUEST
|
||||
* RDMA_MESSAGE_META_DATA_RESPONSE
|
||||
* RDMA_MESSAGE_TENSOR_RE_REQUEST
|
||||
|
||||
### Transport protocol
|
||||
|
||||
The tensor transfer process is initiated when the receiver requests a tensor. In code it is done by calling **Rendezvous::Recv()** or **Rendezvous::RecvAsync()**. The TensorFlow base implementation handles the case where the requested tensor is located on the same node. The more interesting case where the requested tensor is located on a remote node (receiver != sender) is to be handled in a derivation of the pure virtual **BaseRemoteRendezvous::RecvFromRemoteAsync()**. TensorFlow provides a default GRPC based implementation which comes in the vanilla version but suffers in scalability when running large models. Our RDMA based implementation presumes to be more scalable. HKUST's contrib GDR implementation is more scalable than GRPC, and less scalable than ours, only because we did our evolution based on it.
|
||||
|
||||
Our entry point is the implementation of **RdmaRemoteRendezvous::RecvFromRemoteAsync()**, located in rdma_rendezvous_mgr.cc. The implementation creates a new **RdmaTensorRequest** object, keyed by request index (uint32_t), stores it in a list of pending requests, and calls its **Start()** method. The **Start()** method basically does 2 things:
|
||||
|
||||
1. Allocate the result tensor (and the proxy tensor if required).
|
||||
2. Send a **RDMA_MESSAGE_TENSOR_REQUEST** to the sender, containing the address of the destination tensor (result/proxy) for RDMA write.
|
||||
|
||||
In order to allocate the result and proxy tensors, we need to know the tensor's meta-data, i.e. shape and data-type for DMAable tensors, and proto-size for serialized tensors. Unfortunately, this information is only available on the sender side which complicates manners. In order to avoid sending extra messages for querying the meta-data at each step, we store a local meta-data cache per tensor, which will only be update upon changes. Based on the assumption that the meta-data of a tensor rarely changes between steps, we expect that on most times the cache will only be updated once. The sender is responsible to detect changes in the meta-data, and update the receiver. In order for the sender to know that the meta-data had changed, each **RDMA_MESSAGE_TENSOR_REQUEST** will contain the meta-data that the receiver had grabbed from the local cache. The sender will then compare the meta-data from the message to the tensor's new meta-data.
|
||||
|
||||
When the sender receives an **RDMA_MESSAGE_TENSOR_REQUEST**, it will create a new **RdmaTensorResponse** object for the given request message, store it in a list of pending responses, and will invoke its **Start()** method. The **Start()** method does the following:
|
||||
|
||||
1. Grab the source tensor from the local table (In code, **RecvLocalAsync()**).
|
||||
2. If the source tensor is not DMAable, serialize it to a TensorProto.
|
||||
3. If the source tensor is located on a device which cannot be DMA written from, copy it to CPU.
|
||||
4. If it is the first time this tensor is requested, or if the tensor's meta-data changed:
|
||||
1. Clone the tensor's data to be sent later.
|
||||
2. Send a **RDMA_MESSAGE_META_DATA_RESPONSE** containing the new meta-data.
|
||||
5. Otherwise:
|
||||
1. RDMA write the tensor (or TensorProto) to the destination address and rkey specified in the request message. The immediate value for the write will be the request index.
|
||||
|
||||
|
||||
When the receiver receives the **RDMA_MESSAGE_META_DATA_RESPONSE**, it will locate the relevant **RdmaTensorRequest** using the request index specified in the message, and invoke its **RecvTensorMetaData()** which does the following:
|
||||
|
||||
1. Update the local meta-data cache.
|
||||
2. Reallocate the result/proxy tensors.
|
||||
3. Re-send the tensor request. For tracability, the new message has a different name: **RDMA_MESSAGE_TENSOR_RE_REQUEST**.
|
||||
|
||||
When the sender receives a **RDMA_MESSAGE_TENSOR_RE_REQUEST**, it will locate the relevant **RdmaTensorResponse** using the request index specified in the message, and invoke its **Resume()** method, which will RDMA write the contents of the tensor that was cloned earlier, to the new remote address specified in the re-request.
|
||||
|
||||
When the receiver receives the RDMA write, it will locate the relevant **RdmaTensorRequest** using the request index which is the immediate value. It will then invoke its **RecvTensorContent()** which does the following:
|
||||
|
||||
1. Proxy copy/deserialize if required.
|
||||
2. Invoke the done callback.
|
||||
3. Deallocate the result/proxy tensors and remove the request from the pending list.
|
||||
|
||||

|
||||
|
||||
### Additional design notes
|
||||
|
||||
1. When the sender receives a tensor request, the source tensor may or may not be ready yet. The situation is handled through a process of tag matching:
|
||||
* If the request arrives before the tensor is ready, then a callback is put in a local table, and will be invoked once the tensor arrives.
|
||||
* If the tensor is ready before the request arives, than the tensor is put in a local table. When the request arrives, it will invoke the callback immediatly.
|
||||
In code it is done by calling **RecvLocalAsync()**, which receives the tensor's key, step-id, and the callback.
|
||||
2. When the callback is invoked, the relevant tensor is removed from the tag matching table. In the case where we need to send the tensor's meta-data, the **RdmaTensorResponse** will store a copy of the tensor until the re-request arrives.
|
||||
3. The sending of protocol messages (**RDMA_MESSAGE_TENSOR_REQUEST**, **RDMA_MESSAGE_META_DATA_RESPONSE** and **RDMA_MESSAGE_TENSOR_RE_REQUEST**) is done by the class **RdmaMessageBuffer**. All messages are sent using RDMA writes from/to fixed messages buffers. This implies that we cannot send on a specific channel more than one message at a time. In order to synchronize the messages, the **RdmaMessageBuffer** holds the a local and remote buffer statuses which can be either busy or idle. When a write is issued, both statuses will be changed to busy. When the write-complete event is received, the local status is changed to idle. When the write is received on the remote side, the remote side will parse the message, and return an ACK back to the sending side on which the sending side will update the remote status to idle. When both the local and remote statuses are idle, the next message can be sent.
|
||||
5. ACK writes are empty writes (hence they require no buffer) with immediate value 0xFFFFFFFE. Message writes have the immediate value 0xFFFFFFFF. All other writes are tensor-content writes whose immediate value is the request-index.
|
||||
|
||||
### RDMA components
|
||||
|
||||
* **RDMA adapter:** The base for RDMA communications. It may contain multiple channels and buffers. It is responsible for handling various incoming RDMA messages.
|
||||
* **RDMA channel:** Responsible for RDMA connection to a particular node. It manages multiple buffers. A channel has a callback table which stores all the callbacks for the requested tensors.
|
||||
* **RDMA buffer:** Responsible for sending or receiving data. It has a fixed size memory to store the data. It has a queue to store the pending jobs. There are three types of buffers, message buffer, ACK buffer and tensor buffer. A channel has two message buffers, two ack buffers and many tensor buffers.
|
||||
* **RDMA manager:** Manages the adapter and channels, including channel creation, channel setup via GRPC service, channel lookup, etc.
|
||||
* **RDMA rendezvous manager:** manages multiple rdma rendezvous.
|
||||
* **RDMA rendezvous:** a derived class of BaseRemoteRendezvous. This class is the back end for "send" and "recv" ops. When the sendrecv_op wants to send or receive a tensor, it calls the rendezvous' "send" and "recv" functions respectively. Rendezvous are identified by "step_id", a random number, so that tensors for different iterations don't get mixed up.
|
||||
* **enum RdmaImmDataType** - Immediate types to distinguish between different RDMA writes on the remote side. Ack writes and control-message writes have a fixed immediate value. The rest of the writes are tensor writes and the immediate value is the relevant request index.
|
||||
* **enum RdmaWriteIDType** - Types to distinguish between different RDMA write-complete events: Ack, control message and tensor writes.
|
||||
* **class RdmaWriteID** - Context for RDMA write complete events. Holds the RdmaWriteIDType and additional data.
|
||||
* **class RdmaTensorMetaData** - Meta-data for a tensor (type, shape, is_dead, proto_size).
|
||||
* **class RdmaMemoryMgr** - Manages the meta-data cache, and the registered memory regions.
|
||||
* **class RdmaTensorRequest** - Holds and manages information for a single tensor request throughout the entire receive cycle. API:
|
||||
* **Start()** - Start the request sequence.
|
||||
* Allocate the result tensor (and proxy tensor if required).
|
||||
* Send RDMA_MESSAGE_TENSOR_REQUEST to the remote side.
|
||||
* **RecvTensorMetaData()** - Receive meta-data from the remote side.
|
||||
* Update the local meta-data cache.
|
||||
* Reallocate the result tensor (and proxy tensor if required).
|
||||
* Re-send the request to the remote side.
|
||||
* **RecvTensorContent()** - Receive tensor content from the remote side (RDMA write was completed).
|
||||
* Decode proto if required and/or move to GPU if the content was not written to it directly (GPU direct is not avaliable).
|
||||
* Invoke the done callback.
|
||||
* **class RdmaTensorResponse** - Holds and manages information for a single tensor response throughout the entire send cycle. API:
|
||||
* **Start()** - Start the response sequence.
|
||||
* Find the tensor in the local tag-match table.
|
||||
* Compare the tensor's meta-data to the meta-data in the message (taken from the requester's local cache).
|
||||
* If meta-data changed:
|
||||
* Clone the tensor to be sent later.
|
||||
* Send a meta-data update message and wait for re-request.
|
||||
* Else:
|
||||
* Send the tensor's content (using direct RDMA write).
|
||||
* **Resume()** - Resume the response sequence after a re-request. Send the tensor's content that was cloned earlier.
|
||||
* **Destroy()** - Destroy the response's resources and remove it form the pending list.
|
||||
* **class RdmaAdapter** - The base for RDMA communications. It may contain multiple channels and buffers. It is responsible for handling various incoming RDMA messages.
|
||||
* **class RdmaChannel** - Responsible for RDMA connection to a particular node. It manages messagee buffers. A channel has a request table which stores all the pending tensor requests.
|
||||
* **class RdmaMessageBuffer** - Responsible for sending or receiving messages. It has a fixed size memory to store the data. It has a queue to store the pending jobs. A channel has two message buffers one for tx and one for rx.
|
||||
* **class RdmaMgr** - Manages the adapter and channels, including channel creation, channel setup via GRPC service, channel lookup, etc.
|
||||
* **class RdmaRendezvousMgr** - Manages multiple rdma rendezvous.
|
||||
* **class RdmaRemoteRendezvous** - A derived class of BaseRemoteRendezvous. This class is the back end for "send" and "recv" ops. When the sendrecv_op wants to send or receive a tensor, it calls the rendezvous' "send" and "recv" functions respectively. Rendezvous are identified by "step_id", a random number, so that tensors for different iterations don't get mixed up.
|
||||
|
||||
### The SEND operation
|
||||
### Message structure:
|
||||
|
||||
In TensorFlow, when rendezvous sends a tensor, it merely puts a tensor in a local table in the corresponding rendezvous. If the tensor has been requested, a callback exists in the table. "send" will activate the callback, which tries to send the tensor across the node.
|
||||
| type | name_size | name | step_id | request_index | remote_addr/checksum | rkey | is_dead | data_type | tensor_shape | tensor_bytes | error_status |
|
||||
|------|---------- |------|---------|---------------|----------------------|------|---------|-----------|--------------|--------------|-----------------------|
|
||||
| 1B | 2B | 512 | 8B | 8B | 8B | 4B | 1B | XB | XB | 8B | Size - 4B, proto - XB |
|
||||
|
||||
|
||||
### The RECV operation
|
||||
|
||||
When a tensor is requested, rendezvous' recv function is called. The function first places a callback in the channel's callback table, which will be activated once the tensor is sent from the source. In the next step, a message is sent to notify the source of the requested tensor. Once the source receives the message, it will check locally for the tensor, if not found, a callback is placed in the table, otherwise, the tensor id will be placed at corresponding RDMA buffer's job queue for future transmission. When a tensor is scheduled to be transmitted, the RDMA buffer needs to have the memory allocated and initialized (registered with the remote buffer info). If the memory is not ready, the transmission is deferred, a message is sent to the destination to establish the memory first. The other case a transmission can be deferred is when the buffer is still being used by an on-going transmission.
|
||||
|
||||
### Three types of RDMA buffers
|
||||
|
||||
* **Message buffer:** responsible for sending message only.
|
||||
* **Ack buffer:** once a message is sent, the recipient needs to send an ack via the ack buffer to free up the message buffer. An ack buffer is exclusively for its coupled message buffer.
|
||||
* **Tensor buffer:** responsible for sending tensors. The recipient needs to send back a message to free up the sending buffer.
|
||||
|
||||
### RDMA packet format
|
||||
|
||||
|type|name_size|name|step_id|buffer_size|remote_addr|rkey|is_dead|data_type|tensor_shape|tensor_bytes|tensor_buffer|
|
||||
|
||||
### Six types of RDMA messages
|
||||
* RDMA_MESSAGE_ACK
|
||||
* RDMA_MESSAGE_BUFFER_IDLE
|
||||
* RDMA_MESSAGE_BUFFER_REQUEST
|
||||
* RDMA_MESSAGE_BUFFER_RESPONSE
|
||||
* RDMA_MESSAGE_TENSOR_REQUEST
|
||||
* RDMA_MESSAGE_TENSOR_WRITE
|
||||
|
||||
### Actions upon receiving RDMA messages
|
||||
* RDMA_MESSAGE_ACK
|
||||
* sender: mark local ack buffer idle.
|
||||
* receiver: mark remote message buffer idle, send next item.
|
||||
* RDMA_MESSAGE_BUFFER_IDLE
|
||||
* sender: mark local message buffer idle, send next item.
|
||||
* receiver: send ack, set remote tensor buffer idle, send next item.
|
||||
* RDMA_MESSAGE_BUFFER_REQUEST
|
||||
* sender: mark local message buffer idle, send next item.
|
||||
* receiver: send ack, find or create tensor buffer, send BUFFER_RESPONSE.
|
||||
* RDMA_MESSAGE_BUFFER_RESPONSE
|
||||
* sender: mark local message buffer idle, send next item.
|
||||
* receiver: send ack, set remote buffer info, set local and remote buffer idle, send next item.
|
||||
* RDMA_MESSAGE_TENSOR_REQUEST
|
||||
* sender: mark local message buffer idle, send next item.
|
||||
* receiver: send ack, find or create tensor buffer, enqueue tensor id, send next item.
|
||||
* RDMA_MESSAGE_TENSOR_WRITE
|
||||
* sender: mark local message buffer idle, send next item.
|
||||
* receiver: run callback.
|
||||
* **RDMA_MESSAGE_TENSOR_REQUEST** - (receiver ==> sender) The original tensor request.
|
||||
* type - The message type.
|
||||
* name (name_size) - Name of the requested tensor.
|
||||
* step_id - Step ID.
|
||||
* request_index - Request index.
|
||||
* remote_addr/rkey - Address/rkey of the result/proxy tensor. Irrelevant for first-time request.
|
||||
* is_dead/data_type/tensor_shape/tensor_bytes - The current meta-data as stored in the receiver local cache. The sender will use that information to know if the receiver's cache requires updating.
|
||||
* **RDMA_MESSAGE_META_DATA_RESPONSE** - (sender ==> receiver) The meta-data update message in case meta-data had changed (or if it is the first time the tensor is requested).
|
||||
* type - The message type.
|
||||
* request_index - Request index.
|
||||
* is_dead/data_type/tensor_shape/tensor_bytes - The up-to-date meta-data.
|
||||
* checksum - In data validation mode, this will hold the checksum of the source tensor.
|
||||
* **RDMA_MESSAGE_TENSOR_RE_REQUEST** - (receiver ==> sender) Tensor re-requset after meta-data update and reallocation of result/proxy tensors.
|
||||
* type - The message type.
|
||||
* name (name_size) - Name of the requested tensor.
|
||||
* step_id - Step ID.
|
||||
* request_index - Request index.
|
||||
* remote_addr/rkey - Address/rkey of the reallocated result/proxy tensor.
|
||||
* **RDMA_MESSAGE_ERROR_STATUS** - (sender ==> receiver) Notify the receiver that an error had occured on the sender side, so it can propagate it to the upper levels.
|
||||
* type - The message type.
|
||||
* name (name_size) - Name of the requested tensor.
|
||||
* step_id - Step ID.
|
||||
* request_index - Request index.
|
||||
* error_status - The error status (code, message, details).
|
||||
|
@ -122,17 +122,15 @@ Status GrpcVerbsService::GetRemoteAddressSync(
|
||||
rc->SetRemoteAddress(ra, false);
|
||||
rc->Connect();
|
||||
int i = 0;
|
||||
int idx[] = {1, 0, 3, 2};
|
||||
std::vector<RdmaBuffer*> mb(rc->message_buffers());
|
||||
CHECK_EQ(request->mr_size(), 4);
|
||||
int idx[] = {1, 0};
|
||||
std::vector<RdmaMessageBuffer*> mb(rc->message_buffers());
|
||||
CHECK_EQ(request->mr_size(), RdmaChannel::kNumMessageBuffers);
|
||||
for (const auto& mr : request->mr()) {
|
||||
// the connections are crossed, i.e.
|
||||
// local tx_message_buffer <---> remote rx_message_buffer_
|
||||
// local rx_message_buffer <---> remote tx_message_buffer_
|
||||
// local tx_ack_buffer <---> remote rx_ack_buffer_
|
||||
// local rx_ack_buffer <---> remote tx_ack_buffer_
|
||||
// hence idx[] = {1, 0, 3, 2}.
|
||||
RdmaBuffer* rb = mb[idx[i]];
|
||||
// hence idx[] = {1, 0}.
|
||||
RdmaMessageBuffer* rb = mb[idx[i]];
|
||||
RemoteMR rmr;
|
||||
rmr.remote_addr = mr.remote_addr();
|
||||
rmr.rkey = mr.rkey();
|
||||
|
87
tensorflow/contrib/verbs/patch_notes_verbs_with_0_copies.md
Normal file
87
tensorflow/contrib/verbs/patch_notes_verbs_with_0_copies.md
Normal file
@ -0,0 +1,87 @@
|
||||
## Verbs implementation to use direct tensor writes (0 copies)
|
||||
|
||||
### Motivation:
|
||||
|
||||
Following HKUST research on the use of GPU direct, and their [GDR implementation](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/gdr/README.md), we wish to adopt the 0 copies approach and apply it to the current verbs implementation, while keeping the current implementation advantages, such as configurability and the use of RDMA for control messages.
|
||||
|
||||
### Performance:
|
||||
|
||||
Compared with the current GRPC, verbs and GDR implementation, the result implementation gave the best performance for every model, with any number of nodes. For VGG16 on 8 nodes with 4 P100 GPUs each, the prototype beat the second place by over 15%.
|
||||
|
||||
### Implementation requirements:
|
||||
|
||||
1. Tensor writes need to be done directly from the source Tensor to the destination Tensor, with no memory copies in between. This should be done for all DMAble tensors which are located either on CPU or on a RDMA compatible GPU device (GPU direct).
|
||||
2. Non DMAble tensors (CanMemCopy == false) will be serialized to proto on the sender side, RDMA written to a registered buffer on the receiver side, and then deserialized by the receiver.
|
||||
3. Tensors which are located on a non-RDMA-compatible GPU, will be RDMA written to a registered CPU proxy buffer on the receiver side, and then copied to GPU by the receiver.
|
||||
|
||||
### Implementation constrains:
|
||||
|
||||
For best stability and proof of correctness, we will divide the implementation to two stages:
|
||||
1. At first stage we will keep changes to the current implementation to the minimum possible. The expense will be that we may have unused or unnecessary code leftovers, which may also affect performance.
|
||||
2. At second stage, we will re-iterate over the code and remove irrelevant code parts.
|
||||
The design of the solution aims that we will achieve both stages with relative ease.
|
||||
|
||||
### Design guidelines:
|
||||
|
||||
1. Since we do not want to do any unnecessary memory copying, we will no longer allocate a fixed CPU buffer as the destination for the RDMA write. Instead we will do the writing directly to the result tensor, or if the result tensor is on a device which does not support RDMA, we will do the writing to a proxy CPU tensor and then copy its content to the result tensor.
|
||||
2. The address of the destination Tensor needs to be sent to the sender side for writing, meaning that the result/proxy tensor should be pre-allocated on the receiver side, prior to sending the tensor request. In order to do that, we need to know its meta-data, i.e. shape and data-type for DMAble tensors, and proto-size for serialized tensors. Unfortunately, this information is only available on the sender side which complicates manners. In order to avoid sending extra messages for querying the meta-data on each step, we store a local meta-data cache per tensor. Based on the assumption that the meta-data of a tensor rarely changes between steps, we expect that on most times the cache will only be updated once. When the sender receives a request for a tensor, if it is the first time this tensor is requested, or in the rare case that the meta-data did change, the sender will first send a meta-data response, on which the receiver will update the local cache, and reallocate the result/proxy tensors if required. When the receiver sends the tensor request, it will contain also the meta-data currently stored in its local cache, so the sender can compare it to see if there was a change.
|
||||
3. When the sender writes the tensor content to the result tensor, no additional data is being written with it. That means we need to reside on ibverbs immediate (uint32_t) to indicate which request we are responding to (in order to trigger the receive callback). The easiest and most elegant way is to key the recv callback with a unique request_index (uint32_t), instead of the current key_with_step_id (string).
|
||||
4. Since the sender no longer writes the tensor from/to fixed buffers, we no longer need to schedule the writes using the local/remote status. In addition we no longer rely on the RmdaTensorBuffer members as the source/destination addresses and rkey/lkey. Instead, each RdmaTensorBuffer will hold multiple "Response" objects (one per step-id), from which we derive destination address and rkey. The source address and lkey are always the ones of the source Tensor.
|
||||
5. With the addition of tensor pre-allocation, we noticed there is a large code similarity between sending the first tensor request and re-sending the request in case of meta-data changes. After implementing a common method for tensor pre-allocation, it turned out that implementation becomes much simpler by encapsulating the process of request sending/re-sending, meta-data response callback and content response callback, all in a single "Request" class. The request class holds all the relevant request information, which reduces excessive parameter passing and lambda capturing. This decision is purely for elegance and code simplicity, and we decided to implement it in first stage because it makes the implementation much easier.
|
||||
|
||||
### New types/classes:
|
||||
|
||||
* **enum RdmaImmDataType** - Immediate types to distinguish between different RDMA writes on the remote side. Ack writes and control-message writes have a fixed immediate value. The rest of the writes are tensor writes and the immediate value is the relevant request index.
|
||||
* **enum RdmaWriteIDType** - Types to distinguish between different RDMA write-complete events: Ack, control message, tensor DMA write and tensor proto write.
|
||||
* **class RdmaWriteID** - Context for RDMA write complete events. Holds the RdmaWriteIDType and additional data.
|
||||
* **class RemoteAddressContext** - Remote address information (address + mr). Will be passed as write context for tensor proto writes.
|
||||
* **class RdmaTensorMetaData** - Meta-data for a tensor (type, shape, is_dead, proto_size).
|
||||
* **class RdmaMemoryMgr** - Manages the meta-data cache, and the registered memory regions.
|
||||
* **class RdmaTensorRequest** - Holds and manages information for a single tensor request throughout the entire receive cycle. API:
|
||||
* Start() - Start the request.
|
||||
* RecvTensorMetaData() - Receive meta-data from the remote side.
|
||||
* RecvTensorContent() - Receive tensor content from the remote side and invoke the done() callback.
|
||||
* **class RdmaTensorResponse** - Holds information for a single tensor response, such as destination address and rkey.
|
||||
|
||||
### Protocol changes:
|
||||
|
||||
The protocol messages themselves will remain mostly unchanged at the first stage, but will be used differently, as described below. The current messages structures already have most of the required fields for the new implementation. The only change is the "buffer_size" field which is no longer used since we are no longer sending additional information with the tensor, and thus it is now always equal to the "tensor_bytes" field. Instead, we use that field to pass the "request_index".
|
||||
|
||||
### Message structure:
|
||||
|
||||
| type | name_size | name | step_id | request_index | remote_addr | rkey | is_dead | data_type | tensor_shape | tensor_bytes |
|
||||
|------|---------- |------|---------|---------------|-------------|------|---------|-----------|--------------|--------------|
|
||||
| 1B | 2B | 512 | 8B | 8B | 8B | 4B | 1B | XB | XB | 8B |
|
||||
|
||||
* **RDMA_MESSAGE_TENSOR_REQUEST** - (receiver ==> sender) The original tensor request.
|
||||
* type - The message type.
|
||||
* name (name_size) - Name of the requested tensor.
|
||||
* step_id - Step ID.
|
||||
* request_index - Request index.
|
||||
* remote_addr/rkey - Address/rkey of the result/proxy tensor. Irrelevant for first-time request.
|
||||
* is_dead/data_type/tensor_shape/tensor_bytes - The current meta-data as stored in the receiver local cache. The sender will use that information to know if the receiver's cache requires updating.
|
||||
* **RDMA_MESSAGE_BUFFER_REQUEST** - (sender ==> receiver) The meta-data update message in case meta-data had changed (or if it is the first time the tensor is requested).
|
||||
* type - The message type.
|
||||
* request_index - Request index.
|
||||
* is_dead/data_type/tensor_shape/tensor_bytes - The up-to-date meta-data.
|
||||
* **RDMA_MESSAGE_BUFFER_RESPONSE** - (receiver ==> sender) Tensor re-requset after meta-data update and reallocation of result/proxy tensors.
|
||||
* type - The message type.
|
||||
* name (name_size) - Name of the requested tensor.
|
||||
* step_id - Step ID.
|
||||
* request_index - Request index.
|
||||
* remote_addr/rkey - Address/rkey of the reallocated result/proxy tensor.
|
||||
* is_dead/data_type/tensor_shape/tensor_bytes - The new meta-data. Will be removed in the next phase.
|
||||
* **RDMA_MESSAGE_TENSOR_WRITE** - (sender ==> receiver) No longer sent. There is only a direct write of the tensor content to the result/proxy tensor. Request index passed as the immediate value of the write.
|
||||
* **RDMA_MESSAGE_TENSOR_IDLE** - (receiver ==> sender) No longer sent.
|
||||
|
||||

|
||||
|
||||
### Second stage optimizations:
|
||||
1. Remove unused code leftovers.
|
||||
2. Remove the ACK buffer completely, since we can rely completely on its immediate value.
|
||||
|
||||
### Future optimizations:
|
||||
1. Map the tensor names to indexes, to significantly reduce the request message size.
|
||||
2. Understand the purpose of empty tensors and if we can skip remote fetching for them.
|
||||
3. Consider concatenating multiple requests and/or using multiple message buffers.
|
||||
4. Consider a no-request architecture.
|
File diff suppressed because it is too large
Load Diff
@ -27,6 +27,7 @@ limitations under the License.
|
||||
#include <unordered_map>
|
||||
#include <vector>
|
||||
|
||||
#include "tensorflow/contrib/verbs/verbs_util.h"
|
||||
#include "tensorflow/core/distributed_runtime/worker_env.h"
|
||||
#include "tensorflow/core/framework/rendezvous.h"
|
||||
#include "tensorflow/core/framework/tensor.h"
|
||||
@ -43,6 +44,11 @@ namespace tensorflow {
|
||||
#define SL_DEFAULT 0
|
||||
#define TRAFFIC_CLASS 0
|
||||
|
||||
#define RDMA_LOG_0 LOG(INFO)
|
||||
#define RDMA_LOG_1 VLOG(1)
|
||||
#define RDMA_LOG_2 VLOG(2)
|
||||
#define RDMA_LOG(LEVEL) RDMA_LOG_##LEVEL
|
||||
|
||||
struct RdmaParams {
|
||||
uint8_t port_num;
|
||||
uint8_t sgid_index;
|
||||
@ -76,29 +82,303 @@ enum Location {
|
||||
local,
|
||||
remote
|
||||
};
|
||||
enum BufferType {
|
||||
ACK,
|
||||
MESSAGE,
|
||||
TENSOR
|
||||
};
|
||||
|
||||
enum RdmaMessageType {
|
||||
RDMA_MESSAGE_ACK,
|
||||
RDMA_MESSAGE_BUFFER_IDLE,
|
||||
RDMA_MESSAGE_BUFFER_REQUEST,
|
||||
RDMA_MESSAGE_BUFFER_RESPONSE,
|
||||
RDMA_MESSAGE_META_DATA_UPDATE,
|
||||
RDMA_MESSAGE_TENSOR_RE_REQUEST,
|
||||
RDMA_MESSAGE_TENSOR_REQUEST,
|
||||
RDMA_MESSAGE_TENSOR_WRITE
|
||||
RDMA_MESSAGE_ERROR_STATUS,
|
||||
};
|
||||
class RdmaBuffer;
|
||||
|
||||
struct RdmaMessage {
|
||||
RdmaMessageType type_;
|
||||
uint16_t name_size_;
|
||||
string name_;
|
||||
int64 step_id_;
|
||||
uint64_t request_index_;
|
||||
union {
|
||||
uint64_t remote_addr_;
|
||||
#ifdef RDMA_DATA_VALIDATION
|
||||
uint64_t checksum_;
|
||||
#endif
|
||||
};
|
||||
uint32_t rkey_;
|
||||
bool is_dead_;
|
||||
DataType data_type_;
|
||||
TensorShape tensor_shape_;
|
||||
size_t tensor_bytes_;
|
||||
|
||||
// For error status:
|
||||
Status status_;
|
||||
|
||||
// type|name_size|name|step_id|request_index|remote_addr/checksum|rkey|...
|
||||
// 1B| 2B | 512| 8B | 8B | 8B | 4B |...
|
||||
// ...|is_dead|data_type|tensor_shape|tensor_bytes|error_status |
|
||||
// ...| 1B | XB | XB | 8B |size - 4B, proto - XB |
|
||||
static const size_t kNameCapacity = 512;
|
||||
static const size_t kTypeStartIndex = 0;
|
||||
static const size_t kNameSizeStartIndex = kTypeStartIndex + sizeof(type_);
|
||||
static const size_t kNameStartIndex =
|
||||
kNameSizeStartIndex + sizeof(name_size_);
|
||||
static const size_t kStepIdStartIndex = kNameStartIndex + kNameCapacity;
|
||||
static const size_t kRequestIndexStartIndex =
|
||||
kStepIdStartIndex + sizeof(step_id_);
|
||||
static const size_t kRemoteAddrStartIndex =
|
||||
kRequestIndexStartIndex + sizeof(request_index_);
|
||||
static const size_t kChecksumStartIndex = kRemoteAddrStartIndex;
|
||||
static const size_t kRkeyStartIndex =
|
||||
kRemoteAddrStartIndex + sizeof(remote_addr_);
|
||||
static const size_t kIsDeadStartIndex = kRkeyStartIndex + sizeof(rkey_);
|
||||
static const size_t kDataTypeStartIndex =
|
||||
kIsDeadStartIndex + sizeof(is_dead_);
|
||||
static const size_t kTensorShapeStartIndex =
|
||||
kDataTypeStartIndex + sizeof(data_type_);
|
||||
static const size_t kTensorBytesStartIndex =
|
||||
kTensorShapeStartIndex + sizeof(TensorShape);
|
||||
static const size_t kErrorStatusStartIndex =
|
||||
kTensorBytesStartIndex + sizeof(tensor_bytes_);
|
||||
static const size_t kErrorStatusMaxSize = 4096;
|
||||
|
||||
static const size_t kMessageTotalBytes = kErrorStatusStartIndex;
|
||||
static const size_t kRdmaMessageBufferSize =
|
||||
kMessageTotalBytes + kErrorStatusMaxSize;
|
||||
static string CreateMessage(const RdmaMessage& rm);
|
||||
static void ParseMessage(RdmaMessage& rm, void* buffer);
|
||||
};
|
||||
|
||||
// Immediate types for RDMA write
|
||||
enum RdmaImmDataType {
|
||||
RDMA_IMM_MAX_REQUEST_ID = 0xFFFFFFFD,
|
||||
RDMA_IMM_DATA_ACK = 0xFFFFFFFE,
|
||||
RDMA_IMM_DATA_MESSAGE = 0xFFFFFFFF
|
||||
};
|
||||
|
||||
// Write types for RDMA write-complete events
|
||||
enum RdmaWriteIDType {
|
||||
RDMA_WRITE_ID_ACK,
|
||||
RDMA_WRITE_ID_MESSAGE,
|
||||
RDMA_WRITE_ID_TENSOR_WRITE
|
||||
};
|
||||
|
||||
// Context for RDMA write-complete events
|
||||
class RdmaWriteID {
|
||||
public:
|
||||
RdmaWriteID(RdmaWriteIDType write_type, void* write_context)
|
||||
: write_type(write_type), write_context(write_context) {}
|
||||
|
||||
RdmaWriteIDType write_type;
|
||||
void* write_context;
|
||||
};
|
||||
|
||||
// Tensor meta-data
|
||||
class TensorMetaData {
|
||||
public:
|
||||
TensorShape tensor_shape_;
|
||||
DataType data_type_;
|
||||
size_t proto_size_;
|
||||
bool is_dead_;
|
||||
|
||||
std::ostream& print(std::ostream& out) const {
|
||||
out << "Dtype = " << DataTypeString(data_type_)
|
||||
<< ", Shape = " << tensor_shape_.DebugString() << ", Proto size = 0x"
|
||||
<< std::hex << proto_size_ << ", Is dead = " << is_dead_;
|
||||
return out;
|
||||
}
|
||||
};
|
||||
|
||||
inline std::ostream& operator<<(std::ostream& out,
|
||||
const TensorMetaData& meta_data) {
|
||||
return meta_data.print(out);
|
||||
}
|
||||
|
||||
class RdmaChannel;
|
||||
|
||||
void MRDeleter(ibv_mr* mr);
|
||||
using MemoryRegionPtr = std::unique_ptr<ibv_mr, decltype(&MRDeleter)>;
|
||||
|
||||
// RdmaMemoryMgr
|
||||
// Manages the local meta-data cache, and the registered RDMA memory regions.
|
||||
class RdmaMemoryMgr {
|
||||
public:
|
||||
static RdmaMemoryMgr& Singleton() {
|
||||
static RdmaMemoryMgr instance;
|
||||
return instance;
|
||||
}
|
||||
|
||||
// Memory regions
|
||||
ibv_mr* FindMemoryRegion(void* addr, size_t length);
|
||||
void InsertMemoryRegion(void* addr, size_t length,
|
||||
const std::string& allocator_name);
|
||||
void EvictMemoryRegion(void* addr, size_t length);
|
||||
|
||||
// Tensor meta-data cache
|
||||
const TensorMetaData* GetTensorMetaData(const std::string& tensor_name);
|
||||
const TensorMetaData* SetTensorMetaData(const std::string& tensor_name,
|
||||
DataType dtype,
|
||||
const TensorShape& shape,
|
||||
bool is_dead, size_t proto_size);
|
||||
|
||||
struct ibv_pd* pd_;
|
||||
|
||||
protected:
|
||||
RdmaMemoryMgr() : pd_(nullptr) {}
|
||||
|
||||
static bool Comparator(const void* ptr, const MemoryRegionPtr& other) {
|
||||
return ptr < reinterpret_cast<char*>(other->addr) + other->length;
|
||||
}
|
||||
|
||||
private:
|
||||
mutex tensor_meta_data_mu_;
|
||||
std::unordered_map<std::string, TensorMetaData> tensors_meta_data_;
|
||||
|
||||
// Managed memory regions
|
||||
mutex mrs_mu_;
|
||||
std::vector<MemoryRegionPtr> mrs_ GUARDED_BY(mrs_mu_);
|
||||
};
|
||||
|
||||
// RdmaTensorRequest
|
||||
// Represents a single tensor request.
|
||||
class RdmaTensorRequest {
|
||||
public:
|
||||
typedef Rendezvous::DoneCallback RecvDoneCallback;
|
||||
|
||||
// Creates a tensor request identified by index.
|
||||
RdmaTensorRequest(uint32_t index, const string& key, int64 step_id,
|
||||
RdmaChannel* channel, Device* dst_dev,
|
||||
const Rendezvous::Args recv_args,
|
||||
const RecvDoneCallback& done);
|
||||
~RdmaTensorRequest();
|
||||
|
||||
// Request unique index.
|
||||
uint32_t index() { return index_; }
|
||||
|
||||
// Start the tensor request sequence.
|
||||
//
|
||||
// 1. Allocate the result tensor (and proxy tensor if required).
|
||||
// 2. Send RDMA_MESSAGE_TENSOR_REQUEST to the remote side.
|
||||
void Start();
|
||||
|
||||
// Receive tensor meta-data.
|
||||
//
|
||||
// 1. Update the local meta-data cache.
|
||||
// 2. Reallocate the result tensor (and proxy tensor if required).
|
||||
// 3. Re-send the request to the remote side.
|
||||
void RecvTensorMetaData(DataType dtype, TensorShape shape, bool is_dead,
|
||||
size_t proto_size);
|
||||
|
||||
// Receive tensor content (RDMA write was completed).
|
||||
//
|
||||
// Decode proto if required and/or move to GPU if the content was not
|
||||
// written to it directly (GPU direct is not avaliable). Afterwards,
|
||||
// invoke Done().
|
||||
void RecvTensorContent();
|
||||
|
||||
// Receive error status (in case of a remote error).
|
||||
// Invoke Done() with the status code.
|
||||
void RecvErrorStatus(const Status& status);
|
||||
|
||||
#ifdef RDMA_DATA_VALIDATION
|
||||
// Receive tensor checksum
|
||||
//
|
||||
// For validation: Get and store the Tensor's expected checksum for the
|
||||
// current request. Compare the result Tensor's checksum with the stored
|
||||
// checksum right before invoking Done().
|
||||
void RecvTensorChecksum(uint64_t checksum) { checksum_ = checksum; }
|
||||
#endif
|
||||
|
||||
private:
|
||||
void Done(const Status& s);
|
||||
void Send(RdmaMessageType message_type);
|
||||
bool AllocateTensors();
|
||||
void AllocateTensorsAsync(StatusCallback done);
|
||||
void DeallocateTensors();
|
||||
|
||||
uint32_t index_;
|
||||
string key_;
|
||||
int64 step_id_;
|
||||
RdmaChannel* channel_;
|
||||
Device* dst_dev_;
|
||||
Rendezvous::Args recv_args_;
|
||||
const TensorMetaData* meta_data_;
|
||||
Tensor* result_tensor_;
|
||||
Tensor* proxy_tensor_;
|
||||
void* rdma_addr_;
|
||||
ibv_mr* mr_;
|
||||
RecvDoneCallback done_;
|
||||
#ifdef RDMA_DATA_VALIDATION
|
||||
uint64_t checksum_;
|
||||
#endif
|
||||
};
|
||||
|
||||
// RdmaTensorResponse
|
||||
// Represents a single tensor response.
|
||||
class RdmaTensorResponse {
|
||||
public:
|
||||
// Creates a response for request message.
|
||||
RdmaTensorResponse(RdmaChannel* channel, const RdmaMessage& rm)
|
||||
: channel_(channel), rm_(rm) {}
|
||||
|
||||
void Update(const RdmaMessage& rm) { rm_ = rm; }
|
||||
|
||||
// Start the tensor response sequence.
|
||||
//
|
||||
// 1. Find the tensor in the local tag-match table and invoke RecvHandler.
|
||||
// (Using RecvLocalAsync()).
|
||||
// 2. Compare the tensor's meta-data to the meta-data in the message (taken
|
||||
// from the requester's local cache).
|
||||
// If meta-data changed:
|
||||
// a. Clone the tensor to be sent later.
|
||||
// b. Send a meta-data update message and wait for re-request.
|
||||
// Else:
|
||||
// a. Send the tensor's content (using direct RDMA write).
|
||||
void Start();
|
||||
|
||||
// Resume the response sequence, after a re-request.
|
||||
//
|
||||
// 1. Send the tensor's content that was cloned earlier.
|
||||
void Resume();
|
||||
|
||||
// Destroy the response's resources and remove it from the pending list.
|
||||
void Destroy();
|
||||
|
||||
private:
|
||||
void RecvHandler(Rendezvous::ParsedKey parsed,
|
||||
const Rendezvous::Args& send_args,
|
||||
const Rendezvous::Args& recv_args, const Tensor& in,
|
||||
bool is_dead);
|
||||
void Clone(const Tensor& in, const TensorProto& proto, bool is_dead);
|
||||
void Send(const Tensor& in, const TensorProto& proto, bool is_dead,
|
||||
const Status& status);
|
||||
bool TensorMetaDataChanged(const Tensor& in, bool is_dead);
|
||||
Status PrepareRecvTensor(const Rendezvous::ParsedKey& parsed,
|
||||
Device** src_dev);
|
||||
void SendMetaData(const Tensor& in, const TensorProto& proto, bool is_dead);
|
||||
void SendContent(const Tensor& in, const TensorProto& proto, bool is_dead);
|
||||
void SendErrorStatus(const Status& status);
|
||||
|
||||
RdmaChannel* channel_;
|
||||
RdmaMessage rm_; // The request message
|
||||
Device* src_dev_ = nullptr;
|
||||
TensorBuffer* src_buffer_ = nullptr;
|
||||
void* src_addr_ = nullptr;
|
||||
ibv_mr* mr_ = nullptr;
|
||||
uint64_t checksum_ = 0;
|
||||
bool meta_data_changed_ = false;
|
||||
|
||||
// Re-item:
|
||||
TensorProto* proto_ = nullptr;
|
||||
Tensor* tensor_ = nullptr;
|
||||
bool is_dead_ = false;
|
||||
};
|
||||
|
||||
class RdmaMessageBuffer;
|
||||
// Class that represents the Rdma Adapter.
|
||||
// Responsible for creation of the completion queue, and handling
|
||||
// of work completions.
|
||||
class RdmaAdapter {
|
||||
friend class RdmaChannel;
|
||||
friend class RdmaBuffer;
|
||||
friend class RdmaAckBuffer;
|
||||
friend class RdmaMessageBuffer;
|
||||
friend class RdmaTensorBuffer;
|
||||
friend class RdmaTensorResponse;
|
||||
friend class RdmaMgr;
|
||||
friend class RdmaRemoteRendezvous;
|
||||
|
||||
@ -133,10 +413,10 @@ class RdmaAdapter {
|
||||
// Responsible for connecting queue pairs.
|
||||
class RdmaChannel {
|
||||
friend class RdmaAdapter;
|
||||
friend class RdmaBuffer;
|
||||
friend class RdmaAckBuffer;
|
||||
friend class RdmaMessageBuffer;
|
||||
friend class RdmaTensorBuffer;
|
||||
friend class RdmaTensorRequest;
|
||||
friend class RdmaTensorResponse;
|
||||
friend class RdmaMgr;
|
||||
friend class RdmaRemoteRendezvous;
|
||||
|
||||
@ -146,22 +426,28 @@ class RdmaChannel {
|
||||
~RdmaChannel();
|
||||
inline const RdmaAddress& self() { return self_; }
|
||||
RdmaAddress address() const;
|
||||
inline const std::vector<RdmaBuffer*>& message_buffers() const {
|
||||
inline const std::vector<RdmaMessageBuffer*>& message_buffers() const {
|
||||
return message_buffers_;
|
||||
}
|
||||
void Connect(const RdmaAddress& remoteAddr);
|
||||
void Connect();
|
||||
void Recv();
|
||||
RdmaBuffer* FindBuffer(const uint32_t index);
|
||||
RdmaBuffer* FindBuffer(const string& name);
|
||||
RdmaBuffer* FindOrCreateBuffer(const string& name,
|
||||
BufferType buffer_type = TENSOR);
|
||||
uint32_t LookupBufferIndex(const string& buffer_name);
|
||||
void SetRemoteAddress(const RdmaAddress& ra, bool override);
|
||||
void InsertRecvCallback(const string& key, std::function<void()> recv_done);
|
||||
void RemoveRecvCallback(const string& key);
|
||||
void RunRecvCallback(const string& key);
|
||||
static const int kNumMessageBuffers = 4;
|
||||
|
||||
// Requests:
|
||||
RdmaTensorRequest* InsertTensorRequest(
|
||||
const string& key, int64 step_id, Device* dst_dev,
|
||||
const Rendezvous::Args recv_args,
|
||||
const RdmaTensorRequest::RecvDoneCallback& done);
|
||||
void RemoveTensorRequest(uint32_t request_index);
|
||||
RdmaTensorRequest* GetTensorRequest(uint32_t request_index);
|
||||
|
||||
// Responses:
|
||||
RdmaTensorResponse* AddTensorResponse(const RdmaMessage& rm);
|
||||
RdmaTensorResponse* UpdateTensorResponse(const RdmaMessage& rm);
|
||||
void RemoveTensorResponse(uint32_t request_index);
|
||||
|
||||
static const int kNumMessageBuffers = 2;
|
||||
static const int kPingRecvWrid = 0;
|
||||
|
||||
private:
|
||||
@ -179,36 +465,31 @@ class RdmaChannel {
|
||||
string remote_name_;
|
||||
ibv_qp* qp_;
|
||||
mutex mu_;
|
||||
bool connected_ GUARDED_BY(bt_mu_) = false;
|
||||
RdmaAddress remote_ GUARDED_BY(bt_mu_);
|
||||
bool remote_set_ GUARDED_BY(bt_mu_) = false;
|
||||
bool connected_ GUARDED_BY(mu_) = false;
|
||||
RdmaAddress remote_ GUARDED_BY(mu_);
|
||||
bool remote_set_ GUARDED_BY(mu_) = false;
|
||||
mutex ct_mu_;
|
||||
typedef std::unordered_map<string, std::function<void()> > CallbackTable;
|
||||
CallbackTable callback_table_ GUARDED_BY(ct_mu_);
|
||||
mutex bt_mu_;
|
||||
typedef std::unordered_map<unsigned int, RdmaBuffer*> BufferTable;
|
||||
BufferTable buffer_table_ GUARDED_BY(bt_mu_);
|
||||
typedef std::unordered_map<uint32_t, string> BufferIndexNameTable;
|
||||
BufferIndexNameTable buffer_index_name_table_ GUARDED_BY(bt_mu_);
|
||||
typedef std::unordered_map<string, uint32_t> BufferNameIndexTable;
|
||||
BufferNameIndexTable buffer_name_index_table_ GUARDED_BY(bt_mu_);
|
||||
RdmaBuffer* tx_message_buffer_;
|
||||
RdmaBuffer* rx_message_buffer_;
|
||||
RdmaBuffer* tx_ack_buffer_;
|
||||
RdmaBuffer* rx_ack_buffer_;
|
||||
std::vector<RdmaBuffer*> message_buffers_;
|
||||
typedef std::unordered_map<uint32_t, RdmaTensorRequest> RequestTable;
|
||||
RequestTable request_table_ GUARDED_BY(ct_mu_);
|
||||
uint32_t request_serial_ GUARDED_BY(ct_mu_);
|
||||
mutex responses_mu_;
|
||||
typedef std::unordered_map<uint32_t, RdmaTensorResponse> ResponsesTable;
|
||||
ResponsesTable responses_table_ GUARDED_BY(responses_mu_);
|
||||
RdmaMessageBuffer* tx_message_buffer_;
|
||||
RdmaMessageBuffer* rx_message_buffer_;
|
||||
std::vector<RdmaMessageBuffer*> message_buffers_;
|
||||
};
|
||||
|
||||
// Class that represents a buffer for Rdma writes and reads.
|
||||
class RdmaBuffer {
|
||||
// Class that represents a buffer for Rdma message sending.
|
||||
class RdmaMessageBuffer {
|
||||
friend class RdmaChannel;
|
||||
friend class RdmaAdapter;
|
||||
friend class RdmaMgr;
|
||||
friend class RdmaRemoteRendezvous;
|
||||
|
||||
public:
|
||||
explicit RdmaBuffer(RdmaChannel* channel, string name);
|
||||
virtual ~RdmaBuffer();
|
||||
explicit RdmaMessageBuffer(RdmaChannel* channel, string name);
|
||||
~RdmaMessageBuffer();
|
||||
|
||||
inline void* buffer() const { return buffer_; }
|
||||
inline ibv_mr* self() const { return self_; }
|
||||
@ -223,13 +504,15 @@ class RdmaBuffer {
|
||||
}
|
||||
void FreeBuffer();
|
||||
void EnqueueItem(string Item);
|
||||
virtual void SendNextItem() {};
|
||||
void SendNextItem();
|
||||
void CreateCPUBuffer(size_t size, bool lock = true);
|
||||
void SetRemoteMR(RemoteMR rmi, bool override);
|
||||
uint32_t LookupBufferIndex(const string& buffer_name) {
|
||||
return const_cast<RdmaChannel*>(channel_)->LookupBufferIndex(buffer_name);
|
||||
}
|
||||
void Write(uint32_t imm_data, size_t buffer_size);
|
||||
static void Write(const RdmaChannel* channel, uint32_t imm_data,
|
||||
size_t buffer_size, uint64_t src_addr, uint32_t lkey,
|
||||
uint64_t remote_addr, uint32_t rkey,
|
||||
RdmaWriteIDType write_type, void* write_context);
|
||||
static void SendAck(const RdmaChannel* channel);
|
||||
|
||||
protected:
|
||||
const RdmaChannel* channel_;
|
||||
@ -245,125 +528,6 @@ class RdmaBuffer {
|
||||
BufferStatus remote_status_ GUARDED_BY(mu_) = none;
|
||||
};
|
||||
|
||||
class RdmaAckBuffer : public RdmaBuffer {
|
||||
public:
|
||||
explicit RdmaAckBuffer(RdmaChannel* channel, string name);
|
||||
virtual ~RdmaAckBuffer() override {}
|
||||
void SendNextItem() override;
|
||||
};
|
||||
|
||||
class RdmaMessageBuffer : public RdmaBuffer {
|
||||
friend class RdmaChannel;
|
||||
friend class RdmaAapater;
|
||||
|
||||
public:
|
||||
explicit RdmaMessageBuffer(RdmaChannel* channel, string name);
|
||||
virtual ~RdmaMessageBuffer() override {}
|
||||
void SendNextItem() override;
|
||||
};
|
||||
|
||||
class RdmaTensorBuffer : public RdmaBuffer {
|
||||
public:
|
||||
explicit RdmaTensorBuffer(RdmaChannel* channel, string name);
|
||||
virtual ~RdmaTensorBuffer() override;
|
||||
void SendNextItem() override;
|
||||
void PostCopyOperations(bool can_memcpy, size_t buffer_size,
|
||||
size_t tensor_bytes, const string& key,
|
||||
const Tensor& in, int64 step_id, bool is_dead,
|
||||
const string& key_with_step_id, const Tensor* copy,
|
||||
const TensorProto* proto, const StringPiece* copy_buf,
|
||||
const Rendezvous::Args& send_args,
|
||||
const Rendezvous::Args& recv_args);
|
||||
|
||||
void ReSendNextItem();
|
||||
|
||||
private:
|
||||
Rendezvous::DoneCallback getRecvTensorCallback(
|
||||
const string& key_with_step_id, const string& key, int64 step_id,
|
||||
const Rendezvous::ParsedKey& parsed);
|
||||
|
||||
struct ReItem {
|
||||
Rendezvous::Args send_args;
|
||||
Rendezvous::Args recv_args;
|
||||
Tensor in;
|
||||
bool is_dead;
|
||||
|
||||
ReItem(const Rendezvous::Args& send_args_,
|
||||
const Rendezvous::Args& recv_args_, const Tensor& in_, bool is_dead_)
|
||||
: send_args(send_args_),
|
||||
recv_args(recv_args_),
|
||||
in(in_),
|
||||
is_dead(is_dead_) {
|
||||
if (send_args.device_context) {
|
||||
send_args.device_context->Ref();
|
||||
}
|
||||
if (recv_args.device_context) {
|
||||
recv_args.device_context->Ref();
|
||||
}
|
||||
}
|
||||
|
||||
~ReItem() {
|
||||
if (send_args.device_context) {
|
||||
send_args.device_context->Unref();
|
||||
}
|
||||
if (recv_args.device_context) {
|
||||
recv_args.device_context->Unref();
|
||||
}
|
||||
}
|
||||
};
|
||||
typedef std::map<string, ReItem*> Table;
|
||||
typedef Table::iterator Itable;
|
||||
|
||||
std::queue<string> requeue GUARDED_BY(mu_);
|
||||
Table retable GUARDED_BY(mu_);
|
||||
};
|
||||
|
||||
struct RdmaMessage {
|
||||
RdmaMessageType type_;
|
||||
uint16_t name_size_;
|
||||
string name_;
|
||||
int64 step_id_;
|
||||
uint64_t buffer_size_;
|
||||
uint64_t remote_addr_;
|
||||
uint32_t rkey_;
|
||||
bool is_dead_;
|
||||
DataType data_type_;
|
||||
TensorShape tensor_shape_;
|
||||
size_t tensor_bytes_;
|
||||
|
||||
// type|name_size|name|step_id|buffer_size|remote_addr|rkey|is_dead|...
|
||||
// 1B| 2B | 512| 8B | 8B | 8B | 4B | 1B |...
|
||||
// ...|data_type|tensor_shape|tensor_bytes|tensor_buffer
|
||||
// ...| XB | XB | 8B |...
|
||||
//
|
||||
static const size_t kNameCapacity = 512;
|
||||
static const size_t kTypeStartIndex = 0;
|
||||
static const size_t kNameSizeStartIndex = kTypeStartIndex + sizeof(type_);
|
||||
static const size_t kNameStartIndex =
|
||||
kNameSizeStartIndex + sizeof(name_size_);
|
||||
static const size_t kStepIdStartIndex = kNameStartIndex + kNameCapacity;
|
||||
static const size_t kBufferSizeStartIndex =
|
||||
kStepIdStartIndex + sizeof(step_id_);
|
||||
static const size_t kRemoteAddrStartIndex =
|
||||
kBufferSizeStartIndex + sizeof(buffer_size_);
|
||||
static const size_t kRkeyStartIndex =
|
||||
kRemoteAddrStartIndex + sizeof(remote_addr_);
|
||||
static const size_t kIsDeadStartIndex = kRkeyStartIndex + sizeof(rkey_);
|
||||
static const size_t kDataTypeStartIndex =
|
||||
kIsDeadStartIndex + sizeof(is_dead_);
|
||||
static const size_t kTensorShapeStartIndex =
|
||||
kDataTypeStartIndex + sizeof(data_type_);
|
||||
static const size_t kTensorBytesStartIndex =
|
||||
kTensorShapeStartIndex + sizeof(TensorShape);
|
||||
static const size_t kTensorBufferStartIndex =
|
||||
kTensorBytesStartIndex + sizeof(tensor_bytes_);
|
||||
static const size_t kMessageTotalBytes = kTensorBufferStartIndex;
|
||||
static const size_t kRdmaMessageBufferSize = kMessageTotalBytes;
|
||||
static const size_t kRdmaAckBufferSize = kMessageTotalBytes;
|
||||
static string CreateMessage(const RdmaMessage& rm);
|
||||
static void ParseMessage(RdmaMessage& rm, void* buffer);
|
||||
};
|
||||
|
||||
} // namespace tensorflow
|
||||
|
||||
#endif // TENSORFLOW_USE_VERBS
|
||||
|
@ -16,11 +16,16 @@ limitations under the License.
|
||||
#ifdef TENSORFLOW_USE_VERBS
|
||||
|
||||
#include "tensorflow/contrib/verbs/rdma_mgr.h"
|
||||
#include <fstream>
|
||||
#include <vector>
|
||||
#include "tensorflow/contrib/verbs/grpc_verbs_client.h"
|
||||
#include "tensorflow/contrib/verbs/verbs_service.pb.h"
|
||||
#include "tensorflow/core/common_runtime/bfc_allocator.h"
|
||||
#include "tensorflow/core/common_runtime/gpu/gpu_util.h"
|
||||
#include "tensorflow/core/common_runtime/gpu/process_state.h"
|
||||
#include "tensorflow/core/distributed_runtime/rpc/grpc_worker_cache.h"
|
||||
#include "tensorflow/core/distributed_runtime/session_mgr.h"
|
||||
#include "tensorflow/core/framework/allocator_registry.h"
|
||||
#include "tensorflow/core/lib/core/status.h"
|
||||
|
||||
namespace tensorflow {
|
||||
@ -53,7 +58,7 @@ RdmaMgr::RdmaMgr(const WorkerEnv* const worker_env,
|
||||
void RdmaMgr::SetupChannels() {
|
||||
for (const auto& p : channel_table_) {
|
||||
string worker_name = p.first;
|
||||
LOG(INFO) << "connecting to remote node " << worker_name;
|
||||
RDMA_LOG(2) << "Connecting to remote node " << worker_name;
|
||||
RdmaChannel* rc = p.second;
|
||||
GetRemoteAddressRequest req;
|
||||
GetRemoteAddressResponse resp;
|
||||
@ -78,39 +83,49 @@ void RdmaMgr::SetupChannels() {
|
||||
mr->set_rkey(rc->message_buffers_[i]->self_->rkey);
|
||||
}
|
||||
// synchronous call
|
||||
Status s = client->GetRemoteAddress(&req, &resp);
|
||||
// save obtained remote addresses
|
||||
// connect to the remote channel
|
||||
if (s.ok()) {
|
||||
CHECK(worker_name.compare(resp.host_name()) == 0);
|
||||
RdmaAddress ra;
|
||||
ra.lid = resp.channel().lid();
|
||||
ra.qpn = resp.channel().qpn();
|
||||
ra.psn = resp.channel().psn();
|
||||
ra.snp = resp.channel().snp();
|
||||
ra.iid = resp.channel().iid();
|
||||
rc->SetRemoteAddress(ra, false);
|
||||
rc->Connect();
|
||||
int i = 0;
|
||||
int idx[] = {1, 0, 3, 2};
|
||||
for (const auto& mr : resp.mr()) {
|
||||
// the connections are crossed, i.e.
|
||||
// local tx_message_buffer <---> remote rx_message_buffer_
|
||||
// local rx_message_buffer <---> remote tx_message_buffer_
|
||||
// local tx_ack_buffer <---> remote rx_ack_buffer_
|
||||
// local rx_ack_buffer <---> remote tx_ack_buffer_
|
||||
// hence idx[] = {1, 0, 3, 2}.
|
||||
RdmaBuffer* rb = rc->message_buffers_[idx[i]];
|
||||
RemoteMR rmr;
|
||||
rmr.remote_addr = mr.remote_addr();
|
||||
rmr.rkey = mr.rkey();
|
||||
rb->SetRemoteMR(rmr, false);
|
||||
i++;
|
||||
Status s;
|
||||
int attempts = 0;
|
||||
static const int max_num_attempts = 5;
|
||||
do {
|
||||
s = client->GetRemoteAddress(&req, &resp);
|
||||
// save obtained remote addresses
|
||||
// connect to the remote channel
|
||||
if (s.ok()) {
|
||||
CHECK(worker_name.compare(resp.host_name()) == 0);
|
||||
RdmaAddress ra;
|
||||
ra.lid = resp.channel().lid();
|
||||
ra.qpn = resp.channel().qpn();
|
||||
ra.psn = resp.channel().psn();
|
||||
ra.snp = resp.channel().snp();
|
||||
ra.iid = resp.channel().iid();
|
||||
rc->SetRemoteAddress(ra, false);
|
||||
rc->Connect();
|
||||
int i = 0;
|
||||
int idx[] = {1, 0};
|
||||
for (const auto& mr : resp.mr()) {
|
||||
// the connections are crossed, i.e.
|
||||
// local tx_message_buffer <---> remote rx_message_buffer_
|
||||
// local rx_message_buffer <---> remote tx_message_buffer_
|
||||
// hence idx[] = {1, 0}.
|
||||
RdmaMessageBuffer* rb = rc->message_buffers_[idx[i]];
|
||||
RemoteMR rmr;
|
||||
rmr.remote_addr = mr.remote_addr();
|
||||
rmr.rkey = mr.rkey();
|
||||
rb->SetRemoteMR(rmr, false);
|
||||
i++;
|
||||
}
|
||||
CHECK(i == RdmaChannel::kNumMessageBuffers);
|
||||
} else {
|
||||
LOG(ERROR) << "Connecting to " << worker_name
|
||||
<< ": Got " << s.error_message() << ". Retrying ("
|
||||
<< (attempts + 1) << "/" << max_num_attempts << ")..." ;
|
||||
if (++attempts == max_num_attempts) {
|
||||
break;
|
||||
}
|
||||
worker_env_->env->SleepForMicroseconds(2000000);
|
||||
}
|
||||
CHECK(i == RdmaChannel::kNumMessageBuffers);
|
||||
} else {
|
||||
LOG(ERROR) << s.error_message();
|
||||
}
|
||||
} while (!s.ok());
|
||||
RDMA_LOG(0) << "Connected to remote node " << worker_name;
|
||||
delete client;
|
||||
}
|
||||
}
|
||||
@ -183,6 +198,138 @@ RdmaChannel* RdmaMgr::FindChannel(const string& name) {
|
||||
return iter->second;
|
||||
}
|
||||
|
||||
bool IsGDRAvailable() {
|
||||
#if defined(__APPLE__)
|
||||
return false;
|
||||
#elif defined(PLATFORM_WINDOWS)
|
||||
return false;
|
||||
#else
|
||||
std::ifstream ifs("/proc/modules");
|
||||
string line;
|
||||
while (std::getline(ifs, line)) {
|
||||
auto sep = line.find(' ');
|
||||
CHECK_NE(sep, std::string::npos);
|
||||
if (line.substr(0, sep) == "nv_peer_mem") {
|
||||
return true;
|
||||
}
|
||||
}
|
||||
return false;
|
||||
#endif
|
||||
}
|
||||
|
||||
int TryToReadNumaNode(ibv_device* device) {
|
||||
#if defined(__APPLE__)
|
||||
LOG(INFO) << "OS X does not support NUMA - returning NUMA node 0";
|
||||
return 0;
|
||||
#elif defined(PLATFORM_WINDOWS)
|
||||
// Windows support for NUMA is not currently implemented. Return node 0.
|
||||
return 0;
|
||||
#else
|
||||
VLOG(2) << "Trying to read NUMA node for device: " << device->name;
|
||||
static const int kUnknownNumaNode = -1;
|
||||
|
||||
auto filename = string(device->ibdev_path) + "/device/numa_node";
|
||||
|
||||
std::ifstream ifs(filename.c_str());
|
||||
string content;
|
||||
CHECK(std::getline(ifs, content));
|
||||
|
||||
int32 value;
|
||||
if (strings::safe_strto32(content, &value)) {
|
||||
if (value < 0) {
|
||||
LOG(INFO) << "Successful NUMA node read from SysFS had negative value ("
|
||||
<< value << "), but there must be at least one NUMA node"
|
||||
", so returning NUMA node zero";
|
||||
return 0;
|
||||
}
|
||||
LOG(INFO) << "NUMA node for device: " << device->name << " is " << value;
|
||||
return value;
|
||||
}
|
||||
return kUnknownNumaNode;
|
||||
#endif
|
||||
}
|
||||
|
||||
void MRDeleter(ibv_mr* mr) {
|
||||
if (mr) {
|
||||
ibv_dereg_mr(mr);
|
||||
}
|
||||
}
|
||||
|
||||
// TODO(byronyi): remove this class duplicated from the one in
|
||||
// common/runtime/gpu/pool_allocator.h when it is available in common_runtime
|
||||
class BasicCPUAllocator : public SubAllocator {
|
||||
public:
|
||||
~BasicCPUAllocator() override {}
|
||||
|
||||
void* Alloc(size_t alignment, size_t num_bytes) override {
|
||||
return port::AlignedMalloc(num_bytes, alignment);
|
||||
}
|
||||
void Free(void* ptr, size_t) override { port::AlignedFree(ptr); }
|
||||
};
|
||||
|
||||
// TODO(byronyi): remove this class and its registration when the default
|
||||
// cpu_allocator() returns visitable allocator
|
||||
class BFCRdmaAllocator : public BFCAllocator {
|
||||
public:
|
||||
BFCRdmaAllocator()
|
||||
: BFCAllocator(new BasicCPUAllocator(), 1LL << 36, true, "cpu_rdma_bfc") {
|
||||
}
|
||||
};
|
||||
|
||||
REGISTER_MEM_ALLOCATOR("BFCRdmaAllocator", 101, BFCRdmaAllocator);
|
||||
|
||||
void RdmaMgr::InitAllocators() {
|
||||
RdmaMemoryMgr::Singleton().pd_ = rdma_adapter_->pd_;
|
||||
|
||||
Allocator* allocators[] = {
|
||||
#if GOOGLE_CUDA
|
||||
ProcessState::singleton()->GetCUDAHostAllocator(0),
|
||||
ProcessState::singleton()->GetCPUAllocator(0),
|
||||
#endif // GOOGLE_CUDA
|
||||
cpu_allocator(),
|
||||
};
|
||||
|
||||
using namespace std::placeholders;
|
||||
|
||||
std::set<Allocator*> instrumented_;
|
||||
|
||||
// Host memory allocators
|
||||
for (Allocator* allocator : allocators) {
|
||||
VisitableAllocator::Visitor alloc_visitor =
|
||||
std::bind(&RdmaMemoryMgr::InsertMemoryRegion,
|
||||
&RdmaMemoryMgr::Singleton(), _1, _2, allocator->Name());
|
||||
VisitableAllocator::Visitor free_visitor = std::bind(
|
||||
&RdmaMemoryMgr::EvictMemoryRegion, &RdmaMemoryMgr::Singleton(), _1, _2);
|
||||
|
||||
auto* visitable_allocator = dynamic_cast<VisitableAllocator*>(allocator);
|
||||
CHECK(visitable_allocator) << "is not visitable for instrumentation"
|
||||
<< allocator->Name();
|
||||
// Make sure we don't instrument the same allocator twice
|
||||
if (instrumented_.find(allocator) == std::end(instrumented_)) {
|
||||
visitable_allocator->AddAllocVisitor(alloc_visitor);
|
||||
visitable_allocator->AddFreeVisitor(free_visitor);
|
||||
instrumented_.insert(allocator);
|
||||
LOG(INFO) << "Instrumenting CPU allocator " << allocator->Name();
|
||||
}
|
||||
}
|
||||
|
||||
#if GOOGLE_CUDA
|
||||
if (IsGDRAvailable()) {
|
||||
// Note we don't free allocated GPU memory so there is no free visitor
|
||||
int32_t bus_id = TryToReadNumaNode(rdma_adapter_->context_->device) + 1;
|
||||
|
||||
char buf[8];
|
||||
sprintf(buf, "gpu");
|
||||
VisitableAllocator::Visitor cuda_alloc_visitor =
|
||||
std::bind(&RdmaMemoryMgr::InsertMemoryRegion,
|
||||
&RdmaMemoryMgr::Singleton(), _1, _2, std::string(buf));
|
||||
|
||||
ProcessState::singleton()->AddGPUAllocVisitor(bus_id, cuda_alloc_visitor);
|
||||
LOG(INFO) << "Instrumenting GPU allocator with bus_id " << bus_id;
|
||||
}
|
||||
#endif // GOOGLE_CUDA
|
||||
}
|
||||
|
||||
} // end namespace tensorflow
|
||||
|
||||
#endif
|
||||
|
@ -38,6 +38,7 @@ class RdmaMgr {
|
||||
RdmaChannel* FindChannel(const string& key);
|
||||
void SetupChannels();
|
||||
bool ConnectivityCheck();
|
||||
void InitAllocators();
|
||||
const string& local_worker() { return local_worker_; }
|
||||
|
||||
private:
|
||||
|
@ -21,10 +21,6 @@ limitations under the License.
|
||||
#include "tensorflow/core/common_runtime/device.h"
|
||||
#include "tensorflow/core/common_runtime/device_mgr.h"
|
||||
#include "tensorflow/core/common_runtime/dma_helper.h"
|
||||
#if GOOGLE_CUDA
|
||||
#include "tensorflow/core/common_runtime/gpu/gpu_util.h"
|
||||
#include "tensorflow/core/common_runtime/gpu/process_state.h"
|
||||
#endif // GOOGLE_CUDA
|
||||
#include "tensorflow/core/lib/core/errors.h"
|
||||
#include "tensorflow/core/lib/strings/numbers.h"
|
||||
#include "tensorflow/core/lib/strings/str_util.h"
|
||||
@ -36,11 +32,6 @@ class RdmaRemoteRendezvous : public BaseRemoteRendezvous {
|
||||
RdmaRemoteRendezvous(const WorkerEnv* env, int64 step_id, RdmaMgr* rdma_mgr)
|
||||
: BaseRemoteRendezvous(env, step_id), rdma_mgr_(rdma_mgr) {}
|
||||
|
||||
void RecvPostCopyOps(const string& key, const string& key_with_step_id,
|
||||
const Rendezvous::Args& recv_args,
|
||||
const DoneCallback& done, const RdmaMessage& rm,
|
||||
RdmaChannel* rc, Tensor& val, const Status& s);
|
||||
|
||||
protected:
|
||||
void RecvFromRemoteAsync(const Rendezvous::ParsedKey& parsed,
|
||||
const Rendezvous::Args& args,
|
||||
@ -74,101 +65,18 @@ void RdmaRemoteRendezvous::RecvFromRemoteAsync(
|
||||
RdmaChannel* rc = rdma_mgr_->FindChannel(src_name);
|
||||
string key(std::move(parsed.FullKey().ToString()));
|
||||
string key_with_step_id = VerbsUtil::AppendStepidToKey(key, step_id_);
|
||||
// insert callback
|
||||
rc->InsertRecvCallback(key_with_step_id, [this, key, key_with_step_id, rc,
|
||||
recv_args, parsed, done]() {
|
||||
Status src_s, dst_s, s;
|
||||
Device* src_dev, *dst_dev;
|
||||
src_s = env_->device_mgr->LookupDevice("CPU:0", &src_dev);
|
||||
dst_s = env_->device_mgr->LookupDevice(parsed.dst_device, &dst_dev);
|
||||
if (!src_s.ok() || !dst_s.ok()) {
|
||||
s = src_s.ok() ? dst_s : src_s;
|
||||
LOG(ERROR) << "s is not ok, error code " << s.error_message();
|
||||
done(s, Args(), recv_args, Tensor(), true);
|
||||
return;
|
||||
}
|
||||
RdmaBuffer* rb = rc->FindBuffer(key);
|
||||
RdmaMessage rm;
|
||||
CHECK(rb->size_ >= RdmaMessage::kMessageTotalBytes);
|
||||
RdmaMessage::ParseMessage(rm, rb->buffer_);
|
||||
CHECK(rm.type_ == RDMA_MESSAGE_TENSOR_WRITE);
|
||||
Tensor val;
|
||||
if (!rm.is_dead_) {
|
||||
void* input = static_cast<char*>(rb->buffer_) +
|
||||
RdmaMessage::kTensorBufferStartIndex;
|
||||
bool can_memcpy = DataTypeCanUseMemcpy(rm.data_type_);
|
||||
if (can_memcpy) {
|
||||
if (dst_dev->tensorflow_gpu_device_info() &&
|
||||
(!recv_args.alloc_attrs.on_host())) {
|
||||
#if GOOGLE_CUDA
|
||||
CHECK(recv_args.device_context)
|
||||
<< "send dev name: " << src_dev->name()
|
||||
<< " gpu_info: " << src_dev->tensorflow_gpu_device_info();
|
||||
Allocator* alloc = ProcessState::singleton()->GetCUDAHostAllocator(0);
|
||||
Tensor copy(alloc, rm.data_type_, rm.tensor_shape_);
|
||||
memcpy(DMAHelper::base(©), input, rm.tensor_bytes_);
|
||||
|
||||
Allocator* dst_alloc = dst_dev->GetAllocator(recv_args.alloc_attrs);
|
||||
Tensor gpu_copy(dst_alloc, rm.data_type_, rm.tensor_shape_);
|
||||
Device* dst_dev;
|
||||
s = env_->device_mgr->LookupDevice(parsed.dst_device, &dst_dev);
|
||||
CHECK(s.ok()) << "s is not ok, error code " << s.error_message();
|
||||
if (!s.ok()) {
|
||||
done(s, Args(), recv_args, Tensor(), true);
|
||||
return;
|
||||
}
|
||||
|
||||
GPUUtil::CopyCPUTensorToGPU(
|
||||
©, recv_args.device_context, dst_dev, &gpu_copy,
|
||||
[this, gpu_copy, key, key_with_step_id, recv_args, done, rm, rc](
|
||||
const Status& s) {
|
||||
CHECK(s.ok()) << "copy tensor to gpu sync";
|
||||
Tensor val;
|
||||
val = std::move(gpu_copy);
|
||||
RecvPostCopyOps(key, key_with_step_id, recv_args, done, rm, rc,
|
||||
val, s);
|
||||
});
|
||||
#endif // GOOGLE_CUDA
|
||||
return;
|
||||
} else {
|
||||
AllocatorAttributes host_alloc_attrs;
|
||||
host_alloc_attrs.set_gpu_compatible(true);
|
||||
host_alloc_attrs.set_on_host(true);
|
||||
Allocator* alloc = dst_dev->GetAllocator(host_alloc_attrs);
|
||||
Tensor copy(alloc, rm.data_type_, rm.tensor_shape_);
|
||||
memcpy(DMAHelper::base(©), input, rm.tensor_bytes_);
|
||||
val = std::move(copy);
|
||||
}
|
||||
} else {
|
||||
TensorProto proto;
|
||||
CHECK(rm.tensor_bytes_ + RdmaMessage::kTensorBufferStartIndex <=
|
||||
rb->size_);
|
||||
CHECK(ParseProtoUnlimited(&proto, input, rm.tensor_bytes_))
|
||||
<< "fail to parse proto from array";
|
||||
s = dst_dev->MakeTensorFromProto(proto, recv_args.alloc_attrs, &val);
|
||||
}
|
||||
}
|
||||
RecvPostCopyOps(key, key_with_step_id, recv_args, done, rm, rc, val, s);
|
||||
});
|
||||
// append key to message queue
|
||||
RdmaBuffer* rb = rc->tx_message_buffer_;
|
||||
RdmaMessage rm;
|
||||
rm.type_ = RDMA_MESSAGE_TENSOR_REQUEST;
|
||||
rm.name_size_ = key.size();
|
||||
rm.name_ = key;
|
||||
rm.step_id_ = step_id_;
|
||||
string message = RdmaMessage::CreateMessage(rm);
|
||||
rb->EnqueueItem(message);
|
||||
rb->SendNextItem();
|
||||
}
|
||||
|
||||
void RdmaRemoteRendezvous::RecvPostCopyOps(
|
||||
const string& key, const string& key_with_step_id,
|
||||
const Rendezvous::Args& recv_args, const DoneCallback& done,
|
||||
const RdmaMessage& rm, RdmaChannel* rc, Tensor& val, const Status& s) {
|
||||
rc->RemoveRecvCallback(key_with_step_id);
|
||||
RdmaMessage br;
|
||||
br.type_ = RDMA_MESSAGE_BUFFER_IDLE;
|
||||
br.name_size_ = key.size();
|
||||
br.name_ = key;
|
||||
string message = RdmaMessage::CreateMessage(br);
|
||||
RdmaBuffer* tb = rc->tx_message_buffer_;
|
||||
tb->EnqueueItem(message);
|
||||
tb->SendNextItem();
|
||||
done(s, Args(), recv_args, val, rm.is_dead_);
|
||||
RdmaTensorRequest* request =
|
||||
rc->InsertTensorRequest(key, step_id_, dst_dev, recv_args, done);
|
||||
request->Start();
|
||||
}
|
||||
|
||||
RdmaRendezvousMgr::RdmaRendezvousMgr(const WorkerEnv* env)
|
||||
|
@ -104,6 +104,7 @@ Status VerbsServer::Start() {
|
||||
[this] { verbs_service_->HandleRPCsLoop(); }));
|
||||
rdma_mgr_->SetupChannels();
|
||||
CHECK(rdma_mgr_->ConnectivityCheck()) << "Connectivity check failed!";
|
||||
rdma_mgr_->InitAllocators();
|
||||
verbs_state_ = CONNECTED;
|
||||
}
|
||||
}
|
||||
|
@ -50,6 +50,12 @@ message GetRemoteAddressResponse {
|
||||
repeated MemoryRegion mr = 3;
|
||||
}
|
||||
|
||||
message ErrorStatusProto {
|
||||
int32 error_code = 1;
|
||||
string error_message = 2;
|
||||
string error_details = 3;
|
||||
}
|
||||
|
||||
////////////////////////////////////////////////////////////////////////////////
|
||||
//
|
||||
// VerbsService
|
||||
|
BIN
tensorflow/contrib/verbs/verbs_with_0_copies.png
Normal file
BIN
tensorflow/contrib/verbs/verbs_with_0_copies.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 61 KiB |
1
tensorflow/contrib/verbs/verbs_with_0_copies.xml
Normal file
1
tensorflow/contrib/verbs/verbs_with_0_copies.xml
Normal file
@ -0,0 +1 @@
|
||||
<mxfile userAgent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36" version="7.8.7" editor="www.draw.io" type="device"><diagram name="Page-1" id="74e2e168-ea6b-b213-b513-2b3c1d86103e">7Vxtc9o4EP41zKQfmsGW3/hIgPQ60/RyIZ1rPzHClsFXY1FZEOivP8mW8ZsAB2yXtHQ6jb2SJXl3n0e7K6cdMFhsPhC4nD9gB/kdtetsOmDYUVVFUw32g0u2scRUu7FgRjxHdEoFY+8nEsKk28pzUJjrSDH2qbfMC20cBMimORkkBL/ku7nYz8+6hDNUEoxt6Jel/3oOnQup0u2mDX8hbzYXU1u6aJhC+/uM4FUg5uuowI3+xM0LmIwl+odz6OCXjAiMOmBAMKbx1WIzQD7XbaK2+Ln7Pa27dRMU0CoP6CB+Yg39FUqWHC2MbhNlRK+D+APdDrh7mXsUjZfQ5q0vzPxMNqcLn90p7NL1fH+AfUzYfYAD1ulOzIAIRZu9y1R2L8+cCuEFomTLumx2mo8fEf5kiduX1DhWIptn7GIkQigcYrYbOlUKuxB6kevIkqjI8NkMd463zqnK+LHihrtjL0rfQ9+bBR3QZz185NK0lV3NxM9olHAJg0Q2ppDQm3dJE1tatjUjjqbOS+tfTSKbEskK2lqYcsua+r6PbUgRJwIUhJiEt03OqfI5xyhwbp5Hn8d/P02eRv98GY2f37VhAam2c7PVBVDGTg5Elmtzu1OCv6NMi2FbaOru5iuBVQLp/fgFefwqRhnAalcC4B3jngPghGxrR/ATstfPkTs+IAqHkMKbC/GQ2oD3ZenEsGNah+/ZNeTbLrTnqHkAPiHYLuxlHAjKVCBjg8q0+XsBGahtAlkxjkcryGGRnLjFhM7xDAfQH6XSu7yWMxr9D1G6FcEoXFHMROkInzBein579RjiFbGTEFIsjW3oM5R002IZX+NBbRPkQ+qt89HoWZpTG6LAiCQGBMUgJShc4iBshRvs9SfGDX4/3AZ2s7QbUcAAL7dRGsKvH79wyCPisWd/8hf/6EZv/2PlEeTsffs3B3dT+aX7tv4r0M20RbZfszff+GC3Or/dePRr7u6bmKgaJ2hlTkhS5fo4QTz6iD22lJ0pe728KU2jYKF4UeKp1Eh9QuA2023JO4QH5jELO4RZyECP9E9SttRH4hWkHrPTSTXm00rMd++RkD57Cw7cjjngf1UDLjjggmm4LPJGjN4kwhvMYTBDDmMccF8V53O8mK7C4xh3GHvY1MOMjYbMbzjCLgP3LW/zvePbfDiHS37p+mjT5xWfCKyOuBzaPgxDzz5Ioa5lI9vOIr5bRnxJzVNL1/TKiLckQYBaEfAZXesSVSeyM3kBFCI6tcgL8euUeKE0kNbt3jK/MUxLUSxD10wjP65SjW9OgHjiHm35y5k+Id0FuhflFIbSu1WtHlAVy1KApqs5U2pFU1Z1kcPDgoo70ikeUi5zfkNhyUl4OJh3gbypRUFTUuMUMeTQZoZHTH7Hudbj4aloWHiOE8Unsi0gH7M0wd+gzN+axH3UOqK28ob7Gf++qraK8U6bqpblw00VQqJM75GFyfI0r61yMM/+5MFadgVPwwc+xAvxorz0Jq7SdaITI8qM/e61S3/zqZsh8cvmQjigH9+S28zlRMK2i+2q7tWqKdmrW8rYrKIFtWr74/6M7Zwd1CwZdFcaztDBm4eJ1mqFA1Q4fn0jmU6yn+WQYlZESjtRrVpIdTTzxDi2OJBe9IWaaimleZR6ayOgQj29EfeTlNbOdT+j7H7gstzvcPZjgkaSqtIXEPUlVaC8JdR9rDqIo7Xf6VRVUlqMI2uCN9soQOXnDPcOyh4veJWOF0oDyyLj6PTkY7BmYGMXDs+p+IGu7/NPl45Fxa9S0ZEz1aHkdJf7aeBE77rAayReGoX0tEzjzQfxxePWdoP4JN6UAHyuVIKAyNFlCEeMSEnGUHzEPXY6vVbAXMX2ghkT6Ondc5QevFf3GZ05HnH9aHebe46DgibKBoqp5yzbKxt299Fb1rDFHOAku6pN2ZV/J/EnW9U0jlq115RRZamEYO0iL7o4CiDsHWmldgSu2+1y8ihBdvjQnzyMxuP+h9Ek/1Vcxt7xyCUWnjbgBRdWBwRWnqoVyTeq0uiyjkKgVq65Nmf8h9FzfzLss3+eRuPHvz+PR1cHkDgA0Np0AFm9rXH0XwnggP21Vglg/0lALfYv1s+vFpdY3NDbtHhT2XfNSXUi+LhYxP2ruCF3Qv47M8VRBQO9yvsOJYiyVLLm9773kO+E+VfPLpBulyzPHaSp7sRjXrqJRQFciMaQouWE2V70XGCKJtBxiBB8R9v4in+mPYk/0z6SGZ9w6nUMuTwvNqaGbnTKNUDXVaMa4KVh2MhjWJV86QRkGbZeB4ab/tWihjAsg1m31PvPBbpUPweBfoXtebAFkmCrOdjKvk+8wvYPhO3r9ucrsk9Att7mhpyMcUV2ceqC818+viueVF1xtwd3hqR2XRfu2G36XxzEJ8/p/yMBRv8D</diagram></mxfile>
|
BIN
tensorflow/contrib/verbs/verbs_with_0_copies_phase1_protocol.jpg
Normal file
BIN
tensorflow/contrib/verbs/verbs_with_0_copies_phase1_protocol.jpg
Normal file
Binary file not shown.
After Width: | Height: | Size: 87 KiB |
@ -0,0 +1 @@
|
||||
<mxfile userAgent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36" version="7.8.4" editor="www.draw.io" type="device"><diagram name="Page-1" id="74e2e168-ea6b-b213-b513-2b3c1d86103e">7Vxbc5s4FP41nuk+pMMd8ujYTrYzTZqNk9nmyaOAbNhgRIWc2P31K4G4yzZ1AHtaZzJjOBLS8TnnOzeRDNTRcn2DQejeIgf6A0Vy1gN1PFAUWVMM+sEom4RiGpcJYYE9h0/KCVPvJ+REiVNXngOj0kSCkE+8sEy0URBAm5RoAGP0Xp42R3551xAsYI0wtYFfp/7rOcTlVFmS8oG/obdw+daWzgdegP26wGgV8P0GijqPf5LhJUjX4vMjFzjovUBSJwN1hBEiydVyPYI+k20qtuS56y2jGd8YBqTJA1bywBvwVzDl2PDpo1eO98b4IxsuE+PHijF1ReCaXADfWwQDdUhn+HBO8lF6teCf8SpRCIKUNiUAk09/pUOUqeJogRxvXaa2z01Ke8ECDnpkDCxDehG8ROzjgs4c+j6yAYGPMIgQjkoC64WBKQycT4+Tu+m3h9nD5J+nyfSxax62a6K0m1LaRYlxBpklS3T43fUInIbAZqPv1C9RmkuWPr2Ts6ffIKZMbQWLnEGQujaIlpDgDZ3CH7A4aLlTkw1+/567CCX1EG7BO2RuA3C3tMiWzqFJLzg6xUhNPUbrUH2A9ltia7eQgDEgoHOTa6juNnZjBj2GoE9M1UHc/X4xZq+erq8nDLPT+29308nWTU8LRqrSJ4xkQwCjikCgQ5MBfoswcdECBcCf5NSrssgK4vkPErLh+QxYEURJ+QpfEQpLYmQb7RYi5QutsJ3O4rzSQLqA6TRNLGwMfUC8t/L6H5Kc0pEDivHiON/wU+hQyDzAKERBBLsHDfN8XylM/WG0Cezu9/syj/XyY+VhZjvDPgP7gKGsJRL7LiMUbm7unx7R6F6UXT2dUp7XqSCmEHuUsZ/wqN/45HMnQztq8qQfw8lT0eDN9+LNM1vss85u1x75Xrp75hsdGBq0emhIqvAPhAb+6D3y6M6ZKi+VsipNo6KhhAf+VK6kIcZgU5gWsglR831Us1LL7plvWLvnW9rO+fQi4Ti3sEyGzQKmVguY1x6OyKO3hMyZmCP2W/MyBbeQYDdNy0cuCBbQoX5GvW6KchctX1bRfoQ7NCTZxEPU2YypWTFEdoH6nnO9y/25XuSCkF3Ofbgess5RDFWHX45tH0SRZ5eFNfd8f4R8hOMl0gZPAe9SHe+HodoS5HuKWOIFieoCgaa0D2K/mrwrVewn3NewX1tI1aXPpiwZpiXLlqFrplFeV27mUw6AZWoEfVlFi/5cGhxR9bpx+VmxLlVFtixZ1XSlpDCtqrCmhrB7WbVhbDnEDtSaHTzDqGYKLA8rKzoiGL3CVNUBCmBF+5zEk7exTfUMKf2KuVKPlRt8YOk5TpxpiJxzOfvowherdV+sCcxHaSP/qofCO/T7itqSjihqUYOjq+KKFUD3KBSW7H2VQBfbUqgiAw/jW7bCO6bap5+fkr7cID5BIlSrv8r4iVVThsDAusurVH1/BO2zvOI1VJZwHRx0FVMQdLsposyqBrVmgW57EfWRUGjWFLqlIXdidq/12kVQ6xlDTSKnXU+kAaZk4KZY5P1klXKloNDMA/PI6kJ6VeMtdSVq+8jtdg3UBgcUnRiZoEl1oJEZdSNTj2pku2sMU+2kdEnbSR2ULmrdX7d9FjxK8qLf6ShY0Fo78FSmvyOWETtiubl/aqSHftgaw0h45LGbq/hJWqzte+TEEm3rmHl2muwIYO7KjYDA62ERziF1p7ggdbbiFqEfXpfTUsr2ggUl6PndY5zBXyjbNIgoY3M/jmQuLdth0E2JXlLsZUO9VrP0g9QqakC2olb2FsifrNRqedCrVkXFAQ9vVSc3R3EaYWfpWK5ImphJEuOxBtnx7XB2O5lOhzeTWfntvILCk5VrLvWlAzM4sZ5b1mNLD5gtgfJFOWYbTTet3t/sTvnZa15n5W9Tvqr1qXxRO6xz5Sfv+J21L9C+1iv0t/fbW9F+loLHK1T71tloVe1na8hydr1Pa+iqMm+54E4JX5bLZH4TE2UGyv6Spboq902/ZH27akDRAUzL3/vag74Tlb96kUGyCeFAGfHOAIzIzKNWuk5IAVjywYjAcEZ1z2cuEYEz4DiYE17hJrmidgRmDiBgb/F7wOHTPuRSzTnGi6EbA1EXULHtE8SwXMawInhvSBVl8nobGO76j6I6wrAIZlJt9p8LdKF8dgL9DNuPwVYVJGLdwVb0tt8Ztn8gbD8Un7e+kXvG/i9hX+8zZKdrnLFf3boCj9P3AA2PAs+424I7Q9D0bgt39Db/1wTJuXX+/x/Uyf8=</diagram></mxfile>
|
@ -279,6 +279,7 @@ cc_library(
|
||||
"platform/platform.h",
|
||||
"platform/protobuf.h",
|
||||
"platform/types.h",
|
||||
"platform/windows/cpu_info.h",
|
||||
"lib/bfloat16/bfloat16.h",
|
||||
] + tf_additional_proto_hdrs() + glob(tf_env_time_hdrs()),
|
||||
copts = tf_copts(),
|
||||
@ -865,6 +866,7 @@ cc_library(
|
||||
"//tensorflow/core/kernels:mkl_pooling_ops",
|
||||
"//tensorflow/core/kernels:mkl_relu_op",
|
||||
"//tensorflow/core/kernels:mkl_reshape_op",
|
||||
"//tensorflow/core/kernels:mkl_softmax_op",
|
||||
"//tensorflow/core/kernels:mkl_tfconv_op",
|
||||
"//tensorflow/core/kernels:mkl_aggregate_ops",
|
||||
]),
|
||||
@ -2831,6 +2833,7 @@ tf_cc_test_mkl(
|
||||
"//tensorflow/core/kernels:mkl_pooling_ops",
|
||||
"//tensorflow/core/kernels:mkl_relu_op",
|
||||
"//tensorflow/core/kernels:mkl_reshape_op",
|
||||
"//tensorflow/core/kernels:mkl_softmax_op",
|
||||
"//tensorflow/core/kernels:mkl_tfconv_op",
|
||||
]),
|
||||
)
|
||||
|
@ -453,6 +453,13 @@ inline bool DataTypeIsInteger(DataType dt) {
|
||||
return kDataTypeIsInteger.Contains(dt);
|
||||
}
|
||||
|
||||
// Is the dtype a signed integral type?
|
||||
constexpr DataTypeSet kDataTypeIsSigned =
|
||||
ToSet(DT_INT8) | ToSet(DT_INT16) | ToSet(DT_INT32) | ToSet(DT_INT64);
|
||||
inline bool DataTypeIsSigned(DataType dt) {
|
||||
return kDataTypeIsSigned.Contains(dt);
|
||||
}
|
||||
|
||||
// Is the dtype an unsigned integral type?
|
||||
constexpr DataTypeSet kDataTypeIsUnsigned =
|
||||
ToSet(DT_UINT8) | ToSet(DT_UINT16) | ToSet(DT_UINT32) | ToSet(DT_UINT64);
|
||||
|
@ -2456,9 +2456,9 @@ class MklLayoutRewritePass : public GraphOptimizationPass {
|
||||
// NOTE: names are alphabetically sorted.
|
||||
rinfo_.push_back({csinfo_.addn, mkl_op_registry::GetMklOpName(csinfo_.addn),
|
||||
CopyAttrsAddN, AddNRewrite});
|
||||
rinfo_.push_back({csinfo_.add,
|
||||
/* rinfo_.push_back({csinfo_.add,
|
||||
mkl_op_registry::GetMklOpName(csinfo_.add),
|
||||
CopyAttrsDataType, AlwaysRewrite});
|
||||
CopyAttrsDataType, AlwaysRewrite}); */
|
||||
rinfo_.push_back({csinfo_.avg_pool,
|
||||
mkl_op_registry::GetMklOpName(csinfo_.avg_pool),
|
||||
CopyAttrsPooling, AlwaysRewrite});
|
||||
@ -3117,7 +3117,9 @@ void MklLayoutRewritePass::GetDummyMklTensorNode(std::unique_ptr<Graph>* g,
|
||||
Node* orig_input0 = nullptr;
|
||||
TF_CHECK_OK(orig_node->input_node(0,
|
||||
const_cast<const Node**>(&orig_input0)));
|
||||
CHECK_NOTNULL((*g)->AddControlEdge(orig_input0, *out));
|
||||
// Allow duplicate while adding control edge as it would fail (return
|
||||
// NULL) if we try to add duplicate edge.
|
||||
CHECK_NOTNULL((*g)->AddControlEdge(orig_input0, *out, true));
|
||||
}
|
||||
|
||||
(*out)->set_assigned_device_name(orig_node->assigned_device_name());
|
||||
@ -3382,8 +3384,8 @@ void MklLayoutRewritePass::GetDummyWorkspaceTensorNode(
|
||||
std::unique_ptr<Graph>* g, Node** out, Node* orig_node) {
|
||||
// We use a tensor of shape {1} and value 0 to represent
|
||||
// dummy float tensor. We need this as a dummy workspace tensor.
|
||||
// Workspace tensor has type float.
|
||||
const DataType dt = DataTypeToEnum<float>::v();
|
||||
// Workspace tensor has type uint8.
|
||||
const DataType dt = DataTypeToEnum<uint8>::v();
|
||||
TensorProto proto;
|
||||
proto.set_dtype(dt);
|
||||
float zero[1] = {0};
|
||||
@ -3413,7 +3415,9 @@ void MklLayoutRewritePass::GetDummyWorkspaceTensorNode(
|
||||
Node* orig_input0 = nullptr;
|
||||
TF_CHECK_OK(orig_node->input_node(0,
|
||||
const_cast<const Node**>(&orig_input0)));
|
||||
CHECK_NOTNULL((*g)->AddControlEdge(orig_input0, *out));
|
||||
// Allow duplicate while adding control edge as it would fail (return
|
||||
// NULL) if we try to add duplicate edge.
|
||||
CHECK_NOTNULL((*g)->AddControlEdge(orig_input0, *out, true));
|
||||
}
|
||||
|
||||
(*out)->set_assigned_device_name(orig_node->assigned_device_name());
|
||||
@ -3863,12 +3867,16 @@ Status MklLayoutRewritePass::MergeConv2DWithBiasAdd(std::unique_ptr<Graph>* g,
|
||||
// node are already copied in BuildNode. We handle control edges now.
|
||||
for (const Edge* e : pred->in_edges()) {
|
||||
if (e->IsControlEdge()) {
|
||||
CHECK_NOTNULL((*g)->AddControlEdge(e->src(), new_node));
|
||||
// Allow duplicate while adding control edge as it would fail (return
|
||||
// NULL) if we try to add duplicate edge.
|
||||
CHECK_NOTNULL((*g)->AddControlEdge(e->src(), new_node, true));
|
||||
}
|
||||
}
|
||||
for (const Edge* e : succ->in_edges()) {
|
||||
if (e->IsControlEdge()) {
|
||||
CHECK_NOTNULL((*g)->AddControlEdge(e->src(), new_node));
|
||||
// Allow duplicate while adding control edge as it would fail (return
|
||||
// NULL) if we try to add duplicate edge.
|
||||
CHECK_NOTNULL((*g)->AddControlEdge(e->src(), new_node, true));
|
||||
}
|
||||
}
|
||||
|
||||
@ -3876,14 +3884,18 @@ Status MklLayoutRewritePass::MergeConv2DWithBiasAdd(std::unique_ptr<Graph>* g,
|
||||
// First, we will fix outgoing control edges from 'pred' node.
|
||||
for (const Edge* e : pred->out_edges()) {
|
||||
if (e->IsControlEdge()) {
|
||||
CHECK_NOTNULL((*g)->AddControlEdge(new_node, e->dst()));
|
||||
// Allow duplicate while adding control edge as it would fail (return
|
||||
// NULL) if we try to add duplicate edge.
|
||||
CHECK_NOTNULL((*g)->AddControlEdge(new_node, e->dst(), true));
|
||||
}
|
||||
}
|
||||
|
||||
// Second, we will fix outgoing control and data edges from 'succ' node.
|
||||
for (const Edge* e : succ->out_edges()) {
|
||||
if (e->IsControlEdge()) {
|
||||
CHECK_NOTNULL((*g)->AddControlEdge(new_node, e->dst()));
|
||||
// Allow duplicate while adding control edge as it would fail (return
|
||||
// NULL) if we try to add duplicate edge.
|
||||
CHECK_NOTNULL((*g)->AddControlEdge(new_node, e->dst(), true));
|
||||
} else {
|
||||
// BiasAdd has only 1 output (at slot 0) and merged node also has only 1
|
||||
// output (at slot 0).
|
||||
@ -3966,12 +3978,16 @@ Status MklLayoutRewritePass::MergeConv2DBackpropFilterWithBiasAddGrad(
|
||||
// edges now.
|
||||
for (const Edge* e : badd->in_edges()) {
|
||||
if (e->IsControlEdge()) {
|
||||
CHECK_NOTNULL((*g)->AddControlEdge(e->src(), new_node));
|
||||
// Allow duplicate while adding control edge as it would fail (return
|
||||
// NULL) if we try to add duplicate edge.
|
||||
CHECK_NOTNULL((*g)->AddControlEdge(e->src(), new_node, true));
|
||||
}
|
||||
}
|
||||
for (const Edge* e : fltr->in_edges()) {
|
||||
if (e->IsControlEdge()) {
|
||||
CHECK_NOTNULL((*g)->AddControlEdge(e->src(), new_node));
|
||||
// Allow duplicate while adding control edge as it would fail (return
|
||||
// NULL) if we try to add duplicate edge.
|
||||
CHECK_NOTNULL((*g)->AddControlEdge(e->src(), new_node, true));
|
||||
}
|
||||
}
|
||||
|
||||
@ -3987,7 +4003,9 @@ Status MklLayoutRewritePass::MergeConv2DBackpropFilterWithBiasAddGrad(
|
||||
|
||||
for (const Edge* e : badd->out_edges()) {
|
||||
if (e->IsControlEdge()) {
|
||||
CHECK_NOTNULL((*g)->AddControlEdge(new_node, e->dst()));
|
||||
// Allow duplicate while adding control edge as it would fail (return
|
||||
// NULL) if we try to add duplicate edge.
|
||||
CHECK_NOTNULL((*g)->AddControlEdge(new_node, e->dst(), true));
|
||||
} else {
|
||||
CHECK_NOTNULL((*g)->AddEdge(new_node, kMergedNodeBiasGradOutputIdx,
|
||||
e->dst(), e->dst_input()));
|
||||
@ -3997,7 +4015,11 @@ Status MklLayoutRewritePass::MergeConv2DBackpropFilterWithBiasAddGrad(
|
||||
// Second, we will fix outgoing control and data edges from 'fltr' node.
|
||||
for (const Edge* e : fltr->out_edges()) {
|
||||
if (e->IsControlEdge()) {
|
||||
CHECK_NOTNULL((*g)->AddControlEdge(new_node, e->dst()));
|
||||
// We allow duplicate edge for this case since we already add control
|
||||
// edge from new_node in line 3990. Line below could be adding same
|
||||
// edge to same destination again. In such case, if we do not allow
|
||||
// duplicate edge, then this call will fail.
|
||||
CHECK_NOTNULL((*g)->AddControlEdge(new_node, e->dst(), true));
|
||||
} else {
|
||||
CHECK_NOTNULL((*g)->AddEdge(new_node, kMergedNodeFilterGradOutputIdx,
|
||||
e->dst(), e->dst_input()));
|
||||
@ -4091,7 +4113,9 @@ Status MklLayoutRewritePass::RewriteNode(std::unique_ptr<Graph>* g,
|
||||
// already copied in BuildNode. We need to handle control edges now.
|
||||
for (const Edge* e : orig_node->in_edges()) {
|
||||
if (e->IsControlEdge()) {
|
||||
CHECK_NOTNULL((*g)->AddControlEdge(e->src(), new_node));
|
||||
// Allow duplicate while adding control edge as it would fail (return
|
||||
// NULL) if we try to add duplicate edge.
|
||||
CHECK_NOTNULL((*g)->AddControlEdge(e->src(), new_node, true));
|
||||
}
|
||||
}
|
||||
|
||||
@ -4104,7 +4128,9 @@ Status MklLayoutRewritePass::RewriteNode(std::unique_ptr<Graph>* g,
|
||||
// GetTensorDataIndex provides this mapping function.
|
||||
for (const Edge* e : orig_node->out_edges()) {
|
||||
if (e->IsControlEdge()) {
|
||||
CHECK_NOTNULL((*g)->AddControlEdge(new_node, e->dst()));
|
||||
// Allow duplicate while adding control edge as it would fail (return
|
||||
// NULL) if we try to add duplicate edge.
|
||||
CHECK_NOTNULL((*g)->AddControlEdge(new_node, e->dst(), true));
|
||||
} else {
|
||||
CHECK_NOTNULL((*g)->AddEdge(new_node, GetTensorDataIndex(e->src_output(),
|
||||
e->src()->num_outputs()),
|
||||
|
@ -5846,6 +5846,23 @@ tf_mkl_kernel_library(
|
||||
]),
|
||||
)
|
||||
|
||||
tf_mkl_kernel_library(
|
||||
name = "mkl_softmax_op",
|
||||
prefix = "mkl_softmax",
|
||||
deps = [
|
||||
":bounds_check",
|
||||
":ops_util",
|
||||
"//tensorflow/core:core_cpu",
|
||||
"//tensorflow/core:framework",
|
||||
"//tensorflow/core:lib",
|
||||
"//tensorflow/core:lib_internal",
|
||||
"//tensorflow/core:nn_ops_op_lib",
|
||||
] + if_mkl([
|
||||
"//third_party/mkl:intel_binary_blob",
|
||||
"@mkl_dnn//:mkl_dnn",
|
||||
]),
|
||||
)
|
||||
|
||||
tf_mkl_kernel_library(
|
||||
name = "mkl_fused_batch_norm_op",
|
||||
srcs = ["mkl_fused_batch_norm_op.cc"],
|
||||
|
@ -427,7 +427,7 @@ inline DeviceLapackInfo CudaSolver::GetDeviceLapackInfo(
|
||||
int64 size, const string& debug_info) {
|
||||
DeviceLapackInfo new_dev_info(context_, size, debug_info);
|
||||
scratch_tensor_refs_.emplace_back(new_dev_info.tensor());
|
||||
return std::move(new_dev_info);
|
||||
return new_dev_info;
|
||||
}
|
||||
|
||||
} // namespace tensorflow
|
||||
|
@ -16,8 +16,9 @@ limitations under the License.
|
||||
#include "tensorflow/core/kernels/cwise_ops_common.h"
|
||||
|
||||
namespace tensorflow {
|
||||
REGISTER7(BinaryOp, CPU, "Pow", functor::pow, float, Eigen::half, double, int32,
|
||||
int64, complex64, complex128);
|
||||
REGISTER5(BinaryOp, CPU, "Pow", functor::pow, float, Eigen::half, double,
|
||||
complex64, complex128);
|
||||
REGISTER2(BinaryOp, CPU, "Pow", functor::safe_pow, int32, int64);
|
||||
|
||||
#if GOOGLE_CUDA
|
||||
REGISTER4(BinaryOp, GPU, "Pow", functor::pow, float, Eigen::half, double,
|
||||
@ -25,5 +26,5 @@ REGISTER4(BinaryOp, GPU, "Pow", functor::pow, float, Eigen::half, double,
|
||||
#endif
|
||||
#ifdef TENSORFLOW_USE_SYCL
|
||||
REGISTER2(BinaryOp, SYCL, "Pow", functor::pow, float, double);
|
||||
#endif // TENSORFLOW_USE_SYCL
|
||||
#endif // TENSORFLOW_USE_SYCL
|
||||
} // namespace tensorflow
|
||||
|
@ -21,6 +21,7 @@ limitations under the License.
|
||||
#include <type_traits>
|
||||
|
||||
#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
|
||||
|
||||
#include "tensorflow/core/framework/numeric_types.h"
|
||||
#include "tensorflow/core/framework/tensor_types.h"
|
||||
#include "tensorflow/core/kernels/bounds_check.h"
|
||||
@ -115,6 +116,35 @@ struct functor_traits<scalar_binary_pow_op_google<Scalar, Exponent>> {
|
||||
enum { Cost = 5 * NumTraits<Scalar>::MulCost, PacketAccess = false };
|
||||
};
|
||||
|
||||
template <typename Scalar, typename Exponent>
|
||||
struct safe_scalar_binary_pow_op {
|
||||
static_assert(std::is_integral<Scalar>::value, "Integer type expected");
|
||||
static_assert(std::is_integral<Exponent>::value &&
|
||||
std::is_signed<Exponent>::value,
|
||||
"Signed integer type expected");
|
||||
|
||||
bool* const error;
|
||||
|
||||
EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE safe_scalar_binary_pow_op(bool* error)
|
||||
: error(error) {}
|
||||
|
||||
EIGEN_DEVICE_FUNC inline Scalar operator()(const Scalar& a,
|
||||
const Exponent& b) const {
|
||||
const Exponent safe_b = tensorflow::internal::SubtleMustCopy(b);
|
||||
if (TF_PREDICT_TRUE(safe_b >= 0)) {
|
||||
return numext::pow(a, safe_b);
|
||||
} else {
|
||||
*error = true;
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
};
|
||||
|
||||
template <typename Scalar, typename Exponent>
|
||||
struct functor_traits<safe_scalar_binary_pow_op<Scalar, Exponent>> {
|
||||
enum { Cost = 5 * NumTraits<Scalar>::MulCost, PacketAccess = false };
|
||||
};
|
||||
|
||||
template <typename T, typename DivOrMod>
|
||||
struct safe_div_or_mod_op {
|
||||
static_assert(std::is_integral<T>::value, "Integer type expected");
|
||||
@ -741,6 +771,11 @@ struct floor_div_real : base<T, Eigen::internal::google_floor_div_real<T>> {};
|
||||
template <typename T>
|
||||
struct pow : base<T, Eigen::internal::scalar_binary_pow_op_google<T, T>> {};
|
||||
|
||||
template <typename T>
|
||||
struct safe_pow : base<T, Eigen::internal::safe_scalar_binary_pow_op<T, T>> {
|
||||
static const bool has_errors = true;
|
||||
};
|
||||
|
||||
template <typename T>
|
||||
struct maximum : base<T, Eigen::internal::scalar_max_op<T>> {};
|
||||
|
||||
|
@ -40,6 +40,11 @@ void BinaryOpShared::SetComputeError(OpKernelContext* ctx) {
|
||||
if ((op == "Div" || op == "Mod" || op == "FloorMod" || op == "FloorDiv") &&
|
||||
DataTypeIsInteger(ctx->op_kernel().input_type(0))) {
|
||||
ctx->CtxFailure(errors::InvalidArgument("Integer division by zero"));
|
||||
} else if ((op == "Pow") &&
|
||||
DataTypeIsInteger(ctx->op_kernel().input_type(0)) &&
|
||||
DataTypeIsSigned(ctx->op_kernel().input_type(1))) {
|
||||
ctx->CtxFailure(errors::InvalidArgument(
|
||||
"Integers to negative integer powers are not allowed"));
|
||||
} else {
|
||||
ctx->CtxFailure(
|
||||
errors::Internal("Unexpected error in binary operator "
|
||||
|
@ -87,11 +87,10 @@ class DecodeImageOp : public OpKernel {
|
||||
channels_ = 3;
|
||||
} else {
|
||||
OP_REQUIRES_OK(context, context->GetAttr("channels", &channels_));
|
||||
OP_REQUIRES(
|
||||
context,
|
||||
channels_ == 0 || channels_ == 1 || channels_ == 3 || channels_ == 4,
|
||||
errors::InvalidArgument("channels must be 0, 1, 3, or 4, got ",
|
||||
channels_));
|
||||
OP_REQUIRES(context, channels_ == 0 || channels_ == 1 || channels_ == 3 ||
|
||||
channels_ == 4,
|
||||
errors::InvalidArgument(
|
||||
"channels must be 0, 1, 3, or 4, got ", channels_));
|
||||
}
|
||||
flags_.components = channels_;
|
||||
|
||||
@ -115,9 +114,8 @@ class DecodeImageOp : public OpKernel {
|
||||
|
||||
if (format_ == kJpgFormat) {
|
||||
OP_REQUIRES_OK(context, context->GetAttr("ratio", &flags_.ratio));
|
||||
OP_REQUIRES(context,
|
||||
flags_.ratio == 1 || flags_.ratio == 2 || flags_.ratio == 4 ||
|
||||
flags_.ratio == 8,
|
||||
OP_REQUIRES(context, flags_.ratio == 1 || flags_.ratio == 2 ||
|
||||
flags_.ratio == 4 || flags_.ratio == 8,
|
||||
errors::InvalidArgument("ratio must be 1, 2, 4, or 8, got ",
|
||||
flags_.ratio));
|
||||
OP_REQUIRES_OK(context, context->GetAttr("fancy_upscaling",
|
||||
@ -132,9 +130,8 @@ class DecodeImageOp : public OpKernel {
|
||||
string dct_method;
|
||||
OP_REQUIRES_OK(context, context->GetAttr("dct_method", &dct_method));
|
||||
OP_REQUIRES(
|
||||
context,
|
||||
(dct_method.empty() || dct_method == "INTEGER_FAST" ||
|
||||
dct_method == "INTEGER_ACCURATE"),
|
||||
context, (dct_method.empty() || dct_method == "INTEGER_FAST" ||
|
||||
dct_method == "INTEGER_ACCURATE"),
|
||||
errors::InvalidArgument("dct_method must be one of "
|
||||
"{'', 'INTEGER_FAST', 'INTEGER_ACCURATE'}"));
|
||||
if (dct_method == "INTEGER_FAST") {
|
||||
@ -160,9 +157,9 @@ class DecodeImageOp : public OpKernel {
|
||||
errors::InvalidArgument("Expected image (JPEG, PNG, or GIF), got ",
|
||||
FileFormatString(magic, input)));
|
||||
OP_REQUIRES(context, input.size() <= std::numeric_limits<int>::max(),
|
||||
errors::InvalidArgument(
|
||||
FileFormatString(magic, input),
|
||||
" contents are too large for int: ", input.size()));
|
||||
errors::InvalidArgument(FileFormatString(magic, input),
|
||||
" contents are too large for int: ",
|
||||
input.size()));
|
||||
OP_REQUIRES(context, magic == kPngFormat || channel_bits_ == 8,
|
||||
errors::InvalidArgument(FileFormatString(magic, input),
|
||||
" does not support uint16 output"));
|
||||
@ -215,10 +212,9 @@ class DecodeImageOp : public OpKernel {
|
||||
input.data(), input.size(), flags, nullptr /* nwarn */,
|
||||
[=, &output](int width, int height, int channels) -> uint8* {
|
||||
Status status(context->allocate_output(
|
||||
0,
|
||||
format_ == kGifFormat
|
||||
? TensorShape({1, height, width, channels})
|
||||
: TensorShape({height, width, channels}),
|
||||
0, format_ == kGifFormat
|
||||
? TensorShape({1, height, width, channels})
|
||||
: TensorShape({height, width, channels}),
|
||||
&output));
|
||||
if (!status.ok()) {
|
||||
VLOG(1) << status;
|
||||
@ -294,6 +290,7 @@ class DecodeImageOp : public OpKernel {
|
||||
|
||||
// Decode GIF, allocating tensor once the size is known.
|
||||
Tensor* output = nullptr;
|
||||
string error_string;
|
||||
OP_REQUIRES(
|
||||
context,
|
||||
gif::Decode(input.data(), input.size(),
|
||||
@ -320,8 +317,10 @@ class DecodeImageOp : public OpKernel {
|
||||
return nullptr;
|
||||
}
|
||||
return output->flat<uint8>().data();
|
||||
}),
|
||||
errors::InvalidArgument("Invalid GIF data, size ", input.size()));
|
||||
},
|
||||
&error_string),
|
||||
errors::InvalidArgument("Invalid GIF data (size ", input.size(), "), ",
|
||||
error_string));
|
||||
}
|
||||
|
||||
private:
|
||||
|
@ -539,6 +539,7 @@ struct MatMulFunctor<SYCLDevice, T> {
|
||||
REGISTER_KERNEL_BUILDER( \
|
||||
Name("MatMul").Device(DEVICE_CPU).TypeConstraint<T>("T").Label("eigen"), \
|
||||
MatMulOp<CPUDevice, T, false /* cublas, ignored for CPU */>);
|
||||
|
||||
#define REGISTER_CPU(T) \
|
||||
REGISTER_KERNEL_BUILDER( \
|
||||
Name("MatMul").Device(DEVICE_CPU).TypeConstraint<T>("T"), \
|
||||
|
@ -61,6 +61,18 @@ class MklAddNOp : public OpKernel {
|
||||
GetMklShape(ctx, src2_idx, &(mkl_context.input2_shape));
|
||||
bool input2_in_mkl_format = mkl_context.input2_shape.IsMklTensor();
|
||||
|
||||
// if the shapes of two tensors are not same raise op error
|
||||
TensorShape src1_shape, src2_shape;
|
||||
src1_shape = input0.shape();
|
||||
src2_shape = input1.shape();
|
||||
if (!src1_shape.IsSameSize(src2_shape) ){
|
||||
ctx->SetStatus(
|
||||
errors::InvalidArgument(
|
||||
"Inputs to operation ", this->name(), " of type ", this->type_string(),
|
||||
" must have the same size and shape. Input 0: ",
|
||||
src1_shape.DebugString(), " != input 1: ",
|
||||
src2_shape.DebugString()));
|
||||
}
|
||||
// handle the case of a scalar
|
||||
if (!input1_in_mkl_format && input0.dims() == 0) {
|
||||
const TensorShape& o_shape = input0.shape();
|
||||
@ -307,6 +319,18 @@ class MklAddNOp : public OpKernel {
|
||||
src1_mkl_shape.GetDimension(): src1_tensor.dims();
|
||||
int src2_dims_size = input2_in_mkl_format?
|
||||
src2_mkl_shape.GetDimension(): src2_tensor.dims();
|
||||
// if the shapes of two tensors are not same raise op error
|
||||
TensorShape src1_shape, src2_shape;
|
||||
src1_shape = src1_tensor.shape();
|
||||
src2_shape = src2_tensor.shape();
|
||||
if (!src1_shape.IsSameSize(src2_shape) ){
|
||||
ctx->SetStatus(
|
||||
errors::InvalidArgument(
|
||||
"Inputs to operation ", this->name(), " of type ", this->type_string(),
|
||||
" must have the same size and shape. Input 0: ",
|
||||
src1_shape.DebugString(), " != input 1: ",
|
||||
src2_shape.DebugString()));
|
||||
}
|
||||
|
||||
if (!input1_in_mkl_format && src1_dims_size == 0) {
|
||||
Tensor* dst_tensor = nullptr;
|
||||
|
@ -598,7 +598,6 @@ class MklConcatOp : public OpKernel {
|
||||
concat_dim_tensor.shape().DebugString()));
|
||||
int32 concat_dim = internal::SubtleMustCopy(
|
||||
concat_dim_tensor.scalar<int32>()());
|
||||
if (concat_dim < 0) concat_dim = N + concat_dim;
|
||||
|
||||
// check that ranks of all tensors match
|
||||
// and that their shapes match except for concat_dim.
|
||||
@ -609,6 +608,9 @@ class MklConcatOp : public OpKernel {
|
||||
input_shapes[0].GetTfShape() :
|
||||
input_tensors[0].shape();
|
||||
size_t expected_dims = expected_shape.dims();
|
||||
|
||||
if (concat_dim < 0) concat_dim = expected_dims + concat_dim;
|
||||
|
||||
for (auto& s : input_shapes) {
|
||||
if (s == expected_shape) {++i; continue;}
|
||||
|
||||
|
@ -467,6 +467,13 @@ class MklConv2DCustomBackpropFilterOp :
|
||||
return filter_tf_shape;
|
||||
}
|
||||
|
||||
TensorShape GetOutputTfShape(const TensorShape& input_shape,
|
||||
const TensorShape& filter_shape,
|
||||
const TensorShape& outbprop_shape) {
|
||||
// Shape of output of Conv2DBackpropFilter is same as shape of filter.
|
||||
return filter_shape;
|
||||
}
|
||||
|
||||
const memory::dims& GetOutputDims(const memory::dims& fwd_input_dims,
|
||||
const memory::dims& fwd_filter_dims) {
|
||||
// Shape of output of Conv2DBackpropFilter is same as shape of filter.
|
||||
|
@ -396,6 +396,13 @@ class MklConv2DCustomBackpropInputOp :
|
||||
return GetTfShape(context, kInputIndex_Filter);
|
||||
}
|
||||
|
||||
TensorShape GetOutputTfShape(const TensorShape& input_shape,
|
||||
const TensorShape& filter_shape,
|
||||
const TensorShape& outbprop_shape) {
|
||||
// Output Shape of Conv2DBackpropInput is same as shape of Conv2D 'input'.
|
||||
return input_shape;
|
||||
}
|
||||
|
||||
const memory::dims& GetOutputDims(const memory::dims& fwd_input_dims,
|
||||
const memory::dims& fwd_filter_dims) {
|
||||
// Output Shape of Conv2DBackpropInput is same as shape of Conv2D 'input'.
|
||||
|
@ -551,6 +551,13 @@ class MklConv2DOp : public OpKernel {
|
||||
output_mkl_shape.SetMklTensor(false);
|
||||
AllocateOutputSetMklShape(context, kOutputIndex_Dst, &output_tensor,
|
||||
src_tf_shape, output_mkl_shape);
|
||||
|
||||
// MklConv2D also outputs converted filter as 2nd output of Conv2D.
|
||||
filter_mkl_shape.SetMklTensor(false);
|
||||
Tensor* output_filter_tensor = nullptr;
|
||||
AllocateOutputSetMklShape(context, kOutputIndex_Filter,
|
||||
&output_filter_tensor,
|
||||
filter_tf_shape, filter_mkl_shape);
|
||||
return;
|
||||
}
|
||||
|
||||
|
@ -390,6 +390,29 @@ class MklConv2DBackpropCommonOp : public OpKernel {
|
||||
TensorShape filter_tf_shape = MakeFilterTfShape(context, filter_tensor);
|
||||
TensorShape outbprop_tf_shape = GetTfShape(context, kOutbpropIdx);
|
||||
|
||||
// Corner cases: output with 0 elements and 0 batch size.
|
||||
Tensor* output_tensor = nullptr;
|
||||
if (input_tf_shape.num_elements() == 0 ||
|
||||
filter_tf_shape.num_elements() == 0 ||
|
||||
outbprop_tf_shape.num_elements() == 0) {
|
||||
MklDnnShape output_mkl_shape;
|
||||
output_mkl_shape.SetMklTensor(false);
|
||||
TensorShape output_tf_shape = GetOutputTfShape(input_tf_shape,
|
||||
filter_tf_shape,
|
||||
outbprop_tf_shape);
|
||||
const int kOutputIdx = 0;
|
||||
AllocateOutputSetMklShape(context, kOutputIdx, &output_tensor,
|
||||
output_tf_shape, output_mkl_shape);
|
||||
CHECK_NOTNULL(output_tensor);
|
||||
|
||||
// if output tensor has more than 0 elements, we need to 0 them out.
|
||||
for (size_t i = 0; i < output_tf_shape.num_elements(); ++i) {
|
||||
output_tensor->flat<T>().data()[i] = 0;
|
||||
}
|
||||
|
||||
return;
|
||||
}
|
||||
|
||||
// By default, all dims are in MKL order. Only dims in TF order
|
||||
// are those with prefix tf_order.
|
||||
memory::dims outbprop_dims, fwd_input_dims, fwd_filter_dims;
|
||||
@ -471,7 +494,6 @@ class MklConv2DBackpropCommonOp : public OpKernel {
|
||||
output.SetOpMemDesc(bwd_output_dims, memory::format::any);
|
||||
|
||||
// Operator-specific call to create and execute primitive.
|
||||
Tensor* output_tensor = nullptr;
|
||||
CreatePrimitive(context, cpu_engine, fwd_pd, &input, &filter,
|
||||
&outbackprop, &output, &output_tensor,
|
||||
strides, padding_l, padding_r,
|
||||
@ -507,6 +529,11 @@ class MklConv2DBackpropCommonOp : public OpKernel {
|
||||
virtual TensorShape MakeFilterTfShape(OpKernelContext* context,
|
||||
const Tensor& filter_tensor) = 0;
|
||||
|
||||
/// Get the TensorFlow shape of output tensor.
|
||||
virtual TensorShape GetOutputTfShape(const TensorShape& input_shape,
|
||||
const TensorShape& filter_shape,
|
||||
const TensorShape& outbprop_shape) = 0;
|
||||
|
||||
/// Get shape of output in MKL-DNN order. Computes shape of output from
|
||||
/// input shape (fwd_input_dims) and filter shape (fwd_filter_dims).
|
||||
virtual
|
||||
|
@ -703,27 +703,31 @@ class MklFusedBatchNormOp : public OpKernel {
|
||||
void Compute(OpKernelContext* context) override {
|
||||
try {
|
||||
auto cpu_engine = engine(engine::cpu, 0);
|
||||
const size_t src_index = 0; // index of src input tensor
|
||||
const size_t scale_index = 1; // index of scale tensor
|
||||
const size_t shift_index = 2; // index of shift tensor
|
||||
const size_t mean_index = 3; // index of est_mean tensor
|
||||
const size_t var_index = 4; // index of est_variance tensor
|
||||
const size_t kSrcIndex = 0; // index of src input tensor
|
||||
const size_t kScaleIndex = 1; // index of scale tensor
|
||||
const size_t kShiftIndex = 2; // index of shift tensor
|
||||
const size_t kMeanIndex = 3; // index of est_mean tensor
|
||||
const size_t kVarianceIndex = 4; // index of est_variance tensor
|
||||
|
||||
const Tensor& src_tensor = MklGetInput(context, src_index);
|
||||
const Tensor& scale_tensor = MklGetInput(context, scale_index);
|
||||
const Tensor& shift_tensor = MklGetInput(context, shift_index);
|
||||
const Tensor& est_mean_tensor = MklGetInput(context, mean_index);
|
||||
const Tensor& est_variance_tensor = MklGetInput(context, var_index);
|
||||
const Tensor& src_tensor = MklGetInput(context, kSrcIndex);
|
||||
const Tensor& scale_tensor = MklGetInput(context, kScaleIndex);
|
||||
const Tensor& shift_tensor = MklGetInput(context, kShiftIndex);
|
||||
const Tensor& est_mean_tensor = MklGetInput(context, kMeanIndex);
|
||||
const Tensor& est_variance_tensor = MklGetInput(context,
|
||||
kVarianceIndex);
|
||||
|
||||
TensorShape tf_shape_src;
|
||||
MklDnnShape dnn_shape_src;
|
||||
GetMklShape(context, src_index, &dnn_shape_src);
|
||||
GetMklShape(context, kSrcIndex, &dnn_shape_src);
|
||||
|
||||
if (dnn_shape_src.IsMklTensor()) {
|
||||
tf_shape_src = dnn_shape_src.GetTfShape();
|
||||
OP_REQUIRES(context, dnn_shape_src.GetDimension() == 4,
|
||||
errors::InvalidArgument(
|
||||
"input must be 4-dimensional",
|
||||
src_tensor.shape().DebugString()));
|
||||
} else {
|
||||
tf_shape_src = src_tensor.shape();
|
||||
OP_REQUIRES(context, src_tensor.dims() == 4,
|
||||
errors::InvalidArgument(
|
||||
"input must be 4-dimensional",
|
||||
@ -756,39 +760,35 @@ class MklFusedBatchNormOp : public OpKernel {
|
||||
est_variance_tensor.shape().DebugString()));
|
||||
}
|
||||
|
||||
// special case: input with 0 element and 0 batch size
|
||||
Tensor* dst_tensor = nullptr;
|
||||
if (tf_shape_src.num_elements() == 0) {
|
||||
HandleEmptyInput(context,
|
||||
tf_shape_src,
|
||||
scale_tensor.shape(),
|
||||
&dst_tensor);
|
||||
return;
|
||||
}
|
||||
|
||||
if (dnn_shape_src.IsMklTensor())
|
||||
depth_ = dnn_shape_src.DimSize(MklDnnDims::Dim_C);
|
||||
else
|
||||
ExtractParams(context);
|
||||
|
||||
// Indices of output tensors
|
||||
const size_t dst_index = 0;
|
||||
const size_t batch_mean_index = 1;
|
||||
const size_t batch_variance_index = 2;
|
||||
const size_t saved_mean_index = 3;
|
||||
const size_t saved_variance_index = 4;
|
||||
const size_t kDstIndex = 0;
|
||||
|
||||
// allocate batch mean output tensor
|
||||
// allocate 4 output TF tensors
|
||||
Tensor* batch_mean_tensor = nullptr;
|
||||
MklDnnShape mkl_shape_batch_mean;
|
||||
mkl_shape_batch_mean.SetMklTensor(false);
|
||||
AllocateOutputSetMklShape(context,
|
||||
batch_mean_index,
|
||||
&batch_mean_tensor,
|
||||
scale_tensor.shape(),
|
||||
mkl_shape_batch_mean);
|
||||
CHECK_NOTNULL(batch_mean_tensor);
|
||||
|
||||
// Batch variance
|
||||
Tensor* batch_variance_tensor = nullptr;
|
||||
MklDnnShape mkl_shape_batch_variance;
|
||||
mkl_shape_batch_variance.SetMklTensor(false);
|
||||
AllocateOutputSetMklShape(context,
|
||||
batch_variance_index,
|
||||
&batch_variance_tensor,
|
||||
scale_tensor.shape(),
|
||||
mkl_shape_batch_variance);
|
||||
CHECK_NOTNULL(batch_variance_tensor);
|
||||
Tensor* saved_mean_tensor = nullptr;
|
||||
Tensor* saved_variance_tensor = nullptr;
|
||||
AllocateTFOutputs(context,
|
||||
scale_tensor.shape(),
|
||||
&batch_mean_tensor,
|
||||
&batch_variance_tensor,
|
||||
&saved_mean_tensor,
|
||||
&saved_variance_tensor);
|
||||
|
||||
if (is_training_)
|
||||
SetMeanVariance(*batch_mean_tensor, *batch_variance_tensor);
|
||||
@ -844,26 +844,6 @@ class MklFusedBatchNormOp : public OpKernel {
|
||||
weights_data[k + depth_] = shift_tf[k];
|
||||
}
|
||||
|
||||
// Mean and variance (without Bessel's correction) saved for backward
|
||||
// computation to serve as pre-computed mean and variance.
|
||||
Tensor* saved_mean_tensor = nullptr;
|
||||
MklDnnShape mkl_shape_saved_mean;
|
||||
mkl_shape_saved_mean.SetMklTensor(false);
|
||||
AllocateOutputSetMklShape(context, saved_mean_index,
|
||||
&saved_mean_tensor,
|
||||
scale_tensor.shape(),
|
||||
mkl_shape_saved_mean);
|
||||
CHECK_NOTNULL(saved_mean_tensor);
|
||||
|
||||
Tensor* saved_variance_tensor = nullptr;
|
||||
MklDnnShape mkl_shape_saved_variance;
|
||||
mkl_shape_saved_variance.SetMklTensor(false);
|
||||
AllocateOutputSetMklShape(context, saved_variance_index,
|
||||
&saved_variance_tensor,
|
||||
scale_tensor.shape(),
|
||||
mkl_shape_saved_variance);
|
||||
CHECK_NOTNULL(saved_variance_tensor);
|
||||
|
||||
// set mean primitive
|
||||
auto mean_desc = memory::desc({1, depth_},
|
||||
MklDnnType<T>(),
|
||||
@ -902,7 +882,6 @@ class MklFusedBatchNormOp : public OpKernel {
|
||||
// allocate dst tensor
|
||||
MklDnnShape dnn_shape_dst;
|
||||
TensorShape tf_shape_dst;
|
||||
Tensor* dst_tensor = nullptr;
|
||||
if (dnn_shape_src.IsMklTensor()) {
|
||||
dnn_shape_dst.SetMklTensor(true);
|
||||
auto dst_pd = bnrm_fwd_pd.dst_primitive_desc();
|
||||
@ -915,7 +894,7 @@ class MklFusedBatchNormOp : public OpKernel {
|
||||
dnn_shape_dst.SetMklTensor(false);
|
||||
tf_shape_dst = src_tensor.shape();
|
||||
}
|
||||
AllocateOutputSetMklShape(context, dst_index, &dst_tensor,
|
||||
AllocateOutputSetMklShape(context, kDstIndex, &dst_tensor,
|
||||
tf_shape_dst, dnn_shape_dst);
|
||||
|
||||
// Output of batchnorm has same shape as input.
|
||||
@ -958,10 +937,8 @@ class MklFusedBatchNormOp : public OpKernel {
|
||||
size_t adjust_size = orig_size - 1;
|
||||
adjust_factor = (static_cast<float>(orig_size)) / adjust_size;
|
||||
}
|
||||
T* batch_variance_data_tf = reinterpret_cast<T*>(
|
||||
batch_variance_tensor->flat<T>().data());
|
||||
for (int k=0; k < depth_; k++)
|
||||
batch_variance_data_tf[k] =
|
||||
batch_variance_tensor->flat<T>().data()[k] =
|
||||
(reinterpret_cast<T*>(variance_m.get_data_handle()))[k]
|
||||
* adjust_factor;
|
||||
} catch (mkldnn::error &e) {
|
||||
@ -994,8 +971,100 @@ class MklFusedBatchNormOp : public OpKernel {
|
||||
variance_values_ = reinterpret_cast<T*>(
|
||||
const_cast<T*>(variance.flat<T>().data()));
|
||||
}
|
||||
};
|
||||
|
||||
void HandleEmptyInput(OpKernelContext* context,
|
||||
TensorShape tf_shape_src,
|
||||
TensorShape tf_shape_scale,
|
||||
Tensor** dst_tensor) {
|
||||
CHECK_NOTNULL(dst_tensor);
|
||||
|
||||
const size_t kDstIndex = 0;
|
||||
MklDnnShape dnn_shape_dst;
|
||||
dnn_shape_dst.SetMklTensor(false);
|
||||
AllocateOutputSetMklShape(context, kDstIndex, dst_tensor,
|
||||
tf_shape_src, dnn_shape_dst);
|
||||
CHECK_NOTNULL(*dst_tensor);
|
||||
memset(const_cast<char*>((*dst_tensor)->tensor_data().data()), 0,
|
||||
(*dst_tensor)->tensor_data().size());
|
||||
|
||||
Tensor* batch_mean_tensor = nullptr;
|
||||
Tensor* batch_variance_tensor = nullptr;
|
||||
Tensor* saved_mean_tensor = nullptr;
|
||||
Tensor* saved_variance_tensor = nullptr;
|
||||
AllocateTFOutputs(context, tf_shape_scale,
|
||||
&batch_mean_tensor,
|
||||
&batch_variance_tensor,
|
||||
&saved_mean_tensor,
|
||||
&saved_variance_tensor);
|
||||
}
|
||||
|
||||
void AllocateTFOutputs(OpKernelContext* context,
|
||||
TensorShape tf_shape_scale,
|
||||
Tensor** batch_mean_tensor,
|
||||
Tensor** batch_variance_tensor,
|
||||
Tensor** saved_mean_tensor,
|
||||
Tensor** saved_variance_tensor) {
|
||||
CHECK_NOTNULL(batch_mean_tensor);
|
||||
CHECK_NOTNULL(batch_variance_tensor);
|
||||
CHECK_NOTNULL(saved_mean_tensor);
|
||||
CHECK_NOTNULL(saved_variance_tensor);
|
||||
|
||||
const size_t kBatchMeanIndex = 1;
|
||||
const size_t kBatchVarianceIndex = 2;
|
||||
const size_t kSavedMeanIndex = 3;
|
||||
const size_t kSavedVarianceIndex = 4;
|
||||
|
||||
// allocate batch mean output tensor
|
||||
MklDnnShape mkl_shape_batch_mean;
|
||||
mkl_shape_batch_mean.SetMklTensor(false);
|
||||
AllocateOutputSetMklShape(context,
|
||||
kBatchMeanIndex,
|
||||
batch_mean_tensor,
|
||||
tf_shape_scale,
|
||||
mkl_shape_batch_mean);
|
||||
CHECK_NOTNULL(*batch_mean_tensor);
|
||||
// set NAN mean value in case of empty input tensor
|
||||
for (int k=0; k < tf_shape_scale.num_elements(); k++)
|
||||
(*batch_mean_tensor)->flat<T>().data()[k] = NAN;
|
||||
|
||||
// allocate batch variance output tensor
|
||||
MklDnnShape mkl_shape_batch_variance;
|
||||
mkl_shape_batch_variance.SetMklTensor(false);
|
||||
AllocateOutputSetMklShape(context,
|
||||
kBatchVarianceIndex,
|
||||
batch_variance_tensor,
|
||||
tf_shape_scale,
|
||||
mkl_shape_batch_variance);
|
||||
CHECK_NOTNULL(*batch_variance_tensor);
|
||||
// set NAN variance value in case of empty input tensor
|
||||
for (int k=0; k < tf_shape_scale.num_elements(); k++)
|
||||
(*batch_variance_tensor)->flat<T>().data()[k] = NAN;
|
||||
|
||||
// Mean and variance (without Bessel's correction) saved for backward
|
||||
// computation to serve as pre-computed mean and variance.
|
||||
MklDnnShape mkl_shape_saved_mean;
|
||||
mkl_shape_saved_mean.SetMklTensor(false);
|
||||
AllocateOutputSetMklShape(context, kSavedMeanIndex,
|
||||
saved_mean_tensor,
|
||||
tf_shape_scale,
|
||||
mkl_shape_saved_mean);
|
||||
CHECK_NOTNULL(*saved_mean_tensor);
|
||||
// set NAN mean value in case of empty input tensor
|
||||
for (int k=0; k < tf_shape_scale.num_elements(); k++)
|
||||
(*saved_mean_tensor)->flat<T>().data()[k] = NAN;
|
||||
|
||||
MklDnnShape mkl_shape_saved_variance;
|
||||
mkl_shape_saved_variance.SetMklTensor(false);
|
||||
AllocateOutputSetMklShape(context, kSavedVarianceIndex,
|
||||
saved_variance_tensor,
|
||||
tf_shape_scale,
|
||||
mkl_shape_saved_variance);
|
||||
CHECK_NOTNULL(*saved_variance_tensor);
|
||||
// set NAN variance value in case of empty input tensor
|
||||
for (int k=0; k < tf_shape_scale.num_elements(); k++)
|
||||
(*saved_variance_tensor)->flat<T>().data()[k] = NAN;
|
||||
}
|
||||
};
|
||||
|
||||
template <typename Device, typename T>
|
||||
class MklFusedBatchNormGradOp : public OpKernel {
|
||||
@ -1009,34 +1078,37 @@ class MklFusedBatchNormGradOp : public OpKernel {
|
||||
OP_REQUIRES_OK(context, context->GetAttr("data_format", &tensor_format));
|
||||
OP_REQUIRES(context, FormatFromString(tensor_format, &tensor_format_),
|
||||
errors::InvalidArgument("Invalid data format"));
|
||||
OP_REQUIRES_OK(context, context->GetAttr("is_training", &is_training_));
|
||||
}
|
||||
|
||||
void Compute(OpKernelContext* context) override {
|
||||
try {
|
||||
auto cpu_engine = engine(engine::cpu, 0);
|
||||
|
||||
const size_t diff_dst_index = 0; // index of diff_dst tensor
|
||||
const size_t src_index = 1; // index of src input tensor
|
||||
const size_t scale_index = 2; // index of scale tensor
|
||||
const size_t mean_index = 3; // index of saved_mean tensor
|
||||
const size_t variance_index = 4; // index of saved_variance tensor
|
||||
const Tensor& diff_dst_tensor = MklGetInput(context, diff_dst_index);
|
||||
const Tensor& src_tensor = MklGetInput(context, src_index);
|
||||
const Tensor& scale_tensor = MklGetInput(context, scale_index);
|
||||
const Tensor& saved_mean_tensor = MklGetInput(context, mean_index);
|
||||
const size_t kDiffDstIndex = 0; // index of diff_dst tensor
|
||||
const size_t kSrcIndex = 1; // index of src input tensor
|
||||
const size_t kScaleIndex = 2; // index of scale tensor
|
||||
const size_t kMeanIndex = 3; // index of saved_mean tensor
|
||||
const size_t kVarianceIndex = 4; // index of saved_variance tensor
|
||||
const Tensor& diff_dst_tensor = MklGetInput(context, kDiffDstIndex);
|
||||
const Tensor& src_tensor = MklGetInput(context, kSrcIndex);
|
||||
const Tensor& scale_tensor = MklGetInput(context, kScaleIndex);
|
||||
const Tensor& saved_mean_tensor = MklGetInput(context, kMeanIndex);
|
||||
const Tensor& saved_variance_tensor = MklGetInput(context,
|
||||
variance_index);
|
||||
kVarianceIndex);
|
||||
|
||||
MklDnnShape dnn_shape_src, dnn_shape_diff_dst;
|
||||
GetMklShape(context, src_index, &dnn_shape_src);
|
||||
GetMklShape(context, diff_dst_index, &dnn_shape_diff_dst);
|
||||
GetMklShape(context, kSrcIndex, &dnn_shape_src);
|
||||
GetMklShape(context, kDiffDstIndex, &dnn_shape_diff_dst);
|
||||
TensorShape tf_shape_src, tf_shape_diff_dst;
|
||||
|
||||
if (dnn_shape_diff_dst.IsMklTensor()) {
|
||||
tf_shape_diff_dst = dnn_shape_diff_dst.GetTfShape();
|
||||
OP_REQUIRES(context, dnn_shape_diff_dst.GetDimension() == 4,
|
||||
errors::InvalidArgument(
|
||||
"input must be 4-dimensional",
|
||||
diff_dst_tensor.shape().DebugString()));
|
||||
} else {
|
||||
tf_shape_diff_dst = diff_dst_tensor.shape();
|
||||
OP_REQUIRES(context, diff_dst_tensor.dims() == 4,
|
||||
errors::InvalidArgument(
|
||||
"input must be 4-dimensional",
|
||||
@ -1044,11 +1116,13 @@ class MklFusedBatchNormGradOp : public OpKernel {
|
||||
}
|
||||
|
||||
if (dnn_shape_src.IsMklTensor()) {
|
||||
tf_shape_src = dnn_shape_src.GetTfShape();
|
||||
OP_REQUIRES(context, dnn_shape_src.GetDimension() == 4,
|
||||
errors::InvalidArgument(
|
||||
"input must be 4-dimensional",
|
||||
src_tensor.shape().DebugString()));
|
||||
} else {
|
||||
tf_shape_src = src_tensor.shape();
|
||||
OP_REQUIRES(context, src_tensor.dims() == 4,
|
||||
errors::InvalidArgument(
|
||||
"input must be 4-dimensional",
|
||||
@ -1069,6 +1143,15 @@ class MklFusedBatchNormGradOp : public OpKernel {
|
||||
"saved variance must be 1-dimensional",
|
||||
saved_variance_tensor.shape().DebugString()));
|
||||
|
||||
Tensor* diff_src_tensor = nullptr;
|
||||
if (tf_shape_src.num_elements() == 0 ||
|
||||
tf_shape_diff_dst.num_elements() == 0) {
|
||||
HandleEmptyInput(context, tf_shape_src,
|
||||
scale_tensor.shape(),
|
||||
&diff_src_tensor);
|
||||
return;
|
||||
}
|
||||
|
||||
if (dnn_shape_src.IsMklTensor())
|
||||
depth_ = dnn_shape_src.DimSize(MklDnnDims::Dim_C);
|
||||
else
|
||||
@ -1165,25 +1248,21 @@ class MklFusedBatchNormGradOp : public OpKernel {
|
||||
auto diff_weights_m = memory(diff_weights_pd);
|
||||
|
||||
auto bnrm_fwd_desc = batch_normalization_forward::desc(
|
||||
prop_kind::forward_training,
|
||||
src.GetUsrMemDesc(),
|
||||
epsilon_,
|
||||
use_scale_shift);
|
||||
prop_kind::forward_training,
|
||||
src.GetUsrMemDesc(),
|
||||
epsilon_,
|
||||
is_training_ ? use_scale_shift :
|
||||
(use_scale_shift | use_global_stats));
|
||||
auto bnrm_fwd_pd = batch_normalization_forward::primitive_desc(
|
||||
bnrm_fwd_desc,
|
||||
cpu_engine);
|
||||
|
||||
// Indices of output tensors
|
||||
const size_t diff_src_index = 0; // index of diff_src tensor
|
||||
const size_t diff_scale_index = 1; // index of diff_scale tensor
|
||||
const size_t diff_shift_index = 2; // index of diff_shift tensor
|
||||
const size_t p1_index = 3; // index of 1st placeholder tensor
|
||||
const size_t p2_index = 4; // index of 2nd placeholder tensor
|
||||
const size_t kDiffSrcIndex = 0; // index of diff_src tensor
|
||||
|
||||
// allocate diff_src tensor
|
||||
MklDnnShape dnn_shape_diff_src;
|
||||
TensorShape tf_shape_diff_src;
|
||||
Tensor* diff_src_tensor = nullptr;
|
||||
if (dnn_shape_src.IsMklTensor()) {
|
||||
dnn_shape_diff_src.SetMklTensor(true);
|
||||
auto diff_src_pd = bnrm_fwd_pd.dst_primitive_desc();
|
||||
@ -1201,7 +1280,7 @@ class MklFusedBatchNormGradOp : public OpKernel {
|
||||
dnn_shape_diff_src.SetMklTensor(false);
|
||||
tf_shape_diff_src = src_tensor.shape();
|
||||
}
|
||||
AllocateOutputSetMklShape(context, diff_src_index, &diff_src_tensor,
|
||||
AllocateOutputSetMklShape(context, kDiffSrcIndex, &diff_src_tensor,
|
||||
tf_shape_diff_src, dnn_shape_diff_src);
|
||||
|
||||
diff_src.SetUsrMem(src_md, diff_src_tensor);
|
||||
@ -1212,7 +1291,15 @@ class MklFusedBatchNormGradOp : public OpKernel {
|
||||
diff_src.GetUsrMemDesc(),
|
||||
src.GetUsrMemDesc(),
|
||||
epsilon_,
|
||||
use_scale_shift);
|
||||
/* for inference, specify use_global_stats
|
||||
1. on fwd prop, use mean and variance
|
||||
provided as inputs
|
||||
2. on bwd prop, mean and variance are
|
||||
considered as constants. Thus,
|
||||
reduce the amout of MKL computations
|
||||
*/
|
||||
is_training_ ? use_scale_shift :
|
||||
(use_scale_shift | use_global_stats));
|
||||
auto bnrm_bwd_pd = batch_normalization_backward::primitive_desc(
|
||||
bnrm_bwd_desc,
|
||||
cpu_engine,
|
||||
@ -1232,41 +1319,22 @@ class MklFusedBatchNormGradOp : public OpKernel {
|
||||
net.push_back(bnrm_bwd_op);
|
||||
stream(stream::kind::eager).submit(net).wait();
|
||||
|
||||
// separate out scale and shift grad and copy to individual tensors
|
||||
const TensorShape& tf_shape_scale_shift = scale_tensor.shape();
|
||||
// allocate 4 output TF tensors
|
||||
Tensor* diff_scale_tensor = nullptr;
|
||||
MklDnnShape mkl_shape_diff_scale;
|
||||
mkl_shape_diff_scale.SetMklTensor(false);
|
||||
AllocateOutputSetMklShape(context, diff_scale_index, &diff_scale_tensor,
|
||||
tf_shape_scale_shift, mkl_shape_diff_scale);
|
||||
|
||||
Tensor* diff_shift_tensor = nullptr;
|
||||
MklDnnShape mkl_shape_diff_shift;
|
||||
mkl_shape_diff_shift.SetMklTensor(false);
|
||||
AllocateOutputSetMklShape(context, diff_shift_index, &diff_shift_tensor,
|
||||
tf_shape_scale_shift, mkl_shape_diff_shift);
|
||||
AllocateTFOutputs(context, scale_tensor.shape(),
|
||||
&diff_scale_tensor,
|
||||
&diff_shift_tensor);
|
||||
|
||||
// copy data: diff_scale and diff_shift
|
||||
T* diff_weights_data_dnn = reinterpret_cast<T*>
|
||||
(diff_weights_m.get_data_handle());
|
||||
float* diff_scale_data_tf = const_cast<float*>(
|
||||
static_cast<const float*>(diff_scale_tensor->flat<T>().data()));
|
||||
float* diff_shift_data_tf = const_cast<float*>(
|
||||
static_cast<const float*>(diff_shift_tensor->flat<T>().data()));
|
||||
for (int i = 0; i < depth_; i++) {
|
||||
diff_scale_data_tf[i] = diff_weights_data_dnn[i];
|
||||
diff_shift_data_tf[i] = diff_weights_data_dnn[i + depth_];
|
||||
diff_scale_tensor->flat<T>().data()[i] =
|
||||
diff_weights_data_dnn[i];
|
||||
diff_shift_tensor->flat<T>().data()[i] =
|
||||
diff_weights_data_dnn[i + depth_];
|
||||
}
|
||||
|
||||
// Placeholders for estimated_mean and estimated_variance, which are
|
||||
// used for inference and thus not needed here for gradient computation.
|
||||
Tensor* p1_tensor = nullptr, *p2_tensor = nullptr;
|
||||
MklDnnShape mkl_shape_p;
|
||||
mkl_shape_p.SetMklTensor(false);
|
||||
AllocateOutputSetMklShape(context, p1_index, &p1_tensor,
|
||||
TensorShape({}), mkl_shape_p);
|
||||
AllocateOutputSetMklShape(context, p2_index, &p2_tensor,
|
||||
TensorShape({}), mkl_shape_p);
|
||||
} catch (mkldnn::error &e) {
|
||||
string error_msg = "Status: " + std::to_string(e.status) +
|
||||
", message: " + string(e.message) +
|
||||
@ -1282,12 +1350,74 @@ class MklFusedBatchNormGradOp : public OpKernel {
|
||||
T epsilon_;
|
||||
TensorFormat tensor_format_;
|
||||
int depth_; // batch normalization is done for per channel.
|
||||
bool is_training_;
|
||||
|
||||
void ExtractParams(OpKernelContext* context) {
|
||||
const Tensor& input = MklGetInput(context, 0);
|
||||
depth_ = static_cast<int>(GetTensorDim(input, tensor_format_, 'C'));
|
||||
}
|
||||
|
||||
void HandleEmptyInput(OpKernelContext* context,
|
||||
TensorShape tf_shape_src,
|
||||
TensorShape tf_shape_scale_shift,
|
||||
Tensor** diff_src_tensor) {
|
||||
const size_t kDiffSrcIndex = 0;
|
||||
|
||||
MklDnnShape dnn_shape_diff_src;
|
||||
dnn_shape_diff_src.SetMklTensor(false);
|
||||
AllocateOutputSetMklShape(context, kDiffSrcIndex, diff_src_tensor,
|
||||
tf_shape_src, dnn_shape_diff_src);
|
||||
for (size_t i=0; i < (*diff_src_tensor)->shape().num_elements(); i++)
|
||||
(*diff_src_tensor)->flat<T>().data()[i] = 0;
|
||||
|
||||
Tensor* diff_scale_tensor = nullptr;
|
||||
Tensor* diff_shift_tensor = nullptr;
|
||||
AllocateTFOutputs(context,
|
||||
tf_shape_scale_shift,
|
||||
&diff_scale_tensor,
|
||||
&diff_shift_tensor);
|
||||
}
|
||||
|
||||
void AllocateTFOutputs(OpKernelContext* context,
|
||||
TensorShape tf_shape_scale_shift,
|
||||
Tensor** diff_scale_tensor,
|
||||
Tensor** diff_shift_tensor) {
|
||||
CHECK_NOTNULL(diff_scale_tensor);
|
||||
CHECK_NOTNULL(diff_shift_tensor);
|
||||
|
||||
const size_t kDiffScaleIndex = 1;
|
||||
const size_t kDiffShiftIndex = 2;
|
||||
const size_t kP1Index = 3;
|
||||
const size_t kP2Index = 4;
|
||||
|
||||
// separate out scale and shift grad and copy to individual tensors
|
||||
MklDnnShape mkl_shape_diff_scale;
|
||||
mkl_shape_diff_scale.SetMklTensor(false);
|
||||
AllocateOutputSetMklShape(context, kDiffScaleIndex, diff_scale_tensor,
|
||||
tf_shape_scale_shift, mkl_shape_diff_scale);
|
||||
CHECK_NOTNULL(*diff_scale_tensor);
|
||||
for (size_t i=0; i < (*diff_scale_tensor)->shape().num_elements(); i++)
|
||||
(*diff_scale_tensor)->flat<T>().data()[i] = 0;
|
||||
|
||||
MklDnnShape mkl_shape_diff_shift;
|
||||
mkl_shape_diff_shift.SetMklTensor(false);
|
||||
AllocateOutputSetMklShape(context, kDiffShiftIndex, diff_shift_tensor,
|
||||
tf_shape_scale_shift, mkl_shape_diff_shift);
|
||||
CHECK_NOTNULL(*diff_shift_tensor);
|
||||
for (size_t i=0; i < (*diff_shift_tensor)->shape().num_elements(); i++)
|
||||
(*diff_shift_tensor)->flat<T>().data()[i] = 0;
|
||||
|
||||
// Placeholders for estimated_mean and estimated_variance, which are
|
||||
// used for inference and thus not needed here for gradient computation.
|
||||
Tensor* p1_tensor = nullptr, *p2_tensor = nullptr;
|
||||
MklDnnShape mkl_shape_p;
|
||||
mkl_shape_p.SetMklTensor(false);
|
||||
AllocateOutputSetMklShape(context, kP1Index, &p1_tensor,
|
||||
TensorShape({}), mkl_shape_p);
|
||||
AllocateOutputSetMklShape(context, kP2Index, &p2_tensor,
|
||||
TensorShape({}), mkl_shape_p);
|
||||
}
|
||||
|
||||
memory::dims GetMeanVarianceDims() {
|
||||
return memory::dims({1, depth_});
|
||||
}
|
||||
|
@ -396,7 +396,7 @@ class MklInputConversionOp : public OpKernel {
|
||||
auto cpu_engine = engine(engine::cpu, 0);
|
||||
MklDnnData<T> tf_input(&cpu_engine);
|
||||
auto input_tf_md = mkl_output_mkl_shape.GetTfLayout();
|
||||
tf_input.SetUsrMem(input_tf_md, &tf_tensor);
|
||||
tf_input.SetUsrMem(input_tf_md, tf_tensor);
|
||||
|
||||
// Create reorder between tensorflow layout and Mkl layout.
|
||||
std::vector<primitive> net;
|
||||
|
@ -43,7 +43,7 @@ limitations under the License.
|
||||
using mkldnn::lrn_forward;
|
||||
using mkldnn::lrn_backward;
|
||||
using mkldnn::prop_kind;
|
||||
using mkldnn::algorithm::lrn_across_channels;
|
||||
using mkldnn::lrn_across_channels;
|
||||
using mkldnn::stream;
|
||||
#endif
|
||||
|
||||
@ -910,17 +910,23 @@ class MklLRNOp : public OpKernel {
|
||||
Eigen::Tensor<T, 2, Eigen::RowMajor> multiplier(depth, depth);
|
||||
GetBandMatrix<T>(depth, depth_radius_, &multiplier);
|
||||
|
||||
Tensor *output_dnn_data, *workspace;
|
||||
MklDnnShape mkl_output_mkl_shape, mkl_workspace_mkl_shape;
|
||||
Tensor *output_dnn_data = nullptr;
|
||||
MklDnnShape mkl_output_mkl_shape;
|
||||
mkl_output_mkl_shape.SetMklTensor(false);
|
||||
mkl_output_mkl_shape.SetDimensions(4);
|
||||
AllocateOutputSetMklShape(context, kIdxOutput, &output_dnn_data,
|
||||
input.shape(), mkl_output_mkl_shape);
|
||||
CHECK_NOTNULL(output_dnn_data);
|
||||
|
||||
mkl_workspace_mkl_shape.SetMklTensor(false);
|
||||
mkl_workspace_mkl_shape.SetDimensions(4);
|
||||
AllocateOutputSetMklShape(context, kIdxWorkspace, &workspace,
|
||||
input.shape(), mkl_workspace_mkl_shape);
|
||||
Tensor* workspace_tensor = nullptr;
|
||||
MklDnnShape workspace_mkl_shape;
|
||||
workspace_mkl_shape.SetMklTensor(false);
|
||||
TensorShape workspace_tf_shape;
|
||||
workspace_tf_shape.AddDim(0);
|
||||
AllocateOutputSetMklShape(context, kIdxWorkspace,
|
||||
&workspace_tensor,
|
||||
workspace_tf_shape, workspace_mkl_shape);
|
||||
CHECK_NOTNULL(workspace_tensor);
|
||||
|
||||
auto out_shaped = output_dnn_data->shaped<T, 2>({nodes * batch, depth});
|
||||
Eigen::array<DimPair, 1> dims = {{DimPair(1, 0)}};
|
||||
@ -1344,12 +1350,14 @@ class MklLRNGradOp : public OpKernel {
|
||||
errors::InvalidArgument("Output image must be 4-dimensional"));
|
||||
}
|
||||
|
||||
if (workspace_dnn_shape.IsMklTensor()) {
|
||||
OP_REQUIRES(context, workspace_dnn_shape.IsMklTensor() == false,
|
||||
errors::InvalidArgument("Workspace should not be MKL Tensor."));
|
||||
} else {
|
||||
OP_REQUIRES(context, workspace_tensor.dims() == 1,
|
||||
errors::InvalidArgument("Workspace must be 1-dimensional"));
|
||||
if (workspace_enabled_) {
|
||||
if (workspace_dnn_shape.IsMklTensor()) {
|
||||
OP_REQUIRES(context, workspace_dnn_shape.IsMklTensor() == false,
|
||||
errors::InvalidArgument("Workspace should not be MKL Tensor."));
|
||||
} else {
|
||||
OP_REQUIRES(context, workspace_tensor.dims() == 1,
|
||||
errors::InvalidArgument("Workspace must be 1-dimensional"));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -517,7 +517,7 @@ class MklMaxPoolingOp : public MklPoolingForwardOpBase<T> {
|
||||
|
||||
MklDnnData<T> dnn_data_input(&cpu_engine);
|
||||
MklDnnData<T> dnn_data_output(&cpu_engine);
|
||||
MklDnnData<T> dnn_data_wksp(&cpu_engine);
|
||||
MklDnnData<uint8> dnn_data_wksp(&cpu_engine);
|
||||
|
||||
// initialize variables for the pooling op
|
||||
MklPoolParameters pool_params;
|
||||
@ -588,16 +588,16 @@ class MklMaxPoolingOp : public MklPoolingForwardOpBase<T> {
|
||||
|
||||
void AllocateWorkspaceTensor(OpKernelContext* context,
|
||||
const pooling_forward::primitive_desc& pool_fwd_prim_desc,
|
||||
MklDnnData<T>* dnn_data_wksp) {
|
||||
MklDnnData<uint8>* dnn_data_wksp) {
|
||||
CHECK_NOTNULL(dnn_data_wksp);
|
||||
Tensor* workspace_tensor = nullptr;
|
||||
memory::primitive_desc workspace_pd
|
||||
= pool_fwd_prim_desc.workspace_primitive_desc();
|
||||
size_t workspace_t_elems = this->GetNumTElements(workspace_pd);
|
||||
size_t workspace_bytes = workspace_pd.get_size();
|
||||
MklDnnShape workspace_mkl_shape;
|
||||
workspace_mkl_shape.SetMklTensor(false);
|
||||
TensorShape workspace_tf_shape;
|
||||
workspace_tf_shape.AddDim(workspace_t_elems);
|
||||
workspace_tf_shape.AddDim(workspace_bytes);
|
||||
AllocateOutputSetMklShape(context, kOutputTensorIndexWorkspace,
|
||||
&workspace_tensor,
|
||||
workspace_tf_shape, workspace_mkl_shape);
|
||||
@ -651,7 +651,7 @@ class MklMaxPoolingGradOp : public MklPoolingBackwardOpBase<T> {
|
||||
if (!context->status().ok()) return;
|
||||
|
||||
MklDnnData<T> grad_dnn_data(&cpu_engine);
|
||||
MklDnnData<T> workspace_dnn_data(&cpu_engine);
|
||||
MklDnnData<uint8> workspace_dnn_data(&cpu_engine);
|
||||
MklDnnData<T> output_dnn_data(&cpu_engine);
|
||||
Tensor* output_tensor = nullptr;
|
||||
MklPoolParameters pool_params;
|
||||
@ -770,7 +770,7 @@ class MklMaxPoolingGradOp : public MklPoolingBackwardOpBase<T> {
|
||||
|
||||
void ConfigureWorkspace(const Tensor& workspace_tensor,
|
||||
memory::primitive_desc workspace_pd,
|
||||
MklDnnData<T> *workspace_dnn_data) {
|
||||
MklDnnData<uint8> *workspace_dnn_data) {
|
||||
CHECK_NOTNULL(workspace_dnn_data);
|
||||
|
||||
workspace_dnn_data->SetUsrMem(workspace_pd, &workspace_tensor);
|
||||
@ -811,7 +811,7 @@ class MklMaxPoolingGradOp : public MklPoolingBackwardOpBase<T> {
|
||||
errors::InvalidArgument("Gradient must be "
|
||||
"4-dimensional"));
|
||||
}
|
||||
if (this->workspace_enabled_){
|
||||
if (this->workspace_enabled_) {
|
||||
// The workspace should not be an MKL tensor
|
||||
OP_REQUIRES(context, workspace_mkl_shape.IsMklTensor() == false,
|
||||
errors::InvalidArgument("Workspace tensor should not"
|
||||
|
@ -231,7 +231,7 @@ class MklPoolingForwardOpBase : public MklPoolingOpBase<T> {
|
||||
const pooling_forward::primitive_desc& pool_fwd_desc,
|
||||
const MklDnnData<T>* src,
|
||||
MklDnnData<T>* dst,
|
||||
MklDnnData<T>* wksp = nullptr) {
|
||||
MklDnnData<uint8>* wksp = nullptr) {
|
||||
std::vector<primitive> net;
|
||||
|
||||
// Create pooling primitive and add it to net
|
||||
@ -307,7 +307,7 @@ class MklPoolingBackwardOpBase : public MklPoolingOpBase<T> {
|
||||
MklDnnData<T>* input_gradient_diff_dst,
|
||||
MklDnnData<T>* output_diff_src,
|
||||
const memory::primitive_desc& target_diff_dst_pd,
|
||||
const MklDnnData<T>* workspace = nullptr) {
|
||||
const MklDnnData<uint8>* workspace = nullptr) {
|
||||
|
||||
std::vector<primitive> net;
|
||||
|
||||
|
@ -256,11 +256,18 @@ class MklReshapeOp : public OpKernel {
|
||||
AllocateOutputSetMklShape(context, kOutputSlotIdx, &output_tensor,
|
||||
shape_to, mkl_shape_output);
|
||||
|
||||
// Insert reorder between Mkl layout and TensorFlow layout.
|
||||
// Insert reorder between Mkl layout and TensorFlow layout if
|
||||
// needed. If reorder is not needed but reshape is needed (since
|
||||
// shape_from != shape_to), then we just copy input tensor to
|
||||
// output tensor with target shape (we cannot forward Mkl layout
|
||||
// in such case because shape has changed.)
|
||||
std::vector<primitive> net;
|
||||
CHECK_EQ(dnn_data_input.CheckReorderToOpMem(output_tf_pd,
|
||||
output_tensor, &net), true);
|
||||
stream(stream::kind::eager).submit(net).wait();
|
||||
if (dnn_data_input.CheckReorderToOpMem(output_tf_pd,
|
||||
output_tensor, &net)) {
|
||||
stream(stream::kind::eager).submit(net).wait();
|
||||
} else {
|
||||
output_tensor->CopyFrom(input_tensor, shape_to);
|
||||
}
|
||||
return;
|
||||
} else {
|
||||
// If dimensions that are being expanded or collapsed are
|
||||
|
163
tensorflow/core/kernels/mkl_softmax_op.cc
Normal file
163
tensorflow/core/kernels/mkl_softmax_op.cc
Normal file
@ -0,0 +1,163 @@
|
||||
/* Copyright 2015 The TensorFlow Authors. All Rights Reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
==============================================================================*/
|
||||
|
||||
// See docs in ../ops/nn_ops.cc.
|
||||
#ifdef INTEL_MKL
|
||||
#ifdef INTEL_MKL_DNN
|
||||
|
||||
#include "tensorflow/core/framework/numeric_op.h"
|
||||
#include "tensorflow/core/framework/op_kernel.h"
|
||||
#include "tensorflow/core/framework/register_types.h"
|
||||
#include "tensorflow/core/framework/tensor.h"
|
||||
#include "tensorflow/core/lib/core/errors.h"
|
||||
#include "tensorflow/core/util/tensor_format.h"
|
||||
#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
|
||||
|
||||
#include "mkldnn.h"
|
||||
#include "mkldnn_types.h"
|
||||
#include "tensorflow/core/platform/default/logging.h"
|
||||
#include "tensorflow/core/util/mkl_util.h"
|
||||
|
||||
#include "mkldnn.hpp"
|
||||
using mkldnn::stream;
|
||||
using mkldnn::prop_kind;
|
||||
using mkldnn::softmax_forward;
|
||||
|
||||
namespace tensorflow {
|
||||
|
||||
typedef Eigen::ThreadPoolDevice CPUDevice;
|
||||
|
||||
|
||||
|
||||
template <typename Device, typename T>
|
||||
class MklSoftmaxOp : public OpKernel {
|
||||
public:
|
||||
~MklSoftmaxOp() {}
|
||||
|
||||
explicit MklSoftmaxOp(OpKernelConstruction* context) : OpKernel(context) {}
|
||||
|
||||
void Compute(OpKernelContext* context) override {
|
||||
try {
|
||||
auto cpu_engine = engine(engine::cpu, 0);
|
||||
|
||||
// src_tensor now points to the 0-th input of global data struct "context"
|
||||
size_t src_idx = 0;
|
||||
const Tensor& src_tensor = MklGetInput(context, src_idx);
|
||||
|
||||
// Add: get MklShape
|
||||
MklDnnShape src_mkl_shape;
|
||||
GetMklShape(context, src_idx, &src_mkl_shape);
|
||||
|
||||
|
||||
// src_dims is the dimenstion of src_tensor
|
||||
// dim of the dst will also be same as src_dims
|
||||
auto src_tf_shape = src_mkl_shape.IsMklTensor() ?
|
||||
src_mkl_shape.GetTfShape() : src_tensor.shape();
|
||||
auto src_dims = TFShapeToMklDnnDims(src_tf_shape);
|
||||
auto output_dims = src_dims;
|
||||
|
||||
// Create softmax memory for src, dst: both are defined in mkl_util.h,
|
||||
// they are wrapper
|
||||
MklDnnData<T> src(&cpu_engine);
|
||||
MklDnnData<T> dst(&cpu_engine);
|
||||
|
||||
// If input is in MKL layout, then simply grab input layout; otherwise,
|
||||
// construct input Tf layout. For TF layout, although input shape
|
||||
// (src_dims) required is in MKL-DNN order, the layout is Tensorflow's
|
||||
// layout
|
||||
auto src_md = src_mkl_shape.IsMklTensor()
|
||||
? src_mkl_shape.GetMklLayout()
|
||||
: memory::desc(src_dims, MklDnnType<T>(),
|
||||
memory::format::nc);
|
||||
|
||||
// src: setting memory descriptor and op memory descriptor
|
||||
// Basically following two functions maps the TF "src_tensor" to mkl
|
||||
// tensor object "src"
|
||||
// following functions are in mkl_util.h
|
||||
// data format is "nc" for src and dst; since the src and dst buffer is
|
||||
// always in 2D shape
|
||||
src.SetUsrMem(src_md, &src_tensor);
|
||||
src.SetOpMemDesc(src_dims, memory::format::nc);
|
||||
|
||||
// creating a memory descriptor
|
||||
int axis = 1; // axis to which softmax will be applied
|
||||
auto softmax_fwd_desc = softmax_forward::desc(prop_kind::forward_scoring,
|
||||
src.GetOpMemDesc(), axis);
|
||||
auto softmax_fwd_pd = softmax_forward::primitive_desc(softmax_fwd_desc,
|
||||
cpu_engine);
|
||||
|
||||
// add: output
|
||||
Tensor* output_tensor = nullptr;
|
||||
MklDnnShape output_mkl_shape;
|
||||
TensorShape output_tf_shape; // shape of output TF tensor.
|
||||
// Softmax MklDnn output layout is same as input layout.
|
||||
auto dst_pd = src.GetUsrMemPrimDesc();
|
||||
|
||||
// if input is MKL shape, ouput is also MKL shape.
|
||||
// if input is TF shape, output is also TF shape
|
||||
if (src_mkl_shape.IsMklTensor()) {
|
||||
output_mkl_shape.SetMklTensor(true);
|
||||
output_mkl_shape.SetMklLayout(&dst_pd);
|
||||
output_mkl_shape.SetElemType(MklDnnType<T>());
|
||||
output_mkl_shape.SetTfLayout(output_dims.size(), output_dims,
|
||||
memory::format::nc);
|
||||
output_tf_shape.AddDim((dst_pd.get_size() / sizeof(T)));
|
||||
} else { // then output is also TF shape
|
||||
output_mkl_shape.SetMklTensor(false);
|
||||
output_tf_shape = MklDnnDimsToTFShape(output_dims);
|
||||
}
|
||||
// Allocate output shape (MKL or TF based on the above)
|
||||
AllocateOutputSetMklShape(context, 0, &output_tensor, output_tf_shape,
|
||||
output_mkl_shape);
|
||||
|
||||
// Output_dims and input_dims are same
|
||||
dst.SetUsrMem(src_md, output_tensor);
|
||||
|
||||
// finally creating the "softmax op" using the primitive descriptor, src
|
||||
// and dst
|
||||
auto softmax_fwd =
|
||||
softmax_forward(softmax_fwd_pd, src.GetOpMem(), dst.GetOpMem());
|
||||
|
||||
// execute net (pushing to the stream)
|
||||
// following 3 are common for all mkl dnn ops
|
||||
std::vector<primitive> net;
|
||||
net.push_back(softmax_fwd);
|
||||
stream(stream::kind::eager).submit(net).wait();
|
||||
} catch (mkldnn::error& e) {
|
||||
string error_msg = "Status: " + std::to_string(e.status) + ", message: " +
|
||||
string(e.message) + ", in file " + string(__FILE__) +
|
||||
":" + std::to_string(__LINE__);
|
||||
OP_REQUIRES_OK(
|
||||
context,
|
||||
errors::Aborted("Operation received an exception:", error_msg));
|
||||
}
|
||||
}
|
||||
};
|
||||
|
||||
/* Register DNN kernels for supported operations and supported types - right now
|
||||
* it is only Softmax and f32 */
|
||||
#define REGISTER_SOFTMAX_MKL_SUPPORTED_KERNELS_TYPES(type) \
|
||||
REGISTER_KERNEL_BUILDER(Name("_MklSoftmax") \
|
||||
.Device(DEVICE_CPU) \
|
||||
.TypeConstraint<type>("T") \
|
||||
.Label(mkl_op_registry::kMklOpLabel), \
|
||||
MklSoftmaxOp<CPUDevice, type>);
|
||||
TF_CALL_float(REGISTER_SOFTMAX_MKL_SUPPORTED_KERNELS_TYPES);
|
||||
|
||||
|
||||
} // namespace tensorflow
|
||||
|
||||
#endif // INTEL_MKL_DNN
|
||||
#endif // INTEL_MKL
|
@ -222,7 +222,7 @@ void DnnPoolingOp<T>::Compute(
|
||||
output_desc, &output_data)
|
||||
.ok();
|
||||
OP_REQUIRES(context, status,
|
||||
errors::Internal("cudnn PoolBackward launch failed"));
|
||||
errors::Internal("cudnn PoolForward launch failed"));
|
||||
|
||||
if (data_format == FORMAT_NHWC) {
|
||||
/// Transform the output data from NCHW back to NHWC
|
||||
|
@ -70,10 +70,24 @@ bool ReadRawFloatFileToComplexVector(
|
||||
int offset = 0;
|
||||
const int end = data_string.size();
|
||||
while (offset < end) {
|
||||
#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
|
||||
char arr[4];
|
||||
for (int i = 0; i < kBytesPerValue; ++i ) {
|
||||
arr[3 - i] = *(data_string.data() + offset + i);
|
||||
}
|
||||
memcpy(&real_out, arr, kBytesPerValue);
|
||||
offset += kBytesPerValue;
|
||||
for (int i = 0; i < kBytesPerValue; ++i ) {
|
||||
arr[3 - i] = *(data_string.data() + offset + i);
|
||||
}
|
||||
memcpy(&imag_out, arr, kBytesPerValue);
|
||||
offset += kBytesPerValue;
|
||||
#else
|
||||
memcpy(&real_out, data_string.data() + offset, kBytesPerValue);
|
||||
offset += kBytesPerValue;
|
||||
memcpy(&imag_out, data_string.data() + offset, kBytesPerValue);
|
||||
offset += kBytesPerValue;
|
||||
#endif
|
||||
if (row_counter >= row_length) {
|
||||
data->push_back(data_row);
|
||||
data_row.clear();
|
||||
|
@ -88,6 +88,18 @@ struct Transpose<CPUDevice, T, conjugate> {
|
||||
internal::TransposeUsingEigen<CPUDevice, T, 5>(d, in, perm, conjugate,
|
||||
out);
|
||||
break;
|
||||
case 6:
|
||||
internal::TransposeUsingEigen<CPUDevice, T, 6>(d, in, perm, conjugate,
|
||||
out);
|
||||
break;
|
||||
case 7:
|
||||
internal::TransposeUsingEigen<CPUDevice, T, 7>(d, in, perm, conjugate,
|
||||
out);
|
||||
break;
|
||||
case 8:
|
||||
internal::TransposeUsingEigen<CPUDevice, T, 8>(d, in, perm, conjugate,
|
||||
out);
|
||||
break;
|
||||
default:
|
||||
TransposeSimple<T, conjugate>(d, in, perm, out);
|
||||
break;
|
||||
|
@ -201,6 +201,27 @@ struct Transpose<GPUDevice, T, conjugate> {
|
||||
out);
|
||||
}
|
||||
break;
|
||||
case 6:
|
||||
if (!internal::TransposeUsingTile<T, conjugate>::run(d, in, perm,
|
||||
out)) {
|
||||
internal::TransposeUsingEigen<GPUDevice, T, 6>(d, in, perm, conjugate,
|
||||
out);
|
||||
}
|
||||
break;
|
||||
case 7:
|
||||
if (!internal::TransposeUsingTile<T, conjugate>::run(d, in, perm,
|
||||
out)) {
|
||||
internal::TransposeUsingEigen<GPUDevice, T, 7>(d, in, perm, conjugate,
|
||||
out);
|
||||
}
|
||||
break;
|
||||
case 8:
|
||||
if (!internal::TransposeUsingTile<T, conjugate>::run(d, in, perm,
|
||||
out)) {
|
||||
internal::TransposeUsingEigen<GPUDevice, T, 8>(d, in, perm, conjugate,
|
||||
out);
|
||||
}
|
||||
break;
|
||||
default:
|
||||
internal::TransposeSimple<T, conjugate>(d, in, perm, out);
|
||||
break;
|
||||
|
@ -67,10 +67,12 @@ class SoftmaxXentWithLogitsOp : public OpKernel {
|
||||
// Try to reuse the logits_in buffer for the backprop output.
|
||||
OP_REQUIRES_OK(context, context->forward_input_or_allocate_output(
|
||||
{0}, 1, logits_in.shape(), &back_out));
|
||||
functor::XentFunctor<Device, T> functor;
|
||||
functor(context->eigen_device<Device>(), logits_in.matrix<T>(),
|
||||
labels_in.matrix<T>(), scratch.matrix<T>(), loss_out->vec<T>(),
|
||||
back_out->matrix<T>());
|
||||
if (logits_in.dim_size(0) > 0) {
|
||||
functor::XentFunctor<Device, T> functor;
|
||||
functor(context->eigen_device<Device>(), logits_in.matrix<T>(),
|
||||
labels_in.matrix<T>(), scratch.matrix<T>(), loss_out->vec<T>(),
|
||||
back_out->matrix<T>());
|
||||
}
|
||||
}
|
||||
};
|
||||
|
||||
|
@ -17,6 +17,7 @@ limitations under the License.
|
||||
|
||||
#include "tensorflow/core/lib/gif/gif_io.h"
|
||||
#include "tensorflow/core/lib/gtl/cleanup.h"
|
||||
#include "tensorflow/core/lib/strings/strcat.h"
|
||||
#include "tensorflow/core/platform/gif.h"
|
||||
#include "tensorflow/core/platform/logging.h"
|
||||
#include "tensorflow/core/platform/mem.h"
|
||||
@ -44,7 +45,8 @@ int input_callback(GifFileType* gif_file, GifByteType* buf, int size) {
|
||||
}
|
||||
|
||||
uint8* Decode(const void* srcdata, int datasize,
|
||||
std::function<uint8*(int, int, int, int)> allocate_output) {
|
||||
const std::function<uint8*(int, int, int, int)>& allocate_output,
|
||||
string* error_string) {
|
||||
int error_code = D_GIF_SUCCEEDED;
|
||||
InputBufferInfo info = {reinterpret_cast<const uint8*>(srcdata), datasize};
|
||||
GifFileType* gif_file =
|
||||
@ -57,17 +59,17 @@ uint8* Decode(const void* srcdata, int datasize,
|
||||
}
|
||||
});
|
||||
if (error_code != D_GIF_SUCCEEDED) {
|
||||
LOG(ERROR) << "Fail to open gif file, reason: "
|
||||
<< GifErrorString(error_code);
|
||||
*error_string = strings::StrCat("failed to open gif file: ",
|
||||
GifErrorString(error_code));
|
||||
return nullptr;
|
||||
}
|
||||
if (DGifSlurp(gif_file) != GIF_OK) {
|
||||
LOG(ERROR) << "Fail to slurp gif file, reason: "
|
||||
<< GifErrorString(gif_file->Error);
|
||||
*error_string = strings::StrCat("failed to slurp gif file: ",
|
||||
GifErrorString(gif_file->Error));
|
||||
return nullptr;
|
||||
}
|
||||
if (gif_file->ImageCount <= 0) {
|
||||
LOG(ERROR) << "Gif file does not contain any image";
|
||||
*error_string = strings::StrCat("gif file does not contain any image");
|
||||
return nullptr;
|
||||
}
|
||||
|
||||
@ -83,7 +85,7 @@ uint8* Decode(const void* srcdata, int datasize,
|
||||
GifImageDesc* img_desc = &this_image->ImageDesc;
|
||||
if (img_desc->Left != 0 || img_desc->Top != 0 || img_desc->Width != width ||
|
||||
img_desc->Height != height) {
|
||||
LOG(ERROR) << "Can't process optimized gif.";
|
||||
*error_string = strings::StrCat("can't process optimized gif");
|
||||
return nullptr;
|
||||
}
|
||||
|
||||
|
@ -43,7 +43,8 @@ namespace tensorflow {
|
||||
namespace gif {
|
||||
|
||||
uint8* Decode(const void* srcdata, int datasize,
|
||||
std::function<uint8*(int, int, int, int)> allocate_output);
|
||||
const std::function<uint8*(int, int, int, int)>& allocate_output,
|
||||
string* error_string);
|
||||
|
||||
} // namespace gif
|
||||
} // namespace tensorflow
|
||||
|
@ -1818,7 +1818,11 @@ REGISTER_OP("_MklMaxPool")
|
||||
.Input("input: T")
|
||||
.Input("mkl_input: uint8")
|
||||
.Output("output: T")
|
||||
#ifndef INTEL_MKL_DNN
|
||||
.Output("workspace: T")
|
||||
#else
|
||||
.Output("workspace: uint8")
|
||||
#endif
|
||||
.Output("mkl_output: uint8")
|
||||
.Output("mkl_workspace: uint8")
|
||||
.SetShapeFn(shape_inference::MaxPoolShape)
|
||||
@ -1840,7 +1844,11 @@ REGISTER_OP("_MklMaxPoolGrad")
|
||||
.Input("orig_input: T")
|
||||
.Input("orig_output: T")
|
||||
.Input("grad: T")
|
||||
#ifndef INTEL_MKL_DNN
|
||||
.Input("workspace: T")
|
||||
#else
|
||||
.Input("workspace: uint8")
|
||||
#endif
|
||||
.Input("mkl_orig_input: uint8")
|
||||
.Input("mkl_orig_output: uint8")
|
||||
.Input("mkl_grad: uint8")
|
||||
|
@ -48,6 +48,7 @@ void AWSLogSystem::LogStream(Aws::Utils::Logging::LogLevel log_level,
|
||||
|
||||
void AWSLogSystem::LogMessage(Aws::Utils::Logging::LogLevel log_level,
|
||||
const std::string& message) {
|
||||
if (message == "Initializing Curl library") return;
|
||||
switch (log_level) {
|
||||
case Aws::Utils::Logging::LogLevel::Info:
|
||||
LOG(INFO) << message;
|
||||
|
@ -14,11 +14,13 @@ limitations under the License.
|
||||
==============================================================================*/
|
||||
#include "tensorflow/core/platform/s3/s3_file_system.h"
|
||||
#include "tensorflow/core/lib/io/path.h"
|
||||
#include "tensorflow/core/lib/strings/str_util.h"
|
||||
#include "tensorflow/core/platform/mutex.h"
|
||||
#include "tensorflow/core/platform/s3/aws_logging.h"
|
||||
#include "tensorflow/core/platform/s3/s3_crypto.h"
|
||||
|
||||
#include <aws/core/Aws.h>
|
||||
#include <aws/core/config/AWSProfileConfigLoader.h>
|
||||
#include <aws/core/utils/FileSystemUtils.h>
|
||||
#include <aws/core/utils/logging/AWSLogging.h>
|
||||
#include <aws/core/utils/logging/LogSystemInterface.h>
|
||||
@ -54,13 +56,37 @@ Aws::Client::ClientConfiguration& GetDefaultClientConfig() {
|
||||
cfg.endpointOverride = Aws::String(endpoint);
|
||||
}
|
||||
const char* region = getenv("AWS_REGION");
|
||||
if (!region) {
|
||||
// TODO (yongtang): `S3_REGION` should be deprecated after 2.0.
|
||||
region = getenv("S3_REGION");
|
||||
}
|
||||
if (region) {
|
||||
cfg.region = Aws::String(region);
|
||||
} else {
|
||||
// TODO (yongtang): `S3_REGION` should be deprecated after 2.0.
|
||||
const char* region = getenv("S3_REGION");
|
||||
if (region) {
|
||||
cfg.region = Aws::String(region);
|
||||
// Load config file (e.g., ~/.aws/config) only if AWS_SDK_LOAD_CONFIG
|
||||
// is set with a truthy value.
|
||||
const char* load_config_env = getenv("AWS_SDK_LOAD_CONFIG");
|
||||
string load_config =
|
||||
load_config_env ? str_util::Lowercase(load_config_env) : "";
|
||||
if (load_config == "true" || load_config == "1") {
|
||||
Aws::String config_file;
|
||||
// If AWS_CONFIG_FILE is set then use it, otherwise use ~/.aws/config.
|
||||
const char* config_file_env = getenv("AWS_CONFIG_FILE");
|
||||
if (config_file_env) {
|
||||
config_file = config_file_env;
|
||||
} else {
|
||||
const char* home_env = getenv("HOME");
|
||||
if (home_env) {
|
||||
config_file = home_env;
|
||||
config_file += "/.aws/config";
|
||||
}
|
||||
}
|
||||
Aws::Config::AWSConfigFileProfileConfigLoader loader(config_file);
|
||||
loader.Load();
|
||||
auto profiles = loader.GetProfiles();
|
||||
if (!profiles["default"].GetRegion().empty()) {
|
||||
cfg.region = profiles["default"].GetRegion();
|
||||
}
|
||||
}
|
||||
}
|
||||
const char* use_https = getenv("S3_USE_HTTPS");
|
||||
@ -79,6 +105,22 @@ Aws::Client::ClientConfiguration& GetDefaultClientConfig() {
|
||||
cfg.verifySSL = true;
|
||||
}
|
||||
}
|
||||
const char* connect_timeout = getenv("S3_CONNECT_TIMEOUT_MSEC");
|
||||
if (connect_timeout) {
|
||||
int64 timeout;
|
||||
|
||||
if (strings::safe_strto64(connect_timeout, &timeout)) {
|
||||
cfg.connectTimeoutMs = timeout;
|
||||
}
|
||||
}
|
||||
const char* request_timeout = getenv("S3_REQUEST_TIMEOUT_MSEC");
|
||||
if (request_timeout) {
|
||||
int64 timeout;
|
||||
|
||||
if (strings::safe_strto64(request_timeout, &timeout)) {
|
||||
cfg.requestTimeoutMs = timeout;
|
||||
}
|
||||
}
|
||||
|
||||
init = true;
|
||||
}
|
||||
|
@ -24,7 +24,7 @@ limitations under the License.
|
||||
|
||||
// TF_VERSION_SUFFIX is non-empty for pre-releases (e.g. "-alpha", "-alpha.1",
|
||||
// "-beta", "-rc", "-rc.1")
|
||||
#define TF_VERSION_SUFFIX "-rc0"
|
||||
#define TF_VERSION_SUFFIX "-rc1"
|
||||
|
||||
#define TF_STR_HELPER(x) #x
|
||||
#define TF_STR(x) TF_STR_HELPER(x)
|
||||
|
@ -14,16 +14,16 @@ suitable if fast sharding or other non-sequential access is desired.
|
||||
|
||||
## TFRecords Format Details
|
||||
|
||||
A TFRecords file contains a sequence of strings with CRC hashes. Each record
|
||||
has the format
|
||||
A TFRecords file contains a sequence of strings with CRC32C (32-bit CRC using
|
||||
the Castagnoli polynomial) hashes. Each record has the format
|
||||
|
||||
uint64 length
|
||||
uint32 masked_crc32_of_length
|
||||
byte data[length]
|
||||
uint32 masked_crc32_of_data
|
||||
|
||||
and the records are concatenated together to produce the file. The CRC32s
|
||||
are [described here](https://en.wikipedia.org/wiki/Cyclic_redundancy_check),
|
||||
and the mask of a CRC is
|
||||
and the records are concatenated together to produce the file. CRCs are
|
||||
[described here](https://en.wikipedia.org/wiki/Cyclic_redundancy_check), and
|
||||
the mask of a CRC is
|
||||
|
||||
masked_crc = ((crc >> 15) | (crc << 17)) + 0xa282ead8ul
|
||||
|
@ -38,7 +38,7 @@ enable TensorFlow for C:
|
||||
OS="linux" # Change to "darwin" for macOS
|
||||
TARGET_DIRECTORY="/usr/local"
|
||||
curl -L \
|
||||
"https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-${OS}-x86_64-1.5.0-rc0.tar.gz" |
|
||||
"https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-${OS}-x86_64-1.5.0-rc1.tar.gz" |
|
||||
sudo tar -C $TARGET_DIRECTORY -xz
|
||||
|
||||
The `tar` command extracts the TensorFlow C library into the `lib`
|
||||
|
@ -38,7 +38,7 @@ steps to install this library and enable TensorFlow for Go:
|
||||
TF_TYPE="cpu" # Change to "gpu" for GPU support
|
||||
TARGET_DIRECTORY='/usr/local'
|
||||
curl -L \
|
||||
"https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-$(go env GOOS)-x86_64-1.5.0-rc0.tar.gz" |
|
||||
"https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-$(go env GOOS)-x86_64-1.5.0-rc1.tar.gz" |
|
||||
sudo tar -C $TARGET_DIRECTORY -xz
|
||||
|
||||
The `tar` command extracts the TensorFlow C library into the `lib`
|
||||
|
@ -36,7 +36,7 @@ following to the project's `pom.xml` to use the TensorFlow Java APIs:
|
||||
<dependency>
|
||||
<groupId>org.tensorflow</groupId>
|
||||
<artifactId>tensorflow</artifactId>
|
||||
<version>1.5.0-rc0</version>
|
||||
<version>1.5.0-rc1</version>
|
||||
</dependency>
|
||||
```
|
||||
|
||||
@ -65,7 +65,7 @@ As an example, these steps will create a Maven project that uses TensorFlow:
|
||||
<dependency>
|
||||
<groupId>org.tensorflow</groupId>
|
||||
<artifactId>tensorflow</artifactId>
|
||||
<version>1.5.0-rc0</version>
|
||||
<version>1.5.0-rc1</version>
|
||||
</dependency>
|
||||
</dependencies>
|
||||
</project>
|
||||
@ -123,12 +123,12 @@ instead:
|
||||
<dependency>
|
||||
<groupId>org.tensorflow</groupId>
|
||||
<artifactId>libtensorflow</artifactId>
|
||||
<version>1.4.0</version>
|
||||
<version>1.5.0-rc1</version>
|
||||
</dependency>
|
||||
<dependency>
|
||||
<groupId>org.tensorflow</groupId>
|
||||
<artifactId>libtensorflow_jni_gpu</artifactId>
|
||||
<version>1.4.0</version>
|
||||
<version>1.5.0-rc1</version>
|
||||
</dependency>
|
||||
```
|
||||
|
||||
@ -147,7 +147,7 @@ refer to the simpler instructions above instead.
|
||||
Take the following steps to install TensorFlow for Java on Linux or macOS:
|
||||
|
||||
1. Download
|
||||
[libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.5.0-rc0.jar),
|
||||
[libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.5.0-rc1.jar),
|
||||
which is the TensorFlow Java Archive (JAR).
|
||||
|
||||
2. Decide whether you will run TensorFlow for Java on CPU(s) only or with
|
||||
@ -166,7 +166,7 @@ Take the following steps to install TensorFlow for Java on Linux or macOS:
|
||||
OS=$(uname -s | tr '[:upper:]' '[:lower:]')
|
||||
mkdir -p ./jni
|
||||
curl -L \
|
||||
"https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-${TF_TYPE}-${OS}-x86_64-1.5.0-rc0.tar.gz" |
|
||||
"https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-${TF_TYPE}-${OS}-x86_64-1.5.0-rc1.tar.gz" |
|
||||
tar -xz -C ./jni
|
||||
|
||||
### Install on Windows
|
||||
@ -174,10 +174,10 @@ Take the following steps to install TensorFlow for Java on Linux or macOS:
|
||||
Take the following steps to install TensorFlow for Java on Windows:
|
||||
|
||||
1. Download
|
||||
[libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.5.0-rc0.jar),
|
||||
[libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.5.0-rc1.jar),
|
||||
which is the TensorFlow Java Archive (JAR).
|
||||
2. Download the following Java Native Interface (JNI) file appropriate for
|
||||
[TensorFlow for Java on Windows](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-cpu-windows-x86_64-1.5.0-rc0.zip).
|
||||
[TensorFlow for Java on Windows](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-cpu-windows-x86_64-1.5.0-rc1.zip).
|
||||
3. Extract this .zip file.
|
||||
|
||||
|
||||
@ -225,7 +225,7 @@ must be part of your `classpath`. For example, you can include the
|
||||
downloaded `.jar` in your `classpath` by using the `-cp` compilation flag
|
||||
as follows:
|
||||
|
||||
<pre><b>javac -cp libtensorflow-1.5.0-rc0.jar HelloTF.java</b></pre>
|
||||
<pre><b>javac -cp libtensorflow-1.5.0-rc1.jar HelloTF.java</b></pre>
|
||||
|
||||
|
||||
### Running
|
||||
@ -239,11 +239,11 @@ two files are available to the JVM:
|
||||
For example, the following command line executes the `HelloTF` program on Linux
|
||||
and macOS X:
|
||||
|
||||
<pre><b>java -cp libtensorflow-1.5.0-rc0.jar:. -Djava.library.path=./jni HelloTF</b></pre>
|
||||
<pre><b>java -cp libtensorflow-1.5.0-rc1.jar:. -Djava.library.path=./jni HelloTF</b></pre>
|
||||
|
||||
And the following command line executes the `HelloTF` program on Windows:
|
||||
|
||||
<pre><b>java -cp libtensorflow-1.5.0-rc0.jar;. -Djava.library.path=jni HelloTF</b></pre>
|
||||
<pre><b>java -cp libtensorflow-1.5.0-rc1.jar;. -Djava.library.path=jni HelloTF</b></pre>
|
||||
|
||||
If the program prints <tt>Hello from <i>version</i></tt>, you've successfully
|
||||
installed TensorFlow for Java and are ready to use the API. If the program
|
||||
|
@ -188,7 +188,7 @@ Take the following steps to install TensorFlow with Virtualenv:
|
||||
Virtualenv environment:
|
||||
|
||||
<pre>(tensorflow)$ <b>pip3 install --upgrade \
|
||||
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0rc0-cp34-cp34m-linux_x86_64.whl</b></pre>
|
||||
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0rc1-cp34-cp34m-linux_x86_64.whl</b></pre>
|
||||
|
||||
If you encounter installation problems, see
|
||||
[Common Installation Problems](#common_installation_problems).
|
||||
@ -293,7 +293,7 @@ take the following steps:
|
||||
|
||||
<pre>
|
||||
$ <b>sudo pip3 install --upgrade \
|
||||
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0rc0-cp34-cp34m-linux_x86_64.whl</b>
|
||||
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0rc1-cp34-cp34m-linux_x86_64.whl</b>
|
||||
</pre>
|
||||
|
||||
If this step fails, see
|
||||
@ -480,7 +480,7 @@ Take the following steps to install TensorFlow in an Anaconda environment:
|
||||
|
||||
<pre>
|
||||
(tensorflow)$ <b>pip install --ignore-installed --upgrade \
|
||||
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0rc0-cp34-cp34m-linux_x86_64.whl</b></pre>
|
||||
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0rc1-cp34-cp34m-linux_x86_64.whl</b></pre>
|
||||
|
||||
|
||||
<a name="ValidateYourInstallation"></a>
|
||||
@ -648,14 +648,14 @@ This section documents the relevant values for Linux installations.
|
||||
CPU only:
|
||||
|
||||
<pre>
|
||||
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0rc0-cp27-none-linux_x86_64.whl
|
||||
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0rc1-cp27-none-linux_x86_64.whl
|
||||
</pre>
|
||||
|
||||
|
||||
GPU support:
|
||||
|
||||
<pre>
|
||||
https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.5.0rc0-cp27-none-linux_x86_64.whl
|
||||
https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.5.0rc1-cp27-none-linux_x86_64.whl
|
||||
</pre>
|
||||
|
||||
Note that GPU support requires the NVIDIA hardware and software described in
|
||||
@ -667,14 +667,14 @@ Note that GPU support requires the NVIDIA hardware and software described in
|
||||
CPU only:
|
||||
|
||||
<pre>
|
||||
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0rc0-cp34-cp34m-linux_x86_64.whl
|
||||
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0rc1-cp34-cp34m-linux_x86_64.whl
|
||||
</pre>
|
||||
|
||||
|
||||
GPU support:
|
||||
|
||||
<pre>
|
||||
https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.5.0rc0-cp34-cp34m-linux_x86_64.whl
|
||||
https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.5.0rc1-cp34-cp34m-linux_x86_64.whl
|
||||
</pre>
|
||||
|
||||
Note that GPU support requires the NVIDIA hardware and software described in
|
||||
@ -686,14 +686,14 @@ Note that GPU support requires the NVIDIA hardware and software described in
|
||||
CPU only:
|
||||
|
||||
<pre>
|
||||
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0rc0-cp35-cp35m-linux_x86_64.whl
|
||||
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0rc1-cp35-cp35m-linux_x86_64.whl
|
||||
</pre>
|
||||
|
||||
|
||||
GPU support:
|
||||
|
||||
<pre>
|
||||
https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.5.0rc0-cp35-cp35m-linux_x86_64.whl
|
||||
https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.5.0rc1-cp35-cp35m-linux_x86_64.whl
|
||||
</pre>
|
||||
|
||||
|
||||
@ -705,14 +705,14 @@ Note that GPU support requires the NVIDIA hardware and software described in
|
||||
CPU only:
|
||||
|
||||
<pre>
|
||||
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0rc0-cp36-cp36m-linux_x86_64.whl
|
||||
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0rc1-cp36-cp36m-linux_x86_64.whl
|
||||
</pre>
|
||||
|
||||
|
||||
GPU support:
|
||||
|
||||
<pre>
|
||||
https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.5.0rc0-cp36-cp36m-linux_x86_64.whl
|
||||
https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.5.0rc1-cp36-cp36m-linux_x86_64.whl
|
||||
</pre>
|
||||
|
||||
|
||||
|
@ -115,7 +115,7 @@ Take the following steps to install TensorFlow with Virtualenv:
|
||||
TensorFlow in the active Virtualenv is as follows:
|
||||
|
||||
<pre> $ <b>pip3 install --upgrade \
|
||||
https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.5.0rc0-py2-none-any.whl</b></pre>
|
||||
https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.5.0rc1-py2-none-any.whl</b></pre>
|
||||
|
||||
If you encounter installation problems, see
|
||||
[Common Installation Problems](#common-installation-problems).
|
||||
@ -238,7 +238,7 @@ take the following steps:
|
||||
issue the following command:
|
||||
|
||||
<pre> $ <b>sudo pip3 install --upgrade \
|
||||
https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.5.0rc0-py2-none-any.whl</b> </pre>
|
||||
https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.5.0rc1-py2-none-any.whl</b> </pre>
|
||||
|
||||
If the preceding command fails, see
|
||||
[installation problems](#common-installation-problems).
|
||||
@ -347,7 +347,7 @@ Take the following steps to install TensorFlow in an Anaconda environment:
|
||||
TensorFlow for Python 2.7:
|
||||
|
||||
<pre> (<i>targetDirectory</i>)$ <b>pip install --ignore-installed --upgrade \
|
||||
https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.5.0rc0-py2-none-any.whl</b></pre>
|
||||
https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.5.0rc1-py2-none-any.whl</b></pre>
|
||||
|
||||
|
||||
<a name="ValidateYourInstallation"></a>
|
||||
@ -520,7 +520,7 @@ This section documents the relevant values for Mac OS installations.
|
||||
|
||||
|
||||
<pre>
|
||||
https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.5.0rc0-py2-none-any.whl
|
||||
https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.5.0rc1-py2-none-any.whl
|
||||
</pre>
|
||||
|
||||
|
||||
@ -528,5 +528,5 @@ https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.5.0rc0-py2-none-a
|
||||
|
||||
|
||||
<pre>
|
||||
https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.5.0rc0-py3-none-any.whl
|
||||
https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.5.0rc1-py3-none-any.whl
|
||||
</pre>
|
||||
|
@ -361,10 +361,10 @@ Invoke `pip install` to install that pip package.
|
||||
The filename of the `.whl` file depends on your platform.
|
||||
For example, the following command will install the pip package
|
||||
|
||||
for TensorFlow 1.5.0rc0 on Linux:
|
||||
for TensorFlow 1.5.0rc1 on Linux:
|
||||
|
||||
<pre>
|
||||
$ <b>sudo pip install /tmp/tensorflow_pkg/tensorflow-1.5.0rc0-py2-none-any.whl</b>
|
||||
$ <b>sudo pip install /tmp/tensorflow_pkg/tensorflow-1.5.0rc1-py2-none-any.whl</b>
|
||||
</pre>
|
||||
|
||||
## Validate your installation
|
||||
@ -462,9 +462,12 @@ Stack Overflow and specify the `tensorflow` tag.
|
||||
**Linux**
|
||||
<table>
|
||||
<tr><th>Version:</th><th>CPU/GPU:</th><th>Python Version:</th><th>Compiler:</th><th>Build Tools:</th><th>cuDNN:</th><th>CUDA:</th></tr>
|
||||
|
||||
<tr><td>tensorflow-1.5.0-rc1</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.8.0</td><td>N/A</td><td>N/A</td></tr>
|
||||
<tr><td>tensorflow_gpu-1.5.0-rc1</td><td>GPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.8.0</td><td>7</td><td>9</td></tr>
|
||||
<tr><td>tensorflow-1.4.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.5.4</td><td>N/A</td><td>N/A</td></tr>
|
||||
<tr><td>tensorflow_gpu-1.4.0</td><td>GPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.5.4</td><td>6</td><td>8</td></tr>
|
||||
<tr><td>tensorflow-1.3.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.4.5</td><td>N/A</td><td>N/A</td></tr>
|
||||
<tr><td>tensorflow-1.3.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.4.5</td><td>N/A</td><td>N/A</td></tr>
|
||||
<tr><td>tensorflow_gpu-1.3.0</td><td>GPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.4.5</td><td>6</td><td>8</td></tr>
|
||||
<tr><td>tensorflow-1.2.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.4.5</td><td>N/A</td><td>N/A</td></tr>
|
||||
<tr><td>tensorflow_gpu-1.2.0</td><td>GPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.4.5</td><td>5.1</td><td>8</td></tr>
|
||||
@ -477,8 +480,9 @@ Stack Overflow and specify the `tensorflow` tag.
|
||||
**Mac**
|
||||
<table>
|
||||
<tr><th>Version:</th><th>CPU/GPU:</th><th>Python Version:</th><th>Compiler:</th><th>Build Tools:</th><th>cuDNN:</th><th>CUDA:</th></tr>
|
||||
<tr><td>tensorflow-1.5.0-rc1</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.8.1</td><td>N/A</td><td>N/A</td></tr>
|
||||
<tr><td>tensorflow-1.4.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.5.4</td><td>N/A</td><td>N/A</td></tr>
|
||||
<tr><td>tensorflow-1.3.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.4.5</td><td>N/A</td><td>N/A</td></tr>
|
||||
<tr><td>tensorflow-1.3.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.4.5</td><td>N/A</td><td>N/A</td></tr>
|
||||
<tr><td>tensorflow-1.2.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.4.5</td><td>N/A</td><td>N/A</td></tr>
|
||||
<tr><td>tensorflow-1.1.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.4.2</td><td>N/A</td><td>N/A</td></tr>
|
||||
<tr><td>tensorflow_gpu-1.1.0</td><td>GPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.4.2</td><td>5.1</td><td>8</td></tr>
|
||||
@ -489,6 +493,8 @@ Stack Overflow and specify the `tensorflow` tag.
|
||||
**Windows**
|
||||
<table>
|
||||
<tr><th>Version:</th><th>CPU/GPU:</th><th>Python Version:</th><th>Compiler:</th><th>Build Tools:</th><th>cuDNN:</th><th>CUDA:</th></tr>
|
||||
<tr><td>tensorflow-1.5.0-rc1</td><td>CPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>N/A</td><td>N/A</td></tr>
|
||||
<tr><td>tensorflow_gpu-1.5.0-rc1</td><td>GPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>7</td><td>9</td></tr>
|
||||
<tr><td>tensorflow-1.4.0</td><td>CPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>N/A</td><td>N/A</td></tr>
|
||||
<tr><td>tensorflow_gpu-1.4.0</td><td>GPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>6</td><td>8</td></tr>
|
||||
<tr><td>tensorflow-1.3.0</td><td>CPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>N/A</td><td>N/A</td></tr>
|
||||
|
@ -21,6 +21,8 @@ from __future__ import print_function
|
||||
import collections
|
||||
import math
|
||||
import os
|
||||
import sys
|
||||
import argparse
|
||||
import random
|
||||
from tempfile import gettempdir
|
||||
import zipfile
|
||||
@ -30,6 +32,24 @@ from six.moves import urllib
|
||||
from six.moves import xrange # pylint: disable=redefined-builtin
|
||||
import tensorflow as tf
|
||||
|
||||
from tensorflow.contrib.tensorboard.plugins import projector
|
||||
|
||||
# Give a folder path as an argument with '--log_dir' to save
|
||||
# TensorBoard summaries. Default is a log folder in current directory.
|
||||
current_path = os.path.dirname(os.path.realpath(sys.argv[0]))
|
||||
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument(
|
||||
'--log_dir',
|
||||
type=str,
|
||||
default=os.path.join(current_path, 'log'),
|
||||
help='The log directory for TensorBoard summaries.')
|
||||
FLAGS, unparsed = parser.parse_known_args()
|
||||
|
||||
# Create the directory for TensorBoard variables if there is not.
|
||||
if not os.path.exists(FLAGS.log_dir):
|
||||
os.makedirs(FLAGS.log_dir)
|
||||
|
||||
# Step 1: Download the data.
|
||||
url = 'http://mattmahoney.net/dc/'
|
||||
|
||||
@ -156,38 +176,47 @@ graph = tf.Graph()
|
||||
with graph.as_default():
|
||||
|
||||
# Input data.
|
||||
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
|
||||
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
|
||||
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
|
||||
with tf.name_scope('inputs'):
|
||||
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
|
||||
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
|
||||
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
|
||||
|
||||
# Ops and variables pinned to the CPU because of missing GPU implementation
|
||||
with tf.device('/cpu:0'):
|
||||
# Look up embeddings for inputs.
|
||||
embeddings = tf.Variable(
|
||||
tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
|
||||
embed = tf.nn.embedding_lookup(embeddings, train_inputs)
|
||||
with tf.name_scope('embeddings'):
|
||||
embeddings = tf.Variable(
|
||||
tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
|
||||
embed = tf.nn.embedding_lookup(embeddings, train_inputs)
|
||||
|
||||
# Construct the variables for the NCE loss
|
||||
nce_weights = tf.Variable(
|
||||
tf.truncated_normal([vocabulary_size, embedding_size],
|
||||
stddev=1.0 / math.sqrt(embedding_size)))
|
||||
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
|
||||
with tf.name_scope('weights'):
|
||||
nce_weights = tf.Variable(
|
||||
tf.truncated_normal([vocabulary_size, embedding_size],
|
||||
stddev=1.0 / math.sqrt(embedding_size)))
|
||||
with tf.name_scope('biases'):
|
||||
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
|
||||
|
||||
# Compute the average NCE loss for the batch.
|
||||
# tf.nce_loss automatically draws a new sample of the negative labels each
|
||||
# time we evaluate the loss.
|
||||
# Explanation of the meaning of NCE loss:
|
||||
# http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
|
||||
loss = tf.reduce_mean(
|
||||
tf.nn.nce_loss(weights=nce_weights,
|
||||
biases=nce_biases,
|
||||
labels=train_labels,
|
||||
inputs=embed,
|
||||
num_sampled=num_sampled,
|
||||
num_classes=vocabulary_size))
|
||||
with tf.name_scope('loss'):
|
||||
loss = tf.reduce_mean(
|
||||
tf.nn.nce_loss(weights=nce_weights,
|
||||
biases=nce_biases,
|
||||
labels=train_labels,
|
||||
inputs=embed,
|
||||
num_sampled=num_sampled,
|
||||
num_classes=vocabulary_size))
|
||||
|
||||
# Add the loss value as a scalar to summary.
|
||||
tf.summary.scalar('loss', loss)
|
||||
|
||||
# Construct the SGD optimizer using a learning rate of 1.0.
|
||||
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
|
||||
with tf.name_scope('optimizer'):
|
||||
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
|
||||
|
||||
# Compute the cosine similarity between minibatch examples and all embeddings.
|
||||
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
|
||||
@ -197,13 +226,22 @@ with graph.as_default():
|
||||
similarity = tf.matmul(
|
||||
valid_embeddings, normalized_embeddings, transpose_b=True)
|
||||
|
||||
# Merge all summaries.
|
||||
merged = tf.summary.merge_all()
|
||||
|
||||
# Add variable initializer.
|
||||
init = tf.global_variables_initializer()
|
||||
|
||||
# Create a saver.
|
||||
saver = tf.train.Saver()
|
||||
|
||||
# Step 5: Begin training.
|
||||
num_steps = 100001
|
||||
|
||||
with tf.Session(graph=graph) as session:
|
||||
# Open a writer to write summaries.
|
||||
writer = tf.summary.FileWriter(FLAGS.log_dir, session.graph)
|
||||
|
||||
# We must initialize all variables before we use them.
|
||||
init.run()
|
||||
print('Initialized')
|
||||
@ -214,10 +252,21 @@ with tf.Session(graph=graph) as session:
|
||||
batch_size, num_skips, skip_window)
|
||||
feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}
|
||||
|
||||
# Define metadata variable.
|
||||
run_metadata = tf.RunMetadata()
|
||||
|
||||
# We perform one update step by evaluating the optimizer op (including it
|
||||
# in the list of returned values for session.run()
|
||||
_, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)
|
||||
# Also, evaluate the merged op to get all summaries from the returned "summary" variable.
|
||||
# Feed metadata variable to session for visualizing the graph in TensorBoard.
|
||||
_, summary, loss_val = session.run([optimizer, merged, loss], feed_dict=feed_dict, run_metadata=run_metadata)
|
||||
average_loss += loss_val
|
||||
|
||||
# Add returned summaries to writer in each step.
|
||||
writer.add_summary(summary, step)
|
||||
# Add metadata to visualize the graph for the last run.
|
||||
if step == (num_steps - 1):
|
||||
writer.add_run_metadata(run_metadata, 'step%d' % step)
|
||||
|
||||
if step % 2000 == 0:
|
||||
if step > 0:
|
||||
@ -240,6 +289,23 @@ with tf.Session(graph=graph) as session:
|
||||
print(log_str)
|
||||
final_embeddings = normalized_embeddings.eval()
|
||||
|
||||
# Write corresponding labels for the embeddings.
|
||||
with open(FLAGS.log_dir + '/metadata.tsv', 'w') as f:
|
||||
for i in xrange(vocabulary_size):
|
||||
f.write(reverse_dictionary[i] + '\n')
|
||||
|
||||
# Save the model for checkpoints.
|
||||
saver.save(session, os.path.join(FLAGS.log_dir, "model.ckpt"))
|
||||
|
||||
# Create a configuration for visualizing embeddings with the labels in TensorBoard.
|
||||
config = projector.ProjectorConfig()
|
||||
embedding_conf = config.embeddings.add()
|
||||
embedding_conf.tensor_name = embeddings.name
|
||||
embedding_conf.metadata_path = os.path.join(FLAGS.log_dir, 'metadata.tsv')
|
||||
projector.visualize_embeddings(writer, config)
|
||||
|
||||
writer.close()
|
||||
|
||||
# Step 6: Visualize the embeddings.
|
||||
|
||||
|
||||
|
@ -1,3 +1,4 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
@ -301,6 +302,27 @@ class BatchDatasetTest(test.TestCase):
|
||||
with self.assertRaises(errors.OutOfRangeError):
|
||||
sess.run(get_next)
|
||||
|
||||
def testPaddedBatchDatasetUnicode(self):
|
||||
# See GitHub issue 16149
|
||||
def generator():
|
||||
data = [
|
||||
[u'Простой', u'тест', u'юникода'],
|
||||
[u'никогда', u'не', u'бывает', u'простым']]
|
||||
|
||||
for seq in data:
|
||||
yield seq, [0, 1, 2, 3]
|
||||
|
||||
dataset = dataset_ops.Dataset.from_generator(
|
||||
generator,
|
||||
(dtypes.string, dtypes.int32),
|
||||
(tensor_shape.TensorShape([None]), tensor_shape.TensorShape([None])))
|
||||
padded_dataset = dataset.padded_batch(2, padded_shapes=([None], [None]),
|
||||
padding_values=('', 0))
|
||||
with self.test_session() as sess:
|
||||
next_element = padded_dataset.make_one_shot_iterator().get_next()
|
||||
sess.run(next_element)
|
||||
|
||||
|
||||
def testPaddedBatchDatasetShapeSpecifications(self):
|
||||
int_placeholder = array_ops.placeholder(dtypes.int32)
|
||||
float_placeholder = array_ops.placeholder(dtypes.float32)
|
||||
|
@ -128,9 +128,10 @@ class Estimator(object):
|
||||
|
||||
model_dir: Directory to save model parameters, graph and etc. This can
|
||||
also be used to load checkpoints from the directory into a estimator to
|
||||
continue training a previously saved model. If `None`, the model_dir in
|
||||
`config` will be used if set. If both are set, they must be same. If
|
||||
both are `None`, a temporary directory will be used.
|
||||
continue training a previously saved model. If `PathLike` object, the
|
||||
path will be resolved. If `None`, the model_dir in `config` will be used
|
||||
if set. If both are set, they must be same. If both are `None`, a
|
||||
temporary directory will be used.
|
||||
config: Configuration object.
|
||||
params: `dict` of hyper parameters that will be passed into `model_fn`.
|
||||
Keys are names of parameters, values are basic python types.
|
||||
@ -158,6 +159,7 @@ class Estimator(object):
|
||||
self._config = config
|
||||
|
||||
# Model directory.
|
||||
model_dir = compat.path_to_str(model_dir)
|
||||
if (model_dir is not None) and (self._config.model_dir is not None):
|
||||
if model_dir != self._config.model_dir:
|
||||
# TODO(alanyee): remove this suppression after it is no longer needed
|
||||
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user