The current depthwise_conv is very inefficient by calling slice() on each
input channel on input and filters, followed by a conv() on each input channel,
after which is a concat().
Change: 115601904
The current depthwise_conv is very inefficient by calling slice() on each
input channel on input and filters, followed by a conv() on each input channel,
after which is a concat().
Change: 115583330
These tools are meant to allow recording of benchmark & unit test
structured output to pbtxt files in a directory only when the
environment variable TEST_REPORT_FILE_PREFIX is set. For now,
only saving of C++ microbenchmark output is supported.
Change: 115518303
Helps with: https://github.com/tensorflow/tensorflow/issues/917
Also fixes https://github.com/tensorflow/tensorflow/issues/1162
The main benefit is that the computation of the sufficient statistics is now decoupled of the aggregation of the moments, which means that if you want to perform the accumulation incrementally, you don't have to keep all the inputs around, and can instead keep the much more compact sum and sum-of-squares. Accumulation could also be performed locally if you aggregate across multiple devices.
Computing sum and sum-of-squares can also theoretically be performed in parallel now.
Tested running inception: same performance, same step time.
Batch normalization benchmark is a bit faster on CPU, a bit slower on GPU:
Before:
cpu shape:4/3 #layers:10 mode:py scale:True train:False - 1.139310 secs
gpu shape:4/3 #layers:10 mode:py scale:True train:False - 0.021970 secs
cpu shape:4/3 #layers:10 mode:py scale:True train:True - 2.767147 secs
gpu shape:4/3 #layers:10 mode:py scale:True train:True - 0.074531 secs
cpu shape:4/3 #layers:10 mode:py scale:True train:False - 0.742835 secs
gpu shape:4/3 #layers:10 mode:py scale:True train:False - 0.013473 secs
cpu shape:4/3 #layers:10 mode:py scale:True train:True - 1.738806 secs
gpu shape:4/3 #layers:10 mode:py scale:True train:True - 0.052777 secs
cpu shape:2/1 #layers:10 mode:py scale:True train:False - 0.119180 secs
gpu shape:2/1 #layers:10 mode:py scale:True train:False - 0.011201 secs
cpu shape:2/1 #layers:10 mode:py scale:True train:True - 0.218297 secs
gpu shape:2/1 #layers:10 mode:py scale:True train:True - 0.048526 secs
After:
cpu shape:4/3 #layers:10 mode:py scale:True train:False - 0.998944 secs
gpu shape:4/3 #layers:10 mode:py scale:True train:False - 0.025828 secs
cpu shape:4/3 #layers:10 mode:py scale:True train:True - 2.657428 secs
gpu shape:4/3 #layers:10 mode:py scale:True train:True - 0.086614 secs
cpu shape:4/3 #layers:10 mode:py scale:True train:False - 0.603137 secs
gpu shape:4/3 #layers:10 mode:py scale:True train:False - 0.017668 secs
cpu shape:4/3 #layers:10 mode:py scale:True train:True - 1.519533 secs
gpu shape:4/3 #layers:10 mode:py scale:True train:True - 0.055214 secs
cpu shape:2/1 #layers:10 mode:py scale:True train:False - 0.071344 secs
gpu shape:2/1 #layers:10 mode:py scale:True train:False - 0.016440 secs
cpu shape:2/1 #layers:10 mode:py scale:True train:True - 0.222093 secs
gpu shape:2/1 #layers:10 mode:py scale:True train:True - 0.039967 secs
Change: 115507032
Both gather and scatter now unconditionally validate indices in the inner loop,
which prevents crashes if indices are changed asynchronously while the ops are
running.
For gather when validate_indices = true, the new code is within the noise of the
old code speedwise or possibly slightly faster (unsurprising since the new code
fuses two loops). Specifically, the geometric mean of int32 gather benchmarks
goes from 4.05GB/s to 4.04-4.07GB/s.
For gather when validate_indices = false, the old code and a version of the old
code that supported validate_indices = false both get 1.5% slower. Xiaoqiang
and I deem this difference insufficient to preserve the unsafe code path, so
poof: it's gone.
For scatter (which always validates), the new code is slightly faster than the
old code: the geometric mean goes from 546-559M items/s to 573M items/s.
Change: 115467091
to be used to colocate based on attributes rather than either
names of ops or devices (op names and devices aren't portable).
A follow up change will add an ops.colocate_with() to Python that adds
this attribute to nodes, and will be used to replace calls to 'with
tf.device(foo.device)' in TF library code, which assumes that devices
have been specified.
Change: 115463464