Merge changes from github.

END_PUBLIC --- Commit d0f53f77f authored by Penghao Cen<scorpiocph@gmail.com> Committed by Shanqing Cai<cais@google.com>: Minor fix typo (#11323) --- Commit 02fcf564e authored by Chris Song<sjhshy@gmail.com> Committed by Chris Song<sjhshy@gmail.com>: Fix misspells. --- Commit 764c9b6b4 authored by Louis Tiao<ltiao@users.noreply.github.com> Committed by GitHub<noreply@github.com>: Fixed typo in docstring --- Commit f8cd1283e authored by Shanqing Cai<cais@google.com> Committed by Shanqing Cai<cais@google.com>: Chaser --- Commit 01383b946 authored by Shanqing Cai<cais@google.com> Committed by Shanqing Cai<cais@google.com>: Adapt TensorFlowTestCase.setUp() to new reset_default_graph() semantics Avoid calling reset_default_graph() directly to prevent exceptions in cases where test methods error out from within nested graph contexts, which can leave _default_graph_stack non-empty in certain Python versions. --- Commit 0ffc37890 authored by Amit Patankar<amitpatankar@google.com> Committed by Amit Patankar<amitpatankar@google.com>: Removing second declaration of functions. --- Commit f9c9cacb0 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Refactor ElementalIrEmitter's slice index finding code into IrArray::Index::SourceIndexOfSlice(). PiperOrigin-RevId: 161140653 --- Commit ba297aec9 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Update ops-related pbtxt files. PiperOrigin-RevId: 161138258 --- Commit 68d666737 authored by Alexandre Passos<apassos@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fixes a reentrant lock issue with tensors using ndarray memory which uses tensor memory. PiperOrigin-RevId: 161137788 --- Commit a2ee8bca3 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add support for int8 x int8 -> int32 matrix multiplication via cublasGemmEx to stream_executor. PiperOrigin-RevId: 161137741 --- Commit 755fa7b50 authored by Mark Daoust<markdaoust@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Block generate_test, and docs generating from running in python3. - Doc generation is currently unsupported in python3 - These both end in errors in python 3.5.1+ PiperOrigin-RevId: 161137467 --- Commit 97cbcac45 authored by Peter Hawkins<phawkins@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [TF:XLA] Fix failure in functionalize_control_flow rewrite for Enter nodes that are unused. Make sure we ignore such nodes without producing an error. PiperOrigin-RevId: 161136545 --- Commit dabcb60bc authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Add reasonable error messages to Builder::Build for bad parameter numbers. PiperOrigin-RevId: 161136262 --- Commit 0cbd249e8 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add complex tensors support to `matrix_determinant`. PiperOrigin-RevId: 161132422 --- Commit 335f1f14d authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Extend static shape inference for SparseTensors with dense_shapes constructed using slicing. PiperOrigin-RevId: 161132391 --- Commit 53604916e authored by Jianwei Xie<xiejw@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fixed the missing labels test in TPUEstimator. PiperOrigin-RevId: 161131282 --- Commit 9f57dc8dd authored by Bruno Rosa<bruno.rosa@eldorado.org.br> Committed by Bruno Rosa<bruno.rosa@eldorado.org.br>: Use mcpu instead of march for ppc64le march is not support by gcc on ppc64le --- Commit 7d5c74a9c authored by Skye Wanderman-Milne<skyewm@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Move duplicate detection logic from Graph to FunctionLibraryDefinition Turns out this is more useful, since there are many function libraries that don't belong to a graph. This will be used in a future change. Note that this maintains the current behavior of Graph. In addition, updates FunctionDefsEqual() to handle unset attr entries (I ran into this when using this in said future change). PiperOrigin-RevId: 161126628 --- Commit 2caec3af1 authored by Shanqing Cai<cais@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Disable more timeseries py tests failing in OSS PIP GPU builds PiperOrigin-RevId: 161124799 --- Commit 0b5cce367 authored by Eugene Brevdo<ebrevdo@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Get TopK op working on GPU again. Extend using cub's radix sort. 1. Undo rollback of Andreas Kirsch's initial implementation. 2. Use cub segmented radix sort if Andreas' heap-based impl for large k and small num_cols (thresholds of k=100, n=1000 determined empirically). 3. Use cub segmented radix sort if k == num_cols (this case is always faster). 4. Added benchmarks. Benchmarks show that the GPU implementation is up to 3x slower for small k but can be 10x faster for large num_cols and k. Benchmarks: Benchmark: m_128_n_10_k_5_use_gpu_False wall_time: 0.000166 s Throughput: 0.0077 GB/s Benchmark: m_128_n_10_k_5_use_gpu_True wall_time: 0.000796 s Throughput: 0.00161 GB/s Benchmark: m_128_n_10_k_9_use_gpu_False wall_time: 0.00017 s Throughput: 0.00751 GB/s Benchmark: m_128_n_10_k_9_use_gpu_True wall_time: 0.000796 s Throughput: 0.00161 GB/s Benchmark: m_128_n_10_k_10_use_gpu_False wall_time: 0.00017 s Throughput: 0.00753 GB/s Benchmark: m_128_n_10_k_10_use_gpu_True wall_time: 0.000775 s Throughput: 0.00165 GB/s Benchmark: m_128_n_100_k_1_use_gpu_False wall_time: 0.000155 s Throughput: 0.0826 GB/s Benchmark: m_128_n_100_k_1_use_gpu_True wall_time: 0.000796 s Throughput: 0.0161 GB/s Benchmark: m_128_n_100_k_50_use_gpu_False wall_time: 0.000247 s Throughput: 0.0519 GB/s Benchmark: m_128_n_100_k_50_use_gpu_True wall_time: 0.0008 s Throughput: 0.016 GB/s Benchmark: m_128_n_100_k_99_use_gpu_False wall_time: 0.000261 s Throughput: 0.049 GB/s Benchmark: m_128_n_100_k_99_use_gpu_True wall_time: 0.000794 s Throughput: 0.0161 GB/s Benchmark: m_128_n_100_k_100_use_gpu_False wall_time: 0.000239 s Throughput: 0.0536 GB/s Benchmark: m_128_n_100_k_100_use_gpu_True wall_time: 0.000777 s Throughput: 0.0165 GB/s Benchmark: m_128_n_1000_k_1_use_gpu_False wall_time: 0.000324 s Throughput: 0.395 GB/s Benchmark: m_128_n_1000_k_1_use_gpu_True wall_time: 0.000916 s Throughput: 0.14 GB/s Benchmark: m_128_n_1000_k_10_use_gpu_False wall_time: 0.00042 s Throughput: 0.305 GB/s Benchmark: m_128_n_1000_k_10_use_gpu_True wall_time: 0.000902 s Throughput: 0.142 GB/s Benchmark: m_128_n_1000_k_500_use_gpu_False wall_time: 0.0011 s Throughput: 0.116 GB/s Benchmark: m_128_n_1000_k_500_use_gpu_True wall_time: 0.00097 s Throughput: 0.132 GB/s Benchmark: m_128_n_1000_k_990_use_gpu_False wall_time: 0.00133 s Throughput: 0.0962 GB/s Benchmark: m_128_n_1000_k_990_use_gpu_True wall_time: 0.000993 s Throughput: 0.129 GB/s Benchmark: m_128_n_1000_k_1000_use_gpu_False wall_time: 0.00102 s Throughput: 0.126 GB/s Benchmark: m_128_n_1000_k_1000_use_gpu_True wall_time: 0.000964 s Throughput: 0.133 GB/s Benchmark: m_128_n_10000_k_10_use_gpu_False wall_time: 0.002 s Throughput: 0.64 GB/s Benchmark: m_128_n_10000_k_10_use_gpu_True wall_time: 0.00288 s Throughput: 0.445 GB/s Benchmark: m_128_n_10000_k_100_use_gpu_False wall_time: 0.00233 s Throughput: 0.549 GB/s Benchmark: m_128_n_10000_k_100_use_gpu_True wall_time: 0.00325 s Throughput: 0.394 GB/s Benchmark: m_128_n_10000_k_5000_use_gpu_False wall_time: 0.0127 s Throughput: 0.101 GB/s Benchmark: m_128_n_10000_k_5000_use_gpu_True wall_time: 0.00381 s Throughput: 0.336 GB/s Benchmark: m_128_n_10000_k_9900_use_gpu_False wall_time: 0.015 s Throughput: 0.0853 GB/s Benchmark: m_128_n_10000_k_9900_use_gpu_True wall_time: 0.00438 s Throughput: 0.292 GB/s Benchmark: m_128_n_10000_k_10000_use_gpu_False wall_time: 0.0104 s Throughput: 0.123 GB/s Benchmark: m_128_n_10000_k_10000_use_gpu_True wall_time: 0.00427 s Throughput: 0.3 GB/s Benchmark: m_128_n_100000_k_100_use_gpu_False wall_time: 0.0148 s Throughput: 0.865 GB/s Benchmark: m_128_n_100000_k_100_use_gpu_True wall_time: 0.0262 s Throughput: 0.488 GB/s Benchmark: m_128_n_100000_k_1000_use_gpu_False wall_time: 0.0201 s Throughput: 0.636 GB/s Benchmark: m_128_n_100000_k_1000_use_gpu_True wall_time: 0.0263 s Throughput: 0.486 GB/s Benchmark: m_128_n_100000_k_50000_use_gpu_False wall_time: 0.214 s Throughput: 0.0599 GB/s Benchmark: m_128_n_100000_k_50000_use_gpu_True wall_time: 0.0322 s Throughput: 0.398 GB/s Benchmark: m_128_n_100000_k_99000_use_gpu_False wall_time: 0.262 s Throughput: 0.0489 GB/s Benchmark: m_128_n_100000_k_99000_use_gpu_True wall_time: 0.0377 s Throughput: 0.34 GB/s Benchmark: m_128_n_100000_k_100000_use_gpu_False wall_time: 0.118 s Throughput: 0.108 GB/s Benchmark: m_128_n_100000_k_100000_use_gpu_True wall_time: 0.0365 s Throughput: 0.351 GB/s END_PUBLIC BEGIN_PUBLIC BEGIN_PUBLIC Automated g4 rollback of changelist 157169178 PiperOrigin-RevId: 161476569
2017-07-10 19:22:04 -07:00 · 2017-07-10 19:22:04 -07:00 · 90d6421c5e
commit 90d6421c5e
parent aa239529c4
243 changed files with 2707 additions and 814 deletions
--- a/ISSUE_TEMPLATE.md
+++ b/ISSUE_TEMPLATE.md
@ -17,6 +17,7 @@ If you open a GitHub issue, here is our policy:
 - **OS Platform and Distribution (e.g., Linux Ubuntu 16.04)**:
 - **TensorFlow installed from (source or binary)**:
 - **TensorFlow version (use command below)**:
+- **Python version**: 
 - **Bazel version (if compiling from source)**:
 - **CUDA/cuDNN version**:
 - **GPU model and memory**:
--- a/README.md
+++ b/README.md
@ -16,7 +16,7 @@ or more CPUs or GPUs in a desktop, server, or mobile device without rewriting
 code.  TensorFlow also includes TensorBoard, a data visualization toolkit.

 TensorFlow was originally developed by researchers and engineers
-working on the Google Brain team within Google's Machine Intelligence research
+working on the Google Brain team within Google's Machine Intelligence Research
 organization for the purposes of conducting machine learning and deep neural
 networks research.  The system is general enough to be applicable in a wide
 variety of other domains, as well.
@ -34,13 +34,13 @@ and discussion.**

 People who are a little more adventurous can also try our nightly binaries:

-* Linux CPU-only: [Python 2](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=cpu-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-1.2.0-cp27-none-linux_x86_64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=cpu-slave)) / [Python 3.4](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=cpu-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-1.2.0-cp34-cp34m-linux_x86_64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=cpu-slave/)) / [Python 3.5](https://ci.tensorflow.org/view/Nightly/job/nightly-python35-linux-cpu/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-1.2.0-cp35-cp35m-linux_x86_64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-python35-linux-cpu/))
-* Linux GPU: [Python 2](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-linux-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=gpu-linux/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow_gpu-1.2.0-cp27-none-linux_x86_64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-linux-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=gpu-linux/)) / [Python 3.4](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-linux-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=gpu-linux/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow_gpu-1.2.0-cp34-cp34m-linux_x86_64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-linux-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=gpu-linux/)) / [Python 3.5](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-linux-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3.5,label=gpu-linux/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow_gpu-1.2.0-cp35-cp35m-linux_x86_64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-linux-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3.5,label=gpu-linux/))
-* Mac CPU-only: [Python 2](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=mac-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-1.2.0-py2-none-any.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=mac-slave/)) / [Python 3](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=mac-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-1.2.0-py3-none-any.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=mac-slave/))
-* Mac GPU: [Python 2](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-mac-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=gpu-mac/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow_gpu-1.2.0-py2-none-any.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-mac-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=gpu-mac/)) / [Python 3](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-mac-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=gpu-mac/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow_gpu-1.2.0-py3-none-any.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-mac-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=gpu-mac/))
-* Windows CPU-only: [Python 3.5 64-bit](https://ci.tensorflow.org/view/Nightly/job/nightly-win/M=windows,PY=35/lastSuccessfulBuild/artifact/cmake_build/tf_python/dist/tensorflow-1.2.0-cp35-cp35m-win_amd64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-win/M=windows,PY=35/)) / [Python 3.6 64-bit](https://ci.tensorflow.org/view/Nightly/job/nightly-win/M=windows,PY=36/lastSuccessfulBuild/artifact/cmake_build/tf_python/dist/tensorflow-1.2.0-cp36-cp36m-win_amd64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-win/M=windows,PY=36/))
-* Windows GPU: [Python 3.5 64-bit](https://ci.tensorflow.org/view/Nightly/job/nightly-win/M=windows-gpu,PY=35/lastSuccessfulBuild/artifact/cmake_build/tf_python/dist/tensorflow_gpu-1.2.0-cp35-cp35m-win_amd64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-win/M=windows-gpu,PY=35/)) / [Python 3.6 64-bit](https://ci.tensorflow.org/view/Nightly/job/nightly-win/M=windows-gpu,PY=36/lastSuccessfulBuild/artifact/cmake_build/tf_python/dist/tensorflow_gpu-1.2.0-cp36-cp36m-win_amd64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-win/M=windows-gpu,PY=36/))
-* Android: [demo APK](https://ci.tensorflow.org/view/Nightly/job/nightly-android/lastSuccessfulBuild/artifact/out/tensorflow_demo.apk), [native libs](https://ci.tensorflow.org/view/Nightly/job/nightly-android/lastSuccessfulBuild/artifact/out/native/)
+* Linux CPU-only: [Python 2](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=cpu-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-1.2.1-cp27-none-linux_x86_64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=cpu-slave)) / [Python 3.4](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=cpu-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-1.2.1-cp34-cp34m-linux_x86_64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=cpu-slave/)) / [Python 3.5](https://ci.tensorflow.org/view/Nightly/job/nightly-python35-linux-cpu/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-1.2.1-cp35-cp35m-linux_x86_64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-python35-linux-cpu/))
+* Linux GPU: [Python 2](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-linux-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=gpu-linux/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow_gpu-1.2.1-cp27-none-linux_x86_64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-linux-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=gpu-linux/)) / [Python 3.4](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-linux-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=gpu-linux/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow_gpu-1.2.1-cp34-cp34m-linux_x86_64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-linux-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=gpu-linux/)) / [Python 3.5](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-linux-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3.5,label=gpu-linux/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow_gpu-1.2.1-cp35-cp35m-linux_x86_64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-linux-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3.5,label=gpu-linux/))
+* Mac CPU-only: [Python 2](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=mac-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-1.2.1-py2-none-any.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=mac-slave/)) / [Python 3](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=mac-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-1.2.1-py3-none-any.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=mac-slave/))
+* Mac GPU: [Python 2](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-mac-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=gpu-mac/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow_gpu-1.2.1-py2-none-any.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-mac-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=gpu-mac/)) / [Python 3](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-mac-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=gpu-mac/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow_gpu-1.2.1-py3-none-any.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-mac-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=gpu-mac/))
+* Windows CPU-only: [Python 3.5 64-bit](https://ci.tensorflow.org/view/Nightly/job/nightly-win/M=windows,PY=35/lastSuccessfulBuild/artifact/cmake_build/tf_python/dist/tensorflow-1.2.1-cp35-cp35m-win_amd64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-win/M=windows,PY=35/)) / [Python 3.6 64-bit](https://ci.tensorflow.org/view/Nightly/job/nightly-win/M=windows,PY=36/lastSuccessfulBuild/artifact/cmake_build/tf_python/dist/tensorflow-1.2.1-cp36-cp36m-win_amd64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-win/M=windows,PY=36/))
+* Windows GPU: [Python 3.5 64-bit](https://ci.tensorflow.org/view/Nightly/job/nightly-win/M=windows-gpu,PY=35/lastSuccessfulBuild/artifact/cmake_build/tf_python/dist/tensorflow_gpu-1.2.1-cp35-cp35m-win_amd64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-win/M=windows-gpu,PY=35/)) / [Python 3.6 64-bit](https://ci.tensorflow.org/view/Nightly/job/nightly-win/M=windows-gpu,PY=36/lastSuccessfulBuild/artifact/cmake_build/tf_python/dist/tensorflow_gpu-1.2.1-cp36-cp36m-win_amd64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-win/M=windows-gpu,PY=36/))
+* Android: [demo APK](https://ci.tensorflow.org/view/Nightly/job/nightly-android/lastSuccessfulBuild/artifact/out/tensorflow_demo.apk), [native libs](http://ci.tensorflow.org/view/Nightly/job/nightly-android/lastSuccessfulBuild/artifact/out/native/)
 ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-android/))

 #### *Try your first TensorFlow program*
--- a/RELEASE.md
+++ b/RELEASE.md
@ -1,3 +1,9 @@
+# Release 1.2.1
+
+## Bug Fixes and Other Changes
+* Updating markdown version required to >= 2.6.8.
+* Support tensors as dropout rates again, by removing the min(max(..))
+
 # Release 1.2.0

 ## Major Features and Improvements
--- a/11
+++ b/11
@ -25,6 +25,10 @@ function is_windows() {
  [[ "${PLATFORM}" =~ msys_nt*|mingw*|cygwin*|uwin* ]]
 }

+function is_ppc64le() {
+  [[ "${uname -m}" == "ppc64le" ]]
+}
+
 function sed_in_place() {
  sed -e $1 $2 > "$2.bak"
  mv "$2.bak" $2
@ -294,7 +298,12 @@ fi # TF_NEED_MKL

 ## Set up architecture-dependent optimization flags.
 if [ -z "$CC_OPT_FLAGS" ]; then
-  default_cc_opt_flags="-march=native"
+  if [ is_ppc64le ]; then
+    # gcc on ppc64le does not support -march, use mcpu instead
+    default_cc_opt_flags="-mcpu=native"
+  else
+    default_cc_opt_flags="-march=native"
+  fi
  read -p "Please specify optimization flags to use during compilation when bazel option "\
 "\"--config=opt\" is specified [Default is $default_cc_opt_flags]: " CC_OPT_FLAGS
  if [ -z "$CC_OPT_FLAGS" ]; then
--- a/tensorflow/c/c_api_test.cc
+++ b/tensorflow/c/c_api_test.cc
@ -912,11 +912,13 @@ class CSession {
    for (TF_Operation* o : outputs) {
      outputs_.emplace_back(TF_Output{o, 0});
    }
+    output_values_.resize(outputs_.size());
  }

  void SetOutputs(const std::vector<TF_Output>& outputs) {
    ResetOutputValues();
    outputs_ = outputs;
+    output_values_.resize(outputs_.size());
  }

  void SetTargets(std::initializer_list<TF_Operation*> targets) {
--- a/tensorflow/cc/framework/gradients.cc
+++ b/tensorflow/cc/framework/gradients.cc
@ -152,12 +152,12 @@ Status SymbolicGradientBuilder::Initialize() {
  grad_outputs_->resize(inputs_.size());
  // Populate `output_nodes_` from node ids in `outputs_`.
  output_nodes_.reserve(outputs_.size());
-  for (int i = 0; i < outputs_.size(); ++i) {
+  for (size_t i = 0; i < outputs_.size(); ++i) {
    output_nodes_.insert(outputs_[i].node()->id());
  }
  // Populate `input_nodes_` from Outputs in `inputs_`.
  input_nodes_.reserve(inputs_.size());
-  for (int i = 0; i < inputs_.size(); ++i) {
+  for (size_t i = 0; i < inputs_.size(); ++i) {
    input_nodes_.insert({inputs_[i], i});
  }

@ -341,7 +341,7 @@ Status SymbolicGradientBuilder::AddGradients() {
    // gradient function to the src node/output to which it should be
    // backproped. Maybe grad functions can return a vector of Output pairs to
    // make this association explicit.
-    int dx_index = 0;
+    size_t dx_index = 0;
    for (const Edge* e : n->in_edges()) {
      if (e->IsControlEdge()) continue;
      if (dx_index == dx.size()) {
--- a/tensorflow/cc/gradients/math_grad.cc
+++ b/tensorflow/cc/gradients/math_grad.cc
@ -203,6 +203,46 @@ Status TanhGrad(const Scope& scope, const Operation& op,
 }
 REGISTER_GRADIENT_OP("Tanh", TanhGrad);

+Status AsinhGrad(const Scope& scope, const Operation& op,
+                 const std::vector<Output>& grad_inputs,
+                 std::vector<Output>* grad_outputs) {
+  // y = asinh(x)
+  // dy/dx = 1 / cosh(y)
+  auto dydx = Reciprocal(scope, Cosh(scope, op.output(0)));
+  // grad(x) = grad(y) * conj(dy/dx)
+  grad_outputs->push_back(
+      Mul(scope, grad_inputs[0], ConjugateHelper(scope, dydx)));
+  return scope.status();
+}
+REGISTER_GRADIENT_OP("Asinh", AsinhGrad);
+
+Status AcoshGrad(const Scope& scope, const Operation& op,
+                 const std::vector<Output>& grad_inputs,
+                 std::vector<Output>* grad_outputs) {
+  // y = acosh(x)
+  // dy/dx = 1 / sinh(y)
+  auto dydx = Reciprocal(scope, Sinh(scope, op.output(0)));
+  // grad(x) = grad(y) * conj(dy/dx)
+  grad_outputs->push_back(
+      Mul(scope, grad_inputs[0], ConjugateHelper(scope, dydx)));
+  return scope.status();
+}
+REGISTER_GRADIENT_OP("Acosh", AcoshGrad);
+
+Status AtanhGrad(const Scope& scope, const Operation& op,
+                 const std::vector<Output>& grad_inputs,
+                 std::vector<Output>* grad_outputs) {
+  // y = atanh(x)
+  // dy/dx = 1 / (1 - x^2)
+  auto one = Cast(scope, Const(scope, 1.0), op.input(0).type());
+  auto dydx = Reciprocal(scope, Sub(scope, one, Square(scope, op.input(0))));
+  // grad(x) = grad(y) * conj(dy/dx)
+  grad_outputs->push_back(
+      Mul(scope, grad_inputs[0], ConjugateHelper(scope, dydx)));
+  return scope.status();
+}
+REGISTER_GRADIENT_OP("Atanh", AtanhGrad);
+
 Status SigmoidGrad(const Scope& scope, const Operation& op,
                   const std::vector<Output>& grad_inputs,
                   std::vector<Output>* grad_outputs) {
--- a/tensorflow/cc/gradients/math_grad_test.cc
+++ b/tensorflow/cc/gradients/math_grad_test.cc
@ -48,6 +48,9 @@ class CWiseUnaryGradTest : public ::testing::Test {
    SINH,
    COSH,
    TANH,
+    ASINH,
+    ACOSH,
+    ATANH,
    SIGMOID,
    SIGN,
    SIN,
@ -122,6 +125,15 @@ class CWiseUnaryGradTest : public ::testing::Test {
      case TANH:
        y = Tanh(scope_, x);
        break;
+      case ASINH:
+        y = Asinh(scope_, x);
+        break;
+      case ACOSH:
+        y = Acosh(scope_, x);
+        break;
+      case ATANH:
+        y = Atanh(scope_, x);
+        break;
      case SIGMOID:
        y = Sigmoid(scope_, x);
        break;
@ -413,6 +425,76 @@ TEST_F(CWiseUnaryGradTest, Tanh_Complex) {
  TestCWiseGrad<complex64>(TANH, x_fn, dy_fn, dx_fn);
 }

+TEST_F(CWiseUnaryGradTest, Asinh) {
+  auto x_fn = [this](const int i) { return RV({0, -1, 1, -2, 2, -3, 3}); };
+  auto dy_fn = [this](const float x) { return x + RV({-2, 2, -3, 3, -4, 4}); };
+  auto dx_fn = [this](const float x, const float dy) {
+    auto y = std::asinh(x);
+    return dy / std::cosh(y);
+  };
+  TestCWiseGrad<float>(ASINH, x_fn, dy_fn, dx_fn);
+}
+
+TEST_F(CWiseUnaryGradTest, Asinh_Complex) {
+  auto x_fn = [this](const int i) {
+    return CRV({{1, 0}, {0, 1}, {2, -1}, {1, 2}, {3, 4}});
+  };
+  auto dy_fn = [this](const complex64& x) {
+    return x + CRV({{-2, 2}, {-3, 3}, {1, -4}});
+  };
+  auto dx_fn = [this](const complex64& x, const complex64& dy) {
+    auto y = std::asinh(x);
+    return dy / conjugate(std::cosh(y));
+  };
+  TestCWiseGrad<complex64>(ASINH, x_fn, dy_fn, dx_fn);
+}
+
+TEST_F(CWiseUnaryGradTest, Acosh) {
+  auto x_fn = [this](const int i) { return RV({1, 2, 3, 4, 5, 6, 7}); };
+  auto dy_fn = [this](const float x) { return x + RV({8, 9, 10, 11, 12, 13, 14}); };
+  auto dx_fn = [this](const float x, const float dy) {
+    auto y = std::acosh(x);
+    return dy / std::sinh(y);
+  };
+  TestCWiseGrad<float>(ACOSH, x_fn, dy_fn, dx_fn);
+}
+
+TEST_F(CWiseUnaryGradTest, Acosh_Complex) {
+  auto x_fn = [this](const int i) {
+    return CRV({{1, 1}, {2, 1}, {1, 4}, {1, 2}, {3, 4}});
+  };
+  auto dy_fn = [this](const complex64& x) {
+    return x + CRV({{2, 2}, {3, 3}, {1, 4}});
+  };
+  auto dx_fn = [this](const complex64& x, const complex64& dy) {
+    auto y = std::acosh(x);
+    return dy / conjugate(std::sinh(y));
+  };
+  TestCWiseGrad<complex64>(ACOSH, x_fn, dy_fn, dx_fn);
+}
+
+TEST_F(CWiseUnaryGradTest, Atanh) {
+  auto x_fn = [this](const int i) { return RV({0, -0.5, 0.5, -0.1, 0.1}); };
+  auto dy_fn = [this](const float x) { return x + RV({-2, 2, -3, 3, -4, 4}); };
+  auto dx_fn = [this](const float x, const float dy) {
+    return dy * (1. / (1. - x * x));
+  };
+  TestCWiseGrad<float>(ATANH, x_fn, dy_fn, dx_fn);
+}
+
+TEST_F(CWiseUnaryGradTest, Atanh_Complex) {
+  auto x_fn = [this](const int i) {
+    return CRV({{0.1, 0}, {0, 0.1}, {0.2, -0.1}, {0.1, 0.2}, {0.3, 0.4}});
+  };
+  auto dy_fn = [this](const complex64& x) {
+    return x + CRV({{-2, 2}, {-3, 3}, {1, -4}});
+  };
+  auto dx_fn = [this](const complex64& x, const complex64& dy) {
+    return dy / conjugate(one_ - x * x);
+  };
+  TestCWiseGrad<complex64>(ATANH, x_fn, dy_fn, dx_fn);
+}
+
 TEST_F(CWiseUnaryGradTest, Sigmoid) {
  auto x_fn = [this](const int i) { return RV({0, -1, 1, -2, 2, -3, 3}); };
  auto dy_fn = [this](const float x) { return x + RV({-2, 2, -3, 3, -4, 4}); };
--- a/tensorflow/compiler/tf2xla/functionalize_control_flow.cc
+++ b/tensorflow/compiler/tf2xla/functionalize_control_flow.cc
@ -383,7 +383,7 @@ Status FunctionalizeLoop(Graph* graph, Frame* frame,
        }
      }
      if (arg.exit == nullptr) {
-        return errors::InvalidArgument("Mising Exit successor to ",
+        return errors::InvalidArgument("Missing Exit successor to ",
                                       arg.switch_node->name());
      }
    }
--- a/tensorflow/compiler/xla/service/cpu/cpu_parallelization_preparation.h
+++ b/tensorflow/compiler/xla/service/cpu/cpu_parallelization_preparation.h
@ -63,7 +63,7 @@ class ParallelizationPreparation : public HloPassInterface {

  // Outlines 'instruction' from entry computation, if it had
  // been assigned parallel tasks in an earlier pass through the computation.
-  // Returns true if 'instruction' was succesfully outlined, false otherwise.
+  // Returns true if 'instruction' was successfully outlined, false otherwise.
  bool OutlineParallelizableInstruction(HloInstruction* instruction);

  // Returns true if 'instruction' can be outlined into the same sub-computation
--- a/tensorflow/compiler/xla/service/cpu/infeed_manager.h
+++ b/tensorflow/compiler/xla/service/cpu/infeed_manager.h
@ -21,6 +21,7 @@ limitations under the License.
 #define TENSORFLOW_COMPILER_XLA_SERVICE_CPU_INFEED_MANAGER_H_

 #include <deque>
+#include <vector>

 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/core/platform/mutex.h"
--- a/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
+++ b/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
@ -441,7 +441,7 @@ Status ParallelCpuExecutable::ExecuteComputeFunctions(
  // TODO(b/27458679) Manage scheduling based on in-flight concurrency limits.
  // For example, if we expect a library conv/matmul call to run at max
  // concurrency, we should not dispatch runnable instructions until the
-  // libary call is finished (to avoid expensive cache invalidation).
+  // library call is finished (to avoid expensive cache invalidation).
  Executor executor(functions, run_options, &pending, &results,
                    buffer_pointers.data(), profile_counters.data(),
                    assignment_.get());
--- a/tensorflow/compiler/xla/service/cpu/shape_partition.cc
+++ b/tensorflow/compiler/xla/service/cpu/shape_partition.cc
@ -20,7 +20,7 @@ namespace cpu {

 std::vector<int64> ShapePartitionAssigner::Run(int64 target_partition_count) {
  // Gather outer-most dims where dim_size >= 'target_partition_count'.
-  // Note: always leave inner-dim static for vectorization/optimzations.
+  // Note: always leave inner-dim static for vectorization/optimizations.
  std::vector<int64> outer_dims;
  int64 outer_dim_size = 1;
  // TODO(b/27458679) Consider reserving enough minor dimensions (based on
--- a/tensorflow/compiler/xla/service/cpu/shape_partition.h
+++ b/tensorflow/compiler/xla/service/cpu/shape_partition.h
@ -38,7 +38,7 @@ namespace cpu {
 //
 //     [0, 1), [1, 2), [2, 3), [3, 4), [4, 5) [5, 8)
 //
-//   Note that the last parition has residule because the dimension size is
+//   Note that the last partition has residule because the dimension size is
 //   not a multiple of the partition count.
 //
 //
--- a/tensorflow/compiler/xla/service/hlo_alias_analysis_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_alias_analysis_test.cc
@ -495,7 +495,7 @@ TEST_F(HloAliasAnalysisTest, NestedWhiles) {
  };
  // Build separate condition computations so the call graph is flat. The
  // callgraph is always flattened in the compiler pipeline, and the flattened
-  // callgraph enables representive interference analysis.
+  // callgraph enables representative interference analysis.
  HloComputation* condition1 =
      module_->AddEmbeddedComputation(build_cond_computation());
  HloComputation* condition2 =
--- a/tensorflow/compiler/xla/tests/BUILD
+++ b/tensorflow/compiler/xla/tests/BUILD
@ -527,7 +527,7 @@ xla_test(
 )

 # Tests the dot operation in some cases that can be performed via a
-# runtime call on some backends - e.g. a runtime call to to Eigen.
+# runtime call on some backends - e.g. a runtime call to Eigen.
 xla_test(
    name = "dot_operation_runtime_test",
    srcs = ["dot_operation_test.cc"],
--- a/tensorflow/compiler/xla/tests/build_defs.bzl
+++ b/tensorflow/compiler/xla/tests/build_defs.bzl
@ -1,8 +1,9 @@
 """Build rules for XLA testing."""

 load("@local_config_cuda//cuda:build_defs.bzl", "cuda_is_configured")
+load("//tensorflow/compiler/xla/tests:plugin.bzl", "plugins")

-all_backends = ["cpu", "cpu_parallel", "gpu"]
+all_backends = ["cpu", "cpu_parallel", "gpu"] + plugins.keys()

 def filter_backends(backends):
  """Removes "gpu" from a backend list if CUDA is not enabled.
@ -121,6 +122,11 @@ def xla_test(name,
      backend_deps = ["//tensorflow/compiler/xla/service:gpu_plugin"]
      backend_deps += ["//tensorflow/compiler/xla/tests:test_macros_gpu"]
      this_backend_tags += ["requires-gpu-sm35"]
+    elif backend in plugins:
+      backend_deps = plugins[backend]["deps"]
+      this_backend_copts += plugins[backend]["copts"]
+      this_backend_tags += plugins[backend]["tags"]
+      this_backend_args += plugins[backend]["args"]
    else:
      fail("Unknown backend %s" % backend)

--- a/tensorflow/compiler/xla/tests/plugin.bzl
+++ b/tensorflow/compiler/xla/tests/plugin.bzl
@ -0,0 +1,32 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Additional XLA devices to be included in the unit test suite."""
+
+# Example:
+#
+# plugins = {
+#   "foo": {
+#     "deps": [
+#       "//tensorflow/compiler/plugin/foo:foo_lib",
+#       "//tensorflow/compiler/plugin/foo:test_macros",
+#     ],
+#     "copts": [],
+#     "tags": [],
+#     "args": []
+#   },
+# }
+
+plugins = {}
+
--- a/tensorflow/contrib/batching/kernels/batch_kernels.cc
+++ b/tensorflow/contrib/batching/kernels/batch_kernels.cc
@ -51,7 +51,7 @@ Status Concat(OpKernelContext* context, const gtl::ArraySlice<Tensor>& inputs,
  std::vector<std::unique_ptr<typename TTypes<T, 2>::ConstMatrix>> inputs_flat;
  inputs_flat.reserve(inputs.size());
  int64 output_dim0 = 0;
-  for (int i = 0; i < inputs.size(); ++i) {
+  for (size_t i = 0; i < inputs.size(); ++i) {
    const Tensor& input = inputs[i];
    if (input.dims() != input_dims) {
      return errors::InvalidArgument(
@ -548,7 +548,7 @@ class BatchKernel : public AsyncOpKernel {
      return Status::OK();
    }
    int32 last_size = 0;
-    for (int i = 0; i < allowed_batch_sizes_.size(); ++i) {
+    for (size_t i = 0; i < allowed_batch_sizes_.size(); ++i) {
      const int32 size = allowed_batch_sizes_.at(i);
      if (i > 0 && size <= last_size) {
        return errors::InvalidArgument(
@ -675,7 +675,7 @@ class UnbatchResource : public ResourceBase {
      // If we have a non-empty tensor, finish the waitlisted runs,
      // and store any remaining pieces.
      if (nonempty_input) {
-        for (int i = 0; i < batch_keys.size(); ++i) {
+        for (size_t i = 0; i < batch_keys.size(); ++i) {
          auto runs_it = waiting_callbacks_.find(batch_keys[i]);
          if (runs_it != waiting_callbacks_.end()) {
            runs_it->second.context->set_output(0, split_inputs[i]);
--- a/tensorflow/contrib/boosted_trees/lib/learner/stochastic/handlers/categorical-feature-column-handler.cc
+++ b/tensorflow/contrib/boosted_trees/lib/learner/stochastic/handlers/categorical-feature-column-handler.cc
@ -106,7 +106,7 @@ void CategoricalFeatureColumnHandler::GenerateFeatureSplitCandidates(
    NodeStats left_node_stats(learner_config, left_gradient_stats);
    NodeStats right_node_stats(learner_config, right_gradient_stats);

-    // Generate split candiate and update best split candidate for the
+    // Generate split candidate and update best split candidate for the
    // current root if needed.
    FeatureSplitCandidate split_candidate(
        slot_id_,
--- a/tensorflow/contrib/boosted_trees/lib/learner/stochastic/handlers/dense-quantized-feature-column-handler.cc
+++ b/tensorflow/contrib/boosted_trees/lib/learner/stochastic/handlers/dense-quantized-feature-column-handler.cc
@ -93,7 +93,7 @@ void DenseQuantizedFeatureColumnHandler::GenerateFeatureSplitCandidates(
      NodeStats left_node_stats(learner_config, left_gradient_stats);
      NodeStats right_node_stats(learner_config, right_gradient_stats);

-      // Generate split candiate.
+      // Generate split candidate.
      const float threshold = dense_quantiles_(bucket_id);
      FeatureSplitCandidate split_candidate(
          slot_id_, CreateDenseSplitNode(dense_feature_column_, threshold),
--- a/tensorflow/contrib/boosted_trees/lib/learner/stochastic/handlers/sparse-quantized-feature-column-handler.cc
+++ b/tensorflow/contrib/boosted_trees/lib/learner/stochastic/handlers/sparse-quantized-feature-column-handler.cc
@ -109,7 +109,7 @@ void SparseQuantizedFeatureColumnHandler::GenerateFeatureSplitCandidates(
      NodeStats left_node_stats(learner_config, left_gradient_stats);
      NodeStats right_node_stats(learner_config, right_gradient_stats);

-      // Generate split candiate.
+      // Generate split candidate.
      const float threshold = sparse_quantiles_(bucket_id);
      FeatureSplitCandidate split_candidate(
          slot_id_,
@ -124,7 +124,7 @@ void SparseQuantizedFeatureColumnHandler::GenerateFeatureSplitCandidates(

    // Determine if we need a backward pass by checking if the residual gradient
    // after forward aggregation is almost the same as the aggregated gradient.
-    // for the current root. This helps avoid unecessary computation as well
+    // for the current root. This helps avoid unnecessary computation as well
    // as consistency due to floating point precision.
    if (!right_gradient_stats.IsAlmostZero()) {
      // Backward pass with left default direction.
@ -147,7 +147,7 @@ void SparseQuantizedFeatureColumnHandler::GenerateFeatureSplitCandidates(
        NodeStats left_node_stats(learner_config, left_gradient_stats);
        NodeStats right_node_stats(learner_config, right_gradient_stats);

-        // Generate split candiate.
+        // Generate split candidate.
        const float threshold = sparse_quantiles_(bucket_id - 1);
        FeatureSplitCandidate split_candidate(
            slot_id_,
--- a/tensorflow/contrib/boosted_trees/python/kernel_tests/prediction_ops_test.py
+++ b/tensorflow/contrib/boosted_trees/python/kernel_tests/prediction_ops_test.py
@ -1149,7 +1149,7 @@ class PredictionOpsTest(test_util.TensorFlowTestCase):
      adjusted_tree_ensemble_config = (
          tree_config_pb2.DecisionTreeEnsembleConfig())
      # When we say to average over more trees than possible, it is averaging
-      # accross all trees.
+      # across all trees.
      total_num = 100
      for i in range(0, total_num):
        tree = tree_ensemble_config.trees.add()
--- a/tensorflow/contrib/cmake/external/cub.cmake
+++ b/tensorflow/contrib/cmake/external/cub.cmake
@ -18,6 +18,7 @@ set(cub_URL https://github.com/NVlabs/cub/archive/1.6.4.zip)
 set(cub_HASH SHA256=966d0c4f41e2bdc81aebf9ccfbf0baffaac5a74f00b826b06f4dee79b2bb8cee)
 set(cub_BUILD ${CMAKE_CURRENT_BINARY_DIR}/cub/src/cub)
 set(cub_INCLUDE_DIR ${CMAKE_CURRENT_BINARY_DIR}/cub/src/cub)
+set(cub_ARCHIVE_DIR ${CMAKE_CURRENT_BINARY_DIR}/external/cub_archive)

 ExternalProject_Add(cub
    PREFIX cub
@ -26,4 +27,4 @@ ExternalProject_Add(cub
    DOWNLOAD_DIR "${DOWNLOAD_LOCATION}"
    BUILD_IN_SOURCE 1
    PATCH_COMMAND ${CMAKE_COMMAND} -E copy_if_different ${CMAKE_CURRENT_SOURCE_DIR}/patches/cub/CMakeLists.txt ${cub_BUILD}
-    INSTALL_COMMAND "")
+    INSTALL_COMMAND  ${CMAKE_COMMAND} -E copy_directory  ${cub_INCLUDE_DIR}/cub ${cub_ARCHIVE_DIR}/cub)
--- a/tensorflow/contrib/cmake/tf_core_kernels.cmake
+++ b/tensorflow/contrib/cmake/tf_core_kernels.cmake
@ -126,13 +126,15 @@ if(WIN32)
      "${tensorflow_source_dir}/tensorflow/core/kernels/*quantiz*.h"
      "${tensorflow_source_dir}/tensorflow/core/kernels/*quantiz*.cc"
      "${tensorflow_source_dir}/tensorflow/core/kernels/neon/*"
-      # no in tensorflow.dll - comes from .so
+      # not in core - those are loaded dynamically as dll
      "${tensorflow_source_dir}/tensorflow/contrib/resampler/kernels/resampler_ops.cc"
      "${tensorflow_source_dir}/tensorflow/contrib/rnn/kernels/blas_gemm.cc"
      "${tensorflow_source_dir}/tensorflow/contrib/rnn/kernels/gru_ops.cc"
      "${tensorflow_source_dir}/tensorflow/contrib/rnn/kernels/lstm_ops.cc"
      "${tensorflow_source_dir}/tensorflow/contrib/rnn/ops/gru_ops.cc"
      "${tensorflow_source_dir}/tensorflow/contrib/rnn/ops/lstm_ops.cc"
+      "${tensorflow_source_dir}/tensorflow/contrib/seq2seq/kernels/beam_search_ops.cc"
+      "${tensorflow_source_dir}/tensorflow/contrib/seq2seq/ops/beam_search_ops.cc"
      # temporarily disable nccl (nccl itself needs to be ported to windows first)
      "${tensorflow_source_dir}/tensorflow/contrib/nccl/kernels/nccl_manager.cc"
      "${tensorflow_source_dir}/tensorflow/contrib/nccl/kernels/nccl_ops.cc"
--- a/tensorflow/contrib/cmake/tf_tests.cmake
+++ b/tensorflow/contrib/cmake/tf_tests.cmake
@ -148,6 +148,7 @@ if (tensorflow_BUILD_PYTHON_TESTS)
    "${tensorflow_source_dir}/tensorflow/contrib/data/*_test.py"
    "${tensorflow_source_dir}/tensorflow/contrib/factorization/*_test.py"
    "${tensorflow_source_dir}/tensorflow/contrib/keras/python/keras/integration_test.py"
+    "${tensorflow_source_dir}/tensorflow/contrib/seq2seq/python/kernel_tests/*_test.py"
    "${tensorflow_source_dir}/tensorflow/contrib/stateless/python/kernel_tests/*_test.py"
    # NOTE: tensor_forest tests in tensor_forest/hybrid/... still don't pass.
    "${tensorflow_source_dir}/tensorflow/contrib/tensor_forest/client/*_test.py"
@ -171,6 +172,10 @@ if (tensorflow_BUILD_PYTHON_TESTS)
    "${tensorflow_source_dir}/tensorflow/python/saved_model/saved_model_test.py"
    # requires scipy
    "${tensorflow_source_dir}/tensorflow/contrib/keras/python/keras/preprocessing/*_test.py"
+    "${tensorflow_source_dir}/tensorflow/contrib/tfprof/python/tools/tfprof/pprof_profiler_test.py"
+    # flaky tests
+    "${tensorflow_source_dir}/tensorflow/python/kernel_tests/cwise_ops_test.py"
+    "${tensorflow_source_dir}/tensorflow/contrib/tfprof/python/tools/tfprof/internal/run_metadata_test.py"
  )
  if (WIN32)
    set(tf_test_src_py_exclude
--- a/tensorflow/contrib/distributions/python/ops/binomial.py
+++ b/tensorflow/contrib/distributions/python/ops/binomial.py
@ -196,7 +196,7 @@ class Binomial(distribution.Distribution):

  @property
  def probs(self):
-    """Probability of of drawing a `1`."""
+    """Probability of drawing a `1`."""
    return self._probs

  def _batch_shape_tensor(self):
--- a/tensorflow/contrib/distributions/python/ops/vector_student_t.py
+++ b/tensorflow/contrib/distributions/python/ops/vector_student_t.py
@ -160,7 +160,7 @@ class _VectorStudentT(transformed_distribution.TransformedDistribution):
  #### Examples

  A single instance of a "Vector Student's t-distribution" is defined by a mean
-  vector of of length `k` and a scale matrix of shape `k x k`.
+  vector of length `k` and a scale matrix of shape `k x k`.

  Extra leading dimensions, if provided, allow for batches.

--- a/tensorflow/contrib/factorization/python/ops/factorization_ops.py
+++ b/tensorflow/contrib/factorization/python/ops/factorization_ops.py
@ -800,7 +800,7 @@ class WALSModel(object):
      regularization: A tensor (scalar) that contains the normalized
        regularization term for the minibatch loss corresponding to sp_input.
      sum_weights: The sum of the weights corresponding to sp_input. This
-        can be used with unregularized loss to caluclate the root weighted
+        can be used with unregularized loss to calculate the root weighted
        squared error.
    """
    assert isinstance(sp_input, sparse_tensor.SparseTensor)
--- a/tensorflow/contrib/image/kernels/image_ops.cc
+++ b/tensorflow/contrib/image/kernels/image_ops.cc
@ -32,11 +32,11 @@ namespace functor {
 // Explicit instantiation of the CPU functor.
 typedef Eigen::ThreadPoolDevice CPUDevice;

-template class FillProjectiveTransform<CPUDevice, uint8>;
-template class FillProjectiveTransform<CPUDevice, int32>;
-template class FillProjectiveTransform<CPUDevice, int64>;
-template class FillProjectiveTransform<CPUDevice, float>;
-template class FillProjectiveTransform<CPUDevice, double>;
+template struct FillProjectiveTransform<CPUDevice, uint8>;
+template struct FillProjectiveTransform<CPUDevice, int32>;
+template struct FillProjectiveTransform<CPUDevice, int64>;
+template struct FillProjectiveTransform<CPUDevice, float>;
+template struct FillProjectiveTransform<CPUDevice, double>;

 }  // end namespace functor

@ -116,7 +116,7 @@ namespace functor {
  void FillProjectiveTransform<GPUDevice, TYPE>::operator()(                \
      const GPUDevice& device, OutputType* output, const InputType& images, \
      const TransformsType& transform) const;                               \
-  extern template class FillProjectiveTransform<GPUDevice, TYPE>
+  extern template struct FillProjectiveTransform<GPUDevice, TYPE>

 TF_CALL_uint8(DECLARE_FUNCTOR);
 TF_CALL_int32(DECLARE_FUNCTOR);
--- a/tensorflow/contrib/keras/python/keras/backend.py
+++ b/tensorflow/contrib/keras/python/keras/backend.py
@ -21,6 +21,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function

+import collections
 import json
 import os

@ -263,6 +264,9 @@ def get_uid(prefix=''):
  ```
  """
  graph = ops.get_default_graph()
+  if graph not in tf_base_layers.PER_GRAPH_LAYER_NAME_UIDS:
+    tf_base_layers.PER_GRAPH_LAYER_NAME_UIDS[graph] = collections.defaultdict(
+      int)
  layer_name_uids = tf_base_layers.PER_GRAPH_LAYER_NAME_UIDS[graph]
  layer_name_uids[prefix] += 1
  return layer_name_uids[prefix]
--- a/tensorflow/contrib/keras/python/keras/layers/convolutional.py
+++ b/tensorflow/contrib/keras/python/keras/layers/convolutional.py
@ -316,7 +316,7 @@ class Conv3D(tf_convolutional_layers.Conv3D, Layer):
  When using this layer as the first layer in a model,
  provide the keyword argument `input_shape`
  (tuple of integers, does not include the sample axis),
-  e.g. `input_shape=(128, 128, 128, 3)` for 128x128x128 volumes
+  e.g. `input_shape=(128, 128, 128, 1)` for 128x128x128 volumes
  with a single channel,
  in `data_format="channels_last"`.

--- a/tensorflow/contrib/keras/python/keras/testing_utils.py
+++ b/tensorflow/contrib/keras/python/keras/testing_utils.py
@ -78,7 +78,7 @@ def layer_test(layer_cls, kwargs=None, input_shape=None, input_dtype=None,
      if e is None:
        input_data_shape[i] = np.random.randint(1, 4)
    input_data = 10 * np.random.random(input_data_shape)
-    if input_dtype[:4] == 'float':
+    if input_dtype[:5] == 'float':
      input_data -= 0.5
    input_data = input_data.astype(input_dtype)
  elif input_shape is None:
--- a/tensorflow/contrib/layers/kernels/sparse_feature_cross_kernel.cc
+++ b/tensorflow/contrib/layers/kernels/sparse_feature_cross_kernel.cc
@ -177,7 +177,7 @@ class StringCrosser {
    static const auto k_feature_separator = "_X_";

    gtl::InlinedVector<InternalType, 6> cross_vec(columns_.size());
-    for (int i = 0; i < permutation.size(); i++) {
+    for (size_t i = 0; i < permutation.size(); i++) {
      cross_vec[i] = columns_[i]->Feature(batch_index, permutation[i]);
    }
    // TODO(zakaria): this will copy the string twice, might effect
@ -267,7 +267,7 @@ class ProductIterator {
    next_permutation_.resize(columns_.size(), 0);
    // Sets has_next_ to false if any feature column has 0 features.
    has_next_ = true;
-    for (int i = 0; i < columns_.size(); i++) {
+    for (size_t i = 0; i < columns_.size(); i++) {
      if (columns_[i]->FeatureCount(batch_index_) == 0) {
        has_next_ = false;
        break;
@ -581,7 +581,7 @@ class SparseFeatureCrossOp : public OpKernel {
          columns,
      int batch_index) {
    int64 cross_count = 1;
-    for (int i = 0; i < columns.size(); i++) {
+    for (size_t i = 0; i < columns.size(); i++) {
      const auto feature_count = columns[i]->FeatureCount(batch_index);
      // If one column is missing any feature, there won't be any cross.
      if (feature_count == 0) {
--- a/tensorflow/contrib/layers/python/layers/embedding_ops.py
+++ b/tensorflow/contrib/layers/python/layers/embedding_ops.py
@ -871,7 +871,7 @@ def _embedding_lookup_with_distributed_aggregation(params,
          p_segment_ids = array_ops.gather(segment_ids, pindices[p])
          # Number the p_segment_ids to meet segment_sum's requirements. Note
          # that unique_p_segment_ids contains unique segment ids of this
-          # partiton and these ids' order is unchanged.
+          # partition and these ids' order is unchanged.
          unique_p_segment_ids, unique_p_segment_idx = array_ops.unique(
              p_segment_ids)
          partitioned_segment_ids.append(unique_p_segment_ids)
--- a/tensorflow/contrib/layers/python/layers/feature_column.py
+++ b/tensorflow/contrib/layers/python/layers/feature_column.py
@ -165,7 +165,7 @@ class _LinearEmbeddingLookupArguments(
                            "combiner"])):
  """Represents the information needed from a column for embedding lookup.

-  Used to to compute DNN inputs and weighted sum.
+  Used to compute DNN inputs and weighted sum.
  """
  pass

@ -184,7 +184,7 @@ class _DeepEmbeddingLookupArguments(
                            "trainable"])):
  """Represents the information needed from a column for embedding lookup.

-  Used to to compute DNN inputs and weighted sum.
+  Used to compute DNN inputs and weighted sum.
  """
  pass

--- a/tensorflow/contrib/layers/python/layers/layers.py
+++ b/tensorflow/contrib/layers/python/layers/layers.py
@ -938,7 +938,7 @@ def convolution(inputs,
      with "NC".
    num_outputs: Integer, the number of output filters.
    kernel_size: A sequence of N positive integers specifying the spatial
-      dimensions of of the filters.  Can be a single integer to specify the same
+      dimensions of the filters.  Can be a single integer to specify the same
      value for all spatial dimensions.
    stride: A sequence of N positive integers specifying the stride at which to
      compute output.  Can be a single integer to specify the same value for all
--- a/tensorflow/contrib/layers/python/layers/layers_test.py
+++ b/tensorflow/contrib/layers/python/layers/layers_test.py
@ -1493,12 +1493,12 @@ class PartialFlattenTest(test.TestCase):

  def testSparsePartialFlatten(self):
    """Test `_inner_flatten` on `SparseTensor`s."""
-    shape = [4, 3, 11, 6, 1, 3]
+    shape = [4, 3, 11, 6]
    np.random.seed(10301)
    random_ = np.random.rand(*shape)
    indices, values, _ = _sparsify(random_)

-    for new_rank in [1, 2, 3, 4, 5]:
+    for new_rank in [1, 2, 3]:
      expected_shape = (shape[:new_rank - 1] + [np.prod(shape[new_rank - 1:])])
      reshaped_random_ = np.reshape(random_, expected_shape)
      expected_indices, expected_values, _ = _sparsify(reshaped_random_)
--- a/tensorflow/contrib/learn/python/learn/experiment.py
+++ b/tensorflow/contrib/learn/python/learn/experiment.py
@ -525,7 +525,7 @@ class Experiment(object):
      differences in resource control. First, the resources (e.g., memory) used
      by training will be released before evaluation (`train_and_evaluate` takes
      double resources). Second, more checkpoints will be saved as a checkpoint
-      is generated at the end of each trainning iteration.
+      is generated at the end of each training iteration.

      3. As the estimator.train starts from scratch (new graph, new states for
      input, etc) at each iteration, it is recommended to have the
--- a/tensorflow/contrib/makefile/create_ios_frameworks.sh
+++ b/tensorflow/contrib/makefile/create_ios_frameworks.sh
@ -79,7 +79,7 @@ cd $SCRIPT_DIR/gen/proto
 tar cf $FW_DIR_TFCORE_HDRS/tmp.tar tensorflow
 cd $FW_DIR_TFCORE_HDRS
 tar xf tmp.tar
-# Dont include the auto downloaded/generated to build this library
+# Don't include the auto downloaded/generated to build this library
 rm -rf tensorflow/contrib/makefile
 rm -f tmp.tar

--- a/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py
+++ b/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py
@ -283,7 +283,7 @@ def _luong_score(query, keys, scale):
    raise ValueError(
        "Incompatible or unknown inner dimensions between query and keys.  "
        "Query (%s) has units: %s.  Keys (%s) have units: %s.  "
-        "Perhaps you need to set num_units to the the keys' dimension (%s)?"
+        "Perhaps you need to set num_units to the keys' dimension (%s)?"
        % (query, depth, keys, key_units, key_units))
  dtype = query.dtype

--- a/tensorflow/contrib/slim/README.md
+++ b/tensorflow/contrib/slim/README.md
@ -775,7 +775,7 @@ images, labels = LoadTestData(...)
 predictions = MyModel(images)

 mae_value_op, mae_update_op = slim.metrics.streaming_mean_absolute_error(predictions, labels)
-mre_value_op, mre_update_op = slim.metrics.streaming_mean_relative_error(predictions, labels, labels)
+mre_value_op, mre_update_op = slim.metrics.streaming_mean_relative_error(predictions, labels)
 pl_value_op, pl_update_op = slim.metrics.percentage_less(mean_relative_errors, 0.3)
 ```

--- a/tensorflow/contrib/tensor_forest/client/random_forest.py
+++ b/tensorflow/contrib/tensor_forest/client/random_forest.py
@ -315,7 +315,7 @@ class TensorForestEstimator(estimator.Estimator):
        though training might be distributed.
      version: String indicating TensorForest version to use, for backward
        compatibility. Either 'v2', 'v4', or None to let system pick.
-        Overrides grpah_builder_class.
+        Overrides graph_builder_class.

    Returns:
      A `TensorForestEstimator` instance.
--- a/tensorflow/contrib/tensor_forest/kernels/v4/grow_stats.h
+++ b/tensorflow/contrib/tensor_forest/kernels/v4/grow_stats.h
@ -109,7 +109,7 @@ class GrowStats {

  const TensorForestParams& params_;

-  // We cache these beacuse they're used often.
+  // We cache these because they're used often.
  const int split_after_samples_;
  const int num_splits_to_consider_;

--- a/tensorflow/contrib/timeseries/python/timeseries/math_utils.py
+++ b/tensorflow/contrib/timeseries/python/timeseries/math_utils.py
@ -689,11 +689,11 @@ class InputStatisticsFromMiniBatch(object):
      values = features[TrainEvalFeatures.VALUES]
    else:
      # times and values may not be available, for example during prediction. We
-      # still need to retreive our variables so that they can be read from, even
+      # still need to retrieve our variables so that they can be read from, even
      # if we're not going to update them.
      times = None
      values = None
-    # Create/retreive variables representing input statistics, initialized
+    # Create/retrieve variables representing input statistics, initialized
    # without data to avoid deadlocking if variables are initialized before
    # queue runners are started.
    with variable_scope.variable_scope("input_statistics", use_resource=True):
--- a/tensorflow/contrib/timeseries/python/timeseries/state_management.py
+++ b/tensorflow/contrib/timeseries/python/timeseries/state_management.py
@ -196,7 +196,7 @@ class ChainingStateManager(_OverridableStateManager):
    return time // self._state_saving_interval

  def _get_cached_states(self, times):
-    """Retreive cached states for a batch of times."""
+    """Retrieve cached states for a batch of times."""
    read_chunk_numbers = self._get_chunk_number(times)
    looked_up_state = list(self._cached_states.lookup(
        math_ops.cast(read_chunk_numbers, dtypes.int64)))
@ -242,7 +242,7 @@ class ChainingStateManager(_OverridableStateManager):
    # written to the next bucket). This assumes fixed missing times (i.e. if we
    # were presented with times [10, 50] we will never see times [30, 50]).
    #
-    # TODO(allenl): Retreive the highest time less than the current time rather
+    # TODO(allenl): Retrieve the highest time less than the current time rather
    # than relying on fixed bucketing.
    write_chunk_numbers = math_ops.maximum(
        self._get_chunk_number(array_ops.concat(
--- a/tensorflow/contrib/timeseries/python/timeseries/state_space_models/filtering_postprocessor.py
+++ b/tensorflow/contrib/timeseries/python/timeseries/state_space_models/filtering_postprocessor.py
@ -150,7 +150,7 @@ class StateInterpolatingAnomalyDetector(FilteringStepPostprocessor):
  This is simply Bayes' theorem, where p(data | anomaly) is the
  alternative/anomaly distribution, p(data | not anomaly) is the model's
  predicted distribution, and anomaly_prior_probability is the prior probability
-  of an anomaly occuring (user-specified, defaulting to 1%).
+  of an anomaly occurring (user-specified, defaulting to 1%).

  Rather than computing p(anomaly | data) directly, we use the odds ratio:

--- a/tensorflow/contrib/timeseries/python/timeseries/state_space_models/structural_ensemble.py
+++ b/tensorflow/contrib/timeseries/python/timeseries/state_space_models/structural_ensemble.py
@ -70,7 +70,7 @@ class StructuralEnsemble(state_space_model.StateSpaceIndependentEnsemble):

  `observation_noise`, `level_noise`, `trend noise`, `seasonality_noise`, and
  `transient` are (typically scalar) Gaussian random variables whose variance is
-  learned from data, and that variance is not time dependant in this
+  learned from data, and that variance is not time dependent in this
  implementation. Level noise is optional due to its similarity with observation
  noise in some cases. Seasonality is enforced by constraining a full cycle of
  seasonal variables to have zero expectation, allowing seasonality to adapt
--- a/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py
+++ b/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py
@ -544,7 +544,7 @@ def _convert_model_fn_to_train_step(model_fn, dequeue_fn, mode, run_config,

    # TODO(xiejw): how to do we support hook and savers in the original
    # model_fn. Realistically, the original
-    # model_fn will be excuted on TPU chips in a replica way. The hooks
+    # model_fn will be executed on TPU chips in a replica way. The hooks
    # returned by the model_fn cannot be supported at all. If we have to,
    # the graph construction part in the model_fn should be separated from the
    # control part (such as hooks and savers). By that the graph construction
--- a/tensorflow/contrib/tpu/python/tpu/tpu_feed.py
+++ b/tensorflow/contrib/tpu/python/tpu/tpu_feed.py
@ -298,7 +298,7 @@ class InfeedQueue(object):
    input_tensors is a list of lists of Tensors whose types and shapes are used
    to set the queue configuration. The length of the outer list is the number
    of shards required, and each inner list is the tuple of Tensors to use to
-    determine the types and shapes of the correponding shard. This method
+    determine the types and shapes of the corresponding shard. This method
    depends on the shard dimension, and calling it freezes the shard policy.

    Args:
--- a/tensorflow/contrib/training/python/training/bucket_ops.py
+++ b/tensorflow/contrib/training/python/training/bucket_ops.py
@ -88,7 +88,7 @@ def bucket(tensors,
  This function is implemented using several queues. A `QueueRunner` for the
  queues is added to the current `Graph`'s `QUEUE_RUNNER` collection.

-  As the returned tensors are the result of of a dequeue operation, evaluating
+  As the returned tensors are the result of a dequeue operation, evaluating
  them will throw a `tf.errors.OutOfRangeError` when the input queue is
  exhausted.  If these tensors are feeding another input queue, its queue runner
  will catch this exception, however, if they are used in your main thread
--- a/tensorflow/contrib/training/python/training/evaluation_test.py
+++ b/tensorflow/contrib/training/python/training/evaluation_test.py
@ -329,7 +329,7 @@ class EvaluateRepeatedlyTest(test.TestCase):
    if not gfile.Exists(checkpoint_dir):
      gfile.MakeDirs(checkpoint_dir)

-    # We need a variable that that the saver will try to restore.
+    # We need a variable that the saver will try to restore.
    variables.get_or_create_global_step()

    # Run with placeholders. If we actually try to evaluate this, we'd fail
@ -394,7 +394,7 @@ class EvaluateRepeatedlyTest(test.TestCase):
                                  'evaluate_with_eval_feed_dict')
    self._train_model(checkpoint_dir, num_steps=1)

-    # We need a variable that that the saver will try to restore.
+    # We need a variable that the saver will try to restore.
    variables.get_or_create_global_step()

    # Create a variable and an eval op that increments it with a placeholder.
--- a/tensorflow/contrib/verbs/rdma.cc
+++ b/tensorflow/contrib/verbs/rdma.cc
@ -761,8 +761,8 @@ void RdmaTensorBuffer::SendNextItem() {
          (buffer_size > size_ && local_status_ == idle &&
           remote_status_ == idle)) {
        if ((local_status_ != none) && (buffer_size > size_)) {
-          CHECK(rm.data_type_ == DT_STRING)
-              << "Only string tensor allows to change size";
+          VLOG(2) << "Extend RDMA buffer from " << size_ << " to "
+                  << buffer_size;
        }
        CreateCPUBuffer(buffer_size, false);
        mu_.unlock();
@ -782,11 +782,13 @@ void RdmaTensorBuffer::SendNextItem() {
        // local/remote_status_ won't be set back to idle
        // unitl Write() is successful
        mu_.unlock();
-        CHECK((buffer_size == size_ && rm.data_type_ != DT_STRING) ||
-              (buffer_size <= size_ && rm.data_type_ == DT_STRING))
-            << "tensor and buffer size do not agree!"
-            << " buffer_size = " << size_
-            << " requested tensor size = " << buffer_size << in.DebugString();
+        if (!((buffer_size == size_ && rm.data_type_ != DT_STRING) ||
+              (buffer_size <= size_ && rm.data_type_ == DT_STRING))) {
+          VLOG(2) << "Tensor and buffer size do not agree,"
+                  << " buffer_size = " << size_
+                  << " requested tensor size = "
+                  << buffer_size << in.DebugString();
+        }
        uint32_t imm_data = LookupBufferIndex(key);
        rm.type_ = RDMA_MESSAGE_TENSOR_WRITE;
        string message = RdmaMessage::CreateMessage(rm);
--- a/tensorflow/core/BUILD
+++ b/tensorflow/core/BUILD
@ -1221,6 +1221,9 @@ LIB_INTERNAL_WINDOWS_DEPS = glob(
        "platform/*.cc",
        "platform/profile_utils/**/*.h",
        "platform/profile_utils/**/*.cc",
+    ] + [
+        "framework/resource_handle.h",
+        "framework/resource_handle.cc",
    ],
    exclude = [
        "**/*test*",
@ -1250,7 +1253,6 @@ cc_library(
                "platform/*.cc",
                "platform/profile_utils/**/*.h",
                "platform/profile_utils/**/*.cc",
-            ] + [
                "framework/resource_handle.h",
                "framework/resource_handle.cc",
            ],
--- a/tensorflow/core/framework/op.cc
+++ b/tensorflow/core/framework/op.cc
@ -77,9 +77,10 @@ Status OpRegistry::LookUp(const string& op_type_name,
    if (first_unregistered) {
      OpList op_list;
      Export(true, &op_list);
-      VLOG(1) << "All registered Ops:";
-      for (const auto& op : op_list.op()) {
-        VLOG(1) << SummarizeOpDef(op);
+      if (VLOG_IS_ON(3)) {
+         LOG(INFO) << "All registered Ops:";
+         for (const auto& op : op_list.op())
+            LOG(INFO) << SummarizeOpDef(op);
      }
      first_unregistered = false;
    }
--- a/tensorflow/core/framework/op_gen_lib.cc
+++ b/tensorflow/core/framework/op_gen_lib.cc
@ -73,7 +73,7 @@ bool ConsumeEquals(StringPiece* description) {
  return false;
 }

-// Split `*orig` into two pieces at the first occurence of `split_ch`.
+// Split `*orig` into two pieces at the first occurrence of `split_ch`.
 // Returns whether `split_ch` was found. Afterwards, `*before_split`
 // contains the maximum prefix of the input `*orig` that doesn't
 // contain `split_ch`, and `*orig` contains everything after the
--- a/tensorflow/core/framework/op_kernel.h
+++ b/tensorflow/core/framework/op_kernel.h
@ -716,7 +716,7 @@ class OpKernelContext {
      StringPiece output_name, const TensorShape& output_shape,
      Tensor** output) TF_MUST_USE_RESULT;

-  // Tries to reuse one of of the inputs given in input_indices as a temporary.
+  // Tries to reuse one of the inputs given in input_indices as a temporary.
  // If none of the given inputs can be forwarded, calls
  // allocate_temp() to allocate a new temporary buffer.
  Status forward_input_or_allocate_temp(
--- a/tensorflow/core/framework/queue_interface.h
+++ b/tensorflow/core/framework/queue_interface.h
@ -77,6 +77,9 @@ class QueueInterface : public ResourceBase {
  virtual void Close(OpKernelContext* ctx, bool cancel_pending_enqueues,
                     DoneCallback callback) = 0;

+  // Returns true if a given queue is closed and false if it is open.
+  virtual bool is_closed() const = 0;
+
  // Assuming *this represents a shared queue, verify that it matches
  // another instantiation indicated by node_def.
  virtual Status MatchesNodeDef(const NodeDef& node_def) = 0;
--- a/tensorflow/core/graph/graph.cc
+++ b/tensorflow/core/graph/graph.cc
@ -31,7 +31,7 @@ namespace tensorflow {

 const int Graph::kControlSlot = -1;

-struct NodeProperties {
+class NodeProperties {
 public:
  NodeProperties(const OpDef* op_def, const NodeDef& node_def,
                 const DataTypeSlice inputs, const DataTypeSlice outputs)
--- a/tensorflow/core/grappler/clusters/single_machine.cc
+++ b/tensorflow/core/grappler/clusters/single_machine.cc
@ -158,7 +158,7 @@ Status SingleMachine::Run(const GraphDef& graph_def,
        // Also clear the timeline to save memory
        init_metadata_.clear_step_stats();
      }
-      for (int i = 0; i < queue_runner_defs_.size(); ++i) {
+      for (size_t i = 0; i < queue_runner_defs_.size(); ++i) {
        std::unique_ptr<QueueRunner> queue_runner;
        TF_RETURN_IF_ERROR(QueueRunner::New(queue_runner_defs_[i],
                                            coordinator_.get(), &queue_runner));
--- a/tensorflow/core/grappler/costs/graph_properties.cc
+++ b/tensorflow/core/grappler/costs/graph_properties.cc
@ -141,7 +141,7 @@ Status GraphProperties::MergeEnqueueShapesAndTypes(
        "Enqueue nodes mixed number of tensors: ", shapes_and_types.size(),
        "  vs ", queue_shapes_and_types->size());
  }
-  for (int i = 0; i < shapes_and_types.size(); ++i) {
+  for (size_t i = 0; i < shapes_and_types.size(); ++i) {
    const ShapeAndType& a = shapes_and_types[i];
    ShapeAndType& b = (*queue_shapes_and_types)[i];
    if (a.dtype != b.dtype) {
@ -163,7 +163,7 @@ Status GraphProperties::RelaxEnqueueShapesAndMergeTypes(
        "Enqueue nodes mixed number of tensors: ", shapes_and_types.size(),
        "  vs ", queue_shapes_and_types->size());
  }
-  for (int i = 0; i < shapes_and_types.size(); ++i) {
+  for (size_t i = 0; i < shapes_and_types.size(); ++i) {
    const ShapeAndType& a = shapes_and_types[i];
    ShapeAndType& b = (*queue_shapes_and_types)[i];
    if (a.dtype != b.dtype) {
--- a/tensorflow/core/grappler/costs/virtual_scheduler.cc
+++ b/tensorflow/core/grappler/costs/virtual_scheduler.cc
@ -365,7 +365,7 @@ NodeState& VirtualScheduler::GetNodeStateOrCreateIt(const NodeDef* node) {
    // Initialize output port related data:
    // Assume the size of OutputProperties represents the number of output ports
    // of this node.
-    for (int i = 0; i < node_state.output_properties.size(); ++i) {
+    for (size_t i = 0; i < node_state.output_properties.size(); ++i) {
      node_state.time_no_references[i] = Costs::Duration::max();
      node_state.num_outputs_executed[i] = 0;
      // Populate an empty vector for each port. The caller will add nodes
--- a/tensorflow/core/grappler/optimizers/constant_folding.cc
+++ b/tensorflow/core/grappler/optimizers/constant_folding.cc
@ -396,7 +396,7 @@ Status ConstantFolding::EvaluateOneFoldable(const NodeDef& node,
  if (output_tensors.empty()) {
    Status(error::INVALID_ARGUMENT, "Expected at least one output.");
  }
-  for (int i = 0; i < output_tensors.size(); i++) {
+  for (size_t i = 0; i < output_tensors.size(); i++) {
    string node_name = AddPrefixToNodeName(node.name(), kConstantFoldingConst);
    if (output_tensors.size() > 1) {
      node_name = strings::StrCat(node_name, "-", i);
--- a/tensorflow/core/kernels/BUILD
+++ b/tensorflow/core/kernels/BUILD
@ -32,6 +32,7 @@ load(
    "tf_kernel_library",
    "tf_mkl_kernel_library",
    "cc_header_only_library",
+    "if_not_windows",
 )
 load("@local_config_sycl//sycl:build_defs.bzl", "if_sycl")
 load("//tensorflow:tensorflow.bzl", "tf_cuda_cc_test")
@ -231,7 +232,7 @@ cc_library(
    name = "ops_util",
    srcs = ["ops_util.cc"],
    hdrs = ["ops_util.h"],
-    copts = ["-Wno-sign-compare"],
+    copts = if_not_windows(["-Wno-sign-compare"]),
    deps = [
        "//tensorflow/core:framework",
        "//tensorflow/core:lib",
@ -344,7 +345,7 @@ cc_library(
    name = "save_restore_tensor",
    srcs = ["save_restore_tensor.cc"],
    hdrs = ["save_restore_tensor.h"],
-    copts = ["-Wno-sign-compare"],
+    copts = if_not_windows(["-Wno-sign-compare"]),
    deps = [
        ":bounds_check",
        "//tensorflow/core:framework",
@ -1222,6 +1223,7 @@ tf_kernel_library(
    ],
    visibility = ["//visibility:private"],
    deps = [
+        ":ops_util",
        "//tensorflow/core:framework",
        "//tensorflow/core/kernels:conv_ops",
        "//third_party/eigen3",
@ -3219,6 +3221,7 @@ cc_library(
        ":sparse_reduce_op",
        ":sparse_reorder_op",
        ":sparse_reshape_op",
+        ":sparse_slice_op",
        ":sparse_softmax",
        ":sparse_sparse_binary_op_shared",
        ":sparse_split_op",
@ -3301,6 +3304,12 @@ tf_kernel_library(
    deps = SPARSE_DEPS,
 )

+tf_kernel_library(
+    name = "sparse_slice_op",
+    prefix = "sparse_slice_op",
+    deps = SPARSE_DEPS,
+)
+
 tf_kernel_library(
    name = "sparse_softmax",
    prefix = "sparse_softmax",
--- a/tensorflow/core/kernels/barrier_ops.cc
+++ b/tensorflow/core/kernels/barrier_ops.cc
@ -413,7 +413,7 @@ class Barrier : public ResourceBase {
    }
    queue_closed_ = true;
    if (cancel_pending_enqueues) queue_cancelled_ = true;
-    if (!ready_queue_->closed()) {
+    if (!ready_queue_->is_closed()) {
      ready_queue_->Close(ctx, cancel_pending_enqueues, callback);
    }
  }
--- a/tensorflow/core/kernels/cast_op_impl.h
+++ b/tensorflow/core/kernels/cast_op_impl.h
@ -45,20 +45,23 @@ struct CastFunctor<Eigen::SyclDevice, O, I> {

 }  // namespace functor

-#define CURRY_TYPES3(FN, arg0, arg1)   \
-  FN(arg0, arg1, bool);                \
-  FN(arg0, arg1, uint8);               \
-  FN(arg0, arg1, int8);                \
-  FN(arg0, arg1, uint16);              \
-  FN(arg0, arg1, int16);               \
-  FN(arg0, arg1, int32);               \
-  FN(arg0, arg1, int64);               \
-  FN(arg0, arg1, Eigen::half);         \
-  FN(arg0, arg1, float);               \
-  FN(arg0, arg1, double);              \
-  FN(arg0, arg1, std::complex<float>); \
+#define CURRY_TYPES3_NO_HALF(FN, arg0, arg1)   \
+  FN(arg0, arg1, bool);                        \
+  FN(arg0, arg1, uint8);                       \
+  FN(arg0, arg1, int8);                        \
+  FN(arg0, arg1, uint16);                      \
+  FN(arg0, arg1, int16);                       \
+  FN(arg0, arg1, int32);                       \
+  FN(arg0, arg1, int64);                       \
+  FN(arg0, arg1, float);                       \
+  FN(arg0, arg1, double);                      \
+  FN(arg0, arg1, std::complex<float>);         \
  FN(arg0, arg1, std::complex<double>)

+#define CURRY_TYPES3(FN, arg0, arg1)           \
+  CURRY_TYPES3_NO_HALF(FN, arg0, arg1)         \
+  FN(arg0, arg1, Eigen::half);
+
 #define CAST_CASE(DEVICE, IN, OUT)                                         \
  if (DataTypeToEnum<OUT>::value == dst_dtype) {                           \
    return [](OpKernelContext* ctx, const Tensor& inp, Tensor* out) {      \
@ -154,6 +157,15 @@ GetGpuCastFromBfloat(DataType dst_dtype);
 std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
 GetSyclCastFromBool(DataType dst_dtype);

+std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
+GetSyclCastFromUint8(DataType dst_dtype);
+
+std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
+GetSyclCastFromUint16(DataType dst_dtype);
+
+std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
+GetSyclCastFromInt16(DataType dst_dtype);
+
 std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
 GetSyclCastFromInt32(DataType dst_dtype);

@ -165,10 +177,8 @@ GetSyclCastFromFloat(DataType dst_dtype);

 std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
 GetSyclCastFromDouble(DataType dst_dtype);
-
-#endif // TENSORFLOW_USE_SYCL
+#endif  // TENSORFLOW_USE_SYCL

 }  // namespace tensorflow

 #endif  // THIRD_PARTY_TENSORFLOW_CORE_KERNELS_CAST_OP_IMPL_H_
-
--- a/tensorflow/core/kernels/cast_op_impl_bool.cc
+++ b/tensorflow/core/kernels/cast_op_impl_bool.cc
@ -38,10 +38,9 @@ GetGpuCastFromBool(DataType dst_dtype) {
 typedef Eigen::SyclDevice SYCLDevice;
 std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
 GetSyclCastFromBool(DataType dst_dtype) {
-  CURRY_TYPES3(CAST_CASE, SYCLDevice, bool);
+  CURRY_TYPES3_NO_HALF(CAST_CASE, SYCLDevice, bool);
  return nullptr;
 }
-#endif // TENSORFLOW_USE_SYCL
+#endif  // TENSORFLOW_USE_SYCL

 }  // namespace tensorflow
-
--- a/tensorflow/core/kernels/cast_op_impl_double.cc
+++ b/tensorflow/core/kernels/cast_op_impl_double.cc
@ -38,10 +38,9 @@ GetGpuCastFromDouble(DataType dst_dtype) {
 typedef Eigen::SyclDevice SYCLDevice;
 std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
 GetSyclCastFromDouble(DataType dst_dtype) {
-  CURRY_TYPES3(CAST_CASE, SYCLDevice, double);
+  CURRY_TYPES3_NO_HALF(CAST_CASE, SYCLDevice, double);
  return nullptr;
 }
-#endif // TENSORFLOW_USE_SYC
+#endif  // TENSORFLOW_USE_SYCL

 }  // namespace tensorflow
-
--- a/tensorflow/core/kernels/cast_op_impl_float.cc
+++ b/tensorflow/core/kernels/cast_op_impl_float.cc
@ -53,10 +53,9 @@ GetGpuCastFromFloat(DataType dst_dtype) {
 typedef Eigen::SyclDevice SYCLDevice;
 std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
 GetSyclCastFromFloat(DataType dst_dtype) {
-  CURRY_TYPES3(CAST_CASE, SYCLDevice, float);
+  CURRY_TYPES3_NO_HALF(CAST_CASE, SYCLDevice, float);
  return nullptr;
 }
-#endif // TENSORFLOW_USE_SYCL
+#endif  // TENSORFLOW_USE_SYCL

 }  // namespace tensorflow
-
--- a/tensorflow/core/kernels/cast_op_impl_int16.cc
+++ b/tensorflow/core/kernels/cast_op_impl_int16.cc
@ -34,4 +34,13 @@ GetGpuCastFromInt16(DataType dst_dtype) {
 }
 #endif  // GOOGLE_CUDA

+#ifdef TENSORFLOW_USE_SYCL
+typedef Eigen::SyclDevice SYCLDevice;
+std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
+GetSyclCastFromInt16(DataType dst_dtype) {
+  CURRY_TYPES3_NO_HALF(CAST_CASE, SYCLDevice, int16);
+  return nullptr;
+}
+#endif  // TENSORFLOW_USE_SYCL
+
 }  // namespace tensorflow
--- a/tensorflow/core/kernels/cast_op_impl_int32.cc
+++ b/tensorflow/core/kernels/cast_op_impl_int32.cc
@ -38,9 +38,9 @@ GetGpuCastFromInt32(DataType dst_dtype) {
 typedef Eigen::SyclDevice SYCLDevice;
 std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
 GetSyclCastFromInt32(DataType dst_dtype) {
-  CURRY_TYPES3(CAST_CASE, SYCLDevice, int32);
+  CURRY_TYPES3_NO_HALF(CAST_CASE, SYCLDevice, int32);
  return nullptr;
 }
-#endif // TENSORFLOW_USE_SYCL
+#endif  // TENSORFLOW_USE_SYCL

 }  // namespace tensorflow
--- a/tensorflow/core/kernels/cast_op_impl_int64.cc
+++ b/tensorflow/core/kernels/cast_op_impl_int64.cc
@ -38,9 +38,9 @@ GetGpuCastFromInt64(DataType dst_dtype) {
 typedef Eigen::SyclDevice SYCLDevice;
 std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
 GetSyclCastFromInt64(DataType dst_dtype) {
-  CURRY_TYPES3(CAST_CASE, SYCLDevice, int64);
+  CURRY_TYPES3_NO_HALF(CAST_CASE, SYCLDevice, int64);
  return nullptr;
 }
-#endif // TENSORFLOW_USE_SYCL
+#endif  // TENSORFLOW_USE_SYCL

 }  // namespace tensorflow
--- a/tensorflow/core/kernels/cast_op_impl_int8.cc
+++ b/tensorflow/core/kernels/cast_op_impl_int8.cc
@ -34,4 +34,13 @@ GetGpuCastFromInt8(DataType dst_dtype) {
 }
 #endif  // GOOGLE_CUDA

+#ifdef TENSORFLOW_USE_SYCL
+typedef Eigen::SyclDevice SYCLDevice;
+std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
+GetSyclCastFromInt8(DataType dst_dtype) {
+  CURRY_TYPES3_NO_HALF(CAST_CASE, SYCLDevice, int8);
+  return nullptr;
+}
+#endif  // TENSORFLOW_USE_SYCL
+
 }  // namespace tensorflow
--- a/tensorflow/core/kernels/cast_op_impl_uint16.cc
+++ b/tensorflow/core/kernels/cast_op_impl_uint16.cc
@ -34,4 +34,13 @@ GetGpuCastFromUint16(DataType dst_dtype) {
 }
 #endif  // GOOGLE_CUDA

+#ifdef TENSORFLOW_USE_SYCL
+typedef Eigen::SyclDevice SYCLDevice;
+std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
+GetSyclCastFromUint16(DataType dst_dtype) {
+  CURRY_TYPES3_NO_HALF(CAST_CASE, SYCLDevice, uint16);
+  return nullptr;
+}
+#endif  // TENSORFLOW_USE_SYCL
+
 }  // namespace tensorflow
--- a/tensorflow/core/kernels/cast_op_impl_uint8.cc
+++ b/tensorflow/core/kernels/cast_op_impl_uint8.cc
@ -34,4 +34,13 @@ GetGpuCastFromUint8(DataType dst_dtype) {
 }
 #endif  // GOOGLE_CUDA

+#ifdef TENSORFLOW_USE_SYCL
+typedef Eigen::SyclDevice SYCLDevice;
+std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
+GetSyclCastFromUint8(DataType dst_dtype) {
+  CURRY_TYPES3_NO_HALF(CAST_CASE, SYCLDevice, uint8);
+  return nullptr;
+}
+#endif  // TENSORFLOW_USE_SYCL
+
 }  // namespace tensorflow
--- a/tensorflow/core/kernels/cwise_op_abs.cc
+++ b/tensorflow/core/kernels/cwise_op_abs.cc
@ -22,17 +22,6 @@ REGISTER5(UnaryOp, CPU, "Abs", functor::abs, float, Eigen::half, double, int32,
 REGISTER2(UnaryOp, CPU, "ComplexAbs", functor::abs, complex64, complex128);
 #endif

-#if TENSORFLOW_USE_SYCL
-#define REGISTER_SYCL_KERNEL(TYPE)                                    \
-  REGISTER_KERNEL_BUILDER(                                            \
-                          Name("Abs")                                 \
-                          .Device(DEVICE_SYCL)                        \
-                          .TypeConstraint<TYPE>("T"),                 \
-                          UnaryOp<SYCLDevice, functor::abs<TYPE>>);
-REGISTER_SYCL_KERNEL(float);
-#undef REGISTER_SYCL_KERNEL
-#endif // TENSORFLOW_USE_SYCL
-
 #if GOOGLE_CUDA
 REGISTER4(UnaryOp, GPU, "Abs", functor::abs, float, Eigen::half, double, int64);
 REGISTER2(UnaryOp, GPU, "ComplexAbs", functor::abs, complex64, complex128);
@ -48,4 +37,13 @@ REGISTER_KERNEL_BUILDER(Name("Abs")
                        UnaryOp<CPUDevice, functor::abs<int32>>);
 #endif

+#if TENSORFLOW_USE_SYCL
+REGISTER3(UnaryOp, SYCL, "Abs", functor::abs, float, double, int64);
+REGISTER_KERNEL_BUILDER(Name("Abs")
+                            .Device(DEVICE_SYCL)
+                            .HostMemory("x")
+                            .HostMemory("y")
+                            .TypeConstraint<int32>("T"),
+                        UnaryOp<CPUDevice, functor::abs<int32>>);
+#endif // TENSORFLOW_USE_SYCL
 }  // namespace tensorflow
--- a/tensorflow/core/kernels/cwise_op_acos.cc
+++ b/tensorflow/core/kernels/cwise_op_acos.cc
@ -18,19 +18,11 @@ limitations under the License.
 namespace tensorflow {
 REGISTER2(UnaryOp, CPU, "Acos", functor::acos, float, double);

-#if TENSORFLOW_USE_SYCL
-#define REGISTER_SYCL_KERNEL(TYPE)                                    \
-  REGISTER_KERNEL_BUILDER(                                            \
-                          Name("Acos")                                \
-                          .Device(DEVICE_SYCL)                        \
-                          .TypeConstraint<TYPE>("T"),                 \
-                          UnaryOp<SYCLDevice, functor::acos<TYPE>>);
-REGISTER_SYCL_KERNEL(float);
-REGISTER_SYCL_KERNEL(double);
-#undef REGISTER_SYCL_KERNEL
-#endif // TENSORFLOW_USE_SYCL
-
 #if GOOGLE_CUDA
 REGISTER2(UnaryOp, GPU, "Acos", functor::acos, float, double);
 #endif
+
+#if TENSORFLOW_USE_SYCL
+REGISTER2(UnaryOp, SYCL, "Acos", functor::acos, float, double);
+#endif // TENSORFLOW_USE_SYCL
 }  // namespace tensorflow
--- a/tensorflow/core/kernels/cwise_op_acosh.cc
+++ b/tensorflow/core/kernels/cwise_op_acosh.cc
@ -0,0 +1,38 @@
+/* Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/kernels/cwise_ops_common.h"
+#include "tensorflow/core/kernels/cwise_ops_gradients.h"
+
+namespace tensorflow {
+REGISTER4(UnaryOp, CPU, "Acosh", functor::acosh, float, double,
+          complex64, complex128);
+
+#if TENSORFLOW_USE_SYCL
+#define REGISTER_SYCL_KERNEL(TYPE)                                    \
+  REGISTER_KERNEL_BUILDER(                                            \
+                          Name("Acosh")                               \
+                          .Device(DEVICE_SYCL)                        \
+                          .TypeConstraint<TYPE>("T"),                 \
+                          UnaryOp<SYCLDevice, functor::acosh<TYPE>>);
+REGISTER_SYCL_KERNEL(float);
+REGISTER_SYCL_KERNEL(double);
+#undef REGISTER_SYCL_KERNEL
+#endif // TENSORFLOW_USE_SYCL
+
+#if GOOGLE_CUDA
+REGISTER2(UnaryOp, GPU, "Acosh", functor::acosh, float, double);
+#endif
+}  // namespace tensorflow
--- a/tensorflow/core/kernels/cwise_op_add_1.cc
+++ b/tensorflow/core/kernels/cwise_op_add_1.cc
@ -19,26 +19,6 @@ namespace tensorflow {
 REGISTER5(BinaryOp, CPU, "Add", functor::add, float, Eigen::half, double, int32,
          int64);

-#if TENSORFLOW_USE_SYCL
-#define REGISTER_SYCL_KERNEL(TYPE)                                    \
-  REGISTER_KERNEL_BUILDER(                                            \
-                          Name("Add")                                 \
-                          .Device(DEVICE_SYCL)                        \
-                          .TypeConstraint<TYPE>("T"),                 \
-                          BinaryOp<SYCLDevice, functor::add<TYPE>>);
-REGISTER_SYCL_KERNEL(float);
-REGISTER_SYCL_KERNEL(double);
-#undef REGISTER_SYCL_KERNEL
-
-REGISTER_KERNEL_BUILDER(Name("Add")
-                            .Device(DEVICE_SYCL)
-                            .HostMemory("x")
-                            .HostMemory("y")
-                            .HostMemory("z")
-                            .TypeConstraint<int32>("T"),
-                        BinaryOp<CPUDevice, functor::add<int32>>);
-#endif // TENSORFLOW_USE_SYCL
-
 #if GOOGLE_CUDA
 REGISTER3(BinaryOp, GPU, "Add", functor::add, float, Eigen::half, double);

@ -54,4 +34,15 @@ REGISTER_KERNEL_BUILDER(Name("Add")
                        BinaryOp<CPUDevice, functor::add<int32>>);
 #endif

+
+#if TENSORFLOW_USE_SYCL
+REGISTER2(BinaryOp, SYCL, "Add", functor::add, float, double);
+REGISTER_KERNEL_BUILDER(Name("Add")
+                            .Device(DEVICE_SYCL)
+                            .HostMemory("x")
+                            .HostMemory("y")
+                            .HostMemory("z")
+                            .TypeConstraint<int32>("T"),
+                        BinaryOp<CPUDevice, functor::add<int32>>);
+#endif // TENSORFLOW_USE_SYCL
 }  // namespace tensorflow
--- a/tensorflow/core/kernels/cwise_op_asin.cc
+++ b/tensorflow/core/kernels/cwise_op_asin.cc
@ -18,19 +18,11 @@ limitations under the License.
 namespace tensorflow {
 REGISTER2(UnaryOp, CPU, "Asin", functor::asin, float, double);

-#if TENSORFLOW_USE_SYCL
-#define REGISTER_SYCL_KERNEL(TYPE)                                    \
-  REGISTER_KERNEL_BUILDER(                                            \
-                          Name("Asin")                                \
-                          .Device(DEVICE_SYCL)                        \
-                          .TypeConstraint<TYPE>("T"),                 \
-                          UnaryOp<SYCLDevice, functor::asin<TYPE>>);
-REGISTER_SYCL_KERNEL(float);
-REGISTER_SYCL_KERNEL(double);
-#undef REGISTER_SYCL_KERNEL
-#endif // TENSORFLOW_USE_SYCL
-
 #if GOOGLE_CUDA
 REGISTER2(UnaryOp, GPU, "Asin", functor::asin, float, double);
 #endif
+
+#if TENSORFLOW_USE_SYCL
+REGISTER2(UnaryOp, SYCL, "Asin", functor::asin, float, double);
+#endif // TENSORFLOW_USE_SYCL
 }  // namespace tensorflow
--- a/tensorflow/core/kernels/cwise_op_asinh.cc
+++ b/tensorflow/core/kernels/cwise_op_asinh.cc
@ -0,0 +1,38 @@
+/* Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/kernels/cwise_ops_common.h"
+#include "tensorflow/core/kernels/cwise_ops_gradients.h"
+
+namespace tensorflow {
+REGISTER4(UnaryOp, CPU, "Asinh", functor::asinh, float, double,
+          complex64, complex128);
+
+#if TENSORFLOW_USE_SYCL
+#define REGISTER_SYCL_KERNEL(TYPE)                                    \
+  REGISTER_KERNEL_BUILDER(                                            \
+                          Name("Asinh")                               \
+                          .Device(DEVICE_SYCL)                        \
+                          .TypeConstraint<TYPE>("T"),                 \
+                          UnaryOp<SYCLDevice, functor::asinh<TYPE>>);
+REGISTER_SYCL_KERNEL(float);
+REGISTER_SYCL_KERNEL(double);
+#undef REGISTER_SYCL_KERNEL
+#endif // TENSORFLOW_USE_SYC
+
+#if GOOGLE_CUDA
+REGISTER2(UnaryOp, GPU, "Asinh", functor::asinh, float, double);
+#endif
+}  // namespace tensorflow
--- a/tensorflow/core/kernels/cwise_op_atan.cc
+++ b/tensorflow/core/kernels/cwise_op_atan.cc
@ -18,19 +18,11 @@ limitations under the License.
 namespace tensorflow {
 REGISTER2(UnaryOp, CPU, "Atan", functor::atan, float, double);

-#if TENSORFLOW_USE_SYCL
-#define REGISTER_SYCL_KERNEL(TYPE)                                    \
-  REGISTER_KERNEL_BUILDER(                                            \
-                          Name("Atan")                                \
-                          .Device(DEVICE_SYCL)                        \
-                          .TypeConstraint<TYPE>("T"),                 \
-                          UnaryOp<SYCLDevice, functor::atan<TYPE>>);
-REGISTER_SYCL_KERNEL(float);
-REGISTER_SYCL_KERNEL(double);
-#undef REGISTER_SYCL_KERNEL
-#endif // TENSORFLOW_USE_SYCL
-
 #if GOOGLE_CUDA
 REGISTER2(UnaryOp, GPU, "Atan", functor::atan, float, double);
 #endif
+
+#if TENSORFLOW_USE_SYCL
+REGISTER2(UnaryOp, SYCL, "Atan", functor::atan, float, double);
+#endif // TENSORFLOW_USE_SYCL
 }  // namespace tensorflow
--- a/tensorflow/core/kernels/cwise_op_atanh.cc
+++ b/tensorflow/core/kernels/cwise_op_atanh.cc
@ -0,0 +1,38 @@
+/* Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/kernels/cwise_ops_common.h"
+#include "tensorflow/core/kernels/cwise_ops_gradients.h"
+
+namespace tensorflow {
+REGISTER4(UnaryOp, CPU, "Atanh", functor::atanh, float, double,
+          complex64, complex128);
+
+#if TENSORFLOW_USE_SYCL
+#define REGISTER_SYCL_KERNEL(TYPE)                                    \
+  REGISTER_KERNEL_BUILDER(                                            \
+                          Name("Atanh")                               \
+                          .Device(DEVICE_SYCL)                        \
+                          .TypeConstraint<TYPE>("T"),                 \
+                          UnaryOp<SYCLDevice, functor::atanh<TYPE>>);
+REGISTER_SYCL_KERNEL(float);
+REGISTER_SYCL_KERNEL(double);
+#undef REGISTER_SYCL_KERNEL
+#endif // TENSORFLOW_USE_SYC
+
+#if GOOGLE_CUDA
+REGISTER2(UnaryOp, GPU, "Atanh", functor::atanh, float, double);
+#endif
+}  // namespace tensorflow
--- a/tensorflow/core/kernels/cwise_op_ceil.cc
+++ b/tensorflow/core/kernels/cwise_op_ceil.cc
@ -18,19 +18,11 @@ limitations under the License.
 namespace tensorflow {
 REGISTER3(UnaryOp, CPU, "Ceil", functor::ceil, float, Eigen::half, double);

-#if TENSORFLOW_USE_SYCL
-#define REGISTER_SYCL_KERNEL(TYPE)                                    \
-  REGISTER_KERNEL_BUILDER(                                            \
-                          Name("Ceil")                                \
-                          .Device(DEVICE_SYCL)                        \
-                          .TypeConstraint<TYPE>("T"),                 \
-                          UnaryOp<SYCLDevice, functor::ceil<TYPE>>);
-REGISTER_SYCL_KERNEL(float);
-REGISTER_SYCL_KERNEL(double);
-#undef REGISTER_SYCL_KERNEL
-#endif // TENSORFLOW_USE_SYCL
-
 #if GOOGLE_CUDA
 REGISTER3(UnaryOp, GPU, "Ceil", functor::ceil, float, Eigen::half, double);
 #endif
+
+#if TENSORFLOW_USE_SYCL
+REGISTER2(UnaryOp, SYCL, "Ceil", functor::ceil, float, double);
+#endif // TENSORFLOW_USE_SYCL
 }  // namespace tensorflow
--- a/tensorflow/core/kernels/cwise_op_cos.cc
+++ b/tensorflow/core/kernels/cwise_op_cos.cc
@ -19,19 +19,11 @@ namespace tensorflow {
 REGISTER5(UnaryOp, CPU, "Cos", functor::cos, float, Eigen::half, double,
          complex64, complex128);

-#if TENSORFLOW_USE_SYCL
-#define REGISTER_SYCL_KERNEL(TYPE)                                    \
-  REGISTER_KERNEL_BUILDER(                                            \
-                          Name("Cos")                                 \
-                          .Device(DEVICE_SYCL)                        \
-                          .TypeConstraint<TYPE>("T"),                 \
-                          UnaryOp<SYCLDevice, functor::cos<TYPE>>);
-REGISTER_SYCL_KERNEL(float);
-REGISTER_SYCL_KERNEL(double);
-#undef REGISTER_SYCL_KERNEL
-#endif // TENSORFLOW_USE_SYCL
-
 #if GOOGLE_CUDA
 REGISTER3(UnaryOp, GPU, "Cos", functor::cos, float, Eigen::half, double);
 #endif
+
+#ifdef TENSORFLOW_USE_SYCL
+REGISTER2(UnaryOp, SYCL, "Cos", functor::cos, float, double);
+#endif // TENSORFLOW_USE_SYCL
 }  // namespace tensorflow
--- a/tensorflow/core/kernels/cwise_op_div.cc
+++ b/tensorflow/core/kernels/cwise_op_div.cc
@ -24,32 +24,6 @@ REGISTER5(BinaryOp, CPU, "TruncateDiv", functor::safe_div, uint8, uint16, int16,
          int32, int64);
 REGISTER5(BinaryOp, CPU, "RealDiv", functor::div, float, Eigen::half, double,
          complex64, complex128);
-#if TENSORFLOW_USE_SYCL
-#define REGISTER_SYCL_KERNEL(TYPE)                                    \
-  REGISTER_KERNEL_BUILDER(                                            \
-                          Name("Div")                                 \
-                          .Device(DEVICE_SYCL)                        \
-                          .TypeConstraint<TYPE>("T"),                 \
-                          BinaryOp<SYCLDevice, functor::div<TYPE>>);  \
-  REGISTER_KERNEL_BUILDER(                                            \
-                          Name("RealDiv")                             \
-                          .Device(DEVICE_SYCL)                        \
-                          .TypeConstraint<TYPE>("T"),                 \
-                          BinaryOp<SYCLDevice, functor::div<TYPE>>);
-REGISTER_SYCL_KERNEL(float)
-REGISTER_SYCL_KERNEL(double)
-#undef REGISTER_SYCL_KERNEL
-// A special GPU kernel for int32.
-// TODO(b/25387198): Also enable int32 in device memory. This kernel
-// registration requires all int32 inputs and outputs to be in host memory.
-REGISTER_KERNEL_BUILDER(Name("Div")
-                            .Device(DEVICE_SYCL)
-                            .HostMemory("x")
-                            .HostMemory("y")
-                            .HostMemory("z")
-                            .TypeConstraint<int32>("T"),
-                        BinaryOp<CPUDevice, functor::safe_div<int32>>);
-#endif // TENSORFLOW_USE_SYCL
 #if GOOGLE_CUDA
 REGISTER9(BinaryOp, GPU, "Div", functor::div, float, Eigen::half, double, uint8,
          uint16, int16, int64, complex64, complex128);
@ -70,4 +44,15 @@ REGISTER_KERNEL_BUILDER(Name("Div")
                        BinaryOp<CPUDevice, functor::safe_div<int32>>);
 #endif

+#ifdef TENSORFLOW_USE_SYCL
+REGISTER2(BinaryOp, SYCL, "Div", functor::div, float, double);
+REGISTER2(BinaryOp, SYCL, "RealDiv", functor::div, float, double);
+REGISTER_KERNEL_BUILDER(Name("Div")
+                            .Device(DEVICE_SYCL)
+                            .HostMemory("x")
+                            .HostMemory("y")
+                            .HostMemory("z")
+                            .TypeConstraint<int32>("T"),
+                        BinaryOp<CPUDevice, functor::safe_div<int32>>);
+#endif // TENSORFLOW_USE_SYCL
 }  // namespace tensorflow
--- a/tensorflow/core/kernels/cwise_op_equal_to_1.cc
+++ b/tensorflow/core/kernels/cwise_op_equal_to_1.cc
@ -47,8 +47,8 @@ REGISTER_KERNEL_BUILDER(Name("Equal")
 #endif

 #ifdef TENSORFLOW_USE_SYCL
-REGISTER2(BinaryOp, SYCL, "Equal", functor::equal_to, float, double);
-
+REGISTER5(BinaryOp, SYCL, "Equal", functor::equal_to, float, double, uint8,
+          int8, int16);
 REGISTER_KERNEL_BUILDER(Name("Equal")
                            .Device(DEVICE_SYCL)
                            .HostMemory("x")
--- a/tensorflow/core/kernels/cwise_op_exp.cc
+++ b/tensorflow/core/kernels/cwise_op_exp.cc
@ -19,19 +19,11 @@ namespace tensorflow {
 REGISTER5(UnaryOp, CPU, "Exp", functor::exp, float, Eigen::half, double,
          complex64, complex128);

-#if TENSORFLOW_USE_SYCL
-#define REGISTER_SYCL_KERNEL(TYPE)                                    \
-  REGISTER_KERNEL_BUILDER(                                            \
-                          Name("Exp")                                 \
-                          .Device(DEVICE_SYCL)                        \
-                          .TypeConstraint<TYPE>("T"),                 \
-                          UnaryOp<SYCLDevice, functor::exp<TYPE>>);
-REGISTER_SYCL_KERNEL(float);
-REGISTER_SYCL_KERNEL(double);
-#undef REGISTER_SYCL_KERNEL
-#endif // TENSORFLOW_USE_SYCL
-
 #if GOOGLE_CUDA
 REGISTER3(UnaryOp, GPU, "Exp", functor::exp, float, Eigen::half, double);
 #endif
+
+#if TENSORFLOW_USE_SYCL
+REGISTER2(UnaryOp, SYCL, "Exp", functor::exp, float, double);
+#endif // TENSORFLOW_USE_SYCL
 }  // namespace tensorflow
--- a/tensorflow/core/kernels/cwise_op_expm1.cc
+++ b/tensorflow/core/kernels/cwise_op_expm1.cc
@ -22,6 +22,6 @@ REGISTER5(UnaryOp, CPU, "Expm1", functor::expm1, float, Eigen::half, double,
 REGISTER3(UnaryOp, GPU, "Expm1", functor::expm1, float, Eigen::half, double);
 #endif
 #ifdef TENSORFLOW_USE_SYCL
-REGISTER(UnaryOp, SYCL, "Expm1", functor::expm1, float);
+REGISTER2(UnaryOp, SYCL, "Expm1", functor::expm1, float, double);
 #endif // TENSORFLOW_USE_SYCL
 }  // namespace tensorflow
--- a/tensorflow/core/kernels/cwise_op_floor.cc
+++ b/tensorflow/core/kernels/cwise_op_floor.cc
@ -18,19 +18,10 @@ limitations under the License.
 namespace tensorflow {
 REGISTER3(UnaryOp, CPU, "Floor", functor::floor, float, Eigen::half, double);

-#if TENSORFLOW_USE_SYCL
-#define REGISTER_SYCL_KERNEL(TYPE)                                    \
-  REGISTER_KERNEL_BUILDER(                                            \
-                          Name("Floor")                               \
-                          .Device(DEVICE_SYCL)                        \
-                          .TypeConstraint<TYPE>("T"),                 \
-                          UnaryOp<SYCLDevice, functor::floor<TYPE>>);
-REGISTER_SYCL_KERNEL(float);
-REGISTER_SYCL_KERNEL(double);
-#undef REGISTER_SYCL_KERNEL
-#endif // TENSORFLOW_USE_SYCL
-
 #if GOOGLE_CUDA
 REGISTER3(UnaryOp, GPU, "Floor", functor::floor, float, Eigen::half, double);
 #endif
+#ifdef TENSORFLOW_USE_SYCL
+REGISTER2(UnaryOp, SYCL, "Floor", functor::floor, float, double);
+#endif // TENSORFLOW_USE_SYCL
 }  // namespace tensorflow
--- a/tensorflow/core/kernels/cwise_op_gpu_acosh.cu.cc
+++ b/tensorflow/core/kernels/cwise_op_gpu_acosh.cu.cc
@ -0,0 +1,27 @@
+/* Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#if GOOGLE_CUDA
+
+#include "tensorflow/core/kernels/cwise_ops_gpu_common.cu.h"
+#include "tensorflow/core/kernels/cwise_ops_gpu_gradients.cu.h"
+
+namespace tensorflow {
+namespace functor {
+DEFINE_UNARY2(acosh, float, double);
+}  // namespace functor
+}  // namespace tensorflow
+
+#endif  // GOOGLE_CUDA
--- a/tensorflow/core/kernels/cwise_op_gpu_asinh.cu.cc
+++ b/tensorflow/core/kernels/cwise_op_gpu_asinh.cu.cc
@ -0,0 +1,27 @@
+/* Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#if GOOGLE_CUDA
+
+#include "tensorflow/core/kernels/cwise_ops_gpu_common.cu.h"
+#include "tensorflow/core/kernels/cwise_ops_gpu_gradients.cu.h"
+
+namespace tensorflow {
+namespace functor {
+DEFINE_UNARY2(asinh, float, double);
+}  // namespace functor
+}  // namespace tensorflow
+
+#endif  // GOOGLE_CUDA
--- a/tensorflow/core/kernels/cwise_op_gpu_atanh.cu.cc
+++ b/tensorflow/core/kernels/cwise_op_gpu_atanh.cu.cc
@ -0,0 +1,27 @@
+/* Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#if GOOGLE_CUDA
+
+#include "tensorflow/core/kernels/cwise_ops_gpu_common.cu.h"
+#include "tensorflow/core/kernels/cwise_ops_gpu_gradients.cu.h"
+
+namespace tensorflow {
+namespace functor {
+DEFINE_UNARY2(atanh, float, double);
+}  // namespace functor
+}  // namespace tensorflow
+
+#endif  // GOOGLE_CUDA
--- a/tensorflow/core/kernels/cwise_op_greater.cc
+++ b/tensorflow/core/kernels/cwise_op_greater.cc
@ -34,11 +34,8 @@ REGISTER_KERNEL_BUILDER(Name("Greater")
                        BinaryOp<CPUDevice, functor::greater<int32>>);
 #endif
 #ifdef TENSORFLOW_USE_SYCL
-REGISTER(BinaryOp, SYCL, "Greater", functor::greater, float);
+REGISTER2(BinaryOp, SYCL, "Greater", functor::greater, float, double);

-// A special GPU kernel for int32.
-// TODO(b/25387198): Also enable int32 in device memory. This kernel
-// registration requires all int32 inputs and outputs to be in host memory.
 REGISTER_KERNEL_BUILDER(Name("Greater")
                            .Device(DEVICE_SYCL)
                            .HostMemory("x")
@ -47,5 +44,4 @@ REGISTER_KERNEL_BUILDER(Name("Greater")
                            .TypeConstraint<int32>("T"),
                        BinaryOp<CPUDevice, functor::greater<int32>>);
 #endif // TENSORFLOW_USE_SYCL
-
 }  // namespace tensorflow
--- a/tensorflow/core/kernels/cwise_op_greater_equal.cc
+++ b/tensorflow/core/kernels/cwise_op_greater_equal.cc
@ -35,7 +35,7 @@ REGISTER_KERNEL_BUILDER(Name("GreaterEqual")
 #endif

 #ifdef TENSORFLOW_USE_SYCL
-REGISTER(BinaryOp, SYCL, "GreaterEqual", functor::greater_equal, float);
+REGISTER2(BinaryOp, SYCL, "GreaterEqual", functor::greater_equal, float, double);

 REGISTER_KERNEL_BUILDER(Name("GreaterEqual")
                            .Device(DEVICE_SYCL)
--- a/tensorflow/core/kernels/cwise_op_isfinite.cc
+++ b/tensorflow/core/kernels/cwise_op_isfinite.cc
@ -19,20 +19,12 @@ namespace tensorflow {
 REGISTER3(UnaryOp, CPU, "IsFinite", functor::isfinite, float, Eigen::half,
          double);

-#if TENSORFLOW_USE_SYCL
-#define REGISTER_SYCL_KERNEL(TYPE)                                    \
-  REGISTER_KERNEL_BUILDER(                                            \
-                          Name("IsFinite")                            \
-                          .Device(DEVICE_SYCL)                        \
-                          .TypeConstraint<TYPE>("T"),                 \
-                          UnaryOp<SYCLDevice, functor::isfinite<TYPE>>);
-REGISTER_SYCL_KERNEL(float);
-REGISTER_SYCL_KERNEL(double);
-#undef REGISTER_SYCL_KERNEL
-#endif // TENSORFLOW_USE_SYCL
-
 #if GOOGLE_CUDA
 REGISTER3(UnaryOp, GPU, "IsFinite", functor::isfinite, float, Eigen::half,
          double);
 #endif
+
+#ifdef TENSORFLOW_USE_SYCL
+REGISTER2(UnaryOp, SYCL, "IsFinite", functor::isfinite, float, double);
+#endif // TENSORFLOW_USE_SYCL
 }  // namespace tensorflow
--- a/tensorflow/core/kernels/cwise_op_isinf.cc
+++ b/tensorflow/core/kernels/cwise_op_isinf.cc
@ -18,19 +18,11 @@ limitations under the License.
 namespace tensorflow {
 REGISTER3(UnaryOp, CPU, "IsInf", functor::isinf, float, Eigen::half, double);

-#if TENSORFLOW_USE_SYCL
-#define REGISTER_SYCL_KERNEL(TYPE)                                    \
-  REGISTER_KERNEL_BUILDER(                                            \
-                          Name("IsInf")                               \
-                          .Device(DEVICE_SYCL)                        \
-                          .TypeConstraint<TYPE>("T"),                 \
-                          UnaryOp<SYCLDevice, functor::isinf<TYPE>>);
-REGISTER_SYCL_KERNEL(float);
-REGISTER_SYCL_KERNEL(double);
-#undef REGISTER_SYCL_KERNEL
-#endif // TENSORFLOW_USE_SYCL
-
 #if GOOGLE_CUDA
 REGISTER3(UnaryOp, GPU, "IsInf", functor::isinf, float, Eigen::half, double);
 #endif
+
+#ifdef TENSORFLOW_USE_SYCL
+REGISTER2(UnaryOp, SYCL, "IsInf", functor::isinf, float, double);
+#endif // TENSORFLOW_USE_SYCL
 }  // namespace tensorflow
--- a/tensorflow/core/kernels/cwise_op_isnan.cc
+++ b/tensorflow/core/kernels/cwise_op_isnan.cc
@ -18,19 +18,11 @@ limitations under the License.
 namespace tensorflow {
 REGISTER3(UnaryOp, CPU, "IsNan", functor::isnan, float, Eigen::half, double);

-#if TENSORFLOW_USE_SYCL
-#define REGISTER_SYCL_KERNEL(TYPE)                                    \
-  REGISTER_KERNEL_BUILDER(                                            \
-                          Name("IsNan")                               \
-                          .Device(DEVICE_SYCL)                        \
-                          .TypeConstraint<TYPE>("T"),                 \
-                          UnaryOp<SYCLDevice, functor::isnan<TYPE>>);
-REGISTER_SYCL_KERNEL(float);
-REGISTER_SYCL_KERNEL(double);
-#undef REGISTER_SYCL_KERNEL
-#endif // TENSORFLOW_USE_SYCL
-
 #if GOOGLE_CUDA
 REGISTER3(UnaryOp, GPU, "IsNan", functor::isnan, float, Eigen::half, double);
 #endif
+
+#ifdef TENSORFLOW_USE_SYCL
+REGISTER2(UnaryOp, SYCL, "IsNan", functor::isnan, float, double);
+#endif // TENSORFLOW_USE_SYCL
 }  // namespace tensorflow
--- a/tensorflow/core/kernels/cwise_op_less.cc
+++ b/tensorflow/core/kernels/cwise_op_less.cc
@ -35,7 +35,6 @@ REGISTER_KERNEL_BUILDER(Name("Less")
 #endif
 #ifdef TENSORFLOW_USE_SYCL
 REGISTER3(BinaryOp, SYCL, "Less", functor::less, float, double, int64);
-
 REGISTER_KERNEL_BUILDER(Name("Less")
                            .Device(DEVICE_SYCL)
                            .HostMemory("x")
--- a/tensorflow/core/kernels/cwise_op_less_equal.cc
+++ b/tensorflow/core/kernels/cwise_op_less_equal.cc
@ -35,8 +35,8 @@ REGISTER_KERNEL_BUILDER(Name("LessEqual")
 #endif

 #ifdef TENSORFLOW_USE_SYCL
-REGISTER(BinaryOp, SYCL, "LessEqual", functor::less_equal, float);
-
+REGISTER6(BinaryOp, SYCL, "LessEqual", functor::less_equal, float, double,
+          int64, uint8, int8, int16);
 REGISTER_KERNEL_BUILDER(Name("LessEqual")
                            .Device(DEVICE_SYCL)
                            .HostMemory("x")
@ -45,5 +45,4 @@ REGISTER_KERNEL_BUILDER(Name("LessEqual")
                            .TypeConstraint<int32>("T"),
                        BinaryOp<CPUDevice, functor::less_equal<int32>>);
 #endif // TENSORFLOW_USE_SYCL
-
 }  // namespace tensorflow
--- a/Show More
+++ b/Show More