Fix dependencies bugs

Change: 116925769
2016-03-10 17:18:30 -08:00 · 2016-03-10 17:18:30 -08:00 · 56f1d64998
commit 56f1d64998
parent 64dd5b58d5
143 changed files with 4131 additions and 801 deletions
--- a/ISSUE_TEMPLATE.md
+++ b/ISSUE_TEMPLATE.md
@ -1,5 +1,11 @@
-For bugs/issues, please fill in the following.  The more information you
-provide, the more likely we can help you.
+GitHub issues are for bugs / installation problems / feature requests.  
+For general support from the community, see [StackOverflow](https://stackoverflow.com/questions/tagged/tensorflow).
+To make bugs and feature requests more easy to find and organize, we close issues that are deemed
+out of scope for GitHub Issues and point people to StackOverflow.
+
+For bugs or installation issues, please provide the following information.
+The more information you provide, the more easily we will be able to offer
+help and advice.

 ### Environment info
 Operating System:
--- a/README.md
+++ b/README.md
@ -5,7 +5,7 @@

 |  **`Linux CPU`**   |  **`Linux GPU PIP`** | **`Mac OS CPU`** |  **`Android`** |
 |-------------------|----------------------|------------------|----------------|
-| [![Build Status](http://ci.tensorflow.org/buildStatus/icon?job=tensorflow-master)](http://ci.tensorflow.org/job/tensorflow-master) | [![Build Status](http://ci.tensorflow.org/buildStatus/icon?job=tensorflow-master-gpu_pip)](http://ci.tensorflow.org/job/tensorflow-master-gpu_pip) | [![Build Status](http://ci.tensorflow.org/buildStatus/icon?job=tensorflow-master-mac)](http://ci.tensorflow.org/job/tensorflow-master-mac) | [![Build Status](http://ci.tensorflow.org/buildStatus/icon?job=tensorflow-master-android)](http://ci.tensorflow.org/job/tensorflow-master-android) |
+| [![Build Status](http://ci.tensorflow.org/buildStatus/icon?job=tensorflow-master-cpu)](http://ci.tensorflow.org/job/tensorflow-master-cpu) | [![Build Status](http://ci.tensorflow.org/buildStatus/icon?job=tensorflow-master-gpu_pip)](http://ci.tensorflow.org/job/tensorflow-master-gpu_pip) | [![Build Status](http://ci.tensorflow.org/buildStatus/icon?job=tensorflow-master-mac)](http://ci.tensorflow.org/job/tensorflow-master-mac) | [![Build Status](http://ci.tensorflow.org/buildStatus/icon?job=tensorflow-master-android)](http://ci.tensorflow.org/job/tensorflow-master-android) |

 **TensorFlow** is an open source software library for numerical computation using
 data flow graphs.  Nodes in the graph represent mathematical operations, while
@ -27,7 +27,14 @@ tracking requests and bugs, but please see
 and discussion.**

 ## Installation
-*See [Download and Setup](tensorflow/g3doc/get_started/os_setup.md).*
+*See [Download and Setup](tensorflow/g3doc/get_started/os_setup.md) for instructions on how to install our release binaries or how to build from source.*
+
+People who are a little bit adventurous can also try our nightly binaries:
+
+* Linux CPU only: [Python 2](http://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_CONTAINER_TYPE=CPU,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=cpu-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-0.7.1-cp27-none-linux_x86_64.whl) ([build history](http://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_CONTAINER_TYPE=CPU,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=cpu-slave/)) / [Python 3](http://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_CONTAINER_TYPE=CPU,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=cpu-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-0.7.1-py3-none-any.whl) ([build history](http://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_CONTAINER_TYPE=CPU,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=cpu-slave/))
+* Linux GPU: [Python 2](http://ci.tensorflow.org/view/Nightly/job/nigntly-matrix-linux-gpu/TF_BUILD_CONTAINER_TYPE=GPU,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=gpu-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-0.7.1-py2-none-any.whl) ([build history](http://ci.tensorflow.org/view/Nightly/job/nigntly-matrix-linux-gpu/TF_BUILD_CONTAINER_TYPE=GPU,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=gpu-slave/)) / [Python 3](http://ci.tensorflow.org/view/Nightly/job/nigntly-matrix-linux-gpu/TF_BUILD_CONTAINER_TYPE=GPU,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=gpu-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-0.7.1-py3-none-any.whl) ([build history](http://ci.tensorflow.org/view/Nightly/job/nigntly-matrix-linux-gpu/TF_BUILD_CONTAINER_TYPE=GPU,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=gpu-slave/))
+* Mac CPU only: [Python 2](http://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_CONTAINER_TYPE=CPU,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=mac-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-0.7.1-py2-none-any.whl) ([build history](http://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_CONTAINER_TYPE=CPU,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=mac-slave/)) / [Python 3](http://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_CONTAINER_TYPE=CPU,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=mac-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-0.7.1-py3-none-any.whl) ([build history](http://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_CONTAINER_TYPE=CPU,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=mac-slave/))
+* [Android](http://ci.tensorflow.org/view/Nightly/job/nightly-matrix-android/TF_BUILD_CONTAINER_TYPE=ANDROID,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=NO_PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=android-slave/lastSuccessfulBuild/artifact/bazel-out/local_linux/bin/tensorflow/examples/android/tensorflow_demo.apk) ([build history](http://ci.tensorflow.org/view/Nightly/job/nightly-matrix-android/TF_BUILD_CONTAINER_TYPE=ANDROID,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=NO_PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=android-slave/))

 #### *Try your first TensorFlow program*
 ```python
@ -46,6 +53,9 @@ Hello, TensorFlow!
 ```

 ##For more information
+
 * [TensorFlow website](http://tensorflow.org)
 * [TensorFlow whitepaper](http://download.tensorflow.org/paper/whitepaper2015.pdf)
-* [Tensorflow MOOC on Udacity] (https://www.udacity.com/course/deep-learning--ud730)
+* [TensorFlow MOOC on Udacity] (https://www.udacity.com/course/deep-learning--ud730)
+
+The TensorFlow community has created amazing things with TensorFlow, please see the [resources section of tensorflow.org](https://www.tensorflow.org/versions/master/resources#community) for an incomplete list.
--- a/RELEASE.md
+++ b/RELEASE.md
@ -1,3 +1,20 @@
+# Release 0.7.1
+
+## Bug Fixes and Other Changes
+
+* Added gfile.Open and gfile.Copy, used by input_data.py.
+* Fixed Saver bug when MakeDirs tried to create empty directory.
+* GPU Pip wheels are built with cuda 7.5 and cudnn-v4, making them
+  required for the binary releases. Lower versions of cuda/cudnn can
+  be supported by installing from sources and setting the options
+  during ./configure
+* Fix dataset encoding example for Python3 (@danijar)
+* Fix PIP installation by not packaging protobuf as part of wheel,
+  require protobuf 3.0.0b2.
+* Fix Mac pip installation of numpy by requiring pip >= 1.10.1.
+* Improvements and fixes to Docker image.
+
+
 # Release 0.7.0

 ## Major Features and Improvements
--- a/8
+++ b/8
@ -99,12 +99,18 @@ while true; do
  else
    TF_CUDNN_EXT=".$TF_CUDNN_VERSION"
  fi
-  if [ -e "$CUDNN_INSTALL_PATH/libcudnn.so${CUDNNEXT}" -o -e "$CUDNN_INSTALL_PATH/lib64/libcudnn.so${TF_CUDNN_EXT}" ]; then
+  if [ -e "$CUDNN_INSTALL_PATH/libcudnn.so${TF_CUDNN_EXT}" -o -e "$CUDNN_INSTALL_PATH/lib64/libcudnn.so${TF_CUDNN_EXT}" ]; then
+    break
+  fi
+  CUDNN_PATH_FROM_LDCONFIG="$(ldconfig -p | sed -n 's/.*libcudnn.so .* => \(.*\)/\1/p')"
+  if [ -e "${CUDNN_PATH_FROM_LDCONFIG}${TF_CUDNN_EXT}" ]; then
+    CUDNN_INSTALL_PATH="$(dirname ${CUDNN_PATH_FROM_LDCONFIG})"
    break
  fi
  echo "Invalid path to cuDNN ${TF_CUDNN_VERSION} toolkit. Neither of the following two files can be found:"
  echo "$CUDNN_INSTALL_PATH/lib64/libcudnn.so${TF_CUDNN_EXT}"
  echo "$CUDNN_INSTALL_PATH/libcudnn.so${TF_CUDNN_EXT}"
+  echo "${CUDNN_PATH_FROM_LDCONFIG}${TF_CUDNN_EXT}"
  if [ -z "$fromuser" ]; then
    exit 1
  fi
--- a/tensorflow/BUILD
+++ b/tensorflow/BUILD
@ -54,6 +54,15 @@ cc_binary(
    ],
 )

+cc_binary(
+    name = "libtensorflow_cc.so",
+    linkshared = 1,
+    deps = [
+        "//tensorflow/cc:cc_ops",
+        "//tensorflow/core:tensorflow",
+    ],
+)
+
 py_library(
    name = "tensorflow_py",
    srcs = ["__init__.py"],
--- a/tensorflow/contrib/cmake/CMakeLists.txt
+++ b/tensorflow/contrib/cmake/CMakeLists.txt
@ -0,0 +1,62 @@
+# Minimum CMake required
+cmake_minimum_required(VERSION 2.8)
+
+# Project
+project(tensorflow C CXX)
+
+# Actual source is the ../../.. directory
+get_filename_component(tf_contrib_source_dir ${tensorflow_SOURCE_DIR} PATH)
+get_filename_component(tf_tf_source_dir ${tf_contrib_source_dir} PATH)
+get_filename_component(tensorflow_source_dir ${tf_tf_source_dir} PATH)
+
+# [CLEANUP] Not sure if this is needed (copied from Protobuf)
+# CMake policies
+cmake_policy(SET CMP0022 NEW)
+
+# Options
+option(tensorflow_VERBOSE "Enable for verbose output" OFF)
+option(tensorflow_BUILD_TESTS "Build tests" ON)
+
+#Threads: defines CMAKE_THREAD_LIBS_INIT and adds -pthread compile option for
+# targets that link ${CMAKE_THREAD_LIBS_INIT}.
+find_package (Threads)
+
+# [CLEANUP] Remove when done
+# For debugging
+function(SHOW_VARIABLES)
+    get_cmake_property(_variableNames VARIABLES)
+    foreach (_variableName ${_variableNames})
+        message(STATUS "${_variableName}=${${_variableName}}")
+    endforeach()
+endfunction()
+
+# External dependencies
+set(CMAKE_MODULE_PATH ${PROJECT_SOURCE_DIR}/external)
+
+# Location where external projects will be downloaded
+set (DOWNLOAD_LOCATION "${CMAKE_CURRENT_BINARY_DIR}/downloads"
+     CACHE PATH "Location where external projects will be downloaded.")
+mark_as_advanced(DOWNLOAD_LOCATION)
+
+# External dependencies
+include(png)
+include(jpeg)
+include(re2)
+include(eigen)
+
+# Let's get to work!
+include(tf_core_framework.cmake)
+include(tf_stream_executor.cmake)
+include(tf_core_cpu.cmake)
+include(tf_models.cmake)
+include(tf_core_ops.cmake)
+include(tf_core_direct_session.cmake)
+include(tf_core_kernels.cmake)
+include(tf_cc_ops.cmake)
+include(tf_tutorials.cmake)
+
+if (tensorflow_BUILD_TESTS)
+  include(tests.cmake)
+endif (tensorflow_BUILD_TESTS)
+
+include(install.cmake)
--- a/tensorflow/contrib/cmake/README.md
+++ b/tensorflow/contrib/cmake/README.md
@ -0,0 +1,257 @@
+This directory contains *CMake* files that can be used to build TensorFlow
+core library.
+
+You need to have [CMake](http://www.cmake.org) and [Git](http://git-scm.com)
+installed on your computer before proceeding.
+
+Most of the instructions will be given to the *Сommand Prompt*, but the same
+actions can be performed using appropriate GUI tools.
+
+Environment Setup
+=================
+
+Open the appropriate *Command Prompt* from the *Start* menu.
+
+For example *VS2013 x64 Native Tools Command Prompt*:
+
+    C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\amd64>
+
+Change to your working directory:
+
+    C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\amd64>cd C:\Path\to
+    C:\Path\to>
+
+Where *C:\Path\to* is the path to your real working directory.
+
+Create a folder where Tensorflow headers/libraries/binaries will be installed
+after they are built:
+
+    C:\Path\to>mkdir install
+
+If *cmake* command is not available from *Command Prompt*, add it to system
+*PATH* variable:
+
+    C:\Path\to>set PATH=%PATH%;C:\Program Files (x86)\CMake\bin
+
+If *git* command is not available from *Command Prompt*, add it to system
+*PATH* variable:
+
+    C:\Path\to>set PATH=%PATH%;C:\Program Files\Git\cmd
+
+Good. Now you are ready to continue.
+
+Getting Sources
+===============
+
+You can get the latest stable source packages from the
+[releases](https://github.com/tensorflow/tensorflow/releases) page.
+Or you can type:
+
+    C:\Path\to> git clone --recursive -b [release_tag] https://github.com/tensorflow/tensorflow.git
+
+Where *[release_tag]* is a git tag like *v0.6.0* or a branch name like *master*
+if you want to get the latest code.
+
+Go to the project folder:
+
+    C:\Path\to>cd tensorflow
+    C:\Path\to\tensorflow>
+
+Now go to *tensorflow\contrib\cmake* folder in Tensorflow's contrib sources:
+
+    C:\Path\to\tensorflow>cd tensorflow\contrib\cmake
+    C:\Path\to\tensorflow\tensorflow\contrib\cmake>
+
+Good. Now you are ready to configure *CMake*.
+
+CMake Configuration
+===================
+
+*CMake* supports a lot of different
+[generators](http://www.cmake.org/cmake/help/latest/manual/cmake-generators.7.html)
+for various native build systems. We are only interested in
+[Makefile](http://www.cmake.org/cmake/help/latest/manual/cmake-generators.7.html#makefile-generators)
+and
+[Visual Studio](http://www.cmake.org/cmake/help/latest/manual/cmake-generators.7.html#visual-studio-generators)
+generators.
+
+We will use shadow building to separate the temporary files from the Tensorflow
+source code.
+
+Create a temporary *build* folder and change your working directory to it:
+
+     C:\Path\to\tensorflow\tensorflow\contrib\cmake>mkdir build & cd build
+     C:\Path\to\tensorflow\tensorflow\contrib\cmake\build>
+
+The *Makefile* generator can build the project in only one configuration, so
+you need to build a separate folder for each configuration.
+
+To start using a *Release* configuration:
+
+     [...]\contrib\cmake\build>mkdir release & cd release
+     [...]\contrib\cmake\build\release>cmake -G "NMake Makefiles" ^
+     -DCMAKE_BUILD_TYPE=Release ^
+     -DCMAKE_INSTALL_PREFIX=../../../../../../install ^
+     ../..
+
+It will generate *nmake* *Makefile* in current directory.
+
+To use *Debug* configuration:
+
+     [...]\contrib\cmake\build>mkdir debug & cd debug
+     [...]\contrib\cmake\build\debug>cmake -G "NMake Makefiles" ^
+     -DCMAKE_BUILD_TYPE=Debug ^
+     -DCMAKE_INSTALL_PREFIX=../../../../../../install ^
+     ../..
+
+It will generate *nmake* *Makefile* in current directory.
+
+To create *Visual Studio* solution file:
+
+     [...]\contrib\cmake\build>mkdir solution & cd solution
+     [...]\contrib\cmake\build\solution>cmake -G "Visual Studio 12 2013 Win64" ^
+     -DCMAKE_INSTALL_PREFIX=../../../../../../install ^
+     ../..
+
+It will generate *Visual Studio* solution file *tensorflow.sln* in current
+directory.
+
+If the *gmock* directory does not exist, and/or you do not want to build
+Tensorflow unit tests, you need to add *cmake* command argument
+`-Dtensorflow_BUILD_TESTS=OFF` to disable testing.
+
+Compiling
+=========
+
+To compile tensorflow:
+
+     [...]\contrib\cmake\build\release>nmake
+
+or
+
+     [...]\contrib\cmake\build\debug>nmake
+
+And wait for the compilation to finish.
+
+If you prefer to use the IDE:
+
+  * Open the generated tensorflow.sln file in Microsoft Visual Studio.
+  * Choose "Debug" or "Release" configuration as desired.
+  * From the Build menu, choose "Build Solution".
+
+And wait for the compilation to finish.
+
+Testing
+=======
+
+To run unit-tests:
+
+     [...]\contrib\cmake\build\release>nmake check
+
+or
+
+     [...]\contrib\cmake\build\debug>nmake check
+
+You can also build project *check* from Visual Studio solution.
+Yes, it may sound strange, but it works.
+
+You should see an output similar to:
+
+     Running main() from gmock_main.cc
+     [==========] Running 1546 tests from 165 test cases.
+     
+     ...
+     
+     [==========] 1546 tests from 165 test cases ran. (2529 ms total)
+     [  PASSED  ] 1546 tests.
+
+To run specific tests:
+
+     C:\Path\to\tensorflow>tensorflow\contrib\cmake\build\release\tests.exe ^
+     --gtest_filter=AnyTest*
+     Running main() from gmock_main.cc
+     Note: Google Test filter = AnyTest*
+     [==========] Running 3 tests from 1 test case.
+     [----------] Global test environment set-up.
+     [----------] 3 tests from AnyTest
+     [ RUN      ] AnyTest.TestPackAndUnpack
+     [       OK ] AnyTest.TestPackAndUnpack (0 ms)
+     [ RUN      ] AnyTest.TestPackAndUnpackAny
+     [       OK ] AnyTest.TestPackAndUnpackAny (0 ms)
+     [ RUN      ] AnyTest.TestIs
+     [       OK ] AnyTest.TestIs (0 ms)
+     [----------] 3 tests from AnyTest (1 ms total)
+     
+     [----------] Global test environment tear-down
+     [==========] 3 tests from 1 test case ran. (2 ms total)
+     [  PASSED  ] 3 tests.
+
+Note that the tests must be run from the source folder.
+
+If all tests are passed, safely continue.
+
+Installing
+==========
+
+To install Tensorflow to the specified *install* folder:
+
+     [...]\contrib\cmake\build\release>nmake install
+
+or
+
+     [...]\contrib\cmake\build\debug>nmake install
+
+You can also build project *INSTALL* from Visual Studio solution.
+It sounds not so strange and it works.
+
+This will create the following folders under the *install* location:
+  * bin - that contains tensorflow binaries;
+  * include - that contains C++ headers and Tensorflow *.proto files;
+  * lib - that contains linking libraries and *CMake* configuration files for
+    *tensorflow* package.
+
+Now you can if needed:
+  * Copy the contents of the include directory to wherever you want to put
+    headers.
+  * Copy binaries wherever you put build tools (probably somewhere in your
+    PATH).
+  * Copy linking libraries libtensorflow[d].lib wherever you put libraries.
+
+To avoid conflicts between the MSVC debug and release runtime libraries, when
+compiling a debug build of your application, you may need to link against a
+debug build of libtensorflowd.lib with "d" postfix.  Similarly, release builds
+should link against release libtensorflow.lib library.
+
+DLLs vs. static linking
+=======================
+
+Static linking is now the default for the Tensorflow Buffer libraries.  Due to
+issues with Win32's use of a separate heap for each DLL, as well as binary
+compatibility issues between different versions of MSVC's STL library, it is
+recommended that you use static linkage only.  However, it is possible to
+build libtensorflow as DLLs if you really want.  To do this, do the following:
+
+  * Add an additional flag `-Dtensorflow_BUILD_SHARED_LIBS=ON` when invoking
+    cmake
+  * Follow the same steps as described in the above section.
+  * When compiling your project, make sure to `#define TENSORFLOW_USE_DLLS`.
+
+When distributing your software to end users, we strongly recommend that you
+do NOT install libtensorflow.dll to any shared location.
+Instead, keep these libraries next to your binaries, in your application's
+own install directory.  C++ makes it very difficult to maintain binary
+compatibility between releases, so it is likely that future versions of these
+libraries will *not* be usable as drop-in replacements.
+
+If your project is itself a DLL intended for use by third-party software, we
+recommend that you do NOT expose Tensorflow objects in your library's
+public interface, and that you statically link them into your library.
+
+Notes on Compiler Warnings
+==========================
+
+The following warnings have been disabled while building the tensorflow
+libraries and binaries.  You may have to disable some of them in your own
+project as well, or live with them.
+
+* [TODO]
--- a/tensorflow/contrib/cmake/external/eigen.cmake
+++ b/tensorflow/contrib/cmake/external/eigen.cmake
@ -0,0 +1,34 @@
+#new_http_archive(
+#  name = "eigen_archive",
+#  url = "https://bitbucket.org/eigen/eigen/get/...",
+#  sha256 = "...",
+#  build_file = "eigen.BUILD",
+#)
+
+include (ExternalProject)
+
+set(eigen_archive_hash "ed4c9730b545")
+
+set(eigen_INCLUDE_DIRS
+    ${CMAKE_CURRENT_BINARY_DIR}
+    ${CMAKE_CURRENT_BINARY_DIR}/external/eigen_archive
+    ${CMAKE_CURRENT_BINARY_DIR}/external/eigen_archive/eigen-eigen-${eigen_archive_hash}
+    ${tensorflow_source_dir}/third_party/eigen3
+)
+set(eigen_URL https://bitbucket.org/eigen/eigen/get/${eigen_archive_hash}.tar.gz)
+set(eigen_HASH SHA256=3d9eceb8a2add299e37b1f32759157cc2574f7684936c151552a5ae3f33aebd5)
+set(eigen_BUILD ${CMAKE_CURRENT_BINARY_DIR}/eigen/src/eigen)
+set(eigen_INSTALL ${CMAKE_CURRENT_BINARY_DIR}/eigen/install)
+
+ExternalProject_Add(eigen
+    PREFIX eigen
+    URL ${eigen_URL}
+    URL_HASH ${eigen_HASH}
+    DOWNLOAD_DIR "${DOWNLOAD_LOCATION}"
+    INSTALL_DIR "${eigen_INSTALL}"
+    CMAKE_CACHE_ARGS
+        -DCMAKE_BUILD_TYPE:STRING=Release
+        -DCMAKE_VERBOSE_MAKEFILE:BOOL=OFF
+        -DCMAKE_INSTALL_PREFIX:STRING=${eigen_INSTALL}
+        -DINCLUDE_INSTALL_DIR:STRING=${CMAKE_CURRENT_BINARY_DIR}/external/eigen_archive/eigen-eigen-${eigen_archive_hash}
+)
--- a/tensorflow/contrib/cmake/external/jpeg.cmake
+++ b/tensorflow/contrib/cmake/external/jpeg.cmake
@ -0,0 +1,75 @@
+include (ExternalProject)
+
+set(jpeg_INCLUDE_DIR ${CMAKE_CURRENT_BINARY_DIR}/external/jpeg_archive)
+set(jpeg_URL http://www.ijg.org/files/jpegsrc.v9a.tar.gz)
+set(jpeg_HASH SHA256=3a753ea48d917945dd54a2d97de388aa06ca2eb1066cbfdc6652036349fe05a7)
+set(jpeg_BUILD ${CMAKE_BINARY_DIR}/jpeg/src/jpeg)
+set(jpeg_INSTALL ${CMAKE_BINARY_DIR}/jpeg/install)
+set(jpeg_STATIC_LIBRARIES ${jpeg_INSTALL}/lib/libjpeg.a)
+
+set(jpeg_HEADERS
+    "${jpeg_INSTALL}/include/jconfig.h"
+    "${jpeg_INSTALL}/include/jerror.h"
+    "${jpeg_INSTALL}/include/jmorecfg.h"
+    "${jpeg_INSTALL}/include/jpeglib.h"
+    "${jpeg_BUILD}/cderror.h"
+    "${jpeg_BUILD}/cdjpeg.h"
+    "${jpeg_BUILD}/jdct.h"
+    "${jpeg_BUILD}/jinclude.h"
+    "${jpeg_BUILD}/jmemsys.h"
+    "${jpeg_BUILD}/jpegint.h"
+    "${jpeg_BUILD}/jversion.h"
+    "${jpeg_BUILD}/transupp.h"
+)
+
+if (WIN32)
+    ExternalProject_Add(jpeg
+        PREFIX jpeg
+        URL ${jpeg_URL}
+        URL_HASH ${jpeg_HASH}
+        PATCH_COMMAND ${CMAKE_COMMAND} -E copy ${CMAKE_SOURCE_DIR}/patches/jpeg/CMakeLists.txt ${jpeg_BUILD}
+        INSTALL_DIR ${jpeg_INSTALL}
+        DOWNLOAD_DIR "${DOWNLOAD_LOCATION}"
+        CMAKE_CACHE_ARGS
+            -DCMAKE_BUILD_TYPE:STRING=Release
+            -DCMAKE_VERBOSE_MAKEFILE:BOOL=OFF
+            -DCMAKE_INSTALL_PREFIX:STRING=${jpeg_INSTALL}
+    )
+
+    ExternalProject_Add_Step(jpeg copy_jconfig
+        COMMAND ${CMAKE_COMMAND} -E copy 
+            ${jpeg_BUILD}/jconfig.vc ${jpeg_BUILD}/jconfig.h
+        DEPENDEES patch
+        DEPENDERS build
+    )
+
+else()
+
+    ExternalProject_Add(jpeg
+        PREFIX jpeg
+        URL ${jpeg_URL}
+        URL_HASH ${jpeg_HASH}
+        INSTALL_DIR ${jpeg_INSTALL}
+        DOWNLOAD_DIR "${DOWNLOAD_LOCATION}"
+        BUILD_COMMAND $(MAKE)
+        INSTALL_COMMAND $(MAKE) install
+        CONFIGURE_COMMAND
+            ${jpeg_BUILD}/configure
+            --prefix=${jpeg_INSTALL}
+            --enable-shared=yes
+    )
+  
+endif()
+
+# put jpeg includes in the directory where they are expected
+add_custom_target(jpeg_create_destination_dir
+    COMMAND ${CMAKE_COMMAND} -E make_directory ${jpeg_INCLUDE_DIR}/jpeg-9a
+    DEPENDS jpeg)
+
+add_custom_target(jpeg_copy_headers_to_destination
+    DEPENDS jpeg_create_destination_dir)
+
+foreach(header_file ${jpeg_HEADERS})
+    add_custom_command(TARGET jpeg_copy_headers_to_destination PRE_BUILD
+    COMMAND ${CMAKE_COMMAND} -E copy ${header_file} ${jpeg_INCLUDE_DIR}/jpeg-9a)
+endforeach()
--- a/tensorflow/contrib/cmake/external/png.cmake
+++ b/tensorflow/contrib/cmake/external/png.cmake
@ -0,0 +1,38 @@
+include (ExternalProject)
+
+set(png_INCLUDE_DIR ${CMAKE_CURRENT_BINARY_DIR}/external/png_archive)
+set(png_URL https://storage.googleapis.com/libpng-public-archive/libpng-1.2.53.tar.gz)
+set(png_HASH SHA256=e05c9056d7f323088fd7824d8c6acc03a4a758c4b4916715924edc5dd3223a72)
+set(png_BUILD ${CMAKE_BINARY_DIR}/png/src/png)
+set(png_INSTALL ${CMAKE_BINARY_DIR}/png/install)
+set(png_STATIC_LIBRARIES ${CMAKE_BINARY_DIR}/png/install/lib/libpng12.a)
+
+set(png_HEADERS
+    "${png_INSTALL}/include/libpng12/png.h"
+    "${png_INSTALL}/include/libpng12/pngconf.h"
+)
+
+ExternalProject_Add(png
+    PREFIX png
+    URL ${png_URL}
+    URL_HASH ${png_HASH}
+    INSTALL_DIR ${png_INSTALL}
+    DOWNLOAD_DIR "${DOWNLOAD_LOCATION}"
+    CMAKE_CACHE_ARGS
+        -DCMAKE_BUILD_TYPE:STRING=Release
+        -DCMAKE_VERBOSE_MAKEFILE:BOOL=OFF
+        -DCMAKE_INSTALL_PREFIX:STRING=${png_INSTALL}
+)
+
+## put png includes in the directory where they are expected
+add_custom_target(png_create_destination_dir
+    COMMAND ${CMAKE_COMMAND} -E make_directory ${png_INCLUDE_DIR}/libpng-1.2.53
+    DEPENDS png)
+
+add_custom_target(png_copy_headers_to_destination
+    DEPENDS png_create_destination_dir)
+
+foreach(header_file ${png_HEADERS})
+    add_custom_command(TARGET png_copy_headers_to_destination PRE_BUILD
+    COMMAND ${CMAKE_COMMAND} -E copy ${header_file} ${png_INCLUDE_DIR}/libpng-1.2.53)
+endforeach()
--- a/tensorflow/contrib/cmake/external/re2.cmake
+++ b/tensorflow/contrib/cmake/external/re2.cmake
@ -0,0 +1,46 @@
+include (ExternalProject)
+
+set(re2_INCLUDE_DIR ${CMAKE_CURRENT_BINARY_DIR}/external/re2/re2)
+set(re2_EXTRA_INCLUDE_DIR ${CMAKE_CURRENT_BINARY_DIR}/re2/src)
+set(re2_URL https://github.com/google/re2.git)
+set(re2_TAG 791beff)
+set(re2_BUILD ${CMAKE_BINARY_DIR}/re2/src/re2)
+set(re2_LIBRARIES ${re2_BUILD}/obj/so/libre2.so)
+get_filename_component(re2_STATIC_LIBRARIES ${re2_BUILD}/libre2.a ABSOLUTE)
+set(re2_INCLUDES ${re2_BUILD})
+
+# We only need re2.h in external/re2/re2/re2.h
+# For the rest, we'll just add the build dir as an include dir.
+set(re2_HEADERS
+    "${re2_BUILD}/re2/re2.h"
+)
+
+ExternalProject_Add(re2
+    PREFIX re2
+    GIT_REPOSITORY ${re2_URL}
+    GIT_TAG ${re2_TAG}
+    DOWNLOAD_DIR "${DOWNLOAD_LOCATION}"
+    BUILD_IN_SOURCE 1
+    INSTALL_COMMAND ""
+    CMAKE_CACHE_ARGS
+        -DCMAKE_BUILD_TYPE:STRING=Release
+        -DCMAKE_VERBOSE_MAKEFILE:BOOL=OFF
+)
+
+## put re2 includes in the directory where they are expected
+add_custom_target(re2_create_destination_dir
+    COMMAND ${CMAKE_COMMAND} -E make_directory ${re2_INCLUDE_DIR}
+    DEPENDS re2)
+
+add_custom_target(re2_copy_headers_to_destination
+    DEPENDS re2_create_destination_dir)
+
+foreach(header_file ${re2_HEADERS})
+    add_custom_command(TARGET re2_copy_headers_to_destination PRE_BUILD
+    COMMAND ${CMAKE_COMMAND} -E copy ${header_file} ${re2_INCLUDE_DIR})
+endforeach()
+
+ADD_LIBRARY(re2_lib STATIC IMPORTED
+    DEPENDS re2)
+SET_TARGET_PROPERTIES(re2_lib PROPERTIES
+    IMPORTED_LOCATION ${re2_STATIC_LIBRARIES})
--- a/tensorflow/contrib/cmake/install.cmake
+++ b/tensorflow/contrib/cmake/install.cmake
@ -0,0 +1 @@
+# [TODO]
--- a/tensorflow/contrib/cmake/patches/jpeg/CMakeLists.txt
+++ b/tensorflow/contrib/cmake/patches/jpeg/CMakeLists.txt
@ -0,0 +1,76 @@
+cmake_minimum_required(VERSION 2.8.3)
+
+project(libjpeg)
+
+set(LIBJPEG_SRCS
+    "jaricom.c"
+    "jcapimin.c"
+    "jcapistd.c"
+    "jcarith.c"
+    "jccoefct.c"
+    "jccolor.c"
+    "jcdctmgr.c"
+    "jchuff.c"
+    "jcinit.c"
+    "jcmainct.c"
+    "jcmarker.c"
+    "jcmaster.c"
+    "jcomapi.c"
+    "jcparam.c"
+    "jcprepct.c"
+    "jcsample.c"
+    "jctrans.c"
+    "jdapimin.c"
+    "jdapistd.c"
+    "jdarith.c"
+    "jdatadst.c"
+    "jdatasrc.c"
+    "jdcoefct.c"
+    "jdcolor.c"
+    "jddctmgr.c"
+    "jdhuff.c"
+    "jdinput.c"
+    "jdmainct.c"
+    "jdmarker.c"
+    "jdmaster.c"
+    "jdmerge.c"
+    "jdpostct.c"
+    "jdsample.c"
+    "jdtrans.c"
+    "jerror.c"
+    "jfdctflt.c"
+    "jfdctfst.c"
+    "jfdctint.c"
+    "jidctflt.c"
+    "jidctfst.c"
+    "jidctint.c"
+    "jmemmgr.c"
+    "jmemnobs.c"
+    "jquant1.c"
+    "jquant2.c"
+    "jutils.c"
+)
+set(LIBJPEG_INCLUDES
+    "jconfig.h"
+    "jdct.h"
+    "jerror.h"
+    "jinclude.h"
+    "jmemsys.h"
+    "jmorecfg.h"
+    "jpegint.h"
+    "jpeglib.h"
+    "jversion.h"
+)
+
+include_directories("${CMAKE_CURRENT_SOURCE_DIR}")
+
+add_library(libjpeg ${LIBJPEG_SRCS})
+
+install(TARGETS libjpeg
+  RUNTIME DESTINATION bin COMPONENT RuntimeLibraries
+  LIBRARY DESTINATION lib COMPONENT RuntimeLibraries
+  ARCHIVE DESTINATION lib COMPONENT Development)
+
+foreach(LIBJPEG_INCLUDE ${LIBJPEG_INCLUDES})
+  install(FILES ${LIBJPEG_INCLUDE} DESTINATION include COMPONENT Development)
+endforeach()
--- a/tensorflow/contrib/cmake/tests.cmake
+++ b/tensorflow/contrib/cmake/tests.cmake
@ -0,0 +1 @@
+# [TODO]
--- a/tensorflow/contrib/cmake/tf_cc_ops.cmake
+++ b/tensorflow/contrib/cmake/tf_cc_ops.cmake
@ -0,0 +1,204 @@
+########################################################
+# tf_cc_op_gen_main library
+########################################################
+set(tf_cc_op_gen_main_srcs
+    "${tensorflow_source_dir}/tensorflow/cc/ops/cc_op_gen.cc"
+    "${tensorflow_source_dir}/tensorflow/cc/ops/cc_op_gen_main.cc"
+    "${tensorflow_source_dir}/tensorflow/cc/ops/cc_op_gen.h"
+)
+
+add_library(tf_cc_op_gen_main OBJECT ${tf_cc_op_gen_main_srcs})
+
+add_dependencies(tf_cc_op_gen_main tf_core_framework)
+
+target_include_directories(tf_cc_op_gen_main PRIVATE
+    ${tensorflow_source_dir}
+    ${eigen_INCLUDE_DIRS}
+)
+
+#target_link_libraries(tf_cc_op_gen_main
+#    ${CMAKE_THREAD_LIBS_INIT}
+#    ${PROTOBUF_LIBRARIES}
+#    tf_protos_cc
+#    tf_core_lib
+#    tf_core_framework
+#)
+
+target_compile_options(tf_cc_op_gen_main PRIVATE
+    -fno-exceptions
+    -DEIGEN_AVOID_STL_ARRAY
+)
+
+# C++11
+target_compile_features(tf_cc_op_gen_main PRIVATE
+    cxx_rvalue_references
+)
+
+########################################################
+# tf_gen_op_wrapper_cc executables
+########################################################
+
+#
+#  # Run the op generator.
+#  if name == "sendrecv_ops":
+#    include_internal = "1"
+#  else:
+#    include_internal = "0"
+#  native.genrule(
+#      name=name + "_genrule",
+#      outs=[out_ops_file + ".h", out_ops_file + ".cc"],
+#      tools=[":" + tool],
+#      cmd=("$(location :" + tool + ") $(location :" + out_ops_file + ".h) " +
+#           "$(location :" + out_ops_file + ".cc) " + include_internal))
+
+
+
+#def tf_gen_op_wrappers_cc(name,
+#                          op_lib_names=[],
+#                          other_srcs=[],
+#                          other_hdrs=[],
+#                          pkg=""):
+#  subsrcs = other_srcs
+#  subhdrs = other_hdrs
+#  for n in op_lib_names:
+#    tf_gen_op_wrapper_cc(n, "ops/" + n, pkg=pkg)
+#    subsrcs += ["ops/" + n + ".cc"]
+#    subhdrs += ["ops/" + n + ".h"]
+#
+#  native.cc_library(name=name,
+#                    srcs=subsrcs,
+#                    hdrs=subhdrs,
+#                    deps=["//tensorflow/core:core_cpu"],
+#                    copts=tf_copts(),
+#                    alwayslink=1,)
+
+# create directory for ops generated files
+set(cc_ops_target_dir ${CMAKE_CURRENT_BINARY_DIR}/tensorflow/cc/ops)
+
+add_custom_target(create_cc_ops_header_dir
+    COMMAND ${CMAKE_COMMAND} -E make_directory ${cc_ops_target_dir}
+)
+
+set(tf_cc_ops_generated_files)
+
+set(tf_cc_op_lib_names
+    ${tf_op_lib_names}
+    "user_ops"
+)
+foreach(tf_cc_op_lib_name ${tf_cc_op_lib_names})
+    #tf_gen_op_wrapper_cc(name, out_ops_file, pkg=""):
+    #  # Construct an op generator binary for these ops.
+    #  tool = out_ops_file + "_gen_cc"  #example ops/array_ops_gen_cc
+    #  native.cc_binary(
+    #      name = tool,
+    #      copts = tf_copts(),
+    #      linkopts = ["-lm"],
+    #      linkstatic = 1,   # Faster to link this one-time-use binary dynamically
+    #      deps = (["//tensorflow/cc:cc_op_gen_main",
+    #               pkg + ":" + name + "_op_lib"])
+    #  )
+ 
+    # Using <TARGET_OBJECTS:...> to work around an issue where no ops were
+    # registered (static initializers dropped by the linker because the ops
+    # are not used explicitly in the *_gen_cc executables).
+    add_executable(${tf_cc_op_lib_name}_gen_cc
+        $<TARGET_OBJECTS:tf_cc_op_gen_main>
+        $<TARGET_OBJECTS:tf_${tf_cc_op_lib_name}>
+        $<TARGET_OBJECTS:tf_core_lib>
+        $<TARGET_OBJECTS:tf_core_framework>
+    )
+
+    target_include_directories(${tf_cc_op_lib_name}_gen_cc PRIVATE
+        ${tensorflow_source_dir}
+        ${eigen_INCLUDE_DIRS}
+    )
+
+    find_package(ZLIB REQUIRED)
+
+    target_link_libraries(${tf_cc_op_lib_name}_gen_cc PRIVATE
+        ${CMAKE_THREAD_LIBS_INIT}
+        ${PROTOBUF_LIBRARIES}
+        tf_protos_cc
+        re2_lib
+        ${jpeg_STATIC_LIBRARIES}
+        ${png_STATIC_LIBRARIES}
+        ${ZLIB_LIBRARIES}
+    )
+
+    target_compile_options(${tf_cc_op_lib_name}_gen_cc PRIVATE
+        -fno-exceptions
+        -DEIGEN_AVOID_STL_ARRAY
+        -lm
+    )
+
+    # C++11
+    target_compile_features(${tf_cc_op_lib_name}_gen_cc PRIVATE
+        cxx_rvalue_references
+    )
+
+    set(cc_ops_include_internal 0)
+    if(${tf_cc_op_lib_name} STREQUAL "sendrecv_ops")
+        set(cc_ops_include_internal 1)
+    endif()
+
+    add_custom_command(
+        OUTPUT ${cc_ops_target_dir}/${tf_cc_op_lib_name}.h
+               ${cc_ops_target_dir}/${tf_cc_op_lib_name}.cc
+        COMMAND ${tf_cc_op_lib_name}_gen_cc ${cc_ops_target_dir}/${tf_cc_op_lib_name}.h ${cc_ops_target_dir}/${tf_cc_op_lib_name}.cc ${cc_ops_include_internal}
+        DEPENDS ${tf_cc_op_lib_name}_gen_cc create_cc_ops_header_dir
+    )
+    
+    list(APPEND tf_cc_ops_generated_files ${cc_ops_target_dir}/${tf_cc_op_lib_name}.h)
+    list(APPEND tf_cc_ops_generated_files ${cc_ops_target_dir}/${tf_cc_op_lib_name}.cc)
+endforeach()
+
+
+########################################################
+# tf_cc_ops library
+########################################################
+add_library(tf_cc_ops OBJECT
+    ${tf_cc_ops_generated_files}
+    "${tensorflow_source_dir}/tensorflow/cc/ops/const_op.h"
+    "${tensorflow_source_dir}/tensorflow/cc/ops/const_op.cc"
+    "${tensorflow_source_dir}/tensorflow/cc/ops/standard_ops.h"
+)
+
+target_include_directories(tf_cc_ops PRIVATE
+    ${tensorflow_source_dir}
+    ${eigen_INCLUDE_DIRS}
+)
+
+#target_link_libraries(tf_cc_ops
+#    ${CMAKE_THREAD_LIBS_INIT}
+#    ${PROTOBUF_LIBRARIES}
+#    tf_protos_cc
+#    tf_core_lib
+#    tf_core_cpu
+#    tf_models_word2vec_ops
+#)
+
+target_compile_options(tf_cc_ops PRIVATE
+    -fno-exceptions
+    -DEIGEN_AVOID_STL_ARRAY
+)
+
+# C++11
+target_compile_features(tf_cc_ops PRIVATE
+    cxx_rvalue_references
+)
+
+
+#tf_gen_op_wrappers_cc(
+#    name = "cc_ops",
+#    op_lib_names = [
+#        ...
+#    ],
+#    other_hdrs = [
+#        "ops/const_op.h",
+#        "ops/standard_ops.h",
+#    ],
+#    other_srcs = [
+#        "ops/const_op.cc",
+#    ] + glob(["ops/*_grad.cc"]),
+#    pkg = "//tensorflow/core",
+#)
--- a/tensorflow/contrib/cmake/tf_core_cpu.cmake
+++ b/tensorflow/contrib/cmake/tf_core_cpu.cmake
@ -0,0 +1,53 @@
+########################################################
+# tf_core_cpu library
+########################################################
+file(GLOB_RECURSE tf_core_cpu_srcs
+    "${tensorflow_source_dir}/tensorflow/core/common_runtime/*.h"
+    "${tensorflow_source_dir}/tensorflow/core/common_runtime/*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/client/*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/graph/*.h"
+    "${tensorflow_source_dir}/tensorflow/core/graph/*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/public/*.h"
+)
+
+file(GLOB_RECURSE tf_core_cpu_exclude_srcs
+    "${tensorflow_source_dir}/tensorflow/core/*test*.h"
+    "${tensorflow_source_dir}/tensorflow/core/*test*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/*main.cc"
+    "${tensorflow_source_dir}/tensorflow/core/common_runtime/gpu/*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/common_runtime/gpu_device_factory.cc"
+    "${tensorflow_source_dir}/tensorflow/core/common_runtime/direct_session.cc"
+    "${tensorflow_source_dir}/tensorflow/core/common_runtime/direct_session.h"
+)
+
+list(REMOVE_ITEM tf_core_cpu_srcs ${tf_core_cpu_exclude_srcs}) 
+
+add_library(tf_core_cpu OBJECT ${tf_core_cpu_srcs})
+
+target_include_directories(tf_core_cpu PRIVATE
+    ${tensorflow_source_dir}
+    ${eigen_INCLUDE_DIRS}
+    ${re2_INCLUDES}
+)
+
+add_dependencies(tf_core_cpu
+    tf_core_framework
+)
+#target_link_libraries(tf_core_cpu
+#    ${CMAKE_THREAD_LIBS_INIT}
+#    ${PROTOBUF_LIBRARIES}
+#    tf_core_framework
+#    tf_core_lib
+#    tf_protos_cc
+#)
+
+target_compile_options(tf_core_cpu PRIVATE
+    -fno-exceptions
+    -DEIGEN_AVOID_STL_ARRAY
+)
+
+# C++11
+target_compile_features(tf_core_cpu PRIVATE
+    cxx_rvalue_references
+)
+
--- a/tensorflow/contrib/cmake/tf_core_direct_session.cmake
+++ b/tensorflow/contrib/cmake/tf_core_direct_session.cmake
@ -0,0 +1,35 @@
+########################################################
+# tf_core_direct_session library
+########################################################
+file(GLOB tf_core_direct_session_srcs
+   "${tensorflow_source_dir}/tensorflow/core/common_runtime/direct_session.cc"
+   "${tensorflow_source_dir}/tensorflow/core/common_runtime/direct_session.h"
+)
+
+add_library(tf_core_direct_session OBJECT ${tf_core_direct_session_srcs})
+
+add_dependencies(tf_core_direct_session tf_core_cpu)
+
+target_include_directories(tf_core_direct_session PRIVATE
+   ${tensorflow_source_dir}
+   ${eigen_INCLUDE_DIRS}
+)
+
+#target_link_libraries(tf_core_direct_session
+#   ${CMAKE_THREAD_LIBS_INIT}
+#   ${PROTOBUF_LIBRARIES}
+#   tf_core_cpu
+#   tf_core_framework
+#   tf_core_lib
+#   tf_protos_cc
+#)
+
+target_compile_options(tf_core_direct_session PRIVATE
+   -fno-exceptions
+   -DEIGEN_AVOID_STL_ARRAY
+)
+
+# C++11
+target_compile_features(tf_core_direct_session PRIVATE
+   cxx_rvalue_references
+)
--- a/tensorflow/contrib/cmake/tf_core_framework.cmake
+++ b/tensorflow/contrib/cmake/tf_core_framework.cmake
@ -0,0 +1,165 @@
+########################################################
+# RELATIVE_PROTOBUF_GENERATE_CPP function
+########################################################
+# A variant of PROTOBUF_GENERATE_CPP that keeps the directory hierarchy.
+# ROOT_DIR must be absolute, and proto paths must be relative to ROOT_DIR.
+function(RELATIVE_PROTOBUF_GENERATE_CPP SRCS HDRS ROOT_DIR)
+  if(NOT ARGN)
+    message(SEND_ERROR "Error: RELATIVE_PROTOBUF_GENERATE_CPP() called without any proto files")
+    return()
+  endif()
+  
+  set(${SRCS})
+  set(${HDRS})
+  foreach(FIL ${ARGN})
+    set(ABS_FIL ${ROOT_DIR}/${FIL})
+    get_filename_component(FIL_WE ${FIL} NAME_WE)
+    get_filename_component(FIL_DIR ${ABS_FIL} PATH)
+    file(RELATIVE_PATH REL_DIR ${ROOT_DIR} ${FIL_DIR})
+
+    list(APPEND ${SRCS} "${CMAKE_CURRENT_BINARY_DIR}/${REL_DIR}/${FIL_WE}.pb.cc")
+    list(APPEND ${HDRS} "${CMAKE_CURRENT_BINARY_DIR}/${REL_DIR}/${FIL_WE}.pb.h")
+
+    add_custom_command(
+      OUTPUT "${CMAKE_CURRENT_BINARY_DIR}/${REL_DIR}/${FIL_WE}.pb.cc"
+             "${CMAKE_CURRENT_BINARY_DIR}/${REL_DIR}/${FIL_WE}.pb.h"
+      COMMAND  ${PROTOBUF_PROTOC_EXECUTABLE}
+      ARGS --cpp_out  ${CMAKE_CURRENT_BINARY_DIR} -I ${ROOT_DIR} ${ABS_FIL}
+      DEPENDS ${ABS_FIL} ${PROTOBUF_PROTOC_EXECUTABLE}
+      COMMENT "Running C++ protocol buffer compiler on ${FIL}"
+      VERBATIM )
+  endforeach()
+
+  set_source_files_properties(${${SRCS}} ${${HDRS}} PROPERTIES GENERATED TRUE)
+  set(${SRCS} ${${SRCS}} PARENT_SCOPE)
+  set(${HDRS} ${${HDRS}} PARENT_SCOPE)
+endfunction()
+
+
+########################################################
+# tf_protos_cc library
+########################################################
+
+# Build proto library
+include(FindProtobuf)
+find_package(Protobuf REQUIRED)
+include_directories(${PROTOBUF_INCLUDE_DIRS})
+include_directories(${CMAKE_CURRENT_BINARY_DIR})
+file(GLOB_RECURSE tf_protos_cc_srcs RELATIVE ${tensorflow_source_dir}
+    "${tensorflow_source_dir}/tensorflow/*.proto"
+)
+RELATIVE_PROTOBUF_GENERATE_CPP(PROTO_SRCS PROTO_HDRS
+    ${tensorflow_source_dir} ${tf_protos_cc_srcs}
+)
+
+add_library(tf_protos_cc ${PROTO_SRCS} ${PROTO_HDRS})
+target_include_directories(tf_protos_cc PUBLIC
+     ${CMAKE_CURRENT_BINARY_DIR}
+)
+target_link_libraries(tf_protos_cc PUBLIC
+    ${PROTOBUF_LIBRARIES}
+)
+
+
+########################################################
+# tf_core_lib library
+########################################################
+file(GLOB_RECURSE tf_core_lib_srcs
+    "${tensorflow_source_dir}/tensorflow/core/lib/*.h"
+    "${tensorflow_source_dir}/tensorflow/core/lib/*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/platform/*.h"
+    "${tensorflow_source_dir}/tensorflow/core/platform/*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/public/*.h"
+)
+
+file(GLOB_RECURSE tf_core_lib_test_srcs
+    "${tensorflow_source_dir}/tensorflow/core/lib/*test*.h"
+    "${tensorflow_source_dir}/tensorflow/core/lib/*test*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/platform/*test*.h"
+    "${tensorflow_source_dir}/tensorflow/core/platform/*test*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/public/*test*.h"
+)
+
+list(REMOVE_ITEM tf_core_lib_srcs ${tf_core_lib_test_srcs}) 
+
+add_library(tf_core_lib OBJECT ${tf_core_lib_srcs})
+target_include_directories(tf_core_lib PUBLIC
+    ${tensorflow_source_dir}
+    ${jpeg_INCLUDE_DIR}
+    ${png_INCLUDE_DIR}
+)
+#target_link_libraries(tf_core_lib
+#    ${CMAKE_THREAD_LIBS_INIT}
+#    ${PROTOBUF_LIBRARIES}
+#    tf_protos_cc
+#)
+target_compile_options(tf_core_lib PRIVATE
+    -fno-exceptions
+    -DEIGEN_AVOID_STL_ARRAY
+)
+
+# C++11
+target_compile_features(tf_core_lib PRIVATE
+    cxx_rvalue_references
+)
+
+add_dependencies(tf_core_lib
+    jpeg_copy_headers_to_destination
+    png_copy_headers_to_destination
+    re2_copy_headers_to_destination
+    eigen
+    tf_protos_cc
+)
+
+
+########################################################
+# tf_core_framework library
+########################################################
+file(GLOB_RECURSE tf_core_framework_srcs
+    "${tensorflow_source_dir}/tensorflow/core/framework/*.h"
+    "${tensorflow_source_dir}/tensorflow/core/framework/*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/util/*.h"
+    "${tensorflow_source_dir}/tensorflow/core/util/*.cc"
+    "${tensorflow_source_dir}/public/*.h"
+)
+
+file(GLOB_RECURSE tf_core_framework_test_srcs
+    "${tensorflow_source_dir}/tensorflow/core/framework/*test*.h"
+    "${tensorflow_source_dir}/tensorflow/core/framework/*test*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/framework/*testutil.h"
+    "${tensorflow_source_dir}/tensorflow/core/framework/*testutil.cc"
+    "${tensorflow_source_dir}/tensorflow/core/framework/*main.cc"
+    "${tensorflow_source_dir}/tensorflow/core/util/*test*.h"
+    "${tensorflow_source_dir}/tensorflow/core/util/*test*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/util/*main.cc"
+)
+
+list(REMOVE_ITEM tf_core_framework_srcs ${tf_core_framework_test_srcs})
+
+add_library(tf_core_framework OBJECT ${tf_core_framework_srcs})
+target_include_directories(tf_core_framework PUBLIC
+    ${tensorflow_source_dir}
+    ${eigen_INCLUDE_DIRS}
+    ${re2_INCLUDES}
+)
+#target_link_libraries(tf_core_framework
+#    ${CMAKE_THREAD_LIBS_INIT}
+#    ${PROTOBUF_LIBRARIES}
+#    #${re2_STATIC_LIBRARIES}
+#    re2_lib
+#    ${jpeg_STATIC_LIBRARIES}
+#    ${png_STATIC_LIBRARIES}
+#    tf_protos_cc
+#    tf_core_lib
+#)
+add_dependencies(tf_core_framework
+    tf_core_lib
+)
+target_compile_options(tf_core_framework PRIVATE
+    -fno-exceptions
+    -DEIGEN_AVOID_STL_ARRAY
+)
+# C++11
+target_compile_features(tf_core_framework PRIVATE
+    cxx_rvalue_references
+)
--- a/tensorflow/contrib/cmake/tf_core_kernels.cmake
+++ b/tensorflow/contrib/cmake/tf_core_kernels.cmake
@ -0,0 +1,53 @@
+########################################################
+# tf_core_kernels library
+########################################################
+file(GLOB_RECURSE tf_core_kernels_srcs
+   "${tensorflow_source_dir}/tensorflow/core/kernels/*.h"
+   "${tensorflow_source_dir}/tensorflow/core/kernels/*.cc"
+)
+
+file(GLOB_RECURSE tf_core_kernels_exclude_srcs
+   "${tensorflow_source_dir}/tensorflow/core/kernels/*test*.h"
+   "${tensorflow_source_dir}/tensorflow/core/kernels/*test*.cc"
+   "${tensorflow_source_dir}/tensorflow/core/kernels/*testutil.h"
+   "${tensorflow_source_dir}/tensorflow/core/kernels/*testutil.cc"
+   "${tensorflow_source_dir}/tensorflow/core/kernels/*main.cc"
+   "${tensorflow_source_dir}/tensorflow/core/kernels/*.cu.cc"
+)
+
+list(REMOVE_ITEM tf_core_kernels_srcs ${tf_core_kernels_exclude_srcs}) 
+
+add_library(tf_core_kernels OBJECT ${tf_core_kernels_srcs})
+
+add_dependencies(tf_core_kernels tf_core_cpu)
+
+target_include_directories(tf_core_kernels PRIVATE
+   ${tensorflow_source_dir}
+   ${png_INCLUDE_DIR}
+   ${eigen_INCLUDE_DIRS}
+)
+
+#target_link_libraries(tf_core_kernels
+#   ${CMAKE_THREAD_LIBS_INIT}
+#   ${PROTOBUF_LIBRARIES}
+#   tf_core_cpu
+#   tf_core_framework
+#   tf_core_lib
+#   tf_protos_cc
+#   tf_models_word2vec_kernels
+#   tf_stream_executor
+#   tf_core_ops
+#   tf_core_cpu
+#)
+
+#        "@gemmlowp//:eight_bit_int_gemm",
+
+target_compile_options(tf_core_kernels PRIVATE
+   -fno-exceptions
+   -DEIGEN_AVOID_STL_ARRAY
+)
+
+# C++11
+target_compile_features(tf_core_kernels PRIVATE
+   cxx_rvalue_references
+)
--- a/tensorflow/contrib/cmake/tf_core_ops.cmake
+++ b/tensorflow/contrib/cmake/tf_core_ops.cmake
@ -0,0 +1,181 @@
+#def tf_gen_op_libs(op_lib_names):
+#  # Make library out of each op so it can also be used to generate wrappers
+#  # for various languages.
+#  for n in op_lib_names:
+#    native.cc_library(name=n + "_op_lib"
+#                      copts=tf_copts(),
+#                      srcs=["ops/" + n + ".cc"],
+#                      deps=(["//tensorflow/core:framework"]),
+#                      visibility=["//visibility:public"],
+#                      alwayslink=1,
+#                      linkstatic=1,)
+
+
+set(tf_op_lib_names
+    "array_ops"
+    "attention_ops"
+    "candidate_sampling_ops"
+    "control_flow_ops"
+    "data_flow_ops"
+    "image_ops"
+    "io_ops"
+    "linalg_ops"
+    "logging_ops"
+    "functional_ops"
+    "math_ops"
+    "nn_ops"
+    "no_op"
+    "parsing_ops"
+    "random_ops"
+    "script_ops"
+    "sendrecv_ops"
+    "sparse_ops"
+    "state_ops"
+    "string_ops"
+    "summary_ops"
+    "training_ops"
+)
+
+foreach(tf_op_lib_name ${tf_op_lib_names})
+    ########################################################
+    # tf_${tf_op_lib_name} library
+    ########################################################
+    file(GLOB tf_${tf_op_lib_name}_srcs
+        "${tensorflow_source_dir}/tensorflow/core/ops/${tf_op_lib_name}.cc"
+    )
+
+    add_library(tf_${tf_op_lib_name} OBJECT ${tf_${tf_op_lib_name}_srcs})
+
+    add_dependencies(tf_${tf_op_lib_name} tf_core_framework)
+
+    target_include_directories(tf_${tf_op_lib_name} PRIVATE
+        ${tensorflow_source_dir}
+        ${eigen_INCLUDE_DIRS}
+    )
+
+    target_compile_options(tf_${tf_op_lib_name} PRIVATE
+        -fno-exceptions
+        -DEIGEN_AVOID_STL_ARRAY
+    )
+
+    # C++11
+    target_compile_features(tf_${tf_op_lib_name} PRIVATE
+        cxx_rvalue_references
+    )
+endforeach()
+
+#cc_library(
+#    name = "user_ops_op_lib"
+#    srcs = glob(["user_ops/**/*.cc"]),
+#    copts = tf_copts(),
+#    linkstatic = 1,
+#    visibility = ["//visibility:public"],
+#    deps = [":framework"],
+#    alwayslink = 1,
+#)
+########################################################
+# tf_user_ops library
+########################################################
+file(GLOB_RECURSE tf_user_ops_srcs
+    "${tensorflow_source_dir}/tensorflow/core/user_ops/*.cc"
+)
+
+add_library(tf_user_ops OBJECT ${tf_user_ops_srcs})
+
+add_dependencies(tf_user_ops tf_core_framework)
+
+target_include_directories(tf_user_ops PRIVATE
+    ${tensorflow_source_dir}
+    ${eigen_INCLUDE_DIRS}
+)
+
+target_compile_options(tf_user_ops PRIVATE
+    -fno-exceptions
+    -DEIGEN_AVOID_STL_ARRAY
+)
+
+# C++11
+target_compile_features(tf_user_ops PRIVATE
+    cxx_rvalue_references
+)
+
+
+#tf_cuda_library(
+#    name = "ops"
+#    srcs = glob(
+#        [
+#            "ops/**/*.h"
+#            "ops/**/*.cc"
+#            "user_ops/**/*.h"
+#            "user_ops/**/*.cc"
+#        ],
+#        exclude = [
+#            "**/*test*"
+#            "**/*main.cc"
+#            "user_ops/**/*.cu.cc"
+#        ],
+#    ),
+#    copts = tf_copts(),
+#    linkstatic = 1,
+#    visibility = ["//visibility:public"],
+#    deps = [
+#        ":core"
+#        ":lib"
+#        ":protos_cc"
+#        "//tensorflow/models/embedding:word2vec_ops"
+#        "//third_party/eigen3"
+#    ],
+#    alwayslink = 1,
+#)
+
+########################################################
+# tf_core_ops library
+########################################################
+file(GLOB_RECURSE tf_core_ops_srcs
+    "${tensorflow_source_dir}/tensorflow/core/ops/*.h"
+    "${tensorflow_source_dir}/tensorflow/core/ops/*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/user_ops/*.h"
+    "${tensorflow_source_dir}/tensorflow/core/user_ops/*.cc"
+)
+
+file(GLOB_RECURSE tf_core_ops_exclude_srcs
+    "${tensorflow_source_dir}/tensorflow/core/ops/*test*.h"
+    "${tensorflow_source_dir}/tensorflow/core/ops/*test*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/ops/*main.cc"
+    "${tensorflow_source_dir}/tensorflow/core/user_ops/*test*.h"
+    "${tensorflow_source_dir}/tensorflow/core/user_ops/*test*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/user_ops/*main.cc"
+    "${tensorflow_source_dir}/tensorflow/core/user_ops/*.cu.cc"
+)
+
+list(REMOVE_ITEM tf_core_ops_srcs ${tf_core_ops_exclude_srcs}) 
+
+add_library(tf_core_ops OBJECT ${tf_core_ops_srcs})
+
+add_dependencies(tf_core_ops tf_core_cpu)
+
+target_include_directories(tf_core_ops PRIVATE
+    ${tensorflow_source_dir}
+    ${eigen_INCLUDE_DIRS}
+)
+
+#target_link_libraries(tf_core_ops
+#    ${CMAKE_THREAD_LIBS_INIT}
+#    ${PROTOBUF_LIBRARIES}
+#    tf_protos_cc
+#    tf_core_lib
+#    tf_core_cpu
+#    tf_models_word2vec_ops
+#)
+
+target_compile_options(tf_core_ops PRIVATE
+    -fno-exceptions
+    -DEIGEN_AVOID_STL_ARRAY
+)
+
+# C++11
+target_compile_features(tf_core_ops PRIVATE
+    cxx_rvalue_references
+)
+
+
--- a/tensorflow/contrib/cmake/tf_models.cmake
+++ b/tensorflow/contrib/cmake/tf_models.cmake
@ -0,0 +1,95 @@
+#cc_library(
+#    name = "word2vec_ops",
+#    srcs = [
+#        "word2vec_ops.cc",
+#    ],
+#    visibility = ["//tensorflow:internal"],
+#    deps = [
+#        "//tensorflow/core:framework",
+#    ],
+#    alwayslink = 1,
+#)
+
+########################################################
+# tf_models_word2vec_ops library
+########################################################
+file(GLOB tf_models_word2vec_ops_srcs
+    "${tensorflow_source_dir}/tensorflow/models/embedding/word2vec_ops.cc"
+)
+
+add_library(tf_models_word2vec_ops OBJECT ${tf_models_word2vec_ops_srcs})
+
+target_include_directories(tf_models_word2vec_ops PRIVATE
+    ${tensorflow_source_dir}
+    ${eigen_INCLUDE_DIRS}
+)
+
+add_dependencies(tf_models_word2vec_ops
+    tf_core_framework
+)
+#target_link_libraries(tf_models_word2vec_ops
+#    ${CMAKE_THREAD_LIBS_INIT}
+#    ${PROTOBUF_LIBRARIES}
+#    tf_core_framework
+#    tf_core_lib
+#    tf_protos_cc
+#)
+
+target_compile_options(tf_models_word2vec_ops PRIVATE
+    -fno-exceptions
+    -DEIGEN_AVOID_STL_ARRAY
+)
+
+# C++11
+target_compile_features(tf_models_word2vec_ops PRIVATE
+    cxx_rvalue_references
+)
+
+#cc_library(
+#    name = "word2vec_kernels",
+#    srcs = [
+#        "word2vec_kernels.cc",
+#    ],
+#    visibility = ["//tensorflow:internal"],
+#    deps = [
+#        "//tensorflow/core",
+#    ],
+#    alwayslink = 1,
+#)
+########################################################
+# tf_models_word2vec_kernels library
+########################################################
+file(GLOB tf_models_word2vec_kernels_srcs
+    "${tensorflow_source_dir}/tensorflow/models/embedding/word2vec_kernels.cc"
+)
+
+add_library(tf_models_word2vec_kernels OBJECT ${tf_models_word2vec_kernels_srcs})
+
+target_include_directories(tf_models_word2vec_kernels PRIVATE
+    ${tensorflow_source_dir}
+    ${eigen_INCLUDE_DIRS}
+    ${re2_INCLUDES}
+)
+
+add_dependencies(tf_models_word2vec_ops
+    tf_core_cpu
+)
+
+#target_link_libraries(tf_models_word2vec_kernels
+#    ${CMAKE_THREAD_LIBS_INIT}
+#    ${PROTOBUF_LIBRARIES}
+#    tf_core_framework
+#    tf_core_lib
+#    tf_protos_cc
+#    tf_core_cpu
+#)
+
+target_compile_options(tf_models_word2vec_kernels PRIVATE
+    -fno-exceptions
+    -DEIGEN_AVOID_STL_ARRAY
+)
+
+# C++11
+target_compile_features(tf_models_word2vec_kernels PRIVATE
+    cxx_rvalue_references
+)
--- a/tensorflow/contrib/cmake/tf_stream_executor.cmake
+++ b/tensorflow/contrib/cmake/tf_stream_executor.cmake
@ -0,0 +1,81 @@
+#cc_library(
+#    name = "stream_executor",
+#    srcs = glob(
+#        [
+#XX            "*.cc",
+#            "lib/*.cc",
+#        ],
+#        exclude = [
+#            "**/*_test.cc",
+#        ],
+#    ) + if_cuda(
+#        glob([
+#            "cuda/*.cc",
+#        ]),
+#    ),
+#    hdrs = glob([
+#        "*.h",
+#        "cuda/*.h",
+#        "lib/*.h",
+#        "platform/**/*.h",
+#    ]),
+#    data = [
+#        "//tensorflow/core:cuda",
+#        "//third_party/gpus/cuda:cublas",
+#        "//third_party/gpus/cuda:cudnn",
+#    ],
+#    linkopts = [
+#        "-ldl",
+#    ],
+#    visibility = ["//visibility:public"],
+#    deps = [
+#        "//tensorflow/core:lib",
+#        "//third_party/gpus/cuda:cuda_headers",
+#    ],
+#    alwayslink = 1,
+#)
+
+########################################################
+# tf_stream_executor library
+########################################################
+file(GLOB tf_stream_executor_srcs
+    "${tensorflow_source_dir}/tensorflow/stream_executor/*.cc"
+    "${tensorflow_source_dir}/tensorflow/stream_executor/*.h"
+    "${tensorflow_source_dir}/tensorflow/stream_executor/lib/*.cc"
+    "${tensorflow_source_dir}/tensorflow/stream_executor/lib/*.h"
+    "${tensorflow_source_dir}/tensorflow/stream_executor/platform/*.h"
+    "${tensorflow_source_dir}/tensorflow/stream_executor/platform/default/*.h"
+)
+
+#file(GLOB_RECURSE tf_stream_executor_test_srcs
+#    "${tensorflow_source_dir}/tensorflow/stream_executor/*_test.cc"
+#    "${tensorflow_source_dir}/tensorflow/stream_executor/*_test.h"
+#)
+#
+#list(REMOVE_ITEM tf_stream_executor_srcs ${tf_stream_executor_test_srcs}) 
+
+add_library(tf_stream_executor OBJECT ${tf_stream_executor_srcs})
+
+target_include_directories(tf_stream_executor PRIVATE
+    ${tensorflow_source_dir}
+)
+add_dependencies(tf_stream_executor
+    tf_core_lib
+)
+#target_link_libraries(tf_stream_executor
+#    ${CMAKE_THREAD_LIBS_INIT}
+#    ${PROTOBUF_LIBRARIES}
+#    tf_protos_cc
+#    tf_core_lib
+#)
+
+target_compile_options(tf_stream_executor PRIVATE
+    -fno-exceptions
+    -DEIGEN_AVOID_STL_ARRAY
+)
+
+# C++11
+target_compile_features(tf_stream_executor PRIVATE
+    cxx_rvalue_references
+)
+
--- a/tensorflow/contrib/cmake/tf_tutorials.cmake
+++ b/tensorflow/contrib/cmake/tf_tutorials.cmake
@ -0,0 +1,54 @@
+#cc_binary(
+#    name = "tutorials_example_trainer",
+#    srcs = ["tutorials/example_trainer.cc"],
+#    copts = tf_copts(),
+#    linkopts = [
+#        "-lpthread",
+#        "-lm",
+#    ],
+#    deps = [
+#        ":cc_ops",
+#        "//tensorflow/core:kernels",
+#        "//tensorflow/core:tensorflow",
+#    ],
+#)
+
+set(tf_tutorials_example_trainer_srcs
+    "${tensorflow_source_dir}/tensorflow/cc/tutorials/example_trainer.cc"
+)
+
+add_executable(tf_tutorials_example_trainer
+    ${tf_tutorials_example_trainer_srcs}
+    $<TARGET_OBJECTS:tf_core_lib>
+    $<TARGET_OBJECTS:tf_core_cpu>
+    $<TARGET_OBJECTS:tf_core_framework>
+    $<TARGET_OBJECTS:tf_core_kernels>
+    $<TARGET_OBJECTS:tf_cc_ops>
+    $<TARGET_OBJECTS:tf_core_ops>
+    $<TARGET_OBJECTS:tf_core_direct_session>
+)
+
+target_include_directories(tf_tutorials_example_trainer PUBLIC
+    ${tensorflow_source_dir}
+    ${eigen_INCLUDE_DIRS}
+)
+
+target_link_libraries(tf_tutorials_example_trainer PUBLIC
+    ${CMAKE_THREAD_LIBS_INIT}
+    ${PROTOBUF_LIBRARIES}
+    tf_protos_cc
+    re2_lib
+    ${jpeg_STATIC_LIBRARIES}
+    ${png_STATIC_LIBRARIES}
+    ${ZLIB_LIBRARIES}
+)
+
+target_compile_options(tf_tutorials_example_trainer PRIVATE
+    -fno-exceptions
+    -DEIGEN_AVOID_STL_ARRAY
+)
+
+# C++11
+target_compile_features(tf_tutorials_example_trainer PRIVATE
+    cxx_rvalue_references
+)
--- a/tensorflow/contrib/layers/python/ops/loss_ops.py
+++ b/tensorflow/contrib/layers/python/ops/loss_ops.py
@ -79,7 +79,7 @@ def _reduce_batch(x, reduce_fn, name=None):
    elif ndims == 1:
      return x  # Don't include a useless reduction.
    elif ndims:
-      reduction_indices = range(1, ndims)
+      reduction_indices = list(range(1, ndims))
      shape = [x.get_shape().dims[0]]
    else:
      reduction_indices = math_ops.range(1, array_ops.size(array_ops.shape(x)))
--- a/tensorflow/contrib/linear_optimizer/kernels/sdca_ops.cc
+++ b/tensorflow/contrib/linear_optimizer/kernels/sdca_ops.cc
@ -73,11 +73,6 @@ struct Regularizations {
  float symmetric_l2 = 0;
 };

-struct RegularizationLoss {
-  double l1_loss = 0;
-  double l2_loss = 0;
-};
-
 struct PerExampleData {
  double wx = 0;
  double norm = 0;
@ -102,7 +97,7 @@ using DenseFeaturesByGroup = std::vector<TTypes<const float>::Vec>;
 // indicates that the contents of sparse_examples_by_group cannot be trusted or
 // used.
 Status FillSparseExamplesByGroup(
-    const int64 num_sparse_features, const int64 num_examples,
+    const int64 num_sparse_features, const int num_examples,
    const OpInputList& sparse_features_indices_inputs,
    const OpInputList& sparse_features_values_inputs,
    const WeightsByGroup& sparse_weights_by_group,
@ -127,7 +122,10 @@ Status FillSparseExamplesByGroup(
      static const int64 kIndicesDims = 2;
      gtl::InlinedVector<int64, 8> order(kIndicesDims);
      std::iota(order.begin(), order.end(), 0);
-      for (int64 i = begin; i < end; ++i) {
+
+      // The static_cast here is safe since begin and end can be at most
+      // num_examples which is an int.
+      for (int i = static_cast<int>(begin); i < end; ++i) {
        if (sparse_features_indices_inputs[i].shape().dims() != kIndicesDims) {
          mutex_lock l(mu);
          result = errors::InvalidArgument(strings::Printf(
@ -147,7 +145,7 @@ Status FillSparseExamplesByGroup(
          if (example_index < 0 || example_index >= num_examples) {
            mutex_lock l(mu);
            result = errors::Internal(strings::Printf(
-                "Example indices should be in [0, %lld). Encountered: %lld",
+                "Example indices should be in [0, %d). Encountered: %lld",
                num_examples, example_index));
            return;
          }
@ -203,35 +201,6 @@ inline double Shrink(const double weight, const double shrink_by) {
  return 0.0;
 }

-// Compute L1 and L2 regularization loss.
-inline RegularizationLoss ComputeRegularizationLoss(
-    const WeightsByGroup& sparse_weights_by_group,
-    const WeightsByGroup& dense_weights_by_group,
-    const Regularizations& regularizations) {
-  RegularizationLoss result;
-
-  const double shrink_by = ShrinkageFactor(regularizations);
-  auto accumulate_regularization_loss = [&](const double w) {
-    const double sw = std::abs(Shrink(w, shrink_by));
-    result.l1_loss += sw;
-    result.l2_loss += sw * sw;
-  };
-
-  for (const TTypes<float>::Vec weights : sparse_weights_by_group) {
-    for (int64 i = 0; i < weights.size(); ++i) {
-      accumulate_regularization_loss(weights(i));
-    }
-  }
-
-  for (const TTypes<float>::Vec weights : dense_weights_by_group) {
-    accumulate_regularization_loss(weights(0));
-  }
-
-  result.l1_loss *= regularizations.symmetric_l1;
-  result.l2_loss *= regularizations.symmetric_l2;
-  return result;
-}
-
 // Compute PerExampleData which contains the logits, and weighted example norm
 // for a given example_id. Norm is weighted by 1/(lambda*N).
 inline PerExampleData ComputeWxAndWeightedExampleNorm(
@ -380,7 +349,7 @@ WeightsByGroup MakeDeltaWeightsFrom(std::vector<Tensor>* const tensors) {
 }

 Status RunTrainStepsForMiniBatch(
-    const int64 num_examples, const TTypes<const string>::Vec example_ids,
+    const int num_examples, const TTypes<const string>::Vec example_ids,
    const TTypes<const float>::Vec example_labels,
    const TTypes<const float>::Vec example_weights,
    const DeviceBase::CpuWorkerThreads& worker_threads,
@ -459,6 +428,13 @@ Status RunTrainStepsForMiniBatch(
  return train_step_status;
 }

+Status FillRegularizations(OpKernelConstruction* const context,
+                           Regularizations* const regularizations) {
+  TF_RETURN_IF_ERROR(context->GetAttr("l1", &regularizations->symmetric_l1));
+  TF_RETURN_IF_ERROR(context->GetAttr("l2", &regularizations->symmetric_l2));
+  return Status::OK();
+}
+
 }  // namespace

 class SdcaSolver : public OpKernel {
@ -484,25 +460,9 @@ class SdcaSolver : public OpKernel {
    OP_REQUIRES(
        context, num_sparse_features_ + num_dense_features_ > 0,
        errors::InvalidArgument("Requires at least one feature to train."));
-
-    OP_REQUIRES_OK(context,
-                   context->GetAttr("l1", &regularizations_.symmetric_l1));
-    OP_REQUIRES_OK(context,
-                   context->GetAttr("l2", &regularizations_.symmetric_l2));
-    // We enforce a minimal l2, required by the algorithm.
-    regularizations_.symmetric_l2 =
-        std::max(regularizations_.symmetric_l2, 1.0f);
-
+    OP_REQUIRES_OK(context, FillRegularizations(context, &regularizations_));
    OP_REQUIRES_OK(context, context->GetAttr("num_inner_iterations",
                                             &num_inner_iterations_));
-
-    // TODO(rohananil): Provide emperical evidence for this. It is better to run
-    // more than one iteration on single mini-batch as we want to spend more
-    // time in compute. SDCA works better with larger mini batches and there
-    // is also recent work that shows its better to reuse old samples than train
-    // on new samples. See: http://arxiv.org/abs/1602.02136.
-    num_inner_iterations_ =
-        std::max(num_inner_iterations_, static_cast<int64>(2));
    OP_REQUIRES_OK(context, context->GetAttr("container", &container_));
    OP_REQUIRES_OK(context, context->GetAttr("solver_uuid", &solver_uuid_));
  }
@ -533,21 +493,16 @@ class SdcaSolver : public OpKernel {
    OP_REQUIRES(context, TensorShapeUtils::IsVector(example_weights_t->shape()),
                errors::InvalidArgument("example_weights should be a vector."));
    const auto example_weights = example_weights_t->vec<float>();
-
-    Eigen::Tensor<float, 0, Eigen::RowMajor> example_weights_sum;
-    example_weights_sum.device(context->eigen_cpu_device()) =
-        example_weights.sum();
-    const float weighted_examples = example_weights_sum();
-    const int64 num_examples = example_weights.size();
-
-    OP_REQUIRES(context, weighted_examples > 0,
-                errors::InvalidArgument("No weighted examples in ",
-                                        num_examples, " training examples"));
+    OP_REQUIRES(context,
+                example_weights.size() <= std::numeric_limits<int>::max(),
+                errors::InvalidArgument(strings::Printf(
+                    "Too many examples in a mini-batch: %ld > %d",
+                    example_weights.size(), std::numeric_limits<int>::max())));
+    const int num_examples = static_cast<int>(example_weights.size());

    OpInputList dense_features_inputs;
    OP_REQUIRES_OK(
        context, context->input_list("dense_features", &dense_features_inputs));
-
    DenseFeaturesByGroup dense_features_by_group;
    for (const auto& dense_feature : dense_features_inputs) {
      dense_features_by_group.emplace_back(dense_feature.vec<float>());
@ -562,7 +517,7 @@ class SdcaSolver : public OpKernel {
    OP_REQUIRES(context, example_labels.size() == num_examples,
                errors::InvalidArgument(strings::Printf(
                    "The number of example labels (%ld) should match the "
-                    "number of example weights (%lld).",
+                    "number of example weights (%d).",
                    example_labels.size(), num_examples)));

    const Tensor* example_ids_t;
@ -573,7 +528,7 @@ class SdcaSolver : public OpKernel {
    OP_REQUIRES(context, example_labels.size() == num_examples,
                errors::InvalidArgument(strings::Printf(
                    "The number of example ids (%ld) should match the number "
-                    "of example weights (%lld).",
+                    "of example weights (%d).",
                    example_ids.size(), num_examples)));
    const int64 num_duplicate_example_ids = [&] {
      // TODO(katsiapis): Benchmark and/or optimize.
@ -632,12 +587,7 @@ class SdcaSolver : public OpKernel {
    SetZeroDeltaWeights(&sparse_delta_weights_by_group,
                        &dense_delta_weights_by_group);

-    // TODO(rohananil): Provide emperical evidence for this. It is better to run
-    // more than one iteration on single mini-batch as we want to spend more
-    // time in compute. SDCA works better with larger mini batches and there
-    // is also recent work that shows its better to reuse old samples than train
-    // on new samples. See: http://arxiv.org/abs/1602.02136.
-    for (int64 i = 0; i < num_inner_iterations_; ++i) {
+    for (int i = 0; i < num_inner_iterations_; ++i) {
      OP_REQUIRES_OK(
          context,
          RunTrainStepsForMiniBatch(
@ -669,7 +619,7 @@ class SdcaSolver : public OpKernel {
  int64 num_sparse_features_;
  int64 num_dense_features_;
  Regularizations regularizations_;
-  int64 num_inner_iterations_;
+  int num_inner_iterations_;
  string container_;
  string solver_uuid_;
 };
@ -678,13 +628,7 @@ REGISTER_KERNEL_BUILDER(Name("SdcaSolver").Device(DEVICE_CPU), SdcaSolver);
 class SdcaShrinkL1 : public OpKernel {
 public:
  explicit SdcaShrinkL1(OpKernelConstruction* context) : OpKernel(context) {
-    OP_REQUIRES_OK(context,
-                   context->GetAttr("l1", &regularizations_.symmetric_l1));
-    OP_REQUIRES_OK(context,
-                   context->GetAttr("l2", &regularizations_.symmetric_l2));
-    // We enforce a minimal l2, required by the algorithm.
-    regularizations_.symmetric_l2 =
-        std::max(regularizations_.symmetric_l2, 1.0f);
+    OP_REQUIRES_OK(context, FillRegularizations(context, &regularizations_));
  }

  void Compute(OpKernelContext* context) override {
@ -709,19 +653,10 @@ class SdcaShrinkL1 : public OpKernel {
 };
 REGISTER_KERNEL_BUILDER(Name("SdcaShrinkL1").Device(DEVICE_CPU), SdcaShrinkL1);

-class ComputeDualityGap : public OpKernel {
+class SdcaTrainingStats : public OpKernel {
 public:
-  explicit ComputeDualityGap(OpKernelConstruction* context)
+  explicit SdcaTrainingStats(OpKernelConstruction* context)
      : OpKernel(context) {
-    // TODO(rohananil): Refactor grabbing common attributes across ops related
-    // to sdca.
-    OP_REQUIRES_OK(context,
-                   context->GetAttr("l1", &regularizations_.symmetric_l1));
-    OP_REQUIRES_OK(context,
-                   context->GetAttr("l2", &regularizations_.symmetric_l2));
-    // We enforce a minimal l2, required by the algorithm.
-    regularizations_.symmetric_l2 =
-        std::max(regularizations_.symmetric_l2, 1.0f);
    OP_REQUIRES_OK(context, context->GetAttr("container", &container_));
    OP_REQUIRES_OK(context, context->GetAttr("solver_uuid", &solver_uuid_));
  }
@ -734,45 +669,56 @@ class ComputeDualityGap : public OpKernel {
        context, !data_by_example->RefCountIsOne(),
        errors::Internal("Expected shared-ownership of data_by_example."));

-    OpMutableInputList sparse_weights_inputs;
-    OP_REQUIRES_OK(context, context->mutable_input_list(
-                                "sparse_weights", &sparse_weights_inputs));
-    WeightsByGroup sparse_weights_by_group =
-        MakeWeightsFrom(&sparse_weights_inputs);
-
-    OpMutableInputList dense_weights_inputs;
-    OP_REQUIRES_OK(context, context->mutable_input_list("dense_weights",
-                                                        &dense_weights_inputs));
-    WeightsByGroup dense_weights_by_group =
-        MakeWeightsFrom(&dense_weights_inputs);
-
-    double example_weight_sum = 0;
-    double total_duality_gap = 0;
+    double total_primal_loss = 0;
+    double total_dual_loss = 0;
+    double total_example_weight = 0;
    OP_REQUIRES_OK(context,
                   data_by_example->Visit([&](const DataByExample::Data& data) {
-                     example_weight_sum += data.example_weight;
-                     total_duality_gap += data.primal_loss + data.dual_loss;
+                     total_primal_loss += data.primal_loss;
+                     total_dual_loss += data.dual_loss;
+                     total_example_weight += data.example_weight;
                   }));

-    const RegularizationLoss regularization_loss = ComputeRegularizationLoss(
-        sparse_weights_by_group, dense_weights_by_group, regularizations_);
-    total_duality_gap +=
-        regularization_loss.l2_loss + regularization_loss.l1_loss;
+    // TODO(katsiapis): Think about the most arithmetically stable way of
+    // computing (dual + primal) loss (if it matters).

-    Tensor* duality_gap_t = nullptr;
-    OP_REQUIRES_OK(context,
-                   context->allocate_output("duality_gap", {}, &duality_gap_t));
-    duality_gap_t->scalar<float>()() = total_duality_gap / example_weight_sum;
+    {
+      Tensor* tensor = nullptr;
+      OP_REQUIRES_OK(context,
+                     context->allocate_output("primal_loss", {}, &tensor));
+      tensor->scalar<double>()() = total_primal_loss;
+    }
+
+    {
+      Tensor* tensor = nullptr;
+      OP_REQUIRES_OK(context,
+                     context->allocate_output("dual_loss", {}, &tensor));
+      tensor->scalar<double>()() = total_dual_loss;
+    }
+
+    {
+      OP_REQUIRES(
+          context, total_example_weight > 0,
+          errors::FailedPrecondition(
+              "No examples found or all examples have zero weight. Either the "
+              "optimizer was trained with no instances or perhaps there is a "
+              "bug in the training data."));
+
+      Tensor* tensor = nullptr;
+      OP_REQUIRES_OK(context,
+                     context->allocate_output("example_weights", {}, &tensor));
+      tensor->scalar<double>()() = total_example_weight;
+    }

    // TODO(katsiapis): Use core::ScopedUnref once it's moved out of internal.
    data_by_example->Unref();
  }

 private:
-  Regularizations regularizations_;
  string container_;
  string solver_uuid_;
 };
-REGISTER_KERNEL_BUILDER(Name("ComputeDualityGap").Device(DEVICE_CPU),
-                        ComputeDualityGap);
+REGISTER_KERNEL_BUILDER(Name("SdcaTrainingStats").Device(DEVICE_CPU),
+                        SdcaTrainingStats);
+
 }  // namespace tensorflow
--- a/tensorflow/contrib/linear_optimizer/ops/sdca_ops.cc
+++ b/tensorflow/contrib/linear_optimizer/ops/sdca_ops.cc
@ -24,7 +24,7 @@ REGISTER_OP("SdcaSolver")
    .Attr("num_dense_features: int >= 0")
    .Attr("l1: float >= 0")
    .Attr("l2: float >= 1")
-    .Attr("num_inner_iterations: int >= 2")
+    .Attr("num_inner_iterations: int >= 1")
    .Attr("container: string")
    .Attr("solver_uuid: string")
    .Input("sparse_features_indices: num_sparse_features * int64")
@ -69,7 +69,7 @@ example_labels: a vector which contains the label/target associated with each
 example_ids: a vector which contains the unique identifier associated with each
  example.
 sparse_weights: a list of vectors where each value is the weight associated with
-  a feature index.
+  a feature group.
 dense_weights: a list of vectors where the value is the weight associated with
  a dense feature group.
 )doc");
@ -89,38 +89,28 @@ num_dense_features: Number of dense feature groups to train on.
 l1: Symmetric l1 regularization strength.
 l2: Symmetric l2 regularization strength.
 sparse_weights: a list of vectors where each value is the weight associated with
-  a feature index.
+  a feature group.
 dense_weights: a list of vectors where the value is the weight associated with
  a dense feature group.
 )doc");

-// TODO(katsiapis): We should expand this scope of this op to compute other
-// statistics about the data.
-REGISTER_OP("ComputeDualityGap")
-    .Attr("num_sparse_features: int >= 0")
-    .Attr("num_dense_features: int >= 0")
-    .Attr("l1: float >= 0")
-    .Attr("l2: float >= 1")
+REGISTER_OP("SdcaTrainingStats")
    .Attr("container: string")
    .Attr("solver_uuid: string")
-    .Input("sparse_weights: Ref(num_sparse_features * float)")
-    .Input("dense_weights: Ref(num_dense_features * float)")
-    .Output("duality_gap: float")
+    .Output("primal_loss: float64")
+    .Output("dual_loss: float64")
+    .Output("example_weights: float64")
    .Doc(R"doc(
-Computes duality gap over all examples seen by the optimizer.
+Computes statistics over all examples seen by the optimizer.

-num_sparse_features: Number of sparse feature groups to train on.
-num_dense_features: Number of dense feature groups to train on.
-l1: Symmetric l1 regularization strength.
-l2: Symmetric l2 regularization strength.
 container: Name of the Container that stores data across invocations of this
  Kernel. Together with SolverUUID form an isolation unit for this solver.
 solver_uuid: Universally Unique Identifier for this solver.
-sparse_weights: a list of vectors where each value is the weight associated with
-  a feature index.
-dense_weights: a list of vectors where the value is the weight associated with
-  a dense feature group.
-duality_gap: duality gap over all examples seen by the optimizer.
+primal_loss: total primal loss of all examples seen by the optimizer.
+dual_loss: total dual loss of all examples seen by the optimizer.
+example_weights: total example weights of all examples seen by the optimizer
+  (guaranteed to be positive; otherwise returns FAILED_PRECONDITION as it
+   probably indicates a bug in the training data).
 )doc");

 }  // namespace tensorflow
--- a/tensorflow/contrib/linear_optimizer/python/kernel_tests/sdca_ops_test.py
+++ b/tensorflow/contrib/linear_optimizer/python/kernel_tests/sdca_ops_test.py
@ -92,6 +92,7 @@ def make_variable_dict(max_age, max_gender):
  return dict(sparse_features_weights=[age_weights, gender_weights],
              dense_features_weights=[])

+
 def make_dense_variable_dict(num_dense_features, num_examples):
  feature_weights = ([
      tf.Variable(tf.zeros([1],
@ -130,6 +131,7 @@ def tearDown():
  pass


+# TODO(katsiapis): Add tests that exercise L1 and Shrinking.
 class SdcaOptimizerTest(TensorFlowTestCase):

  def _single_threaded_test_session(self):
@ -180,6 +182,44 @@ class SdcaOptimizerTest(TensorFlowTestCase):
                          rtol=1e-2,
                          atol=1e-2)

+  def testSimpleLogisticNoL2(self):
+    # Same as test above (so comments from above apply) but without an L2.
+    # The algorithm should behave as if we have an L2 of 1 in optimization but
+    # 0 in regularized_loss.
+    example_protos = [
+        make_example_proto(
+            {'age': [0],
+             'gender': [0]}, 0),
+        make_example_proto(
+            {'age': [1],
+             'gender': [1]}, 1),
+    ]
+    example_weights = [1.0, 1.0]
+    with self._single_threaded_test_session():
+      examples = make_example_dict(example_protos, example_weights)
+      variables = make_variable_dict(1, 1)
+      options = dict(symmetric_l2_regularization=0,
+                     symmetric_l1_regularization=0,
+                     loss_type='logistic_loss')
+
+      lr = SdcaModel(CONTAINER, examples, variables, options)
+      tf.initialize_all_variables().run()
+      unregularized_loss = lr.unregularized_loss(examples)
+      loss = lr.regularized_loss(examples)
+      predictions = lr.predictions(examples)
+      self.assertAllClose(0.693147, unregularized_loss.eval())
+      self.assertAllClose(0.693147, loss.eval())
+      for _ in xrange(5):
+        lr.minimize().run()
+      self.assertAllClose(0.411608, unregularized_loss.eval(), rtol=0.11)
+      self.assertAllClose(0.371705, loss.eval(), atol=0.01)
+      predicted_labels = get_binary_predictions_for_logistic(predictions)
+      self.assertAllEqual([0, 1], predicted_labels.eval())
+      self.assertAllClose(0.01,
+                          lr.approximate_duality_gap().eval(),
+                          rtol=1e-2,
+                          atol=1e-2)
+
  def testSomeUnweightedExamples(self):
    # Setup test data with 4 examples, but should produce the same
    # results as testSimple.
@ -272,10 +312,11 @@ class SdcaOptimizerTest(TensorFlowTestCase):
      lr = SdcaModel(CONTAINER, examples, variables, options)
      tf.initialize_all_variables().run()
      self.assertAllClose([0.5, 0.5], lr.predictions(examples).eval())
-      with self.assertRaisesOpError(
-          'No weighted examples in 2 training examples'):
-        lr.minimize().run()
+      lr.minimize().run()
      self.assertAllClose([0.5, 0.5], lr.predictions(examples).eval())
+      with self.assertRaisesOpError(
+          'No examples found or all examples have zero weight.'):
+        lr.approximate_duality_gap().eval()

  def testDuplicateExampleIds(self):
    # Setup test data with 1 positive, and 1 negative example.
--- a/tensorflow/contrib/linear_optimizer/python/ops/sdca_ops.py
+++ b/tensorflow/contrib/linear_optimizer/python/ops/sdca_ops.py
@ -28,7 +28,6 @@ from tensorflow.python.framework.ops import name_scope
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
-from tensorflow.python.ops import state_ops
 from tensorflow.python.ops import variables as var_ops
 from tensorflow.python.ops.nn import sigmoid_cross_entropy_with_logits
 from tensorflow.python.platform import resource_loader
@ -139,30 +138,35 @@ class SdcaModel(object):
        ['loss_type', 'symmetric_l2_regularization',
         'symmetric_l1_regularization'], options)

+    for name in ['symmetric_l1_regularization', 'symmetric_l2_regularization']:
+      value = options[name]
+      if value < 0.0:
+        raise ValueError('%s should be non-negative. Found (%f)' %
+                         (name, value))
+
    self._container = container
    self._examples = examples
    self._variables = variables
    self._options = options
    self._solver_uuid = uuid.uuid4().hex
-    self._create_slots(variables)
+    self._create_slots()

-  # TODO(rohananil): Use optimizer interface to make use of slot creation
-  # logic
-  def _create_slots(self, variables):
-    self._slots = {}
-    # TODO(rohananil): Rename the slot keys to "unshrinked" weights.
-    self._slots['sparse_features_weights'] = []
-    self._slots['dense_features_weights'] = []
-    self._assign_ops = []
-    # Make an internal variable which has the updates before applying L1
+  def _symmetric_l2_regularization(self):
+    # Algorithmic requirement (for now) is to have minimal l2 of 1.0
+    return max(self._options['symmetric_l2_regularization'], 1.0)
+
+  # TODO(rohananil): Use optimizer interface to make use of slot creation logic.
+  def _create_slots(self):
+    # Make internal variables which have the updates before applying L1
    # regularization.
-    for var_type in ['sparse_features_weights', 'dense_features_weights']:
-      for var in variables[var_type]:
-        if var is not None:
-          self._slots[var_type].append(var_ops.Variable(array_ops.zeros_like(
-              var.initialized_value(), dtypes.float32)))
-          self._assign_ops.append(state_ops.assign(var, self._slots[var_type][
-              -1]))
+    self._slots = {
+        'unshrinked_sparse_features_weights': [],
+        'unshrinked_dense_features_weights': [],
+    }
+    for name in ['sparse_features_weights', 'dense_features_weights']:
+      for var in self._variables[name]:
+        self._slots['unshrinked_' + name].append(var_ops.Variable(
+            array_ops.zeros_like(var.initialized_value(), dtypes.float32)))

  def _assertSpecified(self, items, check_in):
    for x in items:
@ -177,33 +181,22 @@ class SdcaModel(object):
  def _l1_loss(self):
    """Computes the l1 loss of the model."""
    with name_scope('l1_loss'):
-      sparse_weights = self._convert_n_to_tensor(self._variables[
-          'sparse_features_weights'])
-      dense_weights = self._convert_n_to_tensor(self._variables[
-          'dense_features_weights'])
-      l1 = self._options['symmetric_l1_regularization']
-      loss = 0.0
-      for w in sparse_weights:
-        loss += l1 * math_ops.reduce_sum(abs(w))
-      for w in dense_weights:
-        loss += l1 * math_ops.reduce_sum(abs(w))
-      return loss
+      sum = 0.0
+      for name in ['sparse_features_weights', 'dense_features_weights']:
+        for weights in self._convert_n_to_tensor(self._variables[name]):
+          sum += math_ops.reduce_sum(math_ops.abs(weights))
+      # SDCA L1 regularization cost is: l1 * sum(|weights|)
+      return self._options['symmetric_l1_regularization'] * sum

-  def _l2_loss(self):
+  def _l2_loss(self, l2):
    """Computes the l2 loss of the model."""
    with name_scope('l2_loss'):
-      sparse_weights = self._convert_n_to_tensor(self._variables[
-          'sparse_features_weights'])
-      dense_weights = self._convert_n_to_tensor(self._variables[
-          'dense_features_weights'])
-      l2 = self._options['symmetric_l2_regularization']
-      loss = 0.0
-      for w in sparse_weights:
-        loss += l2 * math_ops.reduce_sum(math_ops.square(w))
-      for w in dense_weights:
-        loss += l2 * math_ops.reduce_sum(math_ops.square(w))
-      # SDCA L2 regularization cost is 1/2 * l2 * sum(weights^2)
-      return loss / 2.0
+      sum = 0.0
+      for name in ['sparse_features_weights', 'dense_features_weights']:
+        for weights in self._convert_n_to_tensor(self._variables[name]):
+          sum += math_ops.reduce_sum(math_ops.square(weights))
+      # SDCA L2 regularization cost is: l2 * sum(weights^2) / 2
+      return l2 * sum / 2

  def _convert_n_to_tensor(self, input_list, as_ref=False):
    """Converts input list to a set of tensors."""
@ -265,31 +258,44 @@ class SdcaModel(object):
    """
    with name_scope('sdca/minimize'):
      sparse_features_indices = []
-      sparse_features_weights = []
+      sparse_features_values = []
      for sf in self._examples['sparse_features']:
        sparse_features_indices.append(convert_to_tensor(sf.indices))
-        sparse_features_weights.append(convert_to_tensor(sf.values))
+        sparse_features_values.append(convert_to_tensor(sf.values))

      step_op = _sdca_ops.sdca_solver(
          sparse_features_indices,
-          sparse_features_weights,
+          sparse_features_values,
          self._convert_n_to_tensor(self._examples['dense_features']),
          convert_to_tensor(self._examples['example_weights']),
          convert_to_tensor(self._examples['example_labels']),
          convert_to_tensor(self._examples['example_ids']),
-          self._convert_n_to_tensor(self._slots['sparse_features_weights'],
-                                    as_ref=True),
-          self._convert_n_to_tensor(self._slots['dense_features_weights'],
-                                    as_ref=True),
+          self._convert_n_to_tensor(
+              self._slots['unshrinked_sparse_features_weights'],
+              as_ref=True),
+          self._convert_n_to_tensor(
+              self._slots['unshrinked_dense_features_weights'],
+              as_ref=True),
          l1=self._options['symmetric_l1_regularization'],
-          l2=self._options['symmetric_l2_regularization'],
+          l2=self._symmetric_l2_regularization(),
+          # TODO(rohananil): Provide empirical evidence for this. It is better
+          # to run more than one iteration on single mini-batch as we want to
+          # spend more time in compute. SDCA works better with larger
+          # mini-batches and there is also recent work that shows its better to
+          # reuse old samples than train on new samples.
+          # See: http://arxiv.org/abs/1602.02136.
          num_inner_iterations=2,
          loss_type=self._options['loss_type'],
          container=self._container,
          solver_uuid=self._solver_uuid)
      with ops.control_dependencies([step_op]):
-        assign_ops = control_flow_ops.group(*self._assign_ops)
-        with ops.control_dependencies([assign_ops]):
+        assign_ops = []
+        for name in ['sparse_features_weights', 'dense_features_weights']:
+          for var, slot_var in zip(self._variables[name],
+                                   self._slots['unshrinked_' + name]):
+            assign_ops.append(var.assign(slot_var))
+        assign_group = control_flow_ops.group(*assign_ops)
+        with ops.control_dependencies([assign_group]):
          return _sdca_ops.sdca_shrink_l1(
              self._convert_n_to_tensor(
                  self._variables['sparse_features_weights'],
@ -298,7 +304,7 @@ class SdcaModel(object):
                  self._variables['dense_features_weights'],
                  as_ref=True),
              l1=self._options['symmetric_l1_regularization'],
-              l2=self._options['symmetric_l2_regularization'])
+              l2=self._symmetric_l2_regularization())

  def approximate_duality_gap(self):
    """Add operations to compute the approximate duality gap.
@ -307,15 +313,14 @@ class SdcaModel(object):
      An Operation that computes the approximate duality gap over all
      examples.
    """
-    return _sdca_ops.compute_duality_gap(
-        self._convert_n_to_tensor(self._slots['sparse_features_weights'],
-                                  as_ref=True),
-        self._convert_n_to_tensor(self._slots['dense_features_weights'],
-                                  as_ref=True),
-        l1=self._options['symmetric_l1_regularization'],
-        l2=self._options['symmetric_l2_regularization'],
+    (primal_loss, dual_loss, example_weights) = _sdca_ops.sdca_training_stats(
        container=self._container,
        solver_uuid=self._solver_uuid)
+    # Note that example_weights is guaranteed to be positive by
+    # sdca_training_stats so dividing by it is safe.
+    return (primal_loss + dual_loss + math_ops.to_double(self._l1_loss()) +
+            (2.0 * math_ops.to_double(self._l2_loss(
+                self._symmetric_l2_regularization())))) / example_weights

  def unregularized_loss(self, examples):
    """Add operations to compute the loss (without the regularization loss).
@ -384,6 +389,11 @@ class SdcaModel(object):
    self._assertList(['sparse_features', 'dense_features'], examples)
    with name_scope('sdca/regularized_loss'):
      weights = convert_to_tensor(examples['example_weights'])
-      return ((
-          (self._l1_loss() + self._l2_loss()) / math_ops.reduce_sum(weights)) +
+      return (((
+          self._l1_loss() +
+          # Note that here we are using the raw regularization
+          # (as specified by the user) and *not*
+          # self._symmetric_l2_regularization().
+          self._l2_loss(self._options['symmetric_l2_regularization'])) /
+               math_ops.reduce_sum(weights)) +
              self.unregularized_loss(examples))
--- a/tensorflow/core/distributed_runtime/README.md
+++ b/tensorflow/core/distributed_runtime/README.md
@ -127,7 +127,7 @@ replicated model. Possible approaches include:
  
 * As above, but where the gradients from all workers are averaged. See the
  [CIFAR-10 multi-GPU trainer](https://www.tensorflow.org/code/tensorflow/models/image/cifar10/cifar10_multi_gpu_train.py)
-  for an example of this form of replication. The implements *synchronous* training
+  for an example of this form of replication. This implements *synchronous* training
  
 * The "distributed trainer" approach uses multiple graphs&mdash;one per
  worker&mdash;where each graph contains one set of parameters (pinned to
--- a/tensorflow/core/kernels/BUILD
+++ b/tensorflow/core/kernels/BUILD
@ -1089,6 +1089,7 @@ filegroup(
        "avgpooling_op.cc",
        "batch_norm_op.cc",
        "bcast_ops.cc",
+        "check_numerics_op.cc",
        "control_flow_ops.cc",
        "conv_2d.h",
        "conv_ops.cc",
--- a/tensorflow/core/kernels/bounds_check.h
+++ b/tensorflow/core/kernels/bounds_check.h
@ -26,26 +26,15 @@ namespace tensorflow {
 // Check that 0 <= index < limit using a single comparison, assuming
 // that 0 <= limit if Index is signed.  Intended for use in performance
 // critical contexts where 0 <= index < limit is almost always true.
-template <class Index>
-EIGEN_ALWAYS_INLINE bool FastBoundsCheck(Index index, Index limit) {
-  typedef typename std::make_unsigned<Index>::type UIndex;
+template <typename Ta, typename Tb>
+EIGEN_ALWAYS_INLINE bool FastBoundsCheck(const Ta index, const Tb limit) {
+  static_assert(std::is_integral<Ta>::value && std::is_integral<Tb>::value,
+                "FastBoundsCheck can only be used on integer types.");
+  typedef typename std::make_unsigned<decltype(index + limit)>::type UIndex;
  return TF_PREDICT_TRUE(static_cast<UIndex>(index) <
                         static_cast<UIndex>(limit));
 }

-// Upcasting specializations when the index and bounds do not match;
-// always move to the larger type.
-
-EIGEN_ALWAYS_INLINE bool FastBoundsCheck(int64 index, int32 limit) {
-  return TF_PREDICT_TRUE(static_cast<uint64>(index) <
-                         static_cast<uint64>(limit));
-}
-
-EIGEN_ALWAYS_INLINE bool FastBoundsCheck(int32 index, int64 limit) {
-  return TF_PREDICT_TRUE(static_cast<uint64>(index) <
-                         static_cast<uint64>(limit));
-}
-
 namespace internal {
 // Ensure that the compiler cannot elide a copy into a local, for
 // bounds checking on source tensors that might be updated asynchronously.
--- a/tensorflow/core/kernels/conv_grad_ops.cc
+++ b/tensorflow/core/kernels/conv_grad_ops.cc
@ -1398,7 +1398,7 @@ class Conv2DSlowBackpropFilterOp : public OpKernel {
      //   [filter_rows, filter_cols, in_depth, out_depth];
      // And we need to reverse the filter backprops
      // So we need to allocated (sigh) yet another piece of memory to hold the
-      // ouptut.
+      // output.
      TensorShape filter_shuffle_shape(
          {out_depth, filter_rows, filter_cols, in_depth});
      Tensor filter_shuffle;
--- a/tensorflow/core/kernels/conv_ops_gpu_3.cu.cc
+++ b/tensorflow/core/kernels/conv_ops_gpu_3.cu.cc
@ -246,7 +246,7 @@ __global__ void SwapDimension1And2InTensor3UsingTiles(const T* input,
  }
 }

-// A Cuda custom kernel that converst input to output, given proper padding on
+// A Cuda custom kernel that convert input to output, given proper padding on
 // the left and the top. The padded value is zero.
 template <typename T>
 __global__ void PadInputCustomKernelNHWC(int nthreads, const T* input,
--- a/tensorflow/core/kernels/diag_op.cc
+++ b/tensorflow/core/kernels/diag_op.cc
@ -45,6 +45,28 @@ class DiagonalGenerator {
 private:
  Tensor diagonal_;
 };
+
+template <typename T, size_t NumDims>
+class DiagonalExtractor {
+ public:
+  explicit DiagonalExtractor(const Tensor& tensor) : tensor_(tensor) {
+    CHECK_EQ(tensor.dims(), 2 * NumDims);
+  }
+  T operator()(const Eigen::array<Eigen::Index, NumDims>& coordinates) const {
+    Eigen::array<Eigen::Index, 2 * NumDims> index;
+    for (size_t j = 0; j < NumDims; ++j){
+      index[j] = coordinates[j];
+    }
+    for (size_t j = NumDims; j < 2 * NumDims; ++j){
+      index[j] = index[j - NumDims];
+    }
+    return tensor_.tensor<T, 2 * NumDims>()(index);
+  }
+
+ private:
+  Tensor tensor_;
+};
+  
 }  // namespace

 // Generate the diagonal tensor with the diagonal set to the input tensor.
@ -58,12 +80,9 @@ class DiagOp : public OpKernel {
  void Compute(OpKernelContext* context) override {
    const Tensor& diagonal = context->input(0);
    const int num_dims = diagonal.dims();
-    OP_REQUIRES(context, 1 <= num_dims,
-                errors::InvalidArgument(
-                    "The rank of the diagonal should be between 1 and 3."));
-    OP_REQUIRES(context, 3 >= num_dims,
-                errors::InvalidArgument(
-                    "The rank of the diagonal  should be between 1 and 3."));
+    OP_REQUIRES(context, 1 <= num_dims && num_dims <= 3,
+                errors::InvalidArgument("Expected 1 <= dims <= 3, got shape ",
+                                        diagonal.shape().DebugString()));
    TensorShape out_shape;
    for (int i = 0; i < num_dims; ++i) {
      out_shape.AddDim(diagonal.dim_size(i));
@ -105,4 +124,71 @@ REGISTER_DIAGOP(int32);
 REGISTER_DIAGOP(int64);

 #undef REGISTER_DIAGOP
+
+
+// Generate the diagonal tensor with the diagonal set to the input tensor.
+// It only allows rank 2, 4, or 6 input tensor, so the output tensor is 
+// rank 1, 2, or 3.
+template <typename T>
+class DiagPartOp : public OpKernel {
+ public:
+  explicit DiagPartOp(OpKernelConstruction* context) : OpKernel(context) {}
+
+  void Compute(OpKernelContext* context) override {
+    const Tensor& tensor = context->input(0);
+    const int num_dims = tensor.dims();
+    const int out_dims = num_dims / 2;
+    OP_REQUIRES(context, 2 == num_dims || 4 == num_dims || 6 == num_dims, 
+                errors::InvalidArgument("The rank of the tensor should be 2, \
+                                         4, or 6, got shape ",
+                                        tensor.shape().DebugString()));
+    for (int i = 0; i < out_dims; i++){
+      OP_REQUIRES(context, tensor.dim_size(i) == tensor.dim_size(i + out_dims),
+                  errors::InvalidArgument(
+                    "Invalid shape ", tensor.shape().DebugString(),
+                    ": dimensions ", i, " and ", i + out_dims, " do not match.")
+                  );
+    }
+
+    TensorShape out_shape;
+    for (int i = 0; i < out_dims; ++i) {
+      out_shape.AddDim(tensor.dim_size(i));
+    }
+
+    Tensor* output = nullptr;
+    OP_REQUIRES_OK(context,
+                   context->allocate_output(0, out_shape, &output));
+
+    switch (num_dims) {
+      case 2:
+        output->tensor<T, 1>() = output->tensor<T, 1>().generate(
+          DiagonalExtractor<T, 1>(tensor));
+        break; 
+      case 4:
+        output->tensor<T, 2>() = output->tensor<T, 2>().generate(
+          DiagonalExtractor<T, 2>(tensor));
+        break;
+      case 6:
+        output->tensor<T, 3>() = output->tensor<T, 3>().generate(
+          DiagonalExtractor<T, 3>(tensor));
+        break;      
+      default:
+        context->SetStatus(errors::Unimplemented(
+          "Diagonal of rank ", num_dims, " tensor is not supported yet."));
+        return;
+    }
+  }
+};
+
+#define REGISTER_DIAGPARTOP(T) \
+  REGISTER_KERNEL_BUILDER( \
+      Name("DiagPart").Device(DEVICE_CPU).TypeConstraint<T>("T"), DiagPartOp<T>)
+
+REGISTER_DIAGPARTOP(double);
+REGISTER_DIAGPARTOP(float);
+REGISTER_DIAGPARTOP(int32);
+REGISTER_DIAGPARTOP(int64);
+
+#undef REGISTER_DIAGPARTOP
+  
 }  // namespace tensorflow
--- a/tensorflow/core/kernels/matrix_solve_ls_op.cc
+++ b/tensorflow/core/kernels/matrix_solve_ls_op.cc
@ -94,7 +94,7 @@ class MatrixSolveLsOp
    }
    if (fast_) {
      // The fast branch assumes that matrix is not rank deficient and
-      // not too ill-conditioned. Specifically, the reciprobal condition number
+      // not too ill-conditioned. Specifically, the reciprocal condition number
      // should be greater than the square root of the machine precision, i.e.
      //   1 / cond(matrix) > sqrt(std::numeric_limits<Scalar>::epsilon()).
      // This branch solves over- or underdetermined least-squares problems
--- a/tensorflow/core/kernels/reduction_ops_gpu.cu.cc
+++ b/tensorflow/core/kernels/reduction_ops_gpu.cu.cc
@ -84,6 +84,7 @@ struct ReduceFunctor<GPUDevice, Eigen::internal::MeanReducer<T> > {
  DEFINE_FOR_TYPE_AND_R(T, Eigen::internal::ProdReducer<T>)

 DEFINE_FOR_ALL_REDUCERS(float);
+DEFINE_FOR_ALL_REDUCERS(double);
 #undef DEFINE_FOR_ALL_REDUCERS

 DEFINE_FOR_TYPE_AND_R(complex64, Eigen::internal::SumReducer<complex64>);
--- a/tensorflow/core/kernels/reduction_ops_max.cc
+++ b/tensorflow/core/kernels/reduction_ops_max.cc
@ -34,6 +34,7 @@ TF_CALL_REAL_NUMBER_TYPES(REGISTER_CPU_KERNELS);
          .HostMemory("reduction_indices"), \
      ReductionOp<GPUDevice, type, Eigen::internal::MaxReducer<type>>);
 REGISTER_GPU_KERNELS(float);
+REGISTER_GPU_KERNELS(double);
 #undef REGISTER_GPU_KERNELS

 #endif
--- a/tensorflow/core/kernels/reduction_ops_min.cc
+++ b/tensorflow/core/kernels/reduction_ops_min.cc
@ -34,6 +34,7 @@ TF_CALL_REAL_NUMBER_TYPES(REGISTER_CPU_KERNELS);
          .HostMemory("reduction_indices"), \
      ReductionOp<GPUDevice, type, Eigen::internal::MinReducer<type>>);
 REGISTER_GPU_KERNELS(float);
+REGISTER_GPU_KERNELS(double);
 #undef REGISTER_GPU_KERNELS

 #endif
--- a/tensorflow/core/kernels/reduction_ops_prod.cc
+++ b/tensorflow/core/kernels/reduction_ops_prod.cc
@ -34,6 +34,7 @@ TF_CALL_REAL_NUMBER_TYPES(REGISTER_CPU_KERNELS);
          .HostMemory("reduction_indices"), \
      ReductionOp<GPUDevice, type, Eigen::internal::ProdReducer<type>>);
 REGISTER_GPU_KERNELS(float);
+REGISTER_GPU_KERNELS(double);
 #undef REGISTER_GPU_KERNELS

 #endif
--- a/tensorflow/core/kernels/reduction_ops_sum.cc
+++ b/tensorflow/core/kernels/reduction_ops_sum.cc
@ -41,6 +41,7 @@ REGISTER_KERNEL_BUILDER(
          .HostMemory("reduction_indices"), \
      ReductionOp<GPUDevice, type, Eigen::internal::SumReducer<type>>);
 REGISTER_GPU_KERNELS(float);
+REGISTER_GPU_KERNELS(double);
 #undef REGISTER_GPU_KERNELS

 REGISTER_KERNEL_BUILDER(
--- a/tensorflow/core/kernels/resize_nearest_neighbor_op.cc
+++ b/tensorflow/core/kernels/resize_nearest_neighbor_op.cc
@ -26,6 +26,10 @@ limitations under the License.
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/platform/logging.h"

+#if GOOGLE_CUDA
+#include "tensorflow/core/kernels/resize_nearest_neighbor_op_gpu.h"
+#endif  // GOOGLE_CUDA
+
 namespace tensorflow {

 typedef Eigen::ThreadPoolDevice CPUDevice;
@ -58,10 +62,10 @@ class ResizeNearestNeighborOp : public OpKernel {
    // Initialize shape to the batch size of the input, then add
    // the rest of the dimensions
    Tensor* output = nullptr;
-    OP_REQUIRES_OK(context, context->allocate_output(
-                                0, TensorShape({input.dim_size(0), sizes(0),
-                                                sizes(1), input.dim_size(3)}),
-                                &output));
+    OP_REQUIRES_OK(
+        context, context->allocate_output(0, TensorShape({input.dim_size(0), sizes(0),
+                                                          sizes(1), input.dim_size(3)}),
+                                          &output));

    const int64 batch_size = input.dim_size(0);
    const int64 in_height = input.dim_size(1);
@ -132,10 +136,10 @@ class ResizeNearestNeighborOpGrad : public OpKernel {
    // Initialize shape to the batch size of the input, then add
    // the rest of the dimensions
    Tensor* output = nullptr;
-    OP_REQUIRES_OK(context, context->allocate_output(
-                                0, TensorShape({input.dim_size(0), sizes(0),
-                                                sizes(1), input.dim_size(3)}),
-                                &output));
+    OP_REQUIRES_OK(
+        context, context->allocate_output(0, TensorShape({input.dim_size(0), sizes(0),
+                                                          sizes(1), input.dim_size(3)}),
+                                          &output));

    const int64 batch_size = input.dim_size(0);
    const int64 in_height = input.dim_size(1);
@ -204,4 +208,83 @@ TF_CALL_REAL_NUMBER_TYPES(REGISTER_KERNEL);

 #undef REGISTER_KERNEL

+#if GOOGLE_CUDA
+
+template <typename T>
+class ResizeNearestNeighborGPUOp : public OpKernel {
+ public:
+  explicit ResizeNearestNeighborGPUOp(OpKernelConstruction* context)
+      : OpKernel(context) {
+    OP_REQUIRES_OK(context, context->GetAttr("align_corners", &align_corners_));
+  }
+
+  void Compute(OpKernelContext* context) override {
+    const Tensor& input = context->input(0);
+    OP_REQUIRES(context, input.dims() == 4,
+                errors::InvalidArgument("input must be 4-dimensional",
+                                        input.shape().DebugString()));
+    const Tensor& shape_t = context->input(1);
+    OP_REQUIRES(context, shape_t.dims() == 1,
+                errors::InvalidArgument("shape_t must be 1-dimensional",
+                                        shape_t.shape().DebugString()));
+    OP_REQUIRES(context, shape_t.NumElements() == 2,
+                errors::InvalidArgument("shape_t must have two elements",
+                                        shape_t.shape().DebugString()));
+
+    auto sizes = shape_t.vec<int32>();
+    OP_REQUIRES(context, sizes(0) > 0 && sizes(1) > 0,
+                errors::InvalidArgument("shape_t's elements must be positive"));
+
+    // Initialize shape to the batch size of the input, then add
+    // the rest of the dimensions
+    Tensor* output = nullptr;
+    OP_REQUIRES_OK(
+        context, context->allocate_output(0, TensorShape({input.dim_size(0), sizes(0),
+                                                          sizes(1), input.dim_size(3)}),
+                                          &output));
+
+    const int64 batch_size = input.dim_size(0);
+    const int64 in_height = input.dim_size(1);
+    const int64 in_width = input.dim_size(2);
+    const int64 channels = input.dim_size(3);
+    const int64 out_height = output->dim_size(1);
+    const int64 out_width = output->dim_size(2);
+
+    const float height_scale =
+        (align_corners_ && out_height > 1)
+            ? (in_height - 1) / static_cast<float>(out_height - 1)
+            : in_height / static_cast<float>(out_height);
+    const float width_scale =
+        (align_corners_ && out_width > 1)
+            ? (in_width - 1) / static_cast<float>(out_width - 1)
+            : in_width / static_cast<float>(out_width);
+
+    bool status = ResizeNearestNeighbor<T>(
+        input.flat<T>().data(), batch_size, in_height,
+        in_width, channels, out_height, out_width,
+        height_scale, width_scale, output->flat<T>().data(),
+        context->eigen_gpu_device());
+
+    if (!status) {
+      context->SetStatus(
+          errors::Internal("Failed launching ResizeNearestNeighbor"));
+    }
+  }
+ private:
+  bool align_corners_;
+};
+
+#define REGISTER_KERNEL(T)                                        \
+  REGISTER_KERNEL_BUILDER(Name("ResizeNearestNeighbor")           \
+                              .Device(DEVICE_GPU)                 \
+                              .TypeConstraint<T>("T")             \
+                              .HostMemory("size"),                \
+                          ResizeNearestNeighborGPUOp<T>);
+
+TF_CALL_GPU_NUMBER_TYPES(REGISTER_KERNEL);
+
+#undef REGISTER_KERNEL
+
+#endif  // GOOGLE_CUDA
+
 }  // namespace tensorflow
--- a/tensorflow/core/kernels/resize_nearest_neighbor_op_benchmark_test.cc
+++ b/tensorflow/core/kernels/resize_nearest_neighbor_op_benchmark_test.cc
@ -0,0 +1,52 @@
+/* Copyright 2015 Google Inc. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/common_runtime/kernel_benchmark_testlib.h"
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/graph/node_builder.h"
+#include "tensorflow/core/platform/test.h"
+#include "tensorflow/core/platform/test_benchmark.h"
+
+namespace tensorflow {
+
+static Graph* BM_ResizeNearestNeighbor(int batches, int width, int height) {
+  Graph* g = new Graph(OpRegistry::Global());
+  Tensor in(DT_FLOAT, TensorShape({batches, width, height, 3}));
+  in.flat<float>().setRandom();
+
+  Tensor out_size(DT_INT32, TensorShape({2}));
+  auto out_size_flat = out_size.flat<int32>();
+  out_size_flat(0) = width * 2;
+  out_size_flat(1) = height * 2;
+
+  Node* ret;
+  NodeBuilder(g->NewName("n"), "ResizeNearestNeighbor")
+      .Input(test::graph::Constant(g, in))
+      .Input(test::graph::Constant(g, out_size))
+      .Finalize(g, &ret);
+  return g;
+}
+
+#define BM_ResizeNearestNeighborDev(DEVICE, B, W, H)                           \
+  static void BM_ResizeNearestNeighbor_##DEVICE##_##B##_##W##_##H(int iters) { \
+    testing::ItemsProcessed(iters* B* W* H * 3);                               \
+    test::Benchmark(#DEVICE, BM_ResizeNearestNeighbor(B, W, H)).Run(iters);    \
+  }                                                                            \
+  BENCHMARK(BM_ResizeNearestNeighbor_##DEVICE##_##B##_##W##_##H)
+
+BM_ResizeNearestNeighborDev(cpu, 1, 499, 499);
+BM_ResizeNearestNeighborDev(gpu, 1, 499, 499);
+
+}  // namespace tensorflow
--- a/tensorflow/core/kernels/resize_nearest_neighbor_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/resize_nearest_neighbor_op_gpu.cu.cc
@ -0,0 +1,86 @@
+/* Copyright 2015 Google Inc. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#if GOOGLE_CUDA
+
+#define EIGEN_USE_GPU
+
+#include <stdio.h>
+
+#include "tensorflow/core/kernels/resize_nearest_neighbor_op_gpu.h"
+
+#include "tensorflow/core/framework/register_types.h"
+#include "tensorflow/core/framework/tensor_types.h"
+#include "tensorflow/core/util/cuda_kernel_helper.h"
+
+namespace tensorflow {
+namespace {
+
+template <typename T>
+__global__ void ResizeNearestNeighborNHWC(const int nthreads, const T* bottom_data,
+                                          const int in_height, const int in_width,
+                                          const int channels, const int out_height,
+                                          const int out_width, const float height_scale,
+                                          const float width_scale, T* top_data) {
+  CUDA_1D_KERNEL_LOOP(index, nthreads) {
+    int n = index;
+    int c = n % channels;
+    n /= channels;
+    int out_x = n % out_width;
+    n /= out_width;
+    int out_y = n % out_height;
+    n /= out_height;
+
+    const T* bottom_data_n = bottom_data + n * channels * in_height * in_width;
+    const int in_x = min(static_cast<int>(floorf(out_x * width_scale)), in_width - 1);
+    const int in_y = min(static_cast<int>(floorf(out_y * height_scale)), in_height - 1);
+    const int idx = (in_y * in_width + in_x) * channels + c;
+    top_data[index] = ldg(bottom_data_n + idx);
+  }
+}
+
+}  // namespace
+
+template <typename T>
+bool ResizeNearestNeighbor(const T* bottom_data, const int batch,
+                           const int in_height, const int in_width,
+                           const int channels, const int out_height,
+                           const int out_width,  const float height_scale,
+                           const float width_scale, T* top_data,
+                           const Eigen::GpuDevice& d) {
+  const int output_size = batch * channels * out_height * out_width;
+  CudaLaunchConfig config = GetCudaLaunchConfig(output_size, d);
+
+  ResizeNearestNeighborNHWC<T>
+      <<<config.block_count, config.thread_per_block, 0, d.stream()>>>(
+      output_size, bottom_data, in_height, in_width, channels, out_height,
+      out_width, height_scale, width_scale, top_data);
+  return d.ok();
+}
+
+#define DECLARE_GPU_SPEC(T)                                                        \
+  template bool ResizeNearestNeighbor(const T* bottom_data, const int batch,       \
+                               const int in_height, const int in_width,            \
+                               const int channels, const int out_height,           \
+                               const int out_width,  const float height_scale,     \
+                               const float width_scale, T* top_data,               \
+                               const Eigen::GpuDevice& d);
+
+TF_CALL_GPU_NUMBER_TYPES(DECLARE_GPU_SPEC);
+
+#undef DECLARE_GPU_SPEC
+}  // end namespace tensorflow
+
+#endif  // GOOGLE_CUDA
--- a/tensorflow/core/kernels/resize_nearest_neighbor_op_gpu.h
+++ b/tensorflow/core/kernels/resize_nearest_neighbor_op_gpu.h
@ -0,0 +1,37 @@
+/* Copyright 2015 Google Inc. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#if !GOOGLE_CUDA
+#error This file must only be included when building with Cuda support
+#endif
+
+#ifndef TENSORFLOW_CORE_KERNELS_RESIZE_NEAREST_NEIGHBOR_OP_GPU_H_
+#define TENSORFLOW_CORE_KERNELS_RESIZE_NEAREST_NEIGHBOR_OP_GPU_H_
+
+#include "third_party/eigen3/unsupported/Eigen/CXX11/NeuralNetworks"
+#include "tensorflow/core/framework/tensor_types.h"
+#include "tensorflow/core/platform/types.h"
+
+namespace tensorflow {
+
+template <typename T>
+bool ResizeNearestNeighbor(const T* bottom_data, const int batch, const int in_height,
+                           const int in_width, const int channels, const int out_height,
+                           const int out_width, const float height_scale, const float width_scale,
+                           T* top_data, const Eigen::GpuDevice& d);
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_KERNELS_RESIZE_NEAREST_NEIGHBOR_OP_GPU_H_
--- a/tensorflow/core/kernels/sparse_matmul_op.cc
+++ b/tensorflow/core/kernels/sparse_matmul_op.cc
@ -524,7 +524,7 @@ class SparseMatMulOp : public OpKernel {

 private:
  // Perform matrix multiplication of "left" and "right", and store the result
-  // in *"ouptut".
+  // in *"output".
  static inline void SparseMatMul(
      const ConstMatrixMap& left, const ConstMatrixMap& right,
      bool transpose_left, const DeviceBase::CpuWorkerThreads* thread_pool,
@ -858,7 +858,7 @@ inline void SparseMatMulOp::SparseMatMul(
  const int right_dim0 = right.dimension(0);
  const int right_dim1 = right.dimension(1);
  // Allocate buffer for storing slices of right matrix.
-  // Note buffer needs enough space to hold atmost a KR * NR matrix since that
+  // Note buffer needs enough space to hold at most a KR * NR matrix since that
  // is the block size per iteration.
  const int buffer_num_rows =
      std::min(KR, right_dim0) * (std::min(NR, right_dim1) + N - 1) / N;
--- a/tensorflow/core/kernels/tensor_array_ops.cc
+++ b/tensorflow/core/kernels/tensor_array_ops.cc
@ -577,7 +577,7 @@ class TensorArrayConcatOp : public OpKernel {
    ConstMatrixVector input_tensors_flat;
    input_tensors_flat.reserve(values.size());

-    for (int i = 0; i < values.size(); ++i) {
+    for (size_t i = 0; i < values.size(); ++i) {
      const Tensor* value_t = value_tensors[i];
      if (value_t->NumElements() > 0) {
        input_tensors_flat.emplace_back(new ConstMatrix(
--- a/tensorflow/core/kernels/transpose_functor.h
+++ b/tensorflow/core/kernels/transpose_functor.h
@ -47,7 +47,7 @@ void ComputeStride(const TensorShape& shape, Index* strides) {
  }
 }

-// Device-specific naive implementation for tranpose.
+// Device-specific naive implementation for transpose.
 template <typename Device, typename T>
 void TransposeSimple(const Device& d, const Tensor& in,
                     const gtl::ArraySlice<int32> perm, Tensor* out);
--- a/tensorflow/core/ops/array_ops.cc
+++ b/tensorflow/core/ops/array_ops.cc
@ -172,6 +172,38 @@ tf.diag(diagonal) ==> [[1, 0, 0, 0]
 diagonal: Rank k tensor where k is at most 3.
 )doc");

+// --------------------------------------------------------------------------
+REGISTER_OP("DiagPart")
+    .Input("input: T")
+    .Output("diagonal: T")
+    .Attr("T: {float, double, int32, int64}")
+    .Doc(R"doc(
+Returns the diagonal part of the tensor.
+
+This operation returns a tensor with the `diagonal` part
+of the `input`. The `diagonal` part is computed as follows:
+
+Assume `input` has dimensions `[D1,..., Dk, D1,..., Dk]`, then the output is a
+tensor of rank `k` with dimensions `[D1,..., Dk]` where:
+
+`diagonal[i1,..., ik] = input[i1, ..., ik, i1,..., ik]`.
+
+For example:
+
+```prettyprint
+# 'input' is [[1, 0, 0, 0]
+              [0, 2, 0, 0]
+              [0, 0, 3, 0]
+              [0, 0, 0, 4]]
+
+tf.diag_part(input) ==> [1, 2, 3, 4]
+```
+
+input: Rank k tensor where k is 2, 4, or 6.
+diagonal: The extracted diagonal.
+
+)doc");
+
 // --------------------------------------------------------------------------
 REGISTER_OP("Reverse")
    .Input("tensor: T")
--- a/tensorflow/core/ops/compat/ops_history.v0.pbtxt
+++ b/tensorflow/core/ops/compat/ops_history.v0.pbtxt
@ -3482,6 +3482,29 @@ op {
    }
  }
 }
+op {
+  name: "DiagPart"
+  input_arg {
+    name: "input"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "diagonal"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_FLOAT
+        type: DT_DOUBLE
+        type: DT_INT32
+        type: DT_INT64
+      }
+    }
+  }
+}
 op {
  name: "Digamma"
  input_arg {
--- a/tensorflow/core/ops/ops.pbtxt
+++ b/tensorflow/core/ops/ops.pbtxt
@ -2858,6 +2858,33 @@ op {
  summary: "Returns a diagonal tensor with a given diagonal values."
  description: "Given a `diagonal`, this operation returns a tensor with the `diagonal` and\neverything else padded with zeros. The diagonal is computed as follows:\n\nAssume `diagonal` has dimensions [D1,..., Dk], then the output is a tensor of\nrank 2k with dimensions [D1,..., Dk, D1,..., Dk] where:\n\n`output[i1,..., ik, i1,..., ik] = diagonal[i1, ..., ik]` and 0 everywhere else.\n\nFor example:\n\n```prettyprint\n# \'diagonal\' is [1, 2, 3, 4]\ntf.diag(diagonal) ==> [[1, 0, 0, 0]\n                       [0, 2, 0, 0]\n                       [0, 0, 3, 0]\n                       [0, 0, 0, 4]]\n```"
 }
+op {
+  name: "DiagPart"
+  input_arg {
+    name: "input"
+    description: "Rank k tensor where k is 2, 4, or 6."
+    type_attr: "T"
+  }
+  output_arg {
+    name: "diagonal"
+    description: "The extracted diagonal."
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_FLOAT
+        type: DT_DOUBLE
+        type: DT_INT32
+        type: DT_INT64
+      }
+    }
+  }
+  summary: "Returns the diagonal part of the tensor."
+  description: "This operation returns a tensor with the `diagonal` part\nof the `input`. The `diagonal` part is computed as follows:\n\nAssume `input` has dimensions `[D1,..., Dk, D1,..., Dk]`, then the output is a\ntensor of rank `k` with dimensions `[D1,..., Dk]` where:\n\n`diagonal[i1,..., ik] = input[i1, ..., ik, i1,..., ik]`.\n\nFor example:\n\n```prettyprint\n# \'input\' is [[1, 0, 0, 0]\n              [0, 2, 0, 0]\n              [0, 0, 3, 0]\n              [0, 0, 0, 4]]\n\ntf.diag_part(input) ==> [1, 2, 3, 4]\n```"
+}
 op {
  name: "Digamma"
  input_arg {
--- a/tensorflow/core/public/version.h
+++ b/tensorflow/core/public/version.h
@ -20,7 +20,7 @@ limitations under the License.

 #define TF_MAJOR_VERSION 0
 #define TF_MINOR_VERSION 7
-#define TF_PATCH_VERSION 0
+#define TF_PATCH_VERSION 1

 // TF_VERSION_SUFFIX is non-empty for pre-releases (e.g. "-alpha", "-alpha.1",
 // "-beta", "-rc", "-rc.1")
--- a/tensorflow/core/util/test_log.proto
+++ b/tensorflow/core/util/test_log.proto
@ -63,34 +63,50 @@ message CommitId {
 };

 message CPUInfo {
+  int64 num_cores = 1;
+
+  int64 num_cores_allowed = 2;
+
  // How fast are these cpus?
-  double mhz_per_cpu = 1;
+  double mhz_per_cpu = 3;

  // Additional cpu information. For example,
  // Intel Ivybridge with HyperThreading (24 cores) dL1:32KB dL2:256KB dL3:30MB
-  string cpu_info = 2;
+  string cpu_info = 4;

  // What kind of cpu scaling is enabled on the host.
  // Examples include "performance", "ondemand", "conservative", "mixed".
-  string cpu_governor = 3;
+  string cpu_governor = 5;

  // Cache sizes (in bytes), e.g. "L2": 262144 (for 256KB)
-  map<string, int64> cache_size = 4;
+  map<string, int64> cache_size = 6;
 };

+message MemoryInfo {
+  int64 total = 1;      // Total virtual memory in bytes
+  int64 available = 2;  // Immediately available memory in bytes
+}
+
 message GPUInfo {
  string model = 1;  // e.g. "Tesla K40c"
  string uuid = 2;   // Final entry in output of "nvidia-smi -L"
+  string bus_id = 3;  // e.g. "0000:04:00.0"
 };

 message PlatformInfo {
  string bits = 1;       // e.g. '64bit'
  string linkage = 2;    // e.g. 'ELF'
  string machine = 3;    // e.g. 'i386'
-  string processor = 4;  // e.g. 'amdk6'  (the real processor name)
-  string release = 5;    // e.g. '3.13.0-76-generic'
-  string system = 6;     // e.g. 'Linux'
-  string version = 7;    // e.g. '#120-Ubuntu SMP Mon Jan 18 15:59:10 UTC 2016'
+  string release = 4;    // e.g. '3.13.0-76-generic'
+  string system = 5;     // e.g. 'Linux'
+  string version = 6;    // e.g. '#120-Ubuntu SMP Mon Jan 18 15:59:10 UTC 2016'
+};
+
+message AvailableDeviceInfo {       // Matches DeviceAttributes
+  string name = 1;                  // Device name.
+  string type = 2;                  // Device type, e.g. 'CPU' or 'GPU'.
+  int64 memory_limit = 3;           // Memory capacity in bytes.
+  string physical_description = 4;  // The physical description of this device.
 };

 message MachineConfiguration {
@ -105,6 +121,11 @@ message MachineConfiguration {

  // Other devices that are attached and relevant (e.g. GPUInfo).
  repeated google.protobuf.Any device_info = 4;
+
+  // Devices accessible to the test (e.g. as given by list_local_devices).
+  repeated AvailableDeviceInfo available_device_info = 5;
+
+  MemoryInfo memory_info = 6;
 };

 // Run-specific items such as arguments to the test / benchmark.
--- a/tensorflow/examples/how_tos/reading_data/convert_to_records.py
+++ b/tensorflow/examples/how_tos/reading_data/convert_to_records.py
@ -68,6 +68,7 @@ def convert_to(images, labels, name):
        'label': _int64_feature(int(labels[index])),
        'image_raw': _bytes_feature(image_raw)}))
    writer.write(example.SerializeToString())
+  writer.close()


 def main(argv):
--- a/tensorflow/examples/image_retraining/retrain.py
+++ b/tensorflow/examples/image_retraining/retrain.py
@ -219,8 +219,8 @@ def create_image_lists(image_dir, testing_percentage, validation_percentage):
      # To do that, we need a stable way of deciding based on just the file name
      # itself, so we do a hash of that and then use that to generate a
      # probability value that we use to assign it.
-      percentage_hash = (int(
-          hashlib.sha1(hash_name).hexdigest(), 16) % (65536)) * (100 / 65535.0)
+      hash_name_hashed = hashlib.sha1(hash_name.encode('utf-8')).hexdigest()
+      percentage_hash = (int(hash_name_hashed, 16) % (65536)) * (100 / 65535.0)
      if percentage_hash < validation_percentage:
        validation_images.append(base_name)
      elif percentage_hash < (testing_percentage + validation_percentage):
@ -295,8 +295,9 @@ def create_inception_graph():
    Graph holding the trained Inception network.
  """
  with tf.Session() as sess:
-    with gfile.FastGFile(
-        os.path.join(FLAGS.model_dir, 'classify_image_graph_def.pb'), 'r') as f:
+    model_filename = os.path.join(
+        FLAGS.model_dir, 'classify_image_graph_def.pb')
+    with gfile.FastGFile(model_filename, 'rb') as f:
      graph_def = tf.GraphDef()
      graph_def.ParseFromString(f.read())
      _ = tf.import_graph_def(graph_def, name='')
@ -395,7 +396,7 @@ def get_or_create_bottleneck(sess, image_lists, label_name, index, image_dir,
                                category)
    if not gfile.Exists(image_path):
      tf.logging.fatal('File does not exist %s', image_path)
-    image_data = gfile.FastGFile(image_path, 'r').read()
+    image_data = gfile.FastGFile(image_path, 'rb').read()
    bottleneck_values = run_bottleneck_on_image(sess, image_data,
                                                JPEG_DATA_TENSOR_NAME)
    bottleneck_string = ','.join(str(x) for x in bottleneck_values)
@ -430,7 +431,7 @@ def cache_bottlenecks(sess, image_lists, image_dir, bottleneck_dir):
  """
  how_many_bottlenecks = 0
  ensure_dir_exists(bottleneck_dir)
-  for label_name, label_lists in image_lists.iteritems():
+  for label_name, label_lists in image_lists.items():
    for category in ['training', 'testing', 'validation']:
      category_list = label_lists[category]
      for index, unused_base_name in enumerate(category_list):
@ -467,7 +468,7 @@ def get_random_cached_bottlenecks(sess, image_lists, how_many, category,
  ground_truthes = []
  for unused_i in range(how_many):
    label_index = random.randrange(class_count)
-    label_name = image_lists.keys()[label_index]
+    label_name = list(image_lists.keys())[label_index]
    image_index = random.randrange(65536)
    bottleneck = get_or_create_bottleneck(sess, image_lists, label_name,
                                          image_index, image_dir, category,
@ -818,7 +819,7 @@ def main(_):
  # Write out the trained graph and labels with the weights stored as constants.
  output_graph_def = graph_util.convert_variables_to_constants(
      sess, graph.as_graph_def(), [FLAGS.final_tensor_name])
-  with gfile.FastGFile(FLAGS.output_graph, 'w') as f:
+  with gfile.FastGFile(FLAGS.output_graph, 'wb') as f:
    f.write(output_graph_def.SerializeToString())
  with gfile.FastGFile(FLAGS.output_labels, 'w') as f:
    f.write('\n'.join(image_lists.keys()) + '\n')
--- a/tensorflow/examples/tutorials/mnist/input_data.py
+++ b/tensorflow/examples/tutorials/mnist/input_data.py
@ -54,7 +54,7 @@ def _read32(bytestream):
 def extract_images(filename):
  """Extract the images into a 4D uint8 numpy array [index, y, x, depth]."""
  print('Extracting', filename)
-  with tf.gfile.Open(filename) as f, gzip.GzipFile(fileobj=f) as bytestream:
+  with tf.gfile.Open(filename, 'rb') as f, gzip.GzipFile(fileobj=f) as bytestream:
    magic = _read32(bytestream)
    if magic != 2051:
      raise ValueError(
@ -81,7 +81,7 @@ def dense_to_one_hot(labels_dense, num_classes):
 def extract_labels(filename, one_hot=False, num_classes=10):
  """Extract the labels into a 1D uint8 numpy array [index]."""
  print('Extracting', filename)
-  with tf.gfile.Open(filename) as f, gzip.GzipFile(fileobj=f) as bytestream:
+  with tf.gfile.Open(filename, 'rb') as f, gzip.GzipFile(fileobj=f) as bytestream:
    magic = _read32(bytestream)
    if magic != 2049:
      raise ValueError(
--- a/tensorflow/examples/tutorials/mnist/mnist.py
+++ b/tensorflow/examples/tutorials/mnist/mnist.py
@ -143,7 +143,7 @@ def evaluation(logits, labels):
  """
  # For a classifier model, we can use the in_top_k Op.
  # It returns a bool tensor with shape [batch_size] that is true for
-  # the examples where the label's is was in the top k (here k=1)
+  # the examples where the label is in the top k (here k=1)
  # of all logits for that example.
  correct = tf.nn.in_top_k(logits, labels, 1)
  # Return the number of true entries.
--- a/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py
+++ b/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py
@ -54,23 +54,23 @@ def main(_):
  # Create the model
  x = tf.placeholder(tf.float32, [None, 784], name='x-input')
  W = tf.Variable(tf.zeros([784, 10]), name='weights')
-  b = tf.Variable(tf.zeros([10], name='bias'))
+  b = tf.Variable(tf.zeros([10]), name='bias')

  # Use a name scope to organize nodes in the graph visualizer
  with tf.name_scope('Wx_b'):
    y = tf.nn.softmax(tf.matmul(x, W) + b)

  # Add summary ops to collect data
-  _ = tf.histogram_summary('weights', W)
-  _ = tf.histogram_summary('biases', b)
-  _ = tf.histogram_summary('y', y)
+  tf.histogram_summary('weights', W)
+  tf.histogram_summary('biases', b)
+  tf.histogram_summary('y', y)

  # Define loss and optimizer
  y_ = tf.placeholder(tf.float32, [None, 10], name='y-input')
  # More name scopes will clean up the graph representation
  with tf.name_scope('xent'):
    cross_entropy = -tf.reduce_sum(y_ * tf.log(y))
-    _ = tf.scalar_summary('cross entropy', cross_entropy)
+    tf.scalar_summary('cross entropy', cross_entropy)
  with tf.name_scope('train'):
    train_step = tf.train.GradientDescentOptimizer(
        FLAGS.learning_rate).minimize(cross_entropy)
@ -78,7 +78,7 @@ def main(_):
  with tf.name_scope('test'):
    correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
-    _ = tf.scalar_summary('accuracy', accuracy)
+    tf.scalar_summary('accuracy', accuracy)

  # Merge all the summaries and write them out to /tmp/mnist_logs (by default)
  merged = tf.merge_all_summaries()
--- a/tensorflow/examples/tutorials/word2vec/word2vec_basic.py
+++ b/tensorflow/examples/tutorials/word2vec/word2vec_basic.py
@ -128,7 +128,7 @@ num_skips = 2         # How many times to reuse an input to generate a label.
 # construction are also the most frequent.
 valid_size = 16     # Random set of words to evaluate similarity on.
 valid_window = 100  # Only pick dev samples in the head of the distribution.
-valid_examples = np.array(random.sample(np.arange(valid_window), valid_size))
+valid_examples = np.random.choice(valid_window, valid_size, replace=False)
 num_sampled = 64    # Number of negative examples to sample.

 graph = tf.Graph()
--- a/tensorflow/examples/udacity/3_regularization.ipynb
+++ b/tensorflow/examples/udacity/3_regularization.ipynb
@ -290,11 +290,11 @@
        "Another one is to use learning rate decay:\n",
        "\n",
        "    global_step = tf.Variable(0)  # count the number of steps taken.\n",
-        "    learning_rate = tf.train.exponential_decay(0.5, step, ...)\n",
+        "    learning_rate = tf.train.exponential_decay(0.5, global_step, ...)\n",
        "    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)\n",
        " \n",
        " ---\n"
      ]
    }
  ]
-}
+}
--- a/tensorflow/examples/udacity/5_word2vec.ipynb
+++ b/tensorflow/examples/udacity/5_word2vec.ipynb
@ -421,7 +421,7 @@
        "\n",
        "graph = tf.Graph()\n",
        "\n",
-        "with graph.as_default():\n",
+        "with graph.as_default(), tf.device('/cpu:0'):\n",
        "\n",
        "  # Input data.\n",
        "  train_dataset = tf.placeholder(tf.int32, shape=[batch_size])\n",
--- a/tensorflow/examples/udacity/Dockerfile
+++ b/tensorflow/examples/udacity/Dockerfile
@ -1,6 +1,7 @@
 FROM b.gcr.io/tensorflow/tensorflow:latest
 MAINTAINER Vincent Vanhoucke <vanhoucke@google.com>
 RUN pip install scikit-learn
+RUN rm -rf /notebooks/*
 ADD *.ipynb /notebooks/
 WORKDIR /notebooks
 CMD ["/run_jupyter.sh"]
--- a/tensorflow/g3doc/api_docs/python/nn.md
+++ b/tensorflow/g3doc/api_docs/python/nn.md
@ -820,7 +820,7 @@ classes are mutually exclusive (each entry is in exactly one class).  For
 example, each CIFAR-10 image is labeled with one and only one label: an image
 can be a dog or a truck, but not both.

-**NOTE:**:  While the classes are mutually exclusive, their probabilities
+**NOTE:**  While the classes are mutually exclusive, their probabilities
 need not be.  All that is required is that each row of `labels` is
 a valid probability distribution.  If using exclusive `labels`
 (wherein one and only one class is true at a time), see
@ -857,7 +857,7 @@ classes are mutually exclusive (each entry is in exactly one class).  For
 example, each CIFAR-10 image is labeled with one and only one label: an image
 can be a dog or a truck, but not both.

-**NOTE:**:  For this operation, the probability of a given label is considered
+**NOTE:**  For this operation, the probability of a given label is considered
 exclusive.  That is, soft classes are not allowed, and the `labels` vector
 must provide a single specific index for the true class for each row of
 `logits` (each minibatch entry).  For soft softmax classification with
--- a/tensorflow/g3doc/api_docs/python/train.md
+++ b/tensorflow/g3doc/api_docs/python/train.md
@ -794,9 +794,11 @@ global_step = tf.Variable(0, trainable=False)
 starter_learning_rate = 0.1
 learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step,
                                           100000, 0.96, staircase=True)
-optimizer = tf.GradientDescentOptimizer(learning_rate)
 # Passing global_step to minimize() will increment it at each step.
-optimizer.minimize(...my loss..., global_step=global_step)
+learning_step = (
+    tf.GradientDescentOptimizer(learning_rate)
+    .minimize(...my loss..., global_step=global_step)
+)
 ```

 ##### Args:
@ -2280,5 +2282,3 @@ device assignments have not changed.
 ##### Returns:

  A saver constructed rom `saver_def` in `MetaGraphDef`.
-
-
--- a/tensorflow/g3doc/get_started/os_setup.md
+++ b/tensorflow/g3doc/get_started/os_setup.md
@ -53,28 +53,28 @@ Install TensorFlow:

 ```bash
 # Ubuntu/Linux 64-bit, CPU only:
-$ sudo pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.7.0-py2-none-linux_x86_64.whl
+$ sudo pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.7.1-cp27-none-linux_x86_64.whl

 # Ubuntu/Linux 64-bit, GPU enabled:
-$ sudo pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.7.0-py2-none-linux_x86_64.whl
+$ sudo pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.7.1-cp27-none-linux_x86_64.whl

 # Mac OS X, CPU only:
 $ sudo easy_install --upgrade six
-$ sudo pip install --upgrade https://storage.googleapis.com/tensorflow/mac/tensorflow-0.7.0-py2-none-any.whl
+$ sudo pip install --upgrade https://storage.googleapis.com/tensorflow/mac/tensorflow-0.7.1-cp27-none-any.whl
 ```

 For python3:

 ```bash
 # Ubuntu/Linux 64-bit, CPU only:
-$ sudo pip3 install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.7.0-py3-none-linux_x86_64.whl
+$ sudo pip3 install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.7.1-cp34-none-linux_x86_64.whl

 # Ubuntu/Linux 64-bit, GPU enabled:
-$ sudo pip3 install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.7.0-py3-none-linux_x86_64.whl
+$ sudo pip3 install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.7.1-cp34-none-linux_x86_64.whl

 # Mac OS X, CPU only:
 $ sudo easy_install --upgrade six
-$ sudo pip3 install --upgrade https://storage.googleapis.com/tensorflow/mac/tensorflow-0.7.0-py3-none-any.whl
+$ sudo pip3 install --upgrade https://storage.googleapis.com/tensorflow/mac/tensorflow-0.7.1-cp35-none-any.whl
 ```

 NOTE: If you are upgrading from a previous installation of TensorFlow < 0.7.1,
@ -126,13 +126,13 @@ $ source ~/tensorflow/bin/activate.csh  # If using csh
 (tensorflow)$  # Your prompt should change

 # Ubuntu/Linux 64-bit, CPU only:
-(tensorflow)$ pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.7.0-py2-none-linux_x86_64.whl
+(tensorflow)$ pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.7.1-cp27-none-linux_x86_64.whl

 # Ubuntu/Linux 64-bit, GPU enabled:
-(tensorflow)$ pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.7.0-py2-none-linux_x86_64.whl
+(tensorflow)$ pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.7.1-cp27-none-linux_x86_64.whl

 # Mac OS X, CPU only:
-(tensorflow)$ pip install --upgrade https://storage.googleapis.com/tensorflow/mac/tensorflow-0.7.0-py2-none-any.whl
+(tensorflow)$ pip install --upgrade https://storage.googleapis.com/tensorflow/mac/tensorflow-0.7.1-cp27-none-any.whl
 ```

 and again for python3:
@ -143,13 +143,13 @@ $ source ~/tensorflow/bin/activate.csh  # If using csh
 (tensorflow)$  # Your prompt should change

 # Ubuntu/Linux 64-bit, CPU only:
-(tensorflow)$ pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.7.0-py3-none-linux_x86_64.whl
+(tensorflow)$ pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.7.1-cp34-none-linux_x86_64.whl

 # Ubuntu/Linux 64-bit, GPU enabled:
-(tensorflow)$ pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.7.0-py3-none-linux_x86_64.whl
+(tensorflow)$ pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.7.1-cp34-none-linux_x86_64.whl

 # Mac OS X, CPU only:
-(tensorflow)$ pip3 install --upgrade https://storage.googleapis.com/tensorflow/mac/tensorflow-0.7.0-py3-none-any.whl
+(tensorflow)$ pip3 install --upgrade https://storage.googleapis.com/tensorflow/mac/tensorflow-0.7.1-cp35-none-any.whl
 ```

 With the Virtualenv environment activated, you can now
@ -191,7 +191,7 @@ code.
 * `b.gcr.io/tensorflow/tensorflow:latest-devel-gpu`: GPU Binary image plus source
 code.

-We also have tags with `latest` replaced by a released version (e.g., `0.7.0-gpu`).
+We also have tags with `latest` replaced by a released version (e.g., `0.7.1-gpu`).

 With Docker the installation is as follows:

@ -464,7 +464,7 @@ We recommend using [homebrew](http://brew.sh) to install the bazel and SWIG
 dependencies, and installing python dependencies using easy_install or pip.

 Of course you can also install Swig from source without using homebrew. In that
-case, be sure to install its dependency [PCRE](from www.pcre.org) and not PCRE2.
+case, be sure to install its dependency [PCRE](http://www.pcre.org) and not PCRE2.

 #### Dependencies

@ -517,7 +517,7 @@ $ bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_pack
 $ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

 # The name of the .whl file will depend on your platform.
-$ pip install /tmp/tensorflow_pkg/tensorflow-0.7.0-py2-none-linux_x86_64.whl
+$ pip install /tmp/tensorflow_pkg/tensorflow-0.7.1-py2-none-linux_x86_64.whl
 ```

 ## Setting up TensorFlow for Development
--- a/tensorflow/g3doc/how_tos/image_retraining/index.md
+++ b/tensorflow/g3doc/how_tos/image_retraining/index.md
@ -74,7 +74,7 @@ and compact summary of the images, since it has to contain enough information
 for the classifier to make a good choice in a very small set of values. The
 reason our final layer retraining can work on new classes is that it turns out
 the kind of information needed to distinguish between all the 1,000 classes in
-ImageNet is often also useful to chose between new kinds of objects.
+ImageNet is often also useful to distinguish between new kinds of objects.

 Because every image is reused multiple times during training and calculating
 each bottleneck takes a significant amount of time, it speeds things up to
@ -88,20 +88,20 @@ part again.
 Once the bottlenecks are complete, the actual training of the top layer of the
 network begins. You'll see a series of step outputs, each one showing training
 accuracy, validation accuracy, and the cross entropy. The training accuracy
-shows how many of the images used in the current training batch were labeled
-with the correct class. The validation accuracy is the precision on a
+shows what percent of the images used in the current training batch were
+labeled with the correct class. The validation accuracy is the precision on a
 randomly-selected group of images from a different set. The key difference is
 that the training accuracy is based on images that the network has been able
 to learn from so the network can overfit to the noise in the training data. A
 true measure of the performance of the network is to measure its performance on
 a data set not contained in the training data -- this is measured by the
-validation accuracy. If the training accuracy is high but the validation remains
-low, that means the network is overfitting and memorizing particular features
-in the training images that aren't helpful more generally. Cross entropy is a
-loss function which gives a glimpse into how well the learning process is
-progressing. The training's objective is to make the loss as small as possible,
-so you can tell if the learning is working by keeping an eye on whether the loss
-keeps trending downwards, ignoring the short-term noise.
+validation accuracy. If the train accuracy is high but the validation accuracy
+remains low, that means the network is overfitting and memorizing particular
+features in the training images that aren't helpful more generally. Cross
+entropy is a loss function which gives a glimpse into how well the learning
+process is progressing. The training's objective is to make the loss as small as
+possible, so you can tell if the learning is working by keeping an eye on
+whether the loss keeps trending downwards, ignoring the short-term noise.

 By default this script will run 4,000 training steps. Each step chooses ten
 images at random from the training set, finds their bottlenecks from the cache,
@ -114,8 +114,8 @@ and validation pictures. This test evaluation is the best estimate of how the
 trained model will perform on the classification task. You should see an
 accuracy value of between 90% and 95%, though the exact value will vary from run
 to run since there's randomness in the training process. This number is based on
-how many of the images in the test set are given the correct label after the
-model is fully trained.
+the percent of the images in the test set that are given the correct label
+after the model is fully trained.

 ## Using the Retrained Model

@ -266,7 +266,7 @@ memorized unimportant details of the training images.

 This problem is known as overfitting, and to avoid it we keep some of our data
 out of the training process, so that the model can't memorize them. We then use
-those images as a check to make sure that overfitting isn't occuring, since if
+those images as a check to make sure that overfitting isn't occurring, since if
 we see good accuracy on them it's a good sign the network isn't overfitting. The
 usual split is to put 80% of the images into the main training set, keep 10%
 aside to run as validation frequently during training, and then have a final 10%
--- a/tensorflow/g3doc/how_tos/summaries_and_tensorboard/index.md
+++ b/tensorflow/g3doc/how_tos/summaries_and_tensorboard/index.md
@ -86,23 +86,23 @@ with tf.name_scope("Wx_b") as scope:
  y = tf.nn.softmax(tf.matmul(x,W) + b)

 # Add summary ops to collect data
-w_hist = tf.histogram_summary("weights", W)
-b_hist = tf.histogram_summary("biases", b)
-y_hist = tf.histogram_summary("y", y)
+tf.histogram_summary("weights", W)
+tf.histogram_summary("biases", b)
+tf.histogram_summary("y", y)

 # Define loss and optimizer
 y_ = tf.placeholder(tf.float32, [None,10], name="y-input")
 # More name scopes will clean up the graph representation
 with tf.name_scope("xent") as scope:
  cross_entropy = -tf.reduce_sum(y_*tf.log(y))
-  ce_summ = tf.scalar_summary("cross entropy", cross_entropy)
+  tf.scalar_summary("cross entropy", cross_entropy)
 with tf.name_scope("train") as scope:
  train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)

 with tf.name_scope("test") as scope:
  correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
  accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
-  accuracy_summary = tf.scalar_summary("accuracy", accuracy)
+  tf.scalar_summary("accuracy", accuracy)

 # Merge all the summaries and write them out to /tmp/mnist_logs
 merged = tf.merge_all_summaries()
--- a/tensorflow/g3doc/how_tos/tool_developers/index.md
+++ b/tensorflow/g3doc/how_tos/tool_developers/index.md
@ -28,8 +28,7 @@ by calling `as_graph_def()`, which returns a `GraphDef` object.

 The GraphDef class is an object created by the ProtoBuf library from the
 definition in
-[tensorflow/core/framework/graph.proto](https://github.com/tensorflow/tensorflow
-/blob/master/tensorflow/core/framework/graph.proto). The protobuf tools parse
+[tensorflow/core/framework/graph.proto](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/graph.proto). The protobuf tools parse
 this text file, and generate the code to load, store, and manipulate graph
 definitions. If you see a standalone TensorFlow file representing a model, it's
 likely to contain a serialized version of one of these `GraphDef` objects
@ -37,8 +36,7 @@ saved out by the protobuf code.

 This generated code is used to save and load the GraphDef files from disk. A
 good example to look at as we dig into this is
-[graph_metrics.py](https://github.com/tensorflow/tensorflow/blob/master/tensorfl
-ow/python/tools/graph_metrics.py). This Python script takes a saved graph
+[graph_metrics.py](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/tools/graph_metrics.py). This Python script takes a saved graph
 definition, and analyzes the model to estimate performance and resource
 statistics. The code that actually loads the model looks like this:

@ -69,16 +67,14 @@ There are actually two different formats that a ProtoBuf can be saved in.
 TextFormat is a human-readable form, which makes it nice for debugging and
 editing, but can get large when there's numerical data like weights stored in
 it. You can see a small example of that in
-[poly5-graph.pbtxt](https://github.com/tensorflow/tensorflow/blob/master/tensorf
-low/tensorboard/components/tf-tensorboard/demo/data/poly5-graph.pbtxt).
+[poly5-graph.pbtxt](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tensorboard/components/tf-tensorboard/demo/data/poly5-graph.pbtxt).

 Binary format files are a lot smaller than their text equivalents, even though
 they're not as readable for us. In this script, we ask the user to supply a
 flag indicating whether the input file is binary or text, so we know the right
 function to call. You can find an example of a large binary file inside the
 [inception_dec_2015.zip
-archive](https://storage.googleapis.com/download.tensorflow.org/models/inception
-_dec_2015.zip), as `tensorflow_inception_graph.pb`.
+archive](https://storage.googleapis.com/download.tensorflow.org/models/inception_dec_2015.zip), as `tensorflow_inception_graph.pb`.

 The API itself can be a bit confusing - the binary call is actually
 `ParseFromString()`, whereas you use a utility function from the `text_format`
@ -104,7 +100,7 @@ single operation along with its input connections. Here are the members of a
 Every node should have a unique identifier that's not used by any other nodes
 in the graph. If you don't specify one as you're building a graph using the
 Python API, one reflecting the name of operation, such as "MatMul",
-concatenated with a monotonically increasing number, such as "5", will be 
+concatenated with a monotonically increasing number, such as "5", will be
 picked for you. an arbitrary one will be picked for you. The name is used when
 defining the connections between nodes, and when setting inputs and outputs for
 the whole graph when it's run.
@ -115,8 +111,7 @@ This defines what operation to run, for example `"Add"`, `"MatMul"`, or
 `"Conv2D"`. When a graph is run, this op name is looked up in a registry to
 find an implementation. The registry is populated by calls to the
 `REGISTER_OP()` macro, like those in
-[tensorflow/core/ops/nn_ops.cc](https://github.com/tensorflow/tensorflow/blob/ma
-ster/tensorflow/core/ops/nn_ops.cc).
+[tensorflow/core/ops/nn_ops.cc](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/ops/nn_ops.cc).

 ### `input`

@ -142,8 +137,7 @@ size of filters for convolutions, or the values of constant ops. Because there
 can be so many different types of attribute values, from strings, to ints, to
 arrays of tensor values, there's a separate protobuf file defining the data
 structure that holds them, in
-[tensorflow/core/framework/attr_value.proto](https://github.com/tensorflow/tenso
-rflow/blob/master/tensorflow/core/framework/attr_value.proto).
+[tensorflow/core/framework/attr_value.proto](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/attr_value.proto).

 Each attribute has a unique name string, and the expected attributes are listed
 when the operation is defined. If an attribute isn't present in a node, but it
@ -161,8 +155,7 @@ the file format during training. Instead, they're held in separate checkpoint
 files, and there are `Variable` ops in the graph that load the latest values
 when they're initialized. It's often not very convenient to have separate files
 when you're deploying to production, so there's the
-[freeze_graph.py](https://github.com/tensorflow/tensorflow/blob/master/tensorflo
-w/python/tools/freeze_graph.py) script that takes a graph definition and a set
+[freeze_graph.py](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/tools/freeze_graph.py) script that takes a graph definition and a set
 of checkpoints and freezes them together into a single file.

 What this does is load the `GraphDef`, pull in the values for all the variables
@ -178,10 +171,9 @@ the most common problems is extracting and interpreting the weight values. A
 common way to store them, for example in graphs created by the freeze_graph
 script, is as `Const` ops containing the weights as `Tensors`. These are
 defined in
-[tensorflow/core/framework/tensor.proto](https://github.com/tensorflow/tensorflo
-w/blob/master/tensorflow/core/framework/tensor.proto), and contain information
+[tensorflow/core/framework/tensor.proto](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/tensor.proto), and contain information
 about the size and type of the data, as well as the values themselves. In
-Python, you get a `TensorProto` object from a `NodeDef` representing a `Const` 
+Python, you get a `TensorProto` object from a `NodeDef` representing a `Const`
 op by calling something like `some_node_def.attr['value'].tensor`.

 This will give you an object representing the weights data. The data itself
--- a/tensorflow/g3doc/resources/dims_types.md
+++ b/tensorflow/g3doc/resources/dims_types.md
@ -16,7 +16,7 @@ Python list) has a rank of 2:
    t = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

 A rank two tensor is what we typically think of as a matrix, a rank one tensor
-is a vector. For a rank two tensor you can acccess any element with the syntax
+is a vector. For a rank two tensor you can access any element with the syntax
 `t[i, j]`.  For a rank three tensor you would need to address an element with
 `t[i, j, k]`.

--- a/tensorflow/g3doc/resources/index.md
+++ b/tensorflow/g3doc/resources/index.md
@ -31,6 +31,11 @@ something amazing with TensorFlow, we'd like to hear about it!

 ## Community

+The TensorFlow community has created many great projects around TensorFlow, including:
+
+* [TensorFlow tutorials](https://github.com/pkmital/tensorflow_tutorials)
+* [Scikit Flow - Simplified Interface for TensorFlow](https://github.com/tensorflow/skflow)
+
 ### Development

 The source code for TensorFlow is hosted on GitHub:
--- a/tensorflow/g3doc/tutorials/deep_cnn/index.md
+++ b/tensorflow/g3doc/tutorials/deep_cnn/index.md
@ -9,8 +9,6 @@ CIFAR-10 classification is a common benchmark problem in machine learning.  The
 problem is to classify RGB 32x32 pixel images across 10 categories:
 ```airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.```

-![CIFAR-10 Samples](../../images/cifar_samples.png "CIFAR-10 Samples, from http://www.cs.toronto.edu/~kriz/cifar.html")
-
 For more details refer to the [CIFAR-10 page](http://www.cs.toronto.edu/~kriz/cifar.html)
 and a [Tech Report](http://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf)
 by Alex Krizhevsky.
@ -117,7 +115,7 @@ learn more about how the `Reader` class works.
 The images are processed as follows:

 *  They are cropped to 24 x 24 pixels, centrally for evaluation or
-   [randomly](../../api_docs/python/image.md#random_crop) for training.
+   [randomly](../../api_docs/python/constant_op.md#random_crop) for training.
 *  They are [approximately whitened](../../api_docs/python/image.md#per_image_whitening)
   to make the model insensitive to dynamic range.

@ -168,7 +166,7 @@ Here is a graph generated from TensorBoard describing the inference operation:
 </div>

 > **EXERCISE**: The output of `inference` are un-normalized logits. Try editing
-the network architecture to return normalized predictions using [`tf.softmax()`]
+the network architecture to return normalized predictions using [`tf.nn.softmax()`]
 (../../api_docs/python/nn.md#softmax).

 The `inputs()` and `inference()` functions provide all the components
--- a/tensorflow/g3doc/tutorials/mnist/download/index.md
+++ b/tensorflow/g3doc/tutorials/mnist/download/index.md
@ -50,7 +50,7 @@ unpacked (following the instructions available at the website) by the

 The image data is extracted into a 2d tensor of: `[image index, pixel index]`
 where each entry is the intensity value of a specific pixel in a specific
-image, rescaled from `[0, 255]` to `[-0.5, 0.5]`.  The "image index" corresponds
+image, rescaled from `[0, 255]` to `[0, 1]`.  The "image index" corresponds
 to an image in the dataset, counting up from zero to the size of the dataset.
 And the "pixel index" corresponds to a specific pixel in that image, ranging
 from zero to the number of pixels in the image.
--- a/tensorflow/g3doc/tutorials/recurrent/index.md
+++ b/tensorflow/g3doc/tutorials/recurrent/index.md
@ -92,7 +92,7 @@ lstm = rnn_cell.BasicLSTMCell(lstm_size)
 # Initial state of the LSTM memory.
 initial_state = state = tf.zeros([batch_size, lstm.state_size])

-for i in range(len(num_steps)):
+for i in range(num_steps):
    # The value of state is updated after processing each batch of words.
    output, state = lstm(words[:, i], state)

@ -159,7 +159,7 @@ lstm = rnn_cell.BasicLSTMCell(lstm_size)
 stacked_lstm = rnn_cell.MultiRNNCell([lstm] * number_of_layers)

 initial_state = state = stacked_lstm.zero_state(batch_size, tf.float32)
-for i in range(len(num_steps)):
+for i in range(num_steps):
    # The value of state is updated after processing each batch of words.
    output, state = stacked_lstm(words[:, i], state)

--- a/tensorflow/g3doc/tutorials/seq2seq/index.md
+++ b/tensorflow/g3doc/tutorials/seq2seq/index.md
@ -58,7 +58,7 @@ translation [Sutskever et al., 2014](http://arxiv.org/abs/1409.3215)
 In the basic model depicted above, every input has to be encoded into
 a fixed-size state vector, as that is the only thing passed to the decoder.
 To allow the decoder more direct access to the input, an *attention* mechanism
-was introduced in [Bahdanu et al., 2014](http://arxiv.org/abs/1409.0473)
+was introduced in [Bahdanau et al., 2014](http://arxiv.org/abs/1409.0473)
 ([pdf](http://arxiv.org/pdf/1409.0473.pdf)).
 We will not go into the details of the attention mechanism (see the paper),
 suffice it to say that it allows the decoder to peek into the input at every
@ -176,8 +176,8 @@ projections are constructed by the following code in `seq2seq_model.py`.
 ```

 First, note that we only construct a sampled softmax if the number of samples
-(512 by default) is smaller that the target vocabulary size. For vocabularies
-smaller than 512 it might be a better idea to just use a standard softmax loss.
+(512 by default) is smaller than the target vocabulary size. For vocabularies
+smaller than 512, it might be a better idea to just use a standard softmax loss.

 Then, as you can see, we construct an output projection. It is a pair,
 consisting of a weight matrix and a bias vector. If used, the rnn cell
--- a/tensorflow/models/image/mnist/convolutional.py
+++ b/tensorflow/models/image/mnist/convolutional.py
@ -17,7 +17,7 @@

 This should achieve a test error of 0.7%. Please keep this model as simple and
 linear as possible, it is meant as a tutorial for simple convolutional models.
-Run with --self_test on the command line to exectute a short self-test.
+Run with --self_test on the command line to execute a short self-test.
 """
 from __future__ import absolute_import
 from __future__ import division
--- a/tensorflow/models/rnn/ptb/ptb_word_lm.py
+++ b/tensorflow/models/rnn/ptb/ptb_word_lm.py
@ -276,7 +276,7 @@ def get_config():
    raise ValueError("Invalid model: %s", FLAGS.model)


-def main(unused_args):
+def main(_):
  if not FLAGS.data_path:
    raise ValueError("Must set --data_path to PTB data directory")

--- a/tensorflow/models/rnn/translate/data_utils.py
+++ b/tensorflow/models/rnn/translate/data_utils.py
@ -66,7 +66,7 @@ def gunzip_file(gz_path, new_path):
  """Unzips from gz_path into new_path."""
  print("Unpacking %s to %s" % (gz_path, new_path))
  with gzip.open(gz_path, "rb") as gz_file:
-    with open(new_path, "w") as new_file:
+    with open(new_path, "wb") as new_file:
      for line in gz_file:
        new_file.write(line)

--- a/tensorflow/python/framework/importer.py
+++ b/tensorflow/python/framework/importer.py
@ -251,8 +251,8 @@ def import_graph_def(graph_def, input_map=None, return_elements=None,
          class_values = value.list
          new_class_values = []
          for class_value in class_values.s:
-            if class_value.startswith('loc:@'):
-              op_to_bind_to = class_value[5:]
+            if class_value.startswith(b'loc:@'):
+              op_to_bind_to = class_value[5:].decode()
              # Find the op by its original name.
              if op_to_bind_to not in name_to_op:
                raise ValueError('Specified colocation to an op that '
--- a/tensorflow/python/framework/ops.py
+++ b/tensorflow/python/framework/ops.py
@ -1041,7 +1041,7 @@ class Operation(object):
      raise TypeError("node_def needs to be a NodeDef: %s" % node_def)
    if node_def.ByteSize() >= (1 << 31) or node_def.ByteSize() < 0:
      raise ValueError(
-          "Cannot create an Operation with a NodeDef larger than 2GB.")
+          "Cannot create a tensor proto whose content is larger than 2GB.")
    if not _VALID_OP_NAME_REGEX.match(node_def.name):
      raise ValueError("'%s' is not a valid node name" % node_def.name)
    if not isinstance(g, Graph):
--- a/tensorflow/python/framework/ops_test.py
+++ b/tensorflow/python/framework/ops_test.py
@ -1228,8 +1228,8 @@ class ColocationGroupTest(test_util.TensorFlowTestCase):
    with ops.colocate_with(a.op):
      b = constant_op.constant(3.0)
    c = constant_op.constant(4.0)
-    self.assertEqual(["loc:@a"], a.op.colocation_groups())
-    self.assertEqual(["loc:@a"], b.op.colocation_groups())
+    self.assertEqual([b"loc:@a"], a.op.colocation_groups())
+    self.assertEqual([b"loc:@a"], b.op.colocation_groups())
    with self.assertRaises(ValueError):
      c.op.get_attr("_class")

@ -1242,7 +1242,7 @@ class ColocationGroupTest(test_util.TensorFlowTestCase):
        # colocated with 'a', which is on '/gpu:0'.  colocate_with
        # overrides devices because it is a stronger constraint.
        b = constant_op.constant(3.0)
-    self.assertEqual(["loc:@a"], b.op.colocation_groups())
+    self.assertEqual([b"loc:@a"], b.op.colocation_groups())
    self.assertEqual(a.op.device, b.op.device)

  def testLocationOverrides(self):
@ -1258,7 +1258,7 @@ class ColocationGroupTest(test_util.TensorFlowTestCase):
        c = constant_op.constant(4.0)
      d = constant_op.constant(5.0)

-    self.assertEqual(["loc:@a"], b.op.colocation_groups())
+    self.assertEqual([b"loc:@a"], b.op.colocation_groups())
    self.assertEqual("/device:GPU:0", a.op.device)
    self.assertEqual(a.op.device, b.op.device)

@ -1272,8 +1272,8 @@ class ColocationGroupTest(test_util.TensorFlowTestCase):
      b = constant_op.constant(3.0)
      with ops.colocate_with(b.op):
        c = constant_op.constant(4.0)
-    self.assertEqual(["loc:@a"], b.op.colocation_groups())
-    self.assertEqual(["loc:@a"], c.op.colocation_groups())
+    self.assertEqual([b"loc:@a"], b.op.colocation_groups())
+    self.assertEqual([b"loc:@a"], c.op.colocation_groups())

  def testMultiColocationGroups(self):
    a = constant_op.constant([2.0], name="a")
@ -1281,7 +1281,7 @@ class ColocationGroupTest(test_util.TensorFlowTestCase):
    with ops.colocate_with(a.op):
      with ops.colocate_with(b.op):
        c = constant_op.constant(4.0)
-    self.assertEqual(set(["loc:@a", "loc:@b"]), set(c.op.colocation_groups()))
+    self.assertEqual(set([b"loc:@a", b"loc:@b"]), set(c.op.colocation_groups()))

  def testColocationIgnoreStack(self):
    a = constant_op.constant([2.0], name="a")
@ -1295,7 +1295,7 @@ class ColocationGroupTest(test_util.TensorFlowTestCase):
    a = variables.Variable([2.0], name="a")
    with ops.colocate_with(a.op):
      b = variables.Variable([3.0], name="b")
-    self.assertEqual(["loc:@a"], b.op.colocation_groups())
+    self.assertEqual([b"loc:@a"], b.op.colocation_groups())

  def testInconsistentDeviceWithinColocate(self):
    with ops.device("/gpu:0"):
--- a/tensorflow/python/framework/tensor_util.py
+++ b/tensorflow/python/framework/tensor_util.py
@ -361,6 +361,9 @@ def make_tensor_proto(values, dtype=None, shape=None):
      tensor_shape=tensor_shape.as_shape(shape).as_proto())

  if is_same_size and numpy_dtype in _TENSOR_CONTENT_TYPES and shape_size > 1:
+    if nparray.size * nparray.itemsize >= (1 << 31):
+      raise ValueError(
+          "Cannot create a tensor proto whose content is larger than 2GB.")
    tensor_proto.tensor_content = nparray.tostring()
    return tensor_proto

--- a/tensorflow/python/kernel_tests/constant_op_test.py
+++ b/tensorflow/python/kernel_tests/constant_op_test.py
@ -155,7 +155,7 @@ class ConstantTest(tf.test.TestCase):
      large_array = np.zeros((512, 1024, 1024), dtype=np.float32)
      with self.assertRaisesRegexp(
          ValueError,
-          "Cannot create an Operation with a NodeDef larger than 2GB."):
+          "Cannot create a tensor proto whose content is larger than 2GB."):
        c = tf.constant(large_array)

  def testTooLargeGraph(self):
--- a/tensorflow/python/kernel_tests/control_flow_ops_py_test.py
+++ b/tensorflow/python/kernel_tests/control_flow_ops_py_test.py
@ -1397,7 +1397,7 @@ class ControlFlowTest(tf.test.TestCase):
                                                           vdef)
        # The device is empty, but the colocation constraint is set.
        self.assertDeviceEqual("", with_vdef_dep.device)
-        self.assertEqual(["loc:@vdef"],
+        self.assertEqual([b"loc:@vdef"],
                         with_vdef_dep.op.colocation_groups())

  def testGroup(self):
--- a/tensorflow/python/kernel_tests/depthtospace_op_test.py
+++ b/tensorflow/python/kernel_tests/depthtospace_op_test.py
@ -156,7 +156,7 @@ class DepthToSpaceTest(tf.test.TestCase):
      out_tf.eval()

  def testBlockSizeNotDivisibleDepth(self):
-    # The the depth is not divisible by the square of the block size.
+    # The depth is not divisible by the square of the block size.
    x_np = [[[[1, 1, 1, 1],
              [2, 2, 2, 2]],
             [[3, 3, 3, 3],
--- a/tensorflow/python/kernel_tests/diag_op_test.py
+++ b/tensorflow/python/kernel_tests/diag_op_test.py
@ -23,18 +23,21 @@ import tensorflow as tf

 class GenerateIdentityTensorTest(tf.test.TestCase):

-  def _testDiagOp(self, diag, dtype, expected_ans, use_gpu=False,
-                  expected_err_re=None):
+  def diagOp(self, diag, dtype, expected_ans, use_gpu=False):
    with self.test_session(use_gpu=use_gpu):
      tf_ans = tf.diag(tf.convert_to_tensor(diag.astype(dtype)))
      out = tf_ans.eval()
+      tf_ans_inv = tf.diag_part(expected_ans)
+      inv_out = tf_ans_inv.eval()
    self.assertAllClose(out, expected_ans)
+    self.assertAllClose(inv_out, diag)
    self.assertShapeEqual(expected_ans, tf_ans)
+    self.assertShapeEqual(diag, tf_ans_inv)

  def testEmptyTensor(self):
    x = numpy.array([])
    expected_ans = numpy.empty([0, 0])
-    self._testDiagOp(x, numpy.int32, expected_ans)
+    self.diagOp(x, numpy.int32, expected_ans)

  def testRankOneIntTensor(self):
    x = numpy.array([1, 2, 3])
@ -42,8 +45,8 @@ class GenerateIdentityTensorTest(tf.test.TestCase):
        [[1, 0, 0],
         [0, 2, 0],
         [0, 0, 3]])
-    self._testDiagOp(x, numpy.int32, expected_ans)
-    self._testDiagOp(x, numpy.int64, expected_ans)
+    self.diagOp(x, numpy.int32, expected_ans)
+    self.diagOp(x, numpy.int64, expected_ans)

  def testRankOneFloatTensor(self):
    x = numpy.array([1.1, 2.2, 3.3])
@ -51,8 +54,8 @@ class GenerateIdentityTensorTest(tf.test.TestCase):
        [[1.1, 0, 0],
         [0, 2.2, 0],
         [0, 0, 3.3]])
-    self._testDiagOp(x, numpy.float32, expected_ans)
-    self._testDiagOp(x, numpy.float64, expected_ans)
+    self.diagOp(x, numpy.float32, expected_ans)
+    self.diagOp(x, numpy.float64, expected_ans)

  def testRankTwoIntTensor(self):
    x = numpy.array([[1, 2, 3], [4, 5, 6]])
@ -63,8 +66,8 @@ class GenerateIdentityTensorTest(tf.test.TestCase):
         [[[0, 0, 0], [4, 0, 0]],
          [[0, 0, 0], [0, 5, 0]],
          [[0, 0, 0], [0, 0, 6]]]])
-    self._testDiagOp(x, numpy.int32, expected_ans)
-    self._testDiagOp(x, numpy.int64, expected_ans)
+    self.diagOp(x, numpy.int32, expected_ans)
+    self.diagOp(x, numpy.int64, expected_ans)

  def testRankTwoFloatTensor(self):
    x = numpy.array([[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]])
@ -75,8 +78,8 @@ class GenerateIdentityTensorTest(tf.test.TestCase):
         [[[0, 0, 0], [4.4, 0, 0]],
          [[0, 0, 0], [0, 5.5, 0]],
          [[0, 0, 0], [0, 0, 6.6]]]])
-    self._testDiagOp(x, numpy.float32, expected_ans)
-    self._testDiagOp(x, numpy.float64, expected_ans)
+    self.diagOp(x, numpy.float32, expected_ans)
+    self.diagOp(x, numpy.float64, expected_ans)

  def testRankThreeFloatTensor(self):
    x = numpy.array([[[1.1, 2.2], [3.3, 4.4]],
@ -90,8 +93,64 @@ class GenerateIdentityTensorTest(tf.test.TestCase):
           [[[0, 0], [0, 0]], [[0, 6.6], [0, 0]]]],
          [[[[0, 0], [0, 0]], [[0, 0], [7.7, 0]]],
           [[[0, 0], [0, 0]], [[0, 0], [0, 8.8]]]]]])
-    self._testDiagOp(x, numpy.float32, expected_ans)
-    self._testDiagOp(x, numpy.float64, expected_ans)
+    self.diagOp(x, numpy.float32, expected_ans)
+    self.diagOp(x, numpy.float64, expected_ans)
+
+class DiagPartOpTest(tf.test.TestCase):
+
+  def setUp(self):
+    x = numpy.random.seed(0)
+
+  def diagPartOp(self, tensor, dtpe, expected_ans, use_gpu=False):
+    with self.test_session(use_gpu=use_gpu):
+      tf_ans_inv = tf.diag_part(tensor)
+      inv_out = tf_ans_inv.eval()
+    self.assertAllClose(inv_out, expected_ans)
+    self.assertShapeEqual(expected_ans, tf_ans_inv)
+
+  def testRankTwoFloatTensor(self):
+    x = numpy.random.rand(3, 3)
+    i = numpy.arange(3)
+    expected_ans = x[i, i]
+    self.diagPartOp(x, numpy.float32, expected_ans)
+    self.diagPartOp(x, numpy.float64, expected_ans)
+
+  def testRankFourFloatTensor(self):
+    x = numpy.random.rand(2, 3, 2, 3)
+    i = numpy.arange(2)[:, None]
+    j = numpy.arange(3)
+    expected_ans = x[i, j, i, j]
+    self.diagPartOp(x, numpy.float32, expected_ans)
+    self.diagPartOp(x, numpy.float64, expected_ans)
+    
+  def testRankSixFloatTensor(self):
+    x = numpy.random.rand(2, 2, 2, 2, 2, 2)
+    i = numpy.arange(2)[:, None, None]
+    j = numpy.arange(2)[:, None]
+    k = numpy.arange(2)
+    expected_ans = x[i, j, k, i, j, k]
+    self.diagPartOp(x, numpy.float32, expected_ans)
+    self.diagPartOp(x, numpy.float64, expected_ans)
+
+  def testOddRank(self):
+    w = numpy.random.rand(2)
+    x = numpy.random.rand(2, 2, 2)
+    y = numpy.random.rand(2, 2, 2, 2, 2)
+    z = numpy.random.rand(2, 2, 2, 2, 2, 2, 2)
+    self.assertRaises(ValueError, self.diagPartOp, w, numpy.float32, 0)
+    self.assertRaises(ValueError, self.diagPartOp, x, numpy.float32, 0)
+    self.assertRaises(ValueError, self.diagPartOp, y, numpy.float32, 0)
+    self.assertRaises(ValueError, self.diagPartOp, z, numpy.float32, 0)
+    
+  def testUnevenDimensions(self):
+    w = numpy.random.rand(2, 5)
+    x = numpy.random.rand(2, 1, 2, 3)
+    y = numpy.random.rand(2, 1, 2, 1, 2, 5)
+    z = numpy.random.rand(2, 2, 2, 2, 2, 2, 2, 2)
+    self.assertRaises(ValueError, self.diagPartOp, w, numpy.float32, 0)
+    self.assertRaises(ValueError, self.diagPartOp, x, numpy.float32, 0)
+    self.assertRaises(ValueError, self.diagPartOp, y, numpy.float32, 0)
+    self.assertRaises(ValueError, self.diagPartOp, z, numpy.float32, 0)

 if __name__ == "__main__":
  tf.test.main()
--- a/tensorflow/python/kernel_tests/init_ops_test.py
+++ b/tensorflow/python/kernel_tests/init_ops_test.py
@ -25,7 +25,7 @@ from tensorflow.python.framework import random_seed
 from tensorflow.python.ops import init_ops


-# Returns true iff the two initalizers produce the same tensor to
+# Returns true iff the two initializers produce the same tensor to
 # within a tiny tolerance.
 def identicaltest(tc, init1, init2, use_gpu):
  """Tests if two initializations are identical to within tiny tolerances.
--- a/tensorflow/python/kernel_tests/matmul_op_test.py
+++ b/tensorflow/python/kernel_tests/matmul_op_test.py
@ -120,7 +120,7 @@ class MatMulTest(tf.test.TestCase):
      self._testCpuMatmul(x, y, True, True)
      self._testGpuMatmul(x, y, True, True)

-  def testDoubleRandomTranposeBoth(self):
+  def testDoubleRandomTransposeBoth(self):
    for _ in range(10):
      n, k, m = np.random.randint(1, 100, size=3)
      x = self._randMatrix(k, n, np.float64)
--- a/tensorflow/python/kernel_tests/reduction_ops_test.py
+++ b/tensorflow/python/kernel_tests/reduction_ops_test.py
@ -116,8 +116,8 @@ class SumReductionTest(tf.test.TestCase):
  # Simple tests for various types.
  def testDoubleReduce1D(self):
    np_arr = np.arange(1, 6).reshape([5]).astype(np.float64)
-    self._compare(np_arr, [], False)
-    self._compare(np_arr, [0], False)
+    self._compareAll(np_arr, [])
+    self._compareAll(np_arr, [0])

  def testInt32Reduce1D(self):
    np_arr = np.arange(1, 6).reshape([5]).astype(np.int32)
@ -230,6 +230,19 @@ class MeanReductionTest(tf.test.TestCase):
    self._compareAll(np_arr, [0, 2])
    self._compareAll(np_arr, [0, 1, 2])

+  def testDoubleReduce3D(self):
+    # Create a 3D array of doubles and reduce across all possible
+    # dimensions
+    np_arr = np.arange(0, 30).reshape([2, 3, 5]).astype(np.float64)
+    self._compareAll(np_arr, [])
+    self._compareAll(np_arr, [0])
+    self._compareAll(np_arr, [1])
+    self._compareAll(np_arr, [2])
+    self._compareAll(np_arr, [0, 1])
+    self._compareAll(np_arr, [1, 2])
+    self._compareAll(np_arr, [0, 2])
+    self._compareAll(np_arr, [0, 1, 2])
+
  def testGradient(self):
    s = [2, 3, 4, 2]
    x = np.arange(1.0, 49.0).reshape(s).astype(np.float32)
@ -383,6 +396,19 @@ class MinReductionTest(tf.test.TestCase):
    self._compareAll(np_arr, [0, 2])
    self._compareAll(np_arr, [0, 1, 2])

+  def testDoubleReduce3D(self):
+    # Create a 3D array of doubles and reduce across all possible
+    # dimensions
+    np_arr = np.arange(0, 30).reshape([2, 3, 5]).astype(np.float64)
+    self._compareAll(np_arr, [])
+    self._compareAll(np_arr, [0])
+    self._compareAll(np_arr, [1])
+    self._compareAll(np_arr, [2])
+    self._compareAll(np_arr, [0, 1])
+    self._compareAll(np_arr, [1, 2])
+    self._compareAll(np_arr, [0, 2])
+    self._compareAll(np_arr, [0, 1, 2])
+
  def testGradient(self):
    s = [2, 3, 4, 2]
    x = np.arange(1.0, 49.0).reshape(s).astype(np.float64)
@ -477,6 +503,20 @@ class MaxReductionTest(tf.test.TestCase):
    self._compareAll(np_arr, [0, 2])
    self._compareAll(np_arr, [0, 1, 2])

+  def testDoubleReduce3D(self):
+    # Create a 3D array of doubles and reduce across all possible
+    # dimensions
+    np_arr = np.arange(0, 30).reshape([2, 3, 5]).astype(np.float64)
+    self._compareAll(np_arr, None)
+    self._compareAll(np_arr, [])
+    self._compareAll(np_arr, [0])
+    self._compareAll(np_arr, [1])
+    self._compareAll(np_arr, [2])
+    self._compareAll(np_arr, [0, 1])
+    self._compareAll(np_arr, [1, 2])
+    self._compareAll(np_arr, [0, 2])
+    self._compareAll(np_arr, [0, 1, 2])
+
  def testGradient(self):
    s = [2, 3, 4, 2]
    x = np.arange(1.0, 49.0).reshape(s).astype(np.float64)
--- a/tensorflow/python/kernel_tests/rnn_test.py
+++ b/tensorflow/python/kernel_tests/rnn_test.py
@ -782,11 +782,11 @@ class BidirectionalRNNTest(tf.test.TestCase):
            tf.float32,
            shape=(batch_size, input_size) if use_shape else (None, input_size))
    ]
-    outputs = tf.nn.bidirectional_rnn(cell_fw,
-                                      cell_bw,
-                                      inputs,
-                                      dtype=tf.float32,
-                                      sequence_length=sequence_length)
+    outputs, state_fw, state_bw = tf.nn.bidirectional_rnn(cell_fw,
+                                                          cell_bw,
+                                                          inputs,
+                                                          dtype=tf.float32,
+                                                          sequence_length=sequence_length)
    self.assertEqual(len(outputs), len(inputs))
    for out in outputs:
      self.assertEqual(
@ -794,17 +794,19 @@ class BidirectionalRNNTest(tf.test.TestCase):
          [batch_size if use_shape else None, 2 * num_units])

    input_value = np.random.randn(batch_size, input_size)
+    outputs = tf.pack(outputs)

-    return input_value, inputs, outputs, sequence_length
+    return input_value, inputs, outputs, state_fw, state_bw, sequence_length

  def _testBidirectionalRNN(self, use_gpu, use_shape):
    with self.test_session(use_gpu=use_gpu, graph=tf.Graph()) as sess:
-      input_value, inputs, outputs, sequence_length = (
+      input_value, inputs, outputs, state_fw, state_bw, sequence_length = (
          self._createBidirectionalRNN(use_gpu, use_shape, True))
      tf.initialize_all_variables().run()
      # Run with pre-specified sequence length of 2, 3
-      out = sess.run(outputs, feed_dict={inputs[0]: input_value,
-                                         sequence_length: [2, 3]})
+      out, s_fw, s_bw = sess.run([outputs, state_fw, state_bw], 
+                                 feed_dict={inputs[0]: input_value,
+                                 sequence_length: [2, 3]})

      # Since the forward and backward LSTM cells were initialized with the
      # same parameters, the forward and backward output has to be the same,
@ -836,13 +838,17 @@ class BidirectionalRNNTest(tf.test.TestCase):
      self.assertEqual(out[2][1][0], out[0][1][3])
      self.assertEqual(out[2][1][1], out[0][1][4])
      self.assertEqual(out[2][1][2], out[0][1][5])
+      # Via the reasoning above, the forward and backward final state should be
+      # exactly the same
+      self.assertAllClose(s_fw, s_bw)

  def _testBidirectionalRNNWithoutSequenceLength(self, use_gpu, use_shape):
    with self.test_session(use_gpu=use_gpu, graph=tf.Graph()) as sess:
-      input_value, inputs, outputs, _ = self._createBidirectionalRNN(
-          use_gpu, use_shape, False)
+      input_value, inputs, outputs, state_fw, state_bw, _ = self._createBidirectionalRNN(
+                                                                use_gpu, use_shape, False)
      tf.initialize_all_variables().run()
-      out = sess.run(outputs, feed_dict={inputs[0]: input_value})
+      out, s_fw, s_bw = sess.run([outputs, state_fw, state_bw], 
+                                 feed_dict={inputs[0]: input_value})

      # Since the forward and backward LSTM cells were initialized with the
      # same parameters, the forward and backward output has to be the same,
@ -861,6 +867,9 @@ class BidirectionalRNNTest(tf.test.TestCase):
        self.assertEqual(out[i][1][0], out[8 - 1 - i][1][3])
        self.assertEqual(out[i][1][1], out[8 - 1 - i][1][4])
        self.assertEqual(out[i][1][2], out[8 - 1 - i][1][5])
+      # Via the reasoning above, the forward and backward final state should be
+      # exactly the same
+      self.assertAllClose(s_fw, s_bw)

  def testBidirectionalRNN(self):
    self._testBidirectionalRNN(use_gpu=False, use_shape=False)
--- a/tensorflow/python/kernel_tests/seq2seq_test.py
+++ b/tensorflow/python/kernel_tests/seq2seq_test.py
@ -495,6 +495,105 @@ class Seq2SeqTest(tf.test.TestCase):
        if len(perplexities[bucket]) > 1:  # Assert that perplexity went down.
          self.assertLess(perplexities[bucket][-1], perplexities[bucket][0])

+  def testModelWithBooleanFeedPrevious(self):
+    """Test the model behavior when feed_previous is True.
+
+    For example, the following two cases have the same effect:
+      - Train `embedding_rnn_seq2seq` with `feed_previous=True`, which contains
+        a `embedding_rnn_decoder` with `feed_previous=True` and
+        `update_embedding_for_previous=True`. The decoder is fed with "<Go>"
+        and outputs "A, B, C".
+      - Train `embedding_rnn_seq2seq` with `feed_previous=False`. The decoder
+        is fed with "<Go>, A, B".
+    """
+    num_encoder_symbols = 3
+    num_decoder_symbols = 5
+    batch_size = 2
+    num_enc_timesteps = 2
+    num_dec_timesteps = 3
+
+    def TestModel(seq2seq):
+      with self.test_session(graph=tf.Graph()) as sess:
+        tf.set_random_seed(111)
+        random.seed(111)
+        np.random.seed(111)
+
+        enc_inp = [tf.constant(i + 1, tf.int32, shape=[batch_size])
+                     for i in range(num_enc_timesteps)]
+        dec_inp_fp_true = [tf.constant(i, tf.int32, shape=[batch_size])
+                           for i in range(num_dec_timesteps)]
+        dec_inp_holder_fp_false = [tf.placeholder(tf.int32, shape=[batch_size])
+                                   for _ in range(num_dec_timesteps)]
+        targets = [tf.constant(i + 1, tf.int32, shape=[batch_size])
+                   for i in range(num_dec_timesteps)]
+        weights = [tf.constant(1.0, shape=[batch_size])
+                   for i in range(num_dec_timesteps)]
+
+        def ForwardBackward(enc_inp, dec_inp, feed_previous):
+          scope_name = "fp_{}".format(feed_previous)
+          with tf.variable_scope(scope_name):
+            dec_op, _ = seq2seq(enc_inp, dec_inp, feed_previous=feed_previous)
+            net_variables = tf.get_collection(tf.GraphKeys.VARIABLES,
+                                              scope_name)
+          optimizer = tf.train.AdamOptimizer(0.03, epsilon=1e-5)
+          update_op = optimizer.minimize(
+              tf.nn.seq2seq.sequence_loss(dec_op, targets, weights),
+              var_list=net_variables)
+          return dec_op, update_op, net_variables
+
+        dec_op_fp_true, update_fp_true, variables_fp_true = ForwardBackward(
+            enc_inp, dec_inp_fp_true, feed_previous=True)
+        dec_op_fp_false, update_fp_false, variables_fp_false = ForwardBackward(
+            enc_inp, dec_inp_holder_fp_false, feed_previous=False)
+
+        sess.run(tf.initialize_all_variables())
+
+        # We only check consistencies between the variables existing in both
+        # the models with True and False feed_previous. Variables created by
+        # the loop_function in the model with True feed_previous are ignored.
+        v_false_name_dict = {v.name.split('/', 1)[-1]: v
+                             for v in variables_fp_false}
+        matched_variables = [(v, v_false_name_dict[v.name.split('/', 1)[-1]])
+                             for v in variables_fp_true]
+        for v_true, v_false in matched_variables:
+          sess.run(tf.assign(v_false, v_true))
+
+        # Take the symbols generated by the decoder with feed_previous=True as
+        # the true input symbols for the decoder with feed_previous=False.
+        dec_fp_true = sess.run(dec_op_fp_true)
+        output_symbols_fp_true = np.argmax(dec_fp_true, axis=2)
+        dec_inp_fp_false = np.vstack((dec_inp_fp_true[0].eval(),
+                                      output_symbols_fp_true[:-1]))
+        sess.run(update_fp_true)
+        sess.run(update_fp_false,
+                 {holder: inp for holder, inp in zip(dec_inp_holder_fp_false,
+                                                     dec_inp_fp_false)})
+
+        for v_true, v_false in matched_variables:
+          self.assertAllClose(v_true.eval(), v_false.eval())
+
+    def EmbeddingRNNSeq2SeqF(enc_inp, dec_inp, feed_previous):
+      cell = tf.nn.rnn_cell.BasicLSTMCell(2)
+      return tf.nn.seq2seq.embedding_rnn_seq2seq(
+          enc_inp, dec_inp, cell, num_encoder_symbols,
+          num_decoder_symbols, feed_previous=feed_previous)
+
+    def EmbeddingTiedRNNSeq2Seq(enc_inp, dec_inp, feed_previous):
+      cell = tf.nn.rnn_cell.BasicLSTMCell(2)
+      return tf.nn.seq2seq.embedding_tied_rnn_seq2seq(
+          enc_inp, dec_inp, cell, num_decoder_symbols,
+          feed_previous=feed_previous)
+
+    def EmbeddingAttentionSeq2Seq(enc_inp, dec_inp, feed_previous):
+      cell = tf.nn.rnn_cell.BasicLSTMCell(2)
+      return tf.nn.seq2seq.embedding_attention_seq2seq(
+          enc_inp, dec_inp, cell, num_encoder_symbols,
+          num_decoder_symbols, feed_previous=feed_previous)
+
+    for model in (EmbeddingRNNSeq2SeqF, EmbeddingTiedRNNSeq2Seq,
+                  EmbeddingAttentionSeq2Seq):
+      TestModel(model)
+

 if __name__ == "__main__":
  tf.test.main()
--- a/tensorflow/python/kernel_tests/trace_op_test.py
+++ b/tensorflow/python/kernel_tests/trace_op_test.py
@ -0,0 +1,71 @@
+# Copyright 2015 Google Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy
+import tensorflow as tf
+
+
+class TraceTest(tf.test.TestCase):
+
+  def setUp(self):
+    x = numpy.random.seed(0)
+
+  def traceOp(self, x, dtype, expected_ans, use_gpu=False):
+    with self.test_session(use_gpu=use_gpu):
+      tf_ans = tf.trace(x.astype(dtype))
+      out = tf_ans.eval()
+    self.assertAllClose(out, expected_ans)
+
+  def testEmptyTensor(self):
+    x = numpy.array([])
+    self.assertRaises(ValueError, self.traceOp, x, numpy.float32, 0)
+
+  def testRankOneTensor(self):
+    x = numpy.array([1,2,3])
+    self.assertRaises(ValueError, self.traceOp, x, numpy.float32, 0)
+
+  def testRankTwoIntTensor(self):
+    x = numpy.array(
+        [[1, 0, 0],
+         [0, 2, 0],
+         [0, 0, 3]])
+    expected_ans = 6
+    self.traceOp(x, numpy.int32, expected_ans)
+    self.traceOp(x, numpy.int64, expected_ans)
+
+  def testRankTwoFloatTensor(self):
+    x = numpy.array(
+        [[1.1, 0, 0],
+         [0, 2.2, 0],
+         [0, 0, 3.3]])
+    expected_ans = 6.6
+    self.traceOp(x, numpy.float32, expected_ans)
+    self.traceOp(x, numpy.float64, expected_ans)
+
+  def testRankThreeFloatTensor(self):
+    x = numpy.random.rand(2, 2, 2)
+    self.assertRaises(ValueError, self.traceOp, x, numpy.float32, 0)
+
+  def testRankFourFloatTensor(self):
+    x = numpy.random.rand(2, 2, 2, 2)
+    self.assertRaises(ValueError, self.traceOp, x, numpy.float32, 0)
+
+
+if __name__ == "__main__":
+  tf.test.main()
--- a/tensorflow/python/ops/array_ops.py
+++ b/tensorflow/python/ops/array_ops.py
@ -846,6 +846,35 @@ def _DiagShape(op):
  input_shape = op.inputs[0].get_shape().with_rank_at_most(3)
  return [input_shape.concatenate(input_shape)]

+@ops.RegisterShape("DiagPart")
+def _DiagPartShape(op):
+  """Shape function for array_ops.diag_part.
+
+  This op has one input (of rank k = 2, 4, or 6), and one output (of rank k/2),
+  where the shape of the output is the diagonal of the input shape.
+
+  Args:
+    op: A DiagPart Operation.
+
+  Returns:
+    A single-element list containing the shape of the output.
+
+  Raises:
+    ValueError: If input has odd rank or greater than 6
+
+  """
+  shape = op.inputs[0].get_shape()
+  rank = len(shape)
+  mid = rank // 2
+  if rank % 2 or rank > 6:
+    raise ValueError("Input must have even rank <= 6, input rank is " +
+                     str(rank) + "." )
+  if shape[:mid] != shape[mid:]:
+    raise ValueError("Invalid shape, shape[:mid] " + str(shape[:mid]) +
+                     " and shape[mid:] " + str(shape[mid:]) +
+                     " do not match ")
+  input_shape = shape.with_rank_at_most(6)
+  return [input_shape[:len(input_shape) // 2]]

@ops.RegisterShape("ExpandDims")
 def _ExpandDimsShape(op):
@ -1360,7 +1389,7 @@ def _SpaceToDepthShape(op):
  * input: a tensor of shape like that [B, H, W, D]
  * block_size: an int.

-  Its output is the the same-rank tensor but with changed
+  Its output is the same-rank tensor but with changed
  dimensions like that: [B, H/block_size, W/block_size, D*block_size*block_size]

  Args:
@ -1408,7 +1437,7 @@ def _DepthToSpaceShape(op):
  * input: a tensor of shape like that [B, H, W, D]
  * block_size: an int.

-  Its output is the the same-rank tensor but with changed
+  Its output is the same-rank tensor but with changed
  dimensions like that:
      [B, H*block_size, W*block_size, D/(block_size*block_size)]

--- a/tensorflow/python/ops/image_ops.py
+++ b/tensorflow/python/ops/image_ops.py
@ -308,6 +308,7 @@ def flip_left_right(image):
  Raises:
    ValueError: if the shape of `image` not supported.
  """
+  image = ops.convert_to_tensor(image, name='image')
  _Check3DImage(image, require_static=False)
  return array_ops.reverse(image, [False, True, False])

@ -329,6 +330,7 @@ def flip_up_down(image):
  Raises:
    ValueError: if the shape of `image` not supported.
  """
+  image = ops.convert_to_tensor(image, name='image')
  _Check3DImage(image, require_static=False)
  return array_ops.reverse(image, [True, False, False])

--- a/tensorflow/python/ops/image_ops_test.py
+++ b/tensorflow/python/ops/image_ops_test.py
@ -741,7 +741,14 @@ class ResizeImagesTest(test_util.TensorFlowTestCase):
             image_ops.ResizeMethod.AREA]

  TYPES = [np.uint8, np.int8, np.int16, np.int32, np.int64,
-           np.float, np.double]
+           np.float32, np.float64]
+
+  def availableGPUModes(self, opt, nptype):
+    if opt == image_ops.ResizeMethod.NEAREST_NEIGHBOR \
+            and nptype in [np.float32, np.float64]:
+      return [True, False]
+    else:
+      return [False]

  def testNoOp(self):
    img_shape = [1, 6, 4, 1]
@ -761,13 +768,14 @@ class ResizeImagesTest(test_util.TensorFlowTestCase):
      img_np = np.array(data, dtype=nptype).reshape(img_shape)

      for opt in self.OPTIONS:
-        with self.test_session() as sess:
-          image = constant_op.constant(img_np, shape=img_shape)
-          y = image_ops.resize_images(image, target_height, target_width, opt)
-          yshape = array_ops.shape(y)
-          resized, newshape = sess.run([y, yshape])
-          self.assertAllEqual(img_shape, newshape)
-          self.assertAllClose(resized, img_np, atol=1e-5)
+        for use_gpu in self.availableGPUModes(opt, nptype):
+          with self.test_session(use_gpu=use_gpu) as sess:
+            image = constant_op.constant(img_np, shape=img_shape)
+            y = image_ops.resize_images(image, target_height, target_width, opt)
+            yshape = array_ops.shape(y)
+            resized, newshape = sess.run([y, yshape])
+            self.assertAllEqual(img_shape, newshape)
+            self.assertAllClose(resized, img_np, atol=1e-5)

      # Resizing with a single image must leave the shape unchanged also.
      with self.test_session():
@ -857,12 +865,13 @@ class ResizeImagesTest(test_util.TensorFlowTestCase):
        img_np = np.array(data, dtype=nptype).reshape(img_shape)

        for opt in self.OPTIONS:
-          with self.test_session():
-            image = constant_op.constant(img_np, shape=img_shape)
-            y = image_ops.resize_images(image, target_height, target_width, opt)
-            expected = np.array(expected_data).reshape(target_shape)
-            resized = y.eval()
-            self.assertAllClose(resized, expected, atol=1e-5)
+          for use_gpu in self.availableGPUModes(opt, nptype):
+            with self.test_session(use_gpu=use_gpu):
+              image = constant_op.constant(img_np, shape=img_shape)
+              y = image_ops.resize_images(image, target_height, target_width, opt)
+              expected = np.array(expected_data).reshape(target_shape)
+              resized = y.eval()
+              self.assertAllClose(resized, expected, atol=1e-5)

  def testResizeUp(self):
    img_shape = [1, 3, 2, 1]
@ -899,14 +908,15 @@ class ResizeImagesTest(test_util.TensorFlowTestCase):
          image_ops.ResizeMethod.BILINEAR,
          image_ops.ResizeMethod.NEAREST_NEIGHBOR,
          image_ops.ResizeMethod.AREA]:
-        with self.test_session():
-          img_np = np.array(data, dtype=nptype).reshape(img_shape)
-          image = constant_op.constant(img_np, shape=img_shape)
-          y = image_ops.resize_images(image, target_height, target_width, opt)
-          resized = y.eval()
-          expected = np.array(expected_data[opt]).reshape(
-              [1, target_height, target_width, 1])
-          self.assertAllClose(resized, expected, atol=1e-05)
+        for use_gpu in self.availableGPUModes(opt, nptype):
+          with self.test_session(use_gpu=use_gpu):
+            img_np = np.array(data, dtype=nptype).reshape(img_shape)
+            image = constant_op.constant(img_np, shape=img_shape)
+            y = image_ops.resize_images(image, target_height, target_width, opt)
+            resized = y.eval()
+            expected = np.array(expected_data[opt]).reshape(
+                [1, target_height, target_width, 1])
+            self.assertAllClose(resized, expected, atol=1e-05)

  def testResizeUpBicubic(self):
    img_shape = [1, 6, 6, 1]
@ -964,6 +974,28 @@ class ResizeImagesTest(test_util.TensorFlowTestCase):
      self.assertAllClose(resized, expected, atol=1)


+  def testCompareNearestNeighbor(self):
+    input_shape = [1, 5, 6, 3]
+    target_height = 8
+    target_width = 12
+    for nptype in [np.float32, np.float64]:
+      for align_corners in [True, False]:
+        img_np = np.arange(0, np.prod(input_shape), dtype=nptype).reshape(input_shape)
+        with self.test_session(use_gpu=True):
+          image = constant_op.constant(img_np, shape=input_shape)
+          out_op = image_ops.resize_images(image, target_height, target_width,
+                                           image_ops.ResizeMethod.NEAREST_NEIGHBOR,
+                                           align_corners=align_corners)
+          gpu_val = out_op.eval()
+        with self.test_session(use_gpu=False):
+          image = constant_op.constant(img_np, shape=input_shape)
+          out_op = image_ops.resize_images(image, target_height, target_width,
+                                           image_ops.ResizeMethod.NEAREST_NEIGHBOR,
+                                           align_corners=align_corners)
+          cpu_val = out_op.eval()
+        self.assertAllClose(cpu_val, gpu_val, rtol=1e-5, atol=1e-5)
+
+
 class ResizeImageWithCropOrPadTest(test_util.TensorFlowTestCase):

  def _ResizeImageWithCropOrPad(self, original, original_shape,
--- a/tensorflow/python/ops/math_ops.py
+++ b/tensorflow/python/ops/math_ops.py
@ -63,6 +63,8 @@ TensorFlow provides several operations that you can use to add basic
 mathematical functions for matrices to your graph.

@@diag
+@@diag_part
+@@trace
@@transpose

@@matmul
@ -921,6 +923,39 @@ def reduce_any(input_tensor, reduction_indices=None, keep_dims=False,
                           keep_dims, name=name)


+def trace(x, name=None):
+  """ Compute the trace of a tensor `x`.
+
+  `trace(x)` returns the sum of along the diagonal.
+  
+  For example:
+
+  ```python
+  # 'x' is [[1, 1],
+  #         [1, 1]]
+  tf.trace(x) ==> 2
+  
+  # 'x' is [[1,2,3],
+  #         [4,5,6],
+  #         [7,8,9]]
+  tf.trace(x) ==> 15
+  ```
+
+  Args:
+    input_tensor: 2-D tensor.
+    name: A name for the operation (optional).
+
+  Returns:
+    The trace of input tensor.
+  """
+  with ops.op_scope([x], name, "Trace") as name: 
+    x = ops.convert_to_tensor(x, name="x")
+    if len(x.get_shape()) != 2:
+      raise ValueError("Expected a tensor with rank 2, rank %d tensor received"
+                       % len(x.get_shape()))
+    return reduce_sum(array_ops.diag_part(x), name=name)
+
+
 def matmul(a, b,
           transpose_a=False, transpose_b=False,
           a_is_sparse=False, b_is_sparse=False,
--- a/tensorflow/python/ops/nn_ops.py
+++ b/tensorflow/python/ops/nn_ops.py
@ -194,7 +194,7 @@ def softmax_cross_entropy_with_logits(logits, labels, name=None):
  example, each CIFAR-10 image is labeled with one and only one label: an image
  can be a dog or a truck, but not both.

-  **NOTE:**:  While the classes are mutually exclusive, their probabilities
+  **NOTE:**  While the classes are mutually exclusive, their probabilities
  need not be.  All that is required is that each row of `labels` is
  a valid probability distribution.  If using exclusive `labels`
  (wherein one and only one class is true at a time), see
@ -231,7 +231,7 @@ def sparse_softmax_cross_entropy_with_logits(logits, labels, name=None):
  example, each CIFAR-10 image is labeled with one and only one label: an image
  can be a dog or a truck, but not both.

-  **NOTE:**:  For this operation, the probability of a given label is considered
+  **NOTE:**  For this operation, the probability of a given label is considered
  exclusive.  That is, soft classes are not allowed, and the `labels` vector
  must provide a single specific index for the true class for each row of
  `logits` (each minibatch entry).  For soft softmax classification with
--- a/tensorflow/python/ops/rnn.py
+++ b/tensorflow/python/ops/rnn.py
@ -312,9 +312,11 @@ def bidirectional_rnn(cell_fw, cell_bw, inputs,
    scope: VariableScope for the created subgraph; defaults to "BiRNN"

  Returns:
-    A set of output `Tensors` where:
+    A tuple (outputs, output_state_fw, output_state_bw) where:
      outputs is a length T list of outputs (one for each input), which
      are depth-concatenated forward and backward outputs
+      output_state_fw is the final state of the forward rnn
+      output_state_bw is the final state of the backward rnn

  Raises:
    TypeError: If "cell_fw" or "cell_bw" is not an instance of RNNCell.
@ -333,19 +335,19 @@ def bidirectional_rnn(cell_fw, cell_bw, inputs,
  name = scope or "BiRNN"
  # Forward direction
  with vs.variable_scope(name + "_FW") as fw_scope:
-    output_fw, _ = rnn(cell_fw, inputs, initial_state_fw, dtype,
+    output_fw, output_state_fw = rnn(cell_fw, inputs, initial_state_fw, dtype,
                       sequence_length, scope=fw_scope)

  # Backward direction
  with vs.variable_scope(name + "_BW") as bw_scope:
-    tmp, _ = rnn(cell_bw, _reverse_seq(inputs, sequence_length),
+    tmp, output_state_bw = rnn(cell_bw, _reverse_seq(inputs, sequence_length),
                 initial_state_bw, dtype, sequence_length, scope=bw_scope)
  output_bw = _reverse_seq(tmp, sequence_length)
  # Concat each of the forward/backward outputs
  outputs = [array_ops.concat(1, [fw, bw])
             for fw, bw in zip(output_fw, output_bw)]

-  return outputs
+  return (outputs, output_state_fw, output_state_bw)


 def dynamic_rnn(cell, inputs, sequence_length=None, initial_state=None,
--- a/tensorflow/python/ops/seq2seq.py
+++ b/tensorflow/python/ops/seq2seq.py
@ -73,6 +73,34 @@ from tensorflow.python.ops import rnn_cell
 from tensorflow.python.ops import variable_scope


+def _extract_argmax_and_embed(embedding, output_projection=None,
+                              update_embedding=True):
+  """Get a loop_function that extracts the previous symbol and embeds it.
+
+  Args:
+    embedding: embedding tensor for symbols.
+    output_projection: None or a pair (W, B). If provided, each fed previous
+      output will first be multiplied by W and added B.
+    update_embedding: Boolean; if False, the gradients will not propagate
+      through the embeddings.
+
+  Returns:
+    A loop function.
+  """
+  def loop_function(prev, _):
+    if output_projection is not None:
+      prev = nn_ops.xw_plus_b(
+          prev, output_projection[0], output_projection[1])
+    prev_symbol = math_ops.argmax(prev, 1)
+    # Note that gradients will not propagate through the second parameter of
+    # embedding_lookup.
+    emb_prev = embedding_ops.embedding_lookup(embedding, prev_symbol)
+    if not update_embedding:
+      emb_prev = array_ops.stop_gradient(emb_prev)
+    return emb_prev
+  return loop_function
+
+
 def rnn_decoder(decoder_inputs, initial_state, cell, loop_function=None,
                scope=None):
  """RNN decoder for the sequence-to-sequence model.
@ -107,14 +135,13 @@ def rnn_decoder(decoder_inputs, initial_state, cell, loop_function=None,
    for i, inp in enumerate(decoder_inputs):
      if loop_function is not None and prev is not None:
        with variable_scope.variable_scope("loop_function", reuse=True):
-          # We do not propagate gradients over the loop function.
-          inp = array_ops.stop_gradient(loop_function(prev, i))
+          inp = loop_function(prev, i)
      if i > 0:
        variable_scope.get_variable_scope().reuse_variables()
      output, state = cell(inp, state)
      outputs.append(output)
      if loop_function is not None:
-        prev = array_ops.stop_gradient(output)
+        prev = output
  return outputs, state


@ -182,7 +209,7 @@ def tied_rnn_seq2seq(encoder_inputs, decoder_inputs, cell,

 def embedding_rnn_decoder(decoder_inputs, initial_state, cell, num_symbols,
                          output_projection=None, feed_previous=False,
-                          scope=None):
+                          update_embedding_for_previous=True, scope=None):
  """RNN decoder with embedding and a pure-decoding option.

  Args:
@ -200,6 +227,11 @@ def embedding_rnn_decoder(decoder_inputs, initial_state, cell, num_symbols,
      In effect, this implements a greedy decoder. It can also be used
      during training to emulate http://arxiv.org/abs/1506.03099.
      If False, decoder_inputs are used as given (the standard decoder case).
+    update_embedding_for_previous: Boolean; if False and feed_previous=True,
+      only the embedding for the first symbol of decoder_inputs (the "GO"
+      symbol) will be updated by back propagation. Embeddings for the symbols
+      generated from the decoder itself remain unchanged. This parameter has
+      no effect if feed_previous=False.
    scope: VariableScope for the created subgraph; defaults to
      "embedding_rnn_decoder".

@ -227,16 +259,9 @@ def embedding_rnn_decoder(decoder_inputs, initial_state, cell, num_symbols,
    with ops.device("/cpu:0"):
      embedding = variable_scope.get_variable("embedding",
                                              [num_symbols, cell.input_size])
-
-    def extract_argmax_and_embed(prev, _):
-      """Loop_function that extracts the symbol from prev and embeds it."""
-      if output_projection is not None:
-        prev = nn_ops.xw_plus_b(
-            prev, output_projection[0], output_projection[1])
-      prev_symbol = array_ops.stop_gradient(math_ops.argmax(prev, 1))
-      return embedding_ops.embedding_lookup(embedding, prev_symbol)
-
-    loop_function = extract_argmax_and_embed if feed_previous else None
+    loop_function = _extract_argmax_and_embed(
+        embedding, output_projection,
+        update_embedding_for_previous) if feed_previous else None
    emb_inp = (
        embedding_ops.embedding_lookup(embedding, i) for i in decoder_inputs)
    return rnn_decoder(emb_inp, initial_state, cell,
@ -306,7 +331,8 @@ def embedding_rnn_seq2seq(encoder_inputs, decoder_inputs, cell,
        outputs, state = embedding_rnn_decoder(
            decoder_inputs, encoder_state, cell, num_decoder_symbols,
            output_projection=output_projection,
-            feed_previous=feed_previous_bool)
+            feed_previous=feed_previous_bool,
+            update_embedding_for_previous=False)
        return outputs + [state]

    outputs_and_state = control_flow_ops.cond(feed_previous,
@ -372,25 +398,19 @@ def embedding_tied_rnn_seq2seq(encoder_inputs, decoder_inputs, cell,
    emb_decoder_inputs = [embedding_ops.embedding_lookup(embedding, x)
                          for x in decoder_inputs]

-    def extract_argmax_and_embed(prev, _):
-      """Loop_function that extracts the symbol from prev and embeds it."""
-      if output_projection is not None:
-        prev = nn_ops.xw_plus_b(
-            prev, output_projection[0], output_projection[1])
-      prev_symbol = array_ops.stop_gradient(math_ops.argmax(prev, 1))
-      return embedding_ops.embedding_lookup(embedding, prev_symbol)
-
    if output_projection is None:
      cell = rnn_cell.OutputProjectionWrapper(cell, num_symbols)

    if isinstance(feed_previous, bool):
-      loop_function = extract_argmax_and_embed if feed_previous else None
+      loop_function = _extract_argmax_and_embed(
+          embedding, output_projection, True) if feed_previous else None
      return tied_rnn_seq2seq(emb_encoder_inputs, emb_decoder_inputs, cell,
                              loop_function=loop_function, dtype=dtype)

    # If feed_previous is a Tensor, we construct 2 graphs and use cond.
    def decoder(feed_previous_bool):
-      loop_function = extract_argmax_and_embed if feed_previous_bool else None
+      loop_function = _extract_argmax_and_embed(
+        embedding, output_projection, False) if feed_previous_bool else None
      reuse = None if feed_previous_bool else True
      with variable_scope.variable_scope(variable_scope.get_variable_scope(),
                                         reuse=reuse):
@ -523,7 +543,7 @@ def attention_decoder(decoder_inputs, initial_state, attention_states, cell,
      # If loop_function is set, we use it instead of decoder_inputs.
      if loop_function is not None and prev is not None:
        with variable_scope.variable_scope("loop_function", reuse=True):
-          inp = array_ops.stop_gradient(loop_function(prev, i))
+          inp = loop_function(prev, i)
      # Merge input and previous attentions into one vector of the right size.
      x = rnn_cell.linear([inp] + attns, cell.input_size, True)
      # Run the RNN.
@ -539,8 +559,7 @@ def attention_decoder(decoder_inputs, initial_state, attention_states, cell,
      with variable_scope.variable_scope("AttnOutputProjection"):
        output = rnn_cell.linear([cell_output] + attns, output_size, True)
      if loop_function is not None:
-        # We do not propagate gradients over the loop function.
-        prev = array_ops.stop_gradient(output)
+        prev = output
      outputs.append(output)

  return outputs, state
@ -549,8 +568,10 @@ def attention_decoder(decoder_inputs, initial_state, attention_states, cell,
 def embedding_attention_decoder(decoder_inputs, initial_state, attention_states,
                                cell, num_symbols, num_heads=1,
                                output_size=None, output_projection=None,
-                                feed_previous=False, dtype=dtypes.float32,
-                                scope=None, initial_state_attention=False):
+                                feed_previous=False,
+                                update_embedding_for_previous=True,
+                                dtype=dtypes.float32, scope=None,
+                                initial_state_attention=False):
  """RNN decoder with embedding and attention and a pure-decoding option.

  Args:
@ -571,6 +592,11 @@ def embedding_attention_decoder(decoder_inputs, initial_state, attention_states,
      In effect, this implements a greedy decoder. It can also be used
      during training to emulate http://arxiv.org/abs/1506.03099.
      If False, decoder_inputs are used as given (the standard decoder case).
+    update_embedding_for_previous: Boolean; if False and feed_previous=True,
+      only the embedding for the first symbol of decoder_inputs (the "GO"
+      symbol) will be updated by back propagation. Embeddings for the symbols
+      generated from the decoder itself remain unchanged. This parameter has
+      no effect if feed_previous=False.
    dtype: The dtype to use for the RNN initial states (default: tf.float32).
    scope: VariableScope for the created subgraph; defaults to
      "embedding_attention_decoder".
@ -602,17 +628,9 @@ def embedding_attention_decoder(decoder_inputs, initial_state, attention_states,
    with ops.device("/cpu:0"):
      embedding = variable_scope.get_variable("embedding",
                                              [num_symbols, cell.input_size])
-
-    def extract_argmax_and_embed(prev, _):
-      """Loop_function that extracts the symbol from prev and embeds it."""
-      if output_projection is not None:
-        prev = nn_ops.xw_plus_b(
-            prev, output_projection[0], output_projection[1])
-      prev_symbol = array_ops.stop_gradient(math_ops.argmax(prev, 1))
-      emb_prev = embedding_ops.embedding_lookup(embedding, prev_symbol)
-      return emb_prev
-
-    loop_function = extract_argmax_and_embed if feed_previous else None
+    loop_function = _extract_argmax_and_embed(
+        embedding, output_projection,
+        update_embedding_for_previous) if feed_previous else None
    emb_inp = [
        embedding_ops.embedding_lookup(embedding, i) for i in decoder_inputs]
    return attention_decoder(
@ -700,6 +718,7 @@ def embedding_attention_seq2seq(encoder_inputs, decoder_inputs, cell,
            num_decoder_symbols, num_heads=num_heads, output_size=output_size,
            output_projection=output_projection,
            feed_previous=feed_previous_bool,
+            update_embedding_for_previous=False,
            initial_state_attention=initial_state_attention)
        return outputs + [state]

--- a/tensorflow/python/platform/default/_gfile.py
+++ b/tensorflow/python/platform/default/_gfile.py
@ -248,7 +248,7 @@ class _Nulllocker(object):


 def Exists(path):   # pylint: disable=invalid-name
-  """Retruns True iff "path" exists (as a dir, file, non-broken symlink)."""
+  """Returns True iff "path" exists (as a dir, file, non-broken symlink)."""
  return os.path.exists(path)


--- a/tensorflow/python/training/learning_rate_decay.py
+++ b/tensorflow/python/training/learning_rate_decay.py
@ -50,9 +50,11 @@ def exponential_decay(learning_rate, global_step, decay_steps, decay_rate,
  starter_learning_rate = 0.1
  learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step,
                                             100000, 0.96, staircase=True)
-  optimizer = tf.GradientDescentOptimizer(learning_rate)
  # Passing global_step to minimize() will increment it at each step.
-  optimizer.minimize(...my loss..., global_step=global_step)
+  learning_step = (
+      tf.GradientDescentOptimizer(learning_rate)
+      .minimize(...my loss..., global_step=global_step)
+  )
  ```

  Args:
--- a/tensorflow/python/training/moving_averages_test.py
+++ b/tensorflow/python/training/moving_averages_test.py
@ -218,7 +218,7 @@ class ExponentialMovingAverageTest(tf.test.TestCase):
    self.assertDeviceEqual("/job:dev_v0", ema.average(v0).device)
    self.assertDeviceEqual("/job:dev_v1", ema.average(v1).device)
    # However, the colocation property is maintained.
-    self.assertEqual(["loc:@v1"],
+    self.assertEqual([b"loc:@v1"],
                     ema.average(v1).op.colocation_groups())
    self.assertDeviceEqual("/job:default", ema.average(tensor2).device)

--- a/Show More
+++ b/Show More