Branch 163121296 (#11767)

* Update ops-related pbtxt files. PiperOrigin-RevId: 163014080 * Go: Update generated wrapper functions for TensorFlow ops. PiperOrigin-RevId: 163014834 * Removing session reset since destroying the session object would delete its variables as well. Resetting session might unintentionally close other sessions in the same process. PiperOrigin-RevId: 163019166 * [XLA] Teach CPU and GPU compilers to optionally invoke the HLO insert-reduce-precision-operations pass. This also required a few additions and fixups. We add pieces to ReducePrecisionInsertion to translate between the protocol-buffer representation of the pass options and the predicate-function actually used in the pass. To facilitate this translation, we also add a function to HloOpcode to return the number of opcodes so that we can iterate over the whole set easily. PiperOrigin-RevId: 163037250 * Refactor HLO graph dumping. This also makes a few minor cosmetic changes, like moving the fusion type out of the fusion node and into the out-of-line computation and adjusting the arrow labels that we use to indicate operand numbers. PiperOrigin-RevId: 163038795 * Use correct order of arguments in call of valid_bitcast_callback_. There are platforms where bitcasts are not symmetric. I.e. there are shapes A and B so that A->B is a bitcast, but B->A not. So we have to consider the correct order when calling valid_bitcast_callback_. PiperOrigin-RevId: 163058665 * Two improvements to pip.sh 1. Distinguish between passed and skipped tests. 2. Allow skipping the smoke test of tensorflow install in clean virtualenv with NO_TEST_ON_INSTALL=1 PiperOrigin-RevId: 163065599 * [XLA] Update StatusOr implementation to use more nuanced type traits. Previously we would evaluate the is_copy_constructible trait before template parameters were fully defined; e.g. StatusOr<ThingIAmDefiningRightNow>, which could lead to surprising effects. Also, previously it was not possible to provide an error status to a StatusOr<T> where T was not default-constructible. PiperOrigin-RevId: 163073057 * [TF:XLA] Register a no-op kernel for ControlTrigger, but forbid the JIT marking pass from compiling ControlTrigger nodes. CL in preparation for compiling dynamic RNN gradients via XLA. PiperOrigin-RevId: 163073212 * Improve the HLO graph dumper's output. - Truncate long shapes. It's not uncommon to have giant tuples, and displaying the whole thing makes the graph unreadable. - Don't traverse into the users of a node with < 16 users. These are probably not interesting, and traversing into them can quickly blow up the graph, making it un-renderable. - Allow nodes which have multiple trivial subcomputations (e.g. select-and-scatter) to have those computations inlined. - Match additional patterns in MatchTrivialComputation PiperOrigin-RevId: 163079329 * If the value to be forwarded from a loop to its gradient is a constant, clone the constant instead of repeatedly pushing it onto a stack on each iteration. This should never consume more memory than the stack approach (notwithstanding swapping), and frequently should be much better. This change is in preparation for enabling XLA compilation of RNN gradients. PiperOrigin-RevId: 163082165 * [TF:XLA] Make the shape of a TensorArray flow value a scalar. Previously we used an f32[0] value, since the exact flow value does not matter, however this causes problems when a TensorArray computation is placed in a loop since the shape of the flow value is no longer loop invariant. PiperOrigin-RevId: 163082452 * Automated g4 rollback of changelist 163019166 PiperOrigin-RevId: 163083436 * Automated g4 rollback of changelist 162769374 PiperOrigin-RevId: 163086518 * internal change PiperOrigin-RevId: 163088509 * Clarify docstring for tf.rank. PiperOrigin-RevId: 163089480 * Reduce gather_op_test timeouts by reducing the size of testHigherRank. PiperOrigin-RevId: 163090428 * Add PopulationCount op (popcnt): element-wise counts the number of "on" bits. PiperOrigin-RevId: 163090921 * Show fusion nodes inline in HLO graph dumper. To make this work sanely I had to change NodeFilter so that it says to dump all nodes inside subcomputations. Previously, we passed an explicit NodeFilter down to DumpSubcomputation, and used that to control whether or not we dumped nodes in there. But this becomes unwieldy with inline fusion nodes, as sometimes you want to look at 'filter', and other times you want to look at 'filter_', and there's no good way to tell why. I also had to remove the heuristic whereby we'd pull in operands of nodes with just some operands shown. With the much bigger nodes that are generated by this change, the graph was becoming illegible. I think most of the confusion that heuristic was attempting to avoid is addressed by the fact that we "gray out" incomplete nodes. PiperOrigin-RevId: 163091423 * errors: Avoid stripping error details when convering POSIX errors to Status This change is made out of a desire to have additional information be reported when there are filesystem errors (for e.g. see #11628) PiperOrigin-RevId: 163091773 * C API: Fix a bug with TF_OperationGetAttrTensor when TF_STRING tensors are involved. The TensorBuffer owned by a TF_Tensor object has a different memory layout than the TensorBuffer owned by the corresponding tensorflow::Tensor object. This change consolidates conversions between the runtime's tensorflow::Tensor and the C API's TF_Tensor objects into a pair helper functions. The added test: CApiAttributesTest.StringTensor fails without corresponding changes to c_api.cc PiperOrigin-RevId: 163091789 * Speed up tf.contrib.signal spectral_ops_test.py by reducing the size of the gradient test. PiperOrigin-RevId: 163092423 * Add new CompareAndBitpackOp. PiperOrigin-RevId: 163093146 * Update ops-related pbtxt files. PiperOrigin-RevId: 163094455 * Minor tweaks to avoid unnecessary copies PiperOrigin-RevId: 163101160 * [BatchNormGrad] Add end-to-end test for BatchNormGrad RELNOTES: n/a PiperOrigin-RevId: 163101568 * Go: Update generated wrapper functions for TensorFlow ops. PiperOrigin-RevId: 163102070 * [XLA] Add more unit tests for DynamicSlice and DynamicUpdateSlice. PiperOrigin-RevId: 163102445 * Adding missing deps to targets in llvm.BUILD. This was only working in non-sandboxed builds. PiperOrigin-RevId: 163103908 * Pass batch_size in params when use_tpu=False. PiperOrigin-RevId: 163105673 * Remove duplicate import. PiperOrigin-RevId: 163108237 * Implementation of UnsortedSegmentSum in tf2xla bridge. PiperOrigin-RevId: 163109769 * Add gradient checking tests for nn.moments(). PiperOrigin-RevId: 163110994 * Improved the speed of constant folding PiperOrigin-RevId: 163113085 * Convert configure to python. PiperOrigin-RevId: 163114551 * [TF:XLA] Ignore control edges from Enter nodes to the graph sink during loop functionalization. PiperOrigin-RevId: 163115904 * Support customized residual function in the residual wrapper. PiperOrigin-RevId: 163121296
2017-07-25 18:41:40 -07:00 · 2017-07-25 18:41:40 -07:00 · 62de0b9479
commit 62de0b9479
parent 245a5eeafc
81 changed files with 4579 additions and 2018 deletions
--- a/810
+++ b/810
@ -3,816 +3,12 @@
 set -e
 set -o pipefail
 MIN_BAZEL_VERSION=0.4.5
 # Find out the absolute path to where ./configure resides
 pushd `dirname $0` > /dev/null
 SOURCE_BASE_DIR=`pwd -P`
 popd > /dev/null
 PLATFORM="$(uname -s | tr 'A-Z' 'a-z')"
 function is_linux() {
  [[ "${PLATFORM}" == "linux" ]]
 }
 function is_macos() {
  [[ "${PLATFORM}" == "darwin" ]]
 }
 function is_windows() {
  # On windows, the shell script is actually running in msys
  [[ "${PLATFORM}" =~ msys_nt*|mingw*|cygwin*|uwin* ]]
 }
 function is_ppc64le() {
  [[ "$(uname -m)" == "ppc64le" ]]
 }
 function sed_in_place() {
  sed -e $1 $2 > "$2.bak"
  mv "$2.bak" $2
 }
 function write_to_bazelrc() {
  echo "$1" >> .tf_configure.bazelrc
 }
 function write_action_env_to_bazelrc() {
  write_to_bazelrc "build --action_env $1=\"$2\""
 }
 function python_path {
  "$PYTHON_BIN_PATH" - <<END
 from __future__ import print_function
 import site
 import os
 try:
  input = raw_input
 except NameError:
  pass
 python_paths = []
 if os.getenv('PYTHONPATH') is not None:
  python_paths = os.getenv('PYTHONPATH').split(':')
 try:
  library_paths = site.getsitepackages()
 except AttributeError:
 from distutils.sysconfig import get_python_lib
 library_paths = [get_python_lib()]
 all_paths = set(python_paths + library_paths)
 paths = []
 for path in all_paths:
  if os.path.isdir(path):
    paths.append(path)
 print(",".join(paths))
 END
 }
 function setup_python {
  ## Set up python-related environment settings:
  while true; do
    fromuser=""
 if [ -z "$PYTHON_BIN_PATH" ]; then
-      default_python_bin_path=$(which python || which python3 || true)
+  PYTHON_BIN_PATH=$(which python || which python3 || true)
      read -p "Please specify the location of python. [Default is $default_python_bin_path]: " PYTHON_BIN_PATH
      fromuser="1"
      if [ -z "$PYTHON_BIN_PATH" ]; then
        PYTHON_BIN_PATH=$default_python_bin_path
 fi
    fi
    if [ -e "$PYTHON_BIN_PATH" ]; then
      break
    fi
    echo "Invalid python path. ${PYTHON_BIN_PATH} cannot be found" 1>&2
    if [ -z "$fromuser" ]; then
      exit 1
    fi
    PYTHON_BIN_PATH=""
    # Retry
  done
  if [ -z "$PYTHON_LIB_PATH" ]; then
    # Split python_path into an array of paths, this allows path containing spaces
    IFS=',' read -r -a python_lib_path <<< "$(python_path)"
    if [ 1 = "$USE_DEFAULT_PYTHON_LIB_PATH" ]; then
      PYTHON_LIB_PATH=${python_lib_path[0]}
      echo "Using python library path: $PYTHON_LIB_PATH"
    else
      echo "Found possible Python library paths:"
      for x in "${python_lib_path[@]}"; do
        echo "  $x"
      done
      set -- "${python_lib_path[@]}"
      echo "Please input the desired Python library path to use.  Default is [$1]"
      read b || true
      if [ "$b" == "" ]; then
        PYTHON_LIB_PATH=${python_lib_path[0]}
        echo "Using python library path: $PYTHON_LIB_PATH"
      else
        PYTHON_LIB_PATH="$b"
      fi
    fi
  fi
  if [ ! -x "$PYTHON_BIN_PATH" ]  || [ -d "$PYTHON_BIN_PATH" ]; then
    echo "PYTHON_BIN_PATH is not executable.  Is it the python binary?"
    exit 1
  fi
  local python_major_version
  python_major_version=$("${PYTHON_BIN_PATH}" -c 'from __future__ import print_function; import sys; print(sys.version_info[0]);' | head -c1)
  if [ -z "$python_major_version" ]; then
    echo -e "\n\nERROR: Problem getting python version.  Is $PYTHON_BIN_PATH the correct python binary?"
    exit 1
  fi
  # Convert python path to Windows style before writing into bazel.rc
  if is_windows; then
    PYTHON_BIN_PATH="$(cygpath -m "$PYTHON_BIN_PATH")"
    PYTHON_LIB_PATH="$(cygpath -m "$PYTHON_LIB_PATH")"
  fi
  # Set-up env variables used by python_configure.bzl
  write_action_env_to_bazelrc "PYTHON_BIN_PATH" "$PYTHON_BIN_PATH"
  write_action_env_to_bazelrc "PYTHON_LIB_PATH" "$PYTHON_LIB_PATH"
  write_to_bazelrc "build --define PYTHON_BIN_PATH=\"$PYTHON_BIN_PATH\""
  write_to_bazelrc "build --define PYTHON_LIB_PATH=\"$PYTHON_LIB_PATH\""
  write_to_bazelrc "build --force_python=py$python_major_version"
  write_to_bazelrc "build --host_force_python=py$python_major_version"
  write_to_bazelrc "build --python${python_major_version}_path=\"$PYTHON_BIN_PATH\""
  write_to_bazelrc "test --force_python=py$python_major_version"
  write_to_bazelrc "test --host_force_python=py$python_major_version"
  write_to_bazelrc "test --define PYTHON_BIN_PATH=\"$PYTHON_BIN_PATH\""
  write_to_bazelrc "test --define PYTHON_LIB_PATH=\"$PYTHON_LIB_PATH\""
  write_to_bazelrc "run --define PYTHON_BIN_PATH=\"$PYTHON_BIN_PATH\""
  write_to_bazelrc "run --define PYTHON_LIB_PATH=\"$PYTHON_LIB_PATH\""
  # Write tools/python_bin_path.sh
  echo "export PYTHON_BIN_PATH=\"$PYTHON_BIN_PATH\"" > tools/python_bin_path.sh
 }
 function version {
  echo "$@" | awk -F. '{ printf("%03d%03d%03d\n", $1,$2,$3); }';
 }
 bazel version > bazel.version
 set +e
 curr_bazel_version=$(grep -m 1 'Build label:' bazel.version | cut -d ' ' -f3)
 set -e
 rm -f bazel.version
 echo "You have bazel $curr_bazel_version installed."
 if [ -z "$curr_bazel_version" ]; then
  echo "WARNING: current bazel installation is not a release version."
  echo "Make sure you are running at least bazel $MIN_BAZEL_VERSION."
 elif [ "$(version "$MIN_BAZEL_VERSION")" -gt "$(version "$curr_bazel_version")" ]; then
  echo "Please upgrade your bazel installation to version $MIN_BAZEL_VERSION or higher to build TensorFlow!"
  echo "Exiting..."
  exit 1
 fi
 # This file contains customized config settings.
 rm -f .tf_configure.bazelrc
 touch .tf_configure.bazelrc
 if [[ ! -e .bazelrc ]]; then
  if [[ -e "${HOME}/.bazelrc" ]]; then
    echo "import ${HOME}/.bazelrc" >.bazelrc
  else
    touch .bazelrc
  fi
 fi
 sed_in_place "/tf_configure/d" .bazelrc
 echo "import %workspace%/.tf_configure.bazelrc" >> .bazelrc
 # Delete any leftover BUILD files from the Makefile build, which would interfere
 # with Bazel parsing.
 MAKEFILE_DOWNLOAD_DIR=tensorflow/contrib/makefile/downloads
 if [ -d "${MAKEFILE_DOWNLOAD_DIR}" ]; then
  find ${MAKEFILE_DOWNLOAD_DIR} -type f -name '*BUILD' -delete
 fi
 setup_python
 ## Set up MKL related environment settings
 write_to_bazelrc 'build:mkl --define with_mkl_support=true'
 write_to_bazelrc 'build:mkl --define using_mkl=true'
 write_to_bazelrc 'build:mkl -c opt'
 write_to_bazelrc 'build:mkl --copt="-DEIGEN_USE_VML"'
 echo ""
 echo "Add \"--config=mkl\" to your bazel command to build with MKL support."
 echo "Please note that MKL on MacOS or windows is still not supported."
 echo "If you would like to use a local MKL instead of downloading, please "
 echo "  set the environment variable \"TF_MKL_ROOT\" every time before build."
 echo ""
 ## End MKL setup
 ## Set up architecture-dependent optimization flags.
 if [ -z "$CC_OPT_FLAGS" ]; then
  if is_ppc64le; then
    # gcc on ppc64le does not support -march, use mcpu instead
    default_cc_opt_flags="-mcpu=native"
  else
    default_cc_opt_flags="-march=native"
  fi
  read -p "Please specify optimization flags to use during compilation when bazel option "\
 "\"--config=opt\" is specified [Default is $default_cc_opt_flags]: " CC_OPT_FLAGS
  if [ -z "$CC_OPT_FLAGS" ]; then
    CC_OPT_FLAGS=$default_cc_opt_flags
  fi
 fi
 if is_windows; then
  TF_NEED_GCP=0
  TF_NEED_HDFS=0
  TF_NEED_JEMALLOC=0
  TF_NEED_OPENCL=0
  TF_CUDA_CLANG=0
 fi
 if is_linux; then
  while [ "$TF_NEED_JEMALLOC" == "" ]; do
    read -p "Do you wish to use jemalloc as the malloc implementation? [Y/n] "\
      INPUT
    case $INPUT in
      [Yy]* ) echo "jemalloc enabled"; TF_NEED_JEMALLOC=1;;
      [Nn]* ) echo "jemalloc disabled"; TF_NEED_JEMALLOC=0;;
      "" ) echo "jemalloc enabled"; TF_NEED_JEMALLOC=1;;
      * ) echo "Invalid selection: " $INPUT;;
    esac
  done
 else
  TF_NEED_JEMALLOC=0
 fi
 if [[ "$TF_NEED_JEMALLOC" == "1" ]]; then
  write_to_bazelrc 'build --define with_jemalloc=true'
 fi
 while [[ "$TF_NEED_GCP" == "" ]]; do
  read -p "Do you wish to build TensorFlow with "\
 "Google Cloud Platform support? [y/N] " INPUT
  case $INPUT in
    [Yy]* ) echo "Google Cloud Platform support will be enabled for "\
 "TensorFlow"; TF_NEED_GCP=1;;
    [Nn]* ) echo "No Google Cloud Platform support will be enabled for "\
 "TensorFlow"; TF_NEED_GCP=0;;
    "" ) echo "No Google Cloud Platform support will be enabled for "\
 "TensorFlow"; TF_NEED_GCP=0;;
    * ) echo "Invalid selection: " $INPUT;;
  esac
 done
 if [[ "$TF_NEED_GCP" == "1" ]]; then
  write_to_bazelrc 'build --define with_gcp_support=true'
 fi
 while [[ "$TF_NEED_HDFS" == "" ]]; do
  read -p "Do you wish to build TensorFlow with "\
 "Hadoop File System support? [y/N] " INPUT
  case $INPUT in
    [Yy]* ) echo "Hadoop File System support will be enabled for "\
 "TensorFlow"; TF_NEED_HDFS=1;;
    [Nn]* ) echo "No Hadoop File System support will be enabled for "\
 "TensorFlow"; TF_NEED_HDFS=0;;
    "" ) echo "No Hadoop File System support will be enabled for "\
 "TensorFlow"; TF_NEED_HDFS=0;;
    * ) echo "Invalid selection: " $INPUT;;
  esac
 done
 if [[ "$TF_NEED_HDFS" == "1" ]]; then
  write_to_bazelrc 'build --define with_hdfs_support=true'
 fi
 ## Enable XLA.
 while [[ "$TF_ENABLE_XLA" == "" ]]; do
  read -p "Do you wish to build TensorFlow with the XLA just-in-time compiler (experimental)? [y/N] " INPUT
  case $INPUT in
    [Yy]* ) echo "XLA JIT support will be enabled for TensorFlow"; TF_ENABLE_XLA=1;;
    [Nn]* ) echo "No XLA JIT support will be enabled for TensorFlow"; TF_ENABLE_XLA=0;;
    "" ) echo "No XLA support will be enabled for TensorFlow"; TF_ENABLE_XLA=0;;
    * ) echo "Invalid selection: " $INPUT;;
  esac
 done
 if [[ "$TF_ENABLE_XLA" == "1" ]]; then
  write_to_bazelrc 'build --define with_xla_support=true'
 fi
 # Verbs configuration
 while [ "$TF_NEED_VERBS" == "" ]; do
  read -p "Do you wish to build TensorFlow with "\
 "VERBS support? [y/N] " INPUT
  case $INPUT in
    [Yy]* ) echo "VERBS support will be enabled for "\
 "TensorFlow"; TF_NEED_VERBS=1;;
    [Nn]* ) echo "No VERBS support will be enabled for "\
 "TensorFlow"; TF_NEED_VERBS=0;;
    "" ) echo "No VERBS support will be enabled for "\
 "TensorFlow"; TF_NEED_VERBS=0;;
    * ) echo "Invalid selection: " $INPUT;;
  esac
 done
 if [[ "$TF_NEED_VERBS" == "1" ]]; then
  write_to_bazelrc 'build --define with_verbs_support=true'
 fi
 # Append CC optimization flags to bazel.rc
 for opt in $CC_OPT_FLAGS; do
  write_to_bazelrc "build:opt --cxxopt=$opt --copt=$opt"
 done
 # Run the gen_git_source to create links where bazel can track dependencies for
 # git hash propagation
 GEN_GIT_SOURCE=tensorflow/tools/git/gen_git_source.py
 chmod a+x ${GEN_GIT_SOURCE}
 "${PYTHON_BIN_PATH}" ${GEN_GIT_SOURCE} --configure "${SOURCE_BASE_DIR}"
 ## Set up SYCL-related environment settings
 while [ "$TF_NEED_OPENCL" == "" ]; do
  read -p "Do you wish to build TensorFlow with OpenCL support? [y/N] " INPUT
  case $INPUT in
    [Yy]* ) echo "OpenCL support will be enabled for TensorFlow"; TF_NEED_OPENCL=1;;
    [Nn]* ) echo "No OpenCL support will be enabled for TensorFlow"; TF_NEED_OPENCL=0;;
    "" ) echo "No OpenCL support will be enabled for TensorFlow"; TF_NEED_OPENCL=0;;
    * ) echo "Invalid selection: " $INPUT;;
  esac
 done
 ## Set up Cuda-related environment settings
 while [ "$TF_NEED_CUDA" == "" ]; do
  read -p "Do you wish to build TensorFlow with CUDA support? [y/N] " INPUT
  case $INPUT in
    [Yy]* ) echo "CUDA support will be enabled for TensorFlow"; TF_NEED_CUDA=1;;
    [Nn]* ) echo "No CUDA support will be enabled for TensorFlow"; TF_NEED_CUDA=0;;
    "" ) echo "No CUDA support will be enabled for TensorFlow"; TF_NEED_CUDA=0;;
    * ) echo "Invalid selection: " $INPUT;;
  esac
 done
 export TF_NEED_CUDA
 write_action_env_to_bazelrc "TF_NEED_CUDA" "$TF_NEED_CUDA"
 export TF_NEED_OPENCL
 write_action_env_to_bazelrc "TF_NEED_OPENCL" "$TF_NEED_OPENCL"
 if [ "$TF_NEED_CUDA" == "1" ]; then
 while [[ "$TF_CUDA_CLANG" == "" ]]; do
  read -p "Do you want to use clang as CUDA compiler? [y/N] " INPUT
  case $INPUT in
    [Yy]* ) echo "Clang will be used as CUDA compiler"; TF_CUDA_CLANG=1;;
    [Nn]* ) echo "nvcc will be used as CUDA compiler"; TF_CUDA_CLANG=0;;
    "" ) echo "nvcc will be used as CUDA compiler"; TF_CUDA_CLANG=0;;
    * ) echo "Invalid selection: " $INPUT;;
  esac
 done
 export TF_CUDA_CLANG
 write_action_env_to_bazelrc "TF_CUDA_CLANG" "$TF_CUDA_CLANG"
 # Set up which clang we should use as the cuda / host compiler.
 while [[ "$TF_CUDA_CLANG" == "1" ]] && true; do
  fromuser=""
  if [ -z "$CLANG_CUDA_COMPILER_PATH" ]; then
    default_clang_host_compiler_path=$(which clang || true)
    read -p "Please specify which clang should be used as device and host compiler. [Default is $default_clang_host_compiler_path]: " CLANG_CUDA_COMPILER_PATH
    fromuser="1"
    if [ -z "$CLANG_CUDA_COMPILER_PATH" ]; then
      CLANG_CUDA_COMPILER_PATH="$default_clang_host_compiler_path"
    fi
  fi
  if [ -e "$CLANG_CUDA_COMPILER_PATH" ]; then
    export CLANG_CUDA_COMPILER_PATH
    write_action_env_to_bazelrc "CLANG_CUDA_COMPILER_PATH" "$CLANG_CUDA_COMPILER_PATH"
    break
  fi
  echo "Invalid clang path. ${CLANG_CUDA_COMPILER_PATH} cannot be found" 1>&2
  if [ -z "$fromuser" ]; then
    exit 1
  fi
  CLANG_CUDA_COMPILER_PATH=""
  # Retry
 done
 # Find out where the CUDA toolkit is installed
 while true; do
  # Configure the Cuda SDK version to use.
  if [ -z "$TF_CUDA_VERSION" ]; then
    read -p "Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 8.0]: " TF_CUDA_VERSION
  fi
  # Set default CUDA version if not set
  TF_CUDA_VERSION=${TF_CUDA_VERSION:-8.0}
  fromuser=""
  if [ -z "$CUDA_TOOLKIT_PATH" ]; then
    default_cuda_path=/usr/local/cuda
    if is_windows; then
      if [ -z "$CUDA_PATH" ]; then
        default_cuda_path="C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v8.0"
      else
        default_cuda_path="$(cygpath -m "$CUDA_PATH")"
      fi
    elif is_linux; then
      # If the default doesn't exist, try an alternative default.
      if [ ! -d $default_cuda_path ] && [ -d /opt/cuda ]; then
        default_cuda_path=/opt/cuda
      fi
    fi
    read -p "Please specify the location where CUDA $TF_CUDA_VERSION toolkit is installed. Refer to README.md for more details. [Default is $default_cuda_path]: " CUDA_TOOLKIT_PATH
    fromuser="1"
    if [ -z "$CUDA_TOOLKIT_PATH" ]; then
      CUDA_TOOLKIT_PATH="$default_cuda_path"
    fi
  fi
  if [[ -z "$TF_CUDA_VERSION" ]]; then
    TF_CUDA_EXT=""
  else
    TF_CUDA_EXT=".$TF_CUDA_VERSION"
  fi
  if is_windows; then
    CUDA_RT_LIB_PATH="lib/x64/cudart.lib"
  elif is_linux; then
    CUDA_RT_LIB_PATH="lib64/libcudart.so${TF_CUDA_EXT}"
  elif is_macos; then
    CUDA_RT_LIB_PATH="lib/libcudart${TF_CUDA_EXT}.dylib"
  fi
  if [ -e "${CUDA_TOOLKIT_PATH}/${CUDA_RT_LIB_PATH}" ]; then
    export CUDA_TOOLKIT_PATH
    write_action_env_to_bazelrc "CUDA_TOOLKIT_PATH" "$CUDA_TOOLKIT_PATH"
    export TF_CUDA_VERSION
    break
  fi
  echo "Invalid path to CUDA $TF_CUDA_VERSION toolkit. ${CUDA_TOOLKIT_PATH}/${CUDA_RT_LIB_PATH} cannot be found"
  if [ -z "$fromuser" ]; then
    exit 1
  fi
  # Retry
  TF_CUDA_VERSION=""
  CUDA_TOOLKIT_PATH=""
 done
 export TF_CUDA_VERSION
 write_action_env_to_bazelrc "TF_CUDA_VERSION" "$TF_CUDA_VERSION"
-# Set up which gcc nvcc should use as the host compiler
+# Set all env variables
-# No need to set this on Windows
+$PYTHON_BIN_PATH configure.py
 while [[ "$TF_CUDA_CLANG" != "1" ]] && ! is_windows && true; do
  fromuser=""
  if [ -z "$GCC_HOST_COMPILER_PATH" ]; then
    default_gcc_host_compiler_path=$(which gcc || true)
    cuda_bin_symlink="$CUDA_TOOLKIT_PATH/bin/gcc"
    if [ -L "$cuda_bin_symlink" ]; then
      default_gcc_host_compiler_path=$(readlink $cuda_bin_symlink)
    fi
    read -p "Please specify which gcc should be used by nvcc as the host compiler. [Default is $default_gcc_host_compiler_path]: " GCC_HOST_COMPILER_PATH
    fromuser="1"
    if [ -z "$GCC_HOST_COMPILER_PATH" ]; then
      GCC_HOST_COMPILER_PATH="$default_gcc_host_compiler_path"
    fi
  fi
  if [ -e "$GCC_HOST_COMPILER_PATH" ]; then
    export GCC_HOST_COMPILER_PATH
    write_action_env_to_bazelrc "GCC_HOST_COMPILER_PATH" "$GCC_HOST_COMPILER_PATH"
    break
  fi
  echo "Invalid gcc path. ${GCC_HOST_COMPILER_PATH} cannot be found" 1>&2
  if [ -z "$fromuser" ]; then
    exit 1
  fi
  GCC_HOST_COMPILER_PATH=""
  # Retry
 done
 # Find out where the cuDNN library is installed
 while true; do
  # Configure the cuDNN version to use.
  if [ -z "$TF_CUDNN_VERSION" ]; then
    read -p "Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 6.0]: " TF_CUDNN_VERSION
  fi
  # Set default CUDNN version if not set
  TF_CUDNN_VERSION=${TF_CUDNN_VERSION:-6}
  fromuser=""
  if [ -z "$CUDNN_INSTALL_PATH" ]; then
    default_cudnn_path=${CUDA_TOOLKIT_PATH}
    read -p "Please specify the location where cuDNN $TF_CUDNN_VERSION library is installed. Refer to README.md for more details. [Default is $default_cudnn_path]: " CUDNN_INSTALL_PATH
    fromuser="1"
    if [ -z "$CUDNN_INSTALL_PATH" ]; then
      CUDNN_INSTALL_PATH=$default_cudnn_path
    fi
    # Result returned from "read" will be used unexpanded. That make "~" unusable.
    # Going through one more level of expansion to handle that.
    CUDNN_INSTALL_PATH=`"${PYTHON_BIN_PATH}" -c "import os; print(os.path.realpath(os.path.expanduser('${CUDNN_INSTALL_PATH}')))"`
    if is_windows; then
      CUDNN_INSTALL_PATH="$(cygpath -m "$CUDNN_INSTALL_PATH")"
    fi
  fi
  if [[ -z "$TF_CUDNN_VERSION" ]]; then
    TF_CUDNN_EXT=""
  else
    TF_CUDNN_EXT=".$TF_CUDNN_VERSION"
  fi
  if is_windows; then
    CUDA_DNN_LIB_PATH="lib/x64/cudnn.lib"
    CUDA_DNN_LIB_ALT_PATH="lib/x64/cudnn.lib"
  elif is_linux; then
    CUDA_DNN_LIB_PATH="lib64/libcudnn.so${TF_CUDNN_EXT}"
    CUDA_DNN_LIB_ALT_PATH="libcudnn.so${TF_CUDNN_EXT}"
  elif is_macos; then
    CUDA_DNN_LIB_PATH="lib/libcudnn${TF_CUDNN_EXT}.dylib"
    CUDA_DNN_LIB_ALT_PATH="libcudnn${TF_CUDNN_EXT}.dylib"
  fi
  if [ -e "$CUDNN_INSTALL_PATH/${CUDA_DNN_LIB_ALT_PATH}" ] || [ -e "$CUDNN_INSTALL_PATH/${CUDA_DNN_LIB_PATH}" ]; then
    export TF_CUDNN_VERSION
    write_action_env_to_bazelrc "TF_CUDNN_VERSION" "$TF_CUDNN_VERSION"
    export CUDNN_INSTALL_PATH
    write_action_env_to_bazelrc "CUDNN_INSTALL_PATH" "$CUDNN_INSTALL_PATH"
    break
  fi
  if is_linux; then
    if ! type ldconfig > /dev/null 2>&1; then
        LDCONFIG_BIN=/sbin/ldconfig
    else
        LDCONFIG_BIN=ldconfig
    fi
    CUDNN_PATH_FROM_LDCONFIG="$($LDCONFIG_BIN -p | sed -n 's/.*libcudnn.so .* => \(.*\)/\1/p')"
    if [ -e "${CUDNN_PATH_FROM_LDCONFIG}${TF_CUDNN_EXT}" ]; then
      export TF_CUDNN_VERSION
      export CUDNN_INSTALL_PATH
      CUDNN_INSTALL_PATH="$(dirname ${CUDNN_PATH_FROM_LDCONFIG})"
      write_action_env_to_bazelrc "CUDNN_INSTALL_PATH" "$CUDNN_INSTALL_PATH"
      break
    fi
  fi
  echo "Invalid path to cuDNN ${CUDNN_VERSION} toolkit. Neither of the following two files can be found:"
  echo "${CUDNN_INSTALL_PATH}/${CUDA_DNN_LIB_PATH}"
  echo "${CUDNN_INSTALL_PATH}/${CUDA_DNN_LIB_ALT_PATH}"
  if is_linux; then
    echo "${CUDNN_PATH_FROM_LDCONFIG}${TF_CUDNN_EXT}"
  fi
  if [ -z "$fromuser" ]; then
    exit 1
  fi
  # Retry
  TF_CUDNN_VERSION=""
  CUDNN_INSTALL_PATH=""
 done
 export TF_CUDNN_VERSION
 write_action_env_to_bazelrc "TF_CUDNN_VERSION" "$TF_CUDNN_VERSION"
 # Configure the compute capabilities that TensorFlow builds for.
 # Since Cuda toolkit is not backward-compatible, this is not guaranteed to work.
 function get_native_cuda_compute_capabilities {
  device_query_bin="$CUDA_TOOLKIT_PATH/extras/demo_suite/deviceQuery" # Also works on Windows without .exe
  "$device_query_bin" | grep 'Capability' | grep -o '[0-9]*\.[0-9]*' | sed ':a;{N;s/\n/,/};ba'
  exit 0 # ensure that this function always exit success even if device detection fails, to prevent the whole configure from aborting
 }
 while true; do
  fromuser=""
  native_cuda_compute_capabilities=$(get_native_cuda_compute_capabilities)
  default_cuda_compute_capabilities=${native_cuda_compute_capabilities:-"3.5,5.2"}
  if [ -z "$TF_CUDA_COMPUTE_CAPABILITIES" ]; then
 cat << EOF
 Please specify a list of comma-separated Cuda compute capabilities you want to build with.
 You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
 Please note that each additional compute capability significantly increases your build time and binary size.
 EOF
    read -p "[Default is: \"$default_cuda_compute_capabilities\"]: " TF_CUDA_COMPUTE_CAPABILITIES
    fromuser=1
  fi
  if [ -z "$TF_CUDA_COMPUTE_CAPABILITIES" ]; then
    TF_CUDA_COMPUTE_CAPABILITIES=$default_cuda_compute_capabilities
  fi
  # Check whether all capabilities from the input is valid
  COMPUTE_CAPABILITIES=${TF_CUDA_COMPUTE_CAPABILITIES//,/ }
  ALL_VALID=1
  for CAPABILITY in $COMPUTE_CAPABILITIES; do
    if [[ ! "$CAPABILITY" =~ [0-9]+.[0-9]+ ]]; then
      echo "Invalid compute capability: " $CAPABILITY
      ALL_VALID=0
      break
    fi
  done
  if [ "$ALL_VALID" == "0" ]; then
    if [ -z "$fromuser" ]; then
      exit 1
    fi
  else
    export TF_CUDA_COMPUTE_CAPABILITIES
    write_action_env_to_bazelrc "TF_CUDA_COMPUTE_CAPABILITIES" "$TF_CUDA_COMPUTE_CAPABILITIES"
    break
  fi
  TF_CUDA_COMPUTE_CAPABILITIES=""
 done
 if is_windows; then
  # The following three variables are needed for MSVC toolchain configuration in Bazel
  export CUDA_PATH="$CUDA_TOOLKIT_PATH"
  export CUDA_COMPUTE_CAPABILITIES="$TF_CUDA_COMPUTE_CAPABILITIES"
  export NO_WHOLE_ARCHIVE_OPTION=1
  write_action_env_to_bazelrc "CUDA_PATH" "$CUDA_PATH"
  write_action_env_to_bazelrc "CUDA_COMPUTE_CAPABILITIES" "$CUDA_COMPUTE_CAPABILITIES"
  write_action_env_to_bazelrc "NO_WHOLE_ARCHIVE_OPTION" "1"
  write_to_bazelrc "build --config=win-cuda"
  write_to_bazelrc "test --config=win-cuda"
 else
  # If CUDA is enabled, always use GPU during build and test.
  if [ "$TF_CUDA_CLANG" == "1" ]; then
    write_to_bazelrc "build --config=cuda_clang"
    write_to_bazelrc "test --config=cuda_clang"
  else
    write_to_bazelrc "build --config=cuda"
    write_to_bazelrc "test --config=cuda"
  fi
 fi
 # end of if "$TF_NEED_CUDA" == "1"
 fi
 # OpenCL configuration
 if [ "$TF_NEED_OPENCL" == "1" ]; then
 # Determine which C++ compiler should be used as the host compiler
 while true; do
  fromuser=""
  if [ -z "$HOST_CXX_COMPILER" ]; then
    default_cxx_host_compiler=$(which g++ || true)
    read -p "Please specify which C++ compiler should be used as the host C++ compiler. [Default is $default_cxx_host_compiler]: " HOST_CXX_COMPILER
    fromuser="1"
    if [ -z "$HOST_CXX_COMPILER" ]; then
      HOST_CXX_COMPILER=$default_cxx_host_compiler
    fi
  fi
  if [ -e "$HOST_CXX_COMPILER" ]; then
    export HOST_CXX_COMPILER
    write_action_env_to_bazelrc "HOST_CXX_COMPILER" "$HOST_CXX_COMPILER"
    break
  fi
  echo "Invalid C++ compiler path. ${HOST_CXX_COMPILER} cannot be found" 1>&2
  if [ -z "$fromuser" ]; then
    exit 1
  fi
  HOST_CXX_COMPILER=""
  # Retry
 done
 # Determine which C compiler should be used as the host compiler
 while true; do
  fromuser=""
  if [ -z "$HOST_C_COMPILER" ]; then
    default_c_host_compiler=$(which gcc || true)
    read -p "Please specify which C compiler should be used as the host C compiler. [Default is $default_c_host_compiler]: " HOST_C_COMPILER
    fromuser="1"
    if [ -z "$HOST_C_COMPILER" ]; then
      HOST_C_COMPILER=$default_c_host_compiler
    fi
  fi
  if [ -e "$HOST_C_COMPILER" ]; then
    export HOST_C_COMPILER
    write_action_env_to_bazelrc "HOST_C_COMPILER" "$HOST_C_COMPILER"
    break
  fi
  echo "Invalid C compiler path. ${HOST_C_COMPILER} cannot be found" 1>&2
  if [ -z "$fromuser" ]; then
    exit 1
  fi
  HOST_C_COMPILER=""
  # Retry
 done
 while true; do
  # Configure the OPENCL version to use.
  TF_OPENCL_VERSION="1.2"
  # Point to ComputeCpp root
  if [ -z "$COMPUTECPP_TOOLKIT_PATH" ]; then
    default_computecpp_toolkit_path=/usr/local/computecpp
    read -p "Please specify the location where ComputeCpp for SYCL $TF_OPENCL_VERSION is installed. [Default is $default_computecpp_toolkit_path]: " COMPUTECPP_TOOLKIT_PATH
    fromuser="1"
    if [ -z "$COMPUTECPP_TOOLKIT_PATH" ]; then
      COMPUTECPP_TOOLKIT_PATH=$default_computecpp_toolkit_path
    fi
  fi
  if is_linux; then
    SYCL_RT_LIB_PATH="lib/libComputeCpp.so"
  fi
  if [ -e "${COMPUTECPP_TOOLKIT_PATH}/${SYCL_RT_LIB_PATH}" ]; then
    export COMPUTECPP_TOOLKIT_PATH
    write_action_env_to_bazelrc "COMPUTECPP_TOOLKIT_PATH" "$COMPUTECPP_TOOLKIT_PATH"
    break
  fi
  echo "Invalid SYCL $TF_OPENCL_VERSION library path. ${COMPUTECPP_TOOLKIT_PATH}/${SYCL_RT_LIB_PATH} cannot be found"
  if [ -z "$fromuser" ]; then
    exit 1
  fi
  # Retry
  TF_OPENCL_VERSION=""
  COMPUTECPP_TOOLKIT_PATH=""
 done
 # end of if "$TF_NEED_OPENCL" == "1"
 fi
 while [ "$TF_NEED_MPI" == "" ]; do
  read -p "Do you wish to build TensorFlow with "\
 "MPI support? [y/N] " INPUT
  case $INPUT in
    [Yy]* ) echo "MPI support will be enabled for "\
 "TensorFlow"; TF_NEED_MPI=1;;
    [Nn]* ) echo "MPI support will not be enabled for "\
 "TensorFlow"; TF_NEED_MPI=0;;
    "" ) echo "MPI support will not be enabled for "\
 "TensorFlow"; TF_NEED_MPI=0;;
    * ) echo "Invalid selection: " $INPUT;;
  esac
 done
 # Find out where the MPI toolkit is installed
 while true; do
    if [ "$TF_NEED_MPI" == "0" ]; then
        break;
    fi
    fromuser=""
    if [ -z "$MPI_HOME" ]; then
        #Get the base folder by removing the bin path
        default_mpi_path=$(dirname $(dirname $(which mpirun)) || dirname $(dirname $(which mpiexec))  || true)
        read -p "Please specify the MPI toolkit folder. [Default is $default_mpi_path]: " MPI_HOME
        fromuser="1"
        if [ -z "$MPI_HOME" ]; then
            MPI_HOME=$default_mpi_path
        fi
    fi
    #Check that the include and library folders are where we expect them to be
    if [ -e "$MPI_HOME/include" ] && [ -e "$MPI_HOME/lib" ]; then
        break
    fi
    echo "Invalid path to the MPI Toolkit. ${MPI_HOME}/include or ${MPI_HOME}/lib cannot be found."
    if [ -z "$fromuser" ]; then
        exit 1
    fi
    # Retry
    MPI_HOME=""
 done
 if [ "$TF_NEED_MPI" == "1" ]; then
  write_to_bazelrc 'build --define with_mpi_support=true'
  #Link the MPI header files
  ln -sf "${MPI_HOME}/include/mpi.h" third_party/mpi/mpi.h
  #Determine if we use OpenMPI or MVAPICH, these require different header files
  #to be included here to make bazel dependency checker happy
  if [ -e "${MPI_HOME}/include/mpi_portable_platform.h" ]; then
        #OpenMPI
        ln -sf "${MPI_HOME}/include/mpi_portable_platform.h" third_party/mpi/
        sed -i -e "s/MPI_LIB_IS_OPENMPI=False/MPI_LIB_IS_OPENMPI=True/" third_party/mpi/mpi.bzl
 else
        #MVAPICH / MPICH
        ln -sf "${MPI_HOME}/include/mpio.h" third_party/mpi/
        ln -sf "${MPI_HOME}/include/mpicxx.h" third_party/mpi/
        sed -i -e "s/MPI_LIB_IS_OPENMPI=True/MPI_LIB_IS_OPENMPI=False/" third_party/mpi/mpi.bzl
 fi
  if [ -e "${MPI_HOME}/lib/libmpi.so" ]; then
    ln -sf "${MPI_HOME}/lib/libmpi.so" third_party/mpi/
  else
    echo "Cannot find the MPI library file in ${MPI_HOME}/lib "
    exit 1
  fi
 fi
 echo "Configuration finished"
--- a/configure.py
+++ b/configure.py
@ -0,0 +1,950 @@
 # Copyright 2017 The TensorFlow Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
 """configure script to get build parameters from user."""
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 import errno
 import os
 import platform
 import re
 import site
 import subprocess
 import sys
 _TF_BAZELRC = '.tf_configure.bazelrc'
 _DEFAULT_CUDA_VERSION = '8.0'
 _DEFAULT_CUDNN_VERSION = '6'
 _DEFAULT_CUDA_COMPUTE_CAPABILITIES = '3.5,5.2'
 _DEFAULT_CUDA_PATH = '/usr/local/cuda'
 _DEFAULT_CUDA_PATH_LINUX = '/opt/cuda'
 _DEFAULT_CUDA_PATH_WIN = ('C:/Program Files/NVIDIA GPU Computing '
                          'Toolkit/CUDA/v%s' % _DEFAULT_CUDA_VERSION)
 _TF_OPENCL_VERSION = '1.2'
 _DEFAULT_COMPUTECPP_TOOLKIT_PATH = '/usr/local/computecpp'
 def is_windows():
  return platform.system() == 'Windows'
 def is_linux():
  return platform.system() == 'Linux'
 def is_macos():
  return platform.system() == 'Darwin'
 def is_ppc64le():
  return platform.machine() == 'ppc64le'
 def get_input(question):
  try:
    try:
      answer = raw_input(question)
    except NameError:
      answer = input(question)  # pylint: disable=bad-builtin
  except EOFError:
    answer = ''
  return answer
 def symlink_force(target, link_name):
  """Force symlink, equivalent of 'ln -sf'.
  Args:
    target: items to link to.
    link_name: name of the link.
  """
  try:
    os.symlink(target, link_name)
  except OSError as e:
    if e.errno == errno.EEXIST:
      os.remove(link_name)
      os.symlink(target, link_name)
    else:
      raise e
 def sed_in_place(filename, old, new):
  """Replace old string with new string in file.
  Args:
    filename: string for filename.
    old: string to replace.
    new: new string to replace to.
  """
  with open(filename, 'r') as f:
    filedata = f.read()
  newdata = filedata.replace(old, new)
  with open(filename, 'w') as f:
    f.write(newdata)
 def remove_line_with(filename, token):
  """Remove lines that contain token from file.
  Args:
    filename: string for filename.
    token: string token to check if to remove a line from file or not.
  """
  with open(filename, 'r') as f:
    filedata = f.read()
  with open(filename, 'w') as f:
    for line in filedata.strip().split('\n'):
      if token not in line:
        f.write(line + '\n')
 def write_to_bazelrc(line):
  with open(_TF_BAZELRC, 'a') as f:
    f.write(line + '\n')
 def write_action_env_to_bazelrc(var_name, var):
  write_to_bazelrc('build --action_env %s="%s"' % (var_name, str(var)))
 def run_shell(cmd):
  return subprocess.check_output(cmd, shell=True).decode('UTF-8').strip()
 def cygpath(path):
  """Convert path from posix to windows."""
  return run_shell('cygpath  -m "%s"' % path)
 def get_python_path(environ_cp):
  """Get the python site package paths."""
  python_paths = []
  if environ_cp.get('PYTHONPATH'):
    python_paths = environ_cp.get('PYTHONPATH').split(':')
  try:
    library_paths = site.getsitepackages()
  except AttributeError:
    from distutils.sysconfig import get_python_lib  # pylint: disable=g-import-not-at-top
    library_paths = [get_python_lib()]
  all_paths = set(python_paths + library_paths)
  paths = []
  for path in all_paths:
    if os.path.isdir(path):
      paths.append(path)
  return paths
 def setup_python(environ_cp):
  """Setup python related env variables."""
  # Get PYTHON_BIN_PATH, default is the current running python.
  default_python_bin_path = sys.executable
  ask_python_bin_path = ('Please specify the location of python. [Default is '
                         '%s]: ') % default_python_bin_path
  while True:
    python_bin_path = get_from_env_or_user_or_default(
        environ_cp, 'PYTHON_BIN_PATH', ask_python_bin_path,
        default_python_bin_path)
    # Check if the path is valid
    if (os.path.isfile(python_bin_path) and os.access(
        python_bin_path, os.X_OK)) or (os.path.isdir(python_bin_path)):
      break
    elif not os.path.exists(python_bin_path):
      print('Invalid python path: %s cannot be found.' % python_bin_path)
    else:
      print('%s is not executable.  Is it the python binary?' % python_bin_path)
    environ_cp['PYTHON_BIN_PATH'] = ''
  # Get PYTHON_LIB_PATH
  python_lib_path = environ_cp.get('PYTHON_LIB_PATH')
  if not python_lib_path:
    python_lib_paths = get_python_path(environ_cp)
    if environ_cp.get('USE_DEFAULT_PYTHON_LIB_PATH') == '1':
      environ_cp['PYTHON_LIB_PATH'] = python_lib_paths[0]
    else:
      print('Found possible Python library paths:\n%s' %
            '\n'.join(python_lib_paths))
      default_python_lib_path = python_lib_paths[0]
      python_lib_path = get_input(
          'Please input the desired Python library path to use.  Default is %s'
          % python_lib_paths[0])
      if not python_lib_path:
        python_lib_path = default_python_lib_path
      environ_cp['PYTHON_LIB_PATH'] = python_lib_path
  python_major_version = sys.version_info[0]
  # Convert python path to Windows style before writing into bazel.rc
  if is_windows():
    python_bin_path = cygpath(python_bin_path)
    python_lib_path = cygpath(python_lib_path)
  # Set-up env variables used by python_configure.bzl
  write_action_env_to_bazelrc('PYTHON_BIN_PATH', python_bin_path)
  write_action_env_to_bazelrc('PYTHON_LIB_PATH', python_lib_path)
  write_to_bazelrc('build --define PYTHON_BIN_PATH="%s"' % python_bin_path)
  write_to_bazelrc('build --define PYTHON_LIB_PATH="%s"' % python_lib_path)
  write_to_bazelrc('build --force_python=py%s' % python_major_version)
  write_to_bazelrc('build --host_force_python=py%s' % python_major_version)
  write_to_bazelrc('build --python%s_path=\"%s"' % (python_major_version,
                                                    python_bin_path))
  write_to_bazelrc('test --force_python=py%s' % python_major_version)
  write_to_bazelrc('test --host_force_python=py%s' % python_major_version)
  write_to_bazelrc('test --define PYTHON_BIN_PATH="%s"' % python_bin_path)
  write_to_bazelrc('test --define PYTHON_LIB_PATH="%s"' % python_lib_path)
  write_to_bazelrc('run --define PYTHON_BIN_PATH="%s"' % python_bin_path)
  write_to_bazelrc('run --define PYTHON_LIB_PATH="%s"' % python_lib_path)
  environ_cp['PYTHON_BIN_PATH'] = python_bin_path
  # Write tools/python_bin_path.sh
  with open('tools/python_bin_path.sh', 'w') as f:
    f.write('export PYTHON_BIN_PATH="%s"' % python_bin_path)
 def reset_tf_configure_bazelrc():
  """Reset file that contains customized config settings."""
  open(_TF_BAZELRC, 'w').close()
  home = os.path.expanduser('~')
  if not os.path.exists('.bazelrc'):
    if os.path.exists(os.path.join(home, '.bazelrc')):
      with open('.bazelrc', 'a') as f:
        f.write('import %s/.bazelrc\n' % home)
    else:
      open('.bazelrc', 'w').close()
  remove_line_with('.bazelrc', 'tf_configure')
  with open('.bazelrc', 'a') as f:
    f.write('import %workspace%/.tf_configure.bazelrc\n')
 def run_gen_git_source(environ_cp):
  """Run the gen_git_source to create links.
  The links are for bazel to track dependencies for git hash propagation.
  Args:
    environ_cp: copy of the os.environ.
  """
  cmd = '%s tensorflow/tools/git/gen_git_source.py --configure %s' % (
      environ_cp.get('PYTHON_BIN_PATH'), os.getcwd())
  os.system(cmd)
 def cleanup_makefile():
  """Delete any leftover BUILD files from the Makefile build.
  These files could interfere with Bazel parsing.
  """
  makefile_download_dir = 'tensorflow/contrib/makefile/downloads'
  if os.path.isdir(makefile_download_dir):
    for root, _, filenames in os.walk(makefile_download_dir):
      for f in filenames:
        if f.endswith('BUILD'):
          os.remove(os.path.join(root, f))
 def get_var(environ_cp,
            var_name,
            query_item,
            enabled_by_default,
            question=None,
            yes_reply=None,
            no_reply=None):
  """Get boolean input from user.
  If var_name is not set in env, ask user to enable query_item or not. If the
  response is empty, use the default.
  Args:
    environ_cp: copy of the os.environ.
    var_name: string for name of environment variable, e.g. "TF_NEED_HDFS".
    query_item: string for feature related to the variable, e.g. "Hadoop File
      System".
    enabled_by_default: boolean for default behavior.
    question: optional string for how to ask for user input.
    yes_reply: optionanl string for reply when feature is enabled.
    no_reply: optional string for reply when feature is disabled.
  Returns:
    boolean value of the variable.
  """
  if not question:
    question = 'Do you wish to build TensorFlow with %s support?' % query_item
  if not yes_reply:
    yes_reply = '%s support will be enabled for TensorFlow.' % query_item
  if not no_reply:
    no_reply = 'No %s' % yes_reply
  yes_reply += '\n'
  no_reply += '\n'
  if enabled_by_default:
    question += ' [Y/n]: '
  else:
    question += ' [y/N]: '
  var = environ_cp.get(var_name)
  while var is None:
    user_input_origin = get_input(question)
    user_input = user_input_origin.strip().lower()
    if user_input == 'y':
      print(yes_reply)
      var = True
    elif user_input == 'n':
      print(no_reply)
      var = False
    elif not user_input:
      if enabled_by_default:
        print(yes_reply)
        var = True
      else:
        print(no_reply)
        var = False
    else:
      print('Invalid selection: %s' % user_input_origin)
  return var
 def set_build_var(environ_cp, var_name, query_item, option_name,
                  enabled_by_default):
  """Set if query_item will be enabled for the build.
  Ask user if query_item will be enabled. Default is used if no input is given.
  Set subprocess environment variable and write to .bazelrc if enabled.
  Args:
    environ_cp: copy of the os.environ.
    var_name: string for name of environment variable, e.g. "TF_NEED_HDFS".
    query_item: string for feature related to the variable, e.g. "Hadoop File
      System".
    option_name: string for option to define in .bazelrc.
    enabled_by_default: boolean for default behavior.
  """
  var = str(int(get_var(environ_cp, var_name, query_item, enabled_by_default)))
  environ_cp[var_name] = var
  if var == '1':
    write_to_bazelrc('build --define %s=true' % option_name)
 def set_action_env_var(environ_cp,
                       var_name,
                       query_item,
                       enabled_by_default,
                       question=None,
                       yes_reply=None,
                       no_reply=None):
  """Set boolean action_env variable.
  Ask user if query_item will be enabled. Default is used if no input is given.
  Set environment variable and write to .bazelrc.
  Args:
    environ_cp: copy of the os.environ.
    var_name: string for name of environment variable, e.g. "TF_NEED_HDFS".
    query_item: string for feature related to the variable, e.g. "Hadoop File
      System".
    enabled_by_default: boolean for default behavior.
    question: optional string for how to ask for user input.
    yes_reply: optionanl string for reply when feature is enabled.
    no_reply: optional string for reply when feature is disabled.
  """
  var = int(
      get_var(environ_cp, var_name, query_item, enabled_by_default, question,
              yes_reply, no_reply))
  write_action_env_to_bazelrc(var_name, var)
  environ_cp[var_name] = str(var)
 def check_bazel_version(min_version):
  """Check installed bezel version is at least min_version.
  Args:
    min_version: string for minimum bazel version.
  """
  try:
    curr_version = run_shell('bazel version')
  except subprocess.CalledProcessError:
    print('Cannot find bazel. Please install bazel.')
    sys.exit(0)
  for line in curr_version.split('\n'):
    if 'Build label: ' in line:
      curr_version = line.split('Build label: ')[1]
      break
  min_version_segments = min_version.split('.')
  curr_version_segments = curr_version.split('.')
  # Check if current bazel version can be detected properly.
  for seg in curr_version_segments:
    if not seg.isdigit():
      print('WARNING: current bazel installation is not a release version.')
      print('Make sure you are running at least bazel %s' % min_version)
      return
  min_version_str = ''.join(['%03d' % int(seg) for seg in min_version_segments])
  curr_version_str = ''.join(
      ['%03d' % int(seg) for seg in curr_version_segments])
  if int(curr_version_str) < int(min_version_str):
    print('Please upgrade your bazel installation to version %s or higher to '
          'build TensorFlow!' % min_version)
    sys.exit(0)
 def set_cc_opt_flags(environ_cp):
  """Set up architecture-dependent optimization flags.
  Also append CC optimization flags to bazel.rc..
  Args:
    environ_cp: copy of the os.environ.
  """
  if is_ppc64le():
    # gcc on ppc64le does not support -march, use mcpu instead
    default_cc_opt_flags = '-mcpu=native'
  else:
    default_cc_opt_flags = '-march=native'
  question = ('Please specify optimization flags to use during compilation when'
              ' bazel option "--config=opt" is specified [Default is %s]: '
             ) % default_cc_opt_flags
  cc_opt_flags = get_from_env_or_user_or_default(environ_cp, 'CC_OPT_FLAGS',
                                                 question, default_cc_opt_flags)
  for opt in cc_opt_flags.split():
    write_to_bazelrc('build:opt --cxxopt=%s --copt=%s' % (opt, opt))
 def set_tf_cuda_clang(environ_cp):
  """set TF_CUDA_CLANG action_env.
  Args:
    environ_cp: copy of the os.environ.
  """
  question = 'Do you want to use clang as CUDA compiler?'
  yes_reply = 'Clang will be used as CUDA compiler.'
  no_reply = 'nvcc will be used as CUDA compiler.'
  set_action_env_var(
      environ_cp,
      'TF_CUDA_CLANG',
      None,
      False,
      question=question,
      yes_reply=yes_reply,
      no_reply=no_reply)
 def get_from_env_or_user_or_default(environ_cp, var_name, ask_for_var,
                                    var_default):
  """Get var_name either from env, or user or default.
  If var_name has been set as environment variable, use the preset value, else
  ask for user input. If no input is provided, the default is used.
  Args:
    environ_cp: copy of the os.environ.
    var_name: string for name of environment variable, e.g. "TF_NEED_HDFS".
    ask_for_var: string for how to ask for user input.
    var_default: default value string.
  Returns:
    string value for var_name
  """
  var = environ_cp.get(var_name)
  if not var:
    var = get_input(ask_for_var)
  if not var:
    var = var_default
  return var
 def set_clang_cuda_compiler_path(environ_cp):
  """Set CLANG_CUDA_COMPILER_PATH."""
  default_clang_path = run_shell('which clang || true')
  ask_clang_path = ('Please specify which clang should be used as device and '
                    'host compiler. [Default is %s]: ') % default_clang_path
  while True:
    clang_cuda_compiler_path = get_from_env_or_user_or_default(
        environ_cp, 'CLANG_CUDA_COMPILER_PATH', ask_clang_path,
        default_clang_path)
    if os.path.exists(clang_cuda_compiler_path):
      break
    # Reset and retry
    print('Invalid clang path: %s cannot be found.' % clang_cuda_compiler_path)
    environ_cp['CLANG_CUDA_COMPILER_PATH'] = ''
  # Set CLANG_CUDA_COMPILER_PATH
  environ_cp['CLANG_CUDA_COMPILER_PATH'] = clang_cuda_compiler_path
  write_action_env_to_bazelrc('CLANG_CUDA_COMPILER_PATH',
                              clang_cuda_compiler_path)
 def set_gcc_host_compiler_path(environ_cp):
  """Set GCC_HOST_COMPILER_PATH."""
  default_gcc_host_compiler_path = run_shell('which gcc || true')
  cuda_bin_symlink = '%s/bin/gcc' % environ_cp.get('CUDA_TOOLKIT_PATH')
  if os.path.islink(cuda_bin_symlink):
    # os.readlink is only available in linux
    default_gcc_host_compiler_path = run_shell('readlink %s' % cuda_bin_symlink)
  ask_gcc_path = (
      'Please specify which gcc should be used by nvcc as the '
      'host compiler. [Default is %s]: ') % default_gcc_host_compiler_path
  while True:
    gcc_host_compiler_path = get_from_env_or_user_or_default(
        environ_cp, 'GCC_HOST_COMPILER_PATH', ask_gcc_path,
        default_gcc_host_compiler_path)
    if os.path.exists(gcc_host_compiler_path):
      break
    # Reset and retry
    print('Invalid gcc path. %s cannot be found' % gcc_host_compiler_path)
    environ_cp['GCC_HOST_COMPILER_PATH'] = ''
  # Set GCC_HOST_COMPILER_PATH
  environ_cp['GCC_HOST_COMPILER_PATH'] = gcc_host_compiler_path
  write_action_env_to_bazelrc('GCC_HOST_COMPILER_PATH', gcc_host_compiler_path)
 def set_tf_cuda_version(environ_cp):
  """Set CUDA_TOOLKIT_PATH and TF_CUDA_VERSION."""
  ask_cuda_version = (
      'Please specify the CUDA SDK version you want to use, '
      'e.g. 7.0. [Leave empty to default to CUDA %s]: ') % _DEFAULT_CUDA_VERSION
  while True:
    # Configure the Cuda SDK version to use.
    tf_cuda_version = get_from_env_or_user_or_default(
        environ_cp, 'TF_CUDA_VERSION', ask_cuda_version, _DEFAULT_CUDA_VERSION)
    # Find out where the CUDA toolkit is installed
    default_cuda_path = _DEFAULT_CUDA_PATH
    if is_windows():
      default_cuda_path = cygpath(
          environ_cp.get('CUDA_PATH', _DEFAULT_CUDA_PATH_WIN))
    elif is_linux():
      # If the default doesn't exist, try an alternative default.
      if (not os.path.exists(default_cuda_path)
         ) and os.path.exists(_DEFAULT_CUDA_PATH_LINUX):
        default_cuda_path = _DEFAULT_CUDA_PATH_LINUX
    ask_cuda_path = ('Please specify the location where CUDA %s toolkit is'
                     ' installed. Refer to README.md for more details. '
                     '[Default is %s]: ') % (tf_cuda_version, default_cuda_path)
    cuda_toolkit_path = get_from_env_or_user_or_default(
        environ_cp, 'CUDA_TOOLKIT_PATH', ask_cuda_path, default_cuda_path)
    if is_windows():
      cuda_rt_lib_path = 'lib/x64/cudart.lib'
    elif is_linux():
      cuda_rt_lib_path = 'lib64/libcudart.so.%s' % tf_cuda_version
    elif is_macos():
      cuda_rt_lib_path = 'lib/libcudart.%s.dylib' % tf_cuda_version
    cuda_toolkit_path_full = os.path.join(cuda_toolkit_path, cuda_rt_lib_path)
    if os.path.exists(cuda_toolkit_path_full):
      break
    # Reset and retry
    print('Invalid path to CUDA %s toolkit. %s cannot be found' %
          (tf_cuda_version, cuda_toolkit_path_full))
    environ_cp['TF_CUDA_VERSION'] = ''
    environ_cp['CUDA_TOOLKIT_PATH'] = ''
  # Set CUDA_TOOLKIT_PATH and TF_CUDA_VERSION
  environ_cp['CUDA_TOOLKIT_PATH'] = cuda_toolkit_path
  write_action_env_to_bazelrc('CUDA_TOOLKIT_PATH', cuda_toolkit_path)
  environ_cp['TF_CUDA_VERSION'] = tf_cuda_version
  write_action_env_to_bazelrc('TF_CUDA_VERSION', tf_cuda_version)
 def set_tf_cunn_version(environ_cp):
  """Set CUDNN_INSTALL_PATH and TF_CUDNN_VERSION."""
  ask_cudnn_version = (
      '"Please specify the cuDNN version you want to use. '
      '[Leave empty to default to cuDNN %s.0]: ') % _DEFAULT_CUDNN_VERSION
  while True:
    tf_cudnn_version = get_from_env_or_user_or_default(
        environ_cp, 'TF_CUDNN_VERSION', ask_cudnn_version,
        _DEFAULT_CUDNN_VERSION)
    default_cudnn_path = environ_cp.get('CUDA_TOOLKIT_PATH')
    ask_cudnn_path = (r'Please specify the location where cuDNN %s library is '
                      'installed. Refer to README.md for more details. [Default'
                      ' is %s]:') % (tf_cudnn_version, default_cudnn_path)
    cudnn_install_path = get_from_env_or_user_or_default(
        environ_cp, 'CUDNN_INSTALL_PATH', ask_cudnn_path, default_cudnn_path)
    # Result returned from "read" will be used unexpanded. That make "~"
    # unusable. Going through one more level of expansion to handle that.
    cudnn_install_path = os.path.realpath(
        os.path.expanduser(cudnn_install_path))
    if is_windows():
      cudnn_install_path = cygpath(cudnn_install_path)
    if is_windows():
      cuda_dnn_lib_path = 'lib/x64/cudnn.lib'
      cuda_dnn_lib_alt_path = 'lib/x64/cudnn.lib'
    elif is_linux():
      cuda_dnn_lib_path = 'lib64/libcudnn.so.%s' % tf_cudnn_version
      cuda_dnn_lib_alt_path = 'libcudnn.so.%s' % tf_cudnn_version
    elif is_macos():
      cuda_dnn_lib_path = 'lib/libcudnn.%s.dylib' % tf_cudnn_version
      cuda_dnn_lib_alt_path = 'libcudnn.%s.dylib' % tf_cudnn_version
    cuda_dnn_lib_path_full = os.path.join(cudnn_install_path, cuda_dnn_lib_path)
    cuda_dnn_lib_alt_path_full = os.path.join(cudnn_install_path,
                                              cuda_dnn_lib_alt_path)
    if os.path.exists(cuda_dnn_lib_path_full) or os.path.exists(
        cuda_dnn_lib_alt_path_full):
      break
    # Try another alternative for Linux
    if is_linux():
      if subprocess.call(['which', 'ldconfig']):
        ldconfig_bin = '/sbin/ldconfig'
      else:
        ldconfig_bin = 'ldconfig'
      cudnn_path_from_ldconfig = run_shell(
          r'%s -p | sed -n "s/.*libcudnn.so .* => \(.*\)/\\1/p"' % ldconfig_bin)
      if os.path.exists('%s.%s' % (cudnn_path_from_ldconfig, tf_cudnn_version)):
        cudnn_install_path = os.path.dirname(cudnn_path_from_ldconfig)
        break
    # Reset and Retry
    print(
        'Invalid path to cuDNN %s toolkit. None of the following files can be '
        'found:' % tf_cudnn_version)
    print(cuda_dnn_lib_path_full)
    print(cuda_dnn_lib_alt_path_full)
    if is_linux():
      print('%s.%s' % (cudnn_path_from_ldconfig, tf_cudnn_version))
    environ_cp['TF_CUDNN_VERSION'] = ''
  # Set CUDNN_INSTALL_PATH and TF_CUDNN_VERSION
  environ_cp['CUDNN_INSTALL_PATH'] = cudnn_install_path
  write_action_env_to_bazelrc('CUDNN_INSTALL_PATH', cudnn_install_path)
  environ_cp['TF_CUDNN_VERSION'] = tf_cudnn_version
  write_action_env_to_bazelrc('TF_CUDNN_VERSION', tf_cudnn_version)
 def get_native_cuda_compute_capabilities(environ_cp):
  """Get native cuda compute capabilities.
  Args:
    environ_cp: copy of the os.environ.
  Returns:
    string of native cuda compute capabilities, separated by comma.
  """
  device_query_bin = os.path.join(
      environ_cp.get('CUDA_TOOLKIT_PATH'), 'extras/demo_suite/deviceQuery')
  cmd = (r'"%s" | grep "Capability" | grep -o "[0-9]*\.[0-9]*" | sed '
         '":a;{N;s/\\n/,/};ba"') % device_query_bin
  try:
    output = run_shell(cmd)
  except subprocess.CalledProcessError:
    output = ''
  return output
 def set_tf_cuda_compute_capabilities(environ_cp):
  """Set TF_CUDA_COMPUTE_CAPABILITIES."""
  while True:
    native_cuda_compute_capabilities = get_native_cuda_compute_capabilities(
        environ_cp)
    if not native_cuda_compute_capabilities:
      default_cuda_compute_capabilities = _DEFAULT_CUDA_COMPUTE_CAPABILITIES
    else:
      default_cuda_compute_capabilities = native_cuda_compute_capabilities
    ask_cuda_compute_capabilities = (
        'Please specify a list of comma-separated '
        'Cuda compute capabilities you want to '
        'build with.\nYou can find the compute '
        'capability of your device at: '
        'https://developer.nvidia.com/cuda-gpus.\nPlease'
        ' note that each additional compute '
        'capability significantly increases your '
        'build time and binary size. [Default is: %s]' %
        default_cuda_compute_capabilities)
    tf_cuda_compute_capabilities = get_from_env_or_user_or_default(
        environ_cp, 'TF_CUDA_COMPUTE_CAPABILITIES',
        ask_cuda_compute_capabilities, default_cuda_compute_capabilities)
    # Check whether all capabilities from the input is valid
    all_valid = True
    for compute_capability in tf_cuda_compute_capabilities.split(','):
      if not re.match('[0-9]+.[0-9]+', compute_capability):
        print('Invalid compute capability: ' % compute_capability)
        all_valid = False
    if all_valid:
      break
    # Reset and Retry
    environ_cp['TF_CUDA_COMPUTE_CAPABILITIES'] = ''
  # Set TF_CUDA_COMPUTE_CAPABILITIES
  environ_cp['TF_CUDA_COMPUTE_CAPABILITIES'] = tf_cuda_compute_capabilities
  write_action_env_to_bazelrc('TF_CUDA_COMPUTE_CAPABILITIES',
                              tf_cuda_compute_capabilities)
 def set_other_cuda_vars(environ_cp):
  """Set other CUDA related variables."""
  if is_windows():
    # The following three variables are needed for MSVC toolchain configuration
    # in Bazel
    environ_cp['CUDA_PATH'] = environ_cp.get('CUDA_TOOLKIT_PATH')
    environ_cp['CUDA_COMPUTE_CAPABILITIES'] = environ_cp.get(
        'TF_CUDA_COMPUTE_CAPABILITIES')
    environ_cp['NO_WHOLE_ARCHIVE_OPTION'] = 1
    write_action_env_to_bazelrc('CUDA_PATH', environ_cp.get('CUDA_PATH'))
    write_action_env_to_bazelrc('CUDA_COMPUTE_CAPABILITIE',
                                environ_cp.get('CUDA_COMPUTE_CAPABILITIE'))
    write_action_env_to_bazelrc('NO_WHOLE_ARCHIVE_OPTION',
                                environ_cp.get('NO_WHOLE_ARCHIVE_OPTION'))
    write_to_bazelrc('build --config=win-cuda')
    write_to_bazelrc('test --config=win-cuda')
  else:
    # If CUDA is enabled, always use GPU during build and test.
    if environ_cp.get('TF_CUDA_CLANG') == '1':
      write_to_bazelrc('build --config=cuda_clang')
      write_to_bazelrc('test --config=cuda_clang')
    else:
      write_to_bazelrc('build --config=cuda')
      write_to_bazelrc('test --config=cuda')
 def set_host_cxx_compiler(environ_cp):
  """Set HOST_CXX_COMPILER."""
  default_cxx_host_compiler = run_shell('which g++ || true')
  ask_cxx_host_compiler = (
      'Please specify which C++ compiler should be used as'
      ' the host C++ compiler. [Default is %s]: ') % default_cxx_host_compiler
  while True:
    host_cxx_compiler = get_from_env_or_user_or_default(
        environ_cp, 'HOST_CXX_COMPILER', ask_cxx_host_compiler,
        default_cxx_host_compiler)
    if os.path.exists(host_cxx_compiler):
      break
    # Reset and retry
    print('Invalid C++ compiler path. %s cannot be found' % host_cxx_compiler)
    environ_cp['HOST_CXX_COMPILER'] = ''
  # Set HOST_CXX_COMPILER
  environ_cp['HOST_CXX_COMPILER'] = host_cxx_compiler
  write_action_env_to_bazelrc('HOST_CXX_COMPILER', host_cxx_compiler)
 def set_host_c_compiler(environ_cp):
  """Set HOST_C_COMPILER."""
  default_c_host_compiler = run_shell('which gcc || true')
  ask_c_host_compiler = (
      'Please specify which C compiler should be used as the'
      ' host C compiler. [Default is %s]: ') % default_c_host_compiler
  while True:
    host_c_compiler = get_from_env_or_user_or_default(
        environ_cp, 'HOST_C_COMPILER', ask_c_host_compiler,
        default_c_host_compiler)
    if os.path.exists(host_c_compiler):
      break
    # Reset and retry
    print('Invalid C compiler path. %s cannot be found' % host_c_compiler)
    environ_cp['HOST_C_COMPILER'] = ''
  # Set HOST_C_COMPILER
  environ_cp['HOST_C_COMPILER'] = host_c_compiler
  write_action_env_to_bazelrc('HOST_C_COMPILER', host_c_compiler)
 def set_computecpp_toolkit_path(environ_cp):
  """Set COMPUTECPP_TOOLKIT_PATH."""
  ask_computecpp_toolkit_path = ('Please specify the location where ComputeCpp '
                                 'for SYCL %s is installed. [Default is %s]: '
                                ) % (_TF_OPENCL_VERSION,
                                     _DEFAULT_COMPUTECPP_TOOLKIT_PATH)
  while True:
    computecpp_toolkit_path = get_from_env_or_user_or_default(
        environ_cp, 'COMPUTECPP_TOOLKIT_PATH', ask_computecpp_toolkit_path,
        _DEFAULT_COMPUTECPP_TOOLKIT_PATH)
    if is_linux():
      sycl_rt_lib_path = 'lib/libComputeCpp.so'
    else:
      sycl_rt_lib_path = ''
    sycl_rt_lib_path_full = os.path.join(computecpp_toolkit_path,
                                         sycl_rt_lib_path)
    if os.path.exists(sycl_rt_lib_path_full):
      break
    print('Invalid SYCL %s library path. %s cannot be found' %
          (_TF_OPENCL_VERSION, sycl_rt_lib_path_full))
    environ_cp['COMPUTECPP_TOOLKIT_PATH'] = ''
  # Set COMPUTECPP_TOOLKIT_PATH
  environ_cp['COMPUTECPP_TOOLKIT_PATH'] = computecpp_toolkit_path
  write_action_env_to_bazelrc('COMPUTECPP_TOOLKIT_PATH',
                              computecpp_toolkit_path)
 def set_mpi_home(environ_cp):
  """Set MPI_HOME."""
  cmd = ('dirname $(dirname $(which mpirun)) || dirname $(dirname $(which '
         'mpiexec))  || true')
  default_mpi_home = run_shell(cmd)
  ask_mpi_home = ('Please specify the MPI toolkit folder. [Default is %s]: '
                 ) % default_mpi_home
  while True:
    mpi_home = get_from_env_or_user_or_default(environ_cp, 'MPI_HOME',
                                               ask_mpi_home, default_mpi_home)
    if os.path.exists(os.path.join(mpi_home, 'include')) and os.path.exists(
        os.path.join(mpi_home, 'lib')):
      break
    print('Invalid path to the MPI Toolkit. %s or %s cannot be found' %
          (os.path.join(mpi_home, 'include'),
           os.path.exists(os.path.join(mpi_home, 'lib'))))
    environ_cp['MPI_HOME'] = ''
  # Set MPI_HOME
  environ_cp['MPI_HOME'] = str(mpi_home)
 def set_other_mpi_vars(environ_cp):
  """Set other MPI related variables."""
  # Link the MPI header files
  mpi_home = environ_cp.get('MPI_HOME')
  symlink_force('%s/include/mpi.h' % mpi_home, 'third_party/mpi/mpi.h')
  # Determine if we use OpenMPI or MVAPICH, these require different header files
  # to be included here to make bazel dependency checker happy
  if os.path.exists(os.path.join(mpi_home, 'include/mpi_portable_platform.h')):
    symlink_force(
        os.path.join(mpi_home, 'include/mpi_portable_platform.h'),
        'third_party/mpi/mpi_portable_platform.h')
    # TODO(gunan): avoid editing files in configure
    sed_in_place('third_party/mpi/mpi.bzl', 'MPI_LIB_IS_OPENMPI=False',
                 'MPI_LIB_IS_OPENMPI=True')
  else:
    # MVAPICH / MPICH
    symlink_force(
        os.path.join(mpi_home, 'include/mpio.h'), 'third_party/mpi/mpio.h')
    symlink_force(
        os.path.join(mpi_home, 'include/mpicxx.h'), 'third_party/mpi/mpicxx.h')
    # TODO(gunan): avoid editing files in configure
    sed_in_place('third_party/mpi/mpi.bzl', 'MPI_LIB_IS_OPENMPI=True',
                 'MPI_LIB_IS_OPENMPI=False')
  if os.path.exists(os.path.join(mpi_home, 'lib/libmpi.so')):
    symlink_force(
        os.path.join(mpi_home, 'lib/libmpi.so'), 'third_party/mpi/libmpi.so')
  else:
    raise ValueError('Cannot find the MPI library file in %s/lib' % mpi_home)
 def set_mkl():
  write_to_bazelrc('build:mkl --define with_mkl_support=true')
  write_to_bazelrc('build:mkl --define using_mkl=true')
  write_to_bazelrc('build:mkl -c opt')
  write_to_bazelrc('build:mkl --copt="-DEIGEN_USE_VML"')
  print(
      'Add "--config=mkl" to your bazel command to build with MKL '
      'support.\nPlease note that MKL on MacOS or windows is still not '
      'supported.\nIf you would like to use a local MKL instead of '
      'downloading, please set the environment variable \"TF_MKL_ROOT\" every '
      'time before build.')
 def main():
  # Make a copy of os.environ to be clear when functions and getting and setting
  # environment variables.
  environ_cp = dict(os.environ)
  check_bazel_version('0.4.5')
  reset_tf_configure_bazelrc()
  cleanup_makefile()
  setup_python(environ_cp)
  run_gen_git_source(environ_cp)
  if is_windows():
    environ_cp['TF_NEED_GCP'] = '0'
    environ_cp['TF_NEED_HDFS'] = '0'
    environ_cp['TF_NEED_JEMALLOC'] = '0'
    environ_cp['TF_NEED_OPENCL'] = '0'
    environ_cp['TF_CUDA_CLANG'] = '0'
  if is_macos():
    environ_cp['TF_NEED_JEMALLOC'] = '0'
  set_build_var(environ_cp, 'TF_NEED_JEMALLOC', 'jemalloc as malloc',
                'with_jemalloc', True)
  set_build_var(environ_cp, 'TF_NEED_GCP', 'Google Cloud Platform',
                'with_gcp_support', False)
  set_build_var(environ_cp, 'TF_NEED_HDFS', 'Hadoop File System',
                'with_hdfs_support', False)
  set_build_var(environ_cp, 'TF_ENABLE_XLA', 'XLA JIT', 'with_xla_support',
                False)
  set_build_var(environ_cp, 'TF_NEED_VERBS', 'VERBS', 'with_verbs_support',
                False)
  set_action_env_var(environ_cp, 'TF_NEED_OPENCL', 'OpenCL', False)
  if environ_cp.get('TF_NEED_OPENCL') == '1':
    set_host_cxx_compiler(environ_cp)
    set_host_c_compiler(environ_cp)
    set_computecpp_toolkit_path(environ_cp)
  set_action_env_var(environ_cp, 'TF_NEED_CUDA', 'CUDA', False)
  if environ_cp.get('TF_NEED_CUDA') == '1':
    set_tf_cuda_version(environ_cp)
    set_tf_cunn_version(environ_cp)
    set_tf_cuda_compute_capabilities(environ_cp)
    set_tf_cuda_clang(environ_cp)
    if environ_cp.get('TF_CUDA_CLANG') == '1':
      # Set up which clang we should use as the cuda / host compiler.
      set_clang_cuda_compiler_path(environ_cp)
    else:
      # Set up which gcc nvcc should use as the host compiler
      # No need to set this on Windows
      if not is_windows():
        set_gcc_host_compiler_path(environ_cp)
    set_other_cuda_vars(environ_cp)
  set_build_var(environ_cp, 'TF_NEED_MPI', 'MPI', 'with_mpi_support', False)
  if environ_cp.get('TF_NEED_MPI') == '1':
    set_mpi_home(environ_cp)
    set_other_mpi_vars(environ_cp)
  set_cc_opt_flags(environ_cp)
  set_mkl()
 if __name__ == '__main__':
  main()
--- a/tensorflow/c/c_api.cc
+++ b/tensorflow/c/c_api.cc
@ -56,21 +56,16 @@ limitations under the License.
 // The implementation below is at the top level instead of the
 // brain namespace because we are defining 'extern "C"' functions.
 using tensorflow::error::Code;
 using tensorflow::errors::InvalidArgument;
 using tensorflow::gtl::ArraySlice;
 using tensorflow::strings::StrCat;
 using tensorflow::AllocationDescription;
 using tensorflow::DataType;
 using tensorflow::Graph;
 using tensorflow::GraphDef;
 using tensorflow::mutex_lock;
 using tensorflow::NameRangeMap;
 using tensorflow::NameRangesForNode;
 using tensorflow::NewSession;
 using tensorflow::Node;
 using tensorflow::NodeDef;
 using tensorflow::NodeBuilder;
 using tensorflow::NodeDef;
 using tensorflow::OpDef;
 using tensorflow::OpRegistry;
 using tensorflow::PartialTensorShape;
@ -83,6 +78,11 @@ using tensorflow::TensorBuffer;
 using tensorflow::TensorId;
 using tensorflow::TensorShape;
 using tensorflow::TensorShapeProto;
 using tensorflow::error::Code;
 using tensorflow::errors::InvalidArgument;
 using tensorflow::gtl::ArraySlice;
 using tensorflow::mutex_lock;
 using tensorflow::strings::StrCat;
 extern "C" {
@ -258,24 +258,27 @@ size_t TF_StringEncode(const char* src, size_t src_len, char* dst,
  return sz;
 }
-size_t TF_StringDecode(const char* src, size_t src_len, const char** dst,
+static Status TF_StringDecode_Impl(const char* src, size_t src_len,
-                       size_t* dst_len, TF_Status* status) {
+                                   const char** dst, size_t* dst_len) {
  tensorflow::uint64 len64 = 0;
  const char* p = tensorflow::core::GetVarint64Ptr(src, src + src_len, &len64);
  if (p == nullptr) {
-    status->status =
+    return InvalidArgument("invalid string encoding or truncated src buffer");
        InvalidArgument("invalid string encoding or truncated src buffer");
    return 0;
  }
  if (len64 > std::numeric_limits<size_t>::max()) {
-    status->status =
+    return InvalidArgument("encoded string is ", len64,
        InvalidArgument("encoded string is ", len64,
                           "-bytes, which is too large for this architecture");
    return 0;
  }
  *dst = p;
  *dst_len = static_cast<size_t>(len64);
-  return static_cast<size_t>(p - src) + *dst_len;
+  return Status::OK();
 }
 size_t TF_StringDecode(const char* src, size_t src_len, const char** dst,
                       size_t* dst_len, TF_Status* status) {
  status->status = TF_StringDecode_Impl(src, src_len, dst, dst_len);
  if (!status->status.ok()) return 0;
  return static_cast<size_t>(*dst - src) + *dst_len;
 }
 size_t TF_StringEncodedSize(size_t len) {
@ -391,16 +394,20 @@ void TF_Reset(const TF_SessionOptions* opt, const char** containers,
 namespace tensorflow {
-// Non-static for testing.
+Status TF_TensorToTensor(const TF_Tensor* src, Tensor* dst) {
-bool TF_Tensor_DecodeStrings(TF_Tensor* src, Tensor* dst, TF_Status* status) {
+  if (src->dtype != TF_STRING) {
    *dst = TensorCApi::MakeTensor(src->dtype, src->shape, src->buffer);
    return Status::OK();
  }
  // TF_STRING tensors require copying since Tensor class expects a sequence of
  // string objects.
  const tensorflow::int64 num_elements = src->shape.num_elements();
  const char* input = reinterpret_cast<const char*>(TF_TensorData(src));
  const size_t src_size = TF_TensorByteSize(src);
  if (static_cast<tensorflow::int64>(src_size / sizeof(tensorflow::uint64)) <
      num_elements) {
-    status->status = InvalidArgument(
+    return InvalidArgument(
        "Malformed TF_STRING tensor; too short to hold number of elements");
    return false;
  }
  const char* data_start = input + sizeof(tensorflow::uint64) * num_elements;
  const char* limit = input + src_size;
@ -411,24 +418,30 @@ bool TF_Tensor_DecodeStrings(TF_Tensor* src, Tensor* dst, TF_Status* status) {
    tensorflow::uint64 offset =
        reinterpret_cast<const tensorflow::uint64*>(input)[i];
    if (static_cast<ptrdiff_t>(offset) >= (limit - data_start)) {
-      status->status = InvalidArgument("Malformed TF_STRING tensor; element ",
+      return InvalidArgument("Malformed TF_STRING tensor; element ", i,
-                                       i, " out of range");
+                             " out of range");
      return false;
    }
    size_t len;
    const char* p;
    const char* srcp = data_start + offset;
-    TF_StringDecode(srcp, limit - srcp, &p, &len, status);
+    Status status = TF_StringDecode_Impl(srcp, limit - srcp, &p, &len);
-    if (!status->status.ok()) {
+    if (!status.ok()) return status;
      return false;
    }
    dstarray(i).assign(p, len);
  }
-  return true;
+  return Status::OK();
 }
 // Non-static for testing.
-TF_Tensor* TF_Tensor_EncodeStrings(const Tensor& src) {
+TF_Tensor* TF_TensorFromTensor(const tensorflow::Tensor& src) {
  if (src.dtype() != DT_STRING) {
    TensorBuffer* buf = TensorCApi::Buffer(src);
    buf->Ref();
    return new TF_Tensor{static_cast<TF_DataType>(src.dtype()), src.shape(),
                         buf};
  }
  // DT_STRING tensors require a copying since TF_Tensor.buffer expects a flatly
  // encoded sequence of strings.
  // Compute bytes needed for encoding.
  size_t size = 0;
  const auto& srcarray = src.flat<tensorflow::string>();
@ -507,16 +520,8 @@ static bool TF_Run_Inputs(
    TF_Status* status) {
  const int ninputs = input_pairs->size();
  for (int i = 0; i < ninputs; ++i) {
-    TF_Tensor* src = c_inputs[i];
+    status->status = TF_TensorToTensor(c_inputs[i], &(*input_pairs)[i].second);
-    if (c_inputs[i]->dtype != TF_STRING) {
+    if (!status->status.ok()) return false;
      (*input_pairs)[i].second = tensorflow::TensorCApi::MakeTensor(
          src->dtype, src->shape, src->buffer);
    } else if (!tensorflow::TF_Tensor_DecodeStrings(
                   src, &(*input_pairs)[i].second, status)) {
      // TF_STRING tensors require copying since Tensor class expects
      // a sequence of string objects.
      return false;
    }
  }
  return true;
 }
@ -574,15 +579,7 @@ static void TF_Run_Helper(
          static_cast<TF_DataType>(src.dtype()), src.shape());
      continue;
    }
-    if (src.dtype() != tensorflow::DT_STRING) {
+    c_outputs[i] = TF_TensorFromTensor(src);
      // Share the underlying buffer.
      TensorBuffer* buf = tensorflow::TensorCApi::Buffer(src);
      buf->Ref();
      c_outputs[i] = new TF_Tensor{static_cast<TF_DataType>(src.dtype()),
                                   src.shape(), buf};
    } else {
      c_outputs[i] = tensorflow::TF_Tensor_EncodeStrings(src);
    }
  }
 }
@ -1062,20 +1059,9 @@ void TF_SetAttrTensorShapeProtoList(TF_OperationDescription* desc,
 void TF_SetAttrTensor(TF_OperationDescription* desc, const char* attr_name,
                      TF_Tensor* value, TF_Status* status) {
  status->status = Status::OK();
  Tensor t;
-  bool ok = true;
+  status->status = TF_TensorToTensor(value, &t);
-
+  if (status->status.ok()) desc->node_builder.Attr(attr_name, t);
  if (value->dtype != TF_STRING) {
    t = tensorflow::TensorCApi::MakeTensor(value->dtype, value->shape,
                                           value->buffer);
  } else {
    // TF_STRING tensors require copying since Tensor class expects
    // a sequence of string objects.
    ok = tensorflow::TF_Tensor_DecodeStrings(value, &t, status);
  }
  if (ok) desc->node_builder.Attr(attr_name, t);
 }
 void TF_SetAttrTensorList(TF_OperationDescription* desc, const char* attr_name,
@ -1084,21 +1070,14 @@ void TF_SetAttrTensorList(TF_OperationDescription* desc, const char* attr_name,
  status->status = Status::OK();
  std::vector<Tensor> t;
  t.reserve(num_values);
  bool ok = true;
-  for (int i = 0; i < num_values && ok; ++i) {
+  for (int i = 0; i < num_values && status->status.ok(); ++i) {
-    if (values[i]->dtype != TF_STRING) {
+    Tensor v;
-      t.emplace_back(tensorflow::TensorCApi::MakeTensor(
+    status->status = TF_TensorToTensor(values[i], &v);
-          values[i]->dtype, values[i]->shape, values[i]->buffer));
+    t.emplace_back(v);
    } else {
      t.emplace_back(::tensorflow::DT_STRING);
      // TF_STRING tensors require copying since Tensor class expects
      // a sequence of string objects.
      ok = tensorflow::TF_Tensor_DecodeStrings(values[i], &t.back(), status);
    }
  }
-  if (ok) desc->node_builder.Attr(attr_name, t);
+  if (status->status.ok()) desc->node_builder.Attr(attr_name, t);
 }
 void TF_SetAttrValueProto(TF_OperationDescription* desc, const char* attr_name,
@ -1555,9 +1534,7 @@ void TF_OperationGetAttrTensor(TF_Operation* oper, const char* attr_name,
  Tensor t;
  status->status = tensorflow::GetNodeAttr(oper->node.attrs(), attr_name, &t);
  if (!status->status.ok()) return;
-  *value = new TF_Tensor{static_cast<TF_DataType>(t.dtype()), t.shape(),
+  *value = TF_TensorFromTensor(t);
                         tensorflow::TensorCApi::Buffer(t)};
  (*value)->buffer->Ref();
 }
 void TF_OperationGetAttrTensorList(TF_Operation* oper, const char* attr_name,
@ -1568,10 +1545,7 @@ void TF_OperationGetAttrTensorList(TF_Operation* oper, const char* attr_name,
  if (!status->status.ok()) return;
  const auto len = std::min(max_values, static_cast<int>(ts.size()));
  for (int i = 0; i < len; ++i) {
-    const Tensor& t = ts[i];
+    values[i] = TF_TensorFromTensor(ts[i]);
    values[i] = new TF_Tensor{static_cast<TF_DataType>(t.dtype()), t.shape(),
                              tensorflow::TensorCApi::Buffer(t)};
    values[i]->buffer->Ref();
  }
 }
--- a/tensorflow/c/c_api_test.cc
+++ b/tensorflow/c/c_api_test.cc
@ -45,9 +45,8 @@ limitations under the License.
 #include "tensorflow/core/util/equal_graph_def.h"
 namespace tensorflow {
-
+TF_Tensor* TF_TensorFromTensor(const Tensor& src);
-bool TF_Tensor_DecodeStrings(TF_Tensor* src, Tensor* dst, TF_Status* status);
+Status TF_TensorToTensor(const TF_Tensor* src, Tensor* dst);
 TF_Tensor* TF_Tensor_EncodeStrings(const Tensor& src);
 namespace {
@ -146,19 +145,16 @@ void TestEncodeDecode(int line, const std::vector<string>& data) {
    for (tensorflow::int64 i = 0; i < src.NumElements(); ++i) {
      src.flat<string>()(i) = data[i];
    }
-    TF_Tensor* dst = TF_Tensor_EncodeStrings(src);
+    TF_Tensor* dst = TF_TensorFromTensor(src);
    // Convert back to a C++ Tensor and ensure we get expected output.
    TF_Status* status = TF_NewStatus();
    Tensor output;
-    ASSERT_TRUE(TF_Tensor_DecodeStrings(dst, &output, status)) << line;
+    ASSERT_EQ(Status::OK(), TF_TensorToTensor(dst, &output)) << line;
    ASSERT_EQ(TF_OK, TF_GetCode(status)) << line;
    ASSERT_EQ(src.NumElements(), output.NumElements()) << line;
    for (tensorflow::int64 i = 0; i < src.NumElements(); ++i) {
      ASSERT_EQ(data[i], output.flat<string>()(i)) << line;
    }
    TF_DeleteStatus(status);
    TF_DeleteTensor(dst);
  }
 }
@ -918,7 +914,7 @@ TEST(CAPI, SavedModel) {
  TF_Operation* input_op =
      TF_GraphOperationByName(graph, input_op_name.c_str());
  ASSERT_TRUE(input_op != nullptr);
-  csession.SetInputs({{input_op, TF_Tensor_EncodeStrings(input)}});
+  csession.SetInputs({{input_op, TF_TensorFromTensor(input)}});
  const tensorflow::string output_op_name =
      tensorflow::ParseTensorName(output_name).first.ToString();
@ -1636,6 +1632,39 @@ TEST_F(CApiAttributesTest, Tensor) {
  TF_DeleteTensor(value);
 }
 TEST_F(CApiAttributesTest, StringTensor) {
  // Create the string-Tensor "atttribute" value.
  char encoded[] = {
      0,   0, 0, 0, 0, 0, 0, 0,  // array[uint64] offsets
      1,                         // varint encoded string length
      'A',
  };
  auto deallocator = [](void* data, size_t len, void* arg) {};
  unique_tensor_ptr t_in(TF_NewTensor(TF_STRING, nullptr, 0, &encoded[0],
                                      sizeof(encoded), deallocator, nullptr),
                         TF_DeleteTensor);
  // Create a TF_Operation with the attribute t_in
  auto desc = init("tensor");
  TF_SetAttrTensor(desc, "v", t_in.get(), s_);
  ASSERT_EQ(TF_OK, TF_GetCode(s_)) << TF_Message(s_);
  auto oper = TF_FinishOperation(desc, s_);
  ASSERT_EQ(TF_OK, TF_GetCode(s_)) << TF_Message(s_);
  // Fetch the attribute back.
  EXPECT_TF_META("v", -1, TF_ATTR_TENSOR, -1);
  TF_Tensor* t_out = nullptr;
  TF_OperationGetAttrTensor(oper, "v", &t_out, s_);
  ASSERT_EQ(TF_OK, TF_GetCode(s_)) << TF_Message(s_);
  EXPECT_EQ(TF_STRING, TF_TensorType(t_out));
  EXPECT_EQ(0, TF_NumDims(t_out));
  ASSERT_EQ(TF_TensorByteSize(t_in.get()), TF_TensorByteSize(t_out));
  EXPECT_EQ(0, memcmp(TF_TensorData(t_in.get()), TF_TensorData(t_out),
                      TF_TensorByteSize(t_out)));
  TF_DeleteTensor(t_out);
 }
 TEST_F(CApiAttributesTest, TensorList) {
  const char tensor1[] = {5, 7};
  const int64_t dims1[] = {1, 2};
@ -1647,7 +1676,8 @@ TEST_F(CApiAttributesTest, TensorList) {
  auto desc = init("list(tensor)");
  TF_Tensor* tmp[] = {
-      Int8Tensor(dims1, ndims1, tensor1), Int8Tensor(dims2, ndims2, tensor2),
+      Int8Tensor(dims1, ndims1, tensor1),
      Int8Tensor(dims2, ndims2, tensor2),
  };
  TF_SetAttrTensorList(desc, "v", tmp, TF_ARRAYSIZE(tmp), s_);
  for (int i = 0; i < TF_ARRAYSIZE(tmp); ++i) {
--- a/tensorflow/cc/framework/gradients.cc
+++ b/tensorflow/cc/framework/gradients.cc
@ -356,7 +356,7 @@ Status SymbolicGradientBuilder::AddGradients() {
  // Check if any input nodes still have pending gradients and have not been
  // processed yet. This happens if not all outputs of a node are in 'inputs_'.
  std::unordered_map<Node*, int> requested_grads;
-  for (Output nout : inputs_) {
+  for (const Output& nout : inputs_) {
    if (pending_[nout.node()->id()] > 0) {
      DCHECK_GT(nout.node()->num_outputs(), 1);
      int idx = input_nodes_[nout];
@ -365,7 +365,7 @@ Status SymbolicGradientBuilder::AddGradients() {
      ++requested_grads[nout.node()];
    }
  }
-  for (auto& p : requested_grads) {
+  for (const auto& p : requested_grads) {
    int num_requested_inputs = p.first->num_outputs() - pending_[p.first->id()];
    CHECK_EQ(num_requested_inputs, p.second);
  }
--- a/tensorflow/compiler/jit/mark_for_compilation_pass.cc
+++ b/tensorflow/compiler/jit/mark_for_compilation_pass.cc
@ -257,6 +257,11 @@ Status MarkForCompilationPass::Run(
                                             &registration)) {
      return false;
    }
    // Don't compile control trigger nodes. We won't preserve their deadness
    // semantics correctly, so it's safest not to compile them.
    if (node->IsControlTrigger()) return false;
    // If this device requires a JIT, we must say yes.
    if (registration->requires_compilation) return true;
--- a/tensorflow/compiler/tests/BUILD
+++ b/tensorflow/compiler/tests/BUILD
@ -353,6 +353,20 @@ tf_xla_py_test(
    ],
 )
 tf_xla_py_test(
    name = "segment_reduction_ops_test",
    size = "small",
    srcs = ["segment_reduction_ops_test.py"],
    deps = [
        ":xla_test",
        "//tensorflow/python:array_ops",
        "//tensorflow/python:framework_for_generated_wrappers",
        "//tensorflow/python:math_ops",
        "//tensorflow/python:math_ops_gen",
        "//tensorflow/python:platform_test",
    ],
 )
 tf_xla_py_test(
    name = "spacetobatch_op_test",
    size = "medium",
--- a/tensorflow/compiler/tests/segment_reduction_ops_test.py
+++ b/tensorflow/compiler/tests/segment_reduction_ops_test.py
@ -0,0 +1,139 @@
 # Copyright 2017 The TensorFlow Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
 """Test cases for segment reduction ops."""
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 import functools
 import numpy as np
 from tensorflow.compiler.tests.xla_test import XLATestCase
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.platform import googletest
 class SegmentReductionOpsTest(XLATestCase):
  """Test cases for segment reduction ops."""
  def UnsortedSegmentSum(self, data, indices, num_segments):
    with self.test_session() as sess, self.test_scope():
      d = array_ops.placeholder(data.dtype, shape=data.shape)
      if isinstance(indices, int):
        i = array_ops.placeholder(np.int32, shape=[])
      else:
        i = array_ops.placeholder(indices.dtype, shape=indices.shape)
      return sess.run(
          math_ops.unsorted_segment_sum(d, i, num_segments),
          {d: data,
           i: indices})
  def testUnsortedSegmentSum0DIndices1DData(self):
    for dtype in self.numeric_types:
      self.assertAllClose(
          np.array(
              [[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 1, 2, 3, 4, 5],
               [0, 0, 0, 0, 0, 0]],
              dtype=dtype),
          self.UnsortedSegmentSum(
              np.array([0, 1, 2, 3, 4, 5], dtype=dtype), 2, 4))
  def testUnsortedSegmentSum1DIndices1DData(self):
    for dtype in self.numeric_types:
      self.assertAllClose(
          np.array([1, 3, 2, 9], dtype=dtype),
          self.UnsortedSegmentSum(
              np.array([0, 1, 2, 3, 4, 5], dtype=dtype),
              np.array([3, 0, 2, 1, 3, 3], dtype=np.int32), 4))
  def testUnsortedSegmentSum1DIndices2DDataDisjoint(self):
    for dtype in self.numeric_types:
      data = np.array(
          [[0, 1, 2, 3], [20, 21, 22, 23], [30, 31, 32, 33], [40, 41, 42, 43],
           [50, 51, 52, 53]],
          dtype=dtype)
      indices = np.array([8, 1, 0, 3, 7], dtype=np.int32)
      num_segments = 10
      y = self.UnsortedSegmentSum(data, indices, num_segments)
      self.assertAllClose(
          np.array(
              [[30, 31, 32, 33], [20, 21, 22, 23], [0, 0, 0, 0],
               [40, 41, 42, 43], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0],
               [50, 51, 52, 53], [0, 1, 2, 3], [0, 0, 0, 0]],
              dtype=dtype), y)
  def testUnsortedSegmentSum1DIndices2DDataNonDisjoint(self):
    for dtype in self.numeric_types:
      data = np.array(
          [[0, 1, 2, 3], [20, 21, 22, 23], [30, 31, 32, 33], [40, 41, 42, 43],
           [50, 51, 52, 53]],
          dtype=dtype)
      indices = np.array([0, 1, 2, 0, 1], dtype=np.int32)
      num_segments = 4
      y = self.UnsortedSegmentSum(data, indices, num_segments)
      self.assertAllClose(
          np.array(
              [[40, 42, 44, 46], [70, 72, 74, 76], [30, 31, 32, 33],
               [0, 0, 0, 0]],
              dtype=dtype), y)
  def testUnsortedSegmentSum2DIndices3DData(self):
    for dtype in self.numeric_types:
      data = np.array(
          [[[0, 1, 2], [10, 11, 12]], [[100, 101, 102], [110, 111, 112]],
           [[200, 201, 202], [210, 211, 212]], [[300, 301, 302],
                                                [310, 311, 312]]],
          dtype=dtype)
      indices = np.array([[3, 5], [3, 1], [5, 0], [6, 2]], dtype=np.int32)
      num_segments = 8
      y = self.UnsortedSegmentSum(data, indices, num_segments)
      self.assertAllClose(
          np.array(
              [[210, 211, 212], [110, 111, 112], [310, 311, 312],
               [100, 102, 104], [0, 0, 0.], [210, 212, 214], [300, 301,
                                                              302], [0, 0, 0]],
              dtype=dtype), y)
  def testUnsortedSegmentSum1DIndices3DData(self):
    for dtype in self.numeric_types:
      data = np.array(
          [[[0, 1, 2], [10, 11, 12]], [[100, 101, 102], [110, 111, 112]],
           [[200, 201, 202], [210, 211, 212]], [[300, 301, 302],
                                                [310, 311, 312]]],
          dtype=dtype)
      indices = np.array([3, 0, 2, 5], dtype=np.int32)
      num_segments = 6
      y = self.UnsortedSegmentSum(data, indices, num_segments)
      self.assertAllClose(
          np.array(
              [[[100, 101, 102.], [110, 111, 112]], [[0, 0, 0], [0, 0, 0]],
               [[200, 201, 202], [210, 211, 212]], [[0, 1, 2.], [10, 11, 12]],
               [[0, 0, 0], [0, 0, 0]], [[300, 301, 302], [310, 311, 312]]],
              dtype=dtype), y)
  def testUnsortedSegmentSumShapeError(self):
    for dtype in self.numeric_types:
      data = np.ones((4, 8, 7), dtype=dtype)
      indices = np.ones((3, 2), dtype=np.int32)
      num_segments = 4
      self.assertRaises(ValueError,
                        functools.partial(self.UnsortedSegmentSum, data,
                                          indices, num_segments))
 if __name__ == '__main__':
  googletest.main()
--- a/tensorflow/compiler/tests/tensor_array_ops_test.py
+++ b/tensorflow/compiler/tests/tensor_array_ops_test.py
@ -57,11 +57,13 @@ class TensorArrayTest(xla_test.XLATestCase):
      r0 = w2.read(0)
      r1 = w2.read(1)
      r2 = w2.read(2)
      flow = w2.flow
-      d0, d1, d2 = session.run([r0, r1, r2])
+      d0, d1, d2, flow_val = session.run([r0, r1, r2, flow])
      self.assertAllEqual([[4.0, 5.0]], d0)
      self.assertAllEqual([[1.0, 3.0]], d1)
      self.assertAllEqual([[7.0, -8.5]], d2)
      self.assertAllEqual([], flow_val.shape)
  def _testTensorArrayWritePack(self, tf_dtype):
    with self.test_session(), self.test_scope():
--- a/tensorflow/compiler/tf2xla/functionalize_control_flow.cc
+++ b/tensorflow/compiler/tf2xla/functionalize_control_flow.cc
@ -323,12 +323,26 @@ Status FunctionalizeLoop(Graph* graph, Frame* frame,
  for (Arg& arg : frame->args) {
    if (!arg.is_loop_invariant) {
      // Follow the edge from the Enter to Merge.
-      if (arg.enter->out_edges().size() != 1) {
+      const Edge* enter_merge = nullptr;
-        return errors::Internal("Enter node for loop-varying argument ",
+      for (const Edge* e : arg.enter->out_edges()) {
-                                arg.enter->name(),
+        // Ignore control-edges to the sink node. These are allowed by the
-                                " does not have exactly one successor");
+        // graph invariants, although probably they should have been stripped
        // off earlier.
        if (e->IsControlEdge() && e->dst()->IsSink()) {
          continue;
        }
        if (enter_merge != nullptr) {
          return errors::Internal(
              "Enter node for loop-varying argument ", arg.enter->name(),
              " has multiple successors: ", enter_merge->dst()->name(), " and ",
              e->dst()->name());
        }
        enter_merge = e;
      }
      if (enter_merge == nullptr) {
        return errors::Internal("Enter node for loop-varying argument ",
                                arg.enter->name(), " has zero successors");
      }
      const Edge* enter_merge = *arg.enter->out_edges().begin();
      arg.merge = enter_merge->dst();
      if (!IsMerge(arg.merge)) {
        return errors::InvalidArgument(
--- a/tensorflow/compiler/tf2xla/functionalize_control_flow_test.cc
+++ b/tensorflow/compiler/tf2xla/functionalize_control_flow_test.cc
@ -96,6 +96,14 @@ TEST(FunctionalizeControlFlow, OneLoopVar) {
    TF_EXPECT_OK(scope.ToGraph(&graph));
  }
  // Regression test: control edges from an Enter node to the graph sink should
  // be ignored.
  for (Node* n : graph.nodes()) {
    if (n->name() == "while/Enter") {
      graph.AddControlEdge(n, graph.sink_node());
    }
  }
  FunctionLibraryDefinition library(OpRegistry::Global(), {});
  TF_ASSERT_OK(FunctionalizeControlFlow(&graph, &library));
--- a/tensorflow/compiler/tf2xla/kernels/BUILD
+++ b/tensorflow/compiler/tf2xla/kernels/BUILD
@ -47,6 +47,7 @@ tf_kernel_library(
        "reshape_op.cc",
        "retval_op.cc",
        "reverse_op.cc",
        "segment_reduction_ops.cc",
        "select_op.cc",
        "sequence_ops.cc",
        "shape_op.cc",
--- a/tensorflow/compiler/tf2xla/kernels/no_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/no_op.cc
@ -23,4 +23,9 @@ namespace tensorflow {
 // dummy operator using CompilationOnly().
 REGISTER_XLA_OP(Name("NoOp").CompilationOnly(), NoOp);
 // We register ControlTrigger as a no-op. This is correct since nodes seen
 // by the XLA compiler are never dead. This may need rethinking when we add
 // support for conditionals to XLA.
 REGISTER_XLA_OP(Name("ControlTrigger"), NoOp);
 }  // namespace tensorflow
--- a/tensorflow/compiler/tf2xla/kernels/segment_reduction_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/segment_reduction_ops.cc
@ -0,0 +1,155 @@
 /* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 #include <sstream>
 #include "tensorflow/compiler/tf2xla/kernels/cwise_ops.h"
 #include "tensorflow/compiler/tf2xla/shape_util.h"
 #include "tensorflow/compiler/tf2xla/xla_helpers.h"
 #include "tensorflow/compiler/tf2xla/xla_op_registry.h"
 #include "tensorflow/compiler/xla/client/computation_builder.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/core/framework/kernel_def_builder.h"
 #include "tensorflow/core/framework/types.h"
 namespace tensorflow {
 namespace {
 class UnsortedSegmentSum : public XlaOpKernel {
 public:
  explicit UnsortedSegmentSum(OpKernelConstruction* ctx) : XlaOpKernel(ctx) {
    OP_REQUIRES_OK(ctx, ctx->GetAttr("T", &dtype_));
  }
  void Compile(XlaOpKernelContext* ctx) override {
    // output = unsorted_segment_sum(data, indices, num_segments)
    // Compute a tensor such that:
    //    output[i] = sum over {j where indices[j] == i} of data[j]
    //    output[i] == 0 if i does not appear in indices
    //
    // Contrast with segment_sum(), which assumes indices are sorted and that
    // max(indices)+1 is the desired size of the output.
    //
    // The returned output tensor has the same type as data, and the same shape
    // as data with the first indices.rank dimensions are replaced
    // by a single dimension with size num_segments.
    xla::ComputationBuilder* builder = ctx->builder();
    auto data = ctx->Input(0);
    auto data_shape = ctx->InputShape(0);
    auto indices = ctx->Input(1);
    auto indices_shape = ctx->InputShape(1);
    OP_REQUIRES(ctx, data_shape.dims() >= indices_shape.dims(),
                errors::InvalidArgument(
                    "UnsortedSegmentSum requires that indices' rank be"
                    " less than or equal to data's rank."));
    // Validate that indices.shape is a prefix of data.shape.
    for (int d = 0; d < indices_shape.dims(); ++d) {
      OP_REQUIRES(ctx, (data_shape.dim_size(d) == indices_shape.dim_size(d)),
                  errors::InvalidArgument(
                      "UnsortedSegmentSum requires indices shape to be prefix"
                      " of data_shape, but dimension ",
                      d, " differs ", data_shape.dim_size(d), " vs. ",
                      indices_shape.dim_size(d)));
    }
    int64 num_segments;
    OP_REQUIRES_OK(ctx, ctx->ConstantInputAsIntScalar(2, &num_segments));
    // Flatten the indices into 1-D.
    auto indices_1d = builder->Reshape(indices, {indices_shape.num_elements()});
    // flatten data for dynamic indexing.
    int64 out_tensor_dims = data_shape.dims() - indices_shape.dims();
    std::vector<int64> flat_shape(1 + out_tensor_dims);
    flat_shape[0] = indices_shape.num_elements();
    for (int64 k = 0; k < out_tensor_dims; ++k) {
      flat_shape[1 + k] = data_shape.dim_size(indices_shape.dims() + k);
    }
    auto data_flat = builder->Reshape(data, flat_shape);
    // output shape; same as data_shape, but dimension 0 is num_segments.
    std::vector<int64> out_shape(flat_shape);
    out_shape[0] = num_segments;
    // Pad the output array dims to rank >= 3 to work around lowering issues.
    // TODO(b/37575001) This is awkward, and could be improved.
    int64 extra_dims = 0;
    if (out_shape.size() < 3) {
      extra_dims = 3u - out_shape.size();
    }
    std::vector<int64> rshape(extra_dims + out_shape.size(), 1);
    for (unsigned k = 0; k < out_shape.size(); ++k) {
      rshape[extra_dims + k] = out_shape[k];
    }
    auto output = builder->Broadcast(XlaHelpers::Zero(builder, dtype_), rshape);
    auto zero = builder->ConstantR1<int32>({0});
    for (int64 i = 0; i < indices_shape.num_elements(); ++i) {
      // output[indices[i]] += data[i]
      std::vector<int64> data_start_indices(flat_shape.size());
      data_start_indices[0] = i;
      for (unsigned d = 1; d < flat_shape.size(); ++d) {
        data_start_indices[d] = 0;
      }
      std::vector<int64> data_limit_indices(flat_shape);
      data_limit_indices[0] = i + 1;
      std::vector<int64> stride(flat_shape.size(), 1);
      auto data_slice = builder->Slice(data_flat, data_start_indices,
                                       data_limit_indices, stride);
      // Reshape the sliced data into the R3+ shape to match output array.
      std::vector<int64> rdata_shape(extra_dims + flat_shape.size());
      for (int64 k = 0; k <= extra_dims; ++k) {
        rdata_shape[k] = 1;
      }
      for (unsigned k = 1; k < data_limit_indices.size(); ++k) {
        rdata_shape[extra_dims + k] = data_limit_indices[k];
      }
      auto rdata_slice = builder->Reshape(data_slice, rdata_shape);
      auto index = builder->Slice(indices_1d, {i}, {i + 1}, {1});
      // Construct the index into the R3+ output array 0, ..., <index>, 0, ...
      std::vector<xla::ComputationDataHandle> out_start_index_parts(
          extra_dims + flat_shape.size(), zero);
      out_start_index_parts[extra_dims] = builder->Reshape(index, {1});
      auto out_start_indices = builder->ConcatInDim(out_start_index_parts, 0);
      std::vector<int64> slice_size(rshape);
      slice_size[extra_dims] = 1;
      auto out_slice =
          builder->DynamicSlice(output, out_start_indices, slice_size);
      auto sumval = builder->Add(out_slice, rdata_slice);
      output = builder->DynamicUpdateSlice(output, sumval, out_start_indices);
    }
    auto reshaped_output = builder->Reshape(output, out_shape);
    ctx->SetOutput(0, reshaped_output);
  }
 private:
  DataType dtype_;
 };
 REGISTER_XLA_OP(Name("UnsortedSegmentSum"), UnsortedSegmentSum);
 }  // namespace
 }  // namespace tensorflow
--- a/tensorflow/compiler/tf2xla/kernels/tensor_array_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/tensor_array_ops.cc
@ -182,7 +182,10 @@ class TensorArrayOp : public XlaOpKernel {
                               dtype_, value, &var));
    var->tensor_array_size = size;
    ctx->SetResourceOutput(0, var);
-    ctx->SetConstantOutput(1, Tensor(DT_FLOAT));
+
    Tensor flow(DT_FLOAT, TensorShape({}));
    flow.scalar<float>()() = 0.0f;
    ctx->SetConstantOutput(1, flow);
  }
 private:
@ -216,6 +219,7 @@ class TensorArrayWriteOp : public XlaOpKernel {
    xla::ComputationDataHandle ta = resource->value;
    xla::ComputationDataHandle index = ctx->Input(1);
    xla::ComputationDataHandle value = ctx->Input(2);
    xla::ComputationDataHandle flow = ctx->Input(3);
    // start_indices of the DynamicUpdateSlice are [index, 0, 0, ..., 0].
    auto start_indices = XlaHelpers::PadWithZeros(b, index, elem_shape.dims());
@ -228,7 +232,7 @@ class TensorArrayWriteOp : public XlaOpKernel {
        DynamicAddSlice(b, ta, update, slice_shape.dim_sizes(), start_indices);
    resource->value = written;
-    ctx->SetConstantOutput(0, Tensor(DT_FLOAT));
+    ctx->SetOutput(0, flow);
  }
 private:
@ -369,6 +373,7 @@ class TensorArrayScatterOp : public XlaOpKernel {
    xla::ComputationDataHandle ta = resource->value;
    const xla::ComputationDataHandle value = ctx->Input(2);
    const xla::ComputationDataHandle flow = ctx->Input(3);
    auto slice_dims = value_shape.dim_sizes();
    slice_dims[0] = 1LL;
@ -394,7 +399,7 @@ class TensorArrayScatterOp : public XlaOpKernel {
    }
    resource->value = ta;
-    ctx->SetConstantOutput(0, Tensor(DT_FLOAT));
+    ctx->SetOutput(0, flow);
  }
 private:
@ -489,6 +494,7 @@ class TensorArraySplitOp : public XlaOpKernel {
                    lengths.size(), " vs. ", resource->tensor_array_size, ")"));
    const xla::ComputationDataHandle value = ctx->Input(1);
    const xla::ComputationDataHandle flow = ctx->Input(3);
    OP_REQUIRES(ctx, value_shape.num_elements() == ta_shape.num_elements(),
                errors::InvalidArgument("mismatched element count ",
@ -497,7 +503,7 @@ class TensorArraySplitOp : public XlaOpKernel {
    resource->value = b->Add(ta, b->Reshape(value, ta_shape.dim_sizes()));
-    ctx->SetConstantOutput(0, Tensor(DT_FLOAT));
+    ctx->SetOutput(0, flow);
  }
 private:
--- a/tensorflow/compiler/xla/BUILD
+++ b/tensorflow/compiler/xla/BUILD
@ -132,7 +132,10 @@ cc_library(
 cc_library(
    name = "statusor",
    srcs = ["statusor.cc"],
-    hdrs = ["statusor.h"],
+    hdrs = [
        "statusor.h",
        "statusor_internals.h",
    ],
    visibility = ["//visibility:public"],
    deps = [
        ":status",
--- a/tensorflow/compiler/xla/service/BUILD
+++ b/tensorflow/compiler/xla/service/BUILD
@ -1943,6 +1943,7 @@ cc_library(
        ":buffer_liveness",
        ":hlo",
        ":hlo_pass",
        "//tensorflow/compiler/xla:shape_util",
        "//tensorflow/core:lib",
    ],
 )
--- a/tensorflow/compiler/xla/service/algebraic_simplifier.cc
+++ b/tensorflow/compiler/xla/service/algebraic_simplifier.cc
@ -1488,9 +1488,9 @@ Status AlgebraicSimplifierVisitor::HandleConvolution(
  // We cannot insert bitcasts if the layouts will not be compatible.
  // TODO(b/33178038): Consider inserting a transpose if a bitcast would be
  // invalid.
-  if (!valid_bitcast_callback_(lhs->shape(), input_shape) ||
+  if (!valid_bitcast_callback_(input_shape, lhs->shape()) ||
-      !valid_bitcast_callback_(rhs->shape(), new_filter_shape) ||
+      !valid_bitcast_callback_(new_filter_shape, rhs->shape()) ||
-      !valid_bitcast_callback_(dot_output_shape, convolution_shape)) {
+      !valid_bitcast_callback_(convolution_shape, dot_output_shape)) {
    return Status::OK();
  }
--- a/tensorflow/compiler/xla/service/algebraic_simplifier.h
+++ b/tensorflow/compiler/xla/service/algebraic_simplifier.h
@ -26,12 +26,13 @@ namespace xla {
 // A pass which performs AlgebraicSimplications.
 class AlgebraicSimplifier : public HloPassInterface {
 public:
-  // Given two shapes, determines if it is valid to bitcast between them after
+  // Given shapes 'from_shape' and 'to_shape', determines if it is valid to
-  // considering platform dependent effects on layout like alignment
+  // bitcast from 'from_shape' to 'to_shape' after considering platform
-  // restrictions.
+  // dependent effects on layout like alignment restrictions. Precondition: the
-  // Precondition: the two shapes have layouts, the same number of
+  // two shapes have layouts, the same number of elements and
-  // elements and ShapeUtil::ReshapeIsBitcast returns true.
+  // ShapeUtil::ReshapeIsBitcast returns true.
-  using ValidBitcastCallback = std::function<bool(const Shape&, const Shape&)>;
+  using ValidBitcastCallback =
      std::function<bool(const Shape& from_shape, const Shape& to_shape)>;
  // If is_layout_sensitive is true, then the simplifier preserves layout during
  // transformation. Otherwise, layout is ignored. If valid_bitcast_callback
--- a/tensorflow/compiler/xla/service/cpu/BUILD
+++ b/tensorflow/compiler/xla/service/cpu/BUILD
@ -72,6 +72,7 @@ cc_library(
        "//tensorflow/compiler/xla/service:hlo_subcomputation_unification",
        "//tensorflow/compiler/xla/service:hlo_verifier",
        "//tensorflow/compiler/xla/service:inliner",
        "//tensorflow/compiler/xla/service:reduce_precision_insertion",
        "//tensorflow/compiler/xla/service:reshape_mover",
        "//tensorflow/compiler/xla/service:transpose_folding",
        "//tensorflow/compiler/xla/service/llvm_ir:llvm_util",  # fixdeps: keep
--- a/tensorflow/compiler/xla/service/cpu/cpu_compiler.cc
+++ b/tensorflow/compiler/xla/service/cpu/cpu_compiler.cc
@ -74,6 +74,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/hlo_verifier.h"
 #include "tensorflow/compiler/xla/service/inliner.h"
 #include "tensorflow/compiler/xla/service/llvm_ir/llvm_util.h"
 #include "tensorflow/compiler/xla/service/reduce_precision_insertion.h"
 #include "tensorflow/compiler/xla/service/reshape_mover.h"
 #include "tensorflow/compiler/xla/service/transpose_folding.h"
 #include "tensorflow/compiler/xla/status_macros.h"
@ -253,6 +254,14 @@ Status CpuCompiler::RunHloPasses(HloModule* module) {
  HloPassPipeline pipeline("CPU");
  pipeline.AddInvariantChecker<HloVerifier>();
  for (const auto& reduce_precision_options :
       module->config().debug_options().hlo_reduce_precision_options()) {
    if (reduce_precision_options.pass_timing() ==
        HloReducePrecisionOptions::BEFORE_OP_FUSION) {
      pipeline.AddPass<ReducePrecisionInsertion>(reduce_precision_options);
    }
  }
  // TODO(b/35786417): Re-enable inliner pass after fixing the bug and deciding
  // where we will take this pass in future.
  // pipeline.AddPass<Inliner>();
@ -278,6 +287,15 @@ Status CpuCompiler::RunHloPasses(HloModule* module) {
      TransposeFolding::NeverFoldTranspose);
  pipeline.AddPass<HloCSE>(/*is_layout_sensitive=*/false);
  pipeline.AddPass<CpuInstructionFusion>();
  for (const auto& reduce_precision_options :
       module->config().debug_options().hlo_reduce_precision_options()) {
    if (reduce_precision_options.pass_timing() ==
        HloReducePrecisionOptions::AFTER_OP_FUSION) {
      pipeline.AddPass<ReducePrecisionInsertion>(reduce_precision_options);
    }
  }
  pipeline.AddPass<CpuLayoutAssignment>(
      module->mutable_entry_computation_layout());
  // The LayoutAssignment pass may leave behind kCopy instructions which are
--- a/tensorflow/compiler/xla/service/gpu/BUILD
+++ b/tensorflow/compiler/xla/service/gpu/BUILD
@ -432,6 +432,7 @@ cc_library(
        "//tensorflow/compiler/xla/service:hlo_proto_util",
        "//tensorflow/compiler/xla/service:hlo_subcomputation_unification",
        "//tensorflow/compiler/xla/service:hlo_verifier",
        "//tensorflow/compiler/xla/service:reduce_precision_insertion",
        "//tensorflow/compiler/xla/service:reshape_mover",
        "//tensorflow/compiler/xla/service:transpose_folding",
        "//tensorflow/compiler/xla/service/gpu/llvm_gpu_backend",
--- a/tensorflow/compiler/xla/service/gpu/gpu_compiler.cc
+++ b/tensorflow/compiler/xla/service/gpu/gpu_compiler.cc
@ -56,6 +56,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/hlo_subcomputation_unification.h"
 #include "tensorflow/compiler/xla/service/hlo_verifier.h"
 #include "tensorflow/compiler/xla/service/llvm_ir/llvm_util.h"
 #include "tensorflow/compiler/xla/service/reduce_precision_insertion.h"
 #include "tensorflow/compiler/xla/service/reshape_mover.h"
 #include "tensorflow/compiler/xla/service/transpose_folding.h"
 #include "tensorflow/compiler/xla/status_macros.h"
@ -123,6 +124,15 @@ tensorflow::Status OptimizeHloModule(HloModule* hlo_module,
  {
    HloPassPipeline pipeline("optimization");
    pipeline.AddInvariantChecker<HloVerifier>();
    for (const auto& reduce_precision_options :
         hlo_module->config().debug_options().hlo_reduce_precision_options()) {
      if (reduce_precision_options.pass_timing() ==
          HloReducePrecisionOptions::BEFORE_OP_FUSION) {
        pipeline.AddPass<ReducePrecisionInsertion>(reduce_precision_options);
      }
    }
    {
      auto& pass =
          pipeline.AddPass<HloPassFix<HloPassPipeline>>("simplification");
@ -149,9 +159,28 @@ tensorflow::Status OptimizeHloModule(HloModule* hlo_module,
    fusion.AddPass<GpuInstructionFusion>(/*may_duplicate=*/false);
    fusion.AddPass<GpuInstructionFusion>(/*may_duplicate=*/true);
    fusion.AddPass<FusionMerger>();
-    return fusion.Run(hlo_module).status();
+    TF_RETURN_IF_ERROR(fusion.Run(hlo_module).status());
    HloPassPipeline reduce_pipeline("reduce-precision");
    for (const auto& reduce_precision_options :
         hlo_module->config().debug_options().hlo_reduce_precision_options()) {
      if (reduce_precision_options.pass_timing() ==
          HloReducePrecisionOptions::AFTER_OP_FUSION) {
        reduce_pipeline.AddPass<ReducePrecisionInsertion>(
            reduce_precision_options);
      }
    }
    StatusOr<bool> reduce_result = reduce_pipeline.Run(hlo_module);
    TF_RETURN_IF_ERROR(reduce_result.status());
    if (reduce_result.ValueOrDie()) {
      // Do another fusion pass, with the expectation that we may be able to
      // fuse the new ReducePrecision operations.
      TF_RETURN_IF_ERROR(fusion.Run(hlo_module).status());
    }
  }
  return tensorflow::Status::OK();
 }
 // Modifies the given HLO module so that it will be accepted by IrEmitter.
 // Unlike optimization passes, the passes are necessary for correctness.
--- a/tensorflow/compiler/xla/service/hlo_graph_dumper.cc
+++ b/tensorflow/compiler/xla/service/hlo_graph_dumper.cc
--- a/tensorflow/compiler/xla/service/hlo_opcode.h
+++ b/tensorflow/compiler/xla/service/hlo_opcode.h
@ -112,6 +112,11 @@ bool HloOpcodeIsComparison(HloOpcode opcode);
 // Returns true iff the given opcode has variadic operands.
 bool HloOpcodeIsVariadic(HloOpcode opcode);
 // Returns the number of HloOpcode values.
 inline const uint32_t HloOpcodeCount() {
  return static_cast<uint32_t>(HloOpcode::kWhile) + 1;
 }
 }  // namespace xla
 #endif  // TENSORFLOW_COMPILER_XLA_SERVICE_HLO_OPCODE_H_
--- a/tensorflow/compiler/xla/service/reduce_precision_insertion.cc
+++ b/tensorflow/compiler/xla/service/reduce_precision_insertion.cc
@ -16,6 +16,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/reduce_precision_insertion.h"
 #include "tensorflow/compiler/xla/service/hlo_module.h"
 #include "tensorflow/compiler/xla/shape_util.h"
 #include "tensorflow/core/platform/logging.h"
 namespace xla {
@ -30,14 +31,15 @@ StatusOr<bool> ReducePrecisionInsertion::Run(HloModule* module) {
    for (auto& instruction : computation->instructions()) {
      VLOG(3) << "Visited instruction: " << instruction->ToString();
-      // For now, ReducePrecision is only implemented for F32 data, so this
+      // For now, ReducePrecision is only implemented for F32 arrays, so this
      // ignore instructions that produce other data.  In particular, this
      // currently ignores instructions producing tuples, even if those tuples
-      // contain F32 data inside them.  The assumption is that in most cases
+      // contain F32 arrays inside them.  The assumption is that in most cases
      // equivalent behavior can be obtained by adding ReducePrecision
-      // instructions after the instructions that pull the F32 data out of the
+      // instructions after the instructions that pull the F32 arrays out of
-      // tuples.
+      // the tuples.
      if (instruction->shape().element_type() == PrimitiveType::F32 &&
          !ShapeUtil::IsScalar(instruction->shape()) &&
          should_reduce_output_precision_(instruction->opcode())) {
        instructions_to_suffix.push_back(instruction.get());
      }
@ -58,4 +60,33 @@ StatusOr<bool> ReducePrecisionInsertion::Run(HloModule* module) {
  return changed;
 }
 ReducePrecisionInsertion::OpcodeFilterFunction
 ReducePrecisionInsertion::make_filter_function(
    const HloReducePrecisionOptions& reduce_precision_options) {
  // Implement the filter function with a lookup table.
  std::vector<bool> filter(HloOpcodeCount(), false);
  for (const auto& opcode : reduce_precision_options.opcodes_to_suffix()) {
    filter[opcode] = true;
  }
  return [filter](const HloOpcode opcode) {
    return filter[static_cast<unsigned int>(opcode)];
  };
 }
 HloReducePrecisionOptions ReducePrecisionInsertion::make_options_proto(
    const HloReducePrecisionOptions::PassTiming pass_timing,
    const int exponent_bits, const int mantissa_bits,
    const OpcodeFilterFunction& should_reduce_output_precision) {
  HloReducePrecisionOptions options;
  options.set_pass_timing(pass_timing);
  options.set_exponent_bits(exponent_bits);
  options.set_mantissa_bits(mantissa_bits);
  for (uint32_t opcode = 0; opcode < HloOpcodeCount(); opcode++) {
    if (should_reduce_output_precision(static_cast<HloOpcode>(opcode))) {
      options.add_opcodes_to_suffix(opcode);
    }
  }
  return options;
 }
 }  // namespace xla
--- a/tensorflow/compiler/xla/service/reduce_precision_insertion.h
+++ b/tensorflow/compiler/xla/service/reduce_precision_insertion.h
@ -42,6 +42,17 @@ class ReducePrecisionInsertion : public HloPassInterface {
      : exponent_bits_(exponent_bits),
        mantissa_bits_(mantissa_bits),
        should_reduce_output_precision_(should_reduce_output_precision) {}
  // Version of the constructor that takes an HloReducePrecisionOptions proto
  // rather than explicitly-enumerated parameters, for convenience when
  // creating passes based on DebugOptions.
  explicit ReducePrecisionInsertion(
      const HloReducePrecisionOptions& reduce_precision_options)
      : exponent_bits_(reduce_precision_options.exponent_bits()),
        mantissa_bits_(reduce_precision_options.mantissa_bits()),
        should_reduce_output_precision_(
            make_filter_function(reduce_precision_options)) {}
  ~ReducePrecisionInsertion() override{};
  tensorflow::StringPiece name() const override {
@ -52,6 +63,15 @@ class ReducePrecisionInsertion : public HloPassInterface {
  // (reduce-precision instructions were inserted).
  StatusOr<bool> Run(HloModule* module) override;
  // Convert between the (inconvenient) xla.proto HloReducePrecisionOptions
  // representation and OpcodeFilterFunction functions.
  static OpcodeFilterFunction make_filter_function(
      const HloReducePrecisionOptions& reduce_precision_options);
  static HloReducePrecisionOptions make_options_proto(
      const HloReducePrecisionOptions::PassTiming pass_timing,
      const int exponent_bits, const int mantissa_bits,
      const OpcodeFilterFunction& should_reduce_output_precision);
 private:
  // Parameters for the precision reduction to be added.
  const int exponent_bits_;
@ -59,7 +79,7 @@ class ReducePrecisionInsertion : public HloPassInterface {
  // Function to determine (from the opcode) whether a given instruction should
  // have a reduce-precision instruction inserted in its output stream.
-  const OpcodeFilterFunction& should_reduce_output_precision_;
+  const OpcodeFilterFunction should_reduce_output_precision_;
 };
 }  // namespace xla
--- a/tensorflow/compiler/xla/statusor.cc
+++ b/tensorflow/compiler/xla/statusor.cc
@ -19,28 +19,20 @@ limitations under the License.
 #include "tensorflow/core/platform/logging.h"
 namespace xla {
-namespace internal {
+namespace internal_statusor {
-Status StatusOrHelper::HandleInvalidStatusCtorArg() {
+void Helper::HandleInvalidStatusCtorArg(Status* status) {
  const char* kMessage =
-      "Status::OK is not a valid constructor argument to StatusOr<T>";
+      "An OK status is not a valid constructor argument to StatusOr<T>";
  LOG(ERROR) << kMessage;
-  // In optimized builds, we will fall back to tensorflow::error::INTERNAL.
+  // Fall back to tensorflow::error::INTERNAL.
-  return Status(tensorflow::error::INTERNAL, kMessage);
+  *status = ::tensorflow::errors::Internal(kMessage);
 }
-Status StatusOrHelper::HandleNullObjectCtorArg() {
+void Helper::Crash(const Status& status) {
  const char* kMessage =
      "NULL is not a valid constructor argument to StatusOr<T*>";
  LOG(ERROR) << kMessage;
  // In optimized builds, we will fall back to tensorflow::error::INTERNAL.
  return Status(tensorflow::error::INTERNAL, kMessage);
 }
 void StatusOrHelper::Crash(const Status& status) {
  LOG(FATAL) << "Attempting to fetch value instead of handling error "
             << status;
 }
-}  // namespace internal
+}  // namespace internal_statusor
 }  // namespace xla
--- a/tensorflow/compiler/xla/statusor.h
+++ b/tensorflow/compiler/xla/statusor.h
@ -72,216 +72,233 @@ limitations under the License.
 #define TENSORFLOW_COMPILER_XLA_STATUSOR_H_
 #include "tensorflow/compiler/xla/status.h"
 #include "tensorflow/compiler/xla/statusor_internals.h"
 #include "tensorflow/core/platform/macros.h"
 namespace xla {
 #if defined(__clang__)
 // Only clang supports warn_unused_result as a type annotation.
-template <typename T, bool CopyConstructible>
+template <typename T>
 class TF_MUST_USE_RESULT StatusOr;
 #endif
-template <typename T,
+template <typename T>
-          bool CopyConstructible = std::is_copy_constructible<T>::value>
+class StatusOr : private internal_statusor::StatusOrData<T>,
-class StatusOr {
+                 private internal_statusor::TraitsBase<
-  template <typename U, bool UC>
+                     std::is_copy_constructible<T>::value,
                     std::is_move_constructible<T>::value> {
  template <typename U>
  friend class StatusOr;
  typedef internal_statusor::StatusOrData<T> Base;
 public:
  typedef T element_type;
-  // Construct a new StatusOr with Status::UNKNOWN status
+  // Constructs a new StatusOr with Status::UNKNOWN status.  This is marked
-  StatusOr();
+  // 'explicit' to try to catch cases like 'return {};', where people think
  // StatusOr<std::vector<int>> will be initialized with an empty vector,
  // instead of a Status::UNKNOWN status.
  explicit StatusOr();
-  // Construct a new StatusOr with the given non-ok status. After calling
+  // StatusOr<T> will be copy constructuble/assignable if T is copy
  // constructible.
  StatusOr(const StatusOr&) = default;
  StatusOr& operator=(const StatusOr&) = default;
  // StatusOr<T> will be move constructuble/assignable if T is move
  // constructible.
  StatusOr(StatusOr&&) = default;
  StatusOr& operator=(StatusOr&&) = default;
  // Conversion copy/move constructor, T must be convertible from U.
  // TODO(b/62186717): These should not participate in overload resolution if U
  // is not convertible to T.
  template <typename U>
  StatusOr(const StatusOr<U>& other);
  template <typename U>
  StatusOr(StatusOr<U>&& other);
  // Conversion copy/move assignment operator, T must be convertible from U.
  template <typename U>
  StatusOr& operator=(const StatusOr<U>& other);
  template <typename U>
  StatusOr& operator=(StatusOr<U>&& other);
  // Constructs a new StatusOr with the given value. After calling this
  // constructor, calls to ValueOrDie() will succeed, and calls to status() will
  // return OK.
  //
  // NOTE: Not explicit - we want to use StatusOr<T> as a return type
  // so it is convenient and sensible to be able to do 'return T()'
  // when the return type is StatusOr<T>.
  //
  // REQUIRES: T is copy constructible.
  StatusOr(const T& value);
  // Constructs a new StatusOr with the given non-ok status. After calling
  // this constructor, calls to ValueOrDie() will CHECK-fail.
  //
  // NOTE: Not explicit - we want to use StatusOr<T> as a return
  // value, so it is convenient and sensible to be able to do 'return
  // Status()' when the return type is StatusOr<T>.
  //
-  // REQUIRES: status != Status::OK. This requirement is DCHECKed.
+  // REQUIRES: !status.ok(). This requirement is DCHECKed.
-  // In optimized builds, passing Status::OK here will have the effect
+  // In optimized builds, passing Status::OK() here will have the effect
  // of passing tensorflow::error::INTERNAL as a fallback.
-  StatusOr(Status status);              // NOLINT
+  StatusOr(const Status& status);
  StatusOr& operator=(const Status& status);
-  // Construct a new StatusOr with the given value. If T is a plain pointer,
+  // TODO(b/62186997): Add operator=(T) overloads.
-  // value must not be NULL. After calling this constructor, calls to
+
-  // ValueOrDie() will succeed, and calls to status() will return OK.
+  // Similar to the `const T&` overload.
  //
-  // NOTE: Not explicit - we want to use StatusOr<T> as a return type
+  // REQUIRES: T is move constructible.
-  // so it is convenient and sensible to be able to do 'return T()'
+  StatusOr(T&& value);
  // when the return type is StatusOr<T>.
  //
  // REQUIRES: if T is a plain pointer, value != NULL. This requirement is
  // DCHECKed. In optimized builds, passing a NULL pointer here will have
  // the effect of passing tensorflow::error::INTERNAL as a fallback.
  StatusOr(const T& value);  // NOLINT
-  // Copy constructor.
+  // RValue versions of the operations declared above.
-  StatusOr(const StatusOr& other) = default;
+  StatusOr(Status&& status);
-
+  StatusOr& operator=(Status&& status);
  // Conversion copy constructor, T must be copy constructible from U
  template <typename U>
  StatusOr(const StatusOr<U>& other);
  // Assignment operator.
  StatusOr& operator=(const StatusOr& other) = default;
  // Conversion assignment operator, T must be assignable from U
  template <typename U>
  StatusOr& operator=(const StatusOr<U>& other);
  // Move constructor and move-assignment operator.
  StatusOr(StatusOr&& other) = default;
  StatusOr& operator=(StatusOr&& other) = default;
  // Rvalue-reference overloads of the other constructors and assignment
  // operators, to support move-only types and avoid unnecessary copying.
  //
  // Implementation note: we could avoid all these rvalue-reference overloads
  // if the existing lvalue-reference overloads took their arguments by value
  // instead. I think this would also let us omit the conversion assignment
  // operator altogether, since we'd get the same functionality for free
  // from the implicit conversion constructor and ordinary assignment.
  // However, this could result in extra copy operations unless we use
  // std::move to avoid them, and we can't use std::move because this code
  // needs to be portable to C++03.
  StatusOr(T&& value);  // NOLINT
  template <typename U>
  StatusOr(StatusOr<U>&& other);
  // Returns a reference to our status. If this contains a T, then
  // returns Status::OK.
  const Status& status() const { return status_; }
  // Returns this->status().ok()
-  bool ok() const { return status_.ok(); }
+  bool ok() const { return this->status_.ok(); }
  // Returns a reference to our status. If this contains a T, then
  // returns Status::OK().
  const Status& status() const &;
  Status status() &&;
  // Returns a reference to our current value, or CHECK-fails if !this->ok().
-  const T& ValueOrDie() const;
+  //
-  T& ValueOrDie();
+  // Note: for value types that are cheap to copy, prefer simple code:
  //
  //   T value = statusor.ValueOrDie();
  //
  // Otherwise, if the value type is expensive to copy, but can be left
  // in the StatusOr, simply assign to a reference:
  //
  //   T& value = statusor.ValueOrDie();  // or `const T&`
  //
  // Otherwise, if the value type supports an efficient move, it can be
  // used as follows:
  //
  //   T value = std::move(statusor).ValueOrDie();
  //
  // The std::move on statusor instead of on the whole expression enables
  // warnings about possible uses of the statusor object after the move.
  // C++ style guide waiver for ref-qualified overloads granted in cl/143176389
  // See go/ref-qualifiers for more details on such overloads.
  const T& ValueOrDie() const &;
  T& ValueOrDie() &;
  const T&& ValueOrDie() const &&;
  T&& ValueOrDie() &&;
  // Moves our current value out of this object and returns it, or CHECK-fails
  // if !this->ok().
  // Use of this method is discouraged; prefer std::move(statusor.ValueOrDie())
  // instead.
  T ConsumeValueOrDie() { return std::move(ValueOrDie()); }
- private:
+  // Ignores any errors. This method does nothing except potentially suppress
-  Status status_;
+  // complaints from any tools that are checking that errors are not dropped on
-  T value_;
+  // the floor.
-};
+  void IgnoreError() const;
 // Partial specialization for when T is not copy-constructible. This uses all
 // methods from the core implementation, but removes copy assignment and copy
 // construction.
 template <typename T>
 class StatusOr<T, false> : public StatusOr<T, true> {
 public:
  // Remove copies.
  StatusOr(const StatusOr& other) = delete;
  StatusOr& operator=(const StatusOr& other) = delete;
  template <typename U>
  StatusOr(const StatusOr<U>& other) = delete;
  StatusOr(const T& value) = delete;
  // Use the superclass version for other constructors and operators.
  StatusOr() = default;
  StatusOr(StatusOr&& other) = default;
  StatusOr& operator=(StatusOr&& other) = default;
  StatusOr(T&& value)  // NOLINT
      : StatusOr<T, true>::StatusOr(std::move(value)) {}
  StatusOr(Status status)  // NOLINT
      : StatusOr<T, true>::StatusOr(std::move(status)) {}
  template <typename U>
  StatusOr(StatusOr<U>&& other)  // NOLINT
      : StatusOr<T, true>::StatusOr(std::move(other)) {}
 };
 ////////////////////////////////////////////////////////////////////////////////
 // Implementation details for StatusOr<T>
 namespace internal {
 class StatusOrHelper {
 public:
  // Move type-agnostic error handling to the .cc.
  static Status HandleInvalidStatusCtorArg();
  static Status HandleNullObjectCtorArg();
  static void Crash(const Status& status);
  // Customized behavior for StatusOr<T> vs. StatusOr<T*>
 template <typename T>
-  struct Specialize;
+StatusOr<T>::StatusOr() : Base(Status(tensorflow::error::UNKNOWN, "")) {}
 };
 template <typename T>
-struct StatusOrHelper::Specialize {
+StatusOr<T>::StatusOr(const T& value) : Base(value) {}
  // For non-pointer T, a reference can never be NULL.
  static inline bool IsValueNull(const T& t) { return false; }
 };
 template <typename T>
-struct StatusOrHelper::Specialize<T*> {
+StatusOr<T>::StatusOr(const Status& status) : Base(status) {}
  static inline bool IsValueNull(const T* t) { return t == NULL; }
 };
-}  // namespace internal
+template <typename T>
-
+StatusOr<T>& StatusOr<T>::operator=(const Status& status) {
-template <typename T, bool CopyConstructible>
+  this->Assign(status);
-inline StatusOr<T, CopyConstructible>::StatusOr()
+  return *this;
    : status_(tensorflow::error::UNKNOWN, "") {}
 template <typename T, bool CopyConstructible>
 inline StatusOr<T, CopyConstructible>::StatusOr(Status status)
    : status_(std::move(status)) {
  if (status_.ok()) {
    status_ = internal::StatusOrHelper::HandleInvalidStatusCtorArg();
  }
 }
-template <typename T, bool CopyConstructible>
+template <typename T>
-inline StatusOr<T, CopyConstructible>::StatusOr(const T& value)
+StatusOr<T>::StatusOr(T&& value) : Base(std::move(value)) {}
-    : value_(value) {
+
-  if (internal::StatusOrHelper::Specialize<T>::IsValueNull(value)) {
+template <typename T>
-    status_ = internal::StatusOrHelper::HandleNullObjectCtorArg();
+StatusOr<T>::StatusOr(Status&& status) : Base(std::move(status)) {}
-  }
+
 template <typename T>
 StatusOr<T>& StatusOr<T>::operator=(Status&& status) {
  this->Assign(std::move(status));
  return *this;
 }
-template <typename T, bool CopyConstructible>
+template <typename T>
 template <typename U>
-inline StatusOr<T, CopyConstructible>::StatusOr(const StatusOr<U>& other)
+inline StatusOr<T>::StatusOr(const StatusOr<U>& other)
-    : status_(other.status_), value_(other.value_) {}
+    : Base(static_cast<const typename StatusOr<U>::Base&>(other)) {}
-template <typename T, bool CopyConstructible>
+template <typename T>
 inline StatusOr<T, CopyConstructible>::StatusOr(T&& value)
    : value_(std::move(value)) {
  if (internal::StatusOrHelper::Specialize<T>::IsValueNull(value_)) {
    status_ = internal::StatusOrHelper::HandleNullObjectCtorArg();
  }
 }
 template <typename T, bool CopyConstructible>
 template <typename U>
-inline StatusOr<T, CopyConstructible>::StatusOr(StatusOr<U>&& other)
+inline StatusOr<T>& StatusOr<T>::operator=(const StatusOr<U>& other) {
-    : status_(std::move(other.status_)), value_(std::move(other.value_)) {}
+  if (other.ok())
-
+    this->Assign(other.ValueOrDie());
-template <typename T, bool CopyConstructible>
+  else
-inline const T& StatusOr<T, CopyConstructible>::ValueOrDie() const {
+    this->Assign(other.status());
-  if (!ok()) {
+  return *this;
    internal::StatusOrHelper::Crash(status());
  }
  return value_;
 }
-template <typename T, bool CopyConstructible>
+template <typename T>
-inline T& StatusOr<T, CopyConstructible>::ValueOrDie() {
+template <typename U>
-  if (!status_.ok()) {
+inline StatusOr<T>::StatusOr(StatusOr<U>&& other)
-    internal::StatusOrHelper::Crash(status());
+    : Base(static_cast<typename StatusOr<U>::Base&&>(other)) {}
 template <typename T>
 template <typename U>
 inline StatusOr<T>& StatusOr<T>::operator=(StatusOr<U>&& other) {
  if (other.ok()) {
    this->Assign(std::move(other).ValueOrDie());
  } else {
    this->Assign(std::move(other).status());
  }
-  return value_;
+  return *this;
 }
 template <typename T>
 const Status& StatusOr<T>::status() const & {
  return this->status_;
 }
 template <typename T>
 Status StatusOr<T>::status() && {
  return ok() ? Status::OK() : std::move(this->status_);
 }
 template <typename T>
 const T& StatusOr<T>::ValueOrDie() const & {
  this->EnsureOk();
  return this->data_;
 }
 template <typename T>
 T& StatusOr<T>::ValueOrDie() & {
  this->EnsureOk();
  return this->data_;
 }
 template <typename T>
 const T&& StatusOr<T>::ValueOrDie() const && {
  this->EnsureOk();
  return std::move(this->data_);
 }
 template <typename T>
 T&& StatusOr<T>::ValueOrDie() && {
  this->EnsureOk();
  return std::move(this->data_);
 }
 template <typename T>
 void StatusOr<T>::IgnoreError() const {
  // no-op
 }
 }  // namespace xla
--- a/tensorflow/compiler/xla/statusor_internals.h
+++ b/tensorflow/compiler/xla/statusor_internals.h
@ -0,0 +1,245 @@
 /* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 #ifndef THIRD_PARTY_TENSORFLOW_COMPILER_XLA_STATUSOR_INTERNALS_H_
 #define THIRD_PARTY_TENSORFLOW_COMPILER_XLA_STATUSOR_INTERNALS_H_
 #include "tensorflow/compiler/xla/status.h"
 #include "tensorflow/core/platform/macros.h"
 namespace xla {
 namespace internal_statusor {
 class Helper {
 public:
  // Move type-agnostic error handling to the .cc.
  static void HandleInvalidStatusCtorArg(Status*);
  TF_ATTRIBUTE_NORETURN static void Crash(const Status& status);
 };
 // Construct an instance of T in `p` through placement new, passing Args... to
 // the constructor.
 // This abstraction is here mostly for the gcc performance fix.
 template <typename T, typename... Args>
 void PlacementNew(void* p, Args&&... args) {
 #if defined(__GNUC__) && !defined(__clang__)
  // Teach gcc that 'p' cannot be null, fixing code size issues.
  if (p == nullptr) __builtin_unreachable();
 #endif
  new (p) T(std::forward<Args>(args)...);
 }
 // Helper base class to hold the data and all operations.
 // We move all this to a base class to allow mixing with the appropriate
 // TraitsBase specialization.
 template <typename T>
 class StatusOrData {
  template <typename U>
  friend class StatusOrData;
 public:
  StatusOrData() = delete;
  StatusOrData(const StatusOrData& other) {
    if (other.ok()) {
      MakeValue(other.data_);
      MakeStatus();
    } else {
      MakeStatus(other.status_);
    }
  }
  StatusOrData(StatusOrData&& other) noexcept {
    if (other.ok()) {
      MakeValue(std::move(other.data_));
      MakeStatus();
    } else {
      MakeStatus(std::move(other.status_));
    }
  }
  template <typename U>
  StatusOrData(const StatusOrData<U>& other) {
    if (other.ok()) {
      MakeValue(other.data_);
      MakeStatus();
    } else {
      MakeStatus(other.status_);
    }
  }
  template <typename U>
  StatusOrData(StatusOrData<U>&& other) {
    if (other.ok()) {
      MakeValue(std::move(other.data_));
      MakeStatus();
    } else {
      MakeStatus(std::move(other.status_));
    }
  }
  explicit StatusOrData(const T& value) : data_(value) { MakeStatus(); }
  explicit StatusOrData(T&& value) : data_(std::move(value)) { MakeStatus(); }
  explicit StatusOrData(const Status& status) : status_(status) {
    EnsureNotOk();
  }
  explicit StatusOrData(Status&& status) : status_(std::move(status)) {
    EnsureNotOk();
  }
  StatusOrData& operator=(const StatusOrData& other) {
    if (this == &other) return *this;
    if (other.ok())
      Assign(other.data_);
    else
      Assign(other.status_);
    return *this;
  }
  StatusOrData& operator=(StatusOrData&& other) {
    if (this == &other) return *this;
    if (other.ok())
      Assign(std::move(other.data_));
    else
      Assign(std::move(other.status_));
    return *this;
  }
  ~StatusOrData() {
    if (ok()) {
      status_.~Status();
      data_.~T();
    } else {
      status_.~Status();
    }
  }
  void Assign(const T& value) {
    if (ok()) {
      data_.~T();
      MakeValue(value);
    } else {
      MakeValue(value);
      status_ = Status::OK();
    }
  }
  void Assign(T&& value) {
    if (ok()) {
      data_.~T();
      MakeValue(std::move(value));
    } else {
      MakeValue(std::move(value));
      status_ = Status::OK();
    }
  }
  void Assign(const Status& status) {
    Clear();
    status_ = status;
    EnsureNotOk();
  }
  void Assign(Status&& status) {
    Clear();
    status_ = std::move(status);
    EnsureNotOk();
  }
  bool ok() const { return status_.ok(); }
 protected:
  // status_ will always be active after the constructor.
  // We make it a union to be able to initialize exactly how we need without
  // waste.
  // Eg. in the copy constructor we use the default constructor of Status in
  // the ok() path to avoid an extra Ref call.
  union {
    Status status_;
  };
  // data_ is active iff status_.ok()==true
  struct Dummy {};
  union {
    // When T is const, we need some non-const object we can cast to void* for
    // the placement new. dummy_ is that object.
    Dummy dummy_;
    T data_;
  };
  void Clear() {
    if (ok()) data_.~T();
  }
  void EnsureOk() const {
    if (!ok()) Helper::Crash(status_);
  }
  void EnsureNotOk() {
    if (ok()) Helper::HandleInvalidStatusCtorArg(&status_);
  }
  // Construct the value (ie. data_) through placement new with the passed
  // argument.
  template <typename Arg>
  void MakeValue(Arg&& arg) {
    internal_statusor::PlacementNew<T>(&dummy_, std::forward<Arg>(arg));
  }
  // Construct the status (ie. status_) through placement new with the passed
  // argument.
  template <typename... Args>
  void MakeStatus(Args&&... args) {
    internal_statusor::PlacementNew<Status>(&status_,
                                            std::forward<Args>(args)...);
  }
 };
 // Helper base class to allow implicitly deleted constructors and assignment
 // operations in StatusOr.
 // TraitsBase will explicitly delete what it can't support and StatusOr will
 // inherit that behavior implicitly.
 template <bool Copy, bool Move>
 struct TraitsBase {
  TraitsBase() = default;
  TraitsBase(const TraitsBase&) = default;
  TraitsBase(TraitsBase&&) = default;
  TraitsBase& operator=(const TraitsBase&) = default;
  TraitsBase& operator=(TraitsBase&&) = default;
 };
 template <>
 struct TraitsBase<false, true> {
  TraitsBase() = default;
  TraitsBase(const TraitsBase&) = delete;
  TraitsBase(TraitsBase&&) = default;
  TraitsBase& operator=(const TraitsBase&) = delete;
  TraitsBase& operator=(TraitsBase&&) = default;
 };
 template <>
 struct TraitsBase<false, false> {
  TraitsBase() = default;
  TraitsBase(const TraitsBase&) = delete;
  TraitsBase(TraitsBase&&) = delete;
  TraitsBase& operator=(const TraitsBase&) = delete;
  TraitsBase& operator=(TraitsBase&&) = delete;
 };
 }  // namespace internal_statusor
 }  // namespace xla
 #endif  // THIRD_PARTY_TENSORFLOW_COMPILER_XLA_STATUSOR_INTERNALS_H_
--- a/tensorflow/compiler/xla/statusor_test.cc
+++ b/tensorflow/compiler/xla/statusor_test.cc
@ -29,8 +29,6 @@ limitations under the License.
 namespace xla {
 namespace {
 using tensorflow::Status;
 class Base1 {
 public:
  virtual ~Base1() {}
@ -59,6 +57,14 @@ class CopyNoAssign {
  const CopyNoAssign& operator=(const CopyNoAssign&);
 };
 class NoDefaultConstructor {
 public:
  explicit NoDefaultConstructor(int foo);
 };
 static_assert(!std::is_default_constructible<NoDefaultConstructor>(),
              "Should not be default-constructible.");
 StatusOr<std::unique_ptr<int>> ReturnUniquePtr() {
  // Uses implicit constructor from T&&
  return std::unique_ptr<int>(new int(0));
@ -69,6 +75,18 @@ TEST(StatusOr, ElementType) {
  static_assert(std::is_same<StatusOr<char>::element_type, char>(), "");
 }
 TEST(StatusOr, TestNoDefaultConstructorInitialization) {
  // Explicitly initialize it with an error code.
  StatusOr<NoDefaultConstructor> statusor(tensorflow::errors::Cancelled(""));
  EXPECT_FALSE(statusor.ok());
  EXPECT_EQ(statusor.status().code(), tensorflow::error::CANCELLED);
  // Default construction of StatusOr initializes it with an UNKNOWN error code.
  StatusOr<NoDefaultConstructor> statusor2;
  EXPECT_FALSE(statusor2.ok());
  EXPECT_EQ(statusor2.status().code(), tensorflow::error::UNKNOWN);
 }
 TEST(StatusOr, TestMoveOnlyInitialization) {
  StatusOr<std::unique_ptr<int>> thing(ReturnUniquePtr());
  ASSERT_TRUE(thing.ok());
--- a/tensorflow/compiler/xla/tests/batch_normalization_test.cc
+++ b/tensorflow/compiler/xla/tests/batch_normalization_test.cc
@ -308,6 +308,137 @@ XLA_TEST_P(BatchNormTest, DISABLED_ON_GPU(RandomizedTests)) {
      ErrorSpec(0.01, 1));
 }
 // TODO(b/62764704): Implement on GPU. Disabled on 2017-06-20.
 XLA_TEST_P(BatchNormTest, DISABLED_ON_CPU_PARALLEL(DISABLED_ON_CPU(
                              DISABLED_ON_GPU(RandomizedGradTests)))) {
  float epsilon = 0.001;
  ComputationBuilder builder(client_, TestName());
  const std::vector<int64>& bounds = GetParam().bounds;
  Array4D<float> input_array(bounds[0], bounds[1], bounds[2], bounds[3]);
  input_array.FillRandom(GetParam().random_value_var,
                         GetParam().random_value_mean);
  Array4D<float> grad_output_array(bounds[0], bounds[1], bounds[2], bounds[3]);
  grad_output_array.FillRandom(GetParam().random_value_var,
                               GetParam().random_value_mean);
  const int64 feature_index = GetParam().feature_index;
  const int64 num_elements_per_feature =
      Product(bounds) / bounds[feature_index];
  const int64 feature_bound = bounds[feature_index];
  std::vector<float> scale(feature_bound, 2);
  auto input_squared =
      ReferenceUtil::MapArray4D(input_array, [](float a) { return a * a; });
  std::vector<int64> reduce_dims;
  for (int64 i = 0; i < bounds.size(); ++i) {
    if (i != feature_index) {
      reduce_dims.push_back(i);
    }
  }
  auto sum =
      ReferenceUtil::Reduce4DTo1D(input_array, /*init=*/0.0f, reduce_dims,
                                  [](float a, float b) { return a + b; });
  auto sum_squared =
      ReferenceUtil::Reduce4DTo1D(*input_squared, /*init=*/0.0f, reduce_dims,
                                  [](float a, float b) { return a + b; });
  std::vector<float> mean(feature_bound);
  for (int64 i = 0; i < feature_bound; ++i) {
    mean[i] = sum[i] / num_elements_per_feature;
  }
  std::vector<float> mean_square(feature_bound);
  for (int64 i = 0; i < feature_bound; ++i) {
    mean_square[i] = mean[i] * mean[i];
  }
  std::vector<float> square_mean(feature_bound);
  for (int64 i = 0; i < feature_bound; ++i) {
    square_mean[i] = sum_squared[i] / num_elements_per_feature;
  }
  std::vector<float> var(feature_bound);
  for (int64 i = 0; i < feature_bound; ++i) {
    var[i] = square_mean[i] - mean_square[i];
  }
  Array4D<float> mean_4D =
      *ReferenceUtil::Broadcast1DTo4D(mean, bounds, feature_index);
  auto var_4D = *ReferenceUtil::Broadcast1DTo4D(var, bounds, feature_index);
  auto scale_4D = *ReferenceUtil::Broadcast1DTo4D(scale, bounds, feature_index);
  auto var_add_epsilon = *ReferenceUtil::MapArray4D(
      var_4D, [epsilon](float a) { return std::sqrt(a + epsilon); });
  auto grad_output_times_var =
      *ReferenceUtil::MapArray4D(grad_output_array, var_add_epsilon,
                                 [](float a, float b) { return a * b; });
  auto grad_activation = *ReferenceUtil::MapArray4D(
      grad_output_times_var, scale_4D, [](float a, float b) { return a * b; });
  auto activation_shifted = *ReferenceUtil::MapArray4D(
      input_array, mean_4D, [](float a, float b) { return a - b; });
  auto grad_scale_before_reduction =
      *ReferenceUtil::MapArray4D(grad_output_times_var, activation_shifted,
                                 [](float a, float b) { return a * b; });
  auto grad_scale = ReferenceUtil::Reduce4DTo1D(
      grad_scale_before_reduction, /*init=*/0.0f, reduce_dims,
      [](float a, float b) { return a + b; });
  auto grad_offset =
      ReferenceUtil::Reduce4DTo1D(grad_output_array, /*init=*/0.0f, reduce_dims,
                                  [](float a, float b) { return a + b; });
  auto expected_grad_activation =
      Literal::CreateR4FromArray4D<float>(grad_activation);
  auto input_literal = Literal::CreateR4FromArray4D<float>(input_array);
  auto scale_literal = Literal::CreateR1<float>(scale);
  auto mean_literal = Literal::CreateR1<float>(mean);
  auto var_literal = Literal::CreateR1<float>(var);
  auto grad_output_literal =
      Literal::CreateR4FromArray4D<float>(grad_output_array);
  auto input_parameter = builder.Parameter(0, input_literal->shape(), "input");
  auto scale_parameter = builder.Parameter(1, scale_literal->shape(), "scale");
  auto mean_parameter = builder.Parameter(2, mean_literal->shape(), "mean");
  auto var_parameter = builder.Parameter(3, var_literal->shape(), "variance");
  auto grad_output_parameter =
      builder.Parameter(4, grad_output_literal->shape(), "grad_output");
  std::unique_ptr<GlobalData> input_data =
      client_->TransferToServer(*input_literal).ConsumeValueOrDie();
  std::unique_ptr<GlobalData> scale_data =
      client_->TransferToServer(*scale_literal).ConsumeValueOrDie();
  std::unique_ptr<GlobalData> mean_data =
      client_->TransferToServer(*mean_literal).ConsumeValueOrDie();
  std::unique_ptr<GlobalData> var_data =
      client_->TransferToServer(*var_literal).ConsumeValueOrDie();
  std::unique_ptr<GlobalData> grad_output_data =
      client_->TransferToServer(*grad_output_literal).ConsumeValueOrDie();
  auto t = builder.BatchNormGrad(input_parameter, scale_parameter,
                                 mean_parameter, var_parameter,
                                 grad_output_parameter, epsilon, feature_index);
  auto expected =
      *Literal::MakeTuple({expected_grad_activation.get(),
                           Literal::CreateR1<float>(grad_scale).get(),
                           Literal::CreateR1<float>(grad_offset).get()});
  ComputeAndCompareTuple(&builder, expected,
                         {input_data.get(), scale_data.get(), mean_data.get(),
                          var_data.get(), grad_output_data.get()},
                         ErrorSpec(0.01, 1));
 }
 INSTANTIATE_TEST_CASE_P(
    BatchNormTest_Instantiation, BatchNormTest,
    ::testing::Values(BatchNormTestParam{{2, 2, 2, 2}, 0, 100.2f, 200.0f},
@ -319,6 +450,7 @@ INSTANTIATE_TEST_CASE_P(
                      BatchNormTestParam{{10, 10, 10, 10}, 1, -666.6f, 777.7f},
                      BatchNormTestParam{{10, 10, 10, 10}, 2, 0.f, 777.7f},
                      BatchNormTestParam{{1, 1, 10, 130}, 2, 0.f, 777.7f},
                      BatchNormTestParam{{1, 1, 130, 11}, 2, 0.f, 777.7f},
                      BatchNormTestParam{{1, 1, 10, 1}, 3, 888.8f, 9.9f},
                      BatchNormTestParam{{24, 129, 1, 2}, 2, 10000, 10000},
@ -446,6 +578,37 @@ XLA_TEST_F(BatchNormTest, DISABLED_ON_GPU(LargeEpsilonTest)) {
                         ErrorSpec(0.1));
 }
 // TODO(b/62764704): Implement on CPU and GPU. Disabled on 2017-07-11.
 XLA_TEST_F(BatchNormTest, DISABLED_ON_CPU_PARALLEL(DISABLED_ON_CPU(
                              DISABLED_ON_GPU(BatchNormGradBasic)))) {
  const int kFeatureIndex = 2;
  ComputationBuilder builder(client_, TestName());
  auto operand =
      builder.ConstantR4FromArray4D<float>(Array4D<float>(2, 2, 2, 1, 0.0f));
  auto scale = builder.ConstantR1<float>({1.0f, 1.0f});
  auto mean = builder.ConstantR1<float>({0.0f, 0.0f});
  auto var = builder.ConstantR1<float>({1.0f, 1.0f});
  auto grad_output = builder.ConstantR4FromArray4D<float>(
      {{{{1.f}, {2.f}}, {{3.f}, {4.f}}}, {{{5.f}, {6.f}}, {{7.f}, {8.f}}}});
  builder.BatchNormGrad(operand, scale, mean, var, grad_output,
                        /*epsilon=*/0.0, kFeatureIndex);
  auto expected = *Literal::MakeTuple(
      {Literal::CreateR4<float>(
           {{{{1.f}, {2.f}}, {{3.f}, {4.f}}}, {{{5.f}, {6.f}}, {{7.f}, {8.f}}}})
           .get(),
       Literal::CreateR1<float>({0, 0}).get(),
       Literal::CreateR1<float>({16, 20}).get()});
  ComputeAndCompareTuple(&builder, expected, {}, ErrorSpec(0.1));
 }
 }  // namespace
 }  // namespace xla
--- a/tensorflow/compiler/xla/tests/dynamic_ops_test.cc
+++ b/tensorflow/compiler/xla/tests/dynamic_ops_test.cc
@ -44,295 +44,310 @@ namespace {
 class DynamicSliceTest : public ClientLibraryTestBase {
 protected:
-  template <typename IndexT>
+  template <typename IndexT, typename DataT>
  void TestR1() {
    // Slice at dimension start.
-    RunR1<IndexT>({0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0}, {0}, {5},
+    RunR1<IndexT, DataT>({0, 1, 2, 3, 4, 5, 6, 7}, {0}, {5}, {0, 1, 2, 3, 4});
                  {0.0, 1.0, 2.0, 3.0, 4.0});
    // Slice in the middle.
-    RunR1<IndexT>({0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0}, {2}, {3},
+    RunR1<IndexT, DataT>({0, 1, 2, 3, 4, 5, 6, 7}, {2}, {3}, {2, 3, 4});
                  {2.0, 3.0, 4.0});
    // Slice at dimension boundaries.
-    RunR1<IndexT>({0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0}, {5}, {3},
+    RunR1<IndexT, DataT>({0, 1, 2, 3, 4, 5, 6, 7}, {5}, {3}, {5, 6, 7});
                  {5.0, 6.0, 7.0});
    // Slice at dimension boundaries, but with sizes that cause indices to wrap.
-    RunR1<IndexT>({0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0}, {6}, {4},
+    RunR1<IndexT, DataT>({0, 1, 2, 3, 4, 5, 6, 7}, {6}, {4}, {6, 7, 0, 1});
                  {6.0, 7.0, 0.0, 1.0});
    // Zero element slice.
-    RunR1<IndexT>({0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0}, {2}, {0}, {});
+    RunR1<IndexT, DataT>({0, 1, 2, 3, 4, 5, 6, 7}, {2}, {0}, {});
  }
-  template <typename IndexT>
+  template <typename IndexT, typename DataT>
  void TestR2() {
    // Slice at dimension start.
-    RunR2<IndexT>({{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}, {7.0f, 8.0f, 9.0f}},
+    RunR2<IndexT, DataT>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}}, {0, 0}, {2, 2},
-                  {0, 0}, {2, 2}, {{1.0f, 2.0f}, {4.0f, 5.0f}});
+                         {{1, 2}, {4, 5}});
    // Slice in the middle.
-    RunR2<IndexT>({{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}, {7.0f, 8.0f, 9.0f}},
+    RunR2<IndexT, DataT>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}}, {1, 1}, {2, 1},
-                  {1, 1}, {2, 1}, {{5.0f}, {8.0f}});
+                         {{5}, {8}});
    // Slice at dimension boundaries.
-    RunR2<IndexT>({{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}, {7.0f, 8.0f, 9.0f}},
+    RunR2<IndexT, DataT>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}}, {1, 1}, {2, 1},
-                  {1, 1}, {2, 1}, {{5.0f}, {8.0f}});
+                         {{5}, {8}});
    // Slice at dimension boundaries, but with sizes that cause indices to wrap.
-    RunR2<IndexT>({{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}, {7.0f, 8.0f, 9.0f}},
+    RunR2<IndexT, DataT>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}}, {1, 1}, {3, 3},
-                  {1, 1}, {3, 3},
+                         {{5, 6, 4}, {8, 9, 7}, {2, 3, 1}});
                  {{5.0f, 6.0f, 4.0f}, {8.0f, 9.0f, 7.0f}, {2.0f, 3.0f, 1.0f}});
    // Zero element slice: 2x0.
-    RunR2<IndexT>({{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}, {7.0f, 8.0f, 9.0f}},
+    RunR2<IndexT, DataT>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}}, {0, 0}, {2, 0},
-                  {0, 0}, {2, 0}, {{}, {}});
+                         {{}, {}});
    // Zero element slice: 0x2.
-    RunR2<IndexT>({{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}, {7.0f, 8.0f, 9.0f}},
+    RunR2<IndexT, DataT>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}}, {0, 0}, {0, 2},
-                  {0, 0}, {0, 2}, Array2D<float>(0, 2));
+                         Array2D<DataT>(0, 2));
  }
-  template <typename IndexT>
+  template <typename IndexT, typename DataT>
  void TestR3() {
    // R3 Shape: [2, 3, 2]
    // clang-format off
    // Slice at dimension start.
-    RunR3<IndexT>(
+    RunR3<IndexT, DataT>(
-      {{{1.0f, 2.0f}, {3.0f, 4.0f}, {5.0f, 6.0f}},
+      {{{1, 2}, {3, 4}, {5, 6}},
-       {{7.0f, 8.0f}, {9.0f, 10.0f}, {11.0f, 12.0f}}},
+       {{7, 8}, {9, 10}, {11, 12}}},
      {0, 0, 0}, {2, 1, 2},
-      {{{1.0f, 2.0f}}, {{7.0f, 8.0f}}});
+      {{{1, 2}}, {{7, 8}}});
    // Slice in the middle.
-    RunR3<IndexT>(
+    RunR3<IndexT, DataT>(
-      {{{1.0f, 2.0f}, {3.0f, 4.0f}, {5.0f, 6.0f}},
+      {{{1, 2}, {3, 4}, {5, 6}},
-       {{7.0f, 8.0f}, {9.0f, 10.0f}, {11.0f, 12.0f}}},
+       {{7, 8}, {9, 10}, {11, 12}}},
      {0, 1, 1}, {2, 2, 1},
-      {{{4.0f}, {6.0f}}, {{10.0f}, {12.0f}}});
+      {{{4}, {6}}, {{10}, {12}}});
    // Slice at dimension boundaries, but with sizes that cause indices to wrap.
-    RunR3<IndexT>(
+    RunR3<IndexT, DataT>(
-      {{{1.0f, 2.0f}, {3.0f, 4.0f}, {5.0f, 6.0f}},
+      {{{1, 2}, {3, 4}, {5, 6}},
-       {{7.0f, 8.0f}, {9.0f, 10.0f}, {11.0f, 12.0f}}},
+       {{7, 8}, {9, 10}, {11, 12}}},
      {0, 2, 1}, {2, 1, 2},
-      {{{6.0f, 5.0f}}, {{12.0f, 11.0f}}});
+      {{{6, 5}}, {{12, 11}}});
    // clang-format on
  }
-  template <typename IndexT>
+  template <typename IndexT, typename DataT>
-  void RunR1(const std::vector<float>& input_values,
+  void RunR1(tensorflow::gtl::ArraySlice<DataT> input_values,
             const std::vector<IndexT> slice_starts,
             const std::vector<int64>& slice_sizes,
-             const std::vector<float>& expected_values) {
+             tensorflow::gtl::ArraySlice<DataT> expected_values) {
    ComputationBuilder builder(client_, TestName());
    // Initialize and transfer dynamic slice start indices parameter.
    ComputationDataHandle starts;
    std::unique_ptr<GlobalData> start_data = CreateR1Parameter<IndexT>(
        slice_starts, 0, "slice_starts", &builder, &starts);
    // Build dynamic slice computation.
-    auto input = builder.ConstantR1<float>(input_values);
+    auto input = builder.ConstantR1<DataT>(input_values);
    builder.DynamicSlice(input, starts, slice_sizes);
    // Run computation and compare against expected values.
-    ComputeAndCompareR1<float>(&builder, expected_values, {start_data.get()},
+    ComputeAndCompareR1<DataT>(&builder, expected_values, {start_data.get()});
                               ErrorSpec(0.000001));
  }
-  template <typename IndexT>
+  template <typename IndexT, typename DataT>
-  void RunR2(const Array2D<float>& input_values,
+  void RunR2(const Array2D<DataT>& input_values,
             const std::vector<IndexT> slice_starts,
             const std::vector<int64>& slice_sizes,
-             const Array2D<float>& expected_values) {
+             const Array2D<DataT>& expected_values) {
    ComputationBuilder builder(client_, TestName());
    // Initialize and transfer dynamic slice start indices parameter.
    ComputationDataHandle starts;
    std::unique_ptr<GlobalData> start_data = CreateR1Parameter<IndexT>(
        slice_starts, 0, "slice_starts", &builder, &starts);
    // Build dynamic slice computation.
-    auto input = builder.ConstantR2FromArray2D<float>(input_values);
+    auto input = builder.ConstantR2FromArray2D<DataT>(input_values);
    builder.DynamicSlice(input, starts, slice_sizes);
    // Run computation and compare against expected values.
-    ComputeAndCompareR2<float>(&builder, expected_values, {start_data.get()},
+    ComputeAndCompareR2<DataT>(&builder, expected_values, {start_data.get()});
                               ErrorSpec(0.000001));
  }
-  template <typename IndexT>
+  template <typename IndexT, typename DataT>
-  void RunR3(const Array3D<float>& input_values,
+  void RunR3(const Array3D<DataT>& input_values,
             const std::vector<IndexT> slice_starts,
             const std::vector<int64>& slice_sizes,
-             const Array3D<float>& expected_values) {
+             const Array3D<DataT>& expected_values) {
    ComputationBuilder builder(client_, TestName());
    // Initialize and transfer dynamic slice start indices parameter.
    ComputationDataHandle starts;
    std::unique_ptr<GlobalData> start_data = CreateR1Parameter<IndexT>(
        slice_starts, 0, "slice_starts", &builder, &starts);
    // Build dynamic slice computation.
-    auto input = builder.ConstantR3FromArray3D<float>(input_values);
+    auto input = builder.ConstantR3FromArray3D<DataT>(input_values);
    builder.DynamicSlice(input, starts, slice_sizes);
    // Run computation and compare against expected values.
-    ComputeAndCompareR3<float>(&builder, expected_values, {start_data.get()},
+    ComputeAndCompareR3<DataT>(&builder, expected_values, {start_data.get()});
                               ErrorSpec(0.000001));
  }
 };
-XLA_TEST_F(DynamicSliceTest, Int32R1) { TestR1<int32>(); }
+XLA_TEST_F(DynamicSliceTest, Int32R1) { TestR1<int32, int32>(); }
-XLA_TEST_F(DynamicSliceTest, Int64R1) { TestR1<int64>(); }
+XLA_TEST_F(DynamicSliceTest, Int64R1) { TestR1<int64, float>(); }
-XLA_TEST_F(DynamicSliceTest, UInt64R1) { TestR1<uint64>(); }
+XLA_TEST_F(DynamicSliceTest, UInt64R1) { TestR1<uint64, double>(); }
-XLA_TEST_F(DynamicSliceTest, Int32R2) { TestR2<int32>(); }
+XLA_TEST_F(DynamicSliceTest, Int32R2) { TestR2<int32, float>(); }
-XLA_TEST_F(DynamicSliceTest, Int64R2) { TestR2<int64>(); }
+XLA_TEST_F(DynamicSliceTest, Int64R2) { TestR2<int64, double>(); }
-XLA_TEST_F(DynamicSliceTest, UInt64R2) { TestR2<uint64>(); }
+XLA_TEST_F(DynamicSliceTest, UInt64R2) { TestR2<uint64, int32>(); }
-XLA_TEST_F(DynamicSliceTest, Int32R3) { TestR3<int32>(); }
+XLA_TEST_F(DynamicSliceTest, Int32R3) { TestR3<int32, int32>(); }
-XLA_TEST_F(DynamicSliceTest, Int64R3) { TestR3<int64>(); }
+XLA_TEST_F(DynamicSliceTest, Int64R3) { TestR3<int64, float>(); }
-XLA_TEST_F(DynamicSliceTest, UInt64R3) { TestR3<uint64>(); }
+XLA_TEST_F(DynamicSliceTest, UInt64R3) { TestR3<uint64, double>(); }
 XLA_TEST_F(DynamicSliceTest, Int32R1Pred) {
  // Slice at dimension start.
  RunR1<int32, bool>({true, false, false, true, false, true, true, false}, {0},
                     {5}, {true, false, false, true, false});
  // Slice in the middle.
  RunR1<int32, bool>({true, false, false, true, false, true, true, false}, {2},
                     {3}, {false, true, false});
  // Slice at dimension boundaries.
  RunR1<int32, bool>({true, false, false, true, false, true, true, false}, {5},
                     {3}, {true, true, false});
  // Zero element slice.
  RunR1<int32, bool>({true, false, false, true, false, true, true, false}, {2},
                     {0}, {});
 }
 XLA_TEST_F(DynamicSliceTest, Int32R2Pred) {
  // Slice at dimension start.
  RunR2<int32, bool>(
      {{true, false, true}, {false, false, true}, {true, true, false}}, {0, 0},
      {2, 2}, {{true, false}, {false, false}});
  // Slice in the middle.
  RunR2<int32, bool>(
      {{true, false, true}, {false, false, true}, {true, true, false}}, {1, 1},
      {2, 1}, {{false}, {true}});
  // Slice at dimension boundaries.
  RunR2<int32, bool>(
      {{true, false, true}, {false, false, true}, {true, true, false}}, {1, 1},
      {2, 1}, {{false}, {true}});
  // Zero element slice: 2x0.
  RunR2<int32, bool>(
      {{true, false, true}, {false, false, true}, {true, true, false}}, {0, 0},
      {2, 0}, {{}, {}});
  // Zero element slice: 0x2.
  RunR2<int32, bool>(
      {{true, false, true}, {false, false, true}, {true, true, false}}, {0, 0},
      {0, 2}, Array2D<bool>(0, 2));
 }
 XLA_TEST_F(DynamicSliceTest, Int32R3Pred) {
  // R3 Shape: [2, 3, 2]
  // clang-format off
  // Slice at dimension start.
  RunR3<int32, bool>(
    {{{true, false}, {false, true}, {true, true}},
     {{false, true}, {true, false}, {false, false}}},
    {0, 0, 0}, {2, 1, 2},
    {{{true, false}}, {{false, true}}});
  // Slice in the middle.
  RunR3<int32, bool>(
    {{{true, false}, {false, true}, {true, true}},
     {{false, true}, {true, false}, {false, false}}},
    {0, 1, 1}, {2, 2, 1},
    {{{true}, {true}}, {{false}, {false}}});
  // clang-format on
 }
 class DynamicUpdateSliceTest : public ClientLibraryTestBase {
 protected:
-  template <typename IndexT>
+  template <typename IndexT, typename DataT>
  void TestR1() {
    // clang-format off
    // Slice at dimension start.
-    RunR1<IndexT>({0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0},
+    RunR1<IndexT, DataT>({0, 1, 2, 3, 4, 5, 6, 7}, {8, 9, 10}, {0},
-                  {8.0, 9.0, 10.0}, {0},
+                         {8, 9, 10, 3, 4, 5, 6, 7});
                  {8.0, 9.0, 10.0, 3.0, 4.0, 5.0, 6.0, 7.0});
    // Slice in the middle.
-    RunR1<IndexT>({0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0},
+    RunR1<IndexT, DataT>({0, 1, 2, 3, 4, 5, 6, 7}, {8, 9, 10}, {2},
-                  {8.0, 9.0, 10.0}, {2},
+                         {0, 1, 8, 9, 10, 5, 6, 7});
                  {0.0, 1.0, 8.0, 9.0, 10.0, 5.0, 6.0, 7.0});
    // Slice at dimension boundaries.
-    RunR1<IndexT>({0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0},
+    RunR1<IndexT, DataT>({0, 1, 2, 3, 4, 5, 6, 7}, {8, 9, 10}, {5},
-                  {8.0, 9.0, 10.0}, {5},
+                         {0, 1, 2, 3, 4, 8, 9, 10});
                  {0.0, 1.0, 2.0, 3.0, 4.0, 8.0, 9.0, 10.0});
    // Slice at dimension boundaries, but with sizes that cause indices to wrap.
-    RunR1<IndexT>({0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0},
+    RunR1<IndexT, DataT>({0, 1, 2, 3, 4, 5, 6, 7}, {8, 9, 10}, {6},
-                  {8.0, 9.0, 10.0}, {6},
+                         {0, 1, 2, 3, 4, 5, 8, 9});
                  {0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 8.0, 9.0});
    // Zero-sized update.
-    RunR1<IndexT>({0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0},
+    RunR1<IndexT, DataT>({0, 1, 2, 3, 4, 5, 6, 7}, {}, {2},
-                  {}, {2},
+                         {0, 1, 2, 3, 4, 5, 6, 7});
                  {0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0});
    // clang-format on
  }
-  template <typename IndexT>
+  template <typename IndexT, typename DataT>
  void TestR2() {
    // clang-format off
    // Slice at dimension start.
-    RunR2<IndexT>(
+    RunR2<IndexT, DataT>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}}, {{10, 11}}, {0, 0},
-        {{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}, {7.0f, 8.0f, 9.0f}},
+                         {{10, 11, 3}, {4, 5, 6}, {7, 8, 9}});
        {{10.0f, 11.0f}}, {0, 0},
        {{10.0f, 11.0f, 3.0f}, {4.0f, 5.0f, 6.0f}, {7.0f, 8.0f, 9.0f}});
    // Slice in the middle.
-    RunR2<IndexT>(
+    RunR2<IndexT, DataT>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}}, {{10, 11}}, {1, 1},
-        {{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}, {7.0f, 8.0f, 9.0f}},
+                         {{1, 2, 3}, {4, 10, 11}, {7, 8, 9}});
        {{10.0f, 11.0f}}, {1, 1},
        {{1.0f, 2.0f, 3.0f}, {4.0f, 10.0f, 11.0f}, {7.0f, 8.0f, 9.0f}});
    // Slice at dimension boundaries.
-    RunR2<IndexT>(
+    RunR2<IndexT, DataT>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}}, {{10, 11}}, {2, 1},
-        {{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}, {7.0f, 8.0f, 9.0f}},
+                         {{1, 2, 3}, {4, 5, 6}, {7, 10, 11}});
        {{10.0f, 11.0f}}, {2, 1},
        {{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}, {7.0f, 10.0f, 11.0f}});
    // Slice at dimension boundaries, but with sizes that cause indices to wrap.
-    RunR2<IndexT>(
+    RunR2<IndexT, DataT>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}}, {{10, 11}}, {2, 2},
-        {{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}, {7.0f, 8.0f, 9.0f}},
+                         {{1, 2, 3}, {4, 5, 6}, {7, 8, 10}});
        {{10.0f, 11.0f}}, {2, 2},
        {{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}, {7.0f, 8.0f, 10.0f}});
    // Zero-sized update.
-    RunR2<IndexT>(
+    RunR2<IndexT, DataT>({{1, 2, 3}, {4, 5, 6}, {7, 8, 9}}, {{}}, {2, 1},
-        {{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}, {7.0f, 8.0f, 9.0f}},
+                         {{1, 2, 3}, {4, 5, 6}, {7, 8, 9}});
        {{}}, {2, 1},
        {{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}, {7.0f, 8.0f, 9.0f}});
    // clang-format on
  }
-  template <typename IndexT>
+  template <typename IndexT, typename DataT>
  void TestR3() {
    // R3 Shape: [2, 3, 2]
    // clang-format off
    // Slice at dimension start.
-    RunR3<IndexT>(
+    RunR3<IndexT, DataT>(
-      {{{1.0f, 2.0f}, {3.0f, 4.0f}, {5.0f, 6.0f}},
+        {{{1, 2}, {3, 4}, {5, 6}}, {{7, 8}, {9, 10}, {11, 12}}},
-       {{7.0f, 8.0f}, {9.0f, 10.0f}, {11.0f, 12.0f}}},
+        {{{13, 14}, {15, 16}}, {{17, 18}, {19, 20}}}, {0, 0, 0},
-      {{{13.0f, 14.0f}, {15.0f, 16.0f}},
+        {{{13, 14}, {15, 16}, {5, 6}}, {{17, 18}, {19, 20}, {11, 12}}});
       {{17.0f, 18.0f}, {19.0f, 20.0f}}},
        {0, 0, 0},
      {{{13.0f, 14.0f}, {15.0f, 16.0f}, {5.0f, 6.0f}},
       {{17.0f, 18.0f}, {19.0f, 20.0f}, {11.0f, 12.0f}}});
    // Slice in the middle.
-    RunR3<IndexT>(
+    RunR3<IndexT, DataT>(
-      {{{1.0f, 2.0f}, {3.0f, 4.0f}, {5.0f, 6.0f}},
+        {{{1, 2}, {3, 4}, {5, 6}}, {{7, 8}, {9, 10}, {11, 12}}}, {{{13}, {15}}},
-       {{7.0f, 8.0f}, {9.0f, 10.0f}, {11.0f, 12.0f}}},
+        {1, 1, 1}, {{{1, 2}, {3, 4}, {5, 6}}, {{7, 8}, {9, 13}, {11, 15}}});
      {{{13.0f}, {15.0f}}},
        {1, 1, 1},
      {{{1.0f, 2.0f}, {3.0f, 4.0f}, {5.0f, 6.0f}},
        {{7.0f, 8.0f}, {9.0f, 13.0f}, {11.0f, 15.0f}}});
    // Slice at dimension boundaries, but with sizes that cause indices to wrap.
-    RunR3<IndexT>(
+    RunR3<IndexT, DataT>(
-      {{{1.0f, 2.0f}, {3.0f, 4.0f}, {5.0f, 6.0f}},
+        {{{1, 2}, {3, 4}, {5, 6}}, {{7, 8}, {9, 10}, {11, 12}}}, {{{13}, {15}}},
-       {{7.0f, 8.0f}, {9.0f, 10.0f}, {11.0f, 12.0f}}},
+        {1, 2, 1}, {{{1, 2}, {3, 4}, {5, 6}}, {{7, 8}, {9, 10}, {11, 13}}});
      {{{13.0f}, {15.0f}}},
        {1, 2, 1},
      {{{1.0f, 2.0f}, {3.0f, 4.0f}, {5.0f, 6.0f}},
        {{7.0f, 8.0f}, {9.0f, 10.0f}, {11.0f, 13.0f}}});
    // clang-format on
  }
-  template <typename IndexT>
+  template <typename IndexT, typename DataT>
-  void RunR1(const std::vector<float>& input_values,
+  void RunR1(tensorflow::gtl::ArraySlice<DataT> input_values,
-             const std::vector<float>& update_values,
+             tensorflow::gtl::ArraySlice<DataT> update_values,
             const std::vector<IndexT> slice_starts,
-             const std::vector<float>& expected_values) {
+             tensorflow::gtl::ArraySlice<DataT> expected_values) {
    ComputationBuilder builder(client_, TestName());
    // Initialize and transfer dynamic slice start indices parameter.
    ComputationDataHandle starts;
    std::unique_ptr<GlobalData> start_data = CreateR1Parameter<IndexT>(
        slice_starts, 0, "slice_starts", &builder, &starts);
    // Build dynamic slice computation.
-    auto input = builder.ConstantR1<float>(input_values);
+    auto input = builder.ConstantR1<DataT>(input_values);
-    auto update = builder.ConstantR1<float>(update_values);
+    auto update = builder.ConstantR1<DataT>(update_values);
    builder.DynamicUpdateSlice(input, update, starts);
    // Run computation and compare against expected values.
-    ComputeAndCompareR1<float>(&builder, expected_values, {start_data.get()},
+    ComputeAndCompareR1<DataT>(&builder, expected_values, {start_data.get()});
                               ErrorSpec(0.000001));
  }
-  template <typename IndexT>
+  template <typename IndexT, typename DataT>
-  void RunR2(const Array2D<float>& input_values,
+  void RunR2(const Array2D<DataT>& input_values,
-             const Array2D<float>& update_values,
+             const Array2D<DataT>& update_values,
             const std::vector<IndexT> slice_starts,
-             const Array2D<float>& expected_values) {
+             const Array2D<DataT>& expected_values) {
    ComputationBuilder builder(client_, TestName());
    // Initialize and transfer dynamic slice start indices parameter.
    ComputationDataHandle starts;
    std::unique_ptr<GlobalData> start_data = CreateR1Parameter<IndexT>(
        slice_starts, 0, "slice_starts", &builder, &starts);
    // Build dynamic slice computation.
-    auto input = builder.ConstantR2FromArray2D<float>(input_values);
+    auto input = builder.ConstantR2FromArray2D<DataT>(input_values);
-    auto update = builder.ConstantR2FromArray2D<float>(update_values);
+    auto update = builder.ConstantR2FromArray2D<DataT>(update_values);
    builder.DynamicUpdateSlice(input, update, starts);
    // Run computation and compare against expected values.
-    ComputeAndCompareR2<float>(&builder, expected_values, {start_data.get()},
+    ComputeAndCompareR2<DataT>(&builder, expected_values, {start_data.get()});
                               ErrorSpec(0.000001));
  }
-  template <typename IndexT>
+  template <typename IndexT, typename DataT>
-  void RunR3(const Array3D<float>& input_values,
+  void RunR3(const Array3D<DataT>& input_values,
-             const Array3D<float>& update_values,
+             const Array3D<DataT>& update_values,
             const std::vector<IndexT> slice_starts,
-             const Array3D<float>& expected_values) {
+             const Array3D<DataT>& expected_values) {
    ComputationBuilder builder(client_, TestName());
    // Initialize and transfer dynamic slice start indices parameter.
    ComputationDataHandle starts;
    std::unique_ptr<GlobalData> start_data = CreateR1Parameter<IndexT>(
        slice_starts, 0, "slice_starts", &builder, &starts);
    // Build dynamic slice computation.
-    auto input = builder.ConstantR3FromArray3D<float>(input_values);
+    auto input = builder.ConstantR3FromArray3D<DataT>(input_values);
-    auto update = builder.ConstantR3FromArray3D<float>(update_values);
+    auto update = builder.ConstantR3FromArray3D<DataT>(update_values);
    builder.DynamicUpdateSlice(input, update, starts);
    // Run computation and compare against expected values.
-    ComputeAndCompareR3<float>(&builder, expected_values, {start_data.get()},
+    ComputeAndCompareR3<DataT>(&builder, expected_values, {start_data.get()});
                               ErrorSpec(0.000001));
  }
  void RunR3Contiguous(std::vector<int32> operand_shape, int32 index,
@ -393,23 +408,81 @@ class DynamicUpdateSliceTest : public ClientLibraryTestBase {
  }
 };
-XLA_TEST_F(DynamicUpdateSliceTest, Int32R1) { TestR1<int32>(); }
+XLA_TEST_F(DynamicUpdateSliceTest, Int32R1) { TestR1<int32, float>(); }
-XLA_TEST_F(DynamicUpdateSliceTest, Int64R1) { TestR1<int64>(); }
+XLA_TEST_F(DynamicUpdateSliceTest, Int64R1) { TestR1<int64, float>(); }
-XLA_TEST_F(DynamicUpdateSliceTest, UInt64R1) { TestR1<uint64>(); }
+XLA_TEST_F(DynamicUpdateSliceTest, UInt64R1) { TestR1<uint64, double>(); }
-XLA_TEST_F(DynamicUpdateSliceTest, Int32R2) { TestR2<int32>(); }
+XLA_TEST_F(DynamicUpdateSliceTest, Int32R2) { TestR2<int32, float>(); }
-XLA_TEST_F(DynamicUpdateSliceTest, Int64R2) { TestR2<int64>(); }
+XLA_TEST_F(DynamicUpdateSliceTest, Int64R2) { TestR2<int64, int64>(); }
-XLA_TEST_F(DynamicUpdateSliceTest, UInt64R2) { TestR2<uint64>(); }
+XLA_TEST_F(DynamicUpdateSliceTest, UInt64R2) { TestR2<uint64, int32>(); }
-XLA_TEST_F(DynamicUpdateSliceTest, Int32R3) { TestR3<int32>(); }
+XLA_TEST_F(DynamicUpdateSliceTest, Int32R3) { TestR3<int32, float>(); }
-XLA_TEST_F(DynamicUpdateSliceTest, Int64R3) { TestR3<int64>(); }
+XLA_TEST_F(DynamicUpdateSliceTest, Int64R3) { TestR3<int64, int64>(); }
-XLA_TEST_F(DynamicUpdateSliceTest, UInt64R3) { TestR3<uint64>(); }
+XLA_TEST_F(DynamicUpdateSliceTest, UInt64R3) { TestR3<uint64, uint64>(); }
 XLA_TEST_F(DynamicUpdateSliceTest, Int32R1Pred) {
  // Slice at dimension start.
  RunR1<int32, bool>({false, false, true, true, false, true, true, false},
                     {true, true, false}, {0},
                     {true, true, false, true, false, true, true, false});
  // Slice in the middle.
  RunR1<int32, bool>({false, false, true, true, false, true, true, false},
                     {false, true, true}, {2},
                     {false, false, false, true, true, true, true, false});
  // Slice at dimension boundaries.
  RunR1<int32, bool>({false, false, true, true, false, true, true, false},
                     {false, true, true}, {5},
                     {false, false, true, true, false, false, true, true});
  // Zero-sized update.
  RunR1<int32, bool>({false, false, true, true, false, true, true, false}, {},
                     {2}, {false, false, true, true, false, true, true, false});
 }
 XLA_TEST_F(DynamicUpdateSliceTest, Int32R2Pred) {
  // Slice at dimension start.
  RunR2<int32, bool>(
      {{false, true, false}, {true, false, true}, {false, true, true}},
      {{true, false}}, {0, 0},
      {{true, false, false}, {true, false, true}, {false, true, true}});
  // Slice in the middle.
  RunR2<int32, bool>(
      {{false, true, false}, {true, false, true}, {false, true, true}},
      {{true, false}}, {1, 1},
      {{false, true, false}, {true, true, false}, {false, true, true}});
  // Slice at dimension boundaries.
  RunR2<int32, bool>(
      {{false, true, false}, {true, false, true}, {false, true, true}},
      {{true, false}}, {2, 1},
      {{false, true, false}, {true, false, true}, {false, true, false}});
  // Zero-sized update.
  RunR2<int32, bool>(
      {{false, true, false}, {true, false, true}, {false, true, true}}, {{}},
      {2, 1}, {{false, true, false}, {true, false, true}, {false, true, true}});
 }
 XLA_TEST_F(DynamicUpdateSliceTest, Int32R3Pred) {
  // R3 Shape: [2, 3, 2]
  // Slice at dimension start.
  RunR3<int32, bool>(
      {{{true, false}, {false, true}, {true, true}},
       {{false, false}, {false, true}, {true, false}}},
      {{{false, true}, {true, false}}, {{true, true}, {false, true}}},
      {0, 0, 0},
      {{{false, true}, {true, false}, {true, true}},
       {{true, true}, {false, true}, {true, false}}});
  // Slice in the middle.
  RunR3<int32, bool>({{{true, false}, {false, true}, {true, true}},
                      {{false, false}, {false, true}, {true, false}}},
                     {{{false}, {true}}}, {1, 1, 1},
                     {{{true, false}, {false, true}, {true, true}},
                      {{false, false}, {false, false}, {true, true}}});
 }
 // Tests for simple R3 case where the update is contiguous (i.e. the minor
 // two dimensions are not sliced).
--- a/tensorflow/compiler/xla/tests/reduce_precision_test.cc
+++ b/tensorflow/compiler/xla/tests/reduce_precision_test.cc
@ -26,6 +26,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/layout_util.h"
 #include "tensorflow/compiler/xla/legacy_flags/debug_options_flags.h"
 #include "tensorflow/compiler/xla/literal_util.h"
 #include "tensorflow/compiler/xla/service/reduce_precision_insertion.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
@ -39,8 +40,11 @@ limitations under the License.
 namespace xla {
 namespace {
-class ReducePrecisionTest : public ClientLibraryTestBase,
+// Tests to confirm that the ReducePrecision operation produces the expected
-                            public ::testing::WithParamInterface<int> {};
+// numerical values.
 class ReducePrecisionAccuracyTest : public ClientLibraryTestBase,
                                    public ::testing::WithParamInterface<int> {
 };
 // For reduction to IEEE-f16, we want to test the following cases, in both
 // positive and negative variants.  (Note: IEEE-f16 is 5 exponent bits and 10
@ -201,7 +205,7 @@ static const uint32_t test_values[][4] = {
        FPVAL(11111111, 1111111111, 1111111111111)   // NaN
    }};
-XLA_TEST_P(ReducePrecisionTest, ReducePrecisionF32) {
+XLA_TEST_P(ReducePrecisionAccuracyTest, ReducePrecisionF32) {
  int index = GetParam();
  int exponent_bits = exponent_sizes[index];
  int mantissa_bits = mantissa_sizes[index];
@ -238,9 +242,87 @@ XLA_TEST_P(ReducePrecisionTest, ReducePrecisionF32) {
  ComputeAndCompareR1<float>(&builder, expected_values, {a_data.get()});
 }
-INSTANTIATE_TEST_CASE_P(ReducePrecisionTest, ReducePrecisionTest,
+INSTANTIATE_TEST_CASE_P(ReducePrecisionAccuracyTest,
                        ReducePrecisionAccuracyTest,
                        ::testing::Values(0, 1, 2, 3), TestDataToString);
 // Tests to confirm that the compiler optimization functions add the expected
 // ReducePrecisionInsertion passes.
 class ReducePrecisionInsertionTest : public ClientLibraryTestBase {};
 XLA_TEST_F(ReducePrecisionInsertionTest, ReducePrecisionBeforeFusion) {
  ComputationBuilder builder(client_, TestName());
  std::unique_ptr<Literal> a_literal = Literal::CreateR1<float>({1.00001});
  std::unique_ptr<GlobalData> a_data =
      client_->TransferToServer(*a_literal).ConsumeValueOrDie();
  auto a = builder.Parameter(0, a_literal->shape(), "a");
  // Abs doesn't affect resolution.
  auto abs = builder.Abs(a);
  // Near 1.0, Log(x) approximates x - 1; this lets us confirm that the
  // reduce-precision operation showed up in the correct place in the
  // graph.
  auto log = builder.Log(abs);
  // Insert precision-reduction after the Abs(x) operation, rounding that
  // result to exactly 1.0f.
  auto reduce_precision_pass = execution_options_.mutable_debug_options()
                                   ->add_hlo_reduce_precision_options();
  *reduce_precision_pass = ReducePrecisionInsertion::make_options_proto(
      HloReducePrecisionOptions::BEFORE_OP_FUSION, 5, 10,
      [](const HloOpcode opcode) { return opcode == HloOpcode::kAbs; });
  ComputeAndCompareR1<float>(&builder, {0.0f}, {a_data.get()});
 }
 XLA_TEST_F(ReducePrecisionInsertionTest, ReducePrecisionSkippedAfterFusion) {
  ComputationBuilder builder(client_, TestName());
  std::unique_ptr<Literal> a_literal = Literal::CreateR1<float>({1.00001});
  std::unique_ptr<GlobalData> a_data =
      client_->TransferToServer(*a_literal).ConsumeValueOrDie();
  auto a = builder.Parameter(0, a_literal->shape(), "a");
  // These two operations should be fused by any reasonable backend.
  auto abs = builder.Abs(a);
  auto neg = builder.Neg(abs);
  // Add a pass after operation fusion, suffixing kAbs operations.  This
  // should not see into the fusion nodes and thus should not affect the
  // result.
  auto reduce_precision_pass = execution_options_.mutable_debug_options()
                                   ->add_hlo_reduce_precision_options();
  *reduce_precision_pass = ReducePrecisionInsertion::make_options_proto(
      HloReducePrecisionOptions::AFTER_OP_FUSION, 5, 10,
      [](const HloOpcode opcode) { return opcode == HloOpcode::kAbs; });
  ComputeAndCompareR1<float>(&builder, {-1.00001f}, {a_data.get()});
 }
 XLA_TEST_F(ReducePrecisionInsertionTest, ReducePrecisionAddedAfterFusion) {
  ComputationBuilder builder(client_, TestName());
  std::unique_ptr<Literal> a_literal = Literal::CreateR1<float>({1.00001});
  std::unique_ptr<GlobalData> a_data =
      client_->TransferToServer(*a_literal).ConsumeValueOrDie();
  auto a = builder.Parameter(0, a_literal->shape(), "a");
  // These two operations should be fused by any reasonable backend.
  auto abs = builder.Abs(a);
  auto neg = builder.Neg(abs);
  // Add a pass after operation fusion, suffixing kFusion operations.
  auto reduce_precision_pass = execution_options_.mutable_debug_options()
                                   ->add_hlo_reduce_precision_options();
  *reduce_precision_pass = ReducePrecisionInsertion::make_options_proto(
      HloReducePrecisionOptions::AFTER_OP_FUSION, 5, 10,
      [](const HloOpcode opcode) { return opcode == HloOpcode::kFusion; });
  ComputeAndCompareR1<float>(&builder, {-1.0f}, {a_data.get()});
 }
 }  // namespace
 }  // namespace xla
--- a/tensorflow/compiler/xla/xla.proto
+++ b/tensorflow/compiler/xla/xla.proto
@ -20,6 +20,24 @@ import "tensorflow/compiler/xla/service/session.proto";
 package xla;
 // Options for the HLO insert-reduce-precision-operations pass.
 message HloReducePrecisionOptions {
  // When to run the pass.
  enum PassTiming {
    BEFORE_OP_FUSION = 0;
    AFTER_OP_FUSION = 1;
  }
  PassTiming pass_timing = 1;
  // Exponent and mantissa bit counts for the reduced precision.
  uint32 exponent_bits = 2;
  uint32 mantissa_bits = 3;
  // Opcodes for operations that should be suffixed with reduced-precision
  // operations.
  repeated uint32 opcodes_to_suffix = 4;
 }
 // Debugging options for XLA. These options may change at any time - there are
 // no guarantees about backward or forward compatibility for these fields.
 message DebugOptions {
@ -112,6 +130,11 @@ message DebugOptions {
  // the generated IR.
  bool xla_llvm_enable_invariant_load_metadata = 72;
  // Options for inserting reduce-precision operations for numerical
  // experimentation.  This is a repeated field, as we may want to have
  // multiple passes with different parameters.
  repeated HloReducePrecisionOptions hlo_reduce_precision_options = 80;
  // This is used by ClientLibraryTestBase::ComputeAndCompare*. If true, the
  // computation will run n! times with all permunations of layouts for the
  // output shape in rank n. For example, with a 3D shape, all permutations of
--- a/tensorflow/contrib/makefile/tf_op_files.txt
+++ b/tensorflow/contrib/makefile/tf_op_files.txt
@ -199,6 +199,7 @@ tensorflow/core/kernels/aggregate_ops.cc
 tensorflow/core/kernels/depthwise_conv_op.cc
 tensorflow/core/kernels/dequantize_op.cc
 tensorflow/core/kernels/meta_support.cc
 tensorflow/core/kernels/population_count_op.cc
 tensorflow/core/kernels/quantization_utils.cc
 tensorflow/core/kernels/quantize_down_and_shrink_range.cc
 tensorflow/core/kernels/quantize_op.cc
--- a/tensorflow/contrib/rnn/python/kernel_tests/core_rnn_cell_test.py
+++ b/tensorflow/contrib/rnn/python/kernel_tests/core_rnn_cell_test.py
@ -406,6 +406,31 @@ class RNNCellTest(test.TestCase):
        # States are left untouched
        self.assertAllClose(res[2], res[3])
  def testResidualWrapperWithSlice(self):
    with self.test_session() as sess:
      with variable_scope.variable_scope(
          "root", initializer=init_ops.constant_initializer(0.5)):
        x = array_ops.zeros([1, 5])
        m = array_ops.zeros([1, 3])
        base_cell = rnn_cell_impl.GRUCell(3)
        g, m_new = base_cell(x, m)
        variable_scope.get_variable_scope().reuse_variables()
        def residual_with_slice_fn(inp, out):
          inp_sliced = array_ops.slice(inp, [0, 0], [-1, 3])
          return inp_sliced + out
        g_res, m_new_res = rnn_cell_impl.ResidualWrapper(
            base_cell, residual_with_slice_fn)(x, m)
        sess.run([variables_lib.global_variables_initializer()])
        res_g, res_g_res, res_m_new, res_m_new_res = sess.run(
            [g, g_res, m_new, m_new_res], {
                x: np.array([[1., 1., 1., 1., 1.]]),
                m: np.array([[0.1, 0.1, 0.1]])
            })
        # Residual connections
        self.assertAllClose(res_g_res, res_g + [1., 1., 1.])
        # States are left untouched
        self.assertAllClose(res_m_new, res_m_new_res)
  def testDeviceWrapper(self):
    with variable_scope.variable_scope(
        "root", initializer=init_ops.constant_initializer(0.5)):
--- a/tensorflow/contrib/signal/python/kernel_tests/spectral_ops_test.py
+++ b/tensorflow/contrib/signal/python/kernel_tests/spectral_ops_test.py
@ -220,15 +220,14 @@ class SpectralOpsTest(test.TestCase):
      # stft_bound, inverse_stft_bound).
      # TODO(rjryan): Investigate why STFT gradient error is so high.
      test_configs = [
-          (512, 64, 32, 64, 2e-3, 3e-5),
+          (64, 16, 8, 16),
-          (512, 64, 64, 64, 2e-3, 3e-5),
+          (64, 16, 16, 16),
-          (512, 64, 25, 64, 2e-3, 3e-5),
+          (64, 16, 7, 16),
-          (512, 25, 15, 36, 2e-3, 3e-5),
+          (64, 7, 4, 9),
-          (123, 23, 5, 42, 2e-3, 4e-5),
+          (29, 5, 1, 10),
      ]
-      for (signal_length, frame_length, frame_step, fft_length,
+      for (signal_length, frame_length, frame_step, fft_length) in test_configs:
           stft_bound, inverse_stft_bound) in test_configs:
        signal_shape = [signal_length]
        signal = random_ops.random_uniform(signal_shape)
        stft_shape = [max(0, 1 + (signal_length - frame_length) // frame_step),
@ -242,8 +241,8 @@ class SpectralOpsTest(test.TestCase):
                                                 stft, stft_shape)
        inverse_stft_error = test.compute_gradient_error(
            stft, stft_shape, inverse_stft, inverse_stft_shape)
-        self.assertLess(stft_error, stft_bound)
+        self.assertLess(stft_error, 2e-3)
-        self.assertLess(inverse_stft_error, inverse_stft_bound)
+        self.assertLess(inverse_stft_error, 4e-5)
 if __name__ == "__main__":
--- a/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py
+++ b/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py
@ -54,9 +54,12 @@ def _tpu_job(run_config):
  return None if run_config.master in ['', 'local'] else 'tpu_worker'
-def _per_shard_batch_size(global_batch_size, run_config):
+def _per_shard_batch_size(global_batch_size, run_config, use_tpu):
  """Returns the batch size for each shard."""
  if use_tpu:
    return global_batch_size // run_config.tpu_config.num_shards
  else:
    return global_batch_size
 class _SIGNAL(object):
@ -470,7 +473,7 @@ class _ModelFnWrapper(object):
    self._train_batch_size = train_batch_size
  def call_without_tpu(self, features, labels):
-    return self._call_model_fn(features, labels)
+    return self._call_model_fn(features, labels, False)
  def convert_to_single_tpu_train_step(self, dequeue_fn):
    """Converts the `model_fn` as a single train step on TPU."""
@ -481,8 +484,8 @@ class _ModelFnWrapper(object):
      features, labels = dequeue_fn()
      # Makes deep copy with `config` and params` in case user mutates them.
-      estimator_spec = self._verify_estimator_spec(self._call_model_fn(
+      estimator_spec = self._verify_estimator_spec(
-          features, labels, add_batch_size_in_params=True))
+          self._call_model_fn(features, labels, True))
      loss, train_op = estimator_spec.loss, estimator_spec.train_op
      with ops.control_dependencies([train_op]):
        return array_ops.identity(loss)
@ -492,7 +495,7 @@ class _ModelFnWrapper(object):
  def config(self):
    return self._config
-  def _call_model_fn(self, features, labels, add_batch_size_in_params=False):
+  def _call_model_fn(self, features, labels, use_tpu):
    """Calls the model_fn with required parameters."""
    model_fn_args = util.fn_args(self._model_fn)
    kwargs = {}
@ -513,7 +516,6 @@ class _ModelFnWrapper(object):
    if 'params' in model_fn_args:
      kwargs['params'] = params
    if add_batch_size_in_params:
    if 'params' not in model_fn_args:
      raise ValueError(
          'model_fn ({}) does not include params argument, '
@ -521,8 +523,8 @@ class _ModelFnWrapper(object):
          'params[\'batch_size\']'.format(self._model_fn))
    if self._mode == model_fn_lib.ModeKeys.TRAIN:
      # For TPU training. `params` is never `None`.
-        params[_BATCH_SIZE_KEY] = _per_shard_batch_size(self._train_batch_size,
+      params[_BATCH_SIZE_KEY] = _per_shard_batch_size(
-                                                        config)
+          self._train_batch_size, config, use_tpu)
    return self._model_fn(features=features, **kwargs)
@ -609,16 +611,12 @@ class TPUEstimator(estimator_lib.Estimator):
            'batch size {} must be divisible by number of shards {}'
            .format(train_batch_size, config.tpu_config.num_shards))
    if use_tpu:
    # Verifies the model_fn signature according to Estimator framework.
    estimator_lib._verify_model_fn_args(model_fn, params)  # pylint: disable=protected-access
    # We cannot store config and params in this constructor as parent
    # constructor might change them, such as assigning a temp dir for
    # config.model_dir.
-      model_function = augment_model_fn_with_tpu_support(
+    model_function = _augment_model_fn(model_fn, train_batch_size, use_tpu)
          model_fn, train_batch_size)
    else:
      model_function = model_fn
    super(TPUEstimator, self).__init__(
        model_fn=model_function,
@ -670,9 +668,6 @@ class TPUEstimator(estimator_lib.Estimator):
    Raises:
      ValueError: if input_fn takes invalid arguments or does not have `params`.
    """
    if not self._use_tpu or mode != model_fn_lib.ModeKeys.TRAIN:
      return super(TPUEstimator, self)._call_input_fn(input_fn, mode)
    input_fn_args = util.fn_args(input_fn)
    config = self.config  # a deep copy.
    kwargs = {}
@ -686,8 +681,13 @@ class TPUEstimator(estimator_lib.Estimator):
      kwargs['config'] = config
    # Now for TPU training.
-    per_shard_batch_size = _per_shard_batch_size(self._train_batch_size, config)
+    if mode == model_fn_lib.ModeKeys.TRAIN:
-    kwargs['params'][_BATCH_SIZE_KEY] = per_shard_batch_size
+      kwargs['params'][_BATCH_SIZE_KEY] = (
          _per_shard_batch_size(self._train_batch_size, config, self._use_tpu))
    if not self._use_tpu or mode != model_fn_lib.ModeKeys.TRAIN:
      with ops.device('/cpu:0'):
        return input_fn(**kwargs)
    job = _tpu_job(config)
    def placement_function(index):
@ -746,7 +746,7 @@ def _create_infeed_enqueue_ops_and_dequeue_fn(inputs_holder):
  return (dequeue_fn, enqueue_fn)
-def augment_model_fn_with_tpu_support(model_fn, train_batch_size):
+def _augment_model_fn(model_fn, train_batch_size, use_tpu):
  """Returns a new model_fn, which wraps the TPU support."""
  def _model_fn(features, labels, mode, config, params):
@ -755,7 +755,7 @@ def augment_model_fn_with_tpu_support(model_fn, train_batch_size):
                                       train_batch_size)
    # TODO(jhseu): Move to EVAL and PREDICT to TPU.
-    if mode != model_fn_lib.ModeKeys.TRAIN:
+    if not use_tpu or mode != model_fn_lib.ModeKeys.TRAIN:
      return model_fn_wrapper.call_without_tpu(features, labels)
    inputs = _InputsHolder(sharded_features=features, sharded_labels=labels)
--- a/tensorflow/core/framework/attr_value_util.cc
+++ b/tensorflow/core/framework/attr_value_util.cc
@ -15,7 +15,9 @@ limitations under the License.
 #include "tensorflow/core/framework/attr_value_util.h"
 #include <string>
 #include <vector>
 #include "tensorflow/core/framework/attr_value.pb_text.h"
 #include "tensorflow/core/framework/tensor.pb_text.h"
 #include "tensorflow/core/framework/tensor_shape.pb.h"
@ -27,7 +29,6 @@ limitations under the License.
 #include "tensorflow/core/platform/protobuf.h"
 namespace tensorflow {
 namespace {
 string SummarizeString(const string& str) {
@ -460,7 +461,8 @@ bool HasPlaceHolder(const AttrValue& val) {
  return false;
 }
-bool SubstitutePlaceholders(SubstituteFunc substitute, AttrValue* value) {
+bool SubstitutePlaceholders(const SubstituteFunc& substitute,
                            AttrValue* value) {
  switch (value->value_case()) {
    case AttrValue::kList: {
      for (NameAttrList& func : *value->mutable_list()->mutable_func()) {
--- a/tensorflow/core/framework/attr_value_util.h
+++ b/tensorflow/core/framework/attr_value_util.h
@ -16,8 +16,10 @@ limitations under the License.
 #ifndef TENSORFLOW_FRAMEWORK_ATTR_VALUE_UTIL_H_
 #define TENSORFLOW_FRAMEWORK_ATTR_VALUE_UTIL_H_
 #include <functional>
 #include <string>
 #include <vector>
 #include "tensorflow/core/framework/attr_value.pb.h"  // TODO(62899350): Remove
 #include "tensorflow/core/framework/partial_tensor_shape.h"
 #include "tensorflow/core/framework/tensor.h"
@ -100,8 +102,8 @@ bool HasPlaceHolder(const AttrValue& val);
 // SubstituteFunc is given a placeholder string. If the placeholder is
 // unknown, SubstituteFunc returns false. Otherwise, overwrites the
 // attr value and returns true.
-typedef std::function<bool(const string&, AttrValue*)> SubstituteFunc;
+using SubstituteFunc = std::function<bool(const string&, AttrValue*)>;
-bool SubstitutePlaceholders(SubstituteFunc substitute, AttrValue* value);
+bool SubstitutePlaceholders(const SubstituteFunc& substitute, AttrValue* value);
 }  // namespace tensorflow
--- a/tensorflow/core/grappler/optimizers/constant_folding.cc
+++ b/tensorflow/core/grappler/optimizers/constant_folding.cc
@ -104,7 +104,8 @@ ConstantFolding::ConstantFolding() {
  ops_to_preserve_ = std::regex(
      "Placeholder.*|Const|.*Save.*|.*Restore.*|.*Reader|"
      "Enter|RefEnter|Exit|RefExit|NextIteration|RefNextIteration|"
-      ".*Quantized.*");
+      ".*Quantized.*",
      std::regex_constants::optimize);
 }
 string ConstantFolding::AddControlDependency(const string& input_name) {
@ -240,13 +241,18 @@ Status ConstantFolding::MaterializeShapes(const GrapplerItem& item,
 }
 bool ConstantFolding::IsFoldable(const NodeDef& node) const {
  // Folding not applicable to ops with no inputs.
  if (node.input().empty()) {
    return false;
  }
  // Skips nodes that must be preserved, and op_types that don't benefit from
  // folding
  if (nodes_to_preserve_.find(node.name()) != nodes_to_preserve_.end()) {
    return false;
  }
-  std::cmatch match;
+  if (std::regex_match(node.op().c_str(), ops_to_preserve_,
-  if (std::regex_match(node.op().c_str(), match, ops_to_preserve_)) {
+                       std::regex_constants::match_any)) {
    return false;
  }
@ -264,23 +270,6 @@ bool ConstantFolding::IsFoldable(const NodeDef& node) const {
    return false;
  }
  DeviceTypeVector device_types;
  status = SupportedDeviceTypesForNode({DeviceType(DEVICE_CPU)}, node,
                                       &device_types);
  if (!status.ok()) {
    return false;
  }
  // Only fold ops with a CPU implementation available.
  if (device_types.empty()) {
    return false;
  }
  DCHECK_EQ(DeviceType(DEVICE_CPU), device_types[0]);
  // Folding not applicable to ops with no inputs.
  if (node.input().empty()) {
    return false;
  }
  // No need to (and don't) fold nodes that have no outgoing edges. Such nodes
  // could be introduced by an earlier constant folding pass and are preserved
  // in case users want to fetch their values; re-processing them would
@ -391,12 +380,15 @@ Status ConstantFolding::EvaluateOneFoldable(const NodeDef& node,
      // Control dependency
      break;
    }
-    // There should be a single output since the input node should be a constant
+    const NodeDef* input_node = node_map_->GetNode(input);
-    // node.
+    if (!IsConstant(*input_node)) {
-    TensorVector output;
+      return Status(error::INVALID_ARGUMENT,
-    TF_RETURN_IF_ERROR(
+                    strings::StrCat("Can't fold ", node.name(), ", its ", input,
-        EvaluateNode(*node_map_->GetNode(input), TensorVector(), &output));
+                                    " isn't constant"));
-    inputs.push_back(output[position]);
+    }
    Tensor* value = new Tensor(input_node->attr().at("dtype").type());
    CHECK(value->FromProto(input_node->attr().at("value").tensor()));
    inputs.emplace_back(value);
  }
  TensorVector output_tensors;
@ -583,24 +575,31 @@ Status ConstantFolding::FoldNode(const NodeDef& node, GraphDef* output) {
 Status ConstantFolding::FoldGraph(GraphDef* output) {
  std::unordered_set<string> processed_nodes;
-  int previously_processed = 0;
+  std::deque<const NodeDef*> queue;
  do {
    previously_processed = processed_nodes.size();
  for (const auto& node : graph_.node()) {
-      if (IsFoldable(node) &&
+    if (IsFoldable(node)) {
-          processed_nodes.find(node.name()) == processed_nodes.end()) {
+      queue.push_back(&node);
-        Status s = FoldNode(node, output);
+    }
  }
  while (!queue.empty()) {
    const NodeDef* node = queue.front();
    queue.pop_front();
    if (processed_nodes.count(node->name())) {
      continue;
    }
    Status s = FoldNode(*node, output);
    processed_nodes.insert(node->name());
    if (!s.ok()) {
-          VLOG(1) << "Failed to fold node " << node.name() << ": " << s;
+      VLOG(1) << "Failed to fold node " << node->name() << ": " << s;
-        }
+    } else {
-        processed_nodes.insert(node.name());
+      auto outputs = node_map_->GetOutputs(node->name());
      for (auto& output : outputs) {
        if (IsFoldable(*output)) {
          queue.push_back(output);
        }
      }
    }
  }
    // Try again as long as we find new constants. In most cases, this loop will
    // only run once since the graph is already in topological order.
    VLOG(1) << "Folded " << processed_nodes.size() - previously_processed
            << " nodes in this pass";
  } while (previously_processed != processed_nodes.size());
  // Build the graph after constant folding. Note that we keep all processed
  // nodes in the graph in case users need to fetch their values.
@ -740,7 +739,6 @@ Status ConstantFolding::SimplifyGraph(GraphDef* output,
 Status ConstantFolding::Optimize(Cluster* cluster, const GrapplerItem& item,
                                 GraphDef* output) {
  graph_ = item.graph;
  LOG(INFO) << "Initial graph size: " << item.graph.node_size();
  node_map_.reset(new NodeMap(&graph_));
  for (const auto& node : item.fetch) {
    nodes_to_preserve_.insert(NodeName(node));
@ -761,7 +759,6 @@ Status ConstantFolding::Optimize(Cluster* cluster, const GrapplerItem& item,
  TF_RETURN_IF_ERROR(FoldGraph(output));
  TF_RETURN_IF_ERROR(SimplifyGraph(output, properties));
  LOG(INFO) << "Optimized graph size: " << output->node_size();
  *output->mutable_library() = item.graph.library();
  *output->mutable_versions() = item.graph.versions();
--- a/tensorflow/core/kernels/BUILD
+++ b/tensorflow/core/kernels/BUILD
@ -701,6 +701,39 @@ tf_kernel_library(
    deps = ARRAY_DEPS,
 )
 tf_kernel_library(
    name = "compare_and_bitpack_op",
    srcs = ["compare_and_bitpack_op.cc"],
    hdrs = ["compare_and_bitpack_op.h"],
    gpu_srcs = [
        "compare_and_bitpack_op.h",
        "compare_and_bitpack_op_gpu.cu.cc",
    ],
    deps = ARRAY_DEPS,
 )
 # TODO(ebrevdo): Add benchmarks once the op is in the autogen array namespace.
 # tf_cuda_cc_test(
 #     name = "compare_and_bitpack_op_test",
 #     srcs = ["compare_and_bitpack_op_test.cc"],
 #     deps = [
 #         ":array",
 #         ":ops_testutil",
 #         ":ops_util",
 #         "//third_party/eigen3",
 #         "//tensorflow/cc:cc_ops",
 #         "//tensorflow/cc:cc_ops_internal",
 #         "//tensorflow/core:core_cpu",
 #         "//tensorflow/core:core_cpu_internal",
 #         "//tensorflow/core:framework",
 #         "//tensorflow/core:lib",
 #         "//tensorflow/core:protos_all_cc",
 #         "//tensorflow/core:test",
 #         "//tensorflow/core:test_main",
 #         "//tensorflow/core:testlib",
 #     ],
 # )
 tf_kernel_library(
    name = "reshape_op",
    prefix = "reshape_op",
@ -2344,10 +2377,12 @@ cc_library(
        ":bucketize_op",
        ":cast_op",
        ":check_numerics_op",
        ":compare_and_bitpack_op",
        ":cross_op",
        ":cwise_op",
        ":fft_ops",
        ":matmul_op",
        ":population_count_op",
        ":reduction_ops",
        ":scan_ops",
        ":segment_reduction_ops",
@ -2409,6 +2444,12 @@ tf_kernel_library(
    deps = MATH_DEPS + ["//tensorflow/core:bitwise_ops_op_lib"],
 )
 tf_kernel_library(
    name = "population_count_op",
    prefix = "population_count_op",
    deps = MATH_DEPS,
 )
 tf_kernel_library(
    name = "fft_ops",
    prefix = "fft_ops",
@ -4292,6 +4333,8 @@ filegroup(
        "fake_quant_ops.cc",
        "fifo_queue.cc",
        "fused_batch_norm_op.cc",
        "population_count_op.cc",
        "population_count_op.h",
        "winograd_transform.h",
        ":android_extended_ops_headers",
    ] + select({
--- a/tensorflow/core/kernels/compare_and_bitpack_op.cc
+++ b/tensorflow/core/kernels/compare_and_bitpack_op.cc
@ -0,0 +1,185 @@
 /* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 // See docs in ../ops/math_ops.cc
 #define EIGEN_USE_THREADS
 #include "tensorflow/core/kernels/compare_and_bitpack_op.h"
 #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/framework/tensor_shape.h"
 #include "tensorflow/core/framework/types.h"
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/util/work_sharder.h"
 namespace tensorflow {
 typedef Eigen::ThreadPoolDevice CPUDevice;
 typedef Eigen::GpuDevice GPUDevice;
 template <typename Device, typename T>
 class CompareAndBitpackOp : public OpKernel {
 public:
  explicit CompareAndBitpackOp(OpKernelConstruction* context)
      : OpKernel(context) {}
  void Compute(OpKernelContext* c) override {
    const Tensor& input_t = c->input(0);
    const Tensor& threshold_t = c->input(1);
    OP_REQUIRES(
        c, TensorShapeUtils::IsScalar(threshold_t.shape()),
        errors::InvalidArgument("Compare must be a scalar, but saw shape: ",
                                threshold_t.shape().DebugString()));
    const TensorShape& input_shape = input_t.shape();
    OP_REQUIRES(c, TensorShapeUtils::IsVectorOrHigher(input_shape),
                errors::InvalidArgument(
                    "Input should be at least a vector, but saw a scalar."));
    OP_REQUIRES(c, input_shape.dim_size(input_shape.dims() - 1) % 8 == 0,
                errors::InvalidArgument(
                    "Inner dimension of input should be "
                    "divisible by ",
                    8, ", but saw shape: ", input_shape.DebugString()));
    TensorShape output_shape = input_shape;
    int rank = input_shape.dims();
    output_shape.set_dim(rank - 1, input_shape.dim_size(rank - 1) / 8);
    Tensor* output_t;
    OP_REQUIRES_OK(c, c->allocate_output(0, output_shape, &output_t));
    auto input = input_t.flat_inner_dims<T>();
    auto threshold = threshold_t.scalar<T>();
    auto output = output_t->flat_inner_dims<uint8>();
    functor::CompareAndBitpack<Device, T> func;
    func(c, input, threshold, output);
  }
 };
 #define REGISTER_COMPARE_AND_BITPACK(type)                                    \
  REGISTER_KERNEL_BUILDER(                                                    \
      Name("CompareAndBitpack").Device(DEVICE_CPU).TypeConstraint<type>("T"), \
      CompareAndBitpackOp<CPUDevice, type>);
 TF_CALL_REAL_NUMBER_TYPES(REGISTER_COMPARE_AND_BITPACK);
 TF_CALL_bool(REGISTER_COMPARE_AND_BITPACK);
 #undef REGISTER_COMPARE_AND_BITPACK
 namespace functor {
 template <typename T, class = void, class = void>
 struct ComputeShard {
  static EIGEN_STRONG_INLINE void Compute(typename TTypes<T>::ConstMatrix input,
                                          typename TTypes<uint8>::Matrix output,
                                          const T& thresh, int64 start,
                                          int64 limit) {
    for (int64 i = start; i < limit; ++i) {
      uint8* out = output.data() + i;
      const T* block = input.data() + 8 * i;
      *out = ((((block[0] > thresh) << 7)) | (((block[1] > thresh) << 6)) |
              (((block[2] > thresh) << 5)) | (((block[3] > thresh) << 4)) |
              (((block[4] > thresh) << 3)) | (((block[5] > thresh) << 2)) |
              (((block[6] > thresh) << 1)) | (((block[7] > thresh))));
    }
  }
 };
 // Specialization for bool on systems where sizeof(bool) == 1.
 template <typename T>
 struct ComputeShard<T,
                    typename std::enable_if<std::is_same<T, bool>::value>::type,
                    typename std::enable_if<sizeof(T) == 1>::type> {
  static EIGEN_STRONG_INLINE void Compute(
      typename TTypes<bool>::ConstMatrix input,
      typename TTypes<uint8>::Matrix output, bool /*thresh*/, int64 start,
      int64 limit) {
    // NOTE(ebrevdo): This assumes memory is little-endian.
    for (int64 i = start; i < limit; ++i) {
      uint8* out = output.data() + i;
      const int64 block = *reinterpret_cast<const int64*>(input.data() + 8 * i);
      *out =
          ((((block & (1LL << (7 * 8))) >> (7 * 8 - 0))) |
           (((block & (1LL << (6 * 8))) >> (6 * 8 - 1))) |
           (((block & (1LL << (5 * 8))) >> (5 * 8 - 2))) |
           (((block & (1LL << (4 * 8))) >> (4 * 8 - 3))) |
           (((block & (1LL << (3 * 8))) >> (3 * 8 - 4))) |
           (((block & (1LL << (2 * 8))) >> (2 * 8 - 5))) |
           (((block & (1LL << 8)) >> (1 * 8 - 6))) | (((block & (1LL)) << 7)));
    }
  }
 };
 template <typename T>
 struct CompareAndBitpack<CPUDevice, T> {
  void operator()(OpKernelContext* c, typename TTypes<T>::ConstMatrix input,
                  typename TTypes<T>::ConstScalar threshold,
                  TTypes<uint8>::Matrix output) {
    const T thresh = threshold();
    auto shard = [&, thresh](int64 start, int64 limit) {
      ComputeShard<T>::Compute(input, output, thresh, start, limit);
    };
    int64 total_shards = output.size();  // Approximate cmp as an add and
                                         // bitwise-or + shift as an add.
    const double total_cost = 8 * (Eigen::TensorOpCost::AddCost<T>() +
                                   Eigen::TensorOpCost::AddCost<uint8>());
    const int64 shard_cost = (total_cost >= static_cast<double>(kint64max))
                                 ? kint64max
                                 : static_cast<int64>(total_cost);
    auto worker_threads = *(c->device()->tensorflow_cpu_worker_threads());
    Shard(worker_threads.num_threads, worker_threads.workers, total_shards,
          shard_cost, shard);
  }
 };
 }  // namespace functor
 #if GOOGLE_CUDA
 #define REGISTER_COMPARE_AND_BITPACK(type)                                    \
  REGISTER_KERNEL_BUILDER(                                                    \
      Name("CompareAndBitpack").Device(DEVICE_GPU).TypeConstraint<type>("T"), \
      CompareAndBitpackOp<GPUDevice, type>);
 TF_CALL_GPU_NUMBER_TYPES(REGISTER_COMPARE_AND_BITPACK);
 TF_CALL_bool(REGISTER_COMPARE_AND_BITPACK);
 #undef REGISTER_COMPARE_AND_BITPACK
 namespace functor {
 #define DECLARE_GPU_SPEC(T)                                      \
  template <>                                                    \
  void CompareAndBitpack<GPUDevice, T>::operator()(              \
      OpKernelContext* c, typename TTypes<T>::ConstMatrix input, \
      typename TTypes<T>::ConstScalar threshold,                 \
      TTypes<uint8>::Matrix output);                             \
  extern template struct CompareAndBitpack<GPUDevice, T>;
 TF_CALL_GPU_NUMBER_TYPES(DECLARE_GPU_SPEC)
 TF_CALL_bool(DECLARE_GPU_SPEC)
 #undef DECLARE_GPU_SPEC
 }  // namespace functor
 #endif  // GOOGLE_CUDA
 }  // namespace tensorflow
--- a/tensorflow/core/kernels/compare_and_bitpack_op.h
+++ b/tensorflow/core/kernels/compare_and_bitpack_op.h
@ -0,0 +1,42 @@
 /* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 #ifndef THIRD_PARTY_TENSORFLOW_CORE_KERNELS_COMPARE_AND_BITPACK_OP_H_
 #define THIRD_PARTY_TENSORFLOW_CORE_KERNELS_COMPARE_AND_BITPACK_OP_H_
 #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/tensor_types.h"
 #include "tensorflow/core/platform/types.h"
 namespace tensorflow {
 namespace functor {
 typedef Eigen::ThreadPoolDevice CPUDevice;
 typedef Eigen::GpuDevice GPUDevice;
 template <typename Device, typename T>
 struct CompareAndBitpack {
  void operator()(OpKernelContext* c, typename TTypes<T>::ConstMatrix input,
                  typename TTypes<T>::ConstScalar threshold,
                  TTypes<uint8>::Matrix output);
 };
 }  // namespace functor
 }  // namespace tensorflow
 #endif  // THIRD_PARTY_TENSORFLOW_CORE_KERNELS_COMPARE_AND_BITPACK_OP_H_
--- a/tensorflow/core/kernels/compare_and_bitpack_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/compare_and_bitpack_op_gpu.cu.cc
@ -0,0 +1,141 @@
 /* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 #if GOOGLE_CUDA
 #define EIGEN_USE_GPU
 #include "tensorflow/core/kernels/compare_and_bitpack_op.h"
 #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/framework/tensor_types.h"
 #include "tensorflow/core/platform/types.h"
 #include "tensorflow/core/util/cuda_kernel_helper.h"
 namespace tensorflow {
 typedef Eigen::GpuDevice GPUDevice;
 namespace functor {
 template <typename T>
 __global__ void CompareAndBitpackKernel(const int size, const T* threshold,
                                        const T* input, uint8* output) {
  // TODO(ebrevdo): Erich said: to get a better memory access pattern
  // you could have 8 threads load this data and do a comparison, then
  // use the ballot instruction to combine the values from each thread
  // in the warp in one instruction (so each thread will have the
  // result for 4 blocks) followed by an appropriate shift and mask to
  // get the 8-bits of interest.
  const T thresh = ldg(threshold);
  CUDA_1D_KERNEL_LOOP(i, size) {
    const T* block = input + 8 * i;
    output[i] =
        ((((ldg(block) > thresh) << 7)) | (((ldg(block + 1) > thresh) << 6)) |
         (((ldg(block + 2) > thresh) << 5)) |
         (((ldg(block + 3) > thresh) << 4)) |
         (((ldg(block + 4) > thresh) << 3)) |
         (((ldg(block + 5) > thresh) << 2)) |
         (((ldg(block + 6) > thresh) << 1)) | (((ldg(block + 7) > thresh))));
  }
 }
 template <>
 __global__ void CompareAndBitpackKernel<bool>(const int size,
                                              const bool* threshold,
                                              const bool* input,
                                              uint8* output) {
  // TODO(ebrevdo): Erich said: I think you could again have multiple
  // threads work on one block and use the ballot instruction to the
  // bit packing in one instruction.
  CUDA_1D_KERNEL_LOOP(i, size) {
    const int64 block = ldg(reinterpret_cast<const int64*>(input + 8 * i));
    // NOTE(ebrevdo): This assumes memory is little-endian.
    output[i] =
        ((((block & (1LL << (7 * 8))) >> (7 * 8 - 0))) |
         (((block & (1LL << (6 * 8))) >> (6 * 8 - 1))) |
         (((block & (1LL << (5 * 8))) >> (5 * 8 - 2))) |
         (((block & (1LL << (4 * 8))) >> (4 * 8 - 3))) |
         (((block & (1LL << (3 * 8))) >> (3 * 8 - 4))) |
         (((block & (1LL << (2 * 8))) >> (2 * 8 - 5))) |
         (((block & (1LL << 8)) >> (1 * 8 - 6))) | (((block & (1LL)) << 7)));
  }
 }
 template <>
 __global__ void CompareAndBitpackKernel<float>(const int size,
                                               const float* threshold,
                                               const float* input,
                                               uint8* output) {
  const float thresh = ldg(threshold);
  CUDA_1D_KERNEL_LOOP(i, size) {
    const float4 block0 = ldg(reinterpret_cast<const float4*>(input + 8 * i));
    const float4 block1 =
        ldg(reinterpret_cast<const float4*>(input + 8 * i + 4));
    output[i] = ((((block0.x > thresh) << 7)) | (((block0.y > thresh) << 6)) |
                 (((block0.z > thresh) << 5)) | (((block0.w > thresh) << 4)) |
                 (((block1.x > thresh) << 3)) | (((block1.y > thresh) << 2)) |
                 (((block1.z > thresh) << 1)) | (((block1.w > thresh))));
  }
 }
 template <>
 __global__ void CompareAndBitpackKernel<double>(const int size,
                                                const double* threshold,
                                                const double* input,
                                                uint8* output) {
  const double thresh = ldg(threshold);
  CUDA_1D_KERNEL_LOOP(i, size) {
    const double2 block0 = ldg(reinterpret_cast<const double2*>(input + 8 * i));
    const double2 block1 =
        ldg(reinterpret_cast<const double2*>(input + 8 * i + 2));
    const double2 block2 =
        ldg(reinterpret_cast<const double2*>(input + 8 * i + 4));
    const double2 block3 =
        ldg(reinterpret_cast<const double2*>(input + 8 * i + 6));
    output[i] = ((((block0.x > thresh) << 7)) | (((block0.y > thresh) << 6)) |
                 (((block1.x > thresh) << 5)) | (((block1.y > thresh) << 4)) |
                 (((block2.x > thresh) << 3)) | (((block2.y > thresh) << 2)) |
                 (((block3.x > thresh) << 1)) | (((block3.y > thresh))));
  }
 }
 #define DEFINE_GPU_SPECS(T)                                               \
  template <>                                                             \
  void CompareAndBitpack<GPUDevice, T>::operator()(                       \
      OpKernelContext* c, typename TTypes<T>::ConstMatrix input,          \
      typename TTypes<T>::ConstScalar threshold,                          \
      TTypes<uint8>::Matrix output) {                                     \
    const GPUDevice& d = c->eigen_device<GPUDevice>();                    \
    int64 total_count = output.size();                                    \
    CudaLaunchConfig config = GetCudaLaunchConfig(total_count, d);        \
                                                                          \
    CompareAndBitpackKernel<T>                                            \
        <<<config.block_count, config.thread_per_block, 0, d.stream()>>>( \
            total_count, threshold.data(), input.data(), output.data());  \
  }
 TF_CALL_GPU_NUMBER_TYPES(DEFINE_GPU_SPECS)
 TF_CALL_bool(DEFINE_GPU_SPECS)
 #undef DECLARE_GPU_SPECS
 }  // namespace functor
 }  // namespace tensorflow
 #endif  // GOOGLE_CUDA
--- a/tensorflow/core/kernels/population_count_op.cc
+++ b/tensorflow/core/kernels/population_count_op.cc
@ -0,0 +1,163 @@
 /* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
 implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 // See docs in ../ops/math_ops.cc
 #define EIGEN_USE_THREADS
 #include <bitset>
 #include "tensorflow/core/kernels/population_count_op.h"
 #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/framework/tensor_shape.h"
 #include "tensorflow/core/framework/types.h"
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/util/work_sharder.h"
 namespace tensorflow {
 typedef Eigen::ThreadPoolDevice CPUDevice;
 typedef Eigen::GpuDevice GPUDevice;
 template <typename Device, typename T>
 class PopulationCountOp : public OpKernel {
 public:
  explicit PopulationCountOp(OpKernelConstruction* context)
      : OpKernel(context) {}
  void Compute(OpKernelContext* c) override {
    const Tensor& input_t = c->input(0);
    Tensor* output_t;
    OP_REQUIRES_OK(c, c->allocate_output(0, input_t.shape(), &output_t));
    auto input = input_t.flat<T>();
    auto output = output_t->flat<uint8>();
    functor::PopulationCount<Device, T> popcnt;
    popcnt(c, input, output);
  }
 };
 #define REGISTER_POPULATION_COUNT(type)                                     \
  REGISTER_KERNEL_BUILDER(                                                  \
      Name("PopulationCount").Device(DEVICE_CPU).TypeConstraint<type>("T"), \
      PopulationCountOp<CPUDevice, type>);
 TF_CALL_uint8(REGISTER_POPULATION_COUNT);
 TF_CALL_int8(REGISTER_POPULATION_COUNT);
 TF_CALL_uint16(REGISTER_POPULATION_COUNT);
 TF_CALL_int16(REGISTER_POPULATION_COUNT);
 TF_CALL_int32(REGISTER_POPULATION_COUNT);
 TF_CALL_int64(REGISTER_POPULATION_COUNT);
 #undef REGISTER_POPULATION_COUNT
 namespace functor {
 namespace {
 template <typename T>
 inline uint8 PopCnt(const T v);
 #define POPCNT(T, N)                  \
  template <>                         \
  uint8 PopCnt<T>(const T v) {        \
    return std::bitset<N>(v).count(); \
  }
 POPCNT(int8, 8);
 POPCNT(uint8, 8);
 POPCNT(int16, 16);
 POPCNT(uint16, 16);
 POPCNT(int32, 32);
 POPCNT(int64, 64);
 #undef POPCNT
 }  // namespace
 template <typename T>
 struct PopulationCount<CPUDevice, T> {
  void operator()(OpKernelContext* c, typename TTypes<T>::ConstFlat input,
                  TTypes<uint8>::Flat output) {
    const T* input_ptr = input.data();
    uint8* output_ptr = output.data();
    auto shard = [input_ptr, output_ptr](int64 start, int64 limit) {
      for (int64 i = start; i < limit; ++i) {
        output_ptr[i] = PopCnt<T>(input_ptr[i]);
      }
    };
    int64 total_shards = input.size();
    // Approximating cost of popcnt: convert T to int64
    // (std::bitset constructor) and convert int64 to uint8
    // (bitset.count() -> output).  The .count() itself is relatively cheap.
    const double total_cost = (Eigen::TensorOpCost::CastCost<T, uint8>() +
                               Eigen::TensorOpCost::CastCost<int64, uint8>());
    const int64 shard_cost = (total_cost >= static_cast<double>(kint64max))
                                 ? kint64max
                                 : static_cast<int64>(total_cost);
    auto worker_threads = *(c->device()->tensorflow_cpu_worker_threads());
    Shard(worker_threads.num_threads, worker_threads.workers, total_shards,
          shard_cost, shard);
  }
 };
 }  // namespace functor
 #if GOOGLE_CUDA
 #define REGISTER_POPULATION_COUNT(type)                                     \
  REGISTER_KERNEL_BUILDER(                                                  \
      Name("PopulationCount").Device(DEVICE_GPU).TypeConstraint<type>("T"), \
      PopulationCountOp<GPUDevice, type>)
 TF_CALL_uint8(REGISTER_POPULATION_COUNT);
 TF_CALL_int8(REGISTER_POPULATION_COUNT);
 TF_CALL_uint16(REGISTER_POPULATION_COUNT);
 TF_CALL_int16(REGISTER_POPULATION_COUNT);
 TF_CALL_int32(REGISTER_POPULATION_COUNT);
 TF_CALL_int64(REGISTER_POPULATION_COUNT);
 #undef REGISTER_POPULATION_COUNT
 namespace functor {
 #define DECLARE_GPU_SPEC(T)                                    \
  template <>                                                  \
  void PopulationCount<GPUDevice, T>::operator()(              \
      OpKernelContext* c, typename TTypes<T>::ConstFlat input, \
      TTypes<uint8>::Flat output);                             \
  extern template struct PopulationCount<GPUDevice, T>
 TF_CALL_uint8(DECLARE_GPU_SPEC);
 TF_CALL_int8(DECLARE_GPU_SPEC);
 TF_CALL_uint16(DECLARE_GPU_SPEC);
 TF_CALL_int16(DECLARE_GPU_SPEC);
 TF_CALL_int32(DECLARE_GPU_SPEC);
 TF_CALL_int64(DECLARE_GPU_SPEC);
 #undef DECLARE_GPU_SPEC
 }  // namespace functor
 #endif  // GOOGLE_CUDA
 }  // namespace tensorflow
--- a/tensorflow/core/kernels/population_count_op.h
+++ b/tensorflow/core/kernels/population_count_op.h
@ -0,0 +1,38 @@
 /* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
 implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 #ifndef THIRD_PARTY_TENSORFLOW_CORE_KERNELS_POPULATION_COUNT_OP_H_
 #define THIRD_PARTY_TENSORFLOW_CORE_KERNELS_POPULATION_COUNT_OP_H_
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/tensor_types.h"
 #include "tensorflow/core/platform/types.h"
 namespace tensorflow {
 namespace functor {
 template <typename Device, typename T>
 struct PopulationCount {
  void operator()(OpKernelContext* c, typename TTypes<T>::ConstFlat input,
                  TTypes<uint8>::Flat output);
 };
 }  // namespace functor
 }  // namespace tensorflow
 #endif  // THIRD_PARTY_TENSORFLOW_CORE_KERNELS_POPULATION_COUNT_OP_H_
--- a/tensorflow/core/kernels/population_count_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/population_count_op_gpu.cu.cc
@ -0,0 +1,92 @@
 /* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
 implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 #if GOOGLE_CUDA
 #define EIGEN_USE_GPU
 #include "tensorflow/core/kernels/population_count_op.h"
 #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/framework/tensor_types.h"
 #include "tensorflow/core/platform/types.h"
 #include "tensorflow/core/util/cuda_kernel_helper.h"
 namespace tensorflow {
 typedef Eigen::GpuDevice GPUDevice;
 namespace functor {
 template <typename T>
 __global__ void PopulationCountKernel(const int size, const T* input,
                                      uint8* output) {
  CUDA_1D_KERNEL_LOOP(i, size) { output[i] = __popc(ldg(input + i)); }
 }
 template <>
 __global__ void PopulationCountKernel(const int size, const int8* input,
                                      uint8* output) {
  // For some reason, __popc on a negative int8 gets confused.
  CUDA_1D_KERNEL_LOOP(i, size) {
    output[i] = __popc(ldg(reinterpret_cast<const uint8*>(input + i)));
  }
 }
 template <>
 __global__ void PopulationCountKernel(const int size, const int16* input,
                                      uint8* output) {
  // For some reason, __popc on a negative int16 gets confused.
  CUDA_1D_KERNEL_LOOP(i, size) {
    output[i] = __popc(ldg(reinterpret_cast<const uint16*>(input + i)));
  }
 }
 template <>
 __global__ void PopulationCountKernel<int64>(const int size, const int64* input,
                                             uint8* output) {
  CUDA_1D_KERNEL_LOOP(i, size) { output[i] = __popcll(ldg(input + i)); }
 }
 #define DEFINE_GPU_SPECS(T)                                               \
  template <>                                                             \
  void PopulationCount<GPUDevice, T>::operator()(                         \
      OpKernelContext* c, typename TTypes<T>::ConstFlat input,            \
      TTypes<uint8>::Flat output) {                                       \
    const GPUDevice& d = c->eigen_device<GPUDevice>();                    \
    int64 total_count = input.size();                                     \
    CudaLaunchConfig config = GetCudaLaunchConfig(total_count, d);        \
    PopulationCountKernel<T>                                              \
        <<<config.block_count, config.thread_per_block, 0, d.stream()>>>( \
            total_count, input.data(), output.data());                    \
  }
 TF_CALL_uint8(DEFINE_GPU_SPECS);
 TF_CALL_int8(DEFINE_GPU_SPECS);
 TF_CALL_uint16(DEFINE_GPU_SPECS);
 TF_CALL_int16(DEFINE_GPU_SPECS);
 TF_CALL_int32(DEFINE_GPU_SPECS);
 TF_CALL_int64(DEFINE_GPU_SPECS);
 #undef DEFINE_GPU_SPECS
 }  // namespace functor
 }  // namespace tensorflow
 #endif  // GOOGLE_CUDA
--- a/tensorflow/core/ops/bitwise_ops.cc
+++ b/tensorflow/core/ops/bitwise_ops.cc
@ -40,6 +40,22 @@ computation is performed on the underlying representation of x.
      .Attr("T: {int8, int16, int32, int64, uint8, uint16}") \
      .SetShapeFn(shape_inference::UnchangedShape)
 REGISTER_OP("PopulationCount")
    .Input("x: T")
    .Output("y: uint8")
    .Attr("T: {int8, int16, int32, int64, uint8, uint16}")
    .SetShapeFn(shape_inference::UnchangedShape)
    .Doc(R"doc(
 Computes element-wise population count (a.k.a. popcount, bitsum, bitcount).
 For each entry in `x`, calculates the number of `1` (on) bits in the binary
 representation of that entry.
 **NOTE**: It is more efficient to first `tf.bitcast` your tensors into
 `int32` or `int64` and perform the bitcount on the result, than to feed in
 8- or 16-bit inputs and then aggregate the resulting counts.
 )doc");
 REGISTER_OP("BitwiseAnd").BINARY_BITWISE().Doc(R"doc(
 Elementwise computes the bitwise AND of `x` and `y`.
--- a/tensorflow/core/ops/compat/ops_history.v1.pbtxt
+++ b/tensorflow/core/ops/compat/ops_history.v1.pbtxt
@ -4597,6 +4597,37 @@ op {
    }
  }
 }
 op {
  name: "CompareAndBitpack"
  input_arg {
    name: "input"
    type_attr: "T"
  }
  input_arg {
    name: "threshold"
    type_attr: "T"
  }
  output_arg {
    name: "output"
    type: DT_UINT8
  }
  attr {
    name: "T"
    type: "type"
    allowed_values {
      list {
        type: DT_BOOL
        type: DT_HALF
        type: DT_FLOAT
        type: DT_DOUBLE
        type: DT_INT8
        type: DT_INT16
        type: DT_INT32
        type: DT_INT64
      }
    }
  }
 }
 op {
  name: "Complex"
  input_arg {
@ -16267,6 +16298,31 @@ op {
    }
  }
 }
 op {
  name: "PopulationCount"
  input_arg {
    name: "x"
    type_attr: "T"
  }
  output_arg {
    name: "y"
    type: DT_UINT8
  }
  attr {
    name: "T"
    type: "type"
    allowed_values {
      list {
        type: DT_INT8
        type: DT_INT16
        type: DT_INT32
        type: DT_INT64
        type: DT_UINT8
        type: DT_UINT16
      }
    }
  }
 }
 op {
  name: "Pow"
  input_arg {
--- a/tensorflow/core/ops/math_ops.cc
+++ b/tensorflow/core/ops/math_ops.cc
@ -2458,6 +2458,64 @@ out_type: The type of the output. Should be a lower bit depth than Tinput.
 )doc");
 REGISTER_OP("CompareAndBitpack")
    .Input("input: T")
    .Input("threshold: T")
    .Output("output: uint8")
    .Attr("T: {bool, float16, float32, float64, int8, int16, int32, int64}")
    .SetShapeFn([](InferenceContext* c) {
      ShapeHandle input;
      TF_RETURN_IF_ERROR(c->WithRankAtLeast(c->input(0), 1, &input));
      ShapeHandle unused;
      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 0, &unused));
      ShapeHandle output = input;
      if (c->RankKnown(input)) {
        int rank = c->Rank(input);
        auto inner_dim = c->Dim(input, rank - 1);
        DimensionHandle inferred_dim;
        TF_RETURN_IF_ERROR(c->Divide(inner_dim, 8,
                                     /* evenly_divisible */ true,
                                     &inferred_dim));
        TF_RETURN_IF_ERROR(
            c->ReplaceDim(output, rank - 1, inferred_dim, &output));
      }
      c->set_output(0, output);
      return Status::OK();
    })
    .Doc(R"doc(
 Compare values of `input` to `threshold` and pack resulting bits into a `uint8`.
 Each comparison returns a boolean `true` (if `input_value > threshold`)
 or and `false` otherwise.
 This operation is useful for Locality-Sensitive-Hashing (LSH) and other
 algorithms that use hashing approximations of cosine and `L2` distances;
 codes can be generated from an input via:
 ```python
 codebook_size = 50
 codebook_bits = codebook_size * 32
 codebook = tf.get_variable('codebook', [x.shape[-1].value, codebook_bits],
                           dtype=x.dtype,
                           initializer=tf.orthogonal_initializer())
 codes = compare_and_threshold(tf.matmul(x, codebook), threshold=0.)
 codes = tf.bitcast(codes, tf.int32)  # go from uint8 to int32
 # now codes has shape x.shape[:-1] + [codebook_size]
 ```
 **NOTE**: Currently, the innermost dimension of the tensor must be divisible
 by 8.
 Given an `input` shaped `[s0, s1, ..., s_n]`, the output is
 a `uint8` tensor shaped `[s0, s1, ..., s_n / 8]`.
 input: Values to compare against `threshold` and bitpack.
 threshold: Threshold to compare against.
 T: The type of the input and threshold.
 output: The bitpacked comparisons.
 )doc");
 REGISTER_OP("RequantizationRange")
    .Input("input: Tinput")
    .Input("input_min: float")
--- a/tensorflow/core/ops/ops.pbtxt
+++ b/tensorflow/core/ops/ops.pbtxt
@ -499,7 +499,7 @@ op {
  }
  input_arg {
    name: "reduction_indices"
-    description: "The dimensions to reduce."
+    description: "The dimensions to reduce. Must be in the range\n`[-rank(input), rank(input))`."
    type_attr: "Tidx"
  }
  output_arg {
@ -601,7 +601,7 @@ op {
  }
  input_arg {
    name: "reduction_indices"
-    description: "The dimensions to reduce."
+    description: "The dimensions to reduce. Must be in the range\n`[-rank(input), rank(input))`."
    type_attr: "Tidx"
  }
  output_arg {
@ -1691,7 +1691,7 @@ op {
  }
  input_arg {
    name: "dimension"
-    description: "int32 or int64, 0 <= dimension < rank(input).  Describes\nwhich dimension of the input Tensor to reduce across. For vectors,\nuse dimension = 0."
+    description: "int32 or int64, must be in the range `[-rank(input), rank(input))`.\nDescribes which dimension of the input Tensor to reduce across. For vectors,\nuse dimension = 0."
    type_attr: "Tidx"
  }
  output_arg {
@ -1757,7 +1757,7 @@ op {
  }
  input_arg {
    name: "dimension"
-    description: "int32 or int64, 0 <= dimension < rank(input).  Describes\nwhich dimension of the input Tensor to reduce across. For vectors,\nuse dimension = 0."
+    description: "int32 or int64, must be in the range `[-rank(input), rank(input))`.\nDescribes which dimension of the input Tensor to reduce across. For vectors,\nuse dimension = 0."
    type_attr: "Tidx"
  }
  output_arg {
@ -4406,6 +4406,43 @@ op {
  summary: "Computes the reverse mode backpropagated gradient of the Cholesky algorithm."
  description: "For an explanation see \"Differentiation of the Cholesky algorithm\" by\nIain Murray http://arxiv.org/abs/1602.07527."
 }
 op {
  name: "CompareAndBitpack"
  input_arg {
    name: "input"
    description: "Values to compare against `threshold` and bitpack."
    type_attr: "T"
  }
  input_arg {
    name: "threshold"
    description: "Threshold to compare against."
    type_attr: "T"
  }
  output_arg {
    name: "output"
    description: "The bitpacked comparisons."
    type: DT_UINT8
  }
  attr {
    name: "T"
    type: "type"
    description: "The type of the input and threshold."
    allowed_values {
      list {
        type: DT_BOOL
        type: DT_HALF
        type: DT_FLOAT
        type: DT_DOUBLE
        type: DT_INT8
        type: DT_INT16
        type: DT_INT32
        type: DT_INT64
      }
    }
  }
  summary: "Compare values of `input` to `threshold` and pack resulting bits into a `uint8`."
  description: "Each comparison returns a boolean `true` (if `input_value > threshold`)\nor and `false` otherwise.\n\nThis operation is useful for Locality-Sensitive-Hashing (LSH) and other\nalgorithms that use hashing approximations of cosine and `L2` distances;\ncodes can be generated from an input via:\n\n```python\ncodebook_size = 50\ncodebook_bits = codebook_size * 32\ncodebook = tf.get_variable(\'codebook\', [x.shape[-1].value, codebook_bits],\n                           dtype=x.dtype,\n                           initializer=tf.orthogonal_initializer())\ncodes = compare_and_threshold(tf.matmul(x, codebook), threshold=0.)\ncodes = tf.bitcast(codes, tf.int32)  # go from uint8 to int32\n# now codes has shape x.shape[:-1] + [codebook_size]\n```\n\n**NOTE**: Currently, the innermost dimension of the tensor must be divisible\nby 8.\n\nGiven an `input` shaped `[s0, s1, ..., s_n]`, the output is\na `uint8` tensor shaped `[s0, s1, ..., s_n / 8]`."
 }
 op {
  name: "Complex"
  input_arg {
@ -5656,10 +5693,12 @@ op {
  name: "Cumprod"
  input_arg {
    name: "x"
    description: "A `Tensor`. Must be one of the following types: `float32`, `float64`,\n`int64`, `int32`, `uint8`, `uint16`, `int16`, `int8`, `complex64`,\n`complex128`, `qint8`, `quint8`, `qint32`, `half`."
    type_attr: "T"
  }
  input_arg {
    name: "axis"
    description: "A `Tensor` of type `int32` (default: 0). Must be in the range\n`[-rank(x), rank(x))`."
    type_attr: "Tidx"
  }
  output_arg {
@ -5672,6 +5711,7 @@ op {
    default_value {
      b: false
    }
    description: "If `True`, perform exclusive cumprod."
  }
  attr {
    name: "reverse"
@ -5679,6 +5719,7 @@ op {
    default_value {
      b: false
    }
    description: "A `bool` (default: False)."
  }
  attr {
    name: "T"
@ -5722,10 +5763,12 @@ op {
  name: "Cumsum"
  input_arg {
    name: "x"
    description: "A `Tensor`. Must be one of the following types: `float32`, `float64`,\n`int64`, `int32`, `uint8`, `uint16`, `int16`, `int8`, `complex64`,\n`complex128`, `qint8`, `quint8`, `qint32`, `half`."
    type_attr: "T"
  }
  input_arg {
    name: "axis"
    description: "A `Tensor` of type `int32` (default: 0). Must be in the range\n`[-rank(x), rank(x))`."
    type_attr: "Tidx"
  }
  output_arg {
@ -5738,6 +5781,7 @@ op {
    default_value {
      b: false
    }
    description: "If `True`, perform exclusive cumsum."
  }
  attr {
    name: "reverse"
@ -5745,6 +5789,7 @@ op {
    default_value {
      b: false
    }
    description: "A `bool` (default: False)."
  }
  attr {
    name: "T"
@ -7615,7 +7660,7 @@ op {
  }
  input_arg {
    name: "dim"
-    description: "0-D (scalar). Specifies the dimension index at which to\nexpand the shape of `input`."
+    description: "0-D (scalar). Specifies the dimension index at which to\nexpand the shape of `input`. Must be in the range\n`[-rank(input) - 1, rank(input)]`."
    type_attr: "Tdim"
  }
  output_arg {
@ -12325,7 +12370,7 @@ op {
  }
  input_arg {
    name: "reduction_indices"
-    description: "The dimensions to reduce."
+    description: "The dimensions to reduce. Must be in the range\n`[-rank(input), rank(input))`."
    type_attr: "Tidx"
  }
  output_arg {
@ -13102,7 +13147,7 @@ op {
  }
  input_arg {
    name: "reduction_indices"
-    description: "The dimensions to reduce."
+    description: "The dimensions to reduce. Must be in the range\n`[-rank(input), rank(input))`."
    type_attr: "Tidx"
  }
  output_arg {
@ -13293,7 +13338,7 @@ op {
  }
  input_arg {
    name: "reduction_indices"
-    description: "The dimensions to reduce."
+    description: "The dimensions to reduce. Must be in the range\n`[-rank(input), rank(input))`."
    type_attr: "Tidx"
  }
  output_arg {
@ -15436,6 +15481,33 @@ op {
  summary: "Compute the polygamma function \\\\(\\psi^{(n)}(x)\\\\)."
  description: "The polygamma function is defined as:\n\n\n\\\\(\\psi^{(n)}(x) = \\frac{d^n}{dx^n} \\psi(x)\\\\)\n\nwhere \\\\(\\psi(x)\\\\) is the digamma function."
 }
 op {
  name: "PopulationCount"
  input_arg {
    name: "x"
    type_attr: "T"
  }
  output_arg {
    name: "y"
    type: DT_UINT8
  }
  attr {
    name: "T"
    type: "type"
    allowed_values {
      list {
        type: DT_INT8
        type: DT_INT16
        type: DT_INT32
        type: DT_INT64
        type: DT_UINT8
        type: DT_UINT16
      }
    }
  }
  summary: "Computes element-wise population count (a.k.a. popcount, bitsum, bitcount)."
  description: "For each entry in `x`, calculates the number of `1` (on) bits in the binary\nrepresentation of that entry.\n\n**NOTE**: It is more efficient to first `tf.bitcast` your tensors into\n`int32` or `int64` and perform the bitcount on the result, than to feed in\n8- or 16-bit inputs and then aggregate the resulting counts."
 }
 op {
  name: "Pow"
  input_arg {
@ -15662,7 +15734,7 @@ op {
  }
  input_arg {
    name: "reduction_indices"
-    description: "The dimensions to reduce."
+    description: "The dimensions to reduce. Must be in the range\n`[-rank(input), rank(input))`."
    type_attr: "Tidx"
  }
  output_arg {
@ -21708,7 +21780,7 @@ op {
  }
  input_arg {
    name: "axis"
-    description: "1-D. The indices of the dimensions to reverse."
+    description: "1-D. The indices of the dimensions to reverse. Must be in the range\n`[-rank(tensor), rank(tensor))`."
    type_attr: "Tidx"
  }
  output_arg {
@ -27260,7 +27332,7 @@ op {
      list {
      }
    }
-    description: "If specified, only squeezes the dimensions listed. The dimension\nindex starts at 0. It is an error to squeeze a dimension that is not 1."
+    description: "If specified, only squeezes the dimensions listed. The dimension\nindex starts at 0. It is an error to squeeze a dimension that is not 1. Must\nbe in the range `[-rank(input), rank(input))`."
    has_minimum: true
  }
  summary: "Removes dimensions of size 1 from the shape of a tensor."
@ -28250,7 +28322,7 @@ op {
  }
  input_arg {
    name: "reduction_indices"
-    description: "The dimensions to reduce."
+    description: "The dimensions to reduce. Must be in the range\n`[-rank(input), rank(input))`."
    type_attr: "Tidx"
  }
  output_arg {
--- a/tensorflow/core/platform/macros.h
+++ b/tensorflow/core/platform/macros.h
@ -20,6 +20,7 @@ limitations under the License.
 #if (defined(__GNUC__) || defined(__APPLE__)) && !defined(SWIG)
 // Compiler supports GCC-style attributes
 #define TF_ATTRIBUTE_NORETURN __attribute__((noreturn))
 #define TF_ATTRIBUTE_ALWAYS_INLINE __attribute__((always_inline))
 #define TF_ATTRIBUTE_NOINLINE __attribute__((noinline))
 #define TF_ATTRIBUTE_UNUSED __attribute__((unused))
 #define TF_ATTRIBUTE_COLD __attribute__((cold))
@ -33,6 +34,7 @@ limitations under the License.
 #elif defined(COMPILER_MSVC)
 // Non-GCC equivalents
 #define TF_ATTRIBUTE_NORETURN __declspec(noreturn)
 #define TF_ATTRIBUTE_ALWAYS_INLINE
 #define TF_ATTRIBUTE_NOINLINE
 #define TF_ATTRIBUTE_UNUSED
 #define TF_ATTRIBUTE_COLD
@ -43,6 +45,7 @@ limitations under the License.
 #else
 // Non-GCC equivalents
 #define TF_ATTRIBUTE_NORETURN
 #define TF_ATTRIBUTE_ALWAYS_INLINE
 #define TF_ATTRIBUTE_NOINLINE
 #define TF_ATTRIBUTE_UNUSED
 #define TF_ATTRIBUTE_COLD
--- a/tensorflow/core/platform/posix/error.cc
+++ b/tensorflow/core/platform/posix/error.cc
@ -171,11 +171,7 @@ error::Code ErrnoToCode(int err_number) {
 Status IOError(const string& context, int err_number) {
  auto code = ErrnoToCode(err_number);
  if (code == error::UNKNOWN) {
  return Status(code, strings::StrCat(context, "; ", strerror(err_number)));
  } else {
    return Status(code, context);
  }
 }
 }  // namespace tensorflow
--- a/tensorflow/go/op/wrappers.go
+++ b/tensorflow/go/op/wrappers.go
@ -1130,7 +1130,8 @@ type SqueezeAttr func(optionalAttr)
 // SqueezeSqueezeDims sets the optional squeeze_dims attribute to value.
 //
 // value: If specified, only squeezes the dimensions listed. The dimension
-// index starts at 0. It is an error to squeeze a dimension that is not 1.
+// index starts at 0. It is an error to squeeze a dimension that is not 1. Must
 // be in the range `[-rank(input), rank(input))`.
 // If not specified, defaults to <>
 //
 // REQUIRES: len(value) >= 0
@ -7069,6 +7070,61 @@ func TFRecordReaderV2(scope *Scope, optional ...TFRecordReaderV2Attr) (reader_ha
 	return op.Output(0)
 }
 // TextLineReaderV2Attr is an optional argument to TextLineReaderV2.
 type TextLineReaderV2Attr func(optionalAttr)
 // TextLineReaderV2SkipHeaderLines sets the optional skip_header_lines attribute to value.
 //
 // value: Number of lines to skip from the beginning of every file.
 // If not specified, defaults to 0
 func TextLineReaderV2SkipHeaderLines(value int64) TextLineReaderV2Attr {
 	return func(m optionalAttr) {
 		m["skip_header_lines"] = value
 	}
 }
 // TextLineReaderV2Container sets the optional container attribute to value.
 //
 // value: If non-empty, this reader is placed in the given container.
 // Otherwise, a default container is used.
 // If not specified, defaults to ""
 func TextLineReaderV2Container(value string) TextLineReaderV2Attr {
 	return func(m optionalAttr) {
 		m["container"] = value
 	}
 }
 // TextLineReaderV2SharedName sets the optional shared_name attribute to value.
 //
 // value: If non-empty, this reader is named in the given bucket
 // with this shared_name. Otherwise, the node name is used instead.
 // If not specified, defaults to ""
 func TextLineReaderV2SharedName(value string) TextLineReaderV2Attr {
 	return func(m optionalAttr) {
 		m["shared_name"] = value
 	}
 }
 // A Reader that outputs the lines of a file delimited by '\n'.
 //
 // Returns The handle to reference the Reader.
 func TextLineReaderV2(scope *Scope, optional ...TextLineReaderV2Attr) (reader_handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
 		Type: "TextLineReaderV2",
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 // Computes rectified linear 6: `min(max(features, 0), 6)`.
 func Relu6(scope *Scope, features tf.Output) (activations tf.Output) {
 	if scope.Err() != nil {
@ -12819,7 +12875,8 @@ func ReciprocalGrad(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
 //
 // Arguments:
 //	tensor: Up to 8-D.
-//	axis: 1-D. The indices of the dimensions to reverse.
+//	axis: 1-D. The indices of the dimensions to reverse. Must be in the range
 // `[-rank(tensor), rank(tensor))`.
 //
 // Returns The same shape as `tensor`.
 func ReverseV2(scope *Scope, tensor tf.Output, axis tf.Output) (output tf.Output) {
@ -14493,61 +14550,6 @@ func Tanh(scope *Scope, x tf.Output) (y tf.Output) {
 	return op.Output(0)
 }
 // TextLineReaderV2Attr is an optional argument to TextLineReaderV2.
 type TextLineReaderV2Attr func(optionalAttr)
 // TextLineReaderV2SkipHeaderLines sets the optional skip_header_lines attribute to value.
 //
 // value: Number of lines to skip from the beginning of every file.
 // If not specified, defaults to 0
 func TextLineReaderV2SkipHeaderLines(value int64) TextLineReaderV2Attr {
 	return func(m optionalAttr) {
 		m["skip_header_lines"] = value
 	}
 }
 // TextLineReaderV2Container sets the optional container attribute to value.
 //
 // value: If non-empty, this reader is placed in the given container.
 // Otherwise, a default container is used.
 // If not specified, defaults to ""
 func TextLineReaderV2Container(value string) TextLineReaderV2Attr {
 	return func(m optionalAttr) {
 		m["container"] = value
 	}
 }
 // TextLineReaderV2SharedName sets the optional shared_name attribute to value.
 //
 // value: If non-empty, this reader is named in the given bucket
 // with this shared_name. Otherwise, the node name is used instead.
 // If not specified, defaults to ""
 func TextLineReaderV2SharedName(value string) TextLineReaderV2Attr {
 	return func(m optionalAttr) {
 		m["shared_name"] = value
 	}
 }
 // A Reader that outputs the lines of a file delimited by '\n'.
 //
 // Returns The handle to reference the Reader.
 func TextLineReaderV2(scope *Scope, optional ...TextLineReaderV2Attr) (reader_handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
 		Type: "TextLineReaderV2",
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 // Component-wise multiplies a SparseTensor by a dense Tensor.
 //
 // The output locations corresponding to the implicitly zero elements in the sparse
@ -16147,6 +16149,8 @@ func SegmentMean(scope *Scope, data tf.Output, segment_ids tf.Output) (output tf
 type CumprodAttr func(optionalAttr)
 // CumprodExclusive sets the optional exclusive attribute to value.
 //
 // value: If `True`, perform exclusive cumprod.
 // If not specified, defaults to false
 func CumprodExclusive(value bool) CumprodAttr {
 	return func(m optionalAttr) {
@ -16155,6 +16159,8 @@ func CumprodExclusive(value bool) CumprodAttr {
 }
 // CumprodReverse sets the optional reverse attribute to value.
 //
 // value: A `bool` (default: False).
 // If not specified, defaults to false
 func CumprodReverse(value bool) CumprodAttr {
 	return func(m optionalAttr) {
@ -16192,6 +16198,13 @@ func CumprodReverse(value bool) CumprodAttr {
 // ```python
 // tf.cumprod([a, b, c], exclusive=True, reverse=True)  # => [b * c, c, 1]
 // ```
 //
 // Arguments:
 //	x: A `Tensor`. Must be one of the following types: `float32`, `float64`,
 // `int64`, `int32`, `uint8`, `uint16`, `int16`, `int8`, `complex64`,
 // `complex128`, `qint8`, `quint8`, `qint32`, `half`.
 //	axis: A `Tensor` of type `int32` (default: 0). Must be in the range
 // `[-rank(x), rank(x))`.
 func Cumprod(scope *Scope, x tf.Output, axis tf.Output, optional ...CumprodAttr) (out tf.Output) {
 	if scope.Err() != nil {
 		return
@ -16420,6 +16433,8 @@ func QuantizedRelu6(scope *Scope, features tf.Output, min_features tf.Output, ma
 type CumsumAttr func(optionalAttr)
 // CumsumExclusive sets the optional exclusive attribute to value.
 //
 // value: If `True`, perform exclusive cumsum.
 // If not specified, defaults to false
 func CumsumExclusive(value bool) CumsumAttr {
 	return func(m optionalAttr) {
@ -16428,6 +16443,8 @@ func CumsumExclusive(value bool) CumsumAttr {
 }
 // CumsumReverse sets the optional reverse attribute to value.
 //
 // value: A `bool` (default: False).
 // If not specified, defaults to false
 func CumsumReverse(value bool) CumsumAttr {
 	return func(m optionalAttr) {
@ -16465,6 +16482,13 @@ func CumsumReverse(value bool) CumsumAttr {
 // ```python
 // tf.cumsum([a, b, c], exclusive=True, reverse=True)  # => [b + c, c, 0]
 // ```
 //
 // Arguments:
 //	x: A `Tensor`. Must be one of the following types: `float32`, `float64`,
 // `int64`, `int32`, `uint8`, `uint16`, `int16`, `int8`, `complex64`,
 // `complex128`, `qint8`, `quint8`, `qint32`, `half`.
 //	axis: A `Tensor` of type `int32` (default: 0). Must be in the range
 // `[-rank(x), rank(x))`.
 func Cumsum(scope *Scope, x tf.Output, axis tf.Output, optional ...CumsumAttr) (out tf.Output) {
 	if scope.Err() != nil {
 		return
@ -17894,6 +17918,28 @@ func Svd(scope *Scope, input tf.Output, optional ...SvdAttr) (s tf.Output, u tf.
 	return op.Output(0), op.Output(1), op.Output(2)
 }
 // Computes element-wise population count (a.k.a. popcount, bitsum, bitcount).
 //
 // For each entry in `x`, calculates the number of `1` (on) bits in the binary
 // representation of that entry.
 //
 // **NOTE**: It is more efficient to first `tf.bitcast` your tensors into
 // `int32` or `int64` and perform the bitcount on the result, than to feed in
 // 8- or 16-bit inputs and then aggregate the resulting counts.
 func PopulationCount(scope *Scope, x tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
 		Type: "PopulationCount",
 		Input: []tf.Input{
 			x,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 // AssertAttr is an optional argument to Assert.
 type AssertAttr func(optionalAttr)
@ -18063,7 +18109,8 @@ func AnyKeepDims(value bool) AnyAttr {
 //
 // Arguments:
 //	input: The tensor to reduce.
-//	reduction_indices: The dimensions to reduce.
+//	reduction_indices: The dimensions to reduce. Must be in the range
 // `[-rank(input), rank(input))`.
 //
 // Returns The reduced tensor.
 func Any(scope *Scope, input tf.Output, reduction_indices tf.Output, optional ...AnyAttr) (output tf.Output) {
@ -19213,7 +19260,8 @@ func ProdKeepDims(value bool) ProdAttr {
 //
 // Arguments:
 //	input: The tensor to reduce.
-//	reduction_indices: The dimensions to reduce.
+//	reduction_indices: The dimensions to reduce. Must be in the range
 // `[-rank(input), rank(input))`.
 //
 // Returns The reduced tensor.
 func Prod(scope *Scope, input tf.Output, reduction_indices tf.Output, optional ...ProdAttr) (output tf.Output) {
@ -20258,7 +20306,8 @@ func MaxKeepDims(value bool) MaxAttr {
 //
 // Arguments:
 //	input: The tensor to reduce.
-//	reduction_indices: The dimensions to reduce.
+//	reduction_indices: The dimensions to reduce. Must be in the range
 // `[-rank(input), rank(input))`.
 //
 // Returns The reduced tensor.
 func Max(scope *Scope, input tf.Output, reduction_indices tf.Output, optional ...MaxAttr) (output tf.Output) {
@ -20583,7 +20632,8 @@ func Sqrt(scope *Scope, x tf.Output) (y tf.Output) {
 // Arguments:
 //
 //	dim: 0-D (scalar). Specifies the dimension index at which to
-// expand the shape of `input`.
+// expand the shape of `input`. Must be in the range
 // `[-rank(input) - 1, rank(input)]`.
 //
 // Returns Contains the same data as `input`, but its shape has an additional
 // dimension of size 1 added.
@ -20623,7 +20673,8 @@ func AllKeepDims(value bool) AllAttr {
 //
 // Arguments:
 //	input: The tensor to reduce.
-//	reduction_indices: The dimensions to reduce.
+//	reduction_indices: The dimensions to reduce. Must be in the range
 // `[-rank(input), rank(input))`.
 //
 // Returns The reduced tensor.
 func All(scope *Scope, input tf.Output, reduction_indices tf.Output, optional ...AllAttr) (output tf.Output) {
@ -21665,8 +21716,8 @@ func ArgMinOutputType(value tf.DataType) ArgMinAttr {
 //
 // Arguments:
 //
-//	dimension: int32 or int64, 0 <= dimension < rank(input).  Describes
+//	dimension: int32 or int64, must be in the range `[-rank(input), rank(input))`.
-// which dimension of the input Tensor to reduce across. For vectors,
+// Describes which dimension of the input Tensor to reduce across. For vectors,
 // use dimension = 0.
 func ArgMin(scope *Scope, input tf.Output, dimension tf.Output, optional ...ArgMinAttr) (output tf.Output) {
 	if scope.Err() != nil {
@ -22716,7 +22767,8 @@ func MeanKeepDims(value bool) MeanAttr {
 //
 // Arguments:
 //	input: The tensor to reduce.
-//	reduction_indices: The dimensions to reduce.
+//	reduction_indices: The dimensions to reduce. Must be in the range
 // `[-rank(input), rank(input))`.
 //
 // Returns The reduced tensor.
 func Mean(scope *Scope, input tf.Output, reduction_indices tf.Output, optional ...MeanAttr) (output tf.Output) {
@ -22856,7 +22908,8 @@ func MinKeepDims(value bool) MinAttr {
 //
 // Arguments:
 //	input: The tensor to reduce.
-//	reduction_indices: The dimensions to reduce.
+//	reduction_indices: The dimensions to reduce. Must be in the range
 // `[-rank(input), rank(input))`.
 //
 // Returns The reduced tensor.
 func Min(scope *Scope, input tf.Output, reduction_indices tf.Output, optional ...MinAttr) (output tf.Output) {
@ -22914,8 +22967,8 @@ func ArgMaxOutputType(value tf.DataType) ArgMaxAttr {
 //
 // Arguments:
 //
-//	dimension: int32 or int64, 0 <= dimension < rank(input).  Describes
+//	dimension: int32 or int64, must be in the range `[-rank(input), rank(input))`.
-// which dimension of the input Tensor to reduce across. For vectors,
+// Describes which dimension of the input Tensor to reduce across. For vectors,
 // use dimension = 0.
 func ArgMax(scope *Scope, input tf.Output, dimension tf.Output, optional ...ArgMaxAttr) (output tf.Output) {
 	if scope.Err() != nil {
@ -23888,6 +23941,51 @@ func QuantizeDownAndShrinkRange(scope *Scope, input tf.Output, input_min tf.Outp
 	return op.Output(0), op.Output(1), op.Output(2)
 }
 // Compare values of `input` to `threshold` and pack resulting bits into a `uint8`.
 //
 // Each comparison returns a boolean `true` (if `input_value > threshold`)
 // or and `false` otherwise.
 //
 // This operation is useful for Locality-Sensitive-Hashing (LSH) and other
 // algorithms that use hashing approximations of cosine and `L2` distances;
 // codes can be generated from an input via:
 //
 // ```python
 // codebook_size = 50
 // codebook_bits = codebook_size * 32
 // codebook = tf.get_variable('codebook', [x.shape[-1].value, codebook_bits],
 //                            dtype=x.dtype,
 //                            initializer=tf.orthogonal_initializer())
 // codes = compare_and_threshold(tf.matmul(x, codebook), threshold=0.)
 // codes = tf.bitcast(codes, tf.int32)  # go from uint8 to int32
 // # now codes has shape x.shape[:-1] + [codebook_size]
 // ```
 //
 // **NOTE**: Currently, the innermost dimension of the tensor must be divisible
 // by 8.
 //
 // Given an `input` shaped `[s0, s1, ..., s_n]`, the output is
 // a `uint8` tensor shaped `[s0, s1, ..., s_n / 8]`.
 //
 // Arguments:
 //	input: Values to compare against `threshold` and bitpack.
 //	threshold: Threshold to compare against.
 //
 // Returns The bitpacked comparisons.
 func CompareAndBitpack(scope *Scope, input tf.Output, threshold tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
 		Type: "CompareAndBitpack",
 		Input: []tf.Input{
 			input, threshold,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 // Outputs a `Summary` protocol buffer with a tensor and per-plugin data.
 //
 // Arguments:
@ -24724,7 +24822,8 @@ func SumKeepDims(value bool) SumAttr {
 //
 // Arguments:
 //	input: The tensor to reduce.
-//	reduction_indices: The dimensions to reduce.
+//	reduction_indices: The dimensions to reduce. Must be in the range
 // `[-rank(input), rank(input))`.
 //
 // Returns The reduced tensor.
 func Sum(scope *Scope, input tf.Output, reduction_indices tf.Output, optional ...SumAttr) (output tf.Output) {
--- a/tensorflow/python/kernel_tests/BUILD
+++ b/tensorflow/python/kernel_tests/BUILD
@ -1671,6 +1671,18 @@ cuda_py_test(
    ],
 )
 cuda_py_test(
    name = "compare_and_bitpack_op_test",
    size = "small",
    srcs = ["compare_and_bitpack_op_test.py"],
    additional_deps = [
        "//third_party/py/numpy",
        "//tensorflow/python:math_ops",
        "//tensorflow/python:client_testlib",
        "//tensorflow/python:framework_for_generated_wrappers",
    ],
 )
 cuda_py_test(
    name = "scalar_test",
    size = "small",
--- a/tensorflow/python/kernel_tests/compare_and_bitpack_op_test.py
+++ b/tensorflow/python/kernel_tests/compare_and_bitpack_op_test.py
@ -0,0 +1,83 @@
 # Copyright 2017 The TensorFlow Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
 """Tests for tensorflow.ops.compare_and_bitpack_op."""
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 import numpy as np
 from tensorflow.python.ops import math_ops
 from tensorflow.python.platform import test
 class CompareAndBitpackTest(test.TestCase):
  def _testCompareAndBitpack(self,
                             x, threshold,
                             truth,
                             expected_err_re=None):
    with self.test_session(use_gpu=True):
      ans = math_ops.compare_and_bitpack(x, threshold)
      if expected_err_re is None:
        tf_ans = ans.eval()
        self.assertShapeEqual(truth, ans)
        self.assertAllEqual(tf_ans, truth)
      else:
        with self.assertRaisesOpError(expected_err_re):
          ans.eval()
  def _testBasic(self, dtype):
    rows = 371
    cols = 294
    x = np.random.randn(rows, cols * 8)
    if dtype == np.bool:
      x = x > 0
    else:
      x = x.astype(dtype)
    threshold = dtype(0)
    # np.packbits flattens the tensor, so we reshape it back to the
    # expected dimensions.
    truth = np.packbits(x > threshold).reshape(rows, cols)
    self._testCompareAndBitpack(x, threshold, truth)
  def testBasicFloat32(self):
    self._testBasic(np.float32)
  def testBasicFloat64(self):
    self._testBasic(np.float64)
  def testBasicFloat16(self):
    self._testBasic(np.float16)
  def testBasicBool(self):
    self._testBasic(np.bool)
  def testBasicInt8(self):
    self._testBasic(np.int8)
  def testBasicInt16(self):
    self._testBasic(np.int16)
  def testBasicInt32(self):
    self._testBasic(np.int32)
  def testBasicInt64(self):
    self._testBasic(np.int64)
 if __name__ == "__main__":
  test.main()
--- a/tensorflow/python/kernel_tests/gather_op_test.py
+++ b/tensorflow/python/kernel_tests/gather_op_test.py
@ -88,8 +88,8 @@ class GatherTest(test.TestCase):
  def testHigherRank(self):
    # We check that scalar and empty indices shapes work as well
-    for shape in (4, 3, 2), (2, 1, 3, 2):
+    shape = (2, 1, 3, 2)
-      for indices_shape in (), (0,), (3, 0), (3, 5), (5, 2, 3):
+    for indices_shape in (), (0,), (2, 0), (2, 3):
      for dtype in _TEST_TYPES:
        for axis in range(len(shape)):
          params = self._buildParams(np.random.randn(*shape), dtype)
--- a/tensorflow/python/ops/array_ops.py
+++ b/tensorflow/python/ops/array_ops.py
@ -330,7 +330,7 @@ def rank(input, name=None):
  # pylint: disable=redefined-builtin
  """Returns the rank of a tensor.
-  This operation returns an integer representing the rank of `input`.
+  Returns a 0-D `int32` `Tensor` representing the rank of `input`.
  For example:
--- a/tensorflow/python/ops/bitwise_ops.py
+++ b/tensorflow/python/ops/bitwise_ops.py
@ -36,5 +36,6 @@ ops.NotDifferentiable("BitwiseAnd")
 ops.NotDifferentiable("BitwiseOr")
 ops.NotDifferentiable("BitwiseXor")
 ops.NotDifferentiable("Invert")
 ops.NotDifferentiable("PopulationCount")
 remove_undocumented(__name__)
--- a/tensorflow/python/ops/bitwise_ops_test.py
+++ b/tensorflow/python/ops/bitwise_ops_test.py
@ -18,10 +18,14 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 import numpy as np
 import six
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import test_util
 from tensorflow.python.ops import bitwise_ops
 from tensorflow.python.ops import gen_bitwise_ops
 from tensorflow.python.platform import googletest
@ -46,6 +50,25 @@ class BitwiseOpTest(test_util.TensorFlowTestCase):
        self.assertAllEqual(or_result, [5, 5, 7, 15])
        self.assertAllEqual(xor_result, [5, 5, 4, 5])
  def testPopulationCountOp(self):
    dtype_list = [dtypes.int8, dtypes.int16,
                  dtypes.int32, dtypes.int64,
                  dtypes.uint8, dtypes.uint16]
    raw_inputs = [0, 1, -1, 3, -3, 5, -5, 14, -14,
                  127, 128, 255, 256, 65535, 65536,
                  2**31 - 1, 2**31, 2**32 - 1, 2**32, -2**32 + 1, -2**32,
                  -2**63 + 1, 2**63 - 1]
    def count_bits(x):
      return sum([bin(z).count("1") for z in six.iterbytes(x.tobytes())])
    for dtype in dtype_list:
      with self.test_session(use_gpu=True) as sess:
        print("PopulationCount test: ", dtype)
        inputs = np.array(raw_inputs, dtype=dtype.as_numpy_dtype)
        truth = [count_bits(x) for x in inputs]
        input_tensor = constant_op.constant(inputs, dtype=dtype)
        popcnt_result = sess.run(gen_bitwise_ops.population_count(input_tensor))
        self.assertAllEqual(truth, popcnt_result)
  def testInvertOp(self):
    dtype_list = [dtypes.int8, dtypes.int16, dtypes.int32, dtypes.int64,
                  dtypes.uint8, dtypes.uint16]
--- a/tensorflow/python/ops/control_flow_ops.py
+++ b/tensorflow/python/ops/control_flow_ops.py
@ -61,6 +61,7 @@ from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import sparse_tensor
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.framework import tensor_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import gen_array_ops
 from tensorflow.python.ops import gen_control_flow_ops
@ -983,9 +984,16 @@ class GradLoopState(object):
            # the right control flow context.
            real_value = self._grad_context.AddValue(cur_value)
            break
        elif constant_op.is_constant(cur_value):
          # If the value to be forwarded is a constant, clone the constant in
          # the gradient loop rather than using a stack.
          # TODO(phawkins): consider hoisting the constant out of the loop
          # instead.
          real_value = constant_op.constant(
              tensor_util.constant_value(cur_value), dtype=cur_value.dtype)
          break
        else:
          # Record the history of this value in forward_ctxt.
          # TODO(yuanbyu): Avoid recording constants.
          self._grad_context.Exit()
          history_value = cur_grad_state.AddForwardAccumulator(cur_value)
          self._grad_context.Enter()
--- a/tensorflow/python/ops/init_ops.py
+++ b/tensorflow/python/ops/init_ops.py
@ -41,7 +41,6 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import linalg_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import random_ops
 from tensorflow.python.ops import math_ops
 class Initializer(object):
--- a/tensorflow/python/ops/nn_test.py
+++ b/tensorflow/python/ops/nn_test.py
@ -830,7 +830,8 @@ class ReluTest(test_lib.TestCase):
 class MomentsTest(test_lib.TestCase):
-  def doOutputTest(self, input_shape, moments_axes, tol=1e-4):
+  def doOutputTest(self, input_shape, moments_axes, tol=1e-4,
                   check_gradients=False):
    for mu in [0.0, 1.0, 1e3]:
      for sigma in [1.0, 0.1]:
        for keep_dims in [True, False]:
@ -846,6 +847,15 @@ class MomentsTest(test_lib.TestCase):
              mean, variance = nn_impl.moments(
                  inputs, moments_axes, keep_dims=keep_dims)
              if check_gradients:
                err = gradient_checker.compute_gradient_error(
                    inputs, input_shape, mean, mean.shape.as_list())
                self.assertLess(err, 1e-3)
                err = gradient_checker.compute_gradient_error(
                    inputs, input_shape, variance, variance.shape.as_list())
                self.assertLess(err, 1e-3)
              # Evaluate.
              [mean, variance] = sess.run([mean, variance])
              # Make sure that there are no NaNs
              self.assertFalse(np.isnan(mean).any())
@ -853,6 +863,12 @@ class MomentsTest(test_lib.TestCase):
              self.assertAllClose(mean, expected_mean, rtol=tol, atol=tol)
              self.assertAllClose(variance, expected_var, rtol=tol, atol=tol)
  def testOutputAndGradient2DInput0(self):
    self.doOutputTest((10, 10), (0,), check_gradients=True)
  def testOutputAndGradient2DInput01(self):
    self.doOutputTest((10, 10), (0, 1), check_gradients=True)
  def testOutput2DInput0(self):
    self.doOutputTest((10, 300), (0,))
--- a/tensorflow/python/ops/rnn_cell_impl.py
+++ b/tensorflow/python/ops/rnn_cell_impl.py
@ -786,13 +786,18 @@ class DropoutWrapper(RNNCell):
 class ResidualWrapper(RNNCell):
  """RNNCell wrapper that ensures cell inputs are added to the outputs."""
-  def __init__(self, cell):
+  def __init__(self, cell, residual_fn=None):
    """Constructs a `ResidualWrapper` for `cell`.
    Args:
      cell: An instance of `RNNCell`.
      residual_fn: (Optional) The function to map raw cell inputs and raw cell
        outputs to the actual cell outputs of the residual network.
        Defaults to calling nest.map_structure on (lambda i, o: i + o), inputs
        and outputs.
    """
    self._cell = cell
    self._residual_fn = residual_fn
  @property
  def state_size(self):
@ -807,7 +812,7 @@ class ResidualWrapper(RNNCell):
      return self._cell.zero_state(batch_size, dtype)
  def __call__(self, inputs, state, scope=None):
-    """Run the cell and add its inputs to its outputs.
+    """Run the cell and then apply the residual_fn on its inputs to its outputs.
    Args:
      inputs: cell inputs.
@ -822,13 +827,14 @@ class ResidualWrapper(RNNCell):
      ValueError: If cell inputs and outputs have different structure (value).
    """
    outputs, new_state = self._cell(inputs, state, scope=scope)
    nest.assert_same_structure(inputs, outputs)
    # Ensure shapes match
    def assert_shape_match(inp, out):
      inp.get_shape().assert_is_compatible_with(out.get_shape())
    def default_residual_fn(inputs, outputs):
      nest.assert_same_structure(inputs, outputs)
      nest.map_structure(assert_shape_match, inputs, outputs)
-    res_outputs = nest.map_structure(
+      return nest.map_structure(lambda inp, out: inp + out, inputs, outputs)
-        lambda inp, out: inp + out, inputs, outputs)
+    res_outputs = (self._residual_fn or default_residual_fn)(inputs, outputs)
    return (res_outputs, new_state)
--- a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-residual-wrapper.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-residual-wrapper.pbtxt
@ -54,7 +54,7 @@ tf_class {
  }
  member_method {
    name: "__init__"
-    argspec: "args=[\'self\', \'cell\'], varargs=None, keywords=None, defaults=None"
+    argspec: "args=[\'self\', \'cell\', \'residual_fn\'], varargs=None, keywords=None, defaults=[\'None\'], "
  }
  member_method {
    name: "add_loss"
--- a/tensorflow/tools/ci_build/builds/builds_common.sh
+++ b/tensorflow/tools/ci_build/builds/builds_common.sh
@ -17,6 +17,7 @@
 # Common Bash functions used by build scripts
 COLOR_NC='\033[0m'
 COLOR_LIGHT_GRAY='\033[0;37m'
 COLOR_GREEN='\033[0;32m'
 COLOR_RED='\033[0;31m'
--- a/tensorflow/tools/ci_build/builds/configured
+++ b/tensorflow/tools/ci_build/builds/configured
@ -56,7 +56,7 @@ else
 fi
 pushd "${CI_TENSORFLOW_SUBMODULE_PATH:-.}"
-yes "" | ./configure
+$PYTHON_BIN_PATH configure.py
 popd
 # Gather and print build information
--- a/tensorflow/tools/ci_build/builds/pip.sh
+++ b/tensorflow/tools/ci_build/builds/pip.sh
@ -73,6 +73,9 @@ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 source "${SCRIPT_DIR}/builds_common.sh"
 SKIP_RETURN_CODE=112
 # Get the command line arguments
 CONTAINER_TYPE=$( echo "$1" | tr '[:upper:]' '[:lower:]' )
 shift
@ -310,6 +313,13 @@ create_activate_virtualenv_and_install_tensorflow() {
 # Smoke test of tensorflow install in clean virtualenv
 ################################################################################
 do_clean_virtualenv_smoke_test() {
  if [[ -n "${NO_TEST_ON_INSTALL}" ]] &&
       [[ "${NO_TEST_ON_INSTALL}" != "0" ]]; then
    echo "NO_TEST_ON_INSTALL=${NO_TEST_ON_INSTALL}:"
    echo "  Skipping smoke test of tensorflow install in clean virtualenv"
    return ${SKIP_RETURN_CODE}
  fi
  CLEAN_VENV_DIR="${PIP_TEST_ROOT}/venv_clean"
  create_activate_virtualenv_and_install_tensorflow --clean \
    "${CLEAN_VENV_DIR}" "${WHL_PATH}"
@ -361,6 +371,7 @@ do_virtualenv_pip_test() {
     [[ "${NO_TEST_ON_INSTALL}" != "0" ]]; then
    echo "NO_TEST_ON_INSTALL=${NO_TEST_ON_INSTALL}:"
    echo "  Skipping ALL Python unit tests on install"
    return ${SKIP_RETURN_CODE}
  else
    # Call run_pip_tests.sh to perform test-on-install
    "${SCRIPT_DIR}/run_pip_tests.sh" --virtualenv ${GPU_FLAG} ${MAC_FLAG}
@ -379,6 +390,7 @@ do_virtualenv_oss_serial_pip_test() {
     [[ "${NO_TEST_ON_INSTALL}" != "0" ]]; then
    echo "NO_TEST_ON_INSTALL=${NO_TEST_ON_INSTALL}:"
    echo "  Skipping Python unit tests on install tagged with oss_serial"
    return ${SKIP_RETURN_CODE}
  else
    # Call run_pip_tests.sh to perform test-on-install
    "${SCRIPT_DIR}/run_pip_tests.sh" \
@ -402,6 +414,7 @@ do_test_user_ops() {
    fi
  else
    echo "Skipping user-op test-on-install due to DO_TEST_USER_OPS = ${DO_TEST_USER_OPS}"
    return ${SKIP_RETURN_CODE}
  fi
 }
@ -424,6 +437,7 @@ do_test_tfdbg_binaries() {
    popd
  else
    echo "Skipping test of tfdbg binaries due to DO_TEST_TFDBG_BINARIES = ${DO_TEST_TFDBG_BINARIES}"
    return ${SKIP_RETURN_CODE}
  fi
 }
@ -439,6 +453,7 @@ do_test_tutorials() {
    fi
  else
    echo "Skipping tutorial tests-on-install due to DO_TEST_TUTORIALS = ${DO_TEST_TUTORIALS}"
    return ${SKIP_RETURN_CODE}
  fi
 }
@ -455,6 +470,7 @@ do_ffmpeg_integration_test() {
    fi
  else
    echo "Skipping ffmpeg integration due to DO_INTEGRATION_TESTS = ${DO_INTEGRATION_TESTS}"
    return ${SKIP_RETURN_CODE}
  fi
 }
@ -468,6 +484,7 @@ PIP_TASKS_DESC=("Smoke test of pip install in clean virtualenv" "PIP tests in vi
 COUNTER=0
 FAIL_COUNTER=0
 PASS_COUNTER=0
 SKIP_COUNTER=0
 while [[ ${COUNTER} -lt "${#PIP_TASKS[@]}" ]]; do
  INDEX=COUNTER
  ((INDEX++))
@ -480,7 +497,9 @@ while [[ ${COUNTER} -lt "${#PIP_TASKS[@]}" ]]; do
  ${PIP_TASKS[COUNTER]}
  RESULT=$?
-  if [[ ${RESULT} != "0" ]]; then
+  if [[ ${RESULT} == ${SKIP_RETURN_CODE} ]]; then
    ((SKIP_COUNTER++))
  elif [[ ${RESULT} != "0" ]]; then
    ((FAIL_COUNTER++))
  else
    ((PASS_COUNTER++))
@ -503,7 +522,9 @@ while [[ ${COUNTER} -lt "${#PIP_TASKS[@]}" ]]; do
  ((INDEX++))
  echo "${INDEX}. ${PIP_TASKS[COUNTER]}: ${PIP_TASKS_DESC[COUNTER]}"
-  if [[ ${STEP_EXIT_CODES[COUNTER]} == "0" ]]; then
+  if [[ ${STEP_EXIT_CODES[COUNTER]} == ${SKIP_RETURN_CODE} ]]; then
    printf "  ${COLOR_LIGHT_GRAY}SKIP${COLOR_NC}\n"
  elif [[ ${STEP_EXIT_CODES[COUNTER]} == "0" ]]; then
    printf "  ${COLOR_GREEN}PASS${COLOR_NC}\n"
  else
    printf "  ${COLOR_RED}FAIL${COLOR_NC}\n"
@ -513,7 +534,7 @@ while [[ ${COUNTER} -lt "${#PIP_TASKS[@]}" ]]; do
 done
 echo
-echo "${FAIL_COUNTER} failed; ${PASS_COUNTER} passed."
+echo "${SKIP_COUNTER} skipped; ${FAIL_COUNTER} failed; ${PASS_COUNTER} passed."
 echo
 if [[ ${FAIL_COUNTER} == "0" ]]; then
--- a/tensorflow/tools/ci_build/builds/run_pip_tests.sh
+++ b/tensorflow/tools/ci_build/builds/run_pip_tests.sh
@ -120,7 +120,7 @@ else
 fi
 export TF_NEED_CUDA=$IS_GPU
-yes "" | ./configure
+${PYTHON_BIN_PATH} configure.py
 # Figure out how many concurrent tests we can run and do run the tests.
 BAZEL_PARALLEL_TEST_FLAGS=""
--- a/tensorflow/tools/ci_build/linux/cpu/run_cc_core.sh
+++ b/tensorflow/tools/ci_build/linux/cpu/run_cc_core.sh
@ -30,7 +30,7 @@ export TF_NEED_HDFS=0
 export TF_NEED_CUDA=0
 # Only running cc tests, python version does not matter.
 export PYTHON_BIN_PATH=`which python`
-yes "" | ./configure
+$PYTHON_BIN_PATH configure.py
 # Run bazel test command. Double test timeouts to avoid flakes.
 bazel test --test_tag_filters=-no_oss,-gpu,-benchmark-test --test_lang_filters=cc -k \
--- a/tensorflow/tools/ci_build/linux/cpu/run_py2_core.sh
+++ b/tensorflow/tools/ci_build/linux/cpu/run_py2_core.sh
@ -29,7 +29,7 @@ export TF_NEED_GCP=0
 export TF_NEED_HDFS=0
 export TF_NEED_CUDA=0
 export PYTHON_BIN_PATH=`which python2`
-yes "" | ./configure
+$PYTHON_BIN_PATH configure.py
 # Run bazel test command. Double test timeouts to avoid flakes.
 bazel test --test_tag_filters=-no_oss,-oss_serial,-gpu,-benchmark-test --test_lang_filters=py -k \
--- a/tensorflow/tools/ci_build/linux/cpu/run_py3_contrib.sh
+++ b/tensorflow/tools/ci_build/linux/cpu/run_py3_contrib.sh
@ -29,7 +29,7 @@ export TF_NEED_GCP=0
 export TF_NEED_HDFS=0
 export TF_NEED_CUDA=0
 export PYTHON_BIN_PATH=`which python3`
-yes "" | ./configure
+$PYTHON_BIN_PATH configure.py
 # Run bazel test command. Double test timeouts to avoid flakes.
 bazel test --test_tag_filters=-no_oss,-oss_serial,-gpu,-benchmark-test -k \
--- a/tensorflow/tools/ci_build/linux/cpu/run_py3_core.sh
+++ b/tensorflow/tools/ci_build/linux/cpu/run_py3_core.sh
@ -29,7 +29,7 @@ export TF_NEED_GCP=0
 export TF_NEED_HDFS=0
 export TF_NEED_CUDA=0
 export PYTHON_BIN_PATH=`which python3`
-yes "" | ./configure
+$PYTHON_BIN_PATH configure.py
 # Run bazel test command. Double test timeouts to avoid flakes.
 bazel test --test_tag_filters=-no_oss,-oss_serial,-gpu,-benchmark-test --test_lang_filters=py -k \
--- a/tensorflow/tools/ci_build/linux/gpu/run_cc_core.sh
+++ b/tensorflow/tools/ci_build/linux/gpu/run_cc_core.sh
@ -32,7 +32,7 @@ export PYTHON_BIN_PATH=`which python3`
 export TF_NEED_CUDA=1
 export TF_CUDA_COMPUTE_CAPABILITIES=3.7
-yes "" | ./configure
+$PYTHON_BIN_PATH configure.py
 # Run bazel test command. Double test timeouts to avoid flakes.
 bazel test --config=cuda --test_tag_filters=-no_oss,-oss_serial,-no_gpu,-benchmark-test -k \
--- a/tensorflow/tools/ci_build/linux/gpu/run_py3_core.sh
+++ b/tensorflow/tools/ci_build/linux/gpu/run_py3_core.sh
@ -32,7 +32,7 @@ export PYTHON_BIN_PATH=`which python3`
 export TF_NEED_CUDA=1
 export TF_CUDA_COMPUTE_CAPABILITIES=3.7
-yes "" | ./configure
+$PYTHON_BIN_PATH configure.py
 # Run bazel test command. Double test timeouts to avoid flakes.
 bazel test --config=cuda --test_tag_filters=-no_oss,-oss_serial,-no_gpu,-benchmark-test -k \
--- a/tensorflow/tools/ci_build/osx/cpu/run_py2_cc_core.sh
+++ b/tensorflow/tools/ci_build/osx/cpu/run_py2_cc_core.sh
@ -30,7 +30,7 @@ export TF_NEED_GCP=0
 export TF_NEED_HDFS=0
 export TF_NEED_CUDA=0
 export PYTHON_BIN_PATH=$(which python2)
-yes "" | ./configure
+$PYTHON_BIN_PATH configure.py
 which bazel
 bazel test --test_tag_filters=-no_oss,-gpu,-benchmark-test,-nomac \
    --test_timeout 300,450,1200,3600 \
--- a/tensorflow/tools/ci_build/xla/linux/gpu/run_py3.sh
+++ b/tensorflow/tools/ci_build/xla/linux/gpu/run_py3.sh
@ -33,7 +33,7 @@ export TF_NEED_CUDA=1
 export TF_ENABLE_XLA=1
 export TF_CUDA_COMPUTE_CAPABILITIES=3.7
-yes "" | ./configure
+$PYTHON_BIN_PATH configure.py
 # Run bazel test command. Double test timeouts to avoid flakes.
 bazel test --config=cuda --test_tag_filters=-no_gpu,-benchmark-test -k \
--- a/third_party/llvm/llvm.BUILD
+++ b/third_party/llvm/llvm.BUILD
@ -899,6 +899,7 @@ cc_library(
        "include/llvm/Target/ARM/InstPrinter/*.h",
        "include/llvm/Target/ARM/InstPrinter/*.def",
        "include/llvm/Target/ARM/InstPrinter/*.inc",
        "lib/Target/ARM/*.h",
        "lib/Target/ARM/InstPrinter/*.h",
    ]),
    copts = ["-Iexternal/llvm/lib/Target/ARM"],
@ -1206,6 +1207,7 @@ cc_library(
        "lib/IR/*.h",
    ]),
    hdrs = glob([
        "include/llvm/Analysis/*.def",
        "include/llvm/IR/*.h",
        "include/llvm/IR/*.def",
        "include/llvm/IR/*.inc",
@ -2022,6 +2024,8 @@ cc_library(
        "lib/Target/*.h",
    ]),
    hdrs = glob([
        "include/llvm/CodeGen/*.h",
        "include/llvm/CodeGen/*.def",
        "include/llvm/Target/*.h",
        "include/llvm/Target/*.def",
        "include/llvm/Target/*.inc",