Merge branch 'r2.4' into cherrypick_350241208

2021-01-19 09:25:57 -08:00 · 2021-01-19 09:25:57 -08:00 · ed76f960cb
commit ed76f960cb
parent 7605e1ebc6 eef95214c8
31 changed files with 434 additions and 117 deletions
--- a/RELEASE.md
+++ b/RELEASE.md
@ -4,13 +4,13 @@

 * `tf.distribute` introduces experimental support for asynchronous training of models via the [`tf.distribute.experimental.ParameterServerStrategy`](https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/ParameterServerStrategy) API. Please see the [tutorial](https://www.tensorflow.org/tutorials/distribute/parameter_server_training) to learn more.

-* [`MultiWorkerMirroredStrategy`](https://www.tensorflow.org/api_docs/python/tf/distribute/MultiWorkerMirroredStrategy) is now a stable API and is no longer considered experimental. Some of the major improvements involve handling peer failure and many bug fixes. Please check out the detailed tutorial on [Multi-worker training with Keras](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras). 
+* [`MultiWorkerMirroredStrategy`](https://www.tensorflow.org/api_docs/python/tf/distribute/MultiWorkerMirroredStrategy) is now a stable API and is no longer considered experimental. Some of the major improvements involve handling peer failure and many bug fixes. Please check out the detailed tutorial on [Multi-worker training with Keras](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras).

-* Introduces experimental support for a new module named [`tf.experimental.numpy`](https://www.tensorflow.org/api_docs/python/tf/experimental/numpy) which is a NumPy-compatible API for writing TF programs. See the [detailed guide](https://www.tensorflow.org/guide/tf_numpy) to learn more. Additional details below. 
+* Introduces experimental support for a new module named [`tf.experimental.numpy`](https://www.tensorflow.org/api_docs/python/tf/experimental/numpy) which is a NumPy-compatible API for writing TF programs. See the [detailed guide](https://www.tensorflow.org/guide/tf_numpy) to learn more. Additional details below.

 * Adds Support for
-  [TensorFloat-32](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) on Ampere based GPUs. TensorFloat-32, or TF32 for short, is a math mode for NVIDIA Ampere based GPUs and is enabled by default. 
-  
+  [TensorFloat-32](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) on Ampere based GPUs. TensorFloat-32, or TF32 for short, is a math mode for NVIDIA Ampere based GPUs and is enabled by default.
+
 * A major refactoring of the internals of the Keras Functional API has been completed, that should improve the reliability, stability, and performance of constructing Functional models.

 * Keras mixed precision API [`tf.keras.mixed_precision`](https://www.tensorflow.org/api_docs/python/tf/keras/mixed_precision?version=nightly) is no longer experimental and allows the use of 16-bit floating point formats during training, improving performance by up to 3x on GPUs and 60% on TPUs. Please see below for additional details.
@ -24,71 +24,71 @@
 ## Breaking Changes

 * TF Core:
-  * Certain float32 ops run in lower precsion on Ampere based GPUs, including  matmuls and convolutions, due to the use of [TensorFloat-32](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/). Specifically, inputs to such ops are rounded from 23 bits of precision to 10
+  * Certain float32 ops run in lower precision on Ampere based GPUs, including matmuls and convolutions, due to the use of [TensorFloat-32](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/). Specifically, inputs to such ops are rounded from 23 bits of precision to 10
  bits of precision. This is unlikely to cause issues in practice for deep learning models. In some cases, TensorFloat-32 is also used for complex64 ops.
-  TensorFloat-32 can be disabled by running `tf.config.experimental.enable_tensor_float_32_execution(False)`. 
+  TensorFloat-32 can be disabled by running `tf.config.experimental.enable_tensor_float_32_execution(False)`.
  * The byte layout for string tensors across the C-API has been updated to match TF Core/C++; i.e., a contiguous array of `tensorflow::tstring`/`TF_TString`s.
-  * C-API functions `TF_StringDecode`, `TF_StringEncode`, and `TF_StringEncodedSize` are no longer relevant and have been removed; see `core/platform/ctstring.h` for  string access/modification in C.
+  * C-API functions `TF_StringDecode`, `TF_StringEncode`, and `TF_StringEncodedSize` are no longer relevant and have been removed; see `core/platform/ctstring.h` for string access/modification in C.
  * `tensorflow.python`, `tensorflow.core` and `tensorflow.compiler` modules are now hidden. These modules are not part of TensorFlow public API.
  * `tf.raw_ops.Max` and `tf.raw_ops.Min` no longer accept inputs of type `tf.complex64` or `tf.complex128`, because the behavior of these ops is not well defined for complex types.
  * XLA:CPU and XLA:GPU devices are no longer registered by default. Use `TF_XLA_FLAGS=--tf_xla_enable_xla_devices` if you really need them, but this flag will eventually be removed in subsequent releases.

-* `tf.keras`:  
+* `tf.keras`:
  * The `steps_per_execution` argument in `model.compile()` is no longer experimental; if you were passing `experimental_steps_per_execution`, rename it to `steps_per_execution` in your code. This argument controls the number of batches to run during each `tf.function` call when calling `model.fit()`. Running multiple batches inside a single `tf.function` call can greatly improve performance on TPUs or small models with a large Python overhead.
  * A **major refactoring** of the internals of the Keras Functional API may affect code that
  is relying on certain internal details:
    * Code that uses `isinstance(x, tf.Tensor)` instead of `tf.is_tensor` when checking Keras symbolic inputs/outputs should switch to using `tf.is_tensor`.
-    * Code that is overly dependent on the exact names attached to symbolic tensors (e.g. assumes there will be ":0" at the end of the inputs, treats names as unique identifiers instead of using `tensor.ref()`, etc.) may break. 
+    * Code that is overly dependent on the exact names attached to symbolic tensors (e.g. assumes there will be ":0" at the end of the inputs, treats names as unique identifiers instead of using `tensor.ref()`, etc.) may break.
    * Code that uses full path for `get_concrete_function` to trace Keras symbolic inputs directly should switch to building matching `tf.TensorSpec`s directly and tracing the `TensorSpec` objects.
-    * Code that relies on the exact number and names of the op layers that TensorFlow operations  were converted into may have changed.
-    * Code that uses `tf.map_fn`/`tf.cond`/`tf.while_loop`/control flow as op layers and  happens to work before TF 2.4. These will explicitly be unsupported now. Converting these ops to Functional API op layers was unreliable before TF 2.4, and prone to erroring incomprehensibly  or being silently buggy.
-    * Code that directly asserts on a Keras symbolic value in cases where ops like `tf.rank` used to  return a static or symbolic value depending on if the input had a fully static shape or not. Now these ops always return symbolic values.
+    * Code that relies on the exact number and names of the op layers that TensorFlow operations were converted into may have changed.
+    * Code that uses `tf.map_fn`/`tf.cond`/`tf.while_loop`/control flow as op layers and happens to work before TF 2.4. These will explicitly be unsupported now. Converting these ops to Functional API op layers was unreliable before TF 2.4, and prone to erroring incomprehensibly or being silently buggy.
+    * Code that directly asserts on a Keras symbolic value in cases where ops like `tf.rank` used to return a static or symbolic value depending on if the input had a fully static shape or not. Now these ops always return symbolic values.
    * Code already susceptible to leaking tensors outside of graphs becomes slightly more likely to do so now.
    * Code that tries directly getting gradients with respect to symbolic Keras inputs/outputs. Use `GradientTape` on the actual Tensors passed to the already-constructed model instead.
    * Code that requires very tricky shape manipulation via converted op layers in order to work, where the Keras symbolic shape inference proves insufficient.
-    * Code that tries manually walking a `tf.keras.Model` layer by layer and assumes layers only ever have one positional argument. This assumption doesn't hold       true before TF 2.4 either, but is more likely to cause issues now.
+    * Code that tries manually walking a `tf.keras.Model` layer by layer and assumes layers only ever have one positional argument. This assumption doesn't hold true before TF 2.4 either, but is more likely to cause issues now.
    * Code that manually enters `keras.backend.get_graph()` before building a functional model is no longer needed.
-    * Start enforcing input shape assumptions when calling Functional API Keras models. This may potentially break some users, in case there is a mismatch             between the shape used when creating `Input` objects in a Functional model, and the shape of the data passed to that model. You can fix this mismatch by         either calling the model with correctly-shaped data, or by relaxing `Input` shape assumptions (note that you can pass shapes with `None` entries for axes
+    * Start enforcing input shape assumptions when calling Functional API Keras models. This may potentially break some users, in case there is a mismatch between the shape used when creating `Input` objects in a Functional model, and the shape of the data passed to that model. You can fix this mismatch by either calling the model with correctly-shaped data, or by relaxing `Input` shape assumptions (note that you can pass shapes with `None` entries for axes
      that are meant to be dynamic). You can also disable the input checking entirely by setting `model.input_spec = None`.
-  * Several changes have been made to `tf.keras.mixed_precision.experimental`. Note that it is now recommended to use the non-experimental          `tf.keras.mixed_precision` API.
+  * Several changes have been made to `tf.keras.mixed_precision.experimental`. Note that it is now recommended to use the non-experimental `tf.keras.mixed_precision` API.
   * `AutoCastVariable.dtype` now refers to the actual variable dtype, not the dtype it will be casted to.
   * When mixed precision is enabled, `tf.keras.layers.Embedding` now outputs a float16 or bfloat16 tensor instead of a float32 tensor.
-   * The property `tf.keras.mixed_precision.experimental.LossScaleOptimizer.loss_scale` is now a tensor, not a `LossScale` object. This means to get a loss scale      of a `LossScaleOptimizer` as a tensor, you must now call `opt.loss_scale`instead of `opt.loss_scale()`.
+   * The property `tf.keras.mixed_precision.experimental.LossScaleOptimizer.loss_scale` is now a tensor, not a `LossScale` object. This means to get a loss scale of a `LossScaleOptimizer` as a tensor, you must now call `opt.loss_scale`instead of `opt.loss_scale()`.
   * The property `should_cast_variables` has been removed from `tf.keras.mixed_precision.experimental.Policy`
-   * When passing a `tf.mixed_precision.experimental.DynamicLossScale` to `tf.keras.mixed_precision.experimental.LossScaleOptimizer`, the `DynamicLossScale`'s        multiplier must be 2.
+   * When passing a `tf.mixed_precision.experimental.DynamicLossScale` to `tf.keras.mixed_precision.experimental.LossScaleOptimizer`, the `DynamicLossScale`'s multiplier must be 2.
   * When passing a `tf.mixed_precision.experimental.DynamicLossScale` to `tf.keras.mixed_precision.experimental.LossScaleOptimizer`, the weights of
-     the `DynanmicLossScale` are copied into the `LossScaleOptimizer` instead of being reused. This means modifying the weights of the `DynamicLossScale` will no      longer affect the weights of the LossScaleOptimizer, and vice versa.
+     the `DynanmicLossScale` are copied into the `LossScaleOptimizer` instead of being reused. This means modifying the weights of the `DynamicLossScale` will no longer affect the weights of the LossScaleOptimizer, and vice versa.
   * The global policy can no longer be set to a non-floating point policy in `tf.keras.mixed_precision.experimental.set_policy`
-   * In `Layer.call`, `AutoCastVariable`s will no longer be casted within `MirroredStrategy.run` or `ReplicaContext.merge_call`. This is because a thread local        variable is used to determine whether `AutoCastVariable`s are casted, and those two functions run with a different thread. Note this only applies if one of      these two functions is called within `Layer.call`; if one of those two functions calls `Layer.call`, `AutoCastVariable`s will still be casted.
+   * In `Layer.call`, `AutoCastVariable`s will no longer be casted within `MirroredStrategy.run` or `ReplicaContext.merge_call`. This is because a thread local variable is used to determine whether `AutoCastVariable`s are casted, and those two functions run with a different thread. Note this only applies if one of these two functions is called within `Layer.call`; if one of those two functions calls `Layer.call`, `AutoCastVariable`s will still be casted.

 * `tf.data`:
-  * `tf.data.experimental.service.DispatchServer` now takes a config tuple instead of individual arguments. Usages should be updated to      `tf.data.experimental.service.DispatchServer(dispatcher_config)`.
-  * `tf.data.experimental.service.WorkerServer` now takes a config tuple instead of individual arguments. Usages should be updated to  `tf.data.experimental.service.WorkerServer(worker_config)`.
-  
+  * `tf.data.experimental.service.DispatchServer` now takes a config tuple instead of individual arguments. Usages should be updated to `tf.data.experimental.service.DispatchServer(dispatcher_config)`.
+  * `tf.data.experimental.service.WorkerServer` now takes a config tuple instead of individual arguments. Usages should be updated to `tf.data.experimental.service.WorkerServer(worker_config)`.
+
 * `tf.distribute`:
  * Removes `tf.distribute.Strategy.experimental_make_numpy_dataset`. Please use `tf.data.Dataset.from_tensor_slices` instead.
-  * Renames `experimental_hints` in `tf.distribute.StrategyExtended.reduce_to`, `tf.distribute.StrategyExtended.batch_reduce_to`,      `tf.distribute.ReplicaContext.all_reduce` to `options`.
+  * Renames `experimental_hints` in `tf.distribute.StrategyExtended.reduce_to`, `tf.distribute.StrategyExtended.batch_reduce_to`, `tf.distribute.ReplicaContext.all_reduce` to `options`.
  * Renames `tf.distribute.experimental.CollectiveHints` to `tf.distribute.experimental.CommunicationOptions`.
  * Renames `tf.distribute.experimental.CollectiveCommunication` to `tf.distribute.experimental.CommunicationImplementation`.
-  * Renames `tf.distribute.Strategy.experimental_distribute_datasets_from_function` to `distribute_datasets_from_function` as it is no longer experimental. 
+  * Renames `tf.distribute.Strategy.experimental_distribute_datasets_from_function` to `distribute_datasets_from_function` as it is no longer experimental.
  * Removes `tf.distribute.Strategy.experimental_run_v2` method, which was deprecated in TF 2.2.

 * `tf.lite`:
  * `tf.quantization.quantize_and_dequantize_v2` has been introduced, which updates the gradient definition for quantization which is outside the range
-     to be 0. To simulate the V1 the behavior of `tf.quantization.quantize_and_dequantize(...)` use             `tf.grad_pass_through(tf.quantization.quantize_and_dequantize_v2)(...)`.
+     to be 0. To simulate the V1 the behavior of `tf.quantization.quantize_and_dequantize(...)` use `tf.grad_pass_through(tf.quantization.quantize_and_dequantize_v2)(...)`.

 * Building TensorFlow:
  * Windows platform builds: TensorFlow on Windows under MSVC is now built with `--copt=/experimental:preprocessor --host_copt=/experimental:preprocessor` (see `.bazelrc` for more details). Builds including TensorFlow may fail with unexpected syntax errors if these flags are absent. See also [this thread on SIG Build](https://groups.google.com/a/tensorflow.org/g/build/c/LbAw8RILvTg/m/ttnuhYU2BgAJ).

 ## Known Caveats
  * `tf.keras.mixed_precision`
-    * When using mixed precision, calling `RMSprop.apply_gradients` or `Nadam.apply_gradients` outside a `tf.function` does not work and will raise the                 AttributeError "Tensor.op is meaningless when eager execution is enabled". See this [issue](https://github.com/tensorflow/tensorflow/issues/45536) for           details and a workaround.
+    * When using mixed precision, calling `RMSprop.apply_gradients` or `Nadam.apply_gradients` outside a `tf.function` does not work and will raise the AttributeError "Tensor.op is meaningless when eager execution is enabled". See this [issue](https://github.com/tensorflow/tensorflow/issues/45536) for details and a workaround.

 ## Bug Fixes and Other Changes

 ### TF Core:
  * Introduces experimental support for a new module named [`tf.experimental.numpy`](https://www.tensorflow.org/api_docs/python/tf/experimental/numpy), which
-    is a NumPy-compatible API for writing TF programs. This module provides class `ndarray`, which mimics the `ndarray` class in NumPy, and wraps an immutable       `tf.Tensor` under the hood. A subset of NumPy functions (e.g. `numpy.add`) are provided. Their inter-operation with TF facilities is seamless in most cases. 
+    is a NumPy-compatible API for writing TF programs. This module provides class `ndarray`, which mimics the `ndarray` class in NumPy, and wraps an immutable `tf.Tensor` under the hood. A subset of NumPy functions (e.g. `numpy.add`) are provided. Their inter-operation with TF facilities is seamless in most cases.
    See [tensorflow/python/ops/numpy_ops/README.md](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/numpy_ops/README.md)
    for details of what operations are supported and what are the differences from NumPy.
  * `tf.types.experimental.TensorLike` is a new `Union` type that can be used as type annotation for variables representing a Tensor or a value
@ -97,35 +97,35 @@
    tf.reshape truncating inputs such as from int64 to int32.
  * Adds `tf.sparse.map_values` to apply a function to the `.value`s of `SparseTensor` arguments.
  * The Python bitwise operators for `Tensor` (`__and__`, `__or__`, `__xor__` and `__invert__` now support non-`bool` arguments and apply
-    the corresponding bitwise ops. `bool` arguments continue to be supported and dispatch to logical ops. This brings them more in line with 
+    the corresponding bitwise ops. `bool` arguments continue to be supported and dispatch to logical ops. This brings them more in line with
    Python and NumPy behavior.
  * Adds `tf.SparseTensor.with_values`. This returns a new SparseTensor with the same sparsity pattern, but with new provided values. It is
    similar to the `with_values` function of `RaggedTensor`.
  * Adds `StatelessCase` op, and uses it if none of case branches has stateful ops.
  * Adds `tf.config.experimental.get_memory_usage` to return total memory usage of the device.
  * Adds gradients for `RaggedTensorToVariant` and `RaggedTensorFromVariant`.
-  * Improve shape inference of nested function calls by supporting constant folding across Arg nodes which makes more static values available to shape               inference functions.
+  * Improve shape inference of nested function calls by supporting constant folding across Arg nodes which makes more static values available to shape inference functions.
 * `tf.debugging`:
  * `tf.debugging.assert_shapes()` now works on `SparseTensor`s (Fixes [#36268](https://github.com/tensorflow/tensorflow/issues/36268)).
 * GPU
-  * Adds Support for [TensorFloat-32](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) on Ampere based GPUs. 
-    TensorFloat-32, or TF32 for short, is a math mode for NVIDIA Ampere based GPUs which causes certain float32 ops, such as matrix 
-    multiplications and convolutions, to run much faster on Ampere GPUs but with reduced precision. This reduced precision has not been found 
-    to effect convergence quality of deep learning models in practice. TensorFloat-32 is enabled by default, but can be disabled with                 `tf.config.experimental.enable_tensor_float_32_execution`.
+  * Adds Support for [TensorFloat-32](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) on Ampere based GPUs.
+    TensorFloat-32, or TF32 for short, is a math mode for NVIDIA Ampere based GPUs which causes certain float32 ops, such as matrix
+    multiplications and convolutions, to run much faster on Ampere GPUs but with reduced precision. This reduced precision has not been found
+    to effect convergence quality of deep learning models in practice. TensorFloat-32 is enabled by default, but can be disabled with `tf.config.experimental.enable_tensor_float_32_execution`.
 * `tf.math`:
  * Adds `tf.math.erfcinv`, the inverse to `tf.math.erfc`.
 * `tf.nn`:
  *   `tf.nn.max_pool2d` now supports explicit padding.
 * `tf.image`:
-  * Adds deterministic `tf.image.stateless_random_*` functions for each `tf.image.random_*` function. Added a new op `stateless_sample_distorted_bounding_box`       which is a deterministic version of `sample_distorted_bounding_box` op. Given the same seed, these stateless functions/ops produce the same results               independent of how many times the function is called, and independent of global seed settings.
-  * Adds deterministic `tf.image.resize` backprop CUDA kernels for `method=ResizeMethod.BILINEAR` (the default method). Enable by setting the environment             variable `TF_DETERMINISTIC_OPS` to `"true"` or `"1"`.
+  * Adds deterministic `tf.image.stateless_random_*` functions for each `tf.image.random_*` function. Added a new op `stateless_sample_distorted_bounding_box` which is a deterministic version of `sample_distorted_bounding_box` op. Given the same seed, these stateless functions/ops produce the same results independent of how many times the function is called, and independent of global seed settings.
+  * Adds deterministic `tf.image.resize` backprop CUDA kernels for `method=ResizeMethod.BILINEAR` (the default method). Enable by setting the environment variable `TF_DETERMINISTIC_OPS` to `"true"` or `"1"`.
 * `tf.print`:
  * Bug fix in `tf.print()` with `OrderedDict` where if an `OrderedDict` didn't have the keys sorted, the keys and values were not being printed
    in accordance with their correct mapping.
 * `tf.train.Checkpoint`:
-  * Now accepts a `root` argument in the initialization, which generates a checkpoint with a root object. This allows users to create a `Checkpoint` object that     is compatible with Keras `model.save_weights()` and `model.load_weights`. The checkpoint is also compatible with the checkpoint saved in the `variables/`         folder in the SavedModel.
+  * Now accepts a `root` argument in the initialization, which generates a checkpoint with a root object. This allows users to create a `Checkpoint` object that is compatible with Keras `model.save_weights()` and `model.load_weights`. The checkpoint is also compatible with the checkpoint saved in the `variables/` folder in the SavedModel.
  * When restoring, `save_path` can be a path to a SavedModel. The function will automatically find the checkpoint in the SavedModel.
-  
+
 ### `tf.data`:
  * Adds new `tf.data.experimental.service.register_dataset` and `tf.data.experimental.service.from_dataset_id` APIs to enable one
    process to register a dataset with the tf.data service, and another process to consume data from the dataset.
@ -138,7 +138,7 @@
  * Adds support for a new "distributed_epoch" processing mode. This processing mode distributes a dataset across all tf.data workers,
    instead of having each worker process the full dataset. See [the tf.data service docs](https://www.tensorflow.org/api_docs/python/tf/data/experimental/service#understand_processing_mode) to learn more.
  * Adds optional `exclude_cols` parameter to CsvDataset. This parameter is the complement of `select_cols`; at most one of these should be specified.
-  * We have implemented an optimization which reorders data-discarding transformations such as `take` and `shard` to happen earlier in the dataset when it is           safe to do so. The optimization can be disabled via the `experimental_optimization.reorder_data_discarding_ops` dataset option.
+  * We have implemented an optimization which reorders data-discarding transformations such as `take` and `shard` to happen earlier in the dataset when it is safe to do so. The optimization can be disabled via the `experimental_optimization.reorder_data_discarding_ops` dataset option.
  * `tf.data.Options` were previously immutable and can now be overridden.
  * `tf.data.Dataset.from_generator` now supports Ragged and Sparse tensors with a new `output_signature` argument, which allows `from_generator` to
    produce any type describable by a `tf.TypeSpec`.
@ -149,8 +149,8 @@
    * Replaces the existing `tf.distribute.experimental.ParameterServerStrategy` symbol with a new class that is for parameter server training in TF2. Usage of
      the old symbol, usually with Estimator API, should be **replaced** with [`tf.compat.v1.distribute.experimental.ParameterServerStrategy`].
    * Added `tf.distribute.experimental.coordinator.*` namespace, including the main API `ClusterCoordinator` for coordinating the training cluster, the related data structure `RemoteValue` and `PerWorkerValue`.
-  * `MultiWorkerMirroredStrategy`](https://www.tensorflow.org/api_docs/python/tf/distribute/MultiWorkerMirroredStrategy) is now a stable API and is no longer considered experimental. Some of the major improvements involve handling peer failure and many bug fixes. Please check out the detailed tutorial on 
-     [Multi-worer training with Keras](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras).    
+  * `MultiWorkerMirroredStrategy`](https://www.tensorflow.org/api_docs/python/tf/distribute/MultiWorkerMirroredStrategy) is now a stable API and is no longer considered experimental. Some of the major improvements involve handling peer failure and many bug fixes. Please check out the detailed tutorial on
+     [Multi-worer training with Keras](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras).
  * Adds `tf.distribute.Strategy.gather` and `tf.distribute.ReplicaContext.all_gather` APIs to support gathering dense distributed values.
  * Fixes various issues with saving a distributed model.

@ -165,7 +165,7 @@
    * Error messages when Functional API construction goes wrong (and when ops cannot be converted to Keras layers automatically) should be
      clearer and easier to understand.
  * `Optimizer.minimize` can now accept a loss `Tensor` and a `GradientTape` as an alternative to accepting a `callable` loss.
-  * Adds `beta` hyperparameter to [FTRL](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Ftrl) optimizer classes (Keras and others) to match [FTRL paper](https://research.google.com/pubs/archive/41159.pdf).  
+  * Adds `beta` hyperparameter to [FTRL](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Ftrl) optimizer classes (Keras and others) to match [FTRL paper](https://research.google.com/pubs/archive/41159.pdf).
  * `Optimizer.__init__` now accepts a `gradient_aggregator` to allow for customization of how gradients are aggregated across devices, as well as `gradients_transformers` to allow for custom gradient transformations (such as gradient clipping).
  * Improvements to Keras preprocessing layers:
    * TextVectorization can now accept a vocabulary list or file as an init arg.
@ -174,15 +174,15 @@
    True, the layer returns the attention scores as an additional output argument.
  * Adds `tf.metrics.log_cosh` and `tf.metrics.logcosh` API entrypoints with the same implementation as their `tf.losses` equivalent.
  * For Keras model, the individual call of `Model.evaluate` uses no cached data for evaluation, while `Model.fit` uses cached data when `validation_data` arg is provided for better performance.
-  * Adds a `save_traces` argument to `model.save`/ `tf.keras.models.save_model` which determines whether the SavedModel format stores the Keras model/layer call     functions. The traced functions allow Keras to revive custom models and layers without the original class definition, but if this isn't required the tracing     can be disabled with the added option.
+  * Adds a `save_traces` argument to `model.save`/ `tf.keras.models.save_model` which determines whether the SavedModel format stores the Keras model/layer call functions. The traced functions allow Keras to revive custom models and layers without the original class definition, but if this isn't required the tracing can be disabled with the added option.
  * The `tf.keras.mixed_precision` API is now non-experimental. The non-experimental API differs from the experimental API in several ways.
-    * `tf.keras.mixed_precision.Policy` no longer takes in a `tf.mixed_precision.experimental.LossScale` in the constructor, and no longer has a `LossScale`           associated with it. Instead, `Model.compile` will automatically wrap the optimizer with a `LossScaleOptimizer` using dynamic loss scaling if `Policy.name`       is "mixed_float16".
-    * `tf.keras.mixed_precision.LossScaleOptimizer`'s constructor takes in different arguments. In particular, it no longer takes in a `LossScale`, and there is       no longer a `LossScale` associated with the `LossScaleOptimizer`. Instead, `LossScaleOptimizer` directly implements fixed or dynamic loss scaling. See the       documentation of [`tf.keras.mixed_precision.experimental.LossScaleOptimizer`](https://www.tensorflow.org/api_docs/python/tf/keras/mixed_precision/experimental/LossScaleOptimizer?version=nightly) for details on the differences between the       experimental `LossScaleOptimizer` and the new non-experimental `LossScaleOptimizer`.
-    * `tf.mixed_precision.experimental.LossScale` and its subclasses are deprecated, as all of its functionality now exists within                      `tf.keras.mixed_precision.LossScaleOptimizer`
+    * `tf.keras.mixed_precision.Policy` no longer takes in a `tf.mixed_precision.experimental.LossScale` in the constructor, and no longer has a `LossScale` associated with it. Instead, `Model.compile` will automatically wrap the optimizer with a `LossScaleOptimizer` using dynamic loss scaling if `Policy.name` is `mixed_float16`.
+    * `tf.keras.mixed_precision.LossScaleOptimizer`'s constructor takes in different arguments. In particular, it no longer takes in a `LossScale`, and there is no longer a `LossScale` associated with the `LossScaleOptimizer`. Instead, `LossScaleOptimizer` directly implements fixed or dynamic loss scaling. See the documentation of [`tf.keras.mixed_precision.experimental.LossScaleOptimizer`](https://www.tensorflow.org/api_docs/python/tf/keras/mixed_precision/experimental/LossScaleOptimizer?version=nightly) for details on the differences between the experimental `LossScaleOptimizer` and the new non-experimental `LossScaleOptimizer`.
+    * `tf.mixed_precision.experimental.LossScale` and its subclasses are deprecated, as all of its functionality now exists within `tf.keras.mixed_precision.LossScaleOptimizer`

 ### `tf.lite`:
  * `TFLiteConverter`:
-    * Support optional flags `inference_input_type` and `inference_output_type` for full integer quantized models. This allows users to modify the model input         and output type to integer types (`tf.int8`, `tf.uint8`) instead of defaulting to float type (`tf.float32`). 
+    * Support optional flags `inference_input_type` and `inference_output_type` for full integer quantized models. This allows users to modify the model input and output type to integer types (`tf.int8`, `tf.uint8`) instead of defaulting to float type (`tf.float32`).
  * NNAPI
    * Adds NNAPI Delegation support for requantization use cases by converting the operation into a dequantize-quantize pair.
    * Removes deprecated `Interpreter.setUseNNAPI(boolean)` Java API. Use `Interpreter.Options.setUseNNAPI` instead.
@ -192,19 +192,19 @@
    * GPU acceleration now supports quantized models by default
  * `DynamicBuffer::AddJoinedString()` will now add a separator if the first string to be joined is empty.
  *  Adds support for cumulative sum (cumsum), both as builtin op and MLIR conversion.
-  
+
 ### `TensorRT`
  * Issues a warning when the `session_config` parameter for the TF1 converter is used or the `rewrite_config_template` field in the TF2
    converter parameter object is used.
-    
+
 ### TPU Enhancements:
  * Adds support for the `beta` parameter of the FTRL optimizer for TPU embeddings. Users of other TensorFlow platforms can implement equivalent
    behavior by adjusting the `l2` parameter.

 ### XLA Support:
-  * xla.experimental.compile is deprecated, use `tf.function(experimental_compile=True)` instead. 
+  * xla.experimental.compile is deprecated, use `tf.function(experimental_compile=True)` instead.
  * Adds `tf.function.experimental_get_compiler_ir` which returns compiler IR (currently 'hlo' and 'optimized_hlo') for given input for given function.
-  
+
 ### Security:
  * Fixes an undefined behavior causing a segfault in `tf.raw_ops.Switch`, ([CVE-2020-15190](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-15190))
  * Fixes three vulnerabilities in conversion to DLPack format
@ -241,7 +241,7 @@
  * Fixes a lack of validation in `tf.raw_ops.DataFormatVecPermute` and `tf.raw_ops.DataFormatDimMap` which can cause uninitialized memory access, read outside bounds of arrays, data corruption and segmentation faults ([CVE-2020-26267](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-26267))
  * Fixes a crash caused by writing to read only memory region ([CVE-2020-26268](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-26268))
  * Fixes a heap out of bounds access in filesystem globbing implementation ([CVE-2020-26269](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-26269))
-    
+
 ### Other:
  * We have replaced uses of "whitelist" and "blacklist" with "allowlist" and "denylist" where possible. Please see [this list](https://developers.google.com/style/word-list#blacklist) for more context.
  * Adds `tf.config.experimental.mlir_bridge_rollout` which will help us rollout the new MLIR TPU bridge.
--- a/tensorflow/api_template.init.py
+++ b/tensorflow/api_template.init.py
@ -116,7 +116,8 @@ from tensorflow.python.lib.io import file_io as _fi

 # Get sitepackages directories for the python installation.
 _site_packages_dirs = []
-_site_packages_dirs += [] if _site.USER_SITE is None else [_site.USER_SITE]
+if _site.ENABLE_USER_SITE and _site.USER_SITE is not None:
+  _site_packages_dirs += [_site.USER_SITE]
 _site_packages_dirs += [_p for _p in _sys.path if 'site-packages' in _p]
 if 'getsitepackages' in dir(_site):
  _site_packages_dirs += _site.getsitepackages()
--- a/tensorflow/c/experimental/filesystem/modular_filesystem.cc
+++ b/tensorflow/c/experimental/filesystem/modular_filesystem.cc
@ -133,7 +133,7 @@ bool ModularFileSystem::FilesExist(const std::vector<std::string>& files,
                                   TransactionToken* token,
                                   std::vector<Status>* status) {
  if (ops_->paths_exist == nullptr)
-    return FileSystem::FilesExist(files, status);
+    return FileSystem::FilesExist(files, token, status);

  std::vector<char*> translated_names;
  translated_names.reserve(files.size());
@ -234,7 +234,7 @@ Status ModularFileSystem::DeleteRecursively(const std::string& dirname,
        "`undeleted_dirs` set to NULL");

  if (ops_->delete_recursively == nullptr)
-    return FileSystem::DeleteRecursively(dirname, undeleted_files,
+    return FileSystem::DeleteRecursively(dirname, token, undeleted_files,
                                         undeleted_dirs);

  UniquePtrTo_TF_Status plugin_status(TF_NewStatus(), TF_DeleteStatus);
@ -264,7 +264,7 @@ Status ModularFileSystem::DeleteDir(const std::string& dirname,
 Status ModularFileSystem::RecursivelyCreateDir(const std::string& dirname,
                                               TransactionToken* token) {
  if (ops_->recursively_create_dir == nullptr)
-    return FileSystem::RecursivelyCreateDir(dirname);
+    return FileSystem::RecursivelyCreateDir(dirname, token);

  UniquePtrTo_TF_Status plugin_status(TF_NewStatus(), TF_DeleteStatus);
  std::string translated_name = TranslateName(dirname);
@ -312,7 +312,8 @@ Status ModularFileSystem::Stat(const std::string& fname,

 Status ModularFileSystem::IsDirectory(const std::string& name,
                                      TransactionToken* token) {
-  if (ops_->is_directory == nullptr) return FileSystem::IsDirectory(name);
+  if (ops_->is_directory == nullptr)
+    return FileSystem::IsDirectory(name, token);

  UniquePtrTo_TF_Status plugin_status(TF_NewStatus(), TF_DeleteStatus);
  std::string translated_name = TranslateName(name);
@ -362,7 +363,8 @@ Status ModularFileSystem::RenameFile(const std::string& src,
 Status ModularFileSystem::CopyFile(const std::string& src,
                                   const std::string& target,
                                   TransactionToken* token) {
-  if (ops_->copy_file == nullptr) return FileSystem::CopyFile(src, target);
+  if (ops_->copy_file == nullptr)
+    return FileSystem::CopyFile(src, target, token);

  UniquePtrTo_TF_Status plugin_status(TF_NewStatus(), TF_DeleteStatus);
  std::string translated_src = TranslateName(src);
--- a/tensorflow/core/framework/lookup_interface.cc
+++ b/tensorflow/core/framework/lookup_interface.cc
@ -83,10 +83,17 @@ Status LookupInterface::CheckFindArguments(const Tensor& key,
                                           const Tensor& default_value) {
  TF_RETURN_IF_ERROR(CheckKeyAndValueTypes(key, default_value));
  TF_RETURN_IF_ERROR(CheckKeyShape(key.shape()));
-  if (default_value.shape() != value_shape()) {
+  TensorShape fullsize_value_shape = key.shape();
+  for (int i = 0; i < key_shape().dims(); ++i) {
+    fullsize_value_shape.RemoveDim(fullsize_value_shape.dims() - 1);
+  }
+  fullsize_value_shape.AppendShape(value_shape());
+  if (default_value.shape() != value_shape() &&
+      default_value.shape() != fullsize_value_shape) {
    return errors::InvalidArgument(
-        "Expected shape ", value_shape().DebugString(),
-        " for default value, got ", default_value.shape().DebugString());
+        "Expected shape ", value_shape().DebugString(), " or ",
+        fullsize_value_shape.DebugString(), " for default value, got ",
+        default_value.shape().DebugString());
  }
  return Status::OK();
 }
--- a/tensorflow/core/framework/lookup_interface.h
+++ b/tensorflow/core/framework/lookup_interface.h
@ -128,7 +128,8 @@ class LookupInterface : public ResourceBase {
  // requirements are satisfied, otherwise it returns InvalidArgument:
  // - DataType of the tensor keys equals to the table key_dtype
  // - DataType of the tensor default_value equals to the table value_dtype
-  // - the default_value tensor shape matches the table's value shape.
+  // - the default_value tensor has the required shape given keys and the 
+  //   tables's value shape.
  Status CheckFindArguments(const Tensor& keys, const Tensor& default_value);

  string DebugString() const override {
--- a/tensorflow/core/kernels/lookup_table_op.cc
+++ b/tensorflow/core/kernels/lookup_table_op.cc
@ -56,14 +56,25 @@ class MutableHashTableOfScalars final : public LookupInterface {

  Status Find(OpKernelContext* ctx, const Tensor& key, Tensor* value,
              const Tensor& default_value) override {
-    const V default_val = default_value.flat<V>()(0);
    const auto key_values = key.flat<K>();
    auto value_values = value->flat<V>();
+    const auto default_flat = default_value.flat<V>();
+
+    int64 total = value_values.size();
+    int64 default_total = default_flat.size();
+    bool is_full_size_default = (total == default_total);

    tf_shared_lock l(mu_);
    for (int64 i = 0; i < key_values.size(); ++i) {
+      // is_full_size_default is true: 
+      //   Each key has an independent default value, key_values(i) 
+      //   corresponding uses default_flat(i) as its default value.
+      // 
+      // is_full_size_default is false: 
+      //   All keys will share the default_flat(0) as default value.
      value_values(i) = gtl::FindWithDefault(
-          table_, SubtleMustCopyIfIntegral(key_values(i)), default_val);
+          table_, SubtleMustCopyIfIntegral(key_values(i)),
+          is_full_size_default ? default_flat(i) : default_flat(0));
    }

    return Status::OK();
@ -173,11 +184,15 @@ class MutableHashTableOfTensors final : public LookupInterface {

  Status Find(OpKernelContext* ctx, const Tensor& key, Tensor* value,
              const Tensor& default_value) override {
-    const auto default_flat = default_value.flat<V>();
+    const auto default_flat = default_value.flat_inner_dims<V, 2>();
    const auto key_values = key.flat<K>();
    auto value_values = value->flat_inner_dims<V, 2>();
    int64 value_dim = value_shape_.dim_size(0);

+    int64 total = value_values.size();
+    int64 default_total = default_flat.size();
+    bool is_full_size_default = (total == default_total);
+
    tf_shared_lock l(mu_);
    for (int64 i = 0; i < key_values.size(); ++i) {
      ValueArray* value_vec =
@ -187,8 +202,15 @@ class MutableHashTableOfTensors final : public LookupInterface {
          value_values(i, j) = value_vec->at(j);
        }
      } else {
+        // is_full_size_default is true: 
+        //   Each key has an independent default value, key_values(i) 
+        //   corresponding uses default_flat(i) as its default value.
+        //
+        // is_full_size_default is false: 
+        //   All keys will share the default_flat(0) as default value.
        for (int64 j = 0; j < value_dim; j++) {
-          value_values(i, j) = default_flat(j);
+          value_values(i, j) =
+              is_full_size_default ? default_flat(i, j) : default_flat(0, j);
        }
      }
    }
--- a/tensorflow/core/ops/lookup_ops.cc
+++ b/tensorflow/core/ops/lookup_ops.cc
@ -169,10 +169,6 @@ REGISTER_OP("LookupTableFindV2")
      ShapeHandle handle;
      TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 0, &handle));

-      // Default value must be scalar or vector.
-      ShapeHandle keys;
-      TF_RETURN_IF_ERROR(c->WithRankAtMost(c->input(2), 1, &keys));
-
      ShapeAndType value_shape_and_type;
      TF_RETURN_IF_ERROR(ValidateTableResourceHandle(
          c,
--- a/tensorflow/core/ops/lookup_ops_test.cc
+++ b/tensorflow/core/ops/lookup_ops_test.cc
@ -25,7 +25,6 @@ namespace {
 TEST(LookupOpsTest, LookupTableFindV2_ShapeFn) {
  ShapeInferenceTestOp op("LookupTableFindV2");
  INFER_ERROR("Shape must be rank 0 but is rank 1", op, "[?];?;?");
-  INFER_ERROR("Shape must be at most rank 1 but is rank 2", op, "[];?;[1,1]");
  TF_ASSERT_OK(NodeDefBuilder("test", "LookupTableFindV2")
                   .Input({"table_handle", 0, DT_RESOURCE})
                   .Input({"keys", 0, DT_INT64})
--- a/tensorflow/python/keras/mixed_precision/autocast_variable.py
+++ b/tensorflow/python/keras/mixed_precision/autocast_variable.py
@ -57,12 +57,11 @@ class AutoCastVariable(variables.Variable, core.Tensor):
  called.
  """

-  def __init__(self, variable, op=None):
+  def __init__(self, variable):
    """Creates an AutoCastVariable instance.

    Args:
      variable: A floating-point resource variable to wrap.
-      op: Optional operation of this variable.

    Raises:
      ValueError: If `variable` is not a floating-point resource variable
@ -74,7 +73,11 @@ class AutoCastVariable(variables.Variable, core.Tensor):
      raise ValueError('variable must be a floating point variable but has '
                       'type: %s' % variable.dtype.name)
    self._variable = variable
-    self._op = op
+    # 'delegate' means AutoCastVariable.op return self._variable.op, which will
+    # raise an AttributeError in Eager (as intended). If set to any other value,
+    # AutoCastVariable.op returns that value instead, which is used to set the
+    # op attribute in AutoCastVariable.assign().
+    self._op = 'delegate'

  def _should_cast(self):
    """Returns True if this variable should be casted when accessed."""
@ -199,10 +202,18 @@ class AutoCastVariable(variables.Variable, core.Tensor):
                           use_locking=None,
                           name=None,
                           read_value=True):
+    # TODO(b/146181571): This logic can be simplified once
+    # DistributedVariable.assign returns a DistributedVariable. Currently for
+    # MirroredStrategy, it returns a Mirrored value.
    if ops.executing_eagerly_outside_functions():
      assign_op = update_fn(value, use_locking, name, False)
      if read_value:
-        return create_autocast_variable(self._variable, op=assign_op)
+        # We create a new AutoCastVariable with the same underlying tf.Variable.
+        # The new AutoCastVariable is identical except the 'op' attribute is
+        # defined. This matches the behavior of tf.Variable.assign.
+        var = create_autocast_variable(self._variable)
+        var._op = assign_op  # pylint:disable=protected-access
+        return var
      return assign_op

    # Fallback to wrapping the returned variable in graph mode if possible
@ -298,9 +309,9 @@ class AutoCastVariable(variables.Variable, core.Tensor):

  @property
  def op(self):
-    if self._op is not None:
-      return self._op
-    return self._variable.op
+    if self._op == 'delegate':
+      return self._variable.op
+    return self._op

  def _as_graph_element(self):
    graph_element = self._variable._as_graph_element()  # pylint:disable=protected-access
@ -469,7 +480,7 @@ ops.register_tensor_conversion_function(AutoCastVariable,
                                        AutoCastVariable._dense_var_to_tensor)  # pylint:disable=protected-access


-def create_autocast_variable(variable, op=None):
+def create_autocast_variable(variable):
  """Creates an AutoCastVariable that wraps another variable.

  This typically just returns `AutoCastVariable(variable)`. But, if the variable
@ -481,14 +492,13 @@ def create_autocast_variable(variable, op=None):

  Args:
    variable: A floating-point resource variable to wrap.
-    op: Optional operation of this variable.

  Returns:
    An AutoCastVariable that wraps the variable.
  """
  if not isinstance(variable, (distribute_values.DistributedVariable,
                               ps_distribute_values.AggregatingVariable)):
-    return AutoCastVariable(variable, op=op)
+    return AutoCastVariable(variable)

  class AutoCastDistributedVariable(AutoCastVariable, variable.__class__):
    """An AutoCastVariable that also subclasses from variable.__class__.
@ -511,7 +521,7 @@ def create_autocast_variable(variable, op=None):
             ).format(v=self)
      # pylint: enable=missing-format-attribute

-  return AutoCastDistributedVariable(variable, op=op)
+  return AutoCastDistributedVariable(variable)


 class enable_auto_cast_variables(object):  # pylint:disable=invalid-name
--- a/tensorflow/python/keras/mixed_precision/autocast_variable_test.py
+++ b/tensorflow/python/keras/mixed_precision/autocast_variable_test.py
@ -37,7 +37,14 @@ from tensorflow.python.framework import indexed_slices
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import test_combinations as combinations
 from tensorflow.python.keras.mixed_precision import autocast_variable
+from tensorflow.python.keras.optimizer_v2 import adadelta
+from tensorflow.python.keras.optimizer_v2 import adagrad
+from tensorflow.python.keras.optimizer_v2 import adam
+from tensorflow.python.keras.optimizer_v2 import adamax
+from tensorflow.python.keras.optimizer_v2 import ftrl
 from tensorflow.python.keras.optimizer_v2 import gradient_descent as gradient_descent_v2
+from tensorflow.python.keras.optimizer_v2 import nadam
+from tensorflow.python.keras.optimizer_v2 import rmsprop
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import state_ops
 from tensorflow.python.ops import variables
@ -352,11 +359,28 @@ class AutoCastVariableTest(test.TestCase, parameterized.TestCase):
        self.assertAllClose(5., self.evaluate(run_assign()))

  @ds_combinations.generate(maybe_distribute)
-  def test_assign_op(self, distribution):
+  def test_op_attribute(self, distribution):
    with distribution.scope():
      x = get_var(0., dtypes.float32)
      x = autocast_variable.create_autocast_variable(x)

+      # Variable.op raises an AttributeError in Eager mode and is an op in graph
+      # mode. Variable.assign(...).op is None in Eager mode and an op in Graph
+      # mode or a tf.function. We test this is also true of AutoCastVariable.
+      if context.executing_eagerly():
+        with self.assertRaisesRegex(
+            AttributeError,
+            'Tensor.op is meaningless when eager execution is enabled'):
+          x.op  # pylint: disable=pointless-statement
+        self.assertIsNone(x.assign(1.0).op)
+        self.assertIsNone(x.assign_add(1.0).op)
+        self.assertIsNone(x.assign_sub(1.0).op)
+      else:
+        self.assertIsNotNone(x.op)
+        self.assertIsNotNone(x.assign(1.0).op)
+        self.assertIsNotNone(x.assign_add(1.0).op)
+        self.assertIsNotNone(x.assign_sub(1.0).op)
+
      @def_function.function
      def func():
        self.assertIsNotNone(x.assign(1.0).op)
@ -503,25 +527,51 @@ class AutoCastVariableTest(test.TestCase, parameterized.TestCase):
            'dtype_to_cast_to=float32 '
            'inner_variable=MirroredVariable.*>')

-  @parameterized.named_parameters(
-      ('v1', gradient_descent_v1.GradientDescentOptimizer),
-      ('v2', gradient_descent_v2.SGD))
-  def test_optimizer(self, optimizer_class):
+  @ds_combinations.generate(combinations.combine(
+      optimizer_class=[
+          adadelta.Adadelta,
+          adagrad.Adagrad,
+          adam.Adam,
+          adamax.Adamax,
+          ftrl.Ftrl,
+          gradient_descent_v2.SGD,
+          nadam.Nadam,
+          rmsprop.RMSprop,
+          gradient_descent_v1.GradientDescentOptimizer
+      ],
+      use_tf_function=[False, True]))
+  def test_optimizer(self, optimizer_class, use_tf_function):
+    if use_tf_function and not context.executing_eagerly():
+      self.skipTest('Test does not support graph mode with tf.function')
    x = get_var(1., dtypes.float32)
    x = autocast_variable.create_autocast_variable(x)
-    opt = optimizer_class(1.)
+    y = get_var(1., dtypes.float32)
+    opt = optimizer_class(learning_rate=1.)

-    @def_function.function
    def f():
-      opt.minimize(lambda: x + 1., var_list=[x])
+      # Minimize both the AutoCastVariable and the normal tf.Variable. Both
+      # variables should be updated to the same value.
+      op = opt.minimize(lambda: x + y, var_list=[x, y])
+      return None if ops.executing_eagerly_outside_functions() else op
+
+    if use_tf_function:
+      f = def_function.function(f)

    if context.executing_eagerly():
      f()
    else:
-      op = f()  # pylint: disable=assignment-from-no-return
+      op = f()
      self.evaluate(variables.global_variables_initializer())
      self.evaluate(op)
-    self.assertEqual(self.evaluate(x), 0)
+    # Assert the AutoCastVariable has changed from its initial value
+    self.assertNotEqual(self.evaluate(x), 1.)
+    # Assert AutoCastVariable is updated correctly by comparing it to the normal
+    # variable
+    self.assertAlmostEqual(self.evaluate(x), self.evaluate(y))
+    if optimizer_class in (gradient_descent_v2.SGD,
+                           gradient_descent_v1.GradientDescentOptimizer):
+      # With SGD, the variables decreases by exactly 1
+      self.assertEqual(self.evaluate(x), 0)


 if __name__ == '__main__':
--- a/tensorflow/python/keras/saving/saved_model/load.py
+++ b/tensorflow/python/keras/saving/saved_model/load.py
@ -135,7 +135,7 @@ def load(path, compile=True, options=None):  # pylint: disable=redefined-builtin

  # Recreate layers and metrics using the info stored in the metadata.
  keras_loader = KerasObjectLoader(metadata, object_graph_def)
-  keras_loader.load_layers()
+  keras_loader.load_layers(compile=compile)

  # Generate a dictionary of all loaded nodes.
  nodes_to_load = {'root': None}
@ -360,7 +360,7 @@ class KerasObjectLoader(object):
          obj_child, child_proto, child_id)
      self.loaded_nodes[child_id] = obj_child, setter

-  def load_layers(self):
+  def load_layers(self, compile=True):  # pylint: disable=redefined-builtin
    """Load all layer nodes from the metadata."""
    # Load metrics after models and layers, since it's likely that models
    # and layers will create the metric when initialized (this avoids wasting
@ -376,9 +376,21 @@ class KerasObjectLoader(object):
          node_metadata.metadata)

    for node_metadata in metric_list:
-      self.loaded_nodes[node_metadata.node_id] = self._load_layer(
-          node_metadata.node_id, node_metadata.identifier,
-          node_metadata.metadata)
+      try:
+        self.loaded_nodes[node_metadata.node_id] = self._load_layer(
+            node_metadata.node_id, node_metadata.identifier,
+            node_metadata.metadata)
+      except ValueError:
+        # Metrics are only needed when the model is compiled later. We ignore
+        # errors when trying to load custom metrics when `compile=False` until
+        # custom metrics are serialized properly (b/135550038).
+        if compile:
+          raise
+        logging.warning('Unable to restore custom metric. Please ensure that '
+                        'the layer implements `get_config` and `from_config` '
+                        'when saving. In addition, please use the '
+                        '`custom_objects` arg when calling `load_model()`.')
+

  def _load_layer(self, node_id, identifier, metadata):
    """Load a single layer from a SavedUserObject proto."""
--- a/tensorflow/python/keras/saving/saved_model/saved_model_test.py
+++ b/tensorflow/python/keras/saving/saved_model/saved_model_test.py
@ -1147,6 +1147,26 @@ class MetricTest(test.TestCase, parameterized.TestCase):
      self._test_metric_save_and_load(
          metric, self._save_model_dir(), 1, test_sample_weight=False)

+  @keras_parameterized.run_with_all_model_types
+  def test_custom_metric_model(self):
+
+    class CustomMetric(keras.metrics.MeanSquaredError):
+      pass
+
+    model = testing_utils.get_small_mlp(1, 4, input_dim=3)
+    model.compile(
+        loss='mse',
+        optimizer='rmsprop',
+        metrics=[CustomMetric()])
+
+    saved_model_dir = self._save_model_dir()
+    tf_save.save(model, saved_model_dir)
+    with self.assertRaisesRegex(ValueError, 'metric'):
+      keras_load.load(saved_model_dir)
+
+    keras_load.load(saved_model_dir, compile=False)
+
+

 if __name__ == '__main__':
  test.main()
--- a/tensorflow/python/kernel_tests/lookup_ops_test.py
+++ b/tensorflow/python/kernel_tests/lookup_ops_test.py
@ -3375,6 +3375,71 @@ class MutableHashTableOpTest(test.TestCase):
    result = self.evaluate(output)
    self.assertAllEqual([[0, 1], [-1, -1]], result)

+  def testMutableHashTableFindWithInvalidShapeDefaultValue(self):
+    default_val = [-1, -1]
+    table = lookup_ops.MutableHashTable(dtypes.string, dtypes.int64,
+                                        default_val)
+
+    input_string = constant_op.constant([["brain", "salad"],
+                                         ["tank", "tarkus"]])
+
+    invalid_default_val = constant_op.constant(
+        [[-2, -3], [-4, -5], [-6, -7], [-8, -9]], dtypes.int64)
+
+    with self.assertRaisesRegex(
+        (ValueError, errors_impl.InvalidArgumentError),
+        "Expected shape \[2\] or \[2,2,2\] for default value, got \[4,2]"):
+      self.evaluate(table.lookup(input_string, invalid_default_val))
+
+    invalid_default_val = constant_op.constant([[[-2, -3], [-4, -5]]],
+                                               dtypes.int64)
+    with self.assertRaisesRegex(
+        (ValueError, errors_impl.InvalidArgumentError),
+        "Expected shape \[2\] or \[2,2,2\] for default value, got \[1,2,2\]"):
+      self.evaluate(table.lookup(input_string, invalid_default_val))
+
+  def testMutableHashTableFindHighRankScalarWithDynamicDefaultValue(self):
+    default_val = -1
+    keys = constant_op.constant(["brain", "salad", "surgery"])
+    values = constant_op.constant([0, 1, 2], dtypes.int64)
+    table = lookup_ops.MutableHashTable(dtypes.string, dtypes.int64,
+                                        default_val)
+
+    self.evaluate(table.insert(keys, values))
+    self.assertAllEqual(3, self.evaluate(table.size()))
+
+    input_string = constant_op.constant([["brain", "salad"],
+                                         ["tank", "tarkus"]])
+
+    dynamic_default_val = constant_op.constant([[-2, -3], [-4, -5]],
+                                               dtypes.int64)
+    output = table.lookup(input_string, dynamic_default_val)
+    self.assertAllEqual([2, 2], output.get_shape())
+
+    result = self.evaluate(output)
+    self.assertAllEqual([[0, 1], [-4, -5]], result)
+
+  def testMutableHashTableFindHighRankVectorWithDynamicDefaultValue(self):
+    default_val = [-1, -1]
+    keys = constant_op.constant(["brain", "salad", "surgery"])
+    values = constant_op.constant([[0, 1], [2, 3], [4, 5]], dtypes.int64)
+    table = lookup_ops.MutableHashTable(dtypes.string, dtypes.int64,
+                                        default_val)
+
+    self.evaluate(table.insert(keys, values))
+    self.assertAllEqual(3, self.evaluate(table.size()))
+
+    input_string = constant_op.constant([["brain", "salad"],
+                                         ["tank", "tarkus"]])
+
+    dynamic_default_val = constant_op.constant(
+        [[[-2, -3], [-4, -5]], [[-6, -7], [-8, -9]]], dtypes.int64)
+    output = table.lookup(input_string, dynamic_default_val)
+    self.assertAllEqual([2, 2, 2], output.get_shape())
+
+    result = self.evaluate(output)
+    self.assertAllEqual([[[0, 1], [2, 3]], [[-6, -7], [-8, -9]]], result)
+
  def testMutableHashTableInsertHighRank(self):
    default_val = -1
    keys = constant_op.constant([["brain", "salad"], ["surgery", "tank"]])
--- a/tensorflow/python/kernel_tests/summary_ops_test.py
+++ b/tensorflow/python/kernel_tests/summary_ops_test.py
@ -1206,6 +1206,26 @@ class SummaryOpsTest(test_util.TensorFlowTestCase):
      # Reset to default state for other tests.
      summary_ops.set_step(None)

+  @test_util.run_v2_only
+  def testTrace_withProfiler(self):
+
+    @def_function.function
+    def f():
+      x = constant_op.constant(2)
+      y = constant_op.constant(3)
+      return x**y
+
+    assert context.executing_eagerly()
+    logdir = self.get_temp_dir()
+    writer = summary_ops.create_file_writer(logdir)
+    summary_ops.trace_on(graph=True, profiler=True)
+    profiler_outdir = self.get_temp_dir()
+    with writer.as_default():
+      f()
+      summary_ops.trace_export(
+          name='foo', step=1, profiler_outdir=profiler_outdir)
+    writer.close()
+

 def events_from_file(filepath):
  """Returns all events in a single event file.
--- a/tensorflow/python/ops/lookup_ops.py
+++ b/tensorflow/python/ops/lookup_ops.py
@ -1849,7 +1849,7 @@ class MutableHashTable(LookupInterface):

    return op

-  def lookup(self, keys, name=None):
+  def lookup(self, keys, dynamic_default_values=None, name=None):
    """Looks up `keys` in a table, outputs the corresponding values.

    The `default_value` is used for keys not present in the table.
@ -1857,6 +1857,23 @@ class MutableHashTable(LookupInterface):
    Args:
      keys: Keys to look up. Can be a tensor of any shape. Must match the
        table's key_dtype.
+      dynamic_default_values: The values to use if a key is missing in the
+        table. If None (by default), the `table.default_value` will be used.
+        Shape of `dynamic_default_values` must be same with
+        `table.default_value` or the lookup result tensor.
+        In the latter case, each key will have a different default value.
+
+        For example:
+
+          ```python
+          keys = [0, 1, 3]
+          dynamic_default_values = [[1, 3, 4], [2, 3, 9], [8, 3, 0]]
+
+          # The key '0' will use [1, 3, 4] as default value.
+          # The key '1' will use [2, 3, 9] as default value.
+          # The key '3' will use [8, 3, 0] as default value.
+          ```
+
      name: A name for the operation (optional).

    Returns:
@ -1870,8 +1887,9 @@ class MutableHashTable(LookupInterface):
                        (self.resource_handle, keys, self._default_value)):
      keys = ops.convert_to_tensor(keys, dtype=self._key_dtype, name="keys")
      with ops.colocate_with(self.resource_handle):
-        values = gen_lookup_ops.lookup_table_find_v2(self.resource_handle, keys,
-                                                     self._default_value)
+        values = gen_lookup_ops.lookup_table_find_v2(
+            self.resource_handle, keys, dynamic_default_values
+            if dynamic_default_values is not None else self._default_value)
    return values

  def insert(self, keys, values, name=None):
--- a/tensorflow/python/ops/summary_ops_v2.py
+++ b/tensorflow/python/ops/summary_ops_v2.py
@ -1370,4 +1370,7 @@ def trace_off():
    context.context().disable_run_metadata()

  if profiler:
-    _profiler.stop()
+    try:
+      _profiler.stop()
+    except _profiler.ProfilerNotRunningError:
+      pass
--- a/tensorflow/tools/ci_build/Dockerfile.custom_op_ubuntu_16_cuda11.0
+++ b/tensorflow/tools/ci_build/Dockerfile.custom_op_ubuntu_16_cuda11.0
@ -0,0 +1,80 @@
+# Dockerfile for Ubuntu 16.04 manylinux2010 custom ops with GPU.
+
+FROM nvidia/cuda:11.0-cudnn8-devel-ubuntu16.04 as devtoolset
+
+LABEL maintainer="Amit Patankar <amitpatankar@google.com>"
+
+ENV DEBIAN_FRONTEND=noninteractive
+RUN apt-get update && apt-get install -y \
+      cpio \
+      file \
+      flex \
+      g++ \
+      make \
+      rpm2cpio \
+      unar \
+      wget \
+      && \
+    rm -rf /var/lib/apt/lists/*
+
+ADD devtoolset/fixlinks.sh fixlinks.sh
+ADD devtoolset/build_devtoolset.sh build_devtoolset.sh
+ADD devtoolset/rpm-patch.sh rpm-patch.sh
+
+# Set up a sysroot for glibc 2.12 / libstdc++ 4.4 / devtoolset-7 in /dt7.
+RUN /build_devtoolset.sh devtoolset-7 /dt7
+# Set up a sysroot for glibc 2.12 / libstdc++ 4.4 / devtoolset-8 in /dt8.
+RUN /build_devtoolset.sh devtoolset-8 /dt8
+
+# TODO(klimek): Split up into two different docker images.
+FROM nvidia/cuda:11.0-cudnn8-devel-ubuntu16.04
+
+LABEL maintainer="Amit Patankar <amitpatankar@google.com>"
+
+COPY --from=devtoolset /dt7 /dt7
+COPY --from=devtoolset /dt8 /dt8
+
+# Install TensorRT.
+RUN apt-get update && apt-get install -y \
+    libnvinfer-dev=7.1.3-1+cuda11.0 \
+    libnvinfer7=7.1.3-1+cuda11.0 \
+    libnvinfer-plugin-dev=7.1.3-1+cuda11.0 \
+    libnvinfer-plugin7=7.1.3-1+cuda11.0 \
+      && \
+    rm -rf /var/lib/apt/lists/*
+
+# Copy and run the install scripts.
+COPY install/*.sh /install/
+ARG DEBIAN_FRONTEND=noninteractive
+RUN /install/install_bootstrap_deb_packages.sh
+RUN /install/install_deb_packages.sh
+RUN /install/install_clang.sh
+RUN /install/install_bazel.sh
+RUN /install/install_buildifier.sh
+
+ENV TF_NEED_CUDA=1
+
+# Install python 3.6.
+RUN add-apt-repository ppa:deadsnakes/ppa && \
+    apt-get update && apt-get install -y \
+    python3.6 python3.6-dev python3-pip python3.6-venv && \
+    rm -rf /var/lib/apt/lists/* && \
+    python3.6 -m pip install pip --upgrade && \
+    update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.6 0
+
+# Install python 3.7
+RUN /install/install_python37.sh
+
+# Install pip3.5
+RUN wget https://bootstrap.pypa.io/get-pip.py && python3.5 get-pip.py && rm get-pip.py
+
+RUN /install/install_pip_packages.sh
+RUN /install/install_auditwheel.sh
+
+# Make python3.6 the default python version
+RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.6 0
+
+# Install given tensorflow or tf-nightly version, if not specified, install the # latest official release
+ARG TF_PACKAGE=tensorflow
+ARG TF_PACKAGE_VERSION=
+RUN pip3 install ${TF_PACKAGE}${TF_PACKAGE_VERSION:+==${TF_PACKAGE_VERSION}}
--- a/tensorflow/tools/ci_build/builds/docker_cpu_pip.sh
+++ b/tensorflow/tools/ci_build/builds/docker_cpu_pip.sh
@ -22,16 +22,16 @@ pip --version
 pip install portpicker
 pip install *.whl

-# Make bazel version the same as the env that invokes this script
-rm -rf ~/bazel
-mkdir ~/bazel
-pushd ~/bazel
-wget https://github.com/bazelbuild/bazel/releases/download/"${BAZEL_VERSION}"/bazel-"${BAZEL_VERSION}"-installer-linux-x86_64.sh
-chmod +x bazel-*.sh
-./bazel-"${BAZEL_VERSION}"-installer-linux-x86_64.sh --user
-rm bazel-"${BAZEL_VERSION}"-installer-linux-x86_64.sh
-PATH="/bazel_pip/bin:$PATH"
-popd
+# Install bazelisk
+rm -rf ~/bin/bazel
+mkdir ~/bin/bazel
+wget --no-verbose -O "~/bin/bazel" \
+    "https://github.com/bazelbuild/bazelisk/releases/download/v1.3.0/bazelisk-linux-amd64"
+chmod u+x "~/bin/bazel"
+if [[ ! ":$PATH:" =~ :"~"/bin/?: ]]; then
+  PATH="~/bin:$PATH"
+fi
+which bazel
 bazel version

 # Use default configuration
--- a/tensorflow/tools/dockerfiles/dockerfiles/onednn/ubuntu-18.04-mpi-horovod-jupyter.Dockerfile
+++ b/tensorflow/tools/dockerfiles/dockerfiles/onednn/ubuntu-18.04-mpi-horovod-jupyter.Dockerfile
@ -85,6 +85,7 @@ ARG HOROVOD_VERSION=

 RUN apt-get update && apt-get install -y --no-install-recommends --fix-missing \
    build-essential \
+    cmake \
    g++-8 \
    gcc-8 \
    python3-dev
--- a/tensorflow/tools/dockerfiles/dockerfiles/onednn/ubuntu-18.04-mpi-horovod.Dockerfile
+++ b/tensorflow/tools/dockerfiles/dockerfiles/onednn/ubuntu-18.04-mpi-horovod.Dockerfile
@ -85,6 +85,7 @@ ARG HOROVOD_VERSION=

 RUN apt-get update && apt-get install -y --no-install-recommends --fix-missing \
    build-essential \
+    cmake \
    g++-8 \
    gcc-8 \
    python3-dev
--- a/tensorflow/tools/dockerfiles/dockerfiles/onednn/ubuntu-18.04-mpich-horovod-jupyter.Dockerfile
+++ b/tensorflow/tools/dockerfiles/dockerfiles/onednn/ubuntu-18.04-mpich-horovod-jupyter.Dockerfile
@ -81,6 +81,7 @@ ARG HOROVOD_VERSION=

 RUN apt-get update && apt-get install -y --no-install-recommends --fix-missing \
    build-essential \
+    cmake \
    g++-8 \
    gcc-8 \
    python3-dev
--- a/tensorflow/tools/dockerfiles/dockerfiles/onednn/ubuntu-18.04-mpich-horovod.Dockerfile
+++ b/tensorflow/tools/dockerfiles/dockerfiles/onednn/ubuntu-18.04-mpich-horovod.Dockerfile
@ -81,6 +81,7 @@ ARG HOROVOD_VERSION=

 RUN apt-get update && apt-get install -y --no-install-recommends --fix-missing \
    build-essential \
+    cmake \
    g++-8 \
    gcc-8 \
    python3-dev
--- a/tensorflow/tools/dockerfiles/dockerfiles/onednn/ubuntu-20.04-mpi-horovod-jupyter.Dockerfile
+++ b/tensorflow/tools/dockerfiles/dockerfiles/onednn/ubuntu-20.04-mpi-horovod-jupyter.Dockerfile
@ -95,6 +95,7 @@ ARG HOROVOD_VERSION=

 RUN apt-get update && apt-get install -y --no-install-recommends --fix-missing \
    build-essential \
+    cmake \
    g++-8 \
    gcc-8 \
    ${PYTHON}-dev
--- a/tensorflow/tools/dockerfiles/dockerfiles/onednn/ubuntu-20.04-mpi-horovod.Dockerfile
+++ b/tensorflow/tools/dockerfiles/dockerfiles/onednn/ubuntu-20.04-mpi-horovod.Dockerfile
@ -95,6 +95,7 @@ ARG HOROVOD_VERSION=

 RUN apt-get update && apt-get install -y --no-install-recommends --fix-missing \
    build-essential \
+    cmake \
    g++-8 \
    gcc-8 \
    ${PYTHON}-dev
--- a/tensorflow/tools/dockerfiles/dockerfiles/onednn/ubuntu-20.04-mpich-horovod-jupyter.Dockerfile
+++ b/tensorflow/tools/dockerfiles/dockerfiles/onednn/ubuntu-20.04-mpich-horovod-jupyter.Dockerfile
@ -91,6 +91,7 @@ ARG HOROVOD_VERSION=

 RUN apt-get update && apt-get install -y --no-install-recommends --fix-missing \
    build-essential \
+    cmake \
    g++-8 \
    gcc-8 \
    ${PYTHON}-dev
--- a/tensorflow/tools/dockerfiles/dockerfiles/onednn/ubuntu-20.04-mpich-horovod.Dockerfile
+++ b/tensorflow/tools/dockerfiles/dockerfiles/onednn/ubuntu-20.04-mpich-horovod.Dockerfile
@ -91,6 +91,7 @@ ARG HOROVOD_VERSION=

 RUN apt-get update && apt-get install -y --no-install-recommends --fix-missing \
    build-essential \
+    cmake \
    g++-8 \
    gcc-8 \
    ${PYTHON}-dev
--- a/tensorflow/tools/dockerfiles/partials/onednn/ubuntu/2004-horovod.partial.Dockerfile
+++ b/tensorflow/tools/dockerfiles/partials/onednn/ubuntu/2004-horovod.partial.Dockerfile
@ -6,6 +6,7 @@ ARG HOROVOD_VERSION=

 RUN apt-get update && apt-get install -y --no-install-recommends --fix-missing \
    build-essential \
+    cmake \
    g++-8 \
    gcc-8 \
    ${PYTHON}-dev
--- a/tensorflow/tools/dockerfiles/partials/onednn/ubuntu/horovod.partial.Dockerfile
+++ b/tensorflow/tools/dockerfiles/partials/onednn/ubuntu/horovod.partial.Dockerfile
@ -6,6 +6,7 @@ ARG HOROVOD_VERSION=

 RUN apt-get update && apt-get install -y --no-install-recommends --fix-missing \
    build-essential \
+    cmake \
    g++-8 \
    gcc-8 \
    python3-dev
--- a/tensorflow/tools/dockerfiles/spec.yml
+++ b/tensorflow/tools/dockerfiles/spec.yml
@ -421,6 +421,7 @@ slice_sets:
              - TF_PACKAGE=intel-tensorflow
              - UBUNTU_VERSION=20.04
              - PYTHON=python3.7
+              - DEBIAN_FRONTEND="noninteractive"

    ubuntu-devel-onednn:
        - add_to_name: "-16.04-devel"
--- a/tensorflow/tools/dockerfiles/tools.Dockerfile
+++ b/tensorflow/tools/dockerfiles/tools.Dockerfile
@ -22,7 +22,7 @@ LABEL maintainer="Austin Anderson <angerson@google.com>"

 RUN apt-get update && apt-get install -y python3 python3-pip bash curl
 RUN curl -sSL https://get.docker.com/ | sh
-RUN pip3 install --upgrade pip setuptools pyyaml absl-py cerberus docker
+RUN pip3 install --upgrade pip setuptools pyyaml absl-py cerberus 'docker<=4.3.0'

 WORKDIR /tf
 VOLUME ["/tf"]
--- a/tensorflow/tools/pip_package/build_pip_package.sh
+++ b/tensorflow/tools/pip_package/build_pip_package.sh
@ -132,10 +132,10 @@ function prepare_src() {
    unzip -o -q ./bazel-bin/tensorflow/tools/pip_package/simple_console_for_windows.zip -d ./bazel-bin/tensorflow/tools/pip_package/simple_console_for_window_unzip
    echo "Unzip finished."
    # runfiles structure after unzip the python binary
-    cp \
+    cp -L \
      bazel-bin/tensorflow/tools/pip_package/simple_console_for_window_unzip/runfiles/org_tensorflow/LICENSE \
      "${TMPDIR}"
-    cp -R \
+    cp -LR \
      bazel-bin/tensorflow/tools/pip_package/simple_console_for_window_unzip/runfiles/org_tensorflow/tensorflow \
      "${TMPDIR}"
    cp_external \
@ -149,10 +149,10 @@ function prepare_src() {
    RUNFILES=bazel-bin/tensorflow/tools/pip_package/build_pip_package.runfiles/org_tensorflow
    if [ -d bazel-bin/tensorflow/tools/pip_package/build_pip_package.runfiles/org_tensorflow/external ]; then
      # Old-style runfiles structure (--legacy_external_runfiles).
-      cp \
+      cp -L \
        bazel-bin/tensorflow/tools/pip_package/build_pip_package.runfiles/org_tensorflow/LICENSE \
        "${TMPDIR}"
-      cp -R \
+      cp -LR \
        bazel-bin/tensorflow/tools/pip_package/build_pip_package.runfiles/org_tensorflow/tensorflow \
        "${TMPDIR}"
      cp_external \
@ -172,10 +172,10 @@ function prepare_src() {
      fi
    else
      # New-style runfiles structure (--nolegacy_external_runfiles).
-      cp \
+      cp -L \
        bazel-bin/tensorflow/tools/pip_package/build_pip_package.runfiles/org_tensorflow/LICENSE \
        "${TMPDIR}"
-      cp -R \
+      cp -LR \
        bazel-bin/tensorflow/tools/pip_package/build_pip_package.runfiles/org_tensorflow/tensorflow \
        "${TMPDIR}"
      cp_external \