Adadelta use case clarification.

Contributed from https://github.com/tensorflow/tensorflow/pull/36849 by abhilash1910 PiperOrigin-RevId: 305288812 Change-Id: I710013945f0cd38db0f19247584a74b2fd35e0e5
2020-04-07 10:33:09 -07:00 · 2020-04-07 10:33:09 -07:00 · 820368cd94
commit 820368cd94
parent bfc8b7cab0
1 changed files with 13 additions and 0 deletions
--- a/tensorflow/python/keras/optimizer_v2/adadelta.py
+++ b/tensorflow/python/keras/optimizer_v2/adadelta.py
@ -45,6 +45,19 @@ class Adadelta(optimizer_v2.OptimizerV2):
  don't have to set an initial learning rate. In this version, initial
  learning rate can be set, as in most other Keras optimizers.

+  According to section 4.3 ("Effective Learning rates"), near the end of
+  training step sizes converge to 1 which is effectively a high learning
+  rate which would cause divergence. This occurs only near the end of the
+  training as gradients and step sizes are small, and the epsilon constant
+  in the numerator and denominator dominate past gradients and parameter
+  updates which converge the learning rate to 1.
+
+  According to section 4.4("Speech Data"),where a large neural network with
+  4 hidden layers was trained on a corpus of US English data, ADADELTA was
+  used with 100 network replicas.The epsilon used is 1e-6 with rho=0.95
+  which converged faster than ADAGRAD, by the following construction:
+  def __init__(self, lr=1.0, rho=0.95, epsilon=1e-6, decay=0., **kwargs):
+
  Args:
    learning_rate: A `Tensor`, floating point value, or a schedule that is a
      `tf.keras.optimizers.schedules.LearningRateSchedule`. The learning rate.