Category: 05. Optimizers

  • Loss Scale Optimizer

    LossScaleOptimizer class An optimizer that dynamically scales the loss to prevent underflow. Loss scaling is a technique to prevent numeric underflow in intermediate gradients when float16 is used. To prevent underflow, the loss is multiplied (or “scaled”) by a certain factor called the “loss scale”, which causes intermediate gradients to be scaled by the loss scale…

  • Lion

    Lion class Optimizer that implements the Lion algorithm. The Lion optimizer is a stochastic-gradient-descent method that uses the sign operator to control the magnitude of the update, unlike other adaptive optimizers such as Adam that rely on second-order moments. This make Lion more memory-efficient as it only keeps track of the momentum. According to the authors…

  • Ftrl

    Ftrl class Optimizer that implements the FTRL algorithm. “Follow The Regularized Leader” (FTRL) is an optimization algorithm developed at Google for click-through rate prediction in the early 2010s. It is most suitable for shallow models with large and sparse feature spaces. The algorithm is described by McMahan et al., 2013. The Keras version has support for both…

  • Nadam

    Nadam class Optimizer that implements the Nadam algorithm. Much like Adam is essentially RMSprop with momentum, Nadam is Adam with Nesterov momentum. Arguments

  • Adafactor

    Adafactor class Optimizer that implements the Adafactor algorithm. Adafactor is commonly used in NLP tasks, and has the advantage of taking less memory because it only saves partial information of previous gradients. The default argument setup is based on the original paper (see reference). When gradients are of dimension > 2, Adafactor optimizer will delete the…

  • Adamax

    Adamax class Optimizer that implements the Adamax algorithm. Adamax, a variant of Adam based on the infinity norm, is a first-order gradient-based optimization method. Due to its capability of adjusting the learning rate based on data characteristics, it is suited to learn time-variant process, e.g., speech data with dynamically changed noise conditions. Default parameters follow those…

  • Adagrad

    Adagrad class Optimizer that implements the Adagrad algorithm. Adagrad is an optimizer with parameter-specific learning rates, which are adapted relative to how frequently a parameter gets updated during training. The more updates a parameter receives, the smaller the updates. Arguments

  • Adadelta

    Adadelta class Optimizer that implements the Adadelta algorithm. Adadelta optimization is a stochastic gradient descent method that is based on adaptive learning rate per dimension to address two drawbacks: Adadelta is a more robust extension of Adagrad that adapts learning rates based on a moving window of gradient updates, instead of accumulating all past gradients. This…

  • AdamW

    AdamW class Optimizer that implements the AdamW algorithm. AdamW optimization is a stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments with an added method to decay weights per the techniques discussed in the paper, ‘Decoupled Weight Decay Regularization’ by Loshchilov, Hutter et al., 2019. According to Kingma et al., 2014, the…

  • Adam

    Adam class Optimizer that implements the Adam algorithm. Adam optimization is a stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments. According to Kingma et al., 2014, the method is “computationally efficient, has little memory requirement, invariant to diagonal rescaling of gradients, and is well suited for problems that are large…