Category: 12. Performance recipes

  • Learning to Resize in Computer Vision

    It is a common belief that if we constrain vision models to perceive things as humans do, their performance can be improved. For example, in this work, Geirhos et al. showed that the vision models pre-trained on the ImageNet-1k dataset are biased towards texture, whereas human beings mostly use the shape descriptor to develop a common…

  • Augmenting convnets with aggregated attention

    Introduction Vision transformers (Dosovitskiy et. al) have emerged as a powerful alternative to Convolutional Neural Networks. ViTs process the images in a patch-based manner. The image information is then aggregated into a CLASS token. This token correlates to the most important patches of the image for a particular classification decision. The interaction between the CLASS token and the patches…

  • Class Attention Image Transformers with LayerScale

    Introduction In this tutorial, we implement the CaiT (Class-Attention in Image Transformers) proposed in Going deeper with Image Transformers by Touvron et al. Depth scaling, i.e. increasing the model depth for obtaining better performance and generalization has been quite successful for convolutional neural networks (Tan et al., Dollár et al., for example). But applying the same model scaling…

  • FixRes: Fixing train-test resolution discrepancy

    Introduction It is a common practice to use the same input image resolution while training and testing vision models. However, as investigated in Fixing the train-test resolution discrepancy (Touvron et al.), this practice leads to suboptimal performance. Data augmentation is an indispensable part of the training process of deep neural networks. For vision models, we typically use…

  • Knowledge Distillation

    Introduction to Knowledge Distillation Knowledge Distillation is a procedure for model compression, in which a small (student) model is trained to match a large pre-trained (teacher) model. Knowledge is transferred from the teacher model to the student by minimizing a loss function, aimed at matching softened teacher logits as well as ground-truth labels. The logits…

  • Learning to tokenize in Vision Transformers

    Introduction Vision Transformers (Dosovitskiy et al.) and many other Transformer-based architectures (Liu et al., Yuan et al., etc.) have shown strong results in image recognition. The following provides a brief overview of the components involved in the Vision Transformer architecture for image classification: If we take 224×224 images and extract 16×16 patches, we get a total…

  • Gradient Centralization for Better Training Performance

    Introduction This example implements Gradient Centralization, a new optimization technique for Deep Neural Networks by Yong et al., and demonstrates it on Laurence Moroney’s Horses or Humans Dataset. Gradient Centralization can both speedup training process and improve the final generalization performance of DNNs. It operates directly on gradients by centralizing the gradient vectors to have zero mean.…