Author: Awais Farooq
-
Learning to Resize in Computer Vision
It is a common belief that if we constrain vision models to perceive things as humans do, their performance can be improved. For example, in this work, Geirhos et al. showed that the vision models pre-trained on the ImageNet-1k dataset are biased towards texture, whereas human beings mostly use the shape descriptor to develop a common…
-
Augmenting convnets with aggregated attention
Introduction Vision transformers (Dosovitskiy et. al) have emerged as a powerful alternative to Convolutional Neural Networks. ViTs process the images in a patch-based manner. The image information is then aggregated into a CLASS token. This token correlates to the most important patches of the image for a particular classification decision. The interaction between the CLASS token and the patches…
-
Class Attention Image Transformers with LayerScale
Introduction In this tutorial, we implement the CaiT (Class-Attention in Image Transformers) proposed in Going deeper with Image Transformers by Touvron et al. Depth scaling, i.e. increasing the model depth for obtaining better performance and generalization has been quite successful for convolutional neural networks (Tan et al., Dollár et al., for example). But applying the same model scaling…