In a couple of minutes, you will know how the transformer architecture can be applied to computer vision with a new paper called the Swin Transformer. As a bonus, make sure you stay until the end of the video for a giveaway sponsored by NVIDIA GTC! References ►My Newsletter (subscribe here to have a chance to win!): http://eepurl.com/huGLT5 ►Register to the GTC event: https://www.nvidia.com/en-us/gtc/?ncid=ref-crea-331503 ►DLI courses: https://www.nvidia.com/en-us/training/ ►Paper: Liu, Z., “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”, 2021, https://arxiv.org/abs/2103.14030v1 ►Code: https://github.com/microsoft/Swin-Transformer Video transcript 00:00 This video is about most probably the next generation of neural networks for all computer 00:05 vision applications: The transformer architecture. 00:09 You've certainly already heard about this architecture in the field of natural language 00:13 processing, or NLP, mainly with GPT3 that made a lot of noise in 2020. 00:19 Transformers can be used as a general-purpose backbone for many different applications and 00:23 not only NLP. photo: transformers, gpt3 00:25 In a couple of minutes, you will know how this transformer architecture can be applied 00:29 to computer vision with a new paper called the Swin Transformer by Ze Lio et al. from 00:35 Microsoft Research. 00:37 Before diving into the paper, I just wanted to tell you to stay until the end of the video, 00:41 where I will talk about my newsletter I just created and the next free NVIDIA GTC EVENT 00:47 happening in two weeks. 00:48 You should definitely stay or skip right to it as I will provide you with the timeline 00:53 as usual because I will be hosting a giveaway in collaboration with NVIDIA GTC! 00:58 This video may be less flashy than usual as it doesn't really show the actual results 01:03 of a precise application. 01:04 Instead, the researchers showed how to adapt the transformers architecture from text inputs 01:10 to images, surpassing computer vision state-of-the-art convolutional neural networks, which is much 01:15 more exciting than a small accuracy improvement, in my opinion! 01:19 And of course, they are providing the code for you to implement yourself! 01:23 The link is in the description. 01:25 But why are we trying to replace convolutional neural networks for computer vision applications? 01:30 This is because transformers can efficiently use a lot more memory and are much more powerful 01:36 when it comes to complex tasks. 01:38 This is, of course, according to the fact that you have the data to train it. 01:43 Transformers also use the attention mechanism introduced with the 2017 paper Attention is 01:48 all you need. 01:49 Attention allows the transformer architecture to compute in a parallelized manner. 01:54 It can simultaneously extract all the information we need from the input and its inter-relation, 02:00 compared to CNNs. 02:02 CNNs are much more localized, using small filters to compress the information towards 02:07 a general answer. 02:08 While this architecture is powerful for general classification tasks, it does not have the 02:13 spatial information necessary for many tasks like instance recognition. 02:18 This is because convolutions don't consider distanced-pixels relations. 02:23 In the case of NLP, a classical type of input is a sentence and an image in a computer vision 02:29 case. 02:30 To quickly introduce the concept of attention, let's take a simple NLP example sending a 02:34 sentence to translate it into a transformer network. 02:38 In this case, attention is basically measuring how each word in the input sentence is associated 02:44 with each word on the output translated sentence. 02:47 Similarly, there is also what we call self-attention that could be seen as a measurement of a specific 02:53 word's effect on all other words of the same sentence. 02:57 This same process can be applied to images calculating the attention of patches of the 03:01 images and their relations to each other, as we will discuss further in the video. 03:06 Now that we know transformers are very interesting, there is still a problem when it comes to 03:11 computer vision applications. 03:12 Indeed, just like the popular saying "a picture is worth a thousand words," pictures contain 03:18 much more information than sentences, so we have to adapt the basic transformer's architecture 03:23 to process images effectively. 03:26 This is what this paper is all about. 03:28 This is due to the fact that the computational complexity of its self-attention is quadratic 03:33 to image size. 03:35 Thus exploding the computation time and memory needs. 03:38 Instead, the researchers replaced this quadratic computational complexity with a linear computational 03:44 complexity to image size. 03:47 The process to achieve this is quite simple. 03:50 At first, like most computer vision tasks, an RGB image is sent to the network. 03:55 This image is split into patches, and each patch is treated as a token. 04:00 And these tokens' features are the RGB values of the pixels themselves. 04:04 To compare with NLP, you can see this as the overall image is the sentence, and each patch 04:10 is the words of that sentence. 04:13 Self-attention is applied on each patch, here referred to as windows. 04:17 Then, the windows are shifted, resulting in a new window configuration to apply self-attention 04:22 again. 04:23 This allows the creation of connections between windows while maintaining the computation 04:28 efficiency of this windowed architecture. 04:31 This is very interesting when compared with convolutional neural networks as it allows 04:35 long-range pixel relations to appear. 04:38 This was only for the first stage. 04:41 The second stage is very similar but concatenates the features of each group of two by two neighboring 04:47 patches, downsampling the resolution by a factor of two. 04:50 This procedure is repeated twice in Stages 3 and 4 producing the same feature map resolutions 04:57 like those of typical convolutional networks like resnets and VGG. 05:03 You may say that this is highly similar to a convolutional architecture and filters using 05:07 dot products. 05:08 Well, yes and no. 05:10 The power of convolutions is that the filters use fixed weights globally, enabling the translation-invariance 05:16 property of convolution, making it a powerful generalizer. 05:20 In self-attention, the weights are not fixed globally. 05:23 Instead, they rely on the local context itself. 05:26 Thus, self-attention takes into account each pixel, but also its relation to the other 05:32 pixels. 05:33 Also, their shifted window technique allows long-range pixel relations to appear. 05:38 Unfortunately, these long-range relations only appear with neighboring windows. 05:43 Thus, losing very long-range relations, showing that there is still a place for improvement 05:47 of the transformer architecture when it comes to computer vision, 05:51 As they state in the paper, "It is our belief that a unified architecture 05:55 across computer vision and natural language processing could benefit 05:59 both fields, since it would facilitate joint modeling of visual and textual signals and 06:04 the modeling knowledge from both domains can be more deeply shared" 06:08 And I completely agree. 06:10 I think using a similar architecture for both NLP and computer vision could significantly 06:15 accelerate the research process. 06:17 Of course, transformers are still highly data-dependent, and nobody can say whether or not it will 06:23 be the future of either NLP or computer vision. 06:26 Still, it is undoubtedly a significant step forward for both fields! 06:31 Now that you've stayed this far let's talk about an awesome upcoming event for our field: 06:36 GTC. 06:37 So what is GTC2021? 06:38 It is a weeklong event offering over 1,500 talks from AI leaders like Yoshua Bengio, 06:45 Yann Lecun, Geoffrey Hinton, and much more! 06:48 The conference will start on April 12 with a keynote from the CEO of NVIDIA, where he 06:53 will be hosting the three AI pioneers I just mentioned. 06:57 This will be amazing! 06:58 It is an official NVIDIA conference for AI innovators, technologists, and creatives. 07:04 The conferences are covering many exciting topics. 07:07 Such as automotive, healthcare, data science, energy, deep learning, education, and much 07:11 more. 07:12 You don't want to miss it out! 07:14 Oh, and did I forget to mention that the registration is completely free this year? 07:18 So sign-up right now and watch it with me. 07:21 The link is in the description! 07:23 What's even cooler is that NVIDIA provided me 5 Deep Learning Institute credits that 07:28 you can use for an online, self-paced course of your choice worth around 30$ each! 07:34 The deep learning institute offers hands-on training in AI for developers, data scientists, 07:40 students, and researchers to get practical experience powered by GPUs in the cloud! 07:45 I think it's an awesome platform to learn, and it is super cool that they are offering 07:49 credits to give away, don't miss out on this opportunity! 07:52 To participate in this giveaway, you need to mention your favorite moment from the GTC 07:57 keynote on April 12 at 8:30 am pacific time using the hashtag #GTCWithMe and tagging me 08:05 (@whats_ai) on LinkedIn or Twitter! 08:09 I will also be live-streaming the event on my channel to watch it together and discuss 08:13 it in the chat. 08:14 Stay tuned for that, and please let me know what you think of the conference afterward! 08:19 NVIDIA also provided me with two extra codes to give away to the ones subscribing to my 08:24 newsletter! 08:25 This newsletter is about sharing only ONE paper each week. 08:29 There will be a video, an article, the code, and the paper itself. 08:32 I will also add some of the projects I am working on, guides to learning machine learning, 08:37 and other exciting news! 08:39 It's the first link in the description, and I will draw the winners just after the GTC 08:43 event! 08:44 Finally, just a final word as I wanted to personally thanks the four recent Youtube 08:49 members! 08:50 Huge thanks to you ebykova, Tonia Spight-Sokoya, Hello Paperspace, and Martin Petrovski, for 08:58 your support and everyone watching the videos! 09:01 See you in the next one!

Harmony

Microsoft

NVIDIA

Super

Twitter

YouTube

From Amputee to Cyborg with this AI-Powered Hand 🦾

Would You Swipe Right for an AI Profile?

Watch more on YouTube: https://www.youtube.com/c/WhatsAI

2021 - HackerNoon Contributor of the Year - DEEP-LEARNING

2021 - HackerNoon Contributor of the Year - FACEBOOK

Nominated for 2022 - Best Data Science Newsletter

Nominated for 2022 - HackerNoon Contributor of the Year - Artificial Intelligence

Nominated for 2022 - Top Tech Youtuber

Nominated for 2022 - HackerNoon Contributor of the Year - Innovation

Nominated for 2022 - HackerNoon Contributor of the Year - Data Science

Nominated for 2022 - HackerNoon Contributor of the Year - Natural Language Processing

Will Transformers Replace CNNs in Computer Vision?

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

From Novice to Data Pro in 90 Days: Avery Smith's Exclusive Method

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

From Novice to Data Pro in 90 Days: Avery Smith's Exclusive Method

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps