paint-brush
Why Are Vision Transformers Focusing on Boring Backgrounds?by@mikeyoung44
1,459 reads
1,459 reads

Why Are Vision Transformers Focusing on Boring Backgrounds?

by Mike YoungOctober 2nd, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Vision Transformers (ViTs) have gained popularity for image-related tasks but exhibit strange behavior: focusing on unimportant background patches instead of the main subjects in images. Researchers found that a small fraction of patch tokens with abnormally high L2 norms cause these spikes in attention. They hypothesize that ViTs recycle low-information patches to store global image information, leading to this behavior. To fix it, they propose adding "register" tokens to provide dedicated storage, resulting in smoother attention maps, better performance, and improved object discovery abilities. This study highlights the need for ongoing research into model artifacts to advance transformer capabilities.
featured image - Why Are Vision Transformers Focusing on Boring Backgrounds?
Mike Young HackerNoon profile picture
Mike Young

Mike Young

@mikeyoung44

L O A D I N G
. . . comments & more!

About Author

Mike Young HackerNoon profile picture
Mike Young@mikeyoung44

TOPICS

THIS ARTICLE WAS FEATURED IN...

Permanent on Arweave
Read on Terminal Reader
Read this story in a terminal
 Terminal
Read this story w/o Javascript
Read this story w/o Javascript
 Lite