Modeling Transformer Layers: Majorization Minimization & Hopfield Networks

Too Long; Didn't Read

Explore how majorization minimization (MM) technique is used to adapt Hopfield network models to the multi-layered structure of Transformers, especially in over-parameterized scenarios.

People Mentioned

Company Mentioned

Table of Links

Abstract and 1 Introduction

2 Related Work

3 Model and 3.1 Associative memories

3.2 Transformer blocks

4 A New Energy Function

4.1 The layered structure

5 Cross-Entropy Loss

6 Empirical Results and 6.1 Empirical evaluation of the radius

6.2 Training GPT-2

6.3 Training Vanilla Transformers

7 Conclusion and Acknowledgments

Appendix A. Deferred Tables

Appendix B. Some Properties of the Energy Functions

Appendix C. Deferred Proofs from Section 5

Appendix D. Transformer Details: Using GPT-2 as an Example

References

4.1 The layered structure

Previous Hopfield models could only handle a single hidden layer, whereas Transformers often consist of a stack of homogeneous blocks of attention and FF layers. To model the multi-layered structure of Transformers, we employ a technique known as majorization minimization (MM) (Ortega and Rheinboldt, 1970; Sun et al., 2016), which aims to accelerate optimization using surrogate convex functions. We argue that the layered structure serves the same purpose when the patterns memorized by all layers encompass the set of training samples.

Remark 1 If the model is severely over-parameterized, the energy function can approximate the energy of the sample distribution well and is not confined to the form expressed in Eq. (9).

Authors:

(1) Xueyan Niu, Theory Laboratory, Central Research Institute, 2012 Laboratories, Huawei Technologies Co., Ltd.;

(2) Bo Bai baibo (8@huawei.com);

(3) Lei Deng (deng.lei2@huawei.com);

(4) Wei Han (harvey.hanwei@huawei.com).

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

Modeling Transformer Layers: Majorization Minimization & Hopfield Networks

Too Long; Didn't Read

People Mentioned

Company Mentioned

Table of Links

4.1 The layered structure

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Trending Topics

Classic

Neon Noir

Minty

Newspaper

HN StartUps

Modeling Transformer Layers: Majorization Minimization & Hopfield Networks

Too Long; Didn't Read

People Mentioned

Company Mentioned

Table of Links

4.1 The layered structure

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Trending Topics

Classic

Neon Noir

Minty

Newspaper

HN StartUps