Overcoming HBM-VMEM Bottlenecks in TPU-v3 Recurrent Workloads

Implementation We implement our recurrence gate, as defined in Section 2.4, in a slightly different, but mathematically equivalent form, for numerical stability. In particular, we compute the logarithm of 𝑎𝑡 and then we exponentiate it, instead of computing a sigmoid and then taking a power:

B. Complex-Gated Linear Recurrent Unit (CG-LRU)

its first half as the real part of a complex vector, and the second part as the imaginary part of the same complex vector:

With this we rewrite the equations for the LRU (see eq. 4) as:

C. Model Scale Hyper-Parameters

In Table 2, we present the hyper-parameters of the models at different scales. These hyperparameters are shared for all the model families that we explored in this paper.

D. Efficient Linear Recurrences on Device

The initial step in computational optimization lies in identifying the primary performance bottleneck on the target hardware. For most accelerators, the key limiting factors are computational throughput (FLOPs/s) and memory bandwidth between the high-bandwidth memory (HBM) and the fast vector memory (VMEM). While factors like HBM capacity and host-device communication are relevant, techniques such as ZeRO sharding and pipelined data transfer offer practical mitigations. Modern accelerator designs often prioritize a high FLOPs-to-byte ratio to accommodate workloads where computations significantly outnumber memory transfers. We show the key specification of the TPU-v3 pod (two chips per pod) in Table 3, which we use for all our experiments.

D.2. Scan runtimes

This paper is available on arxiv under CC BY 4.0 DEED license.

Overcoming HBM-VMEM Bottlenecks in TPU-v3 Recurrent Workloads

Too Long; Didn't Read

Table of Links

A. RG-LRU Recurrence Gate

B. Complex-Gated Linear Recurrent Unit (CG-LRU)

C. Model Scale Hyper-Parameters

D. Efficient Linear Recurrences on Device

D.2. Scan runtimes

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Trending Topics

Classic

Neon Noir

Minty

Newspaper

HN StartUps

Overcoming HBM-VMEM Bottlenecks in TPU-v3 Recurrent Workloads

Too Long; Didn't Read

Table of Links

A. RG-LRU Recurrence Gate

B. Complex-Gated Linear Recurrent Unit (CG-LRU)

C. Model Scale Hyper-Parameters

D. Efficient Linear Recurrences on Device

D.2. Scan runtimes

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Trending Topics

Classic

Neon Noir

Minty

Newspaper

HN StartUps