Empirical Analysis of CLLM Acceleration Mechanisms and Hyperparameter Sensitivity

by Large Models (dot tech)May 26th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

CLLMs show 2-6.8x improvements in fast-forwarded/stationary tokens. Ablations reveal dataset size, sequence length, and loss design impact performance.
featured image - Empirical Analysis of CLLM Acceleration Mechanisms and Hyperparameter Sensitivity
Large Models (dot tech) HackerNoon profile picture
0-item

Authors:

(1) Siqi Kou, Shanghai Jiao Tong University and with Equal contribution;

(2) Lanxiang Hu, University of California, San Diego and with Equal contribution;

(3) Zhezhi He, Shanghai Jiao Tong University;

(4) Zhijie Deng, Shanghai Jiao Tong University;

(5) Hao Zhang, University of California, San Diego.

Abstract and 1 Introduction

2. Related Work

3. Methodology and 3.1. Preliminary: Jacobi Decoding

3.2. Consistency Large Language Models (CLLMs)

3.3. Acceleration Mechanisms in CLLMs

4. Experiments

4.1. Evaluations

4.2. Acceleration Mechanisms in CLLMs

4.3. Ablation Studies

4.4. Limitations and Discussion

5. Conclusion, Impact Statement, and References

A. Illustration of Consistency Loss Learning Objectives

B. Comparison with Baseline Algorithms

C. Pesudo Code for Jacobi Decoding with KV Cache

4.2. Acceleration Mechanisms in CLLMs

With insights provided in Section 3.3, we investigate the fast-forwarding phenomenon and the emergence of stationary tokens in Jacobi decoding to provide further empirical evidences for our hypothesis. We compare fast-forwarded and stationary token counts in target LLMs and CLLMs across the four datasets in Table 3.


From the table, there is a consistent 2.0x to 6.8x improvement in both fast-forwarded token and stationary token counts across all four datasets. In particular, for domain specific datasets, such improvement is much more significant than open-domain dataset profiled on MT-bench. The results align with the observations from Section 3.3, where we see more distinctive collocations and easy syntactical structures like blank space, newline tokens, and repetitive special characters in specialized domains like coding as demonstrated in Figure 2, versus open-domain conversations in ShareGPT and MT-bench with a significantly more diverse set of collocations.


Table 3. Profiling results for fast-forwarded and stationary token counts in fine-tuned models and CLLMs. The numbers are reported for each n-token sequence, with the best-performing model and an accompanying n-gram size. Fast-forwarded token count reported in the table includes the one token that will be predicted right even without fast-forwarding.


4.3. Ablation Studies

Dataset sizes and generalizability. In Section 3.2.1, Jacobi trajectory datasets are collected to conduct training for efficient Jacobi decoding. Table 4 demonstrates larger Jacobi trajectory datasets bring more significant speedup, and the speedup gradually saturates as the dataset size scales. Moreover, CLLMs trained with more data can perform well even at the n-token sequence lengths it’s not trained on and introduce more deployment-time robustness.


Different lengths of n-token sequence. We investigate how different n-token sequence lengths in the Jacobi trajectory dataset affect CLLMs’ performance on GSM8K. We employ varying lengths to generate the Jacobi dataset and train the CLLMs accordingly. Figure 3 illustrates that CLLMs consistently maintain generation quality while the models are trained with different lengths. In practice, longer sequence lengths come at cost of increased computational overhead during inference. In Figure 3, significant degradation inference speed can thus be observed when the when the n-token sequence length exceeds 64.


Loss design. We adjust the ratio of consistency loss to autoregressive loss described in Section 3.2.2 and evaluate different loss ratios’ performance on GSM8K. As illustrated in Table 6, increasing the emphasis on autoregressive loss does indeed enhance accuracy, though it slightly compromises the speedup gains. Additionally, we compare the efficacy of CLLMs using both consistency global loss and consistency local loss. Table 6 demonstrates that the global loss is more efficacious in the training of CLLMs.


4.4. Limitations and Discussion

In our experiments, we observe that achieving significant speedup while maintaining good generation quality with a CLLM relies strongly on having a high-quality Jacobi



Figure 3. Accuracy and speedup of models trained with different n-token sequences lengths on GSM8K dataset. The sequence length for generation matches the training settings. Speedup is measured as the ratio of the wall-clock generation throughput when employing Jacobi decoding, and that of the baseline AR decoding.



trajectory dataset. Therefore, data cleaning is crucial, as discussed in Section 3.2.1. Dataset size also plays a role as described in Section 4.3 and shown in Table 4, although to a lesser extent. For instance, Jacobi trajectories generated with only 10% of the Code-Search-Net Python dataset is able to yield a 2.9× speedup as demonstrated in Table 2. However, for open-domain datasets like ShareGPT, more data is necessary for improved efficiency.


In our proposed method and experiments, we primarily use output sequences from the teacher (Kim & Rush, 2016) to collect Jacobi trajectories and train a CLLM. This introduces some additional overhead in comparison with conventional model training. On-policy GKD proposed in Agarwal et al. (2023) suggests LLM distillation using a mixture of teacher and student samples or even student samples by themselves can yield high-performance models. One mitigation is therefore to use n-token sequences generated by



Table 4. Comparison the performance of CLLMs trained with different sizes of Jacobi trajectory datasets on ShareGPT.




Table 5. CLLMs’ performance versus the fine-tuned baseline on language modeling tasks.



the trained model itself as the training samples. This can remove the Jacobi trajectory collection overhead, making our proposed method potentially feasible for pre-training.


Results from our language modeling experiments, as detailed in Table 5, demonstrate the robustness of the CLLM when trained on pre-training jobs with a notable speedup. By incorporating on-policy GKD, it is conceivable that a modified version of our proposed method could be employed for LLM pre-training. This modification would equip the pre-trained model with both a strong language modeling capability, as existing models possess, and a high generation speed when employing Jacobi decoding for inference. We leave the opportunities of adapting CLLMs to pre-trained jobs for future work.

This paper is available on arxiv under CC0 1.0 Universal license.


Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks