Authors:
(1) Siqi Kou, Shanghai Jiao Tong University and with Equal contribution;
(2) Lanxiang Hu, University of California, San Diego and with Equal contribution;
(3) Zhezhi He, Shanghai Jiao Tong University;
(4) Zhijie Deng, Shanghai Jiao Tong University;
(5) Hao Zhang, University of California, San Diego.
Table of Links
3. Methodology and 3.1. Preliminary: Jacobi Decoding
3.2. Consistency Large Language Models (CLLMs)
3.3. Acceleration Mechanisms in CLLMs
4. Experiments
4.2. Acceleration Mechanisms in CLLMs
4.4. Limitations and Discussion
5. Conclusion, Impact Statement, and References
A. Illustration of Consistency Loss Learning Objectives
B. Comparison with Baseline Algorithms
C. Pesudo Code for Jacobi Decoding with KV Cache
4.2. Acceleration Mechanisms in CLLMs
With insights provided in Section 3.3, we investigate the fast-forwarding phenomenon and the emergence of stationary tokens in Jacobi decoding to provide further empirical evidences for our hypothesis. We compare fast-forwarded and stationary token counts in target LLMs and CLLMs across the four datasets in Table 3.
From the table, there is a consistent 2.0x to 6.8x improvement in both fast-forwarded token and stationary token counts across all four datasets. In particular, for domain specific datasets, such improvement is much more significant than open-domain dataset profiled on MT-bench. The results align with the observations from Section 3.3, where we see more distinctive collocations and easy syntactical structures like blank space, newline tokens, and repetitive special characters in specialized domains like coding as demonstrated in Figure 2, versus open-domain conversations in ShareGPT and MT-bench with a significantly more diverse set of collocations.
4.3. Ablation Studies
Dataset sizes and generalizability. In Section 3.2.1, Jacobi trajectory datasets are collected to conduct training for efficient Jacobi decoding. Table 4 demonstrates larger Jacobi trajectory datasets bring more significant speedup, and the speedup gradually saturates as the dataset size scales. Moreover, CLLMs trained with more data can perform well even at the n-token sequence lengths it’s not trained on and introduce more deployment-time robustness.
Different lengths of n-token sequence. We investigate how different n-token sequence lengths in the Jacobi trajectory dataset affect CLLMs’ performance on GSM8K. We employ varying lengths to generate the Jacobi dataset and train the CLLMs accordingly. Figure 3 illustrates that CLLMs consistently maintain generation quality while the models are trained with different lengths. In practice, longer sequence lengths come at cost of increased computational overhead during inference. In Figure 3, significant degradation inference speed can thus be observed when the when the n-token sequence length exceeds 64.
Loss design. We adjust the ratio of consistency loss to autoregressive loss described in Section 3.2.2 and evaluate different loss ratios’ performance on GSM8K. As illustrated in Table 6, increasing the emphasis on autoregressive loss does indeed enhance accuracy, though it slightly compromises the speedup gains. Additionally, we compare the efficacy of CLLMs using both consistency global loss and consistency local loss. Table 6 demonstrates that the global loss is more efficacious in the training of CLLMs.
4.4. Limitations and Discussion
In our experiments, we observe that achieving significant speedup while maintaining good generation quality with a CLLM relies strongly on having a high-quality Jacobi
trajectory dataset. Therefore, data cleaning is crucial, as discussed in Section 3.2.1. Dataset size also plays a role as described in Section 4.3 and shown in Table 4, although to a lesser extent. For instance, Jacobi trajectories generated with only 10% of the Code-Search-Net Python dataset is able to yield a 2.9× speedup as demonstrated in Table 2. However, for open-domain datasets like ShareGPT, more data is necessary for improved efficiency.
In our proposed method and experiments, we primarily use output sequences from the teacher (Kim & Rush, 2016) to collect Jacobi trajectories and train a CLLM. This introduces some additional overhead in comparison with conventional model training. On-policy GKD proposed in Agarwal et al. (2023) suggests LLM distillation using a mixture of teacher and student samples or even student samples by themselves can yield high-performance models. One mitigation is therefore to use n-token sequences generated by
the trained model itself as the training samples. This can remove the Jacobi trajectory collection overhead, making our proposed method potentially feasible for pre-training.
Results from our language modeling experiments, as detailed in Table 5, demonstrate the robustness of the CLLM when trained on pre-training jobs with a notable speedup. By incorporating on-policy GKD, it is conceivable that a modified version of our proposed method could be employed for LLM pre-training. This modification would equip the pre-trained model with both a strong language modeling capability, as existing models possess, and a high generation speed when employing Jacobi decoding for inference. We leave the opportunities of adapting CLLMs to pre-trained jobs for future work.
This paper is available on arxiv under CC0 1.0 Universal license.