Table of Links
2. Background
2.1 Effective Tutoring Practice
2.2 Feedback for Tutor Training
2.3 Sequence Labeling for Feedback Generation
2.4 Large Language Models in Education
3. Method
3.1 Dataset and 3.2 Sequence Labeling
3.3 GPT Facilitated Sequence Labeling
4. Results
6. Limitation and Future Works
APPENDIX
B. Input for Fine-Tunning GPT-3.5
C. Scatter Matric of the Correlation on the Outcome-based Praise
D. Detailed Results of Fine-Tuned GPT-3.5 Model's Performance
ABSTRACT
Automated explanatory feedback systems play a crucial role in facilitating learning for a large cohort of learners by offering feedback that incorporates explanations, significantly enhancing the learning process. However, delivering such explanatory feedback in real-time poses challenges, particularly when high classification accuracy for domain-specific, nuanced responses is essential. Our study leverages the capabilities of large language models, specifically Generative Pre-Trained Transformers (GPT), to explore a sequence labeling approach focused on identifying components of desired and less desired praise for providing explanatory feedback within a tutor training dataset. Our aim is to equip tutors with actionable, explanatory feedback during online training lessons. To investigate the potential of GPT models for providing the explanatory feedback, we employed two commonly-used approaches: prompting and fine-tuning. To quantify the quality of highlighted praise components identified by GPT models, we introduced a Modified Intersection over Union (M-IoU) score. Our findings demonstrate that: (1) the M-IoU score effectively correlates with human judgment in evaluating sequence quality; (2) using two-shot prompting on GPT-3.5 resulted in decent performance in recognizing effort-based (M-IoU of 0.46) and outcome-based praise (M-IoU of 0.68); and (3) our optimally fine-tuned GPT-3.5 model achieved M-IoU scores of 0.64 for effortbased praise and 0.84 for outcome-based praise, aligning with the satisfaction levels evaluated by human coders. Our results show promise for using GPT models to provide feedback that focuses on specific elements in their open-ended responses that are desirable or could use improvement.
1. INTRODUCTION
Tutoring is an important instructional method that can be highly effective in supporting students. Tutors utilize various tutoring strategies to effectively facilitate learning opportunities [32, 48, 39]. While the effectiveness of tutoring is widely recognized, various logistical challenges have restricted its widespread implementation. Specifically, recruiting, training, and retention of tutors have presented major hurdles [55]. Training tutors can be highly resource-intensive and often requires hands-on training from experienced tutors. A key component of effective tutor training involves helping novice tutors learn effective tutoring strategies [40, 55]. For instance, instead of simply acknowledging an incorrect answer, effective tutors often engage with the student to identify the underlying misconceptions or gaps in knowledge that can provide additional context to the incorrect answer. This contextual insight can then assist the tutor in providing more effective support. Traditionally, these types of nuanced insights have been facilitated through hands-on training from more experienced tutors. However, the scalability of this hands-on approach remains a well-recognized limitation [40, 24, 27, 38], necessitating innovative solutions to extend this model of training tutors without compromising the quality of feedback.
In response to the growing need for scalable hands-on support in tutor training, researchers have increasingly turned to automated feedback systems. The integration of such systems to enhance feedback is well-established within Educational Data Mining (EDM), with numerous studies demonstrating their efficacy [37, 2, 17, 50]. While many implementations have employed AI algorithms to generate automated feedback [5], the specific application to tutor training remains underexplored. In this emerging field, the development of automated explanatory feedback systems designed for tutors presents a promising avenue. An illustrative example includes work by [40], which utilized the BERT language model [12] to enhance tutor training. Although the results showed potential, a significant challenge emerged: The BERT model was hampered by a lack of access to extenarXiv:2405.00291v1 [cs.CL] 1 May 2024 sive datasets, limiting its ability to offer precise, context specific feedback. This challenge is similarly problematic for other traditional models such as Conditional Random Fields (CRF) and Hidden Markov Models (HMM), which also require adequate domain-specific training data [47, 43, 46]. Recent advances in large language models (LLMs) present a viable solution to these challenges. LLMs, such as Generative Pre-trained Transformers (GPT) developed by OpenAI, are pre-trained on vast and diverse datasets, enabling them to generalize effectively across different domains without extensive task-specific data. The inherent adaptability of GPT models to dynamically adjust to specific contextual scenarios makes them well-suited for developing real-time, tailored feedback systems for tutor training—offering the adaptive, hands-on support that models like BERT could not.
By referencing recent LLM literature [26, 51, 28], we explore two approaches to leverage the potentials of GPT models in educational contexts: prompting and fine-tuning. Prompting involves designing input queries that guide the GPT model to generate desired outputs by leveraging its preexisting knowledge and capabilities [26, 28]. This approach is particularly useful for tasks requiring immediate, context specific responses without the need for extensive model retraining. In comparison, fine-tuning adjusts the model’s parameters on a targeted dataset, thereby optimizing its performance for specific tasks or domains [26, 28]. The finetuning approach allows for a more tailored response generation, closely aligned with the nuances of the given context. Both approaches exhibit significant promise in text comprehension and generation, suggesting their potential effectiveness in producing nuanced, explanatory feedback. Thus, our study aims to harness these approaches to unveil the full capacity of GPT models in automating the generation of high-quality explanatory feedback, thereby addressing a critical need in educational feedback systems. Driven by this, our study proposed two Research Questions (RQs):
RQ1: To what extent can we prompt the GPT models to enhance the prediction accuracy of providing explanatory feedback?
RQ2: To what extent can the fine-tuned GPT models enhance the prediction accuracy of providing explanatory feedback?
Through this work, we aim to offer a scalable solution that enhances tutor training programs and, ultimately, the learning experience for students. Our study developed an automated explanatory feedback system to highlight the correct and incorrect components of praise from novice tutor attempts, as illustrated in Figure 1. We implemented sequence labeling method for highlighting the correct and incorrect components by using the approaches of prompting and finetuning GPT models. To evaluate the quality of highlighted praise components from tutor responses by GPT models, we introduced the Modified Intersection over Union (M-IoU) score, a metric designed for our task. Our results indicate a strong correlation between the M-IoU score and human evaluators’ judgments regarding the quality of highlights, affirming the metric’s reliability.
In addressing RQ1, we employed a two-shot prompting method to prompt GPT-3.5 and GPT-4 models to highlight desired and undesired components of praise in tutor responses. Notably, the GPT-3.5 model demonstrated performance on par with that of the GPT-4 model, exhibiting commendable accuracy in identifying effort-based (M-IoU of 0.46) and outcome-based praise (M-IoU of 0.68). These levels of accuracy are considered decent by human coders, highlighting the effectiveness of our prompting strategies. For RQ2, we delved into fine-tuning the GPT models across a set of training sample sizes—from 10% of our dataset (13 samples) to 50% (65 samples)—to gauge how fine-tuning influences the model’s ability to enhance the precision of explanatory feedback. Due to limitations in accessing the fine-tuning GPT-4 model, our investigation focused on fine-tuning the GPT3.5 model. The optimal fine-tuned GPT-3.5 model achieved M-IoU scores of 0.64 for effort-based praise and 0.84 for outcome-based praise, aligning with the satisfaction levels observed by human coders. Motivated by the effectiveness of our fine-tuned GPT model, we have built a demo1 of our automated explanatory feedback system.
This paper is available on arxiv under CC BY 4.0 DEED license.
Authors:
(1) Jionghao Lin, Carnegie Mellon University (jionghal@cs.cmu.edu);
(2) Eason Chen, Carnegie Mellon University (easonc13@cmu.edu);
(3) Zeifei Han, University of Toronto (feifei.han@mail.utoronto.ca);
(4) Ashish Gurung, Carnegie Mellon University (agurung@andrew.cmu.edu);
(5) Danielle R. Thomas, Carnegie Mellon University (drthomas@cmu.edu);
(6) Wei Tan, Monash University (wei.tan2@monash.edu);
(7) Ngoc Dang Nguyen, Monash University (dan.nguyen2@monash.edu);
(8) Kenneth R. Koedinger, Carnegie Mellon University (koedinger@cmu.edu).