Modified Intersection over Union (M-IoU) for Sequence Labeling Evaluation

In sequence labeling tasks, traditional metrics like the F1 score, as depicted in Equation 1, are commonly used to assess model performance [4]. In the context of our study, True Positives (TP) represent the number of tokens correctly identified as praise by the model, False Positives (FP) refer to tokens incorrectly classified as praise, often resulting from the model predicting additional words as part of the praise. False Negatives (FN), on the other hand, are tokens that were part of praise but were overlooked by the model, indicative of missed praise components. Previous research [40] has highlighted that certain additional entities identified by the model can still contribute meaningfully to human tutors’ understanding of response correctness. For instance, as illustrated in Table 3, while the first row shows expert annotations of effort-based praise, subsequent examples (rows 2-5) might be model-generated. Notably, rows 2 to 4, despite including additional words for effort-based praise (i.e., FP), offer valuable insights that could assist tutors, contrasting with row 5 where the model’s highlighting of merely “great” (i.e., FN) fails to clearly convey the type of praise intended. This observation suggests a need for a metric that accommodates the evaluation of additional identified praise tokens more flexibly. However, the F1 score, as shown in Equation 1, applies the same weight to both FP and FN, a treatment that diverges from our requirement for a more nuanced metric. Consequently, we propose adopting the Intersection over Union (IoU) concept, commonly utilized in the computer vision domain, to better suit our evaluation needs in our task.

The Intersection over Union (IoU) metric, frequently applied in object detection and segmentation tasks as depicted in Equation 2, quantifies the extent of overlap between predicted and actual annotations, offering a balanced approach

to assess model performance [41, 6]. In the context of sequence labeling, the ‘Area of Overlap’ (i.e., TP) corresponds to the tokens the model accurately identifies as praise, whereas the ‘Area of Union’ includes all tokens labeled as praise by the model (TP and FP) along with all actual praise tokens in the ground truth (TP and FN). Since we recognize the significance of additionally detected words in our study, we propose a Modified Intersection over Union (M-IOU) metric (shown in Equation 3) to refine IoU metric further. This modification incorporates a weight coefficient, α, which aims to reduce the influence of FPs on the overall performance score, thus introducing a measure of flexibility towards additional identified words without neglecting the potential for inaccuracies. The coefficient α is introduced as a real number set at or above zero, enabling users to adjust the tolerance level for additional words identified. A higher α value enforces a stricter penalty on FPs, while a lower value indicates a more lenient approach. In our analysis, α is set to 0.2 based on our observation of expert annotations. Notably, in situations where a response lacks praise and the model’s prediction concurs (i.e., T P + F P + F N equals 0), indicating a perfect match between model and ground truth in identifying no relevant praise tokens, we encounter a scenario reflective of novice tutors possibly providing irrelevant responses (e.g., “Can I see any of your writing”). Such irrelevant responses underscore the necessity of our explanatory feedback system in guiding tutors on giving effective praise. For this case, we adjust the M-IOU formula to directly assign a score of 1 to reflect perfect agreement and underscore the adaptability of our M-IOU in accurately evaluating model precision, particularly in the absence of praise, thus showing its effectiveness in practical applications.

3.4.2 Human annotation and correlation analysis

To assess the efficacy of our proposed M-IOU score, we undertook a rigorous process involving human annotation to rate the quality of identified components of praise within tutor responses. The human rating scores are further used to compare with our proposed M-IoU score to ensure that M-IoU not only holds computational validity but also aligns with human judgments regarding the praise components in the tutoring responses. Recognizing the importance of human judgment in our study, we hired two additional human coders to scrutinize the highlighted components of praise in tutor responses. These coders attended a detailed annotation training session and completed the lesson of Giving Effective Praise, equipping them with the necessary background to perform their evaluations effectively.

Before beginning their rating tasks, we randomized the presentation order of highlighted texts generated by both GPT models and expert annotations for each tutor response. This approach ensured the unpredictability of expert annotation sequence, aiming to mitigate any potential bias in the coders’ evaluations. Inspired by the study [54], we guided the coders to assess each highlighted response based on two questions: Question 1: “Do you think the highlighted text provides enough information to identify praise on effort?” and Question 2: “Do you think the highlighted text provides enough information to identify praise on the outcome?”. These questions were designed to capture the coders’ assessments of the highlighted texts’ adequacy in conveying praise, either for the student’s effort or the outcome of their work. The coders were instructed to use a five-point Likert scale for their annotations, with the options being: 1 - Strongly Disagree, 2 - Disagree, 3 - Neutral, 4 - Agree, 5 - Strongly Agree.

Upon completing the annotations, we calculated the average score for each response, providing a quantitative measure of the consensus between the coders regarding the effectiveness of the highlighted praise text. To determine the effectiveness of the M-IoU score as a metric for evaluating model predictions, we conducted a correlation analysis using Pearson’s r to understand the strength and direction of the linear relationship between the M-IoU scores and the human coders’ ratings. The correlation analysis help us understand how well the M-IoU score aligns with human judgment and its potential as a surrogate metric for evaluating the model performance in identifying praise components.

This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Jionghao Lin, Carnegie Mellon University (jionghal@cs.cmu.edu);

(2) Eason Chen, Carnegie Mellon University (easonc13@cmu.edu);

(3) Zeifei Han, University of Toronto (feifei.han@mail.utoronto.ca);

(4) Ashish Gurung, Carnegie Mellon University (agurung@andrew.cmu.edu);

(5) Danielle R. Thomas, Carnegie Mellon University (drthomas@cmu.edu);

(6) Wei Tan, Monash University (wei.tan2@monash.edu);

(7) Ngoc Dang Nguyen, Monash University (dan.nguyen2@monash.edu);

(8) Kenneth R. Koedinger, Carnegie Mellon University (koedinger@cmu.edu).