LLM Probabilities, Training Size, and Perturbation Thresholds in Entity Recognition

Authors:

(1) Anthi Papadopoulou, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway and Corresponding author (anthip@ifi.uio.no);

(2) Pierre Lison, Norwegian Computing Center, Gaustadalleen 23A, 0373 Oslo, Norway;

(3) Mark Anderson, Norwegian Computing Center, Gaustadalleen 23A, 0373 Oslo, Norway;

(4) Lilja Øvrelid, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway;

(5) Ildiko Pilan, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway.

Table of Links

Abstract and 1 Introduction

2 Background

2.1 Definitions

2.2 NLP Approaches

2.3 Privacy-Preserving Data Publishing

2.4 Differential Privacy

3 Datasets and 3.1 Text Anonymization Benchmark (TAB)

3.2 Wikipedia Biographies

4 Privacy-oriented Entity Recognizer

4.1 Wikidata Properties

4.2 Silver Corpus and Model Fine-tuning

4.3 Evaluation

4.4 Label Disagreement

4.5 MISC Semantic Type

5 Privacy Risk Indicators

5.1 LLM Probabilities

5.2 Span Classification

5.3 Perturbations

5.4 Sequence Labelling and 5.5 Web Search

6 Analysis of Privacy Risk Indicators and 6.1 Evaluation Metrics

6.2 Experimental Results and 6.3 Discussion

6.4 Combination of Risk Indicators

7 Conclusions and Future Work

Declarations

References

Appendices

A. Human properties from Wikidata

B. Training parameters of entity recognizer

C. Label Agreement

D. LLM probabilities: base models

E. Training size and performance

F. Perturbation thresholds

A Human properties from Wikidata

The two tables below show the selected Wikidata properties mentioned in Section 4.1 that constitute the DEM and MISC gazetteers.

DEM-related properties

MISC-related properties

B Training parameters of entity recognizer

C Label Agreement

Frequently confused label pairs (see Section 4.4) are shown in Figure 4.

D LLM probabilities: base models

Table 11 describes the (ordered) based models the Autogluon tabular predictor employs for the LLM-probability based approach of Section 5.1

E Training size and performance

Figure 5 shows the F1 score of both the Tabular and the Multimodal Autogluon predictors (LLM probabilities Section 6.3 and span classification Section 6.3 respectively) at different training sizes for both datasets. We use a random sample of 1% to 100% for each training dataset split.

F Perturbation thresholds

Figure 6 shows the performance of different perturbation thresholds for both datasets for the training dataset split, with the black line indicating the threshold used in Section 5.3 for evaluation.

This paper is available on arxiv under CC BY 4.0 DEED license.