Authors:
(1) Anthi Papadopoulou, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway and Corresponding author (anthip@ifi.uio.no);
(2) Pierre Lison, Norwegian Computing Center, Gaustadalleen 23A, 0373 Oslo, Norway;
(3) Mark Anderson, Norwegian Computing Center, Gaustadalleen 23A, 0373 Oslo, Norway;
(4) Lilja Øvrelid, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway;
(5) Ildiko Pilan, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway.
Table of Links
2 Background
2.3 Privacy-Preserving Data Publishing
3 Datasets and 3.1 Text Anonymization Benchmark (TAB)
4 Privacy-oriented Entity Recognizer
4.2 Silver Corpus and Model Fine-tuning
5.4 Sequence Labelling and 5.5 Web Search
6 Analysis of Privacy Risk Indicators and 6.1 Evaluation Metrics
6.2 Experimental Results and 6.3 Discussion
6.4 Combination of Risk Indicators
Appendices
A. Human properties from Wikidata
B. Training parameters of entity recognizer
D. LLM probabilities: base models
E. Training size and performance
A Human properties from Wikidata
The two tables below show the selected Wikidata properties mentioned in Section 4.1 that constitute the DEM and MISC gazetteers.
DEM-related properties
MISC-related properties
B Training parameters of entity recognizer
C Label Agreement
Frequently confused label pairs (see Section 4.4) are shown in Figure 4.
D LLM probabilities: base models
Table 11 describes the (ordered) based models the Autogluon tabular predictor employs for the LLM-probability based approach of Section 5.1
E Training size and performance
Figure 5 shows the F1 score of both the Tabular and the Multimodal Autogluon predictors (LLM probabilities Section 6.3 and span classification Section 6.3 respectively) at different training sizes for both datasets. We use a random sample of 1% to 100% for each training dataset split.
F Perturbation thresholds
Figure 6 shows the performance of different perturbation thresholds for both datasets for the training dataset split, with the black line indicating the threshold used in Section 5.3 for evaluation.
This paper is available on arxiv under CC BY 4.0 DEED license.