Authors:
(1) Anthi Papadopoulou, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway and Corresponding author (anthip@ifi.uio.no);
(2) Pierre Lison, Norwegian Computing Center, Gaustadalleen 23A, 0373 Oslo, Norway;
(3) Mark Anderson, Norwegian Computing Center, Gaustadalleen 23A, 0373 Oslo, Norway;
(4) Lilja Øvrelid, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway;
(5) Ildiko Pilan, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway.
Table of Links
2 Background
2.3 Privacy-Preserving Data Publishing
3 Datasets and 3.1 Text Anonymization Benchmark (TAB)
4 Privacy-oriented Entity Recognizer
4.2 Silver Corpus and Model Fine-tuning
5.4 Sequence Labelling and 5.5 Web Search
6 Analysis of Privacy Risk Indicators and 6.1 Evaluation Metrics
6.2 Experimental Results and 6.3 Discussion
6.4 Combination of Risk Indicators
Appendices
A. Human properties from Wikidata
B. Training parameters of entity recognizer
D. LLM probabilities: base models
E. Training size and performance
6 Analysis of Privacy Risk Indicators
We now evaluate the privacy risk indicators detailed in the previous section, both in isolation (Section 6.2) and in combination (Section 6.4) and discuss the main findings.
6.1 Evaluation Metrics
The evaluation of text sanitization must strike a balance between two competing considerations:
• Privacy risk: text spans that have a high risk of leading, directly or indirectly, to the re-identification of the individual should be masked.
• Data utility: the sanitized text should preserve as much semantic content as possible.
The traditional approach to evaluating text sanitization is to rely on a manually labeled set of documents and use metrics such as precision, recall and F1 score to measure the overlap between the human decisions and the system outputs. The recall is associated with the privacy risk, as a high recall indicates that most of the PII spans that should have been masked, are indeed masked. Similarly, the precision is correlated with the data utility, as a high precision expresses that the resulting sanitization did not mask unnecessary text spans.
As argued in Pilan et al. (2022), the use of standard precision and recall measures has, however, a number of important shortcomings. They do not capture the fact that some identifiers have a more important influence on the privacy risk than others (for instance, omitting to mask a full person name is a more serious threat than omitting a vague quasi-identifier). They also ignore the fact that a personal identifier is only protected if all its occurrences are masked: if a person name is mentioned 4 times in a document, and 3 of those occurrences are masked, the re-identification risk remains unchanged, as the identifier remains available in clear text in one occurrence. Finally, text sanitization may admit several, equally correct solutions.
To provide a more fine-grained assessment of the text sanitization quality, Pil´an et al. (2022) present a set of three privacy-oriented metrics, micro-averaged over multiple annotators:
Entity recall on direct identifiers Micro-averaged, entity-level recall score calculated only for direct identifiers. If an entity is mentioned multiple times in the text, all mentions must to be masked for it to be considered correct.
Entity recall on quasi-identifiers Micro-averaged, entity level recall score calculated only for quasi-identifiers. In case of multiple mentions of the same entity, all mentions must again be masked.
Weighted Precision Micro-averaged, token-level precision where each token is weighted by its information content (measured using a language model such as BERT).
The evaluation results below are computed according to those three metrics.
This paper is available on arxiv under CC BY 4.0 DEED license.