How to Measure Privacy and Data Utility in Text Sanitization

Authors:

(1) Anthi Papadopoulou, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway and Corresponding author (anthip@ifi.uio.no);

(2) Pierre Lison, Norwegian Computing Center, Gaustadalleen 23A, 0373 Oslo, Norway;

(3) Mark Anderson, Norwegian Computing Center, Gaustadalleen 23A, 0373 Oslo, Norway;

(4) Lilja Øvrelid, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway;

(5) Ildiko Pilan, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway.

Table of Links

Abstract and 1 Introduction

2 Background

2.1 Definitions

2.2 NLP Approaches

2.3 Privacy-Preserving Data Publishing

2.4 Differential Privacy

3 Datasets and 3.1 Text Anonymization Benchmark (TAB)

3.2 Wikipedia Biographies

4 Privacy-oriented Entity Recognizer

4.1 Wikidata Properties

4.2 Silver Corpus and Model Fine-tuning

4.3 Evaluation

4.4 Label Disagreement

4.5 MISC Semantic Type

5 Privacy Risk Indicators

5.1 LLM Probabilities

5.2 Span Classification

5.3 Perturbations

5.4 Sequence Labelling and 5.5 Web Search

6 Analysis of Privacy Risk Indicators and 6.1 Evaluation Metrics

6.2 Experimental Results and 6.3 Discussion

6.4 Combination of Risk Indicators

7 Conclusions and Future Work

Declarations

References

Appendices

A. Human properties from Wikidata

B. Training parameters of entity recognizer

C. Label Agreement

D. LLM probabilities: base models

E. Training size and performance

F. Perturbation thresholds

6 Analysis of Privacy Risk Indicators

We now evaluate the privacy risk indicators detailed in the previous section, both in isolation (Section 6.2) and in combination (Section 6.4) and discuss the main findings.

6.1 Evaluation Metrics

The evaluation of text sanitization must strike a balance between two competing considerations:

• Privacy risk: text spans that have a high risk of leading, directly or indirectly, to the re-identification of the individual should be masked.

• Data utility: the sanitized text should preserve as much semantic content as possible.

The traditional approach to evaluating text sanitization is to rely on a manually labeled set of documents and use metrics such as precision, recall and F1 score to measure the overlap between the human decisions and the system outputs. The recall is associated with the privacy risk, as a high recall indicates that most of the PII spans that should have been masked, are indeed masked. Similarly, the precision is correlated with the data utility, as a high precision expresses that the resulting sanitization did not mask unnecessary text spans.

As argued in Pilan et al. (2022), the use of standard precision and recall measures has, however, a number of important shortcomings. They do not capture the fact that some identifiers have a more important influence on the privacy risk than others (for instance, omitting to mask a full person name is a more serious threat than omitting a vague quasi-identifier). They also ignore the fact that a personal identifier is only protected if all its occurrences are masked: if a person name is mentioned 4 times in a document, and 3 of those occurrences are masked, the re-identification risk remains unchanged, as the identifier remains available in clear text in one occurrence. Finally, text sanitization may admit several, equally correct solutions.

To provide a more fine-grained assessment of the text sanitization quality, Pil´an et al. (2022) present a set of three privacy-oriented metrics, micro-averaged over multiple annotators:

Entity recall on direct identifiers Micro-averaged, entity-level recall score calculated only for direct identifiers. If an entity is mentioned multiple times in the text, all mentions must to be masked for it to be considered correct.

Entity recall on quasi-identifiers Micro-averaged, entity level recall score calculated only for quasi-identifiers. In case of multiple mentions of the same entity, all mentions must again be masked.

Weighted Precision Micro-averaged, token-level precision where each token is weighted by its information content (measured using a language model such as BERT).

The evaluation results below are computed according to those three metrics.

This paper is available on arxiv under CC BY 4.0 DEED license.

How to Measure Privacy and Data Utility in Text Sanitization

Too Long; Didn't Read

People Mentioned

Table of Links

6 Analysis of Privacy Risk Indicators

6.1 Evaluation Metrics

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

How to Measure Privacy and Data Utility in Text Sanitization

Too Long; Didn't Read

People Mentioned

Table of Links

6 Analysis of Privacy Risk Indicators

6.1 Evaluation Metrics

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics