Fair Data Pruning Implementation: Datasets, Methods, and Augmentation

Table of Links

A Implementation Details

Our empirical work encompasses three standard computer vision benchmarks (Table 1). All code is implemented in PyTorch [Paszke et al., 2017] and run on an internal cluster equipped with NVIDIA RTX8000 GPUs. We make our code available at https://github.com/avysogorets/ fair-data-pruning.

Data Pruning. Data pruning methods require different procedures for training the query model and extracting scores for the training data. For EL2N and GraNd, we use 10% of the full training length reported in Table 1 before calculating the importance scores, which is more than the minimum of 10 epochs recommended by Paul et al. [2021]. To improve the score estimates, we repeat the procedure across 5 random seeds and average the scores before pruning. Forgetting and Dynamic Uncertainty operate during training, so we execute a full optimization cycle of the query model but only do so once. Likewise, CoreSet is applied once on the fully trained embeddings. We use the greedy k-center variant of CoreSet. Since some of the methods require a hold-out validation set (e.g., MetriQ, CDB-W), we reserve 50% of the test set for this purpose. This split is never used when reporting the final model performance.

Data Augmentation. We employ data augmentation only when optimizing the final model. The same augmentation strategies are used for all three datasets. In particular, we normalize examples per-channel and randomly apply shifts by at most 4 pixels in any direction and horizontal flips.

This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Artem Vysogorets, Center for Data Science, New York University ([email protected]);

(2) Kartik Ahuja, Meta FAIR;

(3) Julia Kempe, New York University, Meta FAIR.