Table of Links
5 Discussion, Acknowledgments and Disclosure of Funding, and References
B Theoretical Analysis for a Mixture of Gaussians
B Theoretical Analysis for a Mixture of Gaussians
Consider the Gaussian mixture model and the hypothesis class of linear decision rules introduced in Section 4. Here we give a more formal treatment of the assumptions and claims made in that section.
B.1 Average risk minimization
Theorem B.1. If Equation 6 holds, define t ∗ (ϕ0/ϕ1) as in Equation 3. Then, t ∗ (ϕ0/ϕ1) is the statistical risk minimizer for the Gaussian mixture model if
Proof. For a decision rule t ∈ R ∪ {±∞}, the statistical risk in the given Gaussian mixture model is given in Equation 1. The global minimum is achieved either at ±∞ or on R. In the latter case, the minimizer is a solution to ∂R(t)/∂t = 0:
Rearranging and taking the logarithm on both sides yields
which is a quadratic equation in t with solutions
By repeating the same steps for an inequality rather than equality, we conclude that 0 < ∂R(t)/∂t if and only if
B.2 Worst-class optimal priors
For the second part, we start by proving the following lemma.
B.3 The Effect of SSP
Our setting allows us to adapt and analyze a state-of-the-art baseline pruning algorithm, Self Supervised Pruning (SSP) [Sorscher et al., 2022]. Recall that, in Section 4, we adopted a variant of SSP that removes samples within a margin M > 0 of their class means. SSP performs k-means clustering in the embedding space of an ImageNet pre-trained self-supervised model and defines the difficulty of each data point by the cosine distance to its nearest cluster centroid, or prototype. In the case of two 1D Gaussians, this score corresponds to measuring the distance to the closest mean.
We claimed and demonstrated in Figure 8 that the optimal risk minimizer remains around t ∗ (ϕ0/ϕ1) even after pruning. In this section, we make these claims more precise. To this end, note that, for a Gaussian variable with variance σ 2 , the removed probability mass is
Now, for sufficiently small M, we can assume that the average risk minimizer lies within the interval (µ0 + M, µ1 − M) (recall that the average risk minimizer of the original mixture lies between µ0 and µ1 owing to the assumption in Equation 7). In this case, the right tail of the easier class and the left tail of the more difficult class (these tails are misclassified for the two classes) are unaffected by pruning and, hence, the average risk after SSP R′ (t) should remain proportional to the original risk R(t). More formally, consider the class-wise risks after SSP:
B.4 Multivariate Isotropic Gaussians
The average risk given the class priors is thus
Therefore, this problem also reduces to the minimization in the univariate case from Section B.2.
This paper is available on arxiv under CC BY 4.0 DEED license.
Authors:
(1) Artem Vysogorets, Center for Data Science, New York University (amv458@nyu.edu);
(2) Kartik Ahuja, Meta FAIR;
(3) Julia Kempe, New York University, Meta FAIR.