Full article: EDI-Graphic: A Tool To Study Parameter Discrimination and Confirm Identifiability in Black-Box Models, and to Select Data-Generating Machines

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

In a Data-Generating Experiment (DGE), the data, $X$ , is often obtained from a Black-Box and is approximated with a learning machine/sampler, $f (Y, θ); θ \in Θ,$ $Y$ is random, f is known. When $X$ has unknown cdf, $F_{θ},$ nonidentifiability of θ cannot be confirmed and may limit the predictive accuracy of the learned model, $f (Y, \hat{θ}); \hat{θ}$ estimate of $θ .$ Using properties of the Expected p-value for the Kolmogorov-Smirnov test, the Empirical Discrimination Index (EDI) and the Proportion of p-Values Index (PPVI) are introduced: (i) to confirm almost surely, discrimination of θ from $θ^{*}, that is, F_{θ} \neq F_{θ^{*}},$ (ii) to confirm with EDI-graphics identifiability of $θ (\in Θ)$ by repeating (i) for $θ^{*}$ in a fine sieve of $Θ,$ and (iii) to compare EDI-graphics and PPVIs of DGEs and select to use the DGE with the greater parameter discrimination and the smaller number of $θ^{*}$ violating identifiability of $θ .$ Among the applications, EDI and PPVI explain why the g-estimate in Tukey’s g-and-h model is better than that for the g-and-k model, unless the sample size is extremely large; $h = h_{0} = k .$ EDI-graphics indicate that Normal learning machines have better parameter discrimination than Sigmoid learning machines and their parameters are nonidentifiable. Supplementary materials for this article are available online.

KEYWORDS:

1 Introduction

Statistical modeling is used in numerous fields, from Economics and Psychology to Biology, Engineering, and Machine Learning, among others. Often, it is assumed the sample, $X = {X_{1}, \dots, X_{n}},$ consists of observations with known cdf, $F_{θ},$ we can do calculations with, that is, $F_{θ}$ is tractable; θ is unknown, element of metric space $(Θ, ρ) .$ Recently, X_i is obtained either from a Sampler/Quantile function, $f (Y_{i}, θ),$ or from a Black-Box and is approximated with a learning machine, $f (Y_{i}, θ), i = 1, \dots, n;$ f is known. The cdf $F_{θ}$ is either unknown or intractable, and we use in the sequel “unknown” to denote both and “sampler $f (θ)$ ” assuming the Y-input is in the background.

Breiman (Citation2001) called the Black-Box model algorithmic, observed that statisticians rarely adopt it, and commented: “If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.” Earlier, Tukey (Citation1962, p. 60) wrote: “Procedures of diagnosis, and procedures to extract indications rather than extract conclusions, will have to play a large part in the future of data analyses and graphical techniques offer great possibilities in both areas.” Both suggestions have been widely adopted nowadays in Data Science. Such tools and their theoretical justifications are presented for Black-Box models in this work and in Yatracos (Citation2020, Citation2021, Citation2022). Among the discussants in Breiman (Citation2001), Cox and Efron looked at the problem as prediction of X, ignoring the statistical inference aspect for θ in the X-model. Consequently, Mathematical statisticians did not follow Breiman’s suggestion. Computer scientists study the same problem with f as learning machine.

Modeling goals include estimation of $θ,$ and for the Black-Box in addition the accurate and reproducible prediction of future outputs of f using the estimate, $\hat{θ},$ of $θ .$ These goals require identifiability of $θ,$ which depends on $F = {F_{θ^{*}}; θ^{*} \in Θ} .$ Recall that the parameters in $F$ are identifiable when for each θ and $θ^{*}$ in $Θ,$ $θ \neq θ^{*}$ implies $F_{θ} \neq F_{θ^{*}} .$ Identifiability is confirmed so far only when cdfs in $F$ are known and tractable, usually via the Fisher Information Matrix which is not available for unknown models.

Another problem not frequently studied for unknown models, $F,$ is that often, due to the shapes of $F_{θ^{*}}, θ^{*} \in Θ,$ the sample size, n, needed for small estimation error may be excessively large and the statistician may not be aware about it. There is no direct data-tool measuring discrimination of θ from $θ^{*}$ by evaluating the distance between $F_{θ}$ and $F_{θ^{*}},$ and also confirm θ-identifiability. For example, using Tukey’s g-and-h model (see (2)) and the g-and-k model (see (3)), the difficulty in the discrimination (and estimation) of parameters has been studied via Maximum Likelihood estimates (Rayner and MacGillivray, Citation2002).

In Machine Learning, nonidentifiability of θ is ubiquitous. The extent of nonidentifiability, that is, the number of $θ^{*}$ making θ nonidentifiable, and the level of discrimination of parameters in $F$ are criteria for the choice of the data-generating/learning machine, f, used. For tractable models and the Kolmogorov distance, $d_{K},$ the larger $d_{K} (F_{θ}, F_{θ^{*}})$ is, the greater the discrimination between θ and $θ^{*}$ is. θ is identifiable when $d_{K} (F_{θ}, F_{θ^{*}}) > 0$ for any $θ^{*} \neq θ,$ which implies also $F_{θ^{*}} \neq F_{θ} .$ These suggest for unknown models $F_{θ}$ and $F_{θ^{*}},$ estimating their Kolmogorov distance and are leading to the use of tools from Kolmogorov-Smirnov test to study discrimination and identifiability of $θ,$ as described in Section 3.

Answers in the arising questions are provided by introducing:

(a) the Empirical Discrimination Index (EDI) of θ and $θ^{*},$ used to study discrimination and identifiability; EDI is based on repeated samples and estimates the expected p-value for the Kolmogorov-Smirnov test between $F_{θ}$ and $F_{θ^{*}}$ under both hypotheses, $F_{θ} = F_{θ^{*}}$ and $F_{θ} \neq F_{θ^{*}},$ and (b) the Proportion of p-Values Index (PPVI) complementing EDI, and used to compare “locally” different samplers f₁, f₂ with the same $Y$ -input, regarding the estimation difficulty of a parameter-value, $θ = θ_{0},$ with respect to parameter $θ^{*}$ at distance $ϵ (> 0) .$

The motivation for this work was that Matching Estimates of $θ = (g, h)$ were satisfactory at g₀ for Tukey’s g-and-h model/sampler but not for the g-and-k model/sampler for the same values of h, k (Yatracos Citation2020, Citation2021). The questions were “Why?” and “How this problem could be predicted in an Algorithmic model?” EDI and PPVI are used to answer these questions, confirming in particular that Tukey’s g-and-h model has greater g-discrimination than the g-and-k model, unless the sample size is extremely large.

In Example 4.1, EDI-graphics are presented for the normal and Cauchy models to compare the discrimination of their parameters, and to observe the form of EDI-graphics for identifiable parameters. In Examples 4.2 and 4.3, nonidentifiability is confirmed with EDI-graphics, respectively, for the parameters a, b in a normal model with a + b the mean and known variance, and for the mixture of two normal distributions with known variance. In Examples 4.4 and 4.5, Tukey’s g-and-h model and the g-and-k model are compared using EDI and PPVI, to confirm that the former has better discrimination and is to be preferred as data-generating or learning machine if estimation is the goal, unless the sample size, n, is extremely large. In Example 4.6, EDI-graphics are depicted for Normal and Sigmoid data-generating machines, showing that both have unidentifiable parameters but the former has better parameter discrimination.

EDI is not used herein as a statistical test, even though tools from hypotheses testing are used to motivate it without reference to significance level. EDI-graphics are simply observed for various sample sizes. There is no other method available to confirm identifiability and discrimination of θ when the underlying data model is unknown, thus, EDI is not compared with another method in the applications. A referee suggested to use EDI with the fashionable Wasserstein distance, $W_{p}, p \geq 1,$ instead of the Kolmogorov distance, $d_{K},$ but this is not supported for Black-Box models which may have heavy tails (Yatracos Citation2022), and also because $W_{p} ({\hat{μ}}_{n}, μ_{θ})$ does not necessarily converge to zero in probability as n increases, which is needed in Proposition 3.1 (b) for independent observations; ${\hat{μ}}_{n}$ is the empirical distribution/measure, and $μ_{θ}$ is the underlying distribution/probability measure of the data. The same referee had the concern “EDI is used with Data Generators, or when data models are known and tractable. However, for real data, the underlying generating data models are unknown and untractable. This means EDI is not applicable in many real applications.” To study identifiability and discrimination, either the cdf, $F_{θ},$ has to be known and tractable, or, $F_{θ}$ has to be “learned from samples obtained for various $θ^{*}$ ”, that is, either from a sampler, or a quantile function as with Tukey’s g-and-h model, or simply via the inverse of $F_{θ}$ and uniform r.vs on [0,1]. If none of this is possible, as in the referee’s example, one cannot study identifiability and discrimination. Note that in several statistics papers with intractable cdfs it is assumed that data can be generated for various $θ^{*}$ values. Thus, with real data, rather than proposing an intractable cdf indexed by $θ,$ one can propose instead a function, $f (Y, θ),$ generating the data; $Y$ random.

Rothenberg (Citation1971) established conditions for parameter identifiability using $F_{θ}$ ’s Fisher Information Matrix for tractable $F_{θ}, θ \in Θ .$ Nonidentifiable statistical models include mixture models (Hartigan Citation1985), autoregressive moving averages (Veres Citation1987) and change point problem (Csörgo and Horvath Citation1997). In Statistical Machine Learning, nonidentifiability of θ is ubiquitous and has deep influence in particular on the output, $f (θ),$ of the Learning Machine, f. The estimate, $\hat{θ},$ affects the capability of the learned model (or representation), $f (\hat{θ}),$ to predict future data (Ran and Hu Citation2017). For example, in Deep Neural Networks it is preferable that when a network relearns with data from the same model, the obtained learned representation is nearly similar to $f (\hat{θ}) .$ This wish motivated studying linear identifiability in function space (Roeder, Metz, and Kingma Citation2021) for tractable models with general exponential form, extending results on distinct models (Hyvärinen and Morioka Citation2016; Hyvärinen, Sasaki, and Turner Citation2018). Other tools include the Fisher Information matrix, the asymptotic order of the Likelihood Ratio test statistic of the MLE and the Kullback-Leibler divergence of tractable models, used among others by Fukumizu (Citation2003), Watanabe (Citation2001), Fukumizu and Amari (Citation2000), and Ran and Hu (Citation2014). A detailed review is presented in Ran and Hu (Citation2017), endorsing in the “Summary and Perspective” (p. 1196) the view in Breiman (Citation2001) for algorithmic models and especially the tools needed to address identifiability, by adding: “This will become one of the most important issues for machines in the future.” EDI and PPVI clarify these issues and their advantage over the previously described methods is that there is no need to know the underlying model of the data to check for parameter identifiability and parameter discrimination.

Dempster and Schatzoff (Citation1965) treated p-value as random and used its expected value, the Expected Significance Level, as sensibility index for comparing several multivariate tests. The Expected Significance Level was viewed as “reasonable compromise” to the Neyman-Pearson theory that uses the power of the test which depends on the α-level. Sackrowitz and Samuel-Cahn (Citation1999) used Expected p-Value (EPV) instead of Expected Significance Level, to stress that p-value is random, suggested EPV as test’s performance measure when it is difficult to evaluate the power function, and examined the p-values under the alternative for location and scale models. Since then, EPV and its estimates have been used to study tests for parametric tractable models, but not for studying identifiability or discrimination of parameters in unknown models using the Kolmogorov-Smirnov test.

Empirical discrimination indices for θ based on tests’ comparisons, provide useful information in estimation problems since in the elementary estimation problem, Θ consists of two parameters, θ and $θ^{*},$ and can be solved by testing. Stein (Citation1964) showed inadmissibility of the usual estimator for the variance of a normal distribution with unknown mean using a test of hypotheses to obtain the improved estimate. LeCam (Citation1973) and Birgé (Citation2006) used successfully multiple, simultaneous tests of simple hypotheses for estimation with infinite dimensional parameter space, $Θ .$

In Section 2, previous results related indirectly to parameter discrimination are presented. In Section 3, EPVs properties are presented and EDI and PPVI are introduced along with the use of EDI-graphics. Applications follow in Section 4. Proofs are included in the supplementary material, along with the R-programs used. The reader may proceed directly to that depict EDI-graphics with informative captions.

Fig. 1 EDI-graphics indicate identifiability for the Normal model, since one EDI-value remains near 0.5 as n increases. The graphics will be compared for parameter discrimination with those of the Cauchy model in . Smaller EDI values for a model indicate greater discrimination, and therefore better estimate of θ for the same sample size, n.

Fig. 2 EDI-graphics indicate identifiability for the Cauchy model, since one EDI-value remains near 0.5 as n increases. Comparison with indicates EDI-values of the Cauchy model are larger than or equal to those of the Normal model, for the same distance from θ and n-value, indicating better parameter discrimination for the latter.

Fig. 2 EDI-graphics indicate identifiability for the Cauchy model, since one EDI-value remains near 0.5 as n increases. Comparison with Figure 1 indicates EDI-values of the Cauchy model are larger than or equal to those of the Normal model, for the same distance from θ and n-value, indicating better parameter discrimination for the latter.

Fig. 3 EDI-graphics indicate nonidentifiable parameters for the normal model, $N (a + b, 1),$ with $θ = (a, b) = (1.2, 1.6) .$ Several EDI-values near 0.5 as n increases, indicate nonidentifiability, with $θ^{*} = (a^{*}, b^{*}), a^{*} + b^{*} = 2.8.$ When $n = 1000,$ EDI-values near 0.2 correspond to nonidentifiable $θ^{*}$ with $a^{*} + b^{*} = 2.7.$

Fig. 4 Nonidentifiability of the normal mixture, $0.25 N (0.4, 1) + 0.75 N (1.2, 1),$ is confirmed by the two circles with y-coordinates near 0.5 for all sample sizes n. Circles with similar y-coordinates far from 0, as n increases, indicate nonidentifiable parameters.

Fig. 5 Nonidentifiability of the normal mixture, $0.25 N (0.35, 1) + 0.75 N (1.25, 1),$ with the parameter $θ = (0.25, 0.35, 1.25)$ not included in the sieve, is confirmed by the two circles with the larger y-coordinates, which decrease as n increases. Circles with similar y-coordinates far from 0, as n increases, indicate nonidentifiable parameters, as in .

Fig. 6 EDI-graphics for comparison of data generating machines: Tukey’s g- and- h (denoted TGH) and the g- and- k (denoted GK) models. Tukey’s g- and- h has better parameter discrimination for moderate sample size, reconfirming Example 4.4.

Fig. 7 EDI-graphics showing nonidentifiability for data, X, from a Normal learning machine, $X = f (Z, θ) = p ϕ (Z - μ_{1}) + (1 - p) ϕ (Z - μ_{2}), θ = (p, μ_{1}, μ_{2}) = (0.25, 0.4, 1.2) = θ_{46}^{*}; Z$ is a standard Normal $r . v .$ with density $ϕ .$

Fig. 8 EDI-graphics showing nonidentifiability for data, X, from a Sigmoid learning machine, $X = f (Z, θ) = ps (Z - μ_{1}) + (1 - p) s (Z - μ_{2}), θ = (p, μ_{1}, μ_{2}) = (0.25, 0.4, 1.2) = θ_{46}^{*}; Z$ is a standard Normal $r . v .,$ $s (u) = {(1 + e^{- u})}^{- 1}, u \in R .$ Comparison with favors the Normal learning machine which shows better discrimination.

2 Model Shapes and Parameter Discrimination

Tukey (Citation1962, Citation1977) used the g-and-h model to better fit sample, $X,$ with heavy tails and skewness, and the λ-distribution to fit both symmetric and asymmetric data. The idea is to model the quantiles of $X$ directly and not via a density (Yan and Genton Citation2019). Thus, $X$ is seen as modification of data, $Y,$ using known data-generator f, with unknown parameter $θ (\in R^{p}),$ for example, $θ = (g, h),$ (1) $X_{i} = f (Y_{i}, θ), i = 1, \dots, n;$ (1)

$p \geq 1, Y = {Y_{1}, \dots . Y_{n}}$ is observed from a known model with known parameter $\tilde{η} .$

Tukey’s quantile-modeling approach has been used since then in several fields, (1) evolved and has led to an abundance of models’ shapes that would fit better $X .$ However, this abundance of model shapes near the underlying distribution increases the difficulty in estimating $θ,$ and f in (1) should be examined before its use as potential data-generator.

Tukey’s (Citation1977) $(a, b, g, h)$ -model accommodates data from non-Gaussian distribution, with g real-valued controlling skewness, nonnegative h controlling tail heaviness and with location, $a \in R,$ and scale, $b > 0.$ A vector of independent standard normal random variables, $Z = {Z_{1}, \dots, Z_{n}},$ and parameter values $a, b, g, h$ are used to obtain (2) $X_{1, i} (g, h) = a + b \frac{e^{g Z_{i}} - 1}{g} e^{.5 h Z_{i}^{2}}, i = 1, \dots, n;$ (2) $X_{1} (g, h) = (X_{1, 1} (g, h), \dots, X_{1, n} (g, h)) .$

The $(a, b, g, k)$ -model (Haynes, MacGillivray, and Mengersen Citation1997) includes distributions with more negative kurtosis than the normal distribution and some bimodal distributions (Rayner and MacGillivray Citation2002, p. 58). Standard normal $Z_{1}, \dots, Z_{n}$ and parameter values $a, b, g, k$ are used to obtain (3) $X_{2, i} (g, k) = a + b [1 + c \cdot \frac{1 - e^{- g Z_{i}}}{1 + e^{- g Z_{i}}}] {(1 + Z_{i}^{2})}^{k} Z_{i}, i = 1, \dots, n;$ (3)

$X_{2} (g, k) = (X_{2, 1} (g, k), \dots, X_{2, n} (g, k)), c$ is a parameter used to make the sample correspond to a density and usually $c = . 8.$

With abuse of notation, we use in (2) and (3) $X (g, u)$ instead of $X (a, b, g, u), u = h, k$ and the same for their coordinates. Since g-and-k and g-and-h use both the same $Y = Z,$ EDI and PPVI are used in the Examples 4.4 and 4.5.

For the g-and-k model (Haynes, MacGillivray, and Mengersen Citation1997) and the generalized g-and-h models, Rayner and MacGillivray (Citation2002) confirmed the difficulty of the MLE to discriminate distributional shapes and parameters’ values with small and moderate $n :$ “… computational Maximum Likelihood procedures are very good for very large sample sizes, but they should not necessarily be assumed to be safe for even moderately large sample sizes” (p. 58); also, “… with moderately large positive (i.e., to the right) skewness, the MLE method fitting to the g-and-k distribution cannot efficiently discriminate between moderate positive values and small negative values of the kurtosis parameter.” (p. 64).

For Tukey’s asymmetric λ-distributions, with wider variety of distributional shapes than g-and-h and using the Moments estimation method it is observed: “An additional difficulty with the use of this distribution when fitting through moments, is that of nonuniqueness, where more than one member of the family may be realized when matching the first four moments …” (Ramberg et al. Citation1979; Rayner and MacGillivray Citation2002, p. 58).

These findings suggest the search for a data dependent tool to evaluate the discrimination and identifiability of θ-values.

3 Discrimination and Identifiability of Parameters in Algorithmic Models with EDI and PPVI

3.1 Data Generating Experiment-Definitions- Assumptions on f

The elements of a statistical problem with data obtained from known models are included in Le Cam’s Statistical Experiment, $(X, A, {F_{s}, s \in Θ}),$ with $X$ sample space, $A$ its σ-field and F_s probability (or cdf) indexed by s in parameter space $Θ;$ see, for example, LeCam and Yang (Citation1990). The parallel notion of Data-Generating Experiment is introduced for data obtained from Breiman’s Algorithmic models and Samplers.

Definition 3.1.

A Data-Generating Experiment (DGE), $(X, f, Θ),$ consists of the sample space, $X,$ and the data-generating mechanism (or sampler), f, with inputs parameter θ ( $\in Θ$ ) and sample, $Y,$ that may be latent or not.

The findings in Section 2 indicate, for DGEs with unknown models, the need to study the discrimination of parameters using tools independent from estimation methods. Examples with DGEs appear in the Applications section and include in particular Tukey’s g-and-h model and the g-and-k model.

In a DGE, sample $X$ with size n is drawn using an unknown parameter, $θ (\in Θ),$ in the sampler, f. The observations in $X$ take values in $R^{d},$ have unknown cdf, $F_{θ},$ and the aim is to estimate $θ; d \geq 1.$ Herein, Θ is a subset of $R^{p}, p \geq 1,$ equipped with distance, $ρ .$ For a statistical inference problem to be meaningful, every θ in Θ should be identifiable. DGE includes any underlying structure needed, for example, a distance for cdfs, a prior for θ and σ-field(s). The user can select $θ^{*} \in Θ$ to draw one or more $X^{*}$ via f. With abuse of notation we use $f (θ^{*})$ instead of $f (θ^{*}, Y) .$

The sampler, f, in a DGE replaces succesfully the uknown underlying models ${F_{s}; s \in Θ}$ in the Statistical Experiment, when repeated samples are drawn for selected $θ^{*}$ in Θ providing adequate information for $F_{θ^{*}}$ (Yatracos Citation2020).

Definition 3.2.

For any two distribution functions F, G in $R^{d}, d \geq 1,$ their Kolmogorov distance (4) $d_{K} (F, G) = sup {| F (y) - G (y) |; y \in R^{d}} .$ (4)

Definition 3.3.

For any sample $U = (U_{1}, \dots, U_{n})$ of random vectors in $R^{d}, n {\hat{F}}_{U} (u)$ denotes the number of $U_{i}$ ’s with all their components smaller or equal to the corresponding components of $u (\in R^{d}) . {\hat{F}}_{U}$ is the empirical cdf of $U .$

Definition 3.4.

θ is discriminated from $θ^{*}$ when $F_{θ} \neq F_{θ^{*}},$ or equivalently if $d_{K} (F_{θ}, F_{θ^{*}}) > 0.$

Definition 3.5.

θ is identifiable when for any $θ^{*} \in Θ, θ^{*} \neq θ,$ it holds that $F_{θ} \neq F_{θ^{*}} .$

The larger the distance $d_{K} (F_{θ}, F_{θ^{*}})$ is, the greater the discrimination between θ and $θ^{*}$ is. If θ is discriminated from all $θ^{*} \in Θ, θ^{*} \neq θ,$ then θ is identifiable. However, in a DGE, $d_{K} (F_{θ}, F_{θ^{*}})$ can be only estimated using f due to the unknown cdfs, $F_{θ}$ and $F_{θ^{*}} .$

The assumption from “Looking inside the Black-Box (Breiman Citation2002, p. 3)” is used in A1 for the DGE.

A1: In DGE $(R^{d}, f, Θ),$ the data, $X,$ are generated by independent draws via f.

Assumption A1 is used in Proposition 3.1, (b), (c). However, (b) can be obtained with dependent draws via f, under the weaker assumption A2.

A2: In DGE $(R^{d}, f, Θ),$ for the data, $X,$ with size n and underlying cdf of its components, $F_{θ}, lim_{n \to \infty} d_{K} ({\hat{F}}_{X}, F_{θ}) = 0$ in probability.

3.2 Motivation and Description of the EDI-Approach

In practice, θ is not distinguished locally from $θ^{*}$ -values in an open ρ-ball, $N_{ϵ} (θ),$ centered at θ with small radius $ϵ .$ Violation of θ-identifiability due to $θ^{*} \in N_{ϵ} (θ)$ for ϵ small, is not expected to have significant effect in learning machines but also in probability calculations. Thus, we restrict attention to identifiability of θ with respect to $θ^{*}$ in $N_{ϵ}^{c} (θ);$ A^c denotes the complement of set A.

For a DGE, the goals herein are: (i) to study the discrimination of θ and $θ^{*}$ at ρ-distance greater than or equal to small $ϵ,$ (ii) to confirm identifiability of θ using (i) for all $θ^{*}$ in an ϵ-dense subset/sieve, $Θ^{*},$ of $Θ,$ and (iii) to select among several data-generating machines which one to use for better estimation of parameters, taking into consideration the sample size, and preferring less nonidentifiability, that is, a smaller number of $θ^{*}$ making θ nonidentifiable, and more/greater discrimination, namely larger $d_{K} (F_{θ}, F_{θ^{*}}) .$

For $i),$ the main tool to discriminate θ from $θ^{*}$ is the empirical discrimination index, ${EDI}_{M},$ which estimates the expected p-value, $EPV (θ, θ^{*}; n),$ for the Kolmogorov-Smirnov test of hypotheses $F_{θ} = F_{θ^{*}}$ against $F_{θ} \neq F_{θ^{*}}$ and has special properties (Proposition 3.1). Samples $X$ and $X^{*}$ of size n are drawn using $f (θ)$ and $f (θ^{*}),$ respectively, and $d_{K} (F_{θ}, F_{θ^{*}})$ is estimated by $d_{K} ({\hat{F}}_{X}, {\hat{F}}_{X^{*}}) .$ To evaluate how large $d_{K} (F_{θ}, F_{θ^{*}})$ is, since $d_{K} ({\hat{F}}_{X}, {\hat{F}}_{X^{*}})$ is random, one could rely on its p-value exceeding the observed $t^{*} = d_{K} ({\hat{F}}_{x}, {\hat{F}}_{x^{*}}) .$ Having the luxury of the f-availability, M such samples with $t^{*}$ and p-values are obtained, and the average, ${EDI}_{M},$ of these p-values is used. Smaller value of ${EDI}_{M}$ (or, simply EDI) indicates greater discrimination, as smaller p-value does.

For (ii), $Θ (\subseteq R^{p})$ is assumed to be compact which holds after a preliminary estimation of $θ (\in R^{p});$ see, for example, Yatracos (Citation2020). Let $θ_{1}^{*} = θ$ and form a partition of $N_{ϵ}^{c} (θ),$ namely, $N_{ϵ} (θ_{2}^{*}), \dots, N_{ϵ} (θ_{m}^{*}) .$ ϵ-identifiability of θ is confirmed almost surely by confirming with ${EDI}_{M}$ its discrimination from each $θ^{*}$ in $Θ^{*} - {θ_{1}^{*}},$ (5) $Θ^{*} = {θ_{1}^{*}, \dots, θ_{m}^{*}} .$ (5) Footnote¹

For (iii), EDI-graphics can be compared for several samplers, as in .

For practical purposes, ϵ-identifiability of θ will imply identifiability for all the parameters in $N_{ϵ} (θ) .$ Global identifiability in Θ can be confirmed establishing ϵ-identifiability for $θ = θ_{i}^{*}, i > 1.$ Alternatively, EDI-graphics for one θ only, with $θ^{*}$ in sieve, $Θ^{*},$ and various sample sizes, n, may be enough as observed in Example 4.1 for which θ-identifiability implies $\tilde{θ}$ -identifiability for every $\tilde{θ}$ in $Θ .$ This is expected to hold for location models.

3.3 The EDI-Details

In a DGE $(R^{d}, f, Θ),$ with underlying unknown cdfs, $F = {F_{s} : s \in Θ},$ let (6) $η = d_{K} (F_{θ}, F_{θ^{*}}), 0 \leq η \leq 1,$ (6) and consider the hypotheses (7) $H : η = 0 (i . e ., F_{θ} = F_{θ^{*}}) against H^{*} : η = η^{*} > 0 (i . e ., F_{θ} \neq F_{θ^{*}}) .$ (7)

For the obtained samples $X$ and $X^{*}$ of size n, respectively, with unknown cdfs, $F_{θ}$ and $F_{θ^{*}},$ let (8) $T_{n} = d_{K} ({\hat{F}}_{X}, {\hat{F}}_{X^{*}}) .$ (8)

Under H, let G₀ denote T_n’s cdf, and P₀ the corresponding probability. H is rejected if T_n is large, $T_{n} > t^{*} = d_{K} ({\hat{F}}_{x}, {\hat{F}}_{x^{*}}),$ or instead if (9) $PV = PV (DGE) = P_{0} (T_{n} > t^{*}) = 1 - G_{0} (t^{*})$ (9) is small. The p-value (9) is calculated under H, and is random since it depends on the observed $t^{*}$ -value. Its expected value, EPV, is going to be calculated under both H and $H^{*},$ and we write (10) $\begin{matrix} EPV {(θ, θ^{*}; n)}^{2} = EPV (θ, θ^{*}; n, DGE) \\ = \int_{0}^{1} P_{0} (T_{n} > t^{*}) d G_{η^{*}} (t^{*}) \\ = 1 - E_{η^{*}} G_{0} (T_{n}^{*}), η^{*} \geq 0. \end{matrix}$ (10) Footnote²

In the right side of (10), $T_{n}^{*}$ is the random variable with observed value $t^{*}$ obtained when $η^{*} = d_{K} (F_{θ}, F_{θ^{*}}) \geq 0.$

Definition 3.6

(EDI). In a $DGE (R^{d}, f, Θ),$ let $X_{1}, X_{2}, \dots, X_{M}$ be samples of size n drawn with unknown cdf $F_{θ},$ and let $X_{1}^{*}, \dots, X_{M}^{*}$ be samples of size n from $F_{θ^{*}} .$ Let PV_i be the p-value (9) for the two sided Kolmogorov-Smirnov test (7) using $X_{i}, X_{i}^{*}, i = 1, \dots, M .$ The Empirical Discrimination Index (EDI) of θ and $θ^{*}$ is (11) ${EDI}_{M} (θ, θ^{*}; n) = {EDI}_{M} (θ, θ^{*}; n, DGE) = \frac{1}{M} \sum_{i = 1}^{M} P V_{i} .$ (11)

The smaller the ${EDI}_{M}$ -value is in (11), the greater/better the discrimination of θ and $θ^{*}$ is. EDI is also used instead of ${EDI}_{M}$ in the sequel.

Proposition 3.1.

In $DGE, (R^{d}, f, Θ),$ let T_n and $T_{n}^{*}$ be as previously described.

For every θ in Θ and for every sample size n, (12) $EPV (θ, θ; n) = E_{0} [1 - G_{0} (T_{n}^{*})] = . 5.$ (12)
Under either A1 or A2, for every $η^{*} > 0,$ (13) $lim_{n \to \infty} EPV (θ, θ^{*}; n) = lim_{n \to \infty} E_{η^{*}} [1 - G_{0} (T_{n}^{*})] = 0.$ (13)
Under A1, if $η^{*} = d_{K} (F_{θ}, F_{θ^{*}}) < η^{* *} = d_{K} (F_{θ}, F_{θ^{* *}}),$ and $T_{n}^{* *}$ is defined as T_n in (8) using $X^{* *}$ from $F_{θ^{* *}},$ then for large n, (14) $\begin{matrix} EPV (θ, θ^{*}; n) = E_{η^{*}} [1 - G_{0} (T_{n}^{*})] \geq E_{η^{* *}} [1 - G_{0} (T_{n}^{* *})] \\ = EPV (θ, θ^{* *}; n) . \end{matrix}$ (14)

From Proposition 3.1 (a), (b), discrimination of θ and $θ^{*}$ and identifiability of θ are confirmed when, for large n-values, as M increases ${EDI}_{M}$ converges almost surely to its expected value. This always holds under A1 and may hold also under A2 for dependent data. Thus, when ${EDI}_{M} (θ, θ^{*}; n, DGE),$ $θ \neq θ^{*},$ stays near 0.5 as n increases, θ is not identifiable. Also, if ${EDI}_{M} (θ, θ^{*}; n, DGE) > {EDI}_{M} (θ, θ^{* *}; n, DGE),$ then θ has greater discrimination from $θ^{* *}$ than from $θ^{*}$ for sample size n.

Remark 3.1.

In Dempster and Schatzoff (Citation1965) and Sackrowitz and Samuel-Cahn (Citation1999), property (a) in Proposition 3.1 has been used as well as that, when T_n and $T_{n}^{*}$ are independent, from (10) (15) $\begin{matrix} P (T_{n} > T_{n}^{*}) = \int_{0}^{1} P (T_{n} > t^{*} | T_{n}^{*} = t^{*}) d G_{η^{*}} (t^{*}) \\ = EPV (θ, θ^{*}; n) = 1 - E_{η^{*}} G_{0} (T_{n}^{*}), \end{matrix}$ (15) with T_n and $T_{n}^{*}$ obtained, respectively, under H and $H^{*},$ or under H and H.

3.4 DGE Selection with EDI

EDI in (11) can be used to compare two or more DGEs and choose one if estimation is the goal. Assume that DGE1 and DGE2 are indexed by a parameter $θ \in Θ$ (usually of similar nature, for example, location) and that samples as in Definition 3.6 are obtained. Then, the discrimination of θ from $θ^{*}$ is greater in DGE1 than DGE2 if (16) ${EDI}_{M} (θ, θ^{*}; n, DGE 1) < {EDI}_{M} (θ, θ^{*}; n, DGE 2) .$ (16)

When (16) holds for θ and $θ^{*}$ in $Θ,$ then DGE1 is preferred if both DGE1 and DGE2 could be used for modeling. The reason is that what holds for a choice of θ and $θ^{*},$ is expected to hold for all the parameters in $Θ .$ The EDI comparison can be used for DGE1 and DGE2 with parameters, respectively $θ, θ^{*}$ in $Θ,$ $ζ, ζ^{*}$ in $Z$ and $ρ (θ, θ^{*}) = ρ_{Z} (ζ, ζ^{*}); ρ_{Z}$ is distance in $Z .$

3.5 The Proportion of p-values Index, PPVI

Another tool, complementing EDI, is introduced for a particular type of DGEs.

Definition 3.7.

When DGE1 and DGE2 use each quantile model (1) with variables $Y, Y^{*}$ from the same model, for each of the M tests providing (11), $Y (= Y^{*})$ can be used to generate the data and p-values are compared. The proportion of p-values index is (17) $\begin{matrix} {PPVI}_{M} (θ, θ^{*}; n, DGE 1, DGE 2) \\ = \frac{# {P V_{j} (DGE 2) < P V_{j} (DGE 1), j = 1, \dots, M}}{M} . \end{matrix}$ (17)

${PPVI}_{M}$ is an additional tool to extract indications for discrimination. PPVI is also used instead of ${PPVI}_{M} .$

DGE Selection with the PPVI Discrimination Criterion

For DGE1 and DGE2 as in Definition 3.7, the discrimination of θ from $θ^{*}$ for sample size n is greater/easier in DGE1 than DGE2 if (18) $\begin{matrix} {PPVI}_{M} (θ, θ^{*}; n, DGE 1, DGE 2) < .5 \\ {PPVI}_{M} (θ, θ^{*}; n, DGE 2, DGE 1) > . 5. \end{matrix}$ (18)

Because of the strict inequality, “<”, used in (17), ${PPVI}_{M} (θ, θ^{*}; n, DGE 2, DGE 1)$ is also included in (18). In some cases, when every θ is identifiable and (18) holds, a better estimate is obtained for DGE1 unless the sample size is very large; compare in , PPVIs for Tukey’s g-and-h model (DGE1) and the g-and-k model (DGE2) for $200 \leq n \leq 35000.$ Similar are the findings is comparing EDI (g-and-h) with EDI (g-and-k).

Table 2 Model parameters: g = 5, $g^{*} = 3, h = k = 2.5$ . EDIs and PPVIs for g provide similar indications and are based on M = 1000 repeats.

Display Table

3.6 Implementation

In DGE $(R^{d}, f, Θ),$ Θ is a compact subset (rectangle) of R^p with distance, $ρ (θ, θ^{*}),$ the L₂-distance; $d \geq 1, p \geq 1.$ For $θ \in Θ$ and $ϵ (> 0)$ small enough such that parameters in a neighborhood, $N_{ϵ} (θ),$ are “indistinguishable” for each coordinate, let $Θ^{*} = {θ_{1}^{*}, \dots, θ_{m}^{*}}$ be the sieve in (5), with $θ \in Θ^{*}$ but not necessarily equal to $θ_{1}^{*} .$ The construction of $N_{ϵ} (θ)$ and the sieve, $Θ^{*},$ is described in the examples. The reader may prefer to use a different sieve than the $Θ^{*}$ used, for example, in R based on midpoints of intervals instead of the end-points of intervals for each θ-coordinate. In the EDI-graphic for $θ,$ the y-axis is used for EDI-values and the x-axis for the Euclidean L₂-distance between θ and the elements of sieve, $Θ^{*} .$

When $d = 1,$ the R-function ks.test provides the p-value for the Kolmogorov-Smirnov two-sample test of equality for θ and $θ^{*},$ with $X$ and $X^{*},$ respectively, from $F_{θ}$ and $F_{θ^{*}} .$ For $d \geq 2,$ the approaches in Peacock (Citation1983) and Polonik (Citation1999) can be used to obtain p-values The theory in the latter was implemented by Glazer, Lindenbaoum, and Markovitch (Citation2012), who estimated high-density regions directly instead of using density estimates.

Use of EDI-graphics

EDI-graphics of ${EDI}_{M} (θ, θ_{k}^{*}; n)$ against $ρ (θ, θ_{k}^{*}), k = 1, \dots, m,$ for various $n :$

show nonidentifiability of θ (and other parameters) when, as n increases, at least ${EDI}_{M} (θ, θ_{i}^{*} = θ; n)$ and ${EDI}_{M} (θ, θ_{j}^{*}; n)$ have values near 0.5, $i \neq j,$
indicate greater discrimination of θ from $θ_{1}^{*}$ than from $θ_{2}^{*}$ when $EDI (θ, θ_{1}^{*}; n) < EDI (θ, θ_{2}^{*}; n),$ and
allow comparing data-generating machines with their EDI-graphics using (A) and (B), preferring greater discrimination and less nonidentifiability when estimation is the goal.

4 Applications

EDI-graphics for $θ,$ with $θ^{*}$ in the sieve $Θ^{*},$ are presented for various sample sizes, for known models and also DGEs with identifiable and nonidentifiable parameters to see their differences when using the graphics. For the interpretation of EDI-graphics follow (A)–(C) at the end of the previous section and/or read the Figures’ captions. indicate that EDI-graphics for one θ and for $θ^{*}$ in $Θ^{*},$ and for moderate and large sample size, n, are sufficient for checking identifiability in Θ and parameter discrimination, thus, confirming EDI’s usefulness.

EDI-graphics for statistical models are first presented.

Example 4.1.

EDI-graphics for Normal (N) and Cauchy (C) models are depicted in and , respectively, for studying discrimination and confirming identifiability of θ= $(μ, σ) = (1.2, 1.6) .$ The assumed parameter spaces for μ and σ are, respectively, $[0, 2]$ and $[0.4, 2.4],$ and $Θ = [0, 2] \times [0.4, 2.4] .$ For $ϵ = 0.4,$ consider in each parameter space, respectively, sieve $μ_{j}^{*} = .4 (j - 1), 1 \leq j \leq 6$ and $σ_{k}^{*} = 0.4 + 0.4 (k - 1), 1 \leq k \leq 6,$ providing Θ-sieve, $Θ^{*},$ with elements $θ_{i}^{*} = (μ_{j}^{*}, σ_{k}^{*}), i = 1, \dots, 36,$ that includes $θ .$

M = 100 independent samples $X$ and $X^{*},$ each of size n, are obtained with parameters, respectively, θ and $θ_{i}^{*}$ in both models, and the corresponding EDIs are calculated, $i = 1, \dots, 36,$ for $n = 100, 300, 600, 1000.$ Identifiability of θ is indicated in and since $EDI (θ, θ; n)$ takes value near 0.5 and the other EDI values, $EDI (θ, θ^{*}; n),$ decrease as $d_{K} (F_{θ}, F_{θ^{*}})$ increases and, in this example, the Euclidean distance of θ and $θ^{*}$ increases. All the EDI-values eventually decrease to zero as n increases except for $EDI (θ, θ; n) .$ The simulation results are expected from Proposition 3.1 since ${EDI}_{M}$ converges to the EPV.

Indicative $EDI (θ, θ^{*}; n)$ -values are provided in , for $θ_{22}^{*} = θ = (1.2, 1.6)$ and $θ_{23}^{*} = (1.2, 2),$ to observe better parameter discrimination for the Normal model.

Table 1 EDI-values in column $θ_{23}^{} (N)$ converge faster to zero than those in column $θ_{23}^{} (C),$ indicating greater discrimination of θ with the Normal model than the Cauchy model.

Display Table

Example 4.2.

EDI-graphics for Normal distributions of the form $N (a + b, 1)$ are depicted in to confirm nonidentifiability of the parameter $θ = (a, b) = (1.2, 1.6) .$ The assumed parameter spaces for a and b are, respectively, $[0, 2]$ and $[0.4, 2.4],$ and $Θ = [0, 2] \times [0.4, 2.4] .$ For $ϵ = 0.1,$ in each parameter space, consider $a_{j}^{*} = .1 (j - 1), 1 \leq j \leq 21$ and $b_{k}^{*} = .4 + .1 (k - 1), 1 \leq k \leq 21,$ providing sieve $θ_{i}^{*} = (a_{j}^{*}, b_{k}^{*}) \in Θ, i = 1, \dots, 21^{2} .$

M = 100 independent samples $X$ and $X^{*}$ each of size n are obtained with parameters, respectively, θ and $θ_{i}^{*},$ and the corresponding EDIs are calculated, $i = 1, \dots, 21^{2},$ for $n = 100, 1000, 30, 000, 100, 000.$ For $n = 100,$ several circles in the EDI-graphic (in ) “jump” and have y-values near 0.5, as the distance on the x-axis from $θ = (1.2, 1.6)$ increases. For $n = 1000,$ the $θ^{*}$ -values that indicate nonidentifiability of θ have EDI values near 0.5 and the sum of the $θ^{*}$ parameters is $2.8,$ as with $θ .$ The $θ^{*}$ with EDI-values near 0.2 indicate additional nonidentifiable parameters and have sum of parameters 2.7.

Example 4.3.

Let $N (μ, σ^{2})$ denote Normal cdf with mean, $μ,$ and variance, $σ^{2} .$ Consider the normal mixture model, $F,$ with known variance, $σ^{2} = 1,$ and parameter $θ = (p, μ_{1}, μ_{2}),$ which is nonidentifiable, $\begin{matrix} F = {pN (μ_{1}, 1) + (1 - p) N (μ_{2}, 1), (p, μ_{1}, μ_{2}) \in Θ \\ = [0, 1] \times [0, 2] \times [0, 2]} . \end{matrix}$

EDI is used to confirm algorithmically nonidentifiability of $θ = (0.25, 0.4, 1.2) .$ $S_{p} = {0, 0.25, 0.5, 0.75, 1}$ is a sieve for $[0, 1]$ and $S_{μ_{1}} = S_{μ_{2}} = {0, 0.4, 0.8, 1.2, 1.6, 2}$ are sieves for $[0, 2],$ and (19) $Θ^{*} = S_{p} \times S_{μ_{1}} \times S_{μ_{2}} = {θ_{i}^{*} : 1 \leq i \leq 180}$ (19) is sieve for $Θ,$ with $θ = θ_{46}^{*} .$ $θ_{128}^{*} = (0.75, 1.2, 0.4)$ makes θ nonidentifiable. Using $M = 100,$ EDI-graphics for $n = 100, 500, 15, 000, 30, 000,$ appear in . When $n = 100,$ the EDI-values are spread in $[0, 0.5],$ but as n increases, for example, when $n = 500,$ several EDI-values are near zero. When $n = 15, 000,$ there are 6 EDI-values far from 0. Those with similar values correspond to nonidentifiable $θ_{46}^{*}$ and $θ_{128}^{*},$ $θ_{63}^{*} = (0.25, 1.6, 0.8)$ and $θ_{125}^{*} = (0.75, 0.8, 1.6), θ_{88}^{*} = (0.5, 0.8, 1.2)$ and $θ_{93}^{*} = (0.5, 1.2, 0.8) .$ When $n = 30, 000,$ the EDI-values of the last two are nearly zero, those corresponding to indices 63 and 125 decrease, while those for indices 46 and 128 remain near 0.5, as expected.

In , simulations are repeated when the means’ coordinates in $θ = (0.25, 0.35, 1.25)$ are not in $Θ^{*} .$ θ’s closest element in the sieve is $θ_{46}^{*} = (0.25, 0.4, 1.2) .$ The same pattern with is observed in , except for the two larger EDI-values which decrease slower than the other EDI-values towards zero as n increases. Nonidentifiable $θ_{63}^{*}, θ_{125}^{*}$ and $θ_{88}^{*}, θ_{93}^{*}$ have similar EDI-values, with those of the last two vanishing for $n \geq 15, 000.$

EDI-graphics are used also for studying discrimination and identifiability of the parameters in data-generating machines which are compared. Such examples follow.

Example 4.4.

Tukey’s g-and-h model (DGE1) and the g-and-k model (DGE2) are compared, g- locally with EDI and PPVI. Normal sample, Z is used to obtain $X_{1} (g, h)$ and $X_{2} (g, k)$ , respectively, from (2) and (3) with $g = 5, h = k = 2.5, a = 0, b = 1, c = 0.8.$ Normal sample, $Z^{*},$ also of size n, is used to obtain similarly $X_{1}^{*} (g^{*}, h), X_{2}^{*} (g^{*}, k)$ with different $g^{*} = 3.$ The p-value for the Kolmogorov-Smirnov test is obtained for both DGE1 and DGE2 models. Both experiments are repeated M = 1000 times for $n = 50, 100, 200, 500, 1000, 1500, 2500, 5000, 10, 000, 35, 000, 40, 000.$ For each n, EDIs for DGE1 and DGE2 are computed and counters measure the number of times out of M the p-value for DGE2 is smaller than, or larger than, that of DGE1. The results appear in . Comparison of EDIs indicate that for the g and $g^{*}$ used and $n < 40, 000,$ Tukey’s g-and-h model has greater g-discrimination than the g-and-k model. For example, for 5% significant difference, n = 1500 is needed for the g-and-h model, and n = 2500 is needed for the g-and-k model. Comparison of PPVIs indicate that for $200 \leq n \leq 10, 000$ Tukey’s g-and-h model has greater g-discrimination than the g-and-k model, with $PPVI (g ‐and‐ h, g ‐and‐ k)$ decreasing to zero and $PPVI (g ‐and‐ k, g ‐and‐ h)$ increasing to one. A sample of size $n = 40, 000$ is needed for the parameters g = 5 and $g^{*} = 3$ of the g-and-k model to be discriminated as in Tukey’s g-and-h model.

Remark 4.1.

The results in Example 4.4 for the g-and-k model suggested comparing also smooth histograms for this model. Visually, there is no discrimination between overlayed g-and-k smooth histograms with parameters, respectively, g = 5 and $g^{*} = 3.5$ for various sample sizes less than or equal to $n = 10, 000$ and $Z = Z^{*} .$ We did not use $n > 10, 000.$

From the results in Example 4.4, Tukey’s g-and-h model has better local discrimination than the g-and-k model, unless the sample size is very large. The findings in extend those in Rayner and MacGillivray (Citation2002), based on the data and not on a particular estimate, for example, the MLE. The results are confirmed g- globally below, with EDI-graphics.

Example 4.5.

Tukey’s g- and-h and the g- and-k models are examined as Learning Machines with same parameters $a = 0, b = 1, θ = (g, h) = (g, k) = (5, 2.5) .$ It is assumed g is unknown, but $h = k = 2.5$ is known. To use the same R programs we considered parameter spaces for g and $h = k,$ respectively, $[2, 5]$ and $[2.5, 2.5] .$ The sieve for the first parameter space, with $ϵ = 0.4,$ is $g_{j}^{*} = 2 + 0.6 (j - 1), 1 \leq j \leq 6,$ and $h_{k}^{*} = 2.5, 1 \leq k \leq 6,$ therefore the Θ-sieve consists of $θ_{i}^{*}, i = 1, \dots, 36.$

M = 100 independent samples $X$ and $X^{*}$ each of size n are obtained with parameters, respectively, θ and $θ_{i}^{*}$ in both models, and the corresponding EDIs are calculated, $i = 1, \dots, 36,$ for $n = 200, 1000.$ EDI-graphics for both models appear in indicating identifiability. Comparison of the EDI-graphics for the same n value, indicates that Tukey’s g- and-h model has better parameter discrimination than the g-and-k model. The findings confirm graphically the results in Example 4.4.

Example 4.6.

Similar set-up as in Example 4.3 is used, with only difference that data, X, is obtained from the Normal learning machine that is convex combination of normal densities, (20) $X = f (Z, θ) = p ϕ (Z - μ_{1}) + (1 - p) ϕ (Z - μ_{2}) .$ (20)

$(p, μ_{1}, μ_{2})$ is element of $Θ = [0, 1] \times [0, 2] \times [0, 2],$ with the same sieve, $Θ^{*} .$ The model parameter $θ = (0.4, 1.2, 1.6) = θ_{46}^{*},$ and Z is a standard Normal random variable with density $φ$ . M = 100 learning samples are used for EDI and $n = 100, 500, 15000, 30000.$ A similar experiment is examined for a Sigmoid learning machine, with $ϕ$ in (19) replaced by $s (u) = \frac{1}{1 + e^{- u}}, u \in R .$

and indicate nonidentifiability and that the parameters are better discriminated with the data from the Normal learning machine.

5 Conclusion

For Black-Box models, parameter identifiability cannot be confirmed. For learning machines, nonidentifiability is ubiquitous and the resulting difficulty in the estimation of the parameters and the reproducibility of the learned models are not yet quantified. The Empirical Discrimination Index, $EDI (θ, θ^{*}; n),$ is used for $θ^{*}$ in a sieve of $Θ,$ to confirm with EDI-graphics almost surely identifiability of all parameters θ in $Θ,$ or nonidentifiability. EDI and the Proportion of p-values Index (PPVI) are also useful tools for identifying among samplers and learning machines those that have greater parameter discrimination for a given sample size, n, thus, leading to more efficient estimates.

Supplementary Materials

Proofs and R-functions used in Examples 4.1–4.6.

Supplemental material

Supplemental Material

Download Zip (235.4 KB)

Acknowledgments

Many thanks are due to Professor Faming Liang and Professor Galin Jones, Editors, who have handled, respectively, the original submission and the revisions. Thanks are due to the referees, for their comments that improved the presentation of the paper, and to Mr. Yongzhen Feng, Tsinghua University, for the suggestions to improve readability.

Disclosure Statement

The authors report there are no competing interests to declare.

Notes

1 When

Θ = R^{p},

without preliminary estimation,

Θ^{*}

has countably infinite elements.

2 Since one DGE is studied.

References

Birgé, L. (2006), “Model Selection via Testing: An Alternative to Penalized Maximum Likelihood Estimators,” Annales de l’Institut Henri Poincaré, 42, 273–325.
Google Scholar
Breiman, L. (2001), “Statistical Modeling: The Two Cultures,” Statistical Science, 16, 199–231. DOI: 10.1214/ss/1009213726.
Web of Science ®Google Scholar
Breiman, L. (2002), “Looking Inside the Black Box,” available at https://www.stat.berkeley.edu/users/breiman/wald2002-2.pdf
Google Scholar
Csörgo, M., and Horvath, L. (1997), Limit Theorems in Change-Point Analysis, New York: Wiley.
Google Scholar
Dempster, A. P., and Schatzoff, M. (1965), “Expected Significance Level as a Sensibility Index for Test Statistics,” Journal of the American Statistical Association, 60, 420–436. DOI: 10.1080/01621459.1965.10480802.
Web of Science ®Google Scholar
Fukumizu, K. (2003), “Likelihood Ratio of Non-identifiable Models and Multilayer Neural Networks,” Annals of Statistics, 31, 833–851.
Web of Science ®Google Scholar
Fukumizu, K., and Amari, S. (2000), “Local Minima and Plateaus in Hierarchical Structures of Multilayer Perceptions,” Neural Networks, 13, 317–327. DOI: 10.1016/s0893-6080(00)00009-5.
PubMed Web of Science ®Google Scholar
Glazer, A., Lindenbaoum, M., and Markovitch, S. (2012), “Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data,” Advances in Neural Information Processing Systems, 1, 728–736.
Google Scholar
Hartigan, J. A. (1985), “A Failure of Likelihood Asymptotics for Normal Mixtures,” Proceedings of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer (Vol. 2), eds. L. M. Le Camand and R. A. Olshen, pp. 807–810, Belmont, CA: Wadsworth.
Google Scholar
Haynes, M. A., MacGillivray, H. L., and Mengersen, K. L. (1997), “Robustness of Ranking and Selection Rules using Generalized g-and-k Distributions,” Journal of Statistical Planning and Inference, 65, 45–66. DOI: 10.1016/S0378-3758(97)00050-5.
Web of Science ®Google Scholar
Hyvärinen, A., and Morioka, H. (2016), “Unsupervised Feature Extraction by Time-Contrastive Learning and Nonlinear ICA,” in Advances in Neural Information Processing Systems, pp. 3765–3773.
Google Scholar
Hyvärinen, A., Sasaki, H., and Turner, R. E. (2018), “Nonlinear ICA Using Auxiliary Variables and Generalized Contrastive Learning,” Arxiv Preprint Arxiv:1805.08651.
Google Scholar
LeCam, L. M. (1973), “Convergence of Estimates Under Dimensionality Restrictions,” Annals of Statistics, 1, 38–53.
Web of Science ®Google Scholar
LeCam, L. M., and Yang, G. L. (1990), Asymptotics in Statistics. Some Basic Concepts, New York: Springer.
Google Scholar
Peacock, J. A. (1983), “Two-Dimensional Goodness-of-Fit Testing in Astronomy,” Monthly Notices Royal Astronomy Society, 202, 615–627. DOI: 10.1093/mnras/202.3.615.
Web of Science ®Google Scholar
Polonik, W. (1999), “Concentration and Goodness-of-Fit in Higher Dimensions: (Asymptotically) Distribution-Free Methods,” Annals of Statistics, 27, 1210–1229.
Web of Science ®Google Scholar
Ramberg, J. S., Tadikamalla, P. R., Dudewicz, E. J., and Mykytka, E. F. (1979), “A Probability Distribution and Its Uses in Fitting Data,” Technometrics, 21, 201–214. DOI: 10.1080/00401706.1979.10489750.
Web of Science ®Google Scholar
Ran, Z.-Y., and Hu, B.-G. (2014), “Determining Parameter Identifiability from the Optimization Theory Framework: A Kullback-Leibler Divergence Approach,” Neurocomputing, 142, 307–317. DOI: 10.1016/j.neucom.2014.03.055.
Web of Science ®Google Scholar
Ran, Z.-Y., and Hu, B.-G. (2017), “Parameter Identifiability in Statistical Machine Learning: A Review,” Neural Computation, 29, 1151–1203.
PubMed Web of Science ®Google Scholar
Rayner, G. D., and MacGillivray, H. L. (2002), “Numerical Maximum Likelihood Estimation for the g-and-k and Generalized g-and-h Distributions,” Statistics and Computing, 12, 57–75.
Web of Science ®Google Scholar
Roeder, G., Metz, L., and Kingma, D. P. (2021), “On Linear Identifiability of Learned Representations,” in Proceedings of the 38th International Conference on Machine Learning, PMLR 139. arXiv:2007.00810v3 [stat.ML] 8 Jul 2020
Google Scholar
Rothenberg, T. J. (1971), “Identification in Parametric Models,” Econometrica, 39, 577–591. DOI: 10.2307/1913267.
Web of Science ®Google Scholar
Sackrowitz, H., and Samuel-Cahn, E. (1999), “p-Values as Random Variables-Expected p-Values,” American Statistician, 53, 326–331. DOI: 10.2307/2686051.
Web of Science ®Google Scholar
Stein, C. (1964), “Inadmissibility of the Usual Estimator for the Variance of a Normal Distribution with Unknown Mean,” Annals of the Institute of Statistical Mathematics, 16, 155–160. DOI: 10.1007/BF02868569.
Web of Science ®Google Scholar
Tukey, J. W. (1962), “The Future of Data Analysis,” Annals of Mathematical Statistics, 33, 1–67. DOI: 10.1214/aoms/1177704711.
Google Scholar
Tukey, J. W. (1977), “Modern Techniques in Data Analysis,” NSF-sponsored Regional Research Conference at Southeastern Massachusetts University, North Dartmouth, MA.
Google Scholar
Veres, S.(1987), “Asymptotic Distributions of Likelihood Ratios for Overparameterized ARMA Processes,” Journal of Time Series Analysis, 8, 345–357. DOI: 10.1111/j.1467-9892.1987.tb00446.x.
Google Scholar
Watanabe, S. (2001), “Algebraic Analysis of Nonidentifiable Learning Machines,” Neural Computation, 13, 899–933. DOI: 10.1162/089976601300014402.
PubMed Web of Science ®Google Scholar
Yan, Y., and Genton, M. G. (2019), “The Tukey g-and-h Distribution,” Significance, 2019, 12–13. DOI: 10.1111/j.1740-9713.2019.01273.x.
Google Scholar
Yatracos, Y. G. (2020), “Learning with Matching in Data-Generating Experiments,” DOI: 10.13140/RG.2.2.30964.58245.
Google Scholar
Yatracos, Y. G. (2021), “Fiducial Matching for the Approximate Posterior: F-ABC.” DOI: 10.13140/RG.2.2.20775.06568.
Google Scholar
Yatracos, Y. G. (2022), “Limitations of the Wasserstein MDE for Univariate Data,” Statistics and Computing, 32, 95. DOI: 10.1007/s11222-022-10146-7.
Web of Science ®Google Scholar

EDI-Graphic: A Tool To Study Parameter Discrimination and Confirm Identifiability in Black-Box Models, and to Select Data-Generating Machines

Abstract

1 Introduction

2 Model Shapes and Parameter Discrimination

3 Discrimination and Identifiability of Parameters in Algorithmic Models with EDI and PPVI

3.1 Data Generating Experiment-Definitions- Assumptions on f

3.2 Motivation and Description of the EDI-Approach

3.3 The EDI-Details

3.4 DGE Selection with EDI

3.5 The Proportion of p-values Index, PPVI

Table 2 Model parameters: g = 5, $g^{*} = 3, h = k = 2.5$ . EDIs and PPVIs for g provide similar indications and are based on M = 1000 repeats.

3.6 Implementation

4 Applications

Table 1 EDI-values in column $θ_{23}^{} (N)$ converge faster to zero than those in column $θ_{23}^{} (C),$ indicating greater discrimination of θ with the Normal model than the Cauchy model.

5 Conclusion

Supplementary Materials

Supplemental Material

Acknowledgments

Disclosure Statement

References

Information for

Open access

Opportunities

Help and information

EDI-Graphic: A Tool To Study Parameter Discrimination and Confirm Identifiability in Black-Box Models, and to Select Data-Generating Machines

Abstract

1 Introduction

2 Model Shapes and Parameter Discrimination

3 Discrimination and Identifiability of Parameters in Algorithmic Models with EDI and PPVI

3.1 Data Generating Experiment-Definitions- Assumptions on f

3.2 Motivation and Description of the EDI-Approach

3.3 The EDI-Details

3.4 DGE Selection with EDI

3.5 The Proportion of p-values Index, PPVI

Table 2 Model parameters: g = 5, g*=3,h=k=2.5. EDIs and PPVIs for g provide similar indications and are based on M = 1000 repeats.

3.6 Implementation

4 Applications

Table 1 EDI-values in column θ23*(N) converge faster to zero than those in column θ23*(C), indicating greater discrimination of θ with the Normal model than the Cauchy model.

5 Conclusion

Supplementary Materials

Supplemental Material

Acknowledgments

Disclosure Statement

Notes

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

Table 2 Model parameters: g = 5, $g^{*} = 3, h = k = 2.5$ . EDIs and PPVIs for g provide similar indications and are based on M = 1000 repeats.

Table 1 EDI-values in column $θ_{23}^{} (N)$ converge faster to zero than those in column $θ_{23}^{} (C),$ indicating greater discrimination of θ with the Normal model than the Cauchy model.