Abstract
In a Data-Generating Experiment (DGE), the data, , is often obtained from a Black-Box and is approximated with a learning machine/sampler, is random, f is known. When has unknown cdf, nonidentifiability of θ cannot be confirmed and may limit the predictive accuracy of the learned model, estimate of Using properties of the Expected p-value for the Kolmogorov-Smirnov test, the Empirical Discrimination Index (EDI) and the Proportion of p-Values Index (PPVI) are introduced: (i) to confirm almost surely, discrimination of θ from (ii) to confirm with EDI-graphics identifiability of by repeating (i) for in a fine sieve of and (iii) to compare EDI-graphics and PPVIs of DGEs and select to use the DGE with the greater parameter discrimination and the smaller number of violating identifiability of Among the applications, EDI and PPVI explain why the g-estimate in Tukey’s g-and-h model is better than that for the g-and-k model, unless the sample size is extremely large; EDI-graphics indicate that Normal learning machines have better parameter discrimination than Sigmoid learning machines and their parameters are nonidentifiable. Supplementary materials for this article are available online.
1 Introduction
Statistical modeling is used in numerous fields, from Economics and Psychology to Biology, Engineering, and Machine Learning, among others. Often, it is assumed the sample, consists of observations with known cdf, we can do calculations with, that is, is tractable; θ is unknown, element of metric space Recently, Xi is obtained either from a Sampler/Quantile function, or from a Black-Box and is approximated with a learning machine, f is known. The cdf is either unknown or intractable, and we use in the sequel “unknown” to denote both and “sampler ” assuming the Y-input is in the background.
Breiman (Citation2001) called the Black-Box model algorithmic, observed that statisticians rarely adopt it, and commented: “If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.” Earlier, Tukey (Citation1962, p. 60) wrote: “Procedures of diagnosis, and procedures to extract indications rather than extract conclusions, will have to play a large part in the future of data analyses and graphical techniques offer great possibilities in both areas.” Both suggestions have been widely adopted nowadays in Data Science. Such tools and their theoretical justifications are presented for Black-Box models in this work and in Yatracos (Citation2020, Citation2021, Citation2022). Among the discussants in Breiman (Citation2001), Cox and Efron looked at the problem as prediction of X, ignoring the statistical inference aspect for θ in the X-model. Consequently, Mathematical statisticians did not follow Breiman’s suggestion. Computer scientists study the same problem with f as learning machine.
Modeling goals include estimation of and for the Black-Box in addition the accurate and reproducible prediction of future outputs of f using the estimate, of These goals require identifiability of which depends on Recall that the parameters in are identifiable when for each θ and in implies Identifiability is confirmed so far only when cdfs in are known and tractable, usually via the Fisher Information Matrix which is not available for unknown models.
Another problem not frequently studied for unknown models, is that often, due to the shapes of the sample size, n, needed for small estimation error may be excessively large and the statistician may not be aware about it. There is no direct data-tool measuring discrimination of θ from by evaluating the distance between and and also confirm θ-identifiability. For example, using Tukey’s g-and-h model (see (2)) and the g-and-k model (see (3)), the difficulty in the discrimination (and estimation) of parameters has been studied via Maximum Likelihood estimates (Rayner and MacGillivray, Citation2002).
In Machine Learning, nonidentifiability of θ is ubiquitous. The extent of nonidentifiability, that is, the number of making θ nonidentifiable, and the level of discrimination of parameters in are criteria for the choice of the data-generating/learning machine, f, used. For tractable models and the Kolmogorov distance, the larger is, the greater the discrimination between θ and is. θ is identifiable when for any which implies also These suggest for unknown models and estimating their Kolmogorov distance and are leading to the use of tools from Kolmogorov-Smirnov test to study discrimination and identifiability of as described in Section 3.
Answers in the arising questions are provided by introducing:
(a) the Empirical Discrimination Index (EDI) of θ and used to study discrimination and identifiability; EDI is based on repeated samples and estimates the expected p-value for the Kolmogorov-Smirnov test between and under both hypotheses, and and (b) the Proportion of p-Values Index (PPVI) complementing EDI, and used to compare “locally” different samplers f1, f2 with the same -input, regarding the estimation difficulty of a parameter-value, with respect to parameter at distance
The motivation for this work was that Matching Estimates of were satisfactory at g0 for Tukey’s g-and-h model/sampler but not for the g-and-k model/sampler for the same values of h, k (Yatracos Citation2020, Citation2021). The questions were “Why?” and “How this problem could be predicted in an Algorithmic model?” EDI and PPVI are used to answer these questions, confirming in particular that Tukey’s g-and-h model has greater g-discrimination than the g-and-k model, unless the sample size is extremely large.
In Example 4.1, EDI-graphics are presented for the normal and Cauchy models to compare the discrimination of their parameters, and to observe the form of EDI-graphics for identifiable parameters. In Examples 4.2 and 4.3, nonidentifiability is confirmed with EDI-graphics, respectively, for the parameters a, b in a normal model with a + b the mean and known variance, and for the mixture of two normal distributions with known variance. In Examples 4.4 and 4.5, Tukey’s g-and-h model and the g-and-k model are compared using EDI and PPVI, to confirm that the former has better discrimination and is to be preferred as data-generating or learning machine if estimation is the goal, unless the sample size, n, is extremely large. In Example 4.6, EDI-graphics are depicted for Normal and Sigmoid data-generating machines, showing that both have unidentifiable parameters but the former has better parameter discrimination.
EDI is not used herein as a statistical test, even though tools from hypotheses testing are used to motivate it without reference to significance level. EDI-graphics are simply observed for various sample sizes. There is no other method available to confirm identifiability and discrimination of θ when the underlying data model is unknown, thus, EDI is not compared with another method in the applications. A referee suggested to use EDI with the fashionable Wasserstein distance, instead of the Kolmogorov distance, but this is not supported for Black-Box models which may have heavy tails (Yatracos Citation2022), and also because does not necessarily converge to zero in probability as n increases, which is needed in Proposition 3.1 (b) for independent observations; is the empirical distribution/measure, and is the underlying distribution/probability measure of the data. The same referee had the concern “EDI is used with Data Generators, or when data models are known and tractable. However, for real data, the underlying generating data models are unknown and untractable. This means EDI is not applicable in many real applications.” To study identifiability and discrimination, either the cdf, has to be known and tractable, or, has to be “learned from samples obtained for various ”, that is, either from a sampler, or a quantile function as with Tukey’s g-and-h model, or simply via the inverse of and uniform r.vs on [0,1]. If none of this is possible, as in the referee’s example, one cannot study identifiability and discrimination. Note that in several statistics papers with intractable cdfs it is assumed that data can be generated for various values. Thus, with real data, rather than proposing an intractable cdf indexed by one can propose instead a function, generating the data; random.
Rothenberg (Citation1971) established conditions for parameter identifiability using ’s Fisher Information Matrix for tractable Nonidentifiable statistical models include mixture models (Hartigan Citation1985), autoregressive moving averages (Veres Citation1987) and change point problem (Csörgo and Horvath Citation1997). In Statistical Machine Learning, nonidentifiability of θ is ubiquitous and has deep influence in particular on the output, of the Learning Machine, f. The estimate, affects the capability of the learned model (or representation), to predict future data (Ran and Hu Citation2017). For example, in Deep Neural Networks it is preferable that when a network relearns with data from the same model, the obtained learned representation is nearly similar to This wish motivated studying linear identifiability in function space (Roeder, Metz, and Kingma Citation2021) for tractable models with general exponential form, extending results on distinct models (Hyvärinen and Morioka Citation2016; Hyvärinen, Sasaki, and Turner Citation2018). Other tools include the Fisher Information matrix, the asymptotic order of the Likelihood Ratio test statistic of the MLE and the Kullback-Leibler divergence of tractable models, used among others by Fukumizu (Citation2003), Watanabe (Citation2001), Fukumizu and Amari (Citation2000), and Ran and Hu (Citation2014). A detailed review is presented in Ran and Hu (Citation2017), endorsing in the “Summary and Perspective” (p. 1196) the view in Breiman (Citation2001) for algorithmic models and especially the tools needed to address identifiability, by adding: “This will become one of the most important issues for machines in the future.” EDI and PPVI clarify these issues and their advantage over the previously described methods is that there is no need to know the underlying model of the data to check for parameter identifiability and parameter discrimination.
Dempster and Schatzoff (Citation1965) treated p-value as random and used its expected value, the Expected Significance Level, as sensibility index for comparing several multivariate tests. The Expected Significance Level was viewed as “reasonable compromise” to the Neyman-Pearson theory that uses the power of the test which depends on the α-level. Sackrowitz and Samuel-Cahn (Citation1999) used Expected p-Value (EPV) instead of Expected Significance Level, to stress that p-value is random, suggested EPV as test’s performance measure when it is difficult to evaluate the power function, and examined the p-values under the alternative for location and scale models. Since then, EPV and its estimates have been used to study tests for parametric tractable models, but not for studying identifiability or discrimination of parameters in unknown models using the Kolmogorov-Smirnov test.
Empirical discrimination indices for θ based on tests’ comparisons, provide useful information in estimation problems since in the elementary estimation problem, Θ consists of two parameters, θ and and can be solved by testing. Stein (Citation1964) showed inadmissibility of the usual estimator for the variance of a normal distribution with unknown mean using a test of hypotheses to obtain the improved estimate. LeCam (Citation1973) and Birgé (Citation2006) used successfully multiple, simultaneous tests of simple hypotheses for estimation with infinite dimensional parameter space,
In Section 2, previous results related indirectly to parameter discrimination are presented. In Section 3, EPVs properties are presented and EDI and PPVI are introduced along with the use of EDI-graphics. Applications follow in Section 4. Proofs are included in the supplementary material, along with the R-programs used. The reader may proceed directly to that depict EDI-graphics with informative captions.
2 Model Shapes and Parameter Discrimination
Tukey (Citation1962, Citation1977) used the g-and-h model to better fit sample, with heavy tails and skewness, and the λ-distribution to fit both symmetric and asymmetric data. The idea is to model the quantiles of directly and not via a density (Yan and Genton Citation2019). Thus, is seen as modification of data, using known data-generator f, with unknown parameter for example, (1) (1)
is observed from a known model with known parameter
Tukey’s quantile-modeling approach has been used since then in several fields, (1) evolved and has led to an abundance of models’ shapes that would fit better However, this abundance of model shapes near the underlying distribution increases the difficulty in estimating and f in (1) should be examined before its use as potential data-generator.
Tukey’s (Citation1977) -model accommodates data from non-Gaussian distribution, with g real-valued controlling skewness, nonnegative h controlling tail heaviness and with location, and scale, A vector of independent standard normal random variables, and parameter values are used to obtain (2) (2)
The -model (Haynes, MacGillivray, and Mengersen Citation1997) includes distributions with more negative kurtosis than the normal distribution and some bimodal distributions (Rayner and MacGillivray Citation2002, p. 58). Standard normal and parameter values are used to obtain (3) (3)
is a parameter used to make the sample correspond to a density and usually
With abuse of notation, we use in (2) and (3) instead of and the same for their coordinates. Since g-and-k and g-and-h use both the same EDI and PPVI are used in the Examples 4.4 and 4.5.
For the g-and-k model (Haynes, MacGillivray, and Mengersen Citation1997) and the generalized g-and-h models, Rayner and MacGillivray (Citation2002) confirmed the difficulty of the MLE to discriminate distributional shapes and parameters’ values with small and moderate “… computational Maximum Likelihood procedures are very good for very large sample sizes, but they should not necessarily be assumed to be safe for even moderately large sample sizes” (p. 58); also, “… with moderately large positive (i.e., to the right) skewness, the MLE method fitting to the g-and-k distribution cannot efficiently discriminate between moderate positive values and small negative values of the kurtosis parameter.” (p. 64).
For Tukey’s asymmetric λ-distributions, with wider variety of distributional shapes than g-and-h and using the Moments estimation method it is observed: “An additional difficulty with the use of this distribution when fitting through moments, is that of nonuniqueness, where more than one member of the family may be realized when matching the first four moments …” (Ramberg et al. Citation1979; Rayner and MacGillivray Citation2002, p. 58).
These findings suggest the search for a data dependent tool to evaluate the discrimination and identifiability of θ-values.
3 Discrimination and Identifiability of Parameters in Algorithmic Models with EDI and PPVI
3.1 Data Generating Experiment-Definitions- Assumptions on f
The elements of a statistical problem with data obtained from known models are included in Le Cam’s Statistical Experiment, with sample space, its σ-field and Fs probability (or cdf) indexed by s in parameter space see, for example, LeCam and Yang (Citation1990). The parallel notion of Data-Generating Experiment is introduced for data obtained from Breiman’s Algorithmic models and Samplers.
Definition 3.1.
A Data-Generating Experiment (DGE), consists of the sample space, and the data-generating mechanism (or sampler), f, with inputs parameter θ () and sample, that may be latent or not.
The findings in Section 2 indicate, for DGEs with unknown models, the need to study the discrimination of parameters using tools independent from estimation methods. Examples with DGEs appear in the Applications section and include in particular Tukey’s g-and-h model and the g-and-k model.
In a DGE, sample with size n is drawn using an unknown parameter, in the sampler, f. The observations in take values in have unknown cdf, and the aim is to estimate Herein, Θ is a subset of equipped with distance, For a statistical inference problem to be meaningful, every θ in Θ should be identifiable. DGE includes any underlying structure needed, for example, a distance for cdfs, a prior for θ and σ-field(s). The user can select to draw one or more via f. With abuse of notation we use instead of
The sampler, f, in a DGE replaces succesfully the uknown underlying models in the Statistical Experiment, when repeated samples are drawn for selected in Θ providing adequate information for (Yatracos Citation2020).
Definition 3.2.
For any two distribution functions F, G in their Kolmogorov distance (4) (4)
Definition 3.3.
For any sample of random vectors in denotes the number of ’s with all their components smaller or equal to the corresponding components of is the empirical cdf of
Definition 3.4.
θ is discriminated from when or equivalently if
Definition 3.5.
θ is identifiable when for any it holds that
The larger the distance is, the greater the discrimination between θ and is. If θ is discriminated from all then θ is identifiable. However, in a DGE, can be only estimated using f due to the unknown cdfs, and
The assumption from “Looking inside the Black-Box (Breiman Citation2002, p. 3)” is used in A1 for the DGE.
A1: In DGE the data, are generated by independent draws via f.
Assumption A1 is used in Proposition 3.1, (b), (c). However, (b) can be obtained with dependent draws via f, under the weaker assumption A2.
A2: In DGE for the data, with size n and underlying cdf of its components, in probability.
3.2 Motivation and Description of the EDI-Approach
In practice, θ is not distinguished locally from -values in an open ρ-ball, centered at θ with small radius Violation of θ-identifiability due to for ϵ small, is not expected to have significant effect in learning machines but also in probability calculations. Thus, we restrict attention to identifiability of θ with respect to in Ac denotes the complement of set A.
For a DGE, the goals herein are: (i) to study the discrimination of θ and at ρ-distance greater than or equal to small (ii) to confirm identifiability of θ using (i) for all in an ϵ-dense subset/sieve, of and (iii) to select among several data-generating machines which one to use for better estimation of parameters, taking into consideration the sample size, and preferring less nonidentifiability, that is, a smaller number of making θ nonidentifiable, and more/greater discrimination, namely larger
For the main tool to discriminate θ from is the empirical discrimination index, which estimates the expected p-value, for the Kolmogorov-Smirnov test of hypotheses against and has special properties (Proposition 3.1). Samples and of size n are drawn using and respectively, and is estimated by To evaluate how large is, since is random, one could rely on its p-value exceeding the observed Having the luxury of the f-availability, M such samples with and p-values are obtained, and the average, of these p-values is used. Smaller value of (or, simply EDI) indicates greater discrimination, as smaller p-value does.
For (ii), is assumed to be compact which holds after a preliminary estimation of see, for example, Yatracos (Citation2020). Let and form a partition of namely, ϵ-identifiability of θ is confirmed almost surely by confirming with its discrimination from each in (5) (5) Footnote1
For (iii), EDI-graphics can be compared for several samplers, as in .
For practical purposes, ϵ-identifiability of θ will imply identifiability for all the parameters in Global identifiability in Θ can be confirmed establishing ϵ-identifiability for Alternatively, EDI-graphics for one θ only, with in sieve, and various sample sizes, n, may be enough as observed in Example 4.1 for which θ-identifiability implies -identifiability for every in This is expected to hold for location models.
3.3 The EDI-Details
In a DGE with underlying unknown cdfs, let (6) (6) and consider the hypotheses (7) (7)
For the obtained samples and of size n, respectively, with unknown cdfs, and let (8) (8)
Under H, let G0 denote Tn’s cdf, and P0 the corresponding probability. H is rejected if Tn is large, or instead if (9) (9) is small. The p-value (9) is calculated under H, and is random since it depends on the observed -value. Its expected value, EPV, is going to be calculated under both H and and we write (10) (10) Footnote2
In the right side of (10), is the random variable with observed value obtained when
Definition 3.6
(EDI). In a let be samples of size n drawn with unknown cdf and let be samples of size n from Let PVi be the p-value (9) for the two sided Kolmogorov-Smirnov test (7) using The Empirical Discrimination Index (EDI) of θ and is (11) (11)
The smaller the -value is in (11), the greater/better the discrimination of θ and is. EDI is also used instead of in the sequel.
Proposition 3.1.
In let Tn and be as previously described.
For every θ in Θ and for every sample size n, (12) (12)
Under either A1 or A2, for every (13) (13)
Under A1, if and is defined as Tn in (8) using from then for large n, (14) (14)
From Proposition 3.1 (a), (b), discrimination of θ and and identifiability of θ are confirmed when, for large n-values, as M increases converges almost surely to its expected value. This always holds under A1 and may hold also under A2 for dependent data. Thus, when stays near 0.5 as n increases, θ is not identifiable. Also, if then θ has greater discrimination from than from for sample size n.
Remark 3.1.
In Dempster and Schatzoff (Citation1965) and Sackrowitz and Samuel-Cahn (Citation1999), property (a) in Proposition 3.1 has been used as well as that, when Tn and are independent, from (10) (15) (15) with Tn and obtained, respectively, under H and or under H and H.
3.4 DGE Selection with EDI
EDI in (11) can be used to compare two or more DGEs and choose one if estimation is the goal. Assume that DGE1 and DGE2 are indexed by a parameter (usually of similar nature, for example, location) and that samples as in Definition 3.6 are obtained. Then, the discrimination of θ from is greater in DGE1 than DGE2 if (16) (16)
When (16) holds for θ and in then DGE1 is preferred if both DGE1 and DGE2 could be used for modeling. The reason is that what holds for a choice of θ and is expected to hold for all the parameters in The EDI comparison can be used for DGE1 and DGE2 with parameters, respectively in in and is distance in
3.5 The Proportion of p-values Index, PPVI
Another tool, complementing EDI, is introduced for a particular type of DGEs.
Definition 3.7.
When DGE1 and DGE2 use each quantile model (1) with variables from the same model, for each of the M tests providing (11), can be used to generate the data and p-values are compared. The proportion of p-values index is (17) (17)
is an additional tool to extract indications for discrimination. PPVI is also used instead of
DGE Selection with the PPVI Discrimination Criterion
For DGE1 and DGE2 as in Definition 3.7, the discrimination of θ from for sample size n is greater/easier in DGE1 than DGE2 if (18) (18)
Because of the strict inequality, “<”, used in (17), is also included in (18). In some cases, when every θ is identifiable and (18) holds, a better estimate is obtained for DGE1 unless the sample size is very large; compare in , PPVIs for Tukey’s g-and-h model (DGE1) and the g-and-k model (DGE2) for Similar are the findings is comparing EDI (g-and-h) with EDI (g-and-k).
3.6 Implementation
In DGE Θ is a compact subset (rectangle) of Rp with distance, the L2-distance; For and small enough such that parameters in a neighborhood, are “indistinguishable” for each coordinate, let be the sieve in (5), with but not necessarily equal to The construction of and the sieve, is described in the examples. The reader may prefer to use a different sieve than the used, for example, in R based on midpoints of intervals instead of the end-points of intervals for each θ-coordinate. In the EDI-graphic for the y-axis is used for EDI-values and the x-axis for the Euclidean L2-distance between θ and the elements of sieve,
When the R-function ks.test provides the p-value for the Kolmogorov-Smirnov two-sample test of equality for θ and with and respectively, from and For the approaches in Peacock (Citation1983) and Polonik (Citation1999) can be used to obtain p-values The theory in the latter was implemented by Glazer, Lindenbaoum, and Markovitch (Citation2012), who estimated high-density regions directly instead of using density estimates.
Use of EDI-graphics
EDI-graphics of against for various
show nonidentifiability of θ (and other parameters) when, as n increases, at least and have values near 0.5,
indicate greater discrimination of θ from than from when and
allow comparing data-generating machines with their EDI-graphics using (A) and (B), preferring greater discrimination and less nonidentifiability when estimation is the goal.
4 Applications
EDI-graphics for with in the sieve are presented for various sample sizes, for known models and also DGEs with identifiable and nonidentifiable parameters to see their differences when using the graphics. For the interpretation of EDI-graphics follow (A)–(C) at the end of the previous section and/or read the Figures’ captions. indicate that EDI-graphics for one θ and for in and for moderate and large sample size, n, are sufficient for checking identifiability in Θ and parameter discrimination, thus, confirming EDI’s usefulness.
EDI-graphics for statistical models are first presented.
Example 4.1.
EDI-graphics for Normal (N) and Cauchy (C) models are depicted in and , respectively, for studying discrimination and confirming identifiability of θ= The assumed parameter spaces for μ and σ are, respectively, and and For consider in each parameter space, respectively, sieve and providing Θ-sieve, with elements that includes
M = 100 independent samples and each of size n, are obtained with parameters, respectively, θ and in both models, and the corresponding EDIs are calculated, for Identifiability of θ is indicated in and since takes value near 0.5 and the other EDI values, decrease as increases and, in this example, the Euclidean distance of θ and increases. All the EDI-values eventually decrease to zero as n increases except for The simulation results are expected from Proposition 3.1 since converges to the EPV.
Indicative -values are provided in , for and to observe better parameter discrimination for the Normal model.
Example 4.2.
EDI-graphics for Normal distributions of the form are depicted in to confirm nonidentifiability of the parameter The assumed parameter spaces for a and b are, respectively, and and For in each parameter space, consider and providing sieve
M = 100 independent samples and each of size n are obtained with parameters, respectively, θ and and the corresponding EDIs are calculated, for For several circles in the EDI-graphic (in ) “jump” and have y-values near 0.5, as the distance on the x-axis from increases. For the -values that indicate nonidentifiability of θ have EDI values near 0.5 and the sum of the parameters is as with The with EDI-values near 0.2 indicate additional nonidentifiable parameters and have sum of parameters 2.7.
Example 4.3.
Let denote Normal cdf with mean, and variance, Consider the normal mixture model, with known variance, and parameter which is nonidentifiable,
EDI is used to confirm algorithmically nonidentifiability of is a sieve for and are sieves for and (19) (19) is sieve for with makes θ nonidentifiable. Using EDI-graphics for appear in . When the EDI-values are spread in but as n increases, for example, when several EDI-values are near zero. When there are 6 EDI-values far from 0. Those with similar values correspond to nonidentifiable and and and When the EDI-values of the last two are nearly zero, those corresponding to indices 63 and 125 decrease, while those for indices 46 and 128 remain near 0.5, as expected.
In , simulations are repeated when the means’ coordinates in are not in θ’s closest element in the sieve is The same pattern with is observed in , except for the two larger EDI-values which decrease slower than the other EDI-values towards zero as n increases. Nonidentifiable and have similar EDI-values, with those of the last two vanishing for
EDI-graphics are used also for studying discrimination and identifiability of the parameters in data-generating machines which are compared. Such examples follow.
Example 4.4.
Tukey’s g-and-h model (DGE1) and the g-and-k model (DGE2) are compared, g- locally with EDI and PPVI. Normal sample, Z is used to obtain and , respectively, from (2) and (3) with Normal sample, also of size n, is used to obtain similarly with different The p-value for the Kolmogorov-Smirnov test is obtained for both DGE1 and DGE2 models. Both experiments are repeated M = 1000 times for For each n, EDIs for DGE1 and DGE2 are computed and counters measure the number of times out of M the p-value for DGE2 is smaller than, or larger than, that of DGE1. The results appear in . Comparison of EDIs indicate that for the g and used and Tukey’s g-and-h model has greater g-discrimination than the g-and-k model. For example, for 5% significant difference, n = 1500 is needed for the g-and-h model, and n = 2500 is needed for the g-and-k model. Comparison of PPVIs indicate that for Tukey’s g-and-h model has greater g-discrimination than the g-and-k model, with decreasing to zero and increasing to one. A sample of size is needed for the parameters g = 5 and of the g-and-k model to be discriminated as in Tukey’s g-and-h model.
Remark 4.1.
The results in Example 4.4 for the g-and-k model suggested comparing also smooth histograms for this model. Visually, there is no discrimination between overlayed g-and-k smooth histograms with parameters, respectively, g = 5 and for various sample sizes less than or equal to and We did not use
From the results in Example 4.4, Tukey’s g-and-h model has better local discrimination than the g-and-k model, unless the sample size is very large. The findings in extend those in Rayner and MacGillivray (Citation2002), based on the data and not on a particular estimate, for example, the MLE. The results are confirmed g- globally below, with EDI-graphics.
Example 4.5.
Tukey’s g- and-h and the g- and-k models are examined as Learning Machines with same parameters It is assumed g is unknown, but is known. To use the same R programs we considered parameter spaces for g and respectively, and The sieve for the first parameter space, with is and therefore the Θ-sieve consists of
M = 100 independent samples and each of size n are obtained with parameters, respectively, θ and in both models, and the corresponding EDIs are calculated, for EDI-graphics for both models appear in indicating identifiability. Comparison of the EDI-graphics for the same n value, indicates that Tukey’s g- and-h model has better parameter discrimination than the g-and-k model. The findings confirm graphically the results in Example 4.4.
Example 4.6.
Similar set-up as in Example 4.3 is used, with only difference that data, X, is obtained from the Normal learning machine that is convex combination of normal densities, (20) (20)
is element of with the same sieve, The model parameter and Z is a standard Normal random variable with density . M = 100 learning samples are used for EDI and A similar experiment is examined for a Sigmoid learning machine, with in (19) replaced by
and indicate nonidentifiability and that the parameters are better discriminated with the data from the Normal learning machine.
5 Conclusion
For Black-Box models, parameter identifiability cannot be confirmed. For learning machines, nonidentifiability is ubiquitous and the resulting difficulty in the estimation of the parameters and the reproducibility of the learned models are not yet quantified. The Empirical Discrimination Index, is used for in a sieve of to confirm with EDI-graphics almost surely identifiability of all parameters θ in or nonidentifiability. EDI and the Proportion of p-values Index (PPVI) are also useful tools for identifying among samplers and learning machines those that have greater parameter discrimination for a given sample size, n, thus, leading to more efficient estimates.
Supplementary Materials
Proofs and R-functions used in Examples 4.1–4.6.
Supplemental Material
Download Zip (235.4 KB)Acknowledgments
Many thanks are due to Professor Faming Liang and Professor Galin Jones, Editors, who have handled, respectively, the original submission and the revisions. Thanks are due to the referees, for their comments that improved the presentation of the paper, and to Mr. Yongzhen Feng, Tsinghua University, for the suggestions to improve readability.
Disclosure Statement
The authors report there are no competing interests to declare.
Notes
1 When without preliminary estimation, has countably infinite elements.
2 Since one DGE is studied.
References
- Birgé, L. (2006), “Model Selection via Testing: An Alternative to Penalized Maximum Likelihood Estimators,” Annales de l’Institut Henri Poincaré, 42, 273–325.
- Breiman, L. (2001), “Statistical Modeling: The Two Cultures,” Statistical Science, 16, 199–231. DOI: 10.1214/ss/1009213726.
- Breiman, L. (2002), “Looking Inside the Black Box,” available at https://www.stat.berkeley.edu/users/breiman/wald2002-2.pdf
- Csörgo, M., and Horvath, L. (1997), Limit Theorems in Change-Point Analysis, New York: Wiley.
- Dempster, A. P., and Schatzoff, M. (1965), “Expected Significance Level as a Sensibility Index for Test Statistics,” Journal of the American Statistical Association, 60, 420–436. DOI: 10.1080/01621459.1965.10480802.
- Fukumizu, K. (2003), “Likelihood Ratio of Non-identifiable Models and Multilayer Neural Networks,” Annals of Statistics, 31, 833–851.
- Fukumizu, K., and Amari, S. (2000), “Local Minima and Plateaus in Hierarchical Structures of Multilayer Perceptions,” Neural Networks, 13, 317–327. DOI: 10.1016/s0893-6080(00)00009-5.
- Glazer, A., Lindenbaoum, M., and Markovitch, S. (2012), “Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data,” Advances in Neural Information Processing Systems, 1, 728–736.
- Hartigan, J. A. (1985), “A Failure of Likelihood Asymptotics for Normal Mixtures,” Proceedings of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer (Vol. 2), eds. L. M. Le Camand and R. A. Olshen, pp. 807–810, Belmont, CA: Wadsworth.
- Haynes, M. A., MacGillivray, H. L., and Mengersen, K. L. (1997), “Robustness of Ranking and Selection Rules using Generalized g-and-k Distributions,” Journal of Statistical Planning and Inference, 65, 45–66. DOI: 10.1016/S0378-3758(97)00050-5.
- Hyvärinen, A., and Morioka, H. (2016), “Unsupervised Feature Extraction by Time-Contrastive Learning and Nonlinear ICA,” in Advances in Neural Information Processing Systems, pp. 3765–3773.
- Hyvärinen, A., Sasaki, H., and Turner, R. E. (2018), “Nonlinear ICA Using Auxiliary Variables and Generalized Contrastive Learning,” Arxiv Preprint Arxiv:1805.08651.
- LeCam, L. M. (1973), “Convergence of Estimates Under Dimensionality Restrictions,” Annals of Statistics, 1, 38–53.
- LeCam, L. M., and Yang, G. L. (1990), Asymptotics in Statistics. Some Basic Concepts, New York: Springer.
- Peacock, J. A. (1983), “Two-Dimensional Goodness-of-Fit Testing in Astronomy,” Monthly Notices Royal Astronomy Society, 202, 615–627. DOI: 10.1093/mnras/202.3.615.
- Polonik, W. (1999), “Concentration and Goodness-of-Fit in Higher Dimensions: (Asymptotically) Distribution-Free Methods,” Annals of Statistics, 27, 1210–1229.
- Ramberg, J. S., Tadikamalla, P. R., Dudewicz, E. J., and Mykytka, E. F. (1979), “A Probability Distribution and Its Uses in Fitting Data,” Technometrics, 21, 201–214. DOI: 10.1080/00401706.1979.10489750.
- Ran, Z.-Y., and Hu, B.-G. (2014), “Determining Parameter Identifiability from the Optimization Theory Framework: A Kullback-Leibler Divergence Approach,” Neurocomputing, 142, 307–317. DOI: 10.1016/j.neucom.2014.03.055.
- Ran, Z.-Y., and Hu, B.-G. (2017), “Parameter Identifiability in Statistical Machine Learning: A Review,” Neural Computation, 29, 1151–1203.
- Rayner, G. D., and MacGillivray, H. L. (2002), “Numerical Maximum Likelihood Estimation for the g-and-k and Generalized g-and-h Distributions,” Statistics and Computing, 12, 57–75.
- Roeder, G., Metz, L., and Kingma, D. P. (2021), “On Linear Identifiability of Learned Representations,” in Proceedings of the 38th International Conference on Machine Learning, PMLR 139. arXiv:2007.00810v3 [stat.ML] 8 Jul 2020
- Rothenberg, T. J. (1971), “Identification in Parametric Models,” Econometrica, 39, 577–591. DOI: 10.2307/1913267.
- Sackrowitz, H., and Samuel-Cahn, E. (1999), “p-Values as Random Variables-Expected p-Values,” American Statistician, 53, 326–331. DOI: 10.2307/2686051.
- Stein, C. (1964), “Inadmissibility of the Usual Estimator for the Variance of a Normal Distribution with Unknown Mean,” Annals of the Institute of Statistical Mathematics, 16, 155–160. DOI: 10.1007/BF02868569.
- Tukey, J. W. (1962), “The Future of Data Analysis,” Annals of Mathematical Statistics, 33, 1–67. DOI: 10.1214/aoms/1177704711.
- Tukey, J. W. (1977), “Modern Techniques in Data Analysis,” NSF-sponsored Regional Research Conference at Southeastern Massachusetts University, North Dartmouth, MA.
- Veres, S.(1987), “Asymptotic Distributions of Likelihood Ratios for Overparameterized ARMA Processes,” Journal of Time Series Analysis, 8, 345–357. DOI: 10.1111/j.1467-9892.1987.tb00446.x.
- Watanabe, S. (2001), “Algebraic Analysis of Nonidentifiable Learning Machines,” Neural Computation, 13, 899–933. DOI: 10.1162/089976601300014402.
- Yan, Y., and Genton, M. G. (2019), “The Tukey g-and-h Distribution,” Significance, 2019, 12–13. DOI: 10.1111/j.1740-9713.2019.01273.x.
- Yatracos, Y. G. (2020), “Learning with Matching in Data-Generating Experiments,” DOI: 10.13140/RG.2.2.30964.58245.
- Yatracos, Y. G. (2021), “Fiducial Matching for the Approximate Posterior: F-ABC.” DOI: 10.13140/RG.2.2.20775.06568.
- Yatracos, Y. G. (2022), “Limitations of the Wasserstein MDE for Univariate Data,” Statistics and Computing, 32, 95. DOI: 10.1007/s11222-022-10146-7.