Abstract
Sample compression schemes were first proposed by Littlestone and Warmuth in 1986. Undirected graphical model is a powerful tool for classification in statistical learning. In this paper, we consider labelled compression schemes for concept classes induced by discrete undirected graphical models. For the undirected graph of two vertices with no edge, where one vertex takes two values and the other vertex can take any finite number of values, we propose an algorithm to establish a labelled compression scheme of size VC dimension of associated concept class. Further, we extend the result to other two types of undirected graphical models and show the existence of labelled compression schemes of size VC dimension for induced concept classes. The work of this paper makes a step forward in solving sample compression problem for concept class induced by a general discrete undirected graphical model.
1. Introduction
Compression is an important facet of learning. In statistical learning, excellent and appropriate compression can guarantee a small generalization error (Blumer et al., Citation1987). A significant notion of compression is sample compression scheme, which was introduced by Littlestone and Warmuth (Citation1986). There are two types of sample compression schemes: labelled and unlabelled. Such compression schemes are approaches for encoding a finite sample in a small subsample.
A labelled sample compression scheme of size k for a concept class consists of a compression function g and a reconstruction function h. Compression function g compresses a finite sample to a small labelled subsample of size at most k that is called compression set, and reconstruction function h aims to construct a concept consistent with the original sample set by the compression set. A crucial fact called compactness lemma was given by Ben-David and Litman (Citation1998), which essentially says that if a finite class admits a compression scheme so does an infinite class. Vapnik-Chervonenkis (VC) dimension is a measure of the complexity of function classes. In statistical learning, VC dimension is deeply related to probably approximately correct learning (PAC) and sample compression scheme (Blumer et al., Citation1989; Floyd & Warmuth, Citation1995). In particular, for any concept class of VC dimension d the size of its sample compression scheme must be at least d (Bollobás & Radcliffe, Citation1995). Furthermore, Floyd and Warmuth (Citation1995) and Warmuth (Citation2003) proposed a sample compression conjecture, that is, for every class of VC dimension d, there exists a sample compression scheme of size . Up to now, this question has not been solved. However, it has been proved for many families of concept classes which are natural and important in statistical learning. Floyd and Warmuth (Citation1995) showed that there is a labelled sample compression scheme of size d for any maximum class of VC dimension d. Moran and Warmuth (Citation2016) presented the labelled sample compression schemes of size d for ample classes which are a natural generalization of maximum classes based on Sandwich Lemma (Anstee et al., Citation2002; Bollobás & Radcliffe, Citation1995). One noteworthy result is that Moran and Yehudayoff (Citation2016) proved the existence of labelled compression scheme of size for arbitrary class of VC dimension d. In a latest study, Chepoi et al. (Citation2021) showed that the topes of a complex of oriented matroids (COM) have a labelled sample compression scheme of size VC dimension d.
Another type of compression scheme is unlabelled. An unlabelled sample compression scheme is that the compression set is unlabelled. In other words, the compression function compresses each given sample to a subset of the domain of the sample. For some specific concept classes, unlabelled sample compression schemes have been explored by several researchers and they have achieved a series of results (Chalopin et al., Citation2019; Helmbold et al., Citation1990; Kuzmin & Warmuth, Citation2007; Marc, Citation2022; Pálvölgyi & Tardos, Citation2020; Rubinstein & Rubinstein, Citation2012).
Undirected graphical model (UGM), also known as Markov network, is one of the research hot points in the field of statistical learning. In the past a few decades, it has been successfully applied to many real problems (Friedman, Citation2004; Lauritzen, Citation1996). We know that a statistical model is a family of probability distributions, which is a (part of) real algebraic variety from the viewpoint of algebraic geometry (Pistone et al., Citation2001). In particular, considering discrete data, a statistical model is the set of all solutions of some polynomials in the probability simplex (Geiger et al., Citation2001; Settimi & Smith, Citation2000).
Given an undirected graph (UG) and sample space, one has an algebraic geometric characterization of the corresponding discrete UGM (Geiger et al., Citation2006). Based on the work in Geiger et al. (Citation2006), for general discrete UGMs, Li and Yang (Citation2018) developed an algebraic method for computing VC dimensions of corresponding concept classes. Their work lays a theoretical foundation for this paper.
A natural question is whether the sample compression conjecture holds for the concept classes induced by discrete UGMs. In this paper, we consider concept classes induced by three types of discrete UGMs. Using the quadratic binomials associated with a positive UGM, we construct a labelled compression scheme of size VC dimension d. That is, the compression conjecture is correct for these three concept classes. It is an open question whether the concept class induced by a general discrete UGM has a sample compression scheme of size VC dimension.
The remainder of this paper is organized as follows. Section 2 introduces main definitions and notations. Section 3 discusses a property of one-inclusion graphs of concept classes induced by general UGMs and establishes the existence of labelled sample compression schemes of size equal to their VC dimensions for classes induced by three kinds of discrete UGMs. Section 4 summarizes the work of this paper.
2. Preliminaries
In this section, we introduce some formal definitions and notations that appear in this paper.
2.1. Discrete UGMs
We consider UGMs whose underlying UGs are simple. A UG contains a set of vertices V and a set of unordered pairs vertices (undirected edges) E. A graph is complete if there exists an edge between every pair of vertices. Given a UG G, a clique is a maximal complete subgraph of G, and we use to denote the set of all cliques of G. Let be discrete random variables (vertices in G), where , , . Let be an n-dimension vector, and then the sample space . A UGM is a family of probability distributions, and each element (probability distribution on ) in has the form (1) (1) where , and is a potential function that depends on only through the values of variables in K. We focus on the collection of positive probability distributions in , denoted as .
Example 2.1
The graph in Figure is a simple UG, , and . If , has four values: , , , .
2.2. Algebraic geometry of discrete UGMs
We work in the real polynomial ring whose indeterminates are elementary probabilities . A subset is an ideal if it satisfies the following conditions: (1) , (2) if , then , and (3) if and , then . Every ideal I in associates a variety where , is the set of m-dimensional vectors whose components are positive real numbers.
Given a UG G, we say is independent of given if . Denote it by and call it a pairwise conditional independence statement. Each corresponds to a set of quadratic binomials: , , , Let be the ideal in generated by the quadratic binomials corresponding to all the pairwise conditional independence statements. The well-known Hammersley–Clifford theorem shows that (Geiger et al., Citation2006; Lauritzen, Citation1996).
Example 2.2
Consider the same UG and sample space given in Example 2.1. Since , the unique quadratic binomial associated with this discrete UGM is (2) (2)
2.3. Concept classes, VC dimension, sample compression schemes
A concept class over domain (also denoted as ) is a family of functions of the form . Alternatively, can be represented by the one-inclusion graph (Haussler et al., Citation1994). The vertices of the graph are all the concepts in and two concepts have an edge if they differ at a single point. Each function is said to be a concept, and it can also be viewed as a subset of where if and only if and . For some subset , the restriction of on A is the class . If that means for every binary vector there is a concept c such that , then A is said to be shattered by . The VC dimension of is given by The sign function is defined as (3) (3) where . The concept class induced by a discrete UGM is the set of Boolean functions over of the form for , where is a distribution class. Hence, if and otherwise. A concept class is called maximum if its size is , where and .
Example 2.3
According to Example 2.2., clearly, there is no pair of distributions such that , take the values and . Since , the value cannot appear. The concept class is shown in Figure (a). Note that , and it is not a maximum class. The one-inclusion graph of is given in Figure (b).
A labelled sample is a set , where , . A labelled sample compression scheme of a concept class is a pair of functions . Function g is the compression function for mapping a labelled realizable finite sample that comes from some concept to a labelled subsample as a compression set. h is the reconstruction function for mapping the subsample to a hypothesis on consistent with the entire sample. The size of the compression scheme is the size of the largest compression set. A sample compression scheme is proper if for any realizable sample s, otherwise, it is called improper.
3. Labelled sample compression schemes
In this section, we point out that one-inclusion graph of the concept class induced by a discrete UGM is not the tope graph of a COM if the underlying graph is incomplete and construct labelled sample compression schemes of size VC dimension for classes corresponding to three types of discrete UGMs.
3.1. One-inclusion graphs of concept classes induced by general discrete UGMs
For a discrete UGM, vertices in its underlying UG represent random variables. In this section, each vertex in the graph under consideration represents an object, which is not a random variable (the object is a concept in one-inclusion graph).
Let be a finite, connected and simple graph. The distance between vertices u and v is defined as the length of a shortest -path. We say that is isometrically embeddable into a graph if there is a map such that for all . A graph is a partial cube if it allows an isometric embedding into a hypercube graph .
It is significant for us to introduce two types of metric subgraphs. For a subgraph H, H is even if for every vertex v there exists unique vertex such that , where is the diameter of . An even subgraph is symmetric if for all u, v in H. A symmetric-even subgraph H of is also called an antipodal subgraph. Two antipodal subgraphs of the graph Figure (b) are shown in Figure . A subgraph H is gated if for each vertex v in there exists a gate x in H, such that there is a shortest path from v through x to y for each vertex . In a partial cube that means there exists a path from v to H that does not use any color that appears in the edges of H. We say that a graph is antipodally gated if all of its antipodal subgraphs are gated.
Considering the subgraph Figure (b), it is an antipodal subgraph, while the vertex ‘0010’ does not have gate. For vertex ‘0010’, there is not a path from ‘0010’ to Figure (b) that does not use red or blue edge. Similarly, the subgraph induced by vertex set ‘0010’,‘0110’,‘1110’,‘1010’ is a non-gated antipodal subgraph. Therefore, Figure (b) is not antipodally gated. By Theorem 1.1 in Knauer and Marc (Citation2020), it is not the tope graph of a COM. Furthermore, each one-inclusion graph of concept class induced by a UGM whose underlying graph is incomplete is not the tope graph of a COM.
3.2. Sample compression schemes for the concept classes induced by discrete UGMs
Given a sample space, if a UG is complete, the corresponding concept class is a maximum class (Li & Yang, Citation2018), and its one-inclusion graph is the tope graph of a COM (Knauer & Marc, Citation2020). Now, we consider labelled compression scheme of .
Proposition 3.1
has a labelled compression scheme of size 3.
Proof.
Let , and consider in Figure . and are dependent. Note that the induced concept class contains 15 concepts, and it is a maximum class. Then we can get a labelled compression scheme for (Floyd & Warmuth, Citation1995; Moran & Warmuth, Citation2016), and a labelled compression scheme of size 3 for as shown in Table due to and .
Remark 3.1
For in Figure , let , , where , excluding the case . Then we can give the algebraic characterization of this discrete UGM: (4) (4) where , i<l, , j<k. It is easy to know that only and cannot occur on , , that is, there are 14 functions, denoted by .
Note that . A natural idea would be to embed it into a maximum class of the same VC dimension that contains . As we all know there are 16 functions on 4 domain points in total, and then we can delete to get a maximum class of VC dimension 3 which contains . The labelled compression scheme for this maximum class is shown in Table . The concept , and it has four compression sets. If we know , this evidence cannot be used to predict the label of in though the fact that it comes from is obvious. The key point is that compression sets corresponding to do not contain in the maximum class . Thus we can assign the labelled compression set to , to , to , and to . Then one can obtain a new compression scheme for as shown in Table . This compression scheme is crucial in this paper.
Remark 3.2
Consider formula (4). If we know arbitrary three of the four values , , , , then the value of the unknown probability can be obtained. However, according to Table , we must abide by the following rules: if we want to reconstruct , , , , , we must know , , , , , , respectively. Similarly, if we want to reconstruct concepts , , , , we need to know the specified three points , , , , respectively. The remaining four concepts in need not be specific 0−1 values.
Next we will show that when and takes any finite number of values, each concept in the induced class can be compressed to a labelled set of size VC dimension.
Lemma 3.2
For in Figure , let , , and , let denote the class induced by this discrete UGM, and then for each concept , there exists a compression set of size which can reconstruct c exactly.
Proof.
For , we have , and then the lemma holds. For , has domain points, denoted as . The quadratic binomials associated with this UGM are as follows: (5) (5) where , . We know that (Li & Yang, Citation2018). , there are four cases to discuss.
Case (1): .
Note that , so by Remark 3.2, we must retain and to reconstruct ‘1111’. Therefore, for any m we first retain and . Then for any , if n<m retain ; if n>m retain . Then we can get the compression set of size .
Case (2): . There are two subcases as follows.
Case (2a): and for each n<m. We can first retain and . Then according to reconstruction rules, for n<m we retain . For n>m, if then retain ; if then retain .
Case (2b): and for each n<m. We can first retain , and for n<m. For n>m, if then retain ; if then retain .
Case (3): , and there exists at least one such that , .
Let . We first retain and . Then by the rule shown in Table , for any n<k if , we retain or ; if , we retain or . For any n>k if , we retain ; if , retain or ; if , retain . This means that points are reserved.
Case (4): , and there exists at least one such that , .
Let . First, we retain and . Then using similar techniques as in Case (3), it follows that for any n<l if , we retain ; if , we retain ; if , we retain or . For any n>l if , we retain or ; if , we retain or .
In summary, we can get a compression set A of size for any concept c. Clearly, the compression process ensures that we can reconstruct correctly only by points in the compression set A.
In fact, in the proof of Lemma 3.2, Case (3) contains the following four subcases: , , , . Case (4) also has four subcases: , , , .
On a high level, our goal is to construct a sample compression scheme for which compresses any sample s of to a subsample of size at most , and then this subsample can be used to reconstruct a concept consistent with s. Next we will give an algorithm which generates such a compression scheme. The input is a labelled sample s of . The output is a subsample of size at most that represents some hypothesis consistent with s.
The Algorithm for Constructing a Labelled Compression Scheme.
The compression algorithm: The input is a finite sample s of . Let denote all domain sets corresponding to quadratic binomials that domain points need to satisfy, and further let . If , then . Otherwise, let where . Then compress by the way used in the proof of Lemma 3.2 and denote the compression set for by , and then .
The reconstruction algorithm: The input is a subsample of size at most , and reconstruction function is asked to predict the labels of elements in . For any element , if there is a quadratic binomial that contains x and the other three points are in the subsample , then predict the label of x by the scheme in Table . Let the set of such x be B, and then for elements in , assign any set of possible 0−1 values associated binomial (5) as their labels.
Theorem 3.3
Let s be a sample labelled consistently with some concept of , . Then the algorithm produces a compression set of size at most and the reconstructed hypothesis is consistent with the original sample s.
Proof.
We first show that the size of the compression set is at most . If , there is at most one pair of points , in the sample s, where . Therefore, . If , there exist at least two pairs of points , and , in sample s where . Let the set of these points be D, , where . By Lemma 3.2, we can compress the labelled set D to of size . Then .
From the algorithm and Lemma 3.2, it follows immediately that reconstructed hypothesis is consistent with s on , and consequently, the sample set s. This ends the proof.
Example 3.4
Consider the graph . If , , let denote the concept class induced by this discrete UGM, and then . The associated quadratic binomials are:
(1) ; (2) ; (3) .
It is easy to know the VC dimension of is 4 (Li & Yang, Citation2018). Consider the sample which is a concept in (e.g. let , , , ). Using the proposed algorithm, we can compress s to , . The reconstruction function predicts the label ‘1’ for , ‘1’ for by binomials (2) and (3), respectively. That is, the reconstructed hypothesis we obtain is exactly s.
If , by the proposed algorithm, s itself is the labelled sample compression set. Then and are both labelled by ‘0’. This concept does not belong to , which means that the labelled compression scheme we presented is improper.
Naturally, we can extend Lemma 3.2 to a family of concept classes whose corresponding graph has three vertices and two edges.
Theorem 3.5
For the graph shown in Figure , let , , where , , i = 2, 3. Let denote the class induced by this UGM. Then there exists a labelled compression scheme of size for .
Proof.
There are domain points in , and (Li & Yang, Citation2018). For the UG in Figure , , and one can use the sample compression scheme of to construct scheme for .
For any sample set s, we can divide it into parts according to the value of , denoted by . Note that for the elements in each part , takes the same value . Let the compression sets of be , respectively, where , . Then the compression set of s is and . This ends the proof of Theorem 3.5.
Immediately, we have the following result.
Corollary 3.6
For any UGM whose underlying graph has two cliques , with , then there is a labelled sample compression scheme of size VC dimension of the corresponding induced concept class.
Proof.
Note that we number the vertex that only in first, then vertices in , and finally. Suppose , where , .
If (), the two cliques can be viewed as two isolated vertices taking 2 and values respectively, and this case reduces to the case of Lemma 3.2. If (), this UG can be viewed as the UG in Figure taking 2, and values respectively, and this case reduces to the setting of Theorem 3.5. The conclusion is confirmed.
4. Conclusion and discussion
UGMs have become one of the popular models used for classification in statistical learning. In this paper, we focus on the question whether there exists a sample compression scheme of size VC dimension for the concept class induced by a discrete UGM. We show that for three types of discrete UGMs the answers are positive. The construction of these labelled compression schemes utilizes algebraic characterization of discrete UGMs.
A natural question is whether there are unlabelled sample compression schemes of size VC dimension for these three types of induced concept classes. For a general discrete UGM whether there exists a labelled sample compression scheme of size equal to VC dimension of the induced concept class remains open.
Acknowledgments
We would like to thank the Editor, Managing Editor, the Associate Editor and the reviewer for constructive comments and helpful suggestions.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Additional information
Funding
References
- Anstee, R. P., Rányai, L., & Sali, A. (2002). Shattering news. Graphs and Combinatorics, 18(1), 59–73. https://doi.org/10.1007/s003730200003.
- Ben-David, S., & Litman, A. (1998). Combinatorial variability of Vapnik–Chervonenkis classes with applications to sample compression schemes. Discrete Applied Mathematics, 86(1), 3–25. https://doi.org/10.1016/S0166-218X(98)00000-6 .
- Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1987). Occam's razor. Information Processing Letters, 24(6), 377–380. https://doi.org/10.1016/0020-0190(87)90114-1 .
- Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1989). Learnability and the Vapnik–Chervonenkis dimension. Journal of the ACM, 36(4), 929–965. https://doi.org/10.1145/76359.76371 .
- Bollobás, B., & A. J. Radcliffe (1995). Defect sauer results. Journal of Combinatorial Theory, Series A, 72(2), 189–208. https://doi.org/10.1016/0097-3165(95)90060-8.
- Chalopin, J., Chepoi, V., Moran, S., & Warmuth, M. K. (2019). Unlabeled sample compression schemes and corner peelings for ample and maximum classes. ICALP, 34, 1–15. https://doi.org/10.4230/LIPIcs.ICALP.2019.34 .
- Chepoi, V., Knauer, K., & Philibert, M. (2021). Labeled sample compression schemes for complexes of oriented matroids. arXiv:2110.15168v1.
- Floyd, S., & Warmuth, M. K. (1995). Sample compression, learnability, and the Vapnik–Chervonenkis dimension. Machine Learning, 21(3), 269–304. https://doi.org/10.1023/A:1022660318680 .
- Friedman, N. (2004). Inferring cellular networks using probabilistic graphical models. Science, 303(5659), 799–805. https://doi.org/10.1126/science.1094068.
- Geiger, D., Heckerman, D., King, H. P., & Meek, C. (2001). Stratified exponential families: graphical models and model selection. The Annals of Statistics, 29(2), 505–529. https://doi.org/10.1214/aos/1009210550.
- Geiger, D., Meek, C., & Sturmfels, B. (2006). On the toric algebra of graphical models. Annals of Statistics, 34(3), 1463–1492. https://doi.org/10.1214/009053606000000263.
- Haussler, D., Sloan, R., & Warmuth, M. K. (1994). Predicting {0,1} functions on randomly drawn points. Inf. Comput., 115(2), 248–292. https://doi.org/10.1006/inco.1994.1097.
- Helmbold, D., Sloan, R., & Warmuth, M. K. (1990). Learning nested differences of intersection-closed concept classes. Machine Learning, 5, 165–196. https://doi.org/10.1023/A:1022696716689 .
- Knauer, K., & Marc, T. (2020). On tope graphs of complexes of oriented matroids. Discrete & Computational Geometry, 63(2), 377–417. https://doi.org/10.1007/s00454-019-00111-z.
- Kuzmin, D., & Warmuth, M. K. (2007). Unlabeled compression schemes for maximum classes. Journal of Machine Learning Research, 8, 2047–2081. https://doi.org/10.1007/11503415_40.
- Lauritzen, S. L. (1996). Graphical models. Oxford University Press.
- Li, B., & Yang, Y. (2018). Complexity of concept classes induced by discrete Markov networks and Bayesian networks. Pattern Recognition, 82, 31–37. https://doi.org/10.1016/j.patcog.2018.04.026.
- Littlestone, N., & Warmuth, M. K. (1986). Relating data compression and learnability [Unpublished manuscript]. http://www.cse.ucsc.edu/manfred.
- Marc, T. (2022). Unlabeled sample compression schemes for oriented matroids. arXiv:2203.11535v2.
- Moran, S., & Warmuth, M. K. (2016). Labeled compression schemes for extremal classes. In ALT (pp. 34–39). Springer. https://doi.org/10.1007/978-3-319-46379-7_3
- Moran, S., & Yehudayoff, A. (2016). Sample compression schemes for VC classes. In ITA (pp. 1–14). IEEE. https://doi.org/10.1109/ITA.2016.7888187
- Pálvölgyi, D., & Tardos, G. (2020). Unlabeled compression schemes exceeding the VC-dimension. Discrete Applied Mathematics, 276, 102–107. https://doi.org/10.1016/j.dam.2019.09.022.
- Pistone, G., Riccomagno, E., & Wynn, H. P. (2001). Algebraic statistics: Computational commutative algebra in statistics. Chapman and Hall/CRC.
- Rubinstein, B. I. P., & Rubinstein, J. H. (2012). A geometric approach to sample compression. Journal of Machine Learning Research, 13(42), 1221–1261.
- Settimi, R., & Smith, J. Q. (2000). Geometry, moments and conditional independence trees with hidden variables. Annals of Statistics, 28(4), 1179–1205. https://doi.org/10.1214/aos/1015956712.
- Warmuth, M. K. (2003). Compressing to VC dimension many points. In Proceedings of the 16th Annual Conference on Learning Theory (pp. 743–744). Springer. https://doi.org/10.1007/978-3-540-45167-9_60