Full article: Compression schemes for concept classes induced by three types of discrete undirected graphical models

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Sample compression schemes were first proposed by Littlestone and Warmuth in 1986. Undirected graphical model is a powerful tool for classification in statistical learning. In this paper, we consider labelled compression schemes for concept classes induced by discrete undirected graphical models. For the undirected graph of two vertices with no edge, where one vertex takes two values and the other vertex can take any finite number of values, we propose an algorithm to establish a labelled compression scheme of size VC dimension of associated concept class. Further, we extend the result to other two types of undirected graphical models and show the existence of labelled compression schemes of size VC dimension for induced concept classes. The work of this paper makes a step forward in solving sample compression problem for concept class induced by a general discrete undirected graphical model.

Keywords:

1. Introduction

Compression is an important facet of learning. In statistical learning, excellent and appropriate compression can guarantee a small generalization error (Blumer et al., Citation1987). A significant notion of compression is sample compression scheme, which was introduced by Littlestone and Warmuth (Citation1986). There are two types of sample compression schemes: labelled and unlabelled. Such compression schemes are approaches for encoding a finite sample in a small subsample.

A labelled sample compression scheme of size k for a concept class consists of a compression function g and a reconstruction function h. Compression function g compresses a finite sample to a small labelled subsample of size at most k that is called compression set, and reconstruction function h aims to construct a concept consistent with the original sample set by the compression set. A crucial fact called compactness lemma was given by Ben-David and Litman (Citation1998), which essentially says that if a finite class admits a compression scheme so does an infinite class. Vapnik-Chervonenkis (VC) dimension is a measure of the complexity of function classes. In statistical learning, VC dimension is deeply related to probably approximately correct learning (PAC) and sample compression scheme (Blumer et al., Citation1989; Floyd & Warmuth, Citation1995). In particular, for any concept class of VC dimension d the size of its sample compression scheme must be at least d (Bollobás & Radcliffe, Citation1995). Furthermore, Floyd and Warmuth (Citation1995) and Warmuth (Citation2003) proposed a sample compression conjecture, that is, for every class of VC dimension d, there exists a sample compression scheme of size $O (d)$ . Up to now, this question has not been solved. However, it has been proved for many families of concept classes which are natural and important in statistical learning. Floyd and Warmuth (Citation1995) showed that there is a labelled sample compression scheme of size d for any maximum class of VC dimension d. Moran and Warmuth (Citation2016) presented the labelled sample compression schemes of size d for ample classes which are a natural generalization of maximum classes based on Sandwich Lemma (Anstee et al., Citation2002; Bollobás & Radcliffe, Citation1995). One noteworthy result is that Moran and Yehudayoff (Citation2016) proved the existence of labelled compression scheme of size $\exp (d)$ for arbitrary class of VC dimension d. In a latest study, Chepoi et al. (Citation2021) showed that the topes of a complex of oriented matroids (COM) have a labelled sample compression scheme of size VC dimension d.

Another type of compression scheme is unlabelled. An unlabelled sample compression scheme is that the compression set is unlabelled. In other words, the compression function compresses each given sample to a subset of the domain of the sample. For some specific concept classes, unlabelled sample compression schemes have been explored by several researchers and they have achieved a series of results (Chalopin et al., Citation2019; Helmbold et al., Citation1990; Kuzmin & Warmuth, Citation2007; Marc, Citation2022; Pálvölgyi & Tardos, Citation2020; Rubinstein & Rubinstein, Citation2012).

Undirected graphical model (UGM), also known as Markov network, is one of the research hot points in the field of statistical learning. In the past a few decades, it has been successfully applied to many real problems (Friedman, Citation2004; Lauritzen, Citation1996). We know that a statistical model is a family of probability distributions, which is a (part of) real algebraic variety from the viewpoint of algebraic geometry (Pistone et al., Citation2001). In particular, considering discrete data, a statistical model is the set of all solutions of some polynomials in the probability simplex (Geiger et al., Citation2001; Settimi & Smith, Citation2000).

Given an undirected graph (UG) and sample space, one has an algebraic geometric characterization of the corresponding discrete UGM (Geiger et al., Citation2006). Based on the work in Geiger et al. (Citation2006), for general discrete UGMs, Li and Yang (Citation2018) developed an algebraic method for computing VC dimensions of corresponding concept classes. Their work lays a theoretical foundation for this paper.

A natural question is whether the sample compression conjecture holds for the concept classes induced by discrete UGMs. In this paper, we consider concept classes induced by three types of discrete UGMs. Using the quadratic binomials associated with a positive UGM, we construct a labelled compression scheme of size VC dimension d. That is, the compression conjecture is correct for these three concept classes. It is an open question whether the concept class induced by a general discrete UGM has a sample compression scheme of size VC dimension.

The remainder of this paper is organized as follows. Section 2 introduces main definitions and notations. Section 3 discusses a property of one-inclusion graphs of concept classes induced by general UGMs and establishes the existence of labelled sample compression schemes of size equal to their VC dimensions for classes induced by three kinds of discrete UGMs. Section 4 summarizes the work of this paper.

2. Preliminaries

In this section, we introduce some formal definitions and notations that appear in this paper.

2.1. Discrete UGMs

We consider UGMs whose underlying UGs are simple. A UG $G = (V, E)$ contains a set of vertices V and a set of unordered pairs vertices (undirected edges) E. A graph is complete if there exists an edge between every pair of vertices. Given a UG G, a clique is a maximal complete subgraph of G, and we use $κ_{G}$ to denote the set of all cliques of G. Let $X_{1}, X_{2}, \dots, X_{n}$ be discrete random variables (vertices in G), where $X_{i} \in [k_{i}] ≐ {0, 1, \dots, k_{i} - 1}$ , $k_{i} \in N$ , $k_{i} \geq 2$ . Let $X = (X_{1}, X_{2}, \dots, X_{n})$ be an n-dimension vector, and then the sample space $X = \prod_{i = 1}^{n} [k_{i}]$ . A UGM $P$ is a family of probability distributions, and each element (probability distribution on $X$ ) in $P$ has the form (1) $p_{x_{1} x_{2} \dots x_{n}} ≐ P (x_{1}, x_{2}, \dots, x_{n}) = P (x) \propto \prod_{K \in κ_{G}} ψ_{K} (x),$ (1) where $x = (x_{1}, x_{2}, \dots, x_{n}) \in X$ , and $ψ_{K} (x)$ is a potential function that depends on $X$ only through the values of variables in K. We focus on the collection of positive probability distributions in $P$ , denoted as $P^{+}$ .

Example 2.1

The graph $G_{1}$ in Figure is a simple UG, $V = {X_{1}, X_{2}}$ , $E = {(X_{1}, X_{2})}$ and $κ_{G_{1}} = {{X_{1}}, {X_{2}}}$ . If $X_{1}, X_{2} \in [2]$ , $X$ has four values: $x_{(1)} = (0, 0)$ , $x_{(2)} = (1, 1)$ , $x_{(3)} = (0, 1)$ , $x_{(4)} = (1, 0)$ .

Figure 1. A simple UG $G_{1}$ .

2.2. Algebraic geometry of discrete UGMs

We work in the real polynomial ring $R [X]$ whose indeterminates are elementary probabilities $p_{x_{1} x_{2} \dots x_{n}}$ . A subset $I \subseteq R [X]$ is an ideal if it satisfies the following conditions: (1) $0 \in I$ , (2) if $f, g \in I$ , then $f + g \in I$ , and (3) if $f \in I$ and $h \in R [X]$ , then $h f \in I$ . Every ideal I in $R [X]$ associates a variety $X_{I}^{R_{> 0}} = {y \in R_{> 0}^{m} : f (y) = 0 for every f \in I},$ where $m = \prod_{i = 1}^{n} k_{i}$ , $R_{> 0}^{m}$ is the set of m-dimensional vectors whose components are positive real numbers.

Given a UG G, we say $X_{i}$ is independent of $X_{j}$ given ${X_{1}, X_{2}, \dots, X_{n}} ∖ {X_{i}, X_{j}}$ if $(X_{i}, X_{j}) \notin E$ . Denote it by $X_{i} X_{j} | {X_{1}, X_{2}, \dots, X_{n}} ∖ {X_{i}, X_{j}}$ and call it a pairwise conditional independence statement. Each $X_{i} X_{j} | {X_{1}, X_{2}, \dots, X_{n}} ∖ {X_{i}, X_{j}}$ corresponds to a set of quadratic binomials: $\forall x_{i}, x_{i}^{'} \in [k_{i}]$ , $\forall x_{j}, x_{j}^{'} \in [k_{j}]$ , $\forall z \in \prod_{l = 1, l \neq i, l \neq j}^{n} [k_{l}]$ , $p_{x_{i} x_{j} z} p_{x_{i}^{'} x_{j}^{'} z} - p_{x_{i} x_{j}^{'} z} p_{x_{i}^{'} x_{j} z} holds .$ Let $I_{p a i r w i s e (G)}$ be the ideal in $R [X]$ generated by the quadratic binomials corresponding to all the pairwise conditional independence statements. The well-known Hammersley–Clifford theorem shows that $X_{I_{p a i r w i s e (G)}}^{R_{> 0}} = P^{+}$ (Geiger et al., Citation2006; Lauritzen, Citation1996).

Example 2.2

Consider the same UG and sample space given in Example 2.1. Since $X_{1} X_{2}$ , the unique quadratic binomial associated with this discrete UGM is (2) $p_{x_{(1)}} \cdot p_{x_{(2)}} - p_{x_{(3)}} \cdot p_{x_{(4)}} .$ (2)

2.3. Concept classes, VC dimension, sample compression schemes

A concept class $C$ over domain $X$ (also denoted as $d o m (C)$ ) is a family of functions of the form $c : X \to {0, 1}$ . Alternatively, $C$ can be represented by the one-inclusion graph (Haussler et al., Citation1994). The vertices of the graph are all the concepts in $C$ and two concepts have an edge if they differ at a single point. Each function $c \in C$ is said to be a concept, and it can also be viewed as a subset of $X$ where $x \in c$ if and only if $x \in X$ and $c (x) = 1$ . For some subset $A \subseteq X$ , the restriction of $C$ on A is the class $C | A = {c \cap A : c \in C}$ . If $| C | A | = 2^{| A |}$ that means for every binary vector $b \in {0, 1}^{| A |}$ there is a concept c such that $c (A) = b$ , then A is said to be shattered by $C$ . The VC dimension of $C$ is given by $V C d i m (C) = s u p {m | A \subseteq X shattered by C and | A | = m} .$ The sign function is defined as (3) $s i g n (z) = {\begin{cases} 1, & i f z \geq 0, \\ 0, & i f z < 0, \end{cases}$ (3) where $z \in R$ . The concept class induced by a discrete UGM is the set of Boolean functions over $X$ of the form $s i g n (l o g (P (x) / Q (x)))$ for $P, Q \in P^{+}$ , where $P^{+}$ is a distribution class. Hence, $s i g n (l o g (P (x) / Q (x))) = 1$ if $P (x) \geq Q (x)$ and $s i g n (l o g (P (x) / Q (x))) = 0$ otherwise. A concept class $C$ is called maximum if its size is $(\binom{n}{\leq d}) = \sum_{i = 0}^{d} (\binom{n}{i})$ , where $n = | d o m (C) |$ and $d = V C d i m (C)$ .

Example 2.3

According to Example 2.2., clearly, there is no pair of distributions $P, Q \in P_{1}^{+}$ such that $(s i g n (l o g (\frac{p_{00}}{q_{00}})), s i g n (l o g (\frac{p_{11}}{q_{11}}))$ , $s i g n (l o g (\frac{p_{01}}{q_{01}})), s i g n (l o g (\frac{p_{10}}{q_{10}})))$ take the values $(1, 1, 0, 0)$ and $(0, 0, 1, 1)$ . Since $p_{00} + p_{01} +$ $p_{10} + p_{11} = q_{00} + q_{01} + q_{10} + q_{11} = 1$ , the value $(0, 0, 0, 0)$ cannot appear. The concept class $C_{1}$ is shown in Figure (a). Note that $V C d i m (C_{1}) = 3$ , and it is not a maximum class. The one-inclusion graph of $C_{1}$ is given in Figure (b).

Figure 2. (a) $C_{1}$ and (b) its one-inclusion graph.

A labelled sample is a set $s = {(x_{(1)}, y_{1}), \dots, (x_{(m)}, y_{m})}$ , where $x_{(i)} \in X$ , $y_{i} \in {0, 1}$ . A labelled sample compression scheme of a concept class $C$ is a pair of functions $(g, h)$ . Function g is the compression function for mapping a labelled realizable finite sample that comes from some concept to a labelled subsample as a compression set. h is the reconstruction function for mapping the subsample to a hypothesis on $X$ consistent with the entire sample. The size of the compression scheme is the size of the largest compression set. A sample compression scheme is proper if $h (g (s)) \in C$ for any realizable sample s, otherwise, it is called improper.

3. Labelled sample compression schemes

In this section, we point out that one-inclusion graph of the concept class induced by a discrete UGM is not the tope graph of a COM if the underlying graph is incomplete and construct labelled sample compression schemes of size VC dimension for classes corresponding to three types of discrete UGMs.

3.1. One-inclusion graphs of concept classes induced by general discrete UGMs

For a discrete UGM, vertices in its underlying UG represent random variables. In this section, each vertex in the graph under consideration represents an object, which is not a random variable (the object is a concept in one-inclusion graph).

Let $G^{'} = (V^{'}, E^{'})$ be a finite, connected and simple graph. The distance $d (u, v) ≐ d_{G^{'}} (u, v)$ between vertices u and v is defined as the length of a shortest $(u, v)$ -path. We say that $G^{'} = (V^{'}, E^{'})$ is isometrically embeddable into a graph $H = (W, F)$ if there is a map $φ : V^{'} \to W$ such that $d_{H} (φ (u), φ (v)) = d (u, v)$ for all $u, v \in V^{'}$ . A graph $G^{'}$ is a partial cube if it allows an isometric embedding into a hypercube graph $Q_{n}$ .

It is significant for us to introduce two types of metric subgraphs. For a subgraph H, H is even if for every vertex v there exists unique vertex $v^{'}$ such that $d (v, v^{'}) = d i a m (G^{'})$ , where $d i a m (G^{'})$ is the diameter of $G^{'}$ . An even subgraph is symmetric if $d (u, v) + d (u, v^{'}) = d i a m (G^{'})$ for all u, v in H. A symmetric-even subgraph H of $G^{'}$ is also called an antipodal subgraph. Two antipodal subgraphs of the graph Figure (b) are shown in Figure . A subgraph H is gated if for each vertex v in $G^{'}$ there exists a gate x in H, such that there is a shortest path from v through x to y for each vertex $y \in H$ . In a partial cube that means there exists a path from v to H that does not use any color that appears in the edges of H. We say that a graph is antipodally gated if all of its antipodal subgraphs are gated.

Figure 3. Two antipodal subgraphs of the graph in Figure (b).

Figure 3. Two antipodal subgraphs of the graph in Figure 2(b).

Considering the subgraph Figure (b), it is an antipodal subgraph, while the vertex ‘0010’ does not have gate. For vertex ‘0010’, there is not a path from ‘0010’ to Figure (b) that does not use red or blue edge. Similarly, the subgraph induced by vertex set ${$ ‘0010’,‘0110’,‘1110’,‘1010’ $}$ is a non-gated antipodal subgraph. Therefore, Figure (b) is not antipodally gated. By Theorem 1.1 in Knauer and Marc (Citation2020), it is not the tope graph of a COM. Furthermore, each one-inclusion graph of concept class induced by a UGM whose underlying graph is incomplete is not the tope graph of a COM.

3.2. Sample compression schemes for the concept classes induced by discrete UGMs

Given a sample space, if a UG is complete, the corresponding concept class is a maximum class (Li & Yang, Citation2018), and its one-inclusion graph is the tope graph of a COM (Knauer & Marc, Citation2020). Now, we consider labelled compression scheme of $C_{1}$ .

Proposition 3.1

$C_{1}$ has a labelled compression scheme of size 3.

Proof.

Let $X_{1}, X_{2} \in {0, 1}$ , and consider $G_{2}$ in Figure . $X_{1}$ and $X_{2}$ are dependent. Note that the induced concept class $C_{2}$ contains 15 concepts, and it is a maximum class. Then we can get a labelled compression scheme for $C_{2}$ (Floyd & Warmuth, Citation1995; Moran & Warmuth, Citation2016), and a labelled compression scheme of size 3 for $C_{1}$ as shown in Table due to $C_{1} \subset C_{2}$ and $V C \dim (C_{1}) = V C \dim (C_{2}) = 3$ .

Figure 4. A UG $G_{2}$ .

Table 1. A labelled compression scheme for $C_{1}$ .

Display Table

Remark 3.1

For $G_{1}$ in Figure , let $X_{1} \in [k_{1}]$ , $X_{2} \in [k_{2}]$ , where $k_{1}, k_{2} \in N$ , $k_{1}, k_{2} \geq 2$ excluding the case $k_{1} = k_{2} = 2$ . Then we can give the algebraic characterization of this discrete UGM: (4) $p_{i j} \cdot p_{l k} - p_{i k} \cdot p_{l j},$ (4) where $i, l \in [k_{1}]$ , i<l, $j, k \in [k_{2}]$ , j<k. It is easy to know that only $(0, 0, 1, 1)$ and $(1, 1, 0, 0)$ cannot occur on ${(i, j), (l, k)$ , $(i, k), (l, j)}$ , that is, there are 14 functions, denoted by $C_{3}$ .

Note that $V C d i m (C_{3}) = 3$ . A natural idea would be to embed it into a maximum class of the same VC dimension that contains $C_{3}$ . As we all know there are 16 functions on 4 domain points in total, and then we can delete $(0, 0, 1, 1)$ to get a maximum class of VC dimension 3 which contains $C_{3}$ . The labelled compression scheme for this maximum class is shown in Table . The concept $(1, 1, 0, 0) \notin C_{3}$ , and it has four compression sets. If we know ${((i, j), 1), ((l, k), 1), ((i, k), 0)}$ , this evidence cannot be used to predict the label of $(l, j)$ in $C_{3}$ though the fact that it comes from $(1, 1, 0, 1)$ is obvious. The key point is that compression sets corresponding to $(1, 1, 0, 1)$ do not contain ${((i, j), 1), ((l, k), 1), ((i, k), 0)}$ in the maximum class $C_{3} \cup (1, 1, 0, 0)$ . Thus we can assign the labelled compression set ${((i, j), 1), ((l, k), 1), ((i, k), 0)}$ to $(1, 1, 0, 1)$ , ${((i, j), 1), ((l, k), 1), ((l, j), 0)}$ to $(1, 1, 1, 0)$ , ${((i, j), 1), ((i, k), 0), ((l, j), 0)}$ to $(1, 0,$ $0, 0)$ , and ${((l, k), 1), ((i, k), 0), ((l, j), 0)}$ to $(0, 1, 0, 0)$ . Then one can obtain a new compression scheme for $C_{3}$ as shown in Table . This compression scheme is crucial in this paper.

Table 2. A labelled compression scheme for a maximum class.

Display Table

Table 3. A new labelled compression scheme for $C_{3}$ .

Display Table

Remark 3.2

Consider formula (4). If we know arbitrary three of the four values $p_{i j}$ , $p_{l k}$ , $p_{i k}$ , $p_{l j}$ , then the value of the unknown probability can be obtained. However, according to Table , we must abide by the following rules: if we want to reconstruct $(0, 0, 0, 0)$ , $(0, 1, 0, 1)$ , $(0, 1, 1, 0)$ , $(1, 0, 0, 1)$ , $(1, 0, 1, 0)$ , $(1, 1, 1, 1)$ we must know ${((i, j), 0), ((l, k), 0)}$ , ${((i, j), 0), ((l, j), 1)}$ , ${((i, j), 0), ((i, k), 1)}$ , ${((l, k), 0),$ $((l, j), 1)}$ , ${((l, k), 0), ((i, k), 1)}$ , ${((i, k), 1, ((l, j), 1)}$ , respectively. Similarly, if we want to reconstruct concepts $(0, 0, 0, 1)$ , $(0, 0, 1, 0)$ , $(0, 1, 1, 1)$ , $(1, 0, 1, 1)$ , we need to know the specified three points ${((i, j), 0), ((l, k), 0), ((l, j), 1)}$ , ${((i, j), 0), ((l, k), 0),$ $((i, k), 1)}$ , ${((i, j), 0), ((i, k), 1), ((l, j), 1)}$ , ${((l, k), 0), ((i, k), 1), ((l, j), 1)}$ , respectively. The remaining four concepts in $C_{3}$ need not be specific 0−1 values.

Next we will show that when $X_{1} \in [2]$ and $X_{2}$ takes any finite number of values, each concept in the induced class can be compressed to a labelled set of size VC dimension.

Lemma 3.2

For $G_{1}$ in Figure , let $X_{1} \in [2]$ , $X_{2} \in [k_{2}]$ , $k_{2} \in N$ and $k_{2} \geq 2$ , let $C_{4}$ denote the class induced by this discrete UGM, and then for each concept $c \in C_{4}$ , there exists a compression set of size $k_{2} + 1$ which can reconstruct c exactly.

Proof.

For $k_{2} = 2$ , we have $C_{4} = C_{1}$ , and then the lemma holds. For $k_{2} > 2$ , $C_{4}$ has $2 k_{2}$ domain points, denoted as $(0, 0), (1, 0), \dots, (1, k_{2} - 1)$ . The quadratic binomials associated with this UGM are as follows: (5) $p_{0 i} \cdot p_{1 j} - p_{0 j} \cdot p_{1 i},$ (5) where $i \in [k_{2} - 1]$ , $i < j \in [k_{2}]$ . We know that $V C d i m (C_{4}) = k_{2} + 1$ (Li & Yang, Citation2018). $\forall c \in C_{4}$ , there are four cases to discuss.

Case (1): ${c | {(0, m), (1, m)} : m \in [k_{2}]} = (1, 1)$ .

Note that $c | {(0, i), (1, j), (0, j), (1, i)} = (1, 1, 1, 1)$ , so by Remark 3.2, we must retain $((0, j), 1)$ and $((1, i), 1)$ to reconstruct ‘1111’. Therefore, for any m we first retain $((0, m), 1)$ and $((1, m), 1)$ . Then for any $n \in [k_{2}] ∖ m$ , if n<m retain $((1, n), 1)$ ; if n>m retain $((0, n), 1)$ . Then we can get the compression set of size $k_{2} + 1$ .

Case (2): ${c | {(0, m), (1, m)} : m \in [k_{2}]} = {(1, 1), (0, 0)}$ . There are two subcases as follows.

Case (2a): $c | {(0, m), (1, m)} = (0, 0)$ and $c | {(0, n), (1, n)} = (1, 1)$ for each n<m. We can first retain $((0, m), 0)$ and $((1, m), 0)$ . Then according to reconstruction rules, for n<m we retain $((1, n), 1)$ . For n>m, if $c | {(0, n), (1, n)} = (1, 1)$ then retain $((0, n), 1)$ ; if $c | {(0, n), (1, n)} = (0, 0)$ then retain $((1, n), 0)$ .

Case (2b): $c | {(0, m), (1, m)} = (1, 1)$ and $c | {(0, n), (1, n)} = (0, 0)$ for each n<m. We can first retain $((0, m), 1)$ , $((1, m), 1)$ and $((0, n), 0)$ for n<m. For n>m, if $c | {(0, n), (1, n)} = (1, 1)$ then retain $((0, n), 1)$ ; if $c | {(0, n), (1, n)} = (0, 0)$ then retain $((1, n), 0)$ .

Case (3): $\forall m \in [k_{2}], c | {(0, m), (1, m)} \in {(1, 1), (0, 0), (0, 1)}$ , and there exists at least one $m \in [k_{2}]$ such that $c | {(0, m)$ , $(1, m)} = (0, 1)$ .

Let $k = min {m; c | {(0, m), (1, m)} = (0, 1)}$ . We first retain $((0, k), 0)$ and $((1, k), 1)$ . Then by the rule shown in Table , for any n<k if $c | {(0, n), (1, n)} = (0, 0)$ , we retain $((0, n), 0)$ or $((1, n), 0)$ ; if $c | {(0, n), (1, n)} = (1, 1)$ , we retain $((0, n), 1)$ or $((1, n), 1)$ . For any n>k if $c | {(0, n), (1, n)} = (0, 0)$ , we retain $((1, n), 0)$ ; if $c | {(0, n), (1, n)} = (0, 1)$ , retain $((0, n), 0)$ or $((1, n), 1)$ ; if $c | {(0, n), (1, n)} = (1, 1)$ , retain $((0, n), 1)$ . This means that $k_{2} + 1$ points are reserved.

Case (4): $\forall m \in [k_{2}], c | {(0, m), (1, m)} \in {(1, 1), (0, 0), (1, 0)}$ , and there exists at least one $m \in [k_{2}]$ such that $c | {(0, m)$ , $(1, m)} = (1, 0)$ .

Let $l = max {m; c | {(0, m), (1, m)} = (1, 0)}$ . First, we retain $((0, l), 1)$ and $((1, l), 0)$ . Then using similar techniques as in Case (3), it follows that for any n<l if $c | {(0, n), (1, n)} = (0, 0)$ , we retain $((0, n), 0)$ ; if $c | {(0, n), (1, n)} = (1, 1)$ , we retain $((1, n), 1)$ ; if $c | {(0, n), (1, n)} = (1, 0)$ , we retain $((0, n), 1)$ or $((1, n), 0)$ . For any n>l if $c | {(0, n), (1, n)} = (0, 0)$ , we retain $((0, n), 0)$ or $((1, n), 0)$ ; if $c | {(0, n), (1, n)} = (1, 1)$ , we retain $((0, n), 1)$ or $((1, n), 1)$ .

In summary, we can get a compression set A of size $k_{2} + 1$ for any concept c. Clearly, the compression process ensures that we can reconstruct $c | (d o m (c) ∖ A)$ correctly only by points in the compression set A.

In fact, in the proof of Lemma 3.2, Case (3) contains the following four subcases: ${c | {(0, m), (1, m)} : m \in [k_{2}]} = {(0, 1)}$ , ${(0, 1), (0, 0)}$ , ${(0, 1), (1, 1)}$ , ${(1, 1), (0, 0),$ $(0, 1)}$ . Case (4) also has four subcases: ${c | {(0, m), (1, m)} : m \in [k_{2}]} = {(1, 0)}$ , ${(1, 0), (0, 0)}$ , ${(1, 0), (1, 1)}$ , ${(1, 1), (0, 0), (1, 0)}$ .

On a high level, our goal is to construct a sample compression scheme for $C_{4}$ which compresses any sample s of $C_{4}$ to a subsample of size at most $k_{2} + 1$ , and then this subsample can be used to reconstruct a concept consistent with s. Next we will give an algorithm which generates such a compression scheme. The input is a labelled sample s of $C_{4}$ . The output is a subsample $s^{'}$ of size at most $k_{2} + 1$ that represents some hypothesis consistent with s.

The Algorithm for Constructing a Labelled Compression Scheme.

The compression algorithm: The input is a finite sample s of $C_{4}$ . Let $A_{1}, A_{2}, \dots,$ $A_{n}$ denote all domain sets corresponding to quadratic binomials that $2 k_{2}$ domain points need to satisfy, $n = (\binom{k_{2}}{2})$ and further let $A = {A_{t} : | s | A_{t} | = 4, t = 1, \dots, n}$ . If $A = \emptyset$ , then $s^{'} = s$ . Otherwise, let $A^{'} = ⋃ A_{t}$ where $A_{t} \in A$ . Then compress $s | A^{'}$ by the way used in the proof of Lemma 3.2 and denote the compression set for $s | A^{'}$ by $s^{''}$ , and then $s^{'} = s^{''} ⋃ (s | d o m (s) ∖ A^{'})$ .
The reconstruction algorithm: The input is a subsample $s^{'}$ of size at most $k_{2} + 1$ , and reconstruction function is asked to predict the labels of elements in $d o m (C_{4}) ∖ d o m (s^{'})$ . For any element $x \in d o m (C_{4}) ∖ d o m (s^{'})$ , if there is a quadratic binomial that contains x and the other three points are in the subsample $s^{'}$ , then predict the label of x by the scheme in Table . Let the set of such x be B, and then for elements in $d o m (C_{4}) ∖ (d o m (s^{'}) ⋃ B)$ , assign any set of possible 0−1 values associated binomial (5) as their labels.

The following theorem (3.3) shows that the algorithm produces a correct compression set of size at most

k_{2} + 1

Theorem 3.3

Let s be a sample labelled consistently with some concept of $C_{4}$ , $1 \leq | s | = m \leq 2 k_{2}$ . Then the algorithm produces a compression set of size at most $k_{2} + 1$ and the reconstructed hypothesis is consistent with the original sample s.

Proof.

We first show that the size of the compression set $s^{'}$ is at most $k_{2} + 1$ . If $A = \emptyset$ , there is at most one pair of points $(x_{0 i}, y_{0 i})$ , $(x_{1 i}, y_{1 i})$ in the sample s, where $y_{0 i}, y_{1 i} \in {0, 1}$ . Therefore, $| s^{'} | = | s | \leq 2 + (k_{2} - 1) = k_{2} + 1$ . If $A \neq \emptyset$ , there exist at least two pairs of points $(x_{0 i}, y_{0 i})$ , $(x_{1 i}, y_{1 i})$ and $(x_{0 j}, y_{0 j})$ , $(x_{1 j}, y_{1 j})$ in sample s where $i \neq j$ . Let the set of these points be D, $| D | = 2 n$ , where $2 \leq n \leq k_{2}$ . By Lemma 3.2, we can compress the labelled set D to $s^{''}$ of size $n + 1$ . Then $| s^{'} | = | s^{''} | + | s - D | = n + 1 + m - 2 n \leq k_{2} + 1$ .

From the algorithm and Lemma 3.2, it follows immediately that reconstructed hypothesis is consistent with s on $d o m (s) ∖ d o m (s^{'})$ , and consequently, the sample set s. This ends the proof.

Example 3.4

Consider the graph $G_{1}$ . If $X_{1} \in [2]$ , $X_{2} \in [3]$ , let $C_{5}$ denote the concept class induced by this discrete UGM, and then $d o m (C_{5}) = {(0, 0), (1, 0), (0, 1), (1, 1), (0, 2), (1, 2)}$ . The associated quadratic binomials are:

(1) $p_{00} \cdot p_{11} - p_{01} \cdot p_{10}$ ; (2) $p_{00} \cdot p_{12} - p_{02} \cdot p_{10}$ ; (3) $p_{01} \cdot p_{12} - p_{02} \cdot p_{11}$ .

It is easy to know the VC dimension of $C_{5}$ is 4 (Li & Yang, Citation2018). Consider the sample $s = {((0, 0), 1),$ $((1, 0), 1), ((0, 1), 1), ((1, 1), 1), ((0, 2), 0), ((1, 2), 1)}$ which is a concept in $C_{5}$ (e.g. let $P = (0.12, 0.28, 0.12, 0.28, 0.06, 0.14)$ , $Q = (0.09, 0.01,$ $0.09, 0.01, 0.72, 0.08)$ , $s = s i g n (l o g (P (x) / Q (x)))$ , $x \in d o m (C_{5})$ ). Using the proposed algorithm, we can compress s to ${((0, 0), 1), ((0, 1), 1)$ , $((0, 2), 0), ((1, 2), 1)}$ . The reconstruction function predicts the label ‘1’ for $(1, 0)$ , ‘1’ for $(1, 1)$ by binomials (2) and (3), respectively. That is, the reconstructed hypothesis we obtain is exactly s.

If $s = {((0, 0), 1), ((0, 1), 1), ((1, 1), 1), ((1, 2), 1)}$ , by the proposed algorithm, s itself is the labelled sample compression set. Then $(1, 0)$ and $(0, 2)$ are both labelled by ‘0’. This concept does not belong to $C_{5}$ , which means that the labelled compression scheme we presented is improper.

Naturally, we can extend Lemma 3.2 to a family of concept classes whose corresponding graph has three vertices and two edges.

Theorem 3.5

For the graph shown in Figure , let $X_{1} \in [2]$ , $X_{i} \in [k_{i}]$ , where $k_{i} \in N$ , $k_{i} \geq 2$ , i = 2, 3. Let $C_{6}$ denote the class induced by this UGM. Then there exists a labelled compression scheme of size $k_{2} (k_{3} + 1)$ for $C_{6}$ .

Figure 5. A UG of three vertices with two edges.

Proof.

There are $2 \cdot k_{2} \cdot k_{3}$ domain points in $C_{6}$ , and $V C d i m (C_{6}) = k_{2} (k_{3} + 1)$ (Li & Yang, Citation2018). For the UG in Figure , $X_{1} X_{3} | X_{2}$ , and one can use the sample compression scheme of $C_{4}$ to construct scheme for $C_{6}$ .

For any sample set s, we can divide it into $k_{2}$ parts according to the value of $X_{2}$ , denoted by $s_{1}, s_{2}, \dots, s_{k_{2}}$ . Note that for the elements in each part $s_{i}$ , $X_{2}$ takes the same value $(i - 1) \in [k_{2}]$ . Let the compression sets of $s_{1}, s_{2}, \dots, s_{k_{2}}$ be $s_{1}^{'}, s_{2}^{'}, \dots, s_{k_{2}}^{'}$ , respectively, where $| s_{n}^{'} | \leq k_{3} + 1$ , $n = 1, 2, \dots, k_{2}$ . Then the compression set of s is $⋃ s_{n}^{'}$ and $| ⋃ s_{n}^{'} | \leq k_{2} (k_{3} + 1)$ . This ends the proof of Theorem 3.5.

Immediately, we have the following result.

Corollary 3.6

For any UGM whose underlying graph has two cliques $K_{1} = {X_{1},$ $X_{2}, \dots, X_{n_{1}}}$ , $K_{2} = {X_{2}, X_{3}, \dots, X_{n}}$ with $X_{1} \in [2]$ , then there is a labelled sample compression scheme of size VC dimension of the corresponding induced concept class.

Proof.

Note that we number the vertex that only in $K_{1}$ first, then vertices in $K_{1} \cap K_{2}$ , and $K_{2} ∖ K_{1}$ finally. Suppose $X_{i} \in [k_{i}]$ , where $k_{i} \in N, k_{i} \geq 2$ , $i = 2, \dots, n$ .

If $n_{1} = 1$ ( $K_{1} \cap K_{2} = \emptyset$ ), the two cliques can be viewed as two isolated vertices taking 2 and $\prod_{i = 2}^{n} k_{i}$ values respectively, and this case reduces to the case of Lemma 3.2. If $n_{1} \geq 2$ ( $K_{1} \cap K_{2} = {X_{2}, \dots, X_{n_{1}}}$ ), this UG can be viewed as the UG in Figure taking 2, $\prod_{i = 2}^{n_{1}} k_{i}$ and $\prod_{i = n_{1} + 1}^{n} k_{i}$ values respectively, and this case reduces to the setting of Theorem 3.5. The conclusion is confirmed.

4. Conclusion and discussion

UGMs have become one of the popular models used for classification in statistical learning. In this paper, we focus on the question whether there exists a sample compression scheme of size VC dimension for the concept class induced by a discrete UGM. We show that for three types of discrete UGMs the answers are positive. The construction of these labelled compression schemes utilizes algebraic characterization of discrete UGMs.

A natural question is whether there are unlabelled sample compression schemes of size VC dimension for these three types of induced concept classes. For a general discrete UGM whether there exists a labelled sample compression scheme of size equal to VC dimension of the induced concept class remains open.

Acknowledgments

We would like to thank the Editor, Managing Editor, the Associate Editor and the reviewer for constructive comments and helpful suggestions.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by National Natural Science Foundation of China (Research Plan in Shaanxi Province [Grant Number 12171382]), the Natural Science Basic Research Plan in Shaanxi Province [Grant Number 2020JM-188], and the Fundamental Research Funds for the Central Universities [Grant Number QTZX23002].

References

Anstee, R. P., Rányai, L., & Sali, A. (2002). Shattering news. Graphs and Combinatorics, 18(1), 59–73. https://doi.org/10.1007/s003730200003.
Web of Science ®Google Scholar
Ben-David, S., & Litman, A. (1998). Combinatorial variability of Vapnik–Chervonenkis classes with applications to sample compression schemes. Discrete Applied Mathematics, 86(1), 3–25. https://doi.org/10.1016/S0166-218X(98)00000-6 .
Web of Science ®Google Scholar
Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1987). Occam's razor. Information Processing Letters, 24(6), 377–380. https://doi.org/10.1016/0020-0190(87)90114-1 .
Web of Science ®Google Scholar
Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1989). Learnability and the Vapnik–Chervonenkis dimension. Journal of the ACM, 36(4), 929–965. https://doi.org/10.1145/76359.76371 .
Web of Science ®Google Scholar
Bollobás, B., & A. J. Radcliffe (1995). Defect sauer results. Journal of Combinatorial Theory, Series A, 72(2), 189–208. https://doi.org/10.1016/0097-3165(95)90060-8.
Web of Science ®Google Scholar
Chalopin, J., Chepoi, V., Moran, S., & Warmuth, M. K. (2019). Unlabeled sample compression schemes and corner peelings for ample and maximum classes. ICALP, 34, 1–15. https://doi.org/10.4230/LIPIcs.ICALP.2019.34 .
Google Scholar
Chepoi, V., Knauer, K., & Philibert, M. (2021). Labeled sample compression schemes for complexes of oriented matroids. arXiv:2110.15168v1.
Google Scholar
Floyd, S., & Warmuth, M. K. (1995). Sample compression, learnability, and the Vapnik–Chervonenkis dimension. Machine Learning, 21(3), 269–304. https://doi.org/10.1023/A:1022660318680 .
Web of Science ®Google Scholar
Friedman, N. (2004). Inferring cellular networks using probabilistic graphical models. Science, 303(5659), 799–805. https://doi.org/10.1126/science.1094068.
PubMed Web of Science ®Google Scholar
Geiger, D., Heckerman, D., King, H. P., & Meek, C. (2001). Stratified exponential families: graphical models and model selection. The Annals of Statistics, 29(2), 505–529. https://doi.org/10.1214/aos/1009210550.
Web of Science ®Google Scholar
Geiger, D., Meek, C., & Sturmfels, B. (2006). On the toric algebra of graphical models. Annals of Statistics, 34(3), 1463–1492. https://doi.org/10.1214/009053606000000263.
Web of Science ®Google Scholar
Haussler, D., Sloan, R., & Warmuth, M. K. (1994). Predicting {0,1} functions on randomly drawn points. Inf. Comput., 115(2), 248–292. https://doi.org/10.1006/inco.1994.1097.
Web of Science ®Google Scholar
Helmbold, D., Sloan, R., & Warmuth, M. K. (1990). Learning nested differences of intersection-closed concept classes. Machine Learning, 5, 165–196. https://doi.org/10.1023/A:1022696716689 .
Web of Science ®Google Scholar
Knauer, K., & Marc, T. (2020). On tope graphs of complexes of oriented matroids. Discrete & Computational Geometry, 63(2), 377–417. https://doi.org/10.1007/s00454-019-00111-z.
Web of Science ®Google Scholar
Kuzmin, D., & Warmuth, M. K. (2007). Unlabeled compression schemes for maximum classes. Journal of Machine Learning Research, 8, 2047–2081. https://doi.org/10.1007/11503415_40.
Web of Science ®Google Scholar
Lauritzen, S. L. (1996). Graphical models. Oxford University Press.
Google Scholar
Li, B., & Yang, Y. (2018). Complexity of concept classes induced by discrete Markov networks and Bayesian networks. Pattern Recognition, 82, 31–37. https://doi.org/10.1016/j.patcog.2018.04.026.
Web of Science ®Google Scholar
Littlestone, N., & Warmuth, M. K. (1986). Relating data compression and learnability [Unpublished manuscript]. http://www.cse.ucsc.edu/manfred.
Google Scholar
Marc, T. (2022). Unlabeled sample compression schemes for oriented matroids. arXiv:2203.11535v2.
Google Scholar
Moran, S., & Warmuth, M. K. (2016). Labeled compression schemes for extremal classes. In ALT (pp. 34–39). Springer. https://doi.org/10.1007/978-3-319-46379-7_3
Google Scholar
Moran, S., & Yehudayoff, A. (2016). Sample compression schemes for VC classes. In ITA (pp. 1–14). IEEE. https://doi.org/10.1109/ITA.2016.7888187
Google Scholar
Pálvölgyi, D., & Tardos, G. (2020). Unlabeled compression schemes exceeding the VC-dimension. Discrete Applied Mathematics, 276, 102–107. https://doi.org/10.1016/j.dam.2019.09.022.
Web of Science ®Google Scholar
Pistone, G., Riccomagno, E., & Wynn, H. P. (2001). Algebraic statistics: Computational commutative algebra in statistics. Chapman and Hall/CRC.
Google Scholar
Rubinstein, B. I. P., & Rubinstein, J. H. (2012). A geometric approach to sample compression. Journal of Machine Learning Research, 13(42), 1221–1261.
Google Scholar
Settimi, R., & Smith, J. Q. (2000). Geometry, moments and conditional independence trees with hidden variables. Annals of Statistics, 28(4), 1179–1205. https://doi.org/10.1214/aos/1015956712.
Web of Science ®Google Scholar
Warmuth, M. K. (2003). Compressing to VC dimension many points. In Proceedings of the 16th Annual Conference on Learning Theory (pp. 743–744). Springer. https://doi.org/10.1007/978-3-540-45167-9_60
Google Scholar

Compression schemes for concept classes induced by three types of discrete undirected graphical models

Abstract

1. Introduction