525
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Impartial competitive learning in multi-layered neural networks

Article: 2174079 | Received 01 Nov 2022, Accepted 24 Jan 2023, Published online: 22 Feb 2023

Abstract

The present paper aims to propose a new learning and interpretation method called “impartial competitive learning”, meaning that all participants in a competition should be winners. Due to its importance, the impartiality is forced to be realised even by increasing the corresponding cost in terms of the strength of weights. For the first approximation, three types of impartial competition can be developed: componential, computational, and collective competition. In the componential competition, every weight should have an equal chance on average to win the competition. In the computational competition, all computational procedures should have an equal chance to be applied sequentially in learning. In collective computing for interpretation, all network configurations, obtained by learning, have an equal chance to participate in a process of interpretation, representing one of the most idealised forms of impartiality. The method was applied to a well-known second-language-learning data set. The intuitive conclusion, stressed in the specific science, could not be extracted by the conventional natural language processing methods, because they can deal only with word frequency. The present method tried to extract a main feature beyond the word frequency by competing connection weights and computational procedures impartially, followed by collective and impartial competition for interpretation.

1. Introduction

The introductory section explains the concept of impartial competition with componential and computational procedures, where all components as well as computational procedures have a chance to win the competition equally. In addition, this impartiality is applied to interpretation, where all created representations should have an equal chance to be interpreted. This introduction is accompanied by a literature review of the related studies. For example, this competition can be seen from the viewpoint of conventional competitive learning and interpretation. In the review, it is stressed that those conventional methods have not necessarily paid due attention to the equal chance to win the competition. On the contrary, this paper emphasises the equal chance to win the competition, which should be a primary objective of competitive computation.

1.1. Impartiality and competitive computation

The present paper aims to propose a new type of learning method, called “impartial competitive computation ”. Competitive computing has two fundamental properties: competition among all components and an equal chance to win the competition. Though we need to consider all components and their related procedures for competition, we deal here with three types of competition-componential, computational and collective-for interpretation for the first approximation.

First, competition can be applied to all components in a neural network. For example, any components inside a neural network should compete with each other, which can be called “componential competition ”. In particular, we try to deal with competition not among neurons but among connection weights, because we try to consider competition as independently as possible of inputs. Then, all connection weights are supposed to compete with each other by increasing and decreasing the potentiality of connection weights. The potentiality represents to what degree weights can contribute to the inner activity of a neural network. When the potentiality increases, the connection weights can move more freely. By increasing the potentiality of connection weights, all weights tend to be used  equally.

The second type of competition is applied to computational and optimising procedures. In competitive computing, we need to control the potentiality of connection weights, and the cost is needed to control the potentiality. Then, errors between outputs and targets should be decreased in supervised learning. In the conventional methods, those procedures are simultaneously applied and controlled. Thus, those procedures must be controlled by changing many parameters as well as hyper-parameters. When the number of parameters increases, we naturally have much difficulty in compromising among them. On the contrary, in the computational competition, all computational procedures are applied independently of any others. All computational procedures should be used as equally as possible. As is the case with the componential competition, only one procedure wins the competition and is applied to learning, while all the others cease to be applied. This can make the effect of each computational procedure clearer, leading to the efficient use of all computational procedures. At the present stage of research, the equal chance to win the competition can be algorithmically realised in the computational competition.

In sum, the present method can be considered an extension of conventional competitive learning. In particular, one of the most important characteristics lies in the active realisation of an equal chance for all components and computational procedures to win the competition adaptively or algorithmically. Because of this, we call competitive computation “impartial” to emphasise the equal chance to win the competition. In our competitive computing, there is a supposition that the neural network as a model of living systems should try to use all available resources as efficiently and extensively as possible for coping with new and future situations.

1.2. Impartiality and competitive learning

We need to survey competitive learning in neural networks here, because the present method should be considered an extension of the fundamental property of competitive learning. Competitive learning has received due attention from the early stages of neural network research (Fukushima, Citation1975; Grossberg, Citation1976; Rumelhart & Zipser, Citation1985), because neural networks have tried to imitate living systems with abundant competition. The basic concept in (Rumelhart & Zipser, Citation1985) is that, for an input, neurons compete with each other, and finally one neuron wins the competition, representing the input. A neuron that is closest to the input in terms of similarity or distance should be a winner, and weights to the neuron are updated. One of the most important applications are the self-organising maps by Kohonen (Kohonen, Citation1990), and nowadays, they have been used extensively as one of the most effective methods for visualisation and clustering (Bao et al., Citation2022; Bogdan & Rosenstiel, Citation2001; Brugger et al., Citation2008; Fernández Rodríguez et al., Citation2022; Himberg, Citation2000; Lagani et al., Citation2021; Lu et al., Citation2019; Makhzani & Frey, Citation2014; Xu et al., Citation2010; Yin, Citation2002). The simplicity of competition processes such as the winner-take-all (WTA) has had much influences in attempts to simplify computational procedures of complex neural networks. This WTA has been naturally used in the fields directly related to actual biological systems, like spiking neurons (Peng et al., Citation2021; Yu et al., Citation2018). In addition, many computational methods, by taking full advantage of the properties of those biological systems, have also focused on the WTA to improve the performance of neural networks (Krotov & Hopfield, Citation2019; Shinozaki, Citation2017; Srivastava et al., Citation2013).

However, as mentioned above, from the early stage of research, it has been stated that the impartiality should be realised to improve the performance of competitive learning. For example, it has been observed that some neurons can be dead or inactive, preventing all neurons from being equally used. For solving this type of problem, many methods have been developed to eliminate the dead neurons (Banerjee & Ghosh, Citation2004; Choy & Siu, Citation1998; DeSieno, Citation1988; Fritzke, Citation1993Citation1996; Li et al., Citation2022; Van Hulle, Citation1999Citation2004). In those methods, frequently winning neurons are penalised or the entropy of neurons is forced to be maximised, which is quite similar to the method in this paper. The competitive learning principally aims for neurons to respond to some inputs very specifically, and at the same time, for all neurons to respond equally to those inputs at least on average. Due to the difficulty in dealing with the equal chance, much attention has been paid to the specific responses of neurons. To the best of our knowledge, in the conventional competitive learning, the equal chance to win the competition has been treated secondarily. It should be repeated that the present method is certainly an extension of the conventional method. However, attention is primarily paid to the equal chance for competition, and several computational procedures such as the WTA are secondarily considered. More strongly, the equal chance to win the competition is forced to be realised by using even a higher cost in terms of the strength of connection weights. Because the equal chance is a primary objective of competitive computation, it is absolutely necessary to realise this at any cost and at any time.

1.3. Impartiality and interpretation

The impartial competition can be also applied to the problem of interpretation. Neural networks, from the beginning, have been used to create and interpret internal representations themselves relating to the extraction of hidden and unobservable factors in complex data sets (McClelland & Rumelhart, Citation1986; Rumelhart & McClelland, Citation1986). However, the modern interpretation methods in neural networks have been heavily dependent on observable entities, even in interpreting hidden representations. For example, the interpretation problems have received much attention in the active field of convolutional neural networks (CNN), dealing mainly with image data sets, where many different types of interpretation methods have been developed. However, in interpreting weights and neuron activations in hidden layers, they have tended to use information directly observable or intuitively acceptable in image data sets (Arbabzadah et al., Citation2016; Bach et al., Citation2015; Bai et al., Citation2022; Binder et al., Citation2016; da Cunha et al., Citation2022; Erhan et al., Citation2009; Lapuschkin et al., Citation2016; Mahendran & Vedaldi, Citation2015; Montavon et al., Citation2019; Nguyen et al., Citation2019; Sturm et al., Citation2016). Thus, possible interpretations, far from our intuition on the image data sets, have been under-estimated due to difficulty in accepting them. This under-estimation has been well recognised in recent and active discussions on adversarial features and attacks, where features contrary to our well-accepted way of thinking have been excluded completely, though those features may play important roles in the inference of neural networks (Chen et al., Citation2016; Ilyas et al., Citation2019; Wang et al., Citation2021Citation2019; Xia et al., Citation2020; Xie et al., Citation2017).

This means that modern interpretation methods have focussed on a specific and human-oriented interpretation, because it seems to be the most reasonable intuitively. However, one of the main capabilities of neural networks lies in producing a great number of different internal representations, and in some cases, we cannot understand their meaning due to the existence of inferences, completely different from ours. Even with those incomprehensible ways, neural networks can produce the corresponding outputs. In terms of competitive computing, all those representations should have an equal chance to be interpreted. We need to develop a method to take into account all or at least as many different representations as possible to obtain a kind of more universal and stable interpretation. The present paper proposes a method to take into account as many representations as possible for interpretation. Then, an interpretation is realised by unifying all those representations.

1.4. Objectives of the paper

After explaining the importance of impartiality in computing and interpretation, we should state the objectives of the present paper in more concrete ways. This paper aims to stress the importance of the equal chance to win the competition in terms of components, computational procedures, and interpretation, and it tries to show that the equal chance leads us to find core factors hidden in direct and observable relations between inputs and targets by using the data set in a specific science.

First, it is shown that competition with the equal chance to win should be applied to components and computational procedures. In componential competition, competition is realised adaptively. In a case where the impartiality cannot be easily realised, a cost-forced method is applied, where the strength of weights (cost) is forced to increase to reduce differences among weights. In addition, the impartiality is enhanced by repeating those forced processes many times. The forced method is introduced to show how the impartiality is important in optimising neural networks.

Second, the method is applied to the interpretation of a data set on second-language learning (L2), because any decisive results have not been obtained by processing the raw natural-language data sets. The present interpretation method tries to deal with as many internal representations as possible, where all representations should have an equal chance to be interpreted. Then, we try to show that the method can extract an important feature, intuitively discussed in the specific science.

1.5. Paper organisation

In Section 2, we try to show how to define and compute the internal potentiality. Then, we show how to compete five computational procedures: cost min and max, potentiality min and max, and error minimisation, used in the present experimental results. In addition, we briefly describe how to interpret the final results. The interpretation is based on the collective competition for interpretation, in which we try to interpret the final results by considering as many representations as possible. In particular, multi-layered neural networks are compressed into the simplest ones without hidden layers for interpreting the final connection weights easily.

In Section 3 on the experimental results, we deal with a second-language-learning data set. By using the conventional methods, clear results could not be obtained by observing the raw data sets in the corresponding science. Then, we tried to extract the simple linear relations between inputs and outputs as well as non-linear ones, which were supposed to be related to the unobservable factors.

2. Theory and computational methods

In this section, we firstly explain the basic terms, such as cost and potentiality, needed to understand the present method. Using these words, we explain the concept of competitive computing with two types of competition: componential and computational competition. In componential competition, all weights compete with each other to gain the higher strength, and in addition, it is supposed that all weights should have an equal chance or equal probability to win the competition. In computational competition, all computational and optimisation procedures compete with each other, and only one computational procedure wins the competition to be applied at a certain period of time. All procedures should have an equal chance to win the competition, as is the case with the weights. Then, we explain how to implement this computing model in actual learning. Finally, we present how to interpret the final results. The interpretation method is called “collective interpretation ”, because all representations by different initial conditions, different learning steps, and different inputs have an equal chance to be interpreted. For this interpretation, we introduce a compression method, in which multi-layered neural networks are simplified into ones without hidden layers.

2.1. Componential competition

We introduce here the internal potentiality of components in a neural network. This potentiality is supposed to represent the degree of activity inside a neural network. When the internal potentiality increases, the components are activated internally without considering output information. Thus, this potentiality maximisation corresponds to the entropy maximisation in information theory (Cover & Thomas, Citation1991). However, we can more easily understand the meaning of connection weights by the potentiality.

In the first place, we explain a process of competition from maximum to minimum potentiality conceptually. Figure  shows a concept of competition from maximum to minimum potentiality. Competition is realised by moving from a state with maximum potentiality to a state with minimum potentiality in Figure (a). In the maximum potentiality state, all connection weights should have equal weight strength in Figure (b1), while in the minimum potentiality, only one weight should have some strength, and all the others become zero in Figure (b2).

Figure 1. Transition from a state with the maximum potentiality to one with minimum potentiality.

Figure 1. Transition from a state with the maximum potentiality to one with minimum potentiality.

Let us explain more concretely a network architecture and the corresponding potentiality. We define here the potentiality for connection weights, because we try to define the potentiality inside a network, underestimating the effects of inputs and outputs as much as possible. Figure (b) shows an example of network architecture in which four hidden layers are used, ranging from the second to the fifth layer. We call the input layer the first layer and the final, sixth layer corresponds to the output layer. Thus, the hidden layers range between the second and the fifth layer. For a simple illustration, we focus on connection weights between the tth and t + 1th hidden layers, denoted by the notation (t,t+1), where the subscript t increases from the second to the fourth layer, namely, hidden layers. For computing the potentiality, we need to compute the absolute strength of weights (1) ujk(t,t+1)=|wjk(t,t+1)|,(1) where the subscript t increases from 2 to 4. Then, we normalise this by its maximum value (2) gjk(t,t+1)=ujk(t,t+1)maxjkujk(t,t+1),(2) where the max operation is over all connection weights between the layers. Then, the internal potentiality can be computed by (3) G(t,t+1)=j=25k=25[ujk(t,t+1)maxjkujk(t,t+1)].(3) For simplicity's sake, we suppose that at least one weight should be larger than zero, because the layers are completely separated when all connection weights are zero.

In addition, we define the complementary potentiality by (4) g¯jk(t,t+1)=1ujk(t,t+1)maxjkujk(t,t+1).(4) This complementary potentiality becomes smaller when the ordinary potentiality becomes larger. Thus, this complementary potentiality can be used to reduce the strength of weights with higher potentiality.

Finally, the cost can be computed simply by the sum of all absolute weights, computed by (5) C(t,t+1)=jkujk(t,t+1).(5) This cost is especially used to realise a state in which all weights have the same strength, corresponding to the maximum potentiality state. By changing the cost in addition to the potentiality, we can produce a number of different network configurations.

Here, the problem is how to maximise the potentiality to realise the equal chance to win the competition. When potentiality is maximised by reducing the strength of connection weights, as is done in the conventional regularisation, eventually some connection weights tend to have larger weights with some specific information, due to the necessity of minimising errors between outputs and targets. To make the potentiality as large as possible and to make all weights equal in winning, we introduce a controlled cost approach in which the cost is changed to increase and decrease the potentiality.

2.2. Computational competition

The present study supposes that in principle all components and computational procedures compete with others to make the performance of neural networks better as much as possible. Thus, in addition to the componential competition, described above, we need to explain the computational or procedural competition. In this paper, five computational procedures, namely, potentiality maximisation, potentiality minimisation, cost maximisation, cost minimisation, and error minimisation, are used, as shown in Figure (a). In the conventional methods, all those procedures are simultaneously applied with many learning parameters and hyper-parameters needing careful adjustment, as shown in Figure (a). We think that those conventional methods cannot solve the contradiction among the five computational procedures. The competitive computation tries to use all computational procedures as equally as possible, as shown in Figure (b). At a specific learning step, only one computational procedure wins the competition, which is actually applied in learning. For example, in Figure (b), all computational procedures are serially disentangled, where only one procedure wins the competition at a certain learning time. Figure  shows actual network configurations, corresponding to the conceptual diagram in Figure . When potentiality is maximised, all connection weights have the same absolute strength in Figure (a). Then, when the potentiality is minimised, only one weight tends to be larger, while all the others become zero in Figure (b). When the cost is larger, the strength of all the connection weights becomes larger in Figure (c). When the cost is smaller, the strength of all connection weights is smaller in Figure (d). Finally, when the error is minimised in Figure (e), some weights tend to be larger in the end.

Figure 2. Conventional computing (a) and competitive computation (b), composed of five computational procedures: potentiality max, potentiality min, cost max, cost min, and error min.

Figure 2. Conventional computing (a) and competitive computation (b), composed of five computational procedures: potentiality max, potentiality min, cost max, cost min, and error min.

Figure 3. Actual network configurations, corresponding to five computation procedures: potentiality max, potentiality min, cost max, cost min, and error minimisation in Figure .

Figure 3. Actual network configurations, corresponding to five computation procedures: potentiality max, potentiality min, cost max, cost min, and error minimisation in Figure 2.

2.3. Cost-forced and repeated computational computing

We explain here the actual learning models used in this paper for easy interpretation of experimental results. The reason why we adopted this model is that generalisation performance could be improved. In addition, because the method called “cost-forced” tends to increase the strength of weights excessively, we need to reduce the cost immediately after the cost augmentation.

Let us explain how to use the cost for increasing the potentiality in the first place. Figure  shows an example of cost-forced competitive computing used in the experiments. Learning is composed of three modules: the cost-forced potentiality maximising module in Figure (a), the cost minimisation module in Figure (b), and the cost-forced potentiality minimisation module in Figure (c). Figure  shows actual network configurations, corresponding to the conceptual diagram in Figure . Figure (a) shows an example of a network configuration in the cost-forced potentiality maximisation module. As shown in the figure, connection weights in an initial phase tend to have weights with randomly different strengths. By increasing the cost, the strength of weights becomes stronger, and all connection weights tend to be almost equal. Then, by reducing the errors between outputs and targets, several weights tend to be larger or smaller again. In the cost reduction module in Figure (b), we have only two procedures: cost minimisation and error minimisation. These procedures are introduced to reduce the cost as much as possible, because in the first module, the cost tends to be extremely augmented in the actual experiments. As shown in Figure (b), strong connection weights obtained in the first cost-forced potentiality augmentation phase are forced to be smaller. In the final module in Figure (c), the cost is further reduced, and then the potentiality is also forced to decrease. The potentiality decrease corresponds to specific information augmentation, where only a very few weights tend to be stronger. Finally, by error minimisation, some other weights tend to be again larger for error reduction.

Figure 4. Concept of three computational modules used in this paper: cost-forced potentiality maximisation (cost+potentiality max), cost minimisation, and cost-forced potentiality minimisation (cost+potentiality min).

Figure 4. Concept of three computational modules used in this paper: cost-forced potentiality maximisation (cost+potentiality max), cost minimisation, and cost-forced potentiality minimisation (cost+potentiality min).

Figure 5. Actual network configurations of three computational modules, corresponding to the concepts in Figure .

Figure 5. Actual network configurations of three computational modules, corresponding to the concepts in Figure 4.

In all those processes, we try to increase the potentiality or to decrease information on inputs to obtain the minimum possible information. However, it is not so easy to increase the potentiality monotonically. The potentiality is first maximised with larger cost, and then the potentiality is decreased with smaller cost. As shown in Figures and , those processes of potentiality maximisation and minimisation are repeated many times, with the corresponding cost reduction and augmentation. In the end, many different weight configurations can be produced for better generalisation.

2.4. Practical competitive computational procedures

Let us show how to update connection weights in learning. In the first cost-forced potentiality maximisation module in Figures and (a), all weights are forced to be equal, even by increasing the cost in terms of abstract weight strength. Increasing the cost is realised by making the parameter θ larger than one. These large parameter values force connection weights to be large and equal in their strength, corresponding to maximum potentiality.

Then, for the practical computation, we modify the original complementary potentiality g¯ with some additional parameters to eliminate several extreme values we must face in the middle of cost augmentation. The modified complementary potentiality is given by (6) h¯jk(t,t+1)=[1ujk(t,t+1)maxjkujk(t,t+1)+ϵ]γ.(6) Those parameters are used to stabilise learning by weakening the effects of individual potentiality by setting the parameter ϵ and γ to very small values. By using this modified complementary and individual potentiality, weights are modified as (7) wjk(t,t+1)(n+1)=θh¯jk(t,t+1)(n)wjk(t,t+1)(n),(7) where weights at the n+1th step are obtained by multiplying the previous nth weights by the corresponding modified potentiality. These update rules can be interpreted as follows. First, the parameter θ is increased, and then the potentiality is increased, followed by error minimisation, corresponding to Figures and (a). All three procedures of cost maximisation, potentiality maximisation, and error minimisationcan be performed independently of each other in principle.

In the second module, cost minimisation, in Figures and (b), we try to reduce the strength of weights by setting θ smaller, (8) wjk(t,t+1)(n+1)=θwjk(t,t+1)(n).(8) Then, this smaller cost value is assimilated in the phase of error minimisation.

In the third module in Figures and (c), the weights are modified by individual potentiality with one parameter (9) qjk(t,t+1)=[ujk(t,t+1)maxjkujk(t,t+1)]γ.(9) Then, we use the same type of potentiality assimilation process (10) wjk(t,t+1)(n+1)=θqjk(t,t+1)(n)wjk(t,t+1)(n),(10) where the parameter θ should be less than one to reduce the strength of weights. This process can be interpreted as follows. First, the parameter θ should be reduced from one to make connection weights smaller. Then, the potentiality is decreased gradually, followed by error minimisation.

2.5. Impartial competitive interpretation

In the impartial competitive interpretation, we suppose that all representations created by a neural network should have the same potentiality for interpretation, implying that all are winners in interpretation. This is a very extreme case and, more strongly, one of the most ideal cases of impartial competition. As shown in Figure , as many different internal representations as possible are produced by different initial conditions, different learning steps, and different inputs. Then, all those representations are compressed and unified into the simplest forms for each learning step in Figure (a). Then, all compressed weights are averaged over all learning steps, which is called “syntagmatic compression” in Figure (b). Then, all those syntagmatically compressed weights are again compressed or averaged in “paradigmatic” ways into the final collective weights in Figure (c). This interpretation aims not to deal with a specific network configuration but to deal with a space of network configurations generated by learning with different initial conditions.

Figure 6. Concept of compression for collective weights by syntagmatic (b) and paradigmatic (c) compression.

Figure 6. Concept of compression for collective weights by syntagmatic (b) and paradigmatic (c) compression.

We explain here in the first place how to compress weights partially or fully, and then how to unify all those compressed weights to obtain the final collective weights for interpretation (Kamimura, Citation2019). The model compression methods have been developed to compress complicated multi-layered neural networks into the simpler ones (Bucilu et al., Citation2006; Cheng et al., Citation2020; Hinton et al., Citation2015; Luo et al., Citation2016). Contrary to those models, where compression is performed without keeping information on the corresponding network configuration in original networks, the present compression tries to keep original information as much as possible.

Let us show a process of compression in Figure  for a specific learning step, initial condition, and input. In the first place, we show how to compress networks fully, or full compression, in Figure (a). Now, we compress connection weights from the first to the second layer, denoted by (1,2), and from the second to the third layer (2,3) for an initial condition and a subset of a data set. Then, we have the compressed weights between the first and the third layer, denoted by (1,3), (11) wik(1,3)=jwij(1,2)wjk(2,3).(11) Those compressed weights are further combined with weights from the third to the fourth layer (3,4), and we have the compressed weights between the first and the fourth layer (1,4), (12) wil(1,4)=kwik(1,3)wkl(3,4).(12) By repeating these processes, we have the compressed weights between the first and fifth layer, denoted by wiq(1,5). Using those connection weights, we have the final and fully compressed weights (1,6), (13) wir(1,6)=qwiq(1,5)wqr(5,6).(13) This type of compression is applied to all layers to get full compression. In addition, compressed weights are averaged over all initial conditions and subsets of data sets to get the final compressed weights called “collective weights ”.

Figure 7. Compression from a multi-layered network to the corresponding simplest network in full (a) and partial (b) way.

Figure 7. Compression from a multi-layered network to the corresponding simplest network in full (a) and partial (b) way.

The partial compression is performed with only one hidden layer in Figure (b). For example, the weights from the second to the third layer (2,3) are combined with the weights from the first to the second layer (1,2). Those weights are directly combined with the output weights (5,6), and we have (14) wir(1,(2,3),6)=qwiq(1,3)wqr(5,6),(14) where the notation (2,3) means that only connection weights from the second and third layer are combined with the input and output layer. In all cases, it is supposed that the number of neurons in hidden layers is the same, but this method can be applied to networks with a different number of neurons in hidden layers.

Then, we should explain how to obtain actually the final collective weights. In principle, the final collective weights can be obtained by averaging all representations created with different initial conditions, subsets of data sets, and different learning steps, because we suppose that all instances of connection weights should have the same importance and that they should be taken into account for the interpretation. As shown in Figure , we have two types of compression for considering all learning steps, initial conditions, and input patterns: syntagmatic and paradigmatic compression. Figure  shows two types of compression realised by actual networks. Connection weights are trained with an initial condition and with a fixed number of learning steps and an input pattern. For each learning step, we compress obtained weights into the simplest ones in Figure ( b1–b3). All those compressed weights are further compressed or averaged in Figure (b). This process of compression is called “syntagmatic ”. In addition, all those syntagmatically compressed weights are compressed or averaged in Figure (c), which is paradigmatic compression. Finally, the compressed weights, called “collective weights ”, can be obtained in Figure (d). We should state again that all obtained compressed weights and collective weights should have the same importance, and in principle, they are equally taken into account for interpretation. This is one of the most extreme cases of impartial competition.

Figure 8. Simple (a), syntagmatic (b), and paradigmatic (c) compression to produce the final collective weights (d).

Figure 8. Simple (a), syntagmatic (b), and paradigmatic (c) compression to produce the final collective weights (d).

3. Results and discussion

In this experiment, we tried to distinguish between Japanese learners of English with higher and lower TOEIC scores and tried to examine whether the relative clause could be extracted as an important factor in the distinction, as stated in the specific science.

3.1. Distinction between skilled and unskilled learners

Difficulty with English relative clauses in second-language learning (L2) has received much attention in the studies on many languages (Eckman et al., Citation1988; Izumi, Citation2003; O'Grady et al., Citation2003; Papadopoulou & Clahsen, Citation2003). Thus, the use of relative clauses in writing should be one of the main factors to differentiate between skilled and unskilled learners. Many reports have been published on the difficulty, mainly based on intuitive speculation. To the best of our knowledge, no experimental results have been reported on the importance of the relative clause to English-language learners by analysing the actual English texts with natural language processing systems. For example, in the conventional method, using the well-known ICANALE corpus, several experimental results were reported (Wakamatsu, Citation2018), but no explicit results could be obtained, simply a statement that the important variables were the type and token naturally, followed by naming several ordinary pronouns and functional words such “I ”, “the”, and so on, which are actually high-frequency words. These results show that the conventional statistical methods could not produce reasonable final results due to a focus on the observable words themselves. We think that those results with conventional statistical methods were confined to the high-frequency observable entities. The low-frequency words, which may be related to the relative clauses and which are unobservable, cannot be easily identified. The present paper aims to extract an input, related to the relative clause, by impartial competitive computing and interpretation.

3.2. Experimental outline

The data set was taken from the data set of ICANLE: the international corpus network of Asian learners of English as shown in the notes on data availability. The input variables were ordered in terms of word frequency, ranging from the most frequent word “the” to the least frequent word “from ”, selected from among the top 50 frequent words. In addition to these actual words, we added the type and token measure. Because these types and tokens are naturally related to word frequency, these measures should have considerable effects on learning, because word frequency is one of the most important observable factors. Ten hidden-layered neural networks with ten neurons in each hidden layer were used, where the number of inputs was 344, and the number of input variables was 52. This was the very redundant network configuration for this problem, to show that the redundant network is necessary for good performance, in particular, for generalisation. We tried to improve generalisation performance, because it has been considered one of the main objectives of neural networks. Figure  shows how to compute final collective weights for interpretation, by the parameter θ, ten different initial conditions, and subsets of input patterns. In the experiments, we changed the parameter θ to control the cost or the strength of weights, and we tried to see how generalisation or correlation coefficients were changed.

Figure 9. Cost-forced potentiality max (a), cost min (b), cost-forced potentiality min (c), and final collective weights with an important input representing “who” (d) for the L2 data set.

Figure 9. Cost-forced potentiality max (a), cost min (b), cost-forced potentiality min (c), and final collective weights with an important input representing “who” (d) for the L2 data set.

Let us show how the experiments were performed, and in particular how to compute collective weights for interpretation. In the first place, we determined the value of the parameter θ, and for an initial condition, learning was performed to produce a set of compressed weights in Figure (a) by the cost-forced potentiality maximisation module, followed by the cost minimisation module in Figure (b) and cost-forced minimisation module in Figure (c). Then, the syntagmatic compression was applied to unify compressed weights for all learning steps, ranging from the first learning step to the final learning step. Those syntagmatically compressed weights for all learning steps were again compressed or averaged to produce the final collective weights in Figure (d).

In the first place, we used the complementary potentiality to change the connection weights, and for the weights (t,t+1) from the second to the third layer, we have (15) θh¯jk(t,t+1)=θ[1ujk(t,t+1)maxjkujk(t,t+1)+ϵ]γ,(15) where the parameter θ increased from 1 to 1.5. When the parameter increased from 1, connection weights were forced to be larger in their strength. This cost-forced method was introduced to adjust connection weights, whose strength tended to be greater. The other parameter γ was set to 0.05; because connection weights were changed many times, they tended to have extreme values due to the complementary potentiality. Then, we tried to eliminate too-large effects on the weight change by reducing the parameter. The final parameter ϵ was set to 0.01 to eliminate zero values. Secondly, we tried to reduce the strength of weights by setting a smaller θ (16) wjk(t,t+1)(n+1)=θwjk(t,t+1)(n).(16) The actual small value was 0.95, and because the number of epochs in the error minimisation phase was set to 50, the effect of this small value could be observed in the process of error minimisation. Thirdly, the weights were modified by individual potentiality with two parameters (17) θqjk(t,t+1)=θ[ujk(t,t+1)maxjkujk(t,t+1)]γ.(17) The parameter θ was set to 0.95, and the other parameter γ was set to 0.05. This potentiality had an effect to reduce the strength of connection weights when they were smaller. When the connection weights became smaller, the weights became much smaller in their strength. In the end, only a small number of weights with larger strength remained, producing lower potentiality.

Finally, we should note the other parameter values in neural learning in the scikit-learn package. For those values, to reproduce easily the results in this paper, we tried to keep the original and default parameter values as much as possible, except for the activation function of the tangent hyperbolic function and the number of learning epochs, and without conventional regularisation methods.

3.3. Internal potentiality and cost

The best generalisation performance was obtained when we tried to repeat potentiality min-max and cost min-max many times. This means that good performance could be obtained not by the simple and monotonic increase or decrease of potentiality and cost, but by the repeated application of a decrease or increase of potentiality and cost. In other words, the repeated and active application of competitive computing is necessary for learning. The neural networks should try to seek the optimal network configurations by actively repeating max-min operations in the course of learning.

Figure  shows the potentiality (left) and cost (right) as a function of the number of learning steps by four methods. As shown in Figure (a), by using the conventional method, the potentiality (left) and its cost (right) remained unchanged throughout all learning steps. When maximum generalisation was attained in Figure (b), potentiality maximisation and minimisation as well as cost maximisation and minimisation were repeated 20 times. However, in the end, the learning process showed a process of cost minimisation as well as potentiality maximisation. This method tried to maximise the potentiality and to minimise the corresponding cost by repeating the processes of maximisation and minimisation, seeking for optimal network configurations. We think that the monotonic decrease or increase in the conventional regularisation methods could not produce better results due to the complexity of over-parametrised and redundant networks used in the experiments. When maximum correlation was obtained in Figure (c), the potentiality immediately reached almost a maximum point. In addition, the cost was forced to be much larger in the end. This means that learning was not sufficiently performed due to the higher cost. When generalisation was the lowest in our experiment, shown in Figure (d), the potentiality did not increase sufficiently, because the cost (right) remained extremely small throughout the entire learning process. This means that the cost minimisation was so effective that the networks could not increase the corresponding potentiality, meaning the occurrence of under-fitting.

Figure 10. Internal potentiality (left) and cost (right) as a function of the number of steps when the conventional method was used (a) and when the parameter θ was 1.3 (b), 1.0 (c), and 1.5 (d) for the L2 data set.

Figure 10. Internal potentiality (left) and cost (right) as a function of the number of steps when the conventional method was used (a) and when the parameter θ was 1.3 (b), 1.0 (c), and 1.5 (d) for the L2 data set.

These results show that, in increasing generalisation, the monotonic decrease of cost and monotonic increase or decrease of potentiality were not enough, but the networks should try to repeat increased or decreased operations many times by re-activating the competition processes to reach optimal network configurations. In conventional terms, we need to repeat a process of regularisation and the so-called “de-regularization” many times. One of the major problems of this method is that the cost could be increased extremely in the end, as shown in Figure (c), due to the large parameter θ. This means that we should develop a method to restrict the increase in the strength of connection weights for future studies.

3.4. Collective weights

The conventional methods and the majority of our methods tried mainly to detect linear correlations between inputs and targets, where only functional words and summary words such as the type and the token words naturally showed larger importance values, corresponding to the results so far reported by the conventional methods. On the other hand, the neural network with the best generalisation could produce collective weights different from the correlation coefficients, in which many inputs, considered not so important linearly, tended to have larger strength or importance.

Figure  shows the correlation coefficients between inputs and targets of the original data set (a) and collective weights by the conventional method (b) and the impartial method (c) –(e) with different values of correlation coefficients and generalisation. Figure (a) shows that the correlation coefficients between inputs and targets were larger for “for” (13th input) and “their” (48th input), meaning that the functional words tended to have larger importance, corresponding to the results so far reported. In addition, we added for easy interpretation the type (1st input) and token (2nd input) to the data set. The type and token on the leftmost side of the figure showed larger correlation coefficients. This shows that the correlation coefficients seem to be based on the frequency of words, in particular, functional and summary words. These types of experimental results have been so far observed in the results by analysing the texts by many different types of statistical methods (Ishikawa, Citation2013; Wakamatsu, Citation2018).

Figure 11. Correlation coefficients between inputs and targets of the original data set (a) and collective weights by the conventional (b) and the impartial methods (c)–(e) with different performance values for the L2 data set.

Figure 11. Correlation coefficients between inputs and targets of the original data set (a) and collective weights by the conventional (b) and the impartial methods (c)–(e) with different performance values for the L2 data set.

By the conventional method in Figure (b) and by our method with maximum correlation in Figure (d), with minimum generalisation performance in Figure (e), the collective weights were quite similar to the correlation coefficients in Figure (a), though several different inputs tended to have slightly larger values. Thus, the conventional methods as well as our method tried to extract mainly linear relations between inputs and outputs. However, when maximum generalisation performance was obtained in Figure (c), collective weights were clearly different from the correlation coefficients in Figure (a), and many inputs tended to have larger values. In particular, input No.1 (type) had the highest score, meaning that the networks tried to classify inputs by the number of types. On the other hand, by the conventional method and our method with maximum correlations, input No.2 (token) was larger. This means that, to improve generalisation, we need to pay more attention not to the total number of words, but to the number of distinct words. Though our method seemed to try to detect different inputs as important, we could not see clear characteristics on those collective weights due to too many larger collective weights.

3.5. Relative collective weights

Then, we tried to see the effect of non-linearity by computing the ratio of absolute collective weights to the absolute original correlation coefficients. The results showed clearly that the present method could show characteristics that could not be clarified by using the correlation coefficients. In particular, a specific relative pronoun “who” (42nd input) to represent the relative clause tended to have larger relative collective weights. This means that this word “who” could be shown by the effect of non-linearity or some combination of inputs.

Figure  shows the relative values obtained by dividing the absolute collective weights by the corresponding absolute correlation coefficients. Figure (a) shows the relative weights by the conventional method, where the relative weights were small, meaning that the conventional method produced collective weights close to the correlation coefficients, paying little attention to the non-linear relations. Figure (b) shows the relative weights when maximum generalisation was obtained. The largest relative weight was observed for input No.11 (“smoking”), which was the topic of the piece of writing. This is natural, because the topic words in writing tended to be used frequently. However, the collective weights as well as correlation coefficients could not capture the word in Figure . The second largest one was input No.36, representing the preposition “with ”, which could not easily be explained due to the functional word. In addition, the third largest one was input No.42 (“who”), representing the relative pronoun. This input was not considered important in terms of the linear correlation coefficients in Figure (a). In addition, when we see the other results on the relative weights by the present method, input No.42 (“who”) should clearly play an important role. For example, when the maximum correlation was obtained in Figure (c), input No.42 (“who”) was by far the largest. Then, when generalisation was the lowest in Figure (d), input No.42 had a lower value, and the other larger inputs in Figure (b) tended to have smaller values. These results show that input No.42, representing the typical relative pronoun, can play a critical role in discrimination in this data set.

Figure 12. Relative collective weights, computed by dividing the absolute collective weights by the corresponding absolute correlation coefficients by the conventional (a) and three impartial methods (b)–(d) for the L2 data set.

Figure 12. Relative collective weights, computed by dividing the absolute collective weights by the corresponding absolute correlation coefficients by the conventional (a) and three impartial methods (b)–(d) for the L2 data set.

3.6. Partial collective weights

By examining collective weights partially, we could see that the networks with the largest generalisation performance could process information from the first hidden layer, and the information decreased in the final hidden layer. On the other hand, the other methods, including the conventional methods, did not process information in the hidden layers near the input layer; in particular, by the impartial method, only the final hidden layer played an important role in learning. The use of almost all hidden layers by our method is related to the corresponding higher generalisation.

Figure  shows partial collective weights obtained by four methods. Figure (a) shows the partial collective weights obtained by the conventional method. Weights in the hidden layers had smaller values except for the second to the third hidden layer and in layers higher than the seventh hidden layer. On the contrary, when the best generalisation was obtained in Figure (b), all hidden layers except the final hidden layer had relatively larger values for all inputs except the final hidden layer. The final hidden layer, closest to the output layer, had the lowest values. When the largest correlation was observed in Figure (c), the partial collective weights were stronger only for the final hidden layer. Finally, when the worst generalisation was obtained in Figure (d), the partial collective weights were only stronger for the higher two hidden layers. This means that an almost equal use of all hidden layers was needed to improve generalisation.

Figure 13. Partial collective weights from the first hidden layer to the tenth hidden layer by the conventional (a) and potentiality method (b)–(d) for the L2 data set.

Figure 13. Partial collective weights from the first hidden layer to the tenth hidden layer by the conventional (a) and potentiality method (b)–(d) for the L2 data set.

3.7. Correlation and generalisation

The present method could produce the best generalisation at the expense of decreasing the correlation coefficients between inputs and targets. However, the results suggest that, with relatively larger correlation coefficients, we could still obtain better generalisation performance than that obtained by the other conventional methods.

Table  shows the summary of experimental results on the generalisation and correlation coefficients between collective weights and the original correlation coefficients between inputs and targets. By the impartial competitive computing method, only three cases were inserted with the best generalisation, the largest correlation coefficient, and the lowest generalisation when the parameter θ increased from 1 to 1.5. The best generalisation (0.917) was obtained by the impartial method with θ=1.3. However, the correlation coefficient was the lowest (0.405). This means naturally that to improve generalisation performance the neural network should capture non-linear relations or moderating relations among several inputs between inputs and outputs. However, when the parameter θ was 1.5, the correlation coefficient was the second largest (0.877), and the generalisation performance was still the second highest (0.907). This suggests that by keeping relatively higher linear relations the generalisation performance could be improved. The largest correlation (0.892) was obtained by the logistic regression analysis, having still good generalisation performance (0.896). The conventional method produced the worst generalisation (0.850), having a lower correlation (0.877). The random forest produced the lowest correlation coefficient (0.657), but it could show relatively higher generalisation (0.901). The results show that the linear relations could capture the main parts of the characteristics of input patterns. However, to further improve generalisation, we need to force networks to consider non-linear relations.

Table 1. Summary of experimental results on average correlation coefficients and generalisation performance for the L2 data set.

3.8. Discussion

The experimental results discussed so far seem to show the effectivity of impartial competitive learning, focusing on the hypothesis that all components should have an equal chance in competition, and more strongly, all should be winners in competition. We discuss here the implication of this importance of impartiality in other research areas of neural networks. In addition, two limitations should be discussed, concerning the properties of potentiality. To make discussion more concrete, we use the recently discussed lottery ticket hypothesis with regularisation problems. Moreover, we try to add some comments on relations between generalisation and network size in general. This is because those regularisation problems have tried to find specific weights, neurons, and network configurations among many, contrary to our principle of impartiality. The present method tries to show the importance of using all resources inside neural networks as equally as possible, which is contrary to the specific use of resources in neural networks.

Competitive processes have been well recognised not only in competitive learning but also in many conventional learning methods. For example, many regularisation methods such as the weight decay and even the dropout procedures (Srivastava et al., Citation2014; Wu et al., Citation2021; Xiao et al., Citation2016) can be considered ones to realise a process of competition among neurons and connection weights. In the case of weight decay, in a process of reducing the strength of weights, competition among weights becomes higher, and finally only a few weights out of many win the competition. This type of regularisation is a method for connection weights to compete with each other under the condition of error minimisation. In our view, one of the main problems in terms of competitive computation is that all connection weights do not have an equal chance to win the competition at any state of learning. This means that there are almost no ways to restore the learning when it has gone into traps, though some random noises, inherent to some specific learning methods, may be of some use (Gunasekar et al., Citation2017; Mandt et al., Citation2017). The regularisation may be applied with difficulty in learning due to this inequality in winning the competition.

In connection with the regularisation, a new concept has been stated, closely related to the present hypothesis of competitive computing in this paper, namely, the lottery ticket hypothesis (Chen et al., Citation2021; Frankle & Carbin, Citation2018; Frankle et al., Citation2019Citation2020; Malach et al., Citation2020; Tian et al., Citation2019). The hypothesis says simply that large randomly initialised neural networks tend to have a sub-network with lucky initial weights. This hypothesis may explain why over-parameterised neural networks show better generalisation performance. However, one of the main problems with this hypothesis lies in the implicit supposition of equal potentiality of any sub-networks, which is not necessarily true. Some networks are chosen much frequently, and others have no chance to be chosen, as is the case with the dead neurons in competitive learning, due to initial biases. The present paper aims to eliminate the frequent chance for specific sub-networks to win the competition by making all weights as well as sub-networks equally chosen. In addition, the present paper aims to show that, without a method to win the competition equally, we have much difficulty in extracting an appropriate sub-network. It can be said that the conventional methods can be dependent on the implicit and passive supposition of equal chance.

At this point, we think that the impartiality can explain one of the phenomena inherent to neural networks, where even as a network size becomes larger in terms of components, generalisation performance does not necessarily decrease, contrary to our expectation (Arpit et al., Citation2017; Zhang et al., Citation2021). Those phenomena can be explained by the well-known concept of implicit regularisation (Gunasekar et al., Citation2017). However, this may be explained by using the impartiality. When the network size becomes larger, and the number of connection weights becomes larger, and when the strength of initial weights is considerably small and close to zero, connection weights tend naturally to have smaller and almost equal values in the end. In terms of our potentiality, all individual potentialities of weights become smaller and almost equal. This means that equal potentiality may be realised in this case, but this is not always true in much larger networks, due to the appearance of dead components, as mentioned in the section on the problem of competitive learning. Thus, the implicit regularisation hypothesis in this case seems to be a passive attitude for obtaining higher potentiality, discussed in this paper. Contrary to those passive methods, we try to force networks to have equal potentiality or an equal chance to win the competition at any time when we must face inequality in competition, including in much larger networks. If it is possible to do so, it can be said that, as the network size becomes larger, generalisation becomes correspondingly higher due to the possible and active existence of higher potentiality.

Finally, two limitations should be pointed out: lower potentiality and how to define the potentiality. In the first place, we have the problem of lower potentiality. As mentioned above, competition is a process of transition from a state with higher potentiality to lower potentiality. However, in our approach, focusing on the higher potentiality, a state with lower potentiality is not necessarily considered fully. In our experimental results, the computational procedures tried to increase the potentiality, accompanied by slightly lower potentiality. As might be inferred, a state with lower potentiality is one with fewer connection weights, leading to more interpretable configurations of weights, compared with a state with higher potentiality. Then, higher and lower potentiality are mutually exclusive, and contradictory to each other, and it is desirable to develop how to produce a state in which higher and lower potentiality co-exist more naturally.

Second, one of the main limitations or maybe one of the main shortcomings of the method proposed here is that the potentiality of a network is principally based on the strength of absolute weights. Then, as the strength becomes larger, the corresponding potentiality becomes simply larger. However, as might be easily inferred, the strength of absolute weights does not necessarily represent the importance of the corresponding connection weights. This is the basic shortcoming or limitation of the method. We used the strength of absolute weights as the first approximation of the true potentiality of components of neural networks because it is easy to define and compute it. The problem is that, when the potentiality increases, all connection weights tend to be the same, having no information for inputs. This is because our potentiality measure has been introduced to replace the entropy measure, and the entropy and the potentiality are almost the same in their meaning. When the entropy increases, the potentiality increases correspondingly. However, when entropy increases, all components become uniformly activated, having no information on inputs. Thus, if the potentiality should represent more realistic situations of information flow in neural networks, it might be better to introduce stronger concepts of potentiality instead of the strength of absolute weights. As is well known, in the informax principle by Linsker (Citation1988Citation1989Citation1992), a pioneer of the information-theoretic method on neural networks, the variance played important roles, and this measure may be of some use for the potentiality.

4. Conclusion

The present paper proposed a new learning method for all components and computational procedures to compete with each other. In addition, we tried to emphasise the importance of the equal chance or potentiality to win the competition, where all should be winners. However, the equal chance to win the competition cannot be easily realised, because the equal chance corresponds to losing specific information for learning. For this problem, we introduced the cost-forced method to increase the potentiality of components to win the competition. The potentiality as well as the corresponding cost were applied, repeating the processes of potentiality and cost maximisation and minimisation. This repeated way of cost augmentation aimed to force the potentiality or chance to win the competition to increase to the extreme point.

The method was applied to a second-language learning data set. Though much difficulty could be observed in using the relative clauses in language acquisition, it was impossible to extract this fact by analysing the data set with natural-language processing systems, because they can deal only with word frequency. This paper supposed that this inability to extract the relations was due to the shortcoming of conventional linear statistical methods in which only direct relations were explored. Our new method on the impartial potentiality tried to seek for both direct and indirect relations by minimising biased information to obtain necessary information. The experimental results showed that the new method was successful in clarifying linear relations as well as non-linear or indirect ones by which we could extract the important factor of relative clauses, which have been considered ones that make second-language learning difficult.

The paper has aimed to seek for the possibility of competitive learning with impartial operations purely for clarifying the meaning of competition without paying any attention to large-scale data sets and corresponding industrial applications. However, two things are clear, which have some significance favourable for industrial application. The method has been developed to simplify the complicated computational procedures, and this simplicity may be applicable to large-scale industrial applications. In addition, the interpretation has aimed to clarify the meaning of collective behaviours of neural networks, meaning that this interpretation may give more stable and fixed interpretation for large-scale and complicated data sets.

For practical application, more studies are needed to make information acquisition and potentiality control more smooth in competitive computing. For example, competition was controlled algorithmically, which should be more softly controlled in extracting important information from a small quantity of information. However, we can say that the present results certainly show a possibility that many complex problems in social and human sciences can be dealt with by neural networks, whose use will make it possible to clarify hidden characteristics in complicated phenomena.

Author's contributions

The author was responsible for all the conception or design of the work; or the acquisition, analysis, or interpretation of data; or the creation of new software used in the work.

Disclosure statement

The author has no competing interests to declare that are relevant to the content of this article.

Data availability statement

The data set was taken from the data sets of the research project of ICNALE: the international corpus network of Asian learners of English; a collection of controlled essays and speeches produced by learners of English in 10 countries and areas in Asia (https://language.sakura.ne.jp/icnale/). The files on Japanese English learners: W_JPN_PTJB1_2.txt  and W_JPN_SMK_B1.txt were taken from the project home page.

References

  • Arbabzadah, F., Montavon, G., Müller, K. R., & Samek, W. (2016). Identifying individual facial expressions by deconstructing a neural network. In German Conference on Pattern Recognition (pp. 344–354).
  • Arpit, D., Jastrz Kebski, S., Ballas, N., Krueger, D., Bengio, E., Kanwal, M. S., Maharaj, T., Fischerl, A., Courville, A., Bengio, Y., & Lacoste-Julien, S. (2017). A closer look at memorization in deep networks. In International Conference on Machine Learning (pp. 233–242).
  • Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K. R., & Samek, W. (2015). On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS One, 10(7), e0130140. https://doi.org/10.1371/journal.pone.0130140
  • Bai, Y., Wang, H., Tao, Z., Li, K., & Fu, Y. (2022). Dual lottery ticket hypothesis. arXiv preprint, arXiv:2203.04248.
  • Banerjee, A., & Ghosh, J. (2004). Frequency-sensitive competitive learning for scalable balanced clustering on high-dimensional hyperspheres. Neural Networks, IEEE Transactions on, 15(3), 702–719. https://doi.org/10.1109/TNN.2004.824416
  • Bao, G., Lin, M., Sang, X., Hou, Y., Liu, Y., & Wu, Y. (2022). Classification of dysphonic voices in Parkinson's disease with semi-Supervised competitive learning algorithm. Biosensors, 12(7), 502. https://doi.org/10.3390/bios12070502
  • Binder, A., Montavon, G., Lapuschkin, S., Müller, K. R., & Samek, W. (2016). Layer-wise relevance propagation for neural networks with local renormalization layers. In International Conference on Artificial Neural Networks (pp. 63–71).
  • Bogdan, M., & Rosenstiel, W. (2001). Detection of cluster in self-organizing maps for controlling a prostheses using nerve signals. In 9th European Symposium on Artificial Neural Networks. ESANN. Proceedings. D-facto, Evere, Belgium (pp. 131–136).
  • Brugger, D., Bogdan, M., & Rosenstiel, W. (2008). Automatic cluster detection in kohonen's SOM. Neural Networks, IEEE Transactions on, 19(3), 442–459. https://doi.org/10.1109/TNN.2007.909556
  • Bucilu, C., Caruana, R., & Niculescu-Mizil, A. (2006). Model compression. In Proceedings of the 12th Acm Sigkdd International Conference on Knowledge Discovery and Data Mining (pp. 535–541).
  • Chen, X., Cheng, Y., Wang, S., Gan, Z., Liu, J., & Wang, Z. (2021). The elastic lottery ticket hypothesis. Advances in Neural Information Processing Systems, 34, 26609–26621.
  • Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., & Abbeel, P. (2016). Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Proceedings of the 30th International Conference on Neural Information Processing Systems (pp. 2180–2188).
  • Cheng, Y., Wang, D., Zhou, P., & Zhang, T. (2020). A survey of model compression and acceleration for deep neural networks.
  • Choy, C. S., & Siu, W. (1998). A class of competitive learning models which avoids neuron underutilization problem. IEEE Transactions on Neural Networks, 9(6), 1258–1269. https://doi.org/10.1109/72.728374
  • Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. John Wiley and Sons, Inc.
  • da Cunha, A., Natale, E., & Viennot, L. (2022). Proving the strong lottery ticket hypothesis for convolutional neural networks. In International Conference on Learning Representations.
  • DeSieno, D. (1988). Adding a conscience to competitive learning. In IEEE International Conference on Neural Networks (Vol. 1, pp. 117–124).
  • Eckman, F. R., Bell, L., & Nelson, D. (1988). On the generalization of relative clause instruction in the acquisition of English as a second language. Applied Linguistics, 9(1), 1–20. https://doi.org/10.1093/applin/9.1.1
  • Erhan, D., Bengio, Y., Courville, A., & Vincent, P. (2009). Visualizing higher-layer features of a deep network. University of Montreal. 1341.
  • Fernández Rodríguez, J. D., Maza Quiroga, R. M., Palomo Ferrer, E. J., J. M. Ortiz-de-lazcano Lobato, & López-Rubio, E. (2022). A novel continual learning approach for competitive neural networks.
  • Frankle, J., & Carbin, M. (2018). The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635.
  • Frankle, J., Dziugaite, G. K., Roy, D., & Carbin, M. (2020). Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning (pp. 3259–3269).
  • Frankle, J., Dziugaite, G. K., Roy, D. M., & Carbin, M. (2019). Stabilizing the lottery ticket hypothesis. arXiv preprint arXiv:1903.01611.
  • Fritzke, B. (1993). Vector quantization with a growing and splitting elastic net. In Icann'93: International Conference on Artificial Neural Networks (pp. 580–585). Springer.
  • Fritzke, B. (1996). Automatic construction of radial basis function networks with the growing neural gas model and its relevance for fuzzy logic. In Applied computing: Proceedings of the Acm Symposium on Applied Computing (pp. 624–627). ACM.
  • Fukushima, K. (1975). Cognitron: A self-organizing multi-layered neural network. Biological Cybernetics, 20(3-4), 121–136. https://doi.org/10.1007/BF00342633
  • Grossberg, S. (1976). Adaptive pattern classification and universal recoding: I parallel development and coding of neural feature detectors. Biological Cybernetics, 23(3), 121–134. https://doi.org/10.1007/BF00344744
  • Gunasekar, S., Woodworth, B. E., Bhojanapalli, S., Neyshabur, B., & Srebro, N. (2017). Implicit regularization in matrix factorization. Advances in Neural Information Processing Systems, 30, 6152–6160.
  • Himberg, J. (2000). A SOM based cluster visualization and its application for false colouring. In Proceedings of the International Joint Conference on Neural Networks (pp. 69–74).
  • Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint, arXiv:1503.02531.
  • Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B., & Madry, A. (2019). Adversarial examples are not bugs, they are features. arXiv preprint, arXiv:1905.02175.
  • Ishikawa, S. (2013). The ICNALE and sophisticated contrastive interlanguage analysis of Asian learners of English. Learner Corpus Studies in Asia and the World, 1(1), 91–118.
  • Izumi, S. (2003). Processing difficulty in comprehension and production of relative clauses by learners of English as a second language. Language Learning, 53(2), 285–323. https://doi.org/10.1111/1467-9922.00218
  • Kamimura, R. (2019). Neural self-compressor: Collective interpretation by compressing multi-layered neural networks into non-layered networks. Neurocomputing, 323(5), 12–36. https://doi.org/10.1016/j.neucom.2018.09.036.
  • Kohonen, T. (1990). The self-organizing maps. Proceedings of the IEEE, 78(9), 1464–1480. https://doi.org/10.1109/5.58325
  • Krotov, D., & Hopfield, J. J. (2019). Unsupervised learning by competing hidden units. Proceedings of the National Academy of Sciences, 116(16), 7723–7731. https://doi.org/10.1073/pnas.1820458116
  • Lagani, G., Falchi, F., Gennaro, C., & Amato, G. (2021). Training convolutional neural networks with competitive Hebbian learning approaches. In International Conference on Machine Learning, Optimization, and Data Science (pp. 25–40).
  • Lapuschkin, S., Binder, A., Montavon, G., Muller, K. R., & Samek, W. (2016). Analyzing classifiers: Fisher vectors and deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2912–2920).
  • Li, P., Tu, S., & Xu, L. (2022). Deep rival penalized competitive learning for low-resolution face recognition. Neural Networks.
  • Linsker, R. (1988). Self-organization in a perceptual network. Computer, 21(3), 105–117. https://doi.org/10.1109/2.36
  • Linsker, R. (1989). How to generate ordered maps by maximizing the mutual information between input and output signals. Neural Computation, 1(3), 402–411. https://doi.org/10.1162/neco.1989.1.3.402
  • Linsker, R. (1992). Local synaptic learning rules suffice to maximize mutual information in a linear network. Neural Computation, 4(5), 691–702. https://doi.org/10.1162/neco.1992.4.5.691
  • Lu, Y., Cheung, Y. M., & Tang, Y. Y. (2019). Self-adaptive multiprototype-based competitive learning approach: A k-means-type algorithm for imbalanced data clustering. IEEE Transactions on Cybernetics, 51(3), 1598–1612. https://doi.org/10.1109/TCYB.6221036
  • Luo, P., Zhu, Z., Liu, Z., Wang, X., & Tang, X. (2016). Face model compression by distilling knowledge from neurons. In Thirtieth aaai Conference on Artificial Intelligence.
  • Mahendran, A., & Vedaldi, A. (2015). Understanding deep image representations by inverting them. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5188–5196).
  • Makhzani, A., & Frey, B. (2014). A winner-take-all method for training sparse convolutional autoencoders. In Nips Deep Learning Workshop.
  • Malach, E., Yehudai, G., Shalev-Schwartz, S., & Shamir, O. (2020). Proving the lottery ticket hypothesis: Pruning is all you need. In International Conference on Machine Learning (pp. 6682–6691).
  • Mandt, S., Hoffman, M. D., & Blei, D. M. (2017). Stochastic gradient descent as approximate bayesian inference. arXiv preprint, arXiv:1704.04289.
  • McClelland, J. L, & Rumelhart, D. E. (1986). Parallel distributed processing. Vol. 1, MIT Press.
  • Montavon, G., Binder, A., Lapuschkin, S., Samek, W., & Müller, K. R. (2019). Layer-wise relevance propagation: An overview. In Explainable AI: Interpreting, explaining and visualizing deep learning (pp. 193–209). Springer.
  • Nguyen, A., Yosinski, J., & Clune, J. (2019). Understanding neural networks via feature visualization: A survey. In Explainable AI: Interpreting, explaining and visualizing deep learning (pp. 55–76). Springer.
  • O'Grady, W., Lee, M., & Choo, M. (2003). A subject-object asymmetry in the acquisition of relative clauses in Korean as a second language. Studies in Second Language Acquisition, 25(3), 433–448. https://doi.org/10.1017/S0272263103000172
  • Papadopoulou, D., & Clahsen, H. (2003). Parsing strategies in L1 and L2 sentence processing: A study of relative clause attachment in Greek. Studies in Second Language Acquisition, 25(4), 501–528. https://doi.org/10.1017/S0272263103000214
  • Peng, B., Jin, L., & Shang, M. (2021). Multi-robot competitive tracking based on k-WTA neural network with one single neuron. Neurocomputing, 460(4), 1–8. https://doi.org/10.1016/j.neucom.2021.07.020.
  • Rumelhart, D. E., & McClelland, J. L. (1986). On learning the past tenses of English verbs. In D. E. Rumelhart, G. E. Hinton, & R. J. Williams, (Eds.), Parallel distributed processing (Vol. 2, pp. 216–271). MIT Press.
  • Rumelhart, D. E., & Zipser, D. (1985). Feature discovery by competitive learning. Cognitive Science, 9(1), 75–112. https://doi.org/10.1207/s15516709cog0901_5.
  • Shinozaki, T. (2017). Biologically inspired feedforward supervised learning for deep self-organizing map networks. arXiv preprint, arXiv:1710.09574.
  • Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958.
  • Srivastava, R. K., Masci, J., Kazerounian, S., Gomez, F., & Schmidhuber, J. (2013). Compete to compute. Advances in Neural Information Processing Systems, 26, 2312–2321.
  • Sturm, I., Lapuschkin, S., Samek, W., & Müller, K. R. (2016). Interpretable deep neural networks for single-trial EEG classification. Journal of Neuroscience Methods, 274(1), 141–145. https://doi.org/10.1016/j.jneumeth.2016.10.008.
  • Tian, Y., Jiang, T., Gong, Q., & Morcos, A. (2019). Luck matters: Understanding training dynamics of deep relu networks. arXiv preprint, arXiv:1905.13405.
  • Van Hulle, M. M. (1999). Faithful representations with topographic maps. Neural Networks, 12(6), 803–823. https://doi.org/10.1016/S0893-6080(99)00041-6
  • Van Hulle, M. M. (2004). Entropy-based kernel modeling for topographic map formation. IEEE Transactions on Neural Networks, 15(4), 850–858. https://doi.org/10.1109/TNN.2004.828763
  • Wakamatsu, H. (2018). Corpus analysis (in Japanese). In A. Hirai (Ed.), Data analysis for educational psycholinguistics. Tokyo Tosho.
  • Wang, L., Yan, Y., He, K., Wu, Y., & Xu, W. (2021). Dynamically disentangling social bias from task-oriented representations with adversarial attack. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 3740–3750).
  • Wang, T., Zhao, J., Yatskar, M., Chang, K. W., & Ordonez, V. (2019). Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 5310–5319).
  • Wu, L., Li, J., Wang, Y., Meng, Q., Qin, T., Chen, W., Zhang, M., & Liu, T. Y. (2021). R-drop: Regularized dropout for neural networks. Advances in Neural Information Processing Systems, 34, 10890–10905.
  • Xia, Y., Zhou, J., Shi, Z., Lu, C., & Huang, H. (2020). Generative adversarial regularized mutual information policy gradient framework for automatic diagnosis. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, pp. 1062–1069).
  • Xiao, T., Li, H., Ouyang, W., & Wang, X. (2016). Learning deep feature representations with domain guided dropout for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1249–1258).
  • Xie, Q., Dai, Z., Du, Y., Hovy, E., & Neubig, G. (2017). Controllable invariance through adversarial feature learning. arXiv preprint, arXiv:1705.11122.
  • Xu, L., Xu, Y., & Chow, T. W. (2010). PolSOM: A new method for multidimensional data visualization. Pattern Recognition, 43(4), 1668–1675. https://doi.org/10.1016/j.patcog.2009.09.025
  • Yin, H. (2002). ViSOM-a novel method for multivariate data projection and structure visualization. Neural Networks, IEEE Transactions on, 13(1), 237–243. https://doi.org/10.1109/72.977314
  • Yu, Z., Guo, S., Deng, F., Yan, Q., Huang, K., Liu, J. K., & Chen, F. (2018). Emergent inference of hidden markov models in spiking neural networks through winner-take-all. IEEE Transactions on Cybernetics, 50(3), 1347–1354. https://doi.org/10.1109/TCYB.6221036
  • Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2021). Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3), 107–115. https://doi.org/10.1145/3446776