Search in:

Connection Science Volume 35, 2023 - Issue 1

Submit an article Journal homepage

Open access

471

Views

CrossRef citations to date

Altmetric

Listen

Research Article

Emotion recognition based on convolutional gated recurrent units with attention

Zhu Yea Chengdu University of Information Technology, Chengdu, People’s Republic of ChinaView further author information

Yuan Jinga Chengdu University of Information Technology, Chengdu, People’s Republic of ChinaView further author information

Qinghua Wangb Hubi Wuhan Public Security Bureau, Wuhan City, People’s Republic of ChinaView further author information

Pengrui Lia Chengdu University of Information Technology, Chengdu, People’s Republic of ChinaCorrespondence[email protected]
View further author information

Zhihong Liua Chengdu University of Information Technology, Chengdu, People’s Republic of ChinaView further author information

Mingjing Yana Chengdu University of Information Technology, Chengdu, People’s Republic of ChinaView further author information

Yongqing Zhanga Chengdu University of Information Technology, Chengdu, People’s Republic of ChinaView further author information

Dongrui Gaoa Chengdu University of Information Technology, Chengdu, People’s Republic of China;c School of Life Sciences and Technology, University of Electronic Science and Technology of China, Chengdu, People’s Republic of ChinaView further author information

show all

Article: 2289833 | Received 22 May 2023, Accepted 27 Nov 2023, Published online: 09 Dec 2023

Cite this article
https://doi.org/10.1080/09540091.2023.2289833
CrossMark

In this article

1. Introduction
2. Related work
3. Methods and models
4. Experiment
5. Results analysis and comparison
6. Conclusion
Acknowledgements
Disclosure statement
Additional information
References

Full Article
Figures & data
References
Citations
Metrics
Licensing
Reprints & Permissions
View PDF PDF View EPUB EPUB

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Studying brain activity and deciphering the information in electroencephalogram (EEG) signals has become an emerging research field, and substantial advances have been made in the EEG-based classification of emotions. However, using different EEG features and complementarity to discriminate other emotions is still challenging. Most existing models extract a single temporal feature from the EEG signal while ignoring the crucial temporal dynamic information, which, to a certain extent, constrains the classification capability of the model. To address this issue, we propose an Attention-Based Depthwise Parameterized Convolutional Gated Recurrent Unit (AB-DPCGRU) model and validate it with the mixed experiment on the SEED and SEED-IV datasets. The experimental outcomes revealed that the accuracy of the model outperforms the existing state-of-the-art methods, which confirmed the superiority of our approach over currently popular emotion recognition models.

KEYWORDS:

EEG
emotion recognition
attention mechanism
depthwise parameterised convolutional
gate recurrent neural unit

1. Introduction

Emotion is a state that integrates human feelings, thoughts, and behaviours and plays an essential role in human-to-human communication (Bahari & Janghorbani, Citation2013). In addition to logical ability, emotional intelligence is a necessary part of human intelligence. Emotional intelligence consists of a person's psychological response to external or self-driven stimuli, including the physiological responses accompanying this psychological response. Emotions play a pervasive role in people's daily work and life. For example, in medical care, if you can be aware of the emotional state of the patients, especially those with expression impairments, you can improve the quality of care by taking different nursing measures based on the patient's emotions. At the same time, if you can identify the emotional state of the user in the product and understand the user experience in product development, you can refine the product features and make the product design more satisfying to the user's needs. What’s more, human–computer interaction systems become more friendly and natural when they can recognise people's emotional states (Yin et al., Citation2020). Therefore, sentiment analysis and recognition are essential interdisciplinary research topics in neuroscience, psychology, computer science, and artificial intelligence. However, with a limited understanding of the neural mechanisms that underpin emotional processing, an effective measure to quantify emotion in a convenient setting that can provide positive feedback for disease treatment still needs to be improved. One of the 20 major queries about the future of humanity covered by Scientific American is whether we can monitor human emotions with wearable technology. After all, smart wearable devices have great potential to boost human–computer interaction and tackle mental illness.

Emotions are sophisticated physiological and psychological processes pertaining to numerous internal and external activities. Electroencephalogram (EEG) signals can provide an objective picture of diverse emotions, making it a reliable tool for recognising true emotions (Bahari & Janghorbani, Citation2013). Increasingly, researchers are dedicated to the EEG analysis of diversified emotions elicited by specific stimulus patterns and the development of emotional artificial intelligence in human–computer interaction (Yin et al., Citation2020). One of the research objectives is to identify suitable features for EEG emotion recognition via different methods and then optimise the model to augment the classification accuracy. The development of EEG emotion recognition models provides an excellent physiological basis. Studies have shown that when the subject is in a passive mood, low-frequency band signals (delta, theta) are more active than high-frequency band signals (beta, gamma); even in the same frequency band of beta and gamma, activation in temporal lobe areas was significantly higher in subjects with positive emotional states than those with negative emotional states (Jia et al., Citation2020). Researchers use deep learning methods to integrate these features and improve emotion recognition accuracy in response to these problems. For instance, they use Convolutional Neural Networks (CNN) or Graph Convolutional Networks (GCN) to catch spatial-spectral features and Long-Short Term Memory (LSTM) to capture spatial–temporal signals. Although existing emotion recognition methods have achieved high accuracy rates, most of them only consider unidimensional features without taking into account the complementarity between features and the differences between EEG bands. This may miss some essential features in the EEG signal.

To address the above problems, this thesis presents an attention-based over-parameterised convolutional gated recurrent unit network (AB-DPCGRUN), which first extracts the time–frequency domain characteristics of EEG signals and compares the spatial features with them and then integrates them into the three-dimensional EEG information flow. Then, the spatial and frequency domain features in the EEG information flow are extracted through deep convolution,and the time domain features of the EEG are extracted by Gated Recurrent Unit (GRU). Finally, the classification result of emotion is obtained through the linear layer output.

The contributions of this paper are as follows: (1) We propose a new emotion recognition model (AB-DPCGRU) that simultaneously fuses temporal, spatial, and spectral information of EEG signals and captures complementarities and differences between features. (2) After the training stage, the multilayer composite linear algorithm used can be collapsed into a tight single-layer representation to speed up the training of the model. (3) We use SEED and SEED-IV datasets to evaluate the proposed method. The results demonstrate that the proposed method is superior to other state-of-the-art methods.

The remainder of this paper is as follows. Section 2 describes related work. Section 3 presents the proposed method. Section 4 describes the dataset and parameter settings. Section 5 presents the experimental results and analysis. Section 6 concludes this paper.

2. Related work

Recently, a rising interest has been focused on using information about the emotional state of the user's EEG signals to augment brain-computer interfaces, the so-called affective brain-computer interface (aBCI) (Mühl et al., Citation2014). The aBCI seeks to empower machines with the capability to perceive, understand and modulate emotions. The critical issue is to identify emotions out of EEG. Before the rise of deep learning, the traditional artificial extraction of emotional features combined with machine learning-related algorithms achieved good development. Typically, researchers use machine learning to derive spectral, temporal, or spatial features from EEG signals for emotion identification. With respect to the spectral feature extraction, frequently employed spectral features comprise differential entropy (DE) feature (Duan et al., Citation2013), power spectral density (PSD) feature (Frantzidis et al., Citation2010), differential asymmetry (DASM) feature (Liu and Sourina, Citation2013), rational Asymmetric (RASM) feature (Lin et al., Citation2010) and differential tail (DCAU) feature (Zheng & Lu, Citation2015). For example, M. Serdar Bascil used least squares-based support vector machines (LS-SVM) and learning vector quantisation (LVQ) to identify different emotions using power spectral density features (Bascil et al., Citation2016); Duan adopted power spectrum features, differential asymmetric features, and rational asymmetric features to evaluate the performance of support vector machine (SVM), nonlinear nearest neighbour node algorithm (KNN), and least squares classifier (LSC) (Duan et al., Citation2012). Yet, traditional machine learning techniques suffer from severe constraints in terms of feature design and choice. It requires extensive expertise, which makes the manual selection of suitable EEG signal features costly in terms of time and effort. As artificial intelligence advances rapidly, the application of deep learning in brain science is becoming more extensive. Zheng and Lu introduced Deep Belief Network (DBN) to study EEG’s critical frequency bands and channels for emotion recognition (Zheng & Lu, Citation2015). They confirmed that the DE feature derived from the EEG signal is an accurate and steady classification feature. Besides, Song et al. proposed Dynamic Graph Convolutional Neural Network (DGCNN) for EEG sentiment classification (Rouast et al., Citation2019). All of these deep models yielded better performance than shallow models.

Nonetheless, there remain a number of challenges to be addressed for the construction of deep learning-based EEG representations, one of which is how to incorporate more helpful EEG information to better identify emotions. Many researchers have studied the association between EEG frequency bands with emotion categories in the past decade. Zheng and Lu found that emotion is closely related to four frequency bands, Theta (θ, 4–7 Hz), Alpha (α, 8–13 Hz), Beta (β,14–30 Hz), and Gamma (γ, 31 + Hz). This also confirms that combining these four bands is better than any of the individual bands when classifying emotions (Zheng & Lu, Citation2015). Another problem is how to accurately extract the spatial features of the EEG signal. To improve emotion recognition performance, the researchers explored the intrinsic information in the positional relationship between electrodes. Yang proposed a novel bi-hemisphere difference model (Bi-HDM) to learn asymmetric disparities between two hemispheres for EEG emotion recognition (Li et al., Citation2019). Zhang proposed a method based on a four-way recurrent neural network simultaneously capturing long-distance spatial correlations between electrodes. In addition, some investigators have uncovered that the spatial information from multiple electrodes on a time slice is crucial for emotion identification and the correlation of information contained before the time slice. For example, Hochreiter and Schmidhuber engineered a parallel convolutional recurrent neural network (PCRNN) to extract spatial and temporal features from EEG signals through CNN and RNN and then directly integrate the outputs of CNN and RNN for classification (Hochreiter & Schmidhuber, Citation1997). Yang put forward a novel EEG emotion recognition method consisting of a spatiotemporal neural network model coupled with a regional-to-global hierarchical feature learning process to pick up distinguishing spatiotemporal EEG features. To acquire spatial features, a bidirectional long short-term memory (Bi-LSTM) network is employed to trap the intrinsic spatial connections of EEG electrodes within and between brain regions, respectively (Li et al., Citation2019). Information about the interplay between different brain areas cannot be ignored either. Jung transformed EEG signals into image-based representations and gained encouraging results (Jung & Sejnowski, Citation2019); Zhang et al. designed a graph convolutional broad network (GCB-Net) for exploring more in-depth information on the structured graph data (Zhang et al., Citation2019). Therefore, it is of great significance to solve problems in the field of brain science based on computer science knowledge.

Despite the high accuracy rates achieved by existing emotion recognition methods, only a single feature or a combination of two features are considered in most of these methods. There are few works of literature on using spectral, spatial, and temporal information simultaneously for EEG-based emotion recognition. Due to the complementarity between features, we propose an AB-DPCGRU model that considers spatial-spectral-temporal features simultaneously.

3. Methods and models

Figure shows the whole process of EEG sentiment recognition. It mainly includes the following parts: 3D data module, attention module, hyperparametric convolutional network module, gated recurrent network module, and sentiment classification result. This paper will introduce the 3D data module in 3.1, the attention module in 3.2, the over-parameterised convolutional network module in 3.3, and the gated recurrent network module in 3.4.

Figure 1. An outline of the EEG-based emotion recognition framework presented with AB-DPCGRU.

3.1 3D feature flow

To simultaneously integrate temporal, spatial, and spectral features, this paper constructs a 3D data input structure. The construction of the 3D data module is inspired by (Yang et al., Citation2018). First, the EEG signal in this paper is split into $T n (n = 1, 2, \dots, N)$ non-overlapping time segments in the time domain. Here, each time segment Tn was divided into five frequency bands: Delta ( $δ$ ), Theta ( $θ$ ), Alpha ( $α$ ), Beta ( $β$ ) and Gamma ( $γ$ ), and annotated with the original labels in the dataset. However, unlike Yang, we did not discard the data in the Gamma ( $γ$ ) frequency band with the aim of retaining as much information in the original data as possible. Then this paper extracts the DE features for these five frequency bands respectively (Duan et al., Citation2013). The DE features obtained were mapped into a two-dimensional map depending on the position of the electrodes Figure to receive the spatial characteristics of the EEG signal. Finally, the continuous-time spatial features are superimposed to get this paper's 3D data input module.

Figure 2. Mapping from real 62 electrode channel locations to 2D maps.

Represented as follows: (1) $M_{t}^{T} \in R^{H \times W \times D}, t \in {1, 2 \dots, T}$ (1) where $M_{t}^{T}$ represents $t$ time blocks divided into $T$ time lengths $H$ and $W$ respectively represent the length of the mapped 2D image. Width $D$ is expressed as the number of frequency bands. The DE feature is selected because it is often adopted as a measurement of the sophistication of EEG signals. DE has been proven to be the most reliable feature in emotion recognition (Zheng et al., Citation2017). DE is defined as follows: (2) $h (X) = - \int_{S} f (x) \log f (x) d x$ (2) where $X$ is a continuous random variable, f(x) is its probability density function, and $S$ is the support set of this random variable. For an EEG signal of a certain length that approximately obeys a Gaussian distribution, its DE is equal to: (3) $h (X) = \int_{S} \frac{1}{\sqrt{2 π δ^{2}}} e x p \frac{{(x - u)}^{2}}{2 δ^{2}} \log \frac{1}{\sqrt{2 π δ^{2}}} e x p \frac{{(x - u)}^{2}}{2 δ^{2}} d x = \frac{1}{2} \log 2 π e δ$ (3) To increase the amount of data, this paper uses a 0.5S window to calculate each frequency band’s DEfeature and make the calculated DE relevant. This paper uses the Z-score model to normalise each DE. The vector is normalised into a standard score. The formula for calculating the Z-score is: (4) $z = \frac{(x - \bar{x})}{s}$ (4) where $\bar{x}$ represents the average of the DE, and s represents the standard deviation.

3.2. Attention module

Previous studies have shown that embedding attention modules into existing CNNs can significantly improve performance. For example, attention mechanisms such as BAM and CBAM have brought considerable performance improvements. The scSE attention module (Roy et al., Citation2018) is used in this paper. The scSE module is a variant of SENet, which was proposed in 2018 and applied to image segmentation. To better understand the role of the scSE module, this paper first introduces the SE (Squeeze-and-Excitation) module in SENet by explicitly modelling the interdependence between feature channels. A new “feature recalibration” strategy is deployed, the essence of which is to introduce an Attention mechanism between channels. More concretely, the significance of each feature channel is acquired automatically via learning. The use of features is then raised based on this importance, while features that have less impact on the task at hand are repressed. The SE module mainly includes two operation processes: Squeeze and Excitation. It can be seen that the first step of the backbone network uses a Global pooling operation, where the spatial features are compressed into a series of real numbers. Then it goes into the fully connected layer to reduce the input size to 1/16, followed by the Relu and FC layers to restore the size to the input size. Finally, the weights are normalised to 0 and 1 through a Sigmoid layer and then weighted to the prior features channel, thus completing an SE module operation.

Unlike SENet, the SE module in this paper is the Spatial Squeeze and Channel Excitation Block (cSE), as it only excites in terms of channels and ignores spatial information. Therefore, this paper introduces another SE block, which is “squeezed” alongside the channel and “excited” spatially, called the Channel Squeeze and Spatial Excitation Block (sSE). Finally, this paper uses concurrent spatial and channel SE blocks (scSE), which realign feature maps down the channels and spaces, respectively, and then combine the outputs. Unlike image processing, this study uses spatial and spectral information from each time slice of the 3D features of the EEG signal for attentional fusion, so that the feature maps can provide more information in terms of space and spectrum as shown in Figure .

Figure 3. Attention module. Where ${\hat{U}}_{s S E}$ extracts channel attention and ${\hat{U}}_{c S E}$ extracts spectral attention.

Figure 3. Attention module. Where U^sSE extracts channel attention and U^cSE extracts spectral attention.

The attention module is classified into channel attention and spectral attention. The role of channel attention is to use the electrode 2D map position information of each time slice to generate weights. The weights represent the importance of certain input electrode positions to obtain the activation degree of different brain regions for emotions. The specific process is as follows: (1) The feature map was converted from [C, H, W] to [C, 1, 1] with the global average pooling method; (2) Two 1 × 1 × 1 convolutions were utilised to process information, yielding a final C-dimensional vector; (3) The sigmoid function was deployed for normalisation to gain the corresponding mask; (4) The information-calibrated feature map was obtained by multiplying the channels together.

The role of spectral attention is to use the relationship between the five frequency bands of each time slice to generate corresponding weights. The process is detailed below.: The process is detailed below.(1) 1 × 1 × 1 convolution was adopted directly on the feature map, changing from [C, H, W] to [1, H, W] features; (2) The activation was performed with sigmoid to acquire the spatial attention map; (3) The attention map was then directly applied to the initial feature map to allow for spatial information calibration.

In general, the attention module performs spatial and spectral attention fusion on not only the 3D input data stream of this paper but also the features after the attention module are more easily extracted by the subsequent over-parameter convolution module. After the attention module, each time slice’s spatial and spectral information is weighted. Still, the input and output dimensions are not changed.

3.3. Over-parameterised convolution module

Generally speaking, approaches to emotion recognition fall into two categories: statistics-based approache to emotion classification and deep learning-based processes. Currently, the most broadly employed technique is the use of the convolutional neural network (CNN) to extract spectral-spatial features in EEG signals (Cecotti & Graeser, Citation2008), as CNN can express highly complex functions and solve many classic computer vision problems (e.g. image categorisation, inspection, and segmentation) with great success. It is generally accepted that increasing the depth of the model can increase the expressiveness of the model and improve its performance. However, simply adding additional linear layers does not work as expected in this paper, especially when too many linear layers are added and cause curve fitting. This is because multiple consecutive linear layers can be replaced by one linear layer that has fewer parameters, which leads to an increase in model complexity and a decrease in performance. Cao et al. (Citation2022) proposed Depthwise Over-parameterised method through extensive experiments. The Convolutional Layer (DO-Conv) network model can improve the model’s performance on many classic vision tasks by combining linear and nonlinear layers to replace the traditional CNN convolutional layers. DO-Conv adds a component for overparameterization to regular convolution: Depthwise convolution. But using a deeper network structure tends to increase the computational load of neural nodes. To solve this problem, the multi-layer compound linear operation utilised by DO-Conv can be collapsed into a single tightly packed layer following the training stage. Then, in the inference phase, DO-Conv Conv can be converted into a traditional convolution operation, reducing the computation amount. In addition, DO-Conv can improve the fusion model’s performance and accelerate the training process of the network (Arora et al., Citation2018).

The construction of the Depthwise overparameterized convolution is displayed in Figure . The calculation process of (a) in the figure: (1) Perform Depthwise convolution operation on the input features to obtain intermediate variables; (2) Perform traditional convolution operations on intermediate variables to obtain the final result. The calculation process of (b) in the figure: (1) The product of the two weights was calculated to obtain a new weight; (2) A traditional convolution operation was performed on the input features using the new weight to obtain the final result. The Depthwise convolution operator “*” in the figure is first applied to the Depthwise convolution kernel D and the input feature P to generate the transformed feature P′ = D * P. The regular convolution operation “∗” is adopted to the stable convolution kernel W and the feature P to generate the converted feature O = W ∗ P′. Therefore, the Depthwise overparameterized convolution output can be expressed as O = W * (D * P). As shown in Figure , M = 8 and N = 9 are the spatial dimensions of the input vector, Cin = 5 is the channel number of the input vector, Cout = 64 is the output channel number, dmul = M × Nis the depth multiplier of the Depthwise convolution.

Figure 4. DO-Conv process. The calculation method of figure (a) is called feature composition, and the calculation method of the figure (b) is called kernel composition.

To fuse the spectral-spatial features of EEG as much as possible, this paper adds convolutional layers using additional deep convolution, with each input channel convolved with a separate 2D kernel. Combining two convolutions constitutes overparameterization, as it increases the parameters that can be learned, and a single convolutional layer can represent the resulting linear operation. After the traditional CNN convolution layer, a DO-Conv layer is added, which constitutes this paper’s Depthwise parameterized convolution (DPC) module. For sample Xn, we retrieved spectral and spatial information out of each of its temporal slices through the DPC module. Unlike traditional CNNsc where the convolutional layers are usually followed by a pooling layer, only the last convolutional layer in this paper is followed by an additional pooling layer. The pooling operation reduces the number of parameters, but also entails a loss of information. Nonetheless, sample Xn’s two-dimensional map is small in size, so all the information should be retained rather than combined to diminish the number of parameters in this paper. This is the reason why this paper only uses a pooling layer following the last convolutional layer.

The model for this article is shown in Figure . First, a standard convolutional layer with 32 feature maps and a filter size of 3*3 is passed; Then, a DPC convolution layer with 64 feature maps and a 3*3 filter size, a DPC convolution layer with 128 feature maps, and a DPC convolution layer with a 3*3 filter size are passed respectively; then input into two DPC layers with different depths, this structure is used to fuse the information of the two receptive field sizes in the feature map, and the features obtained by the last two branches. The graphs are supplemented, and a DPC layer was utilised to cut down the amount of feature map channels to 32, followed by an average pooling layer of size 2*2 to alleviate overfitting and enhance the robustness of the network. The feature map size was decreased to 2*2. Finally, the output of the pooling layer was levelled and sent into a wholely linked layer with 64 units.

Figure 5. Over-parameterised convolution module. The structure of this module for frequency and spatial feature learning.

3.4. Gated recurrent network module

Given that EEG signals encompass dynamic content, changes between time slices in the 3D structure may hide extra information that may contribute to more accurate sentiment classification (Yang et al., Citation2018). Therefore, we utilise a Gated Recurrent Unit (GRU), a Recurrent Neural Network (RNN) variant to extract temporal information from CNN output. Compared with RNN, GRU can tackle the inability of recurrent neural networks to handle long-range dependencies. Compared with the Long Short Term Memory Network (LSTM), another variant of RNN, GRU has fewer parameters, faster training, and requires fewer data. For the handling of samples with a small amount of data, such as the processing of EEG signals, GRU may work better (Lian et al., Citation2018).

After the DPC module extracts the spatial-spectral features of the three-dimensional (3D) feature information flow, the GRU module removes the spatial–temporal features, thereby obtaining the spatial, spectral, and temporal features of the EEG signal and finally getting the emotion through two linear layers. The identified classification outcomes are exhibited in Figure .

Figure 6. The layout of GRU module for temporal feature acquisition.

The output sequence of CNN can be expressed as $Q_{n} = (q_{1}, q_{2}, \dots, q_{T})$ , where $q_{t} \in R^{120}, t = 1, 2, \dots, T$ . This paper uses a GRU layer with 32 memory cells to mine the temporal dependencies of inner segments. The GRU model has only two gates: update gates and reset gates. The particular structure is depicted in Figure . The $z_{t}$ and $r_{t}$ in the figure signifies the update gate. The extent to which state information from the previous moment is brought into the current state is governed by the reset and update gates. A larger value for the update gate denotes more information about the state of the previous moment that has been brought in. The reset gate handles the quantity of knowledge from the prior state that is incorporated into the present candidate set ${\tilde{h}}_{t}$ . A smaller value for the reset gate implies that less information has been written to the previous state. The output of the GRU layer can be calculated as follows: (5) $\begin{aligned} z_{t} = σ (W_{z} \cdot [h_{t - 1}, x_{t}]) \end{aligned}$ (5) (6) $\begin{aligned} r_{t} = σ (W_{r} \cdot [h_{t - 1}, x_{t}]) \end{aligned}$ (6) (7) $\begin{aligned} {\tilde{h}}_{t} = \tanh (W \cdot [r_{t} * h_{t - 1}, x_{t}]) \end{aligned}$ (7) (8) $\begin{aligned} h_{t} = (1 - z_{t}) * h_{t - 1} + z_{t} * {\tilde{h}}_{t} \end{aligned}$ (8) (9) $\begin{aligned} y_{t} = σ (W_{o} \cdot h_{t}) \end{aligned}$ (9) where $σ$ is the sigmoid function, $W$ is the weight matrix, $z_{t}$ indicates the update gate, $r_{t}$ is the reset gate, and $x_{t}$ is the input gate. After a layer of GRU output, the EEG signal segment is represented as $y_{n} \in R^{32}$ , so the model in this paper completes the EEG signal spectrum, Spatial and temporal feature extraction. Accrding to the final feature representation $y_{n}$ , the label of the initial EEG segment $x_{n}$ predicted by the linear transformation method can be calculated as: (10) $OUT = A y_{n} = [out_{1}, out_{2}, \dots, out_{M}]$ (10) where $A$ is the transformation matrix and $M$ is the number of sentiment categories. The output is then routed to a softmax classifier for sentiment recognition, which can be expressed as: (11) $L = - \sum_{c = 1}^{M} y_{o, c} \log (p_{o, c})$ (11) where $M$ denotes the quantity of categories and $y_{o, c}$ denotes the binary indicator. If the category label $c$ is the correct classification of the observation $o$ , then $p_{o, c}$ indicates that the observation $o$ is predicted to fall into the category $c$ .

Figure 7. Internal structure of GRU. The update gate and reset gate are incorporated.

Algorithm 1 presents the calculation process of AB-DPCGRU.

Table

Download CSV Display Table

4. Experiment

In this segment, the SEED public sentiment dataset is led in, the experimental setup for our method is set forth, and the outcomes obtained are disclosed and discussed.

4.1. Introduction to the SEED and SEED-IV dataset

The SEED and SEED-IV datasets were provided by the BCMI laboratory of Shanghai Jiaotong University (Zheng & Lu, Citation2015), and 15 Chinese subjects (7 males and eight females; with an average age of 23.27) were selected to participate in the experiment. The subjects were instructed to watch 15 film clips in each experiment. Each clip contains only one emotion, where SEED includes positive, neutral and negative emotions, SEED-IV includes happy, sad, fearful and neutral emotions. Each clip contains 5 s of cues, 45 s of self-assessment, 4 min of film footage and a 15 s break after the playback. The order in which the films are shown is structured so that two film clips with identical moods are not displayed in succession. When the subjects were viewing the clip, their EEG signals were monitored by a 62-channel ESI NeuroScan system with electrodes positioned as required by the 10–20 system. At the end of the experiment, only those experiments that elicited the target emotion were selected for further analysis based on the subject’s response. Each subject underwent the above investigation three times. That is, each subject has three sets of data. In terms of feedback, participants were requested to complete a questionnaire promptly after viewing each clip to record their emotional response. The detailed process is shown in Figure . Before the data set is published, it has to undergo preliminary preprocessing. EEG signals that were heavily tainted by electromyography (EMG) and electrooculography (EOG) were manually eliminated prior to the data set is published. The data were downsampled to 200 Hz, and a bandpass filter between 4 and 50 Hz was deployed on the EEG signal to screen out noise.

Figure 8. The protocol used in the emotion experiment.

4.2. Experimental setup

Here, the model AB-DPCGRU was built by the Pytorch framework and trained on NVIDIA GeForce GTX 1660 SUPER. The selection of system parameters is exhibited in Table .

Table 1. Network parameter selection.

Download CSV Display Table

To make an assessment of the quality of the model, three evaluation metrics are used: Accuracy, F1-score, and the Kappa coefficient. Accuracy (Acc) refers to the proportion of the number of properly categorised samples to the total sample size; F1-score refers to the harmonic mean of recall and precision; Kappa is a measure of consistency. In classification, it is applied to measure whether the model’s predictions are aligned with the actual classification output. Their calculation formulas are as follows: (12) $\begin{aligned} Accuarry = \frac{T P + T N}{T P + T N + F P + F N} \end{aligned}$ (12) (13) $\begin{aligned} P_{e} = \frac{(T P + F P) \times (F N + T P) + (F N + T N) \times (F P + T N)}{N^{2}} \end{aligned}$ (13) (14) $\begin{aligned} Kappa = \frac{Accuracy - P_{e}}{1 - P_{e}} \end{aligned}$ (14) where $TP$ signifies True positive cases, $TN$ refers to True negative cases, $FP$ indicates False positive cases, and $FN$ stands for False negative cases.

5. Results analysis and comparison

To prove the model’s effectiveness, five-fold cross-validation (The features are uniformly partitioned into five subsets, one of which is selected sequentially as the test set and the other four as the training set.) and ablation experiments were implemented for each module within the model, and inter-model comparison experiments were carried out. Figure shows the confusion matrix diagram of the model herein on the SEED dataset, it presents the probability of the model correctly classifying positive, neutral and negative emotions as well as the probability of misclassification The results show that positive emotions are more accessible to identify for our model than depletion and neutral emotions.

Figure 9. The confusion graph of the SEED dataset.

5.1. Five-fold cross-validation

Cross-validation is a technique for estimating the strength of statistical predictive models on independent datasets. The purpose is to ensure that the model and data work well together. Cross-validation is performed during the training phase. The user will evaluate whether the model easily fits or overfits the data. The data used for cross-validation must come from the same distribution of the target variable. Otherwise, we will not achieve the effect of model evaluation. Although the verification process cannot directly find the problem, sometimes the process can show us that there is a problem with the stability of the model. In this paper, we use five-fold cross-validation, each fold uses 100 epochs to evaluate the model, and the Acc outcomes are depicted in Figure .

Figure 10. Five-fold cross-validation accuracy.

5.2. Ablation experiment

To verify the validity of each module in the AB-DPCGRU model, additional ablation experiments were implemented based on the SEED dataset to demonstrate the AB-DPCGRU model. In parallel, to ascertain the feasibility of data processing, the 3D data input module was eliminated, and the original data was directly input into the model. The obtained accuracy rate is 67.40%, the Kappa coefficient is 54.29%, and the F1-score is 70.39%; to verify the attention module, this paper removes the scSE module, and the obtained accuracy rate is 92.20%, Kappa coefficient is 88.45%, F1-score is 91.18%; to study the over-parameterised convolution module, this paper replaces all the over -parameterised convolutions with standard convolutions for experiments, the obtained accuracy rate is 89.79%, and the Kappa coefficient is 86.25%, the F1-score is 87.57%; to study the GRU module, this paper directly outputs the classification results of this paper through a linear layer. The obtained accuracy rate is 82.50%, the Kappa coefficient is 80.03%, and the F1-score is 79.28%. The results are presented in Table .

Table 2. Ablation studies of each module.

Download CSV Display Table

5.3. The performances of the compared methods

To demonstrate the effectiveness of our model, we made a comparison of AB-DPCGRU with the following commonly used models.

HCNN (Li et al., Citation2018): It extracts the DE features of the gamma frequency band as input and uses a hierarchical CNN architecture to model the EEG features and classify sentiment.

DGCNN (Song et al., Citation2018): Dynamic Graph CNN to model multi-channel EEG features and perform EEG sentiment classification.

Bi-HDM (Li et al., Citation2020): Bi-hemisphere difference model learning asymmetric disparities between two hemispheres for EEG sentiment recognition.

RGNN (Zhong et al., Citation2020): Regularized Graph Neural Networks consider bio-topology between disparate brain regions to trap partial and overall relationships between various EEG channels.

The result is shown in Tables and . HCNN extracts only the DE features of the γ-band as input, which considers the spectral-spatial information of EEG signals. DGCNN takes into account only the spatial information of EEG signals gathered from disparate channels and uses graph convolution to pull out the spatial information. Bi-HDM extracts spatial information of EEG signals with the use of two directional RNNs, and the classification accuracy reaches 91.15%. RGNN models inter-channel relationships through EEG signals of adjacency matrices in graph neural networks, where neuroscientific theories of human brain organisation inspire the connectivity and sparsity of adjacency matrices but only consider spatio-temporal information. This paper concludes that simultaneous consideration of frequency, spatial and temporal information benefits emotion recognition. The proposed AB-DPGRU achieves relatively advanced performance on the SEED and SEED-IV datasets.

Table 3. Performance comparison of the models on the SEED dataset.

Download CSV Display Table

Table 4. Performance comparison of the models on the SEED-IV dataset.

Download CSV Display Table

5.4. Discussion

In this paper, we propose an Attention-Based Depthwise Parameterized Convolutional Gated Recurrent Unit (AB-DPCGRU) model for emotion recognition. From the above results, it can be seen that our proposed method can achieve good classification results on two public datasets of emotions. This indicates that the method has good potential for promotion. In the future, we may apply this method to other fields of neuroscience, even in the field of computers (Guo et al., Citation2022; Li et al., Citation2023b; Yang et al., Citation2022, Citation2023).

6. Conclusion

In this paper, we present the AB-DPGRU model for EEG emotion recognition. The complementarity between different features is effectively utilised by integrating the time, space, and spectrum in EEG signals into three-dimensional features. The time-spectral and spatial-spectral information are further fused through the attention module; then the spatial-spectral data is processed by the over-parameterised convolution module, and the GRU extracts the time-spectral information from the output of the over-parameterised convolution module; finally linear The layers are classified to get the prediction results. This paper researches the value of extracting spectral, spatial and temporal information from EEG simultaneously by comparison with other modelling studies. In short, the proposed model can be used to monitor the emotional state of patients and drivers and contribute to the society.

In future work, we can use the teacher-student network to compress the model based on the text to improve the model’s training time and enhance the model’s deployment ability. In addition, we can also explore the brain patterns of different emotions through the synergistic effects between different brain regions.

Acknowledgement

This work is supported by the National Natural Science Foundation of China under grant number 62272067; the Sichuan Science and Technology Program under Grant Nos. 2023NSFSC0499, 2023YFG0018; the LOST 2030 Brain Project No. 2022ZD0208500; the Scientific Research Foundation of Chengdu University of Information Technology under Grant Nos. KYQN202208, KYQN202206, and the 2011 Collaborative Innovation Center for Image and Geospatial Information of Sichuan Province.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work is supported by the National Natural Science Foundation of China under grant number 62272067; the Sichuan Science and Technology Program under grant numbers 2023NSFSC0499, 2023YFG0018; the LOST 2030 Brain Project No. 2022ZD0208500; the Scientific Research Foundation of Chengdu University of Information Technology under grant number KYQN202208, KYQN202206, and the 2011 Collaborative Innovation Center for Image and Geospatial Information of Sichuan Province.

References

Arora, S., Cohen, N., & Hazan, E. (2018). On the optimization of deep networks: Implicit acceleration by overparameterization. In International conference on machine learning (pp. 244–253). PMLR.
Google Scholar
Bahari, F., & Janghorbani, A. (2013). Eeg-based emotion recognition using recurrence plot analysis and k nearest neighbor classifier. In 2013 20th Iranian Conference on Biomedical Engineering (ICBME) (pp. 228–233). IEEE.
Google Scholar
Bascil, M. S., Tesneli, A. Y., & Temurtas, F. (2016). Spectral feature extraction of EEG signals and pattern recognition during mental tasks of 2-D cursor movements for BCI using SVM and ANN. Australasian Physical & Engineering Sciences in Medicine, 39(3), 665–676. https://doi.org/10.1007/s13246-016-0462-x.
PubMed Web of Science ®Google Scholar
Cao, J., Li, Y., Sun, M., Chen, Y., Lischinski, D., Cohen-Or, D., Chen, B., & Tu, C. (2022). Do-conv: Depthwise over-parameterized convolutional layer. IEEE Transactions on Image Processing.
Web of Science ®Google Scholar
Cecotti, H., & Graeser, A. (2008). Convolutional neural network with embedded Fourier transform for EEG classification. In 2008 19th International conference on pattern recognition (pp. 1–4). IEEE.
Google Scholar
Duan, R. N., Wang, X. W., & Lu, B. L. (2012). EEG-based emotion recognition in listening music by using support vector machine and linear dynamic system. In International conference on neural information processing (pp. 468–475). Springer, Berlin, Heidelberg.
Google Scholar
Duan, R. N., Zhu, J. Y., & Lu, B. L. (2013). Differential entropy feature for EEG-based emotion classification. In 2013 6th International IEEE/EMBS conference on neural engineering (NER) (pp. 81–84). IEEE.
Google Scholar
Frantzidis, C. A., Bratsas, C., Papadelis, C. L., Konstantinidis, E., Pappas, C., & Bamidis, P. D. (2010). Toward emotion aware computing: an integrated approach using multichannel neurophysiological recordings and affective visual stimuli. IEEE Transactions on Information Technology in Biomedicine, 14(3), 589–597. https://doi.org/10.1109/TITB.2010.2041553.
PubMedGoogle Scholar
Gao, D., Li, P., Wang, M., Liang, Y., Liu, S., Zhou, J., Wang, L., & Zhang, Y. (2023). CSF-GTNet: A novel multi-dimensional feature fusion network based on Convnext-GeLU-BiLSTM for EEG-signals-enabled fatigue driving detection. IEEE Journal of Biomedical and Health Informatics..
Google Scholar
Guo, Y., Zhou, D., Li, P., Li, C., & Cao, J. (2022). Context-aware poly (A) signal prediction model via deep spatial–temporal neural networks. IEEE Transactions on Neural Networks and Learning Systems.
Web of Science ®Google Scholar
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735.
PubMed Web of Science ®Google Scholar
Jia, Z., Lin, Y., Cai, X., Chen, H., Gou, H., & Gou, H. (2020). SST-emotionnet: Spatial-spectral-temporal based attention 3D dense network for EEG emotion recognition. In Proceedings of the 28th ACM International conference on multimedia (pp. 2909–2917).
Google Scholar
Jung, T. P., & Sejnowski, T. J. (2019). Utilizing deep learning towards multi-modal bio-sensing and vision-based affective computing. IEEE Transactions on Affective Computing, 13(1), 96–107.
Google Scholar
Li, J., Zhang, Z., & He, H. (2018). Hierarchical convolutional neural networks for EEG-based emotion recognition. Cognitive Computation, 10(2), 368–380. https://doi.org/10.1007/s12559-017-9533-x
Web of Science ®Google Scholar
Li, P., Zhang, Y., Liu, S., Lin, L., Zhang, H., Tang, T., & Gao, D. (2023a). An EEG-based brain cognitive dynamic recognition network for representations of brain fatigue. Applied Soft Computing, 146, 110613. https://doi.org/10.1016/j.asoc.2023.110613
Web of Science ®Google Scholar
Li, W., Guo, Y., Wang, B., & Yang, B. (2023b). Learning spatiotemporal embedding with gated convolutional recurrent networks for translation initiation site prediction. Pattern Recognition, 136, 109234. https://doi.org/10.1016/j.patcog.2022.109234
Web of Science ®Google Scholar
Li, Y., Wang, L., Zheng, W., Zong, Y., Qi, L., Cui, Z., Zhang, T., & Song, T. (2020). A novel bi-hemispheric discrepancy model for EEG emotion recognition. IEEE Transactions on Cognitive and Developmental Systems, 13(2), 354–367. https://doi.org/10.1109/TCDS.2020.2999337
Web of Science ®Google Scholar
Li, Y., Zheng, W., Wang, L., Zong, Y., & Cui, Z. (2019). From regional to global brain: A novel hierarchical spatial-temporal neural network model for EEG emotion recognition. IEEE Transactions on Affective Computing.
PubMedGoogle Scholar
Lian, Z., Li, Y., Tao, J., & Huang, J. (2018). Investigation of multimodal features, classifiers and fusion methods for emotion recognition. arXiv preprint arXiv:1809.06225.
Google Scholar
Lin, L., Li, P., Wang, Q., Bai, B., Cui, R., Yu, Z., Gao, D., & Zhang, Y. (2023). An EEG-based cross-subject interpretable CNN for game player expertise level classification. Expert Systems with Applications, 121658.
Web of Science ®Google Scholar
Lin, Y. P., Wang, C. H., Jung, T. P., Wu, T.-L., Jeng, S.-K., Duann, J.-R., & Chen, J.-H. (2010). EEG-based emotion recognition in music listening. IEEE Transactions on Biomedical Engineering, 57(7), 1798–1806. https://doi.org/10.1109/TBME.2010.2048568
PubMed Web of Science ®Google Scholar
Liu, Y., & Sourina, O. (2013). Real-time fractal-based valence level recognition from EEG. In Transactions on computational science XVIII (pp. 101–120). Springer.
Google Scholar
Mühl, C., Allison, B., Nijholt, A., & Chanel, G. (2014). A survey of affective brain computer interfaces: principles, state-of-the-art, and challenges. Brain-Computer Interfaces, 1(2), 66–84. https://doi.org/10.1080/2326263X.2014.912881
Google Scholar
Rouast, P. V., Adam, M. T. P., & Chiong, R. (2019). Deep learning for human affect recognition: Insights and new developments. IEEE Transactions on Affective Computing, 12(2), 524–543. https://doi.org/10.1109/TAFFC.2018.2890471
Web of Science ®Google Scholar
Roy, A. G., Navab, N., & Wachinger, C. (2018). Concurrent spatial and channel ‘squeeze & excitation’ in fully convolutional networks. In International conference on medical image computing and computer-assisted intervention (pp. 421–429). Springer.
Google Scholar
Song, T., Zheng, W., Song, P., & Cui, Z. (2018). EEG emotion recognition using dynamical graph convolutional neural networks. IEEE Transactions on Affective Computing, 11(3), 532–541. https://doi.org/10.1109/TAFFC.2018.2817622
Web of Science ®Google Scholar
Yang, S., Zhou, D., Cao, J., & Guo, Y. (2022). Rethinking low-light enhancement via transformer-GAN. IEEE Signal Processing Letters, 29, 1082–1086. https://doi.org/10.1109/LSP.2022.3167331
Web of Science ®Google Scholar
Yang, S., Zhou, D., Cao, J., & Guo, Y. (2023). LightingNet: An integrated learning method for low-light image enhancement. IEEE Transactions on Computational Imaging, 9, 29–42. https://doi.org/10.1109/TCI.2023.3240087
Web of Science ®Google Scholar
Yang, Y., Wu, Q., Fu, Y., & Chen, X. (2018). Continuous convolutional neural network with 3D input for EEG-based emotion recognition. In International conference on neural information processing (pp. 433–443). Springer.
Google Scholar
Yin, Z., Liu, L., Chen, J., Zhao, B., & Wang, Y. (2020). Locally robust EEG feature selection for individual-independent emotion recognition. Expert Systems with Applications, 162, 113768. https://doi.org/10.1016/j.eswa.2020.113768
Web of Science ®Google Scholar
Zhang, T., Wang, X., Xu, X., & Chen, C. L. P. (2019). GCB-Net: Graph convolutional broad network and its application in emotion recognition. IEEE Transactions on Affective Computing, 13(1), 379–388. https://doi.org/10.1109/TAFFC.2019.2937768
Google Scholar
Zheng, W. L., & Lu, B. L. (2015). Investigating critical frequency bands and channels for EEG-based emotion recognition with deep neural networks. IEEE Transactions on autonomous mental development, 7(3), 162–175. https://doi.org/10.1109/TAMD.2015.2431497
Google Scholar
Zheng, W. L., Zhu, J. Y., & Lu, B. L. (2017). Identifying stable patterns over time for emotion recognition from EEG. IEEE Transactions on Affective Computing, 10(3), 417–429. https://doi.org/10.1109/TAFFC.2017.2712143
Web of Science ®Google Scholar
Zhong, P., Wang, D., & Miao, C. (2020). EEG-based emotion recognition using regularized graph neural networks. IEEE Transactions on Affective Computing.
Web of Science ®Google Scholar

Download PDF

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Emotion recognition based on convolutional gated recurrent units with attention

Abstract

1. Introduction

2. Related work