Multi-branch feature learning based speech emotion recognition using SCAR-NET

Keji MaoCollege of Computer Science and Technology College of Software, Zhejiang University of Technology, Hangzhou, People's Republic of ChinaView further author information

Yuxiang WangCollege of Computer Science and Technology College of Software, Zhejiang University of Technology, Hangzhou, People's Republic of ChinaView further author information

Ligang RenCollege of Computer Science and Technology College of Software, Zhejiang University of Technology, Hangzhou, People's Republic of ChinaView further author information

Jinhong ZhangCollege of Computer Science and Technology College of Software, Zhejiang University of Technology, Hangzhou, People's Republic of ChinaView further author information

Jiefan QiuCollege of Computer Science and Technology College of Software, Zhejiang University of Technology, Hangzhou, People's Republic of ChinaView further author information

Guanglin DaiCollege of Computer Science and Technology College of Software, Zhejiang University of Technology, Hangzhou, People's Republic of ChinaCorrespondence[email protected]
View further author information

Abstract

Speech emotion recognition (SER) is an active research area in affective computing. Recognizing emotions from speech signals helps to assess human behaviour, which has promising applications in the area of human-computer interaction. The performance of deep learning-based SER methods relies heavily on feature learning. In this paper, we propose SCAR-NET, an improved convolutional neural network, to extract emotional features from speech signals and implement classification. This work includes two main parts: First, we extract spectral, temporal, and spectral-temporal correlation features through three parallel paths; and then split-convolve-aggregate residual blocks are designed for multi-branch deep feature learning. The features are refined by global average pooling (GAP) and pass through a softmax classifier to generate predictions for different emotions. We also conduct a series of experiments to evaluate the robustness and effectiveness of SCAR-NET which can achieve 96.45%, 83.13%, and 89.93% accuracy on the speech emotion datasets EMO-DB, SAVEE, and RAVDESS. These results show the outperformance of SCAR-NET.

Keywords:

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by the Basic Public Welfare Research Project of Zhejiang Province [grant number LGG22F020014] and the National Natural Science Foundation of China [grant number 62072410].

Multi-branch feature learning based speech emotion recognition using SCAR-NET

Information for

Open access

Opportunities

Help and information

Multi-branch feature learning based speech emotion recognition using SCAR-NET

Abstract

Disclosure statement

Additional information

Funding

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature