Cross-modal learning with multi-modal model for video action recognition based on adaptive weight training

Qingguo Zhoua School of Information Science and Engineering, Lanzhou University, Lanzhou, People's Republic of China

Yufeng Houa School of Information Science and Engineering, Lanzhou University, Lanzhou, People's Republic of China

Rui Zhoua School of Information Science and Engineering, Lanzhou University, Lanzhou, People's Republic of ChinaCorrespondence[email protected]

Yan Lia School of Information Science and Engineering, Lanzhou University, Lanzhou, People's Republic of ChinaCorrespondence[email protected]

JinQiang Wanga School of Information Science and Engineering, Lanzhou University, Lanzhou, People's Republic of China

Zhen Wua School of Information Science and Engineering, Lanzhou University, Lanzhou, People's Republic of China

Hung-Wei Lib Department of Computer Science and Information Engineering, Providence University, Taichung City, Taiwan

Tien-Hsiung Wengb Department of Computer Science and Information Engineering, Providence University, Taichung City, Taiwan

show all

Abstract

The canonical video action recognition methods usually label categories with numbers or one-hot vectors and train neural networks to classify a fixed set of predefined categories, thereby constraining their ability to recognise complex actions and transferable ability to unseen concepts. In contrast, cross-modal learning can improve the performance of individual modalities. Based on the facts that a better action recogniser can be built by reading the statements used to describe actions, we exploited the recent multimodal foundation model CLIP for action recognition. In this study, an effective Vision-Language action recognition adaptation was implemented based on few-shot examples spanning different modalities. We added semantic information to action categories by treating textual and visual label as training examples for action classifier construction rather than simply labelling them with numbers. Due to the different importance of words in text and video frames, simply averaging all sequential features may result in ignoring keywords or key video frames. To capture sequential and hierarchical representation, a weighted token-wise interaction mechanism was employed to exploit the pair-wise correlations adaptively. Extensive experiments with public datasets show that cross-modal action recognition learning helps for downstream action images classification, in other words, the proposed method can train better action classifiers by reading the sentences describing action itself. The method proposed in this study not only reaches good generalisation and zero-shot/few-shot transfer ability on Out of Distribution (OOD) test sets, but also performs lower computational complexity due to the lightweight interaction mechanism with 84.15% Top-1 accuracy on the Kinetics-400.

Keywords:

Acknowledgments

This work was partially supported by National Key R&D Program of China under Grant No. 2020YFC0832500 and 2023YFB4503903, National Natural Science Foundation of China under Grant No. U22A20261, Gansu Province Science and Technology Major Project - Industrial Project under Grant No. 22ZD6GA048, Gansu Province Key Research and Development Plan - Industrial Project under Grant No. 22YF7GA004, Gansu Provincial Science and Technology Major Special Innovation Consortium Project under Grant No. 21ZD3GA002, the Fundamental Research Funds for the Central Universities under Grant No. lzujbky-2022-kb12, and Supercomputing Center of Lanzhou University.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Cross-modal learning with multi-modal model for video action recognition based on adaptive weight training

Information for

Open access

Opportunities

Help and information

Cross-modal learning with multi-modal model for video action recognition based on adaptive weight training

Abstract

Acknowledgments

Disclosure statement

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature