Search in:

Advanced search

Connection Science Volume 36, 2024 - Issue 1

Submit an article Journal homepage

Open access

378

Views

CrossRef citations to date

Altmetric

Research Article

Cross-modal learning with multi-modal model for video action recognition based on adaptive weight training

Qingguo Zhoua School of Information Science and Engineering, Lanzhou University, Lanzhou, People's Republic of China

Yufeng Houa School of Information Science and Engineering, Lanzhou University, Lanzhou, People's Republic of China

Rui Zhoua School of Information Science and Engineering, Lanzhou University, Lanzhou, People's Republic of ChinaCorrespondence[email protected]

Yan Lia School of Information Science and Engineering, Lanzhou University, Lanzhou, People's Republic of ChinaCorrespondence[email protected]

JinQiang Wanga School of Information Science and Engineering, Lanzhou University, Lanzhou, People's Republic of China

Zhen Wua School of Information Science and Engineering, Lanzhou University, Lanzhou, People's Republic of China

Hung-Wei Lib Department of Computer Science and Information Engineering, Providence University, Taichung City, Taiwan

Tien-Hsiung Wengb Department of Computer Science and Information Engineering, Providence University, Taichung City, Taiwan

show all

Article: 2325474 | Received 25 Dec 2023, Accepted 26 Feb 2024, Published online: 27 Mar 2024

Cite this article
https://doi.org/10.1080/09540091.2024.2325474
CrossMark

Full Article
Figures & data
References
Citations
Metrics
Licensing
Reprints & Permissions
View PDF PDF View EPUB EPUB

References

Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). ViViT: A Video Vision Transformer. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6836–6846). IEEE.
Google Scholar
Baldrati, A., Bertini, M., Uricchio, T., & Del Bimbo, A. (2022). Effective conditioned and composed image retrieval combining clip-based features. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 21466–21474). IEEE COMPUER SOC.
Google Scholar
Bao, H., Wang, W., Dong, L., Liu, Q., Mohammed, O. K., Aggarwal, K., Som, S., Piao, S., & Wei, F. (2022). Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems, 35, 32897–32912.
Google Scholar
Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding?. In ICML (Vol. 2, pp. 4). ICML.
Google Scholar
Bruce, X., Liu, Y., Zhang, X., Zhong, S.-h., & Chan, K. C. (2022). Mmnet: A model-based multimodal network for human action recognition in rgb-d videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 3522–3538.
Web of Science ®Google Scholar
Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., & Zisserman, A. (2018). A short note about kinetics-600. abs/1808.01340.
Google Scholar
Cubuk, E. D., Zoph, B., Shlens, J., & Le, Q. V. (2020). Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (pp. 702–703). IEEE.
Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. volume abs/2010.11929.
Google Scholar
Duan, S., Xia, C., Gao, X., Ge, B., Zhang, H., & Li, K.-C. (2022). Multi-modality diversity fusion network with swintransformer for rgb-d salient object detection. In 2022 IEEE international conference on image processing (ICIP) (pp. 1076–1080). IEEE.
Google Scholar
Feichtenhofer, C. (2020). X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 203–213). IEEE COMPUTER SOC.
Google Scholar
Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6202–6211). IEEE COMPUTER SOC.
Google Scholar
Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1933–1941). IEEE COMPUTER SOC.
Google Scholar
Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., & Qiao, Y. (2023). Clip-adapter: Better vision-language models with feature adapters. International journal of computer vision, 1–15. Springer.
Web of Science ®Google Scholar
Hajati, F., & Tavakolian, M. (2020). Video classification using deep autoencoder network. In Complex, intelligent, and software intensive systems: Proceedings of the 13th international conference on complex, intelligent, and software intensive systems (CISIS-2019) (pp. 508–518). Springer.
Google Scholar
Hara, K., Kataoka, H., & Satoh, Y. (2018). Can spatiotemporal 3D cnns retrace the history of 2D cnns and imagenet?. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6546–6555). IEEE COMPUTER SOC.
Google Scholar
Hataya, R., Zdenek, J., Yoshizoe, K., & Nakayama, H. (2020). Faster autoaugment: Learning augmentation strategies using backpropagation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16 (pp. 1–16). Springer.
Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778). IEEE COMPUTER SOC.
Google Scholar
Iashin, V., & Rahtu, E. (2020). Multi-modal dense video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (pp. 958–959). IEEE.
Google Scholar
Ji, S., Xu, W., Yang, M., & Yu, K. (2012). 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 221–231. https://doi.org/10.1109/TPAMI.2012.59
Web of Science ®Google Scholar
Jiang, B., Wang, M., Gan, W., Wu, W., & Yan, J. (2019). Stm: Spatiotemporal and motion encoding for action recognition. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 2000–2009). IEEE COMPUTER SOC.
Google Scholar
Khattab, O., & Zaharia, M. (2020). Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval (pp. 39–48). ASSOC COMPUTING MACHINERY.
Google Scholar
Kim, S., Ahn, D., & Ko, B. C. (2023). Cross-modal learning with 3D deformable attention for action recognition. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10265–10275). IEEE COMPUTER SOC.
Google Scholar
Kim, W., Son, B., & Kim, I. (2021). Vilt: Vision-and-language transformer without convolution or region supervision. In International conference on machine learning (pp. 5583–5594). PMLR.
Google Scholar
Klaser, A., Marszałek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3D-gradients. In BMVC 2008-19th british machine vision conference (pp. 275–1). British Machine Vision Association.
Google Scholar
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. In 2011 International conference on computer vision (pp. 2556–2563). IEEE.
Google Scholar
Langley, P., Provan, G. M., & Smyth, P. (1997). Learning with probabilistic representations. Machine Learning, 29, 2/391–101. https://doi.org/10.1023/A:1007467927290
Web of Science ®Google Scholar
Lei, J., Li, L., Zhou, L., Gan, Z., Berg, T. L., Bansal, M., & Liu, J. (2021). Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7331–7341). IEEE COMPUTER SOC.
Google Scholar
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., & Hoi, S. C. H (2021). Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 34, 9694–9705.
Google Scholar
Li, X., Wang, Y., Zhou, Z., & Qiao, Y. (2020). Smallbignet: Integrating core and contextual views for video classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1092–1101). IEEE COMPUTER SOC.
Google Scholar
Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., & Wang, L. (2020). Tea: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 909–918). IEEE COMPUTER SOC.
Google Scholar
Lin, J., Gan, C., & Han, S. (2019). Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 7083–7093). IEEE.
Google Scholar
Lin, Z., Yu, S., Kuang, Z., Pathak, D., & Ramanan, D. (2023). Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 19325–19337). IEEE COMPUTER SOC.
Google Scholar
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9), 1–35.
Web of Science ®Google Scholar
Luo, H., Ji, L., Zhong, M., Chen, Y., Lei, W., Duan, N., & Li, T. (2022). Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508, 293–304. https://doi.org/10.1016/j.neucom.2022.07.028
Web of Science ®Google Scholar
Ma, N., Wu, Z., Cheung, Y.-M., Guo, Y., Gao, Y., Li, J., & Jiang, B (2022). A survey of human action recognition and posture prediction. Tsinghua Science and Technology, 27(6), 973–1001. https://doi.org/10.26599/TST.2021.9010068
Web of Science ®Google Scholar
Nanay, B. (2018). Multimodal mental imagery. Cortex; A Journal Devoted to the Study of the Nervous System and Behavior, 105, 125–134. https://doi.org/10.1016/j.cortex.2017.07.006
PubMed Web of Science ®Google Scholar
Neimark, D., Bar, O., Zohar, M., & Asselmann, D. (2021). Video transformer network. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 3163–3172). IEEE.
Google Scholar
Park, J., Lee, J., Kim, I.-J., & Sohn, K. (2022). Probabilistic representations for video contrastive learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14711–14721). IEEE COMPUTER SOC.
Google Scholar
Qin, J., Zeng, X., Wu, S., & Zou, Y. (2022). Multi-semantic alignment graph convolutional network. Connection Science, 34(1), 2313–2331. https://doi.org/10.1080/09540091.2022.2115010
Google Scholar
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., & Krueger, G. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748–8763). PMLR.
Google Scholar
Ramesh, M., & Mahesh, K. (2019). Sports video classification with deep convolution neural network: A test on ucf101 dataset. International Journal of Engineering and Advanced Technology, 8(4S2), 2249–8958.
Google Scholar
Ravuri, S., & Vinyals, O. (2019). Classification accuracy score for conditional generative models. 33rd Conference on Neural Information Processing Systems (NeurIPS)(Vol 32). NEURAL INFORMATION PROCESSING SYSTEMS (NIPS).
Google Scholar
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. 28th Conference on Neural Information Processing Systems (NIPS)(Vol 27). NEURAL INFORMATION PROCESSING SYSTEMS (NIPS).
Google Scholar
Song, S., Liu, J., Li, Y., & Guo, Z. (2020). Modality compensation network: Cross-modal adaptation for action recognition. IEEE Transactions on Image Processing, 29, 3957–3969. https://doi.org/10.1109/TIP.83
Web of Science ®Google Scholar
Tian, Y., Wang, Y., Krishnan, D., Tenenbaum, J. B., & Isola, P. (2020). Rethinking few-shot image classification: A good embedding is all you need?. Computer Vision--ECCV 2020: 16th European Conference (pp. 266-282). Springer.
Google Scholar
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6450–6459). IEEE COMPUER SOC.
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I (2017). Attention is all you need. Advances in Neural Information Processing Systems. 30.
Google Scholar
Wang, H., Kläser, A., Schmid, C., & Liu, C.-L (2013). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 103, 160–79. https://doi.org/10.1007/s11263-012-0594-8
Web of Science ®Google Scholar
Wang, H., Tran, D., Torresani, L., & Feiszli, M. (2020). Video modeling with correlation networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 352–361). IEEE COMPUTER SOC.
Google Scholar
Wang, J., Jiao, J., Bao, L., He, S., Liu, Y., & Liu, W. (2019). Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4006–4015). IEEE COMPUTER SOC.
Google Scholar
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision (pp. 20–36). Springer.
Google Scholar
Wang, M., Xing, J., Mei, J., Liu, Y., & Jiang, Y. (2023). Actionclip: Adapting language-image pretrained models for video action recognition. IEEE Transactions on Neural Networks and Learning Systems, 1–13.
Web of Science ®Google Scholar
Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7794–7803). IEEE COMPUTER SOC.
Google Scholar
Xia, C., Duan, S., Fang, X., Gao, X., Sun, Y., Ge, B., Zhang, H., & Li, K.-C. (2022). Efgnet: Encoder steered multi-modality feature guidance network for rgb-d salient object detection. Digital Signal Processing, 131, Article 103775. https://doi.org/10.1016/j.dsp.2022.103775
Web of Science ®Google Scholar
Xia, C., Gao, X., Li, K.-C., Zhao, Q., & Zhang, S. (2020). Salient object detection based on distribution-edge guidance and iterative bayesian optimization. Applied Intelligence, 50, 102977–2990. https://doi.org/10.1007/s10489-020-01691-7
Web of Science ®Google Scholar
Xia, C., Sun, Y., Li, K.-C., Ge, B., Zhang, H., Jiang, B., & Zhang, J. (2024). Rcnet: Related context-driven network with hierarchical attention for salient object detection. Expert Systems with Applications, 237, Article 121441. https://doi.org/10.1016/j.eswa.2023.121441
Web of Science ®Google Scholar
Yamazaki, K., Vo, K., Truong, Q. S., Raj, B., & Le, N. (2023). Vltint: Visual-linguistic transformer-in-transformer for coherent video paragraph captioning. In Proceedings of the AAAI Conference on Artificial intelligence (Vol. 37, pp. 3081–3090). AAAI.
Google Scholar
Zhang, R., Zhang, W., Fang, R., Gao, P., Li, K., Dai, J., Qiao, Y., & Li, H. (2022). Tip-adapter: Training-free adaption of clip for few-shot classification. In European conference on computer vision (pp. 493–510). Springer.
Google Scholar
Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022a). Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16816–16825). IEEE COMPUTER SOC.
Google Scholar
Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022b). Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9), 2337–2348. https://doi.org/10.1007/s11263-022-01653-1
Web of Science ®Google Scholar

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Cross-modal learning with multi-modal model for video action recognition based on adaptive weight training

References

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

Cross-modal learning with multi-modal model for video action recognition based on adaptive weight training

References

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date