378
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Cross-modal learning with multi-modal model for video action recognition based on adaptive weight training

, , , , , , & show all
Article: 2325474 | Received 25 Dec 2023, Accepted 26 Feb 2024, Published online: 27 Mar 2024

References

  • Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). ViViT: A Video Vision Transformer. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6836–6846). IEEE.
  • Baldrati, A., Bertini, M., Uricchio, T., & Del Bimbo, A. (2022). Effective conditioned and composed image retrieval combining clip-based features. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 21466–21474). IEEE COMPUER SOC.
  • Bao, H., Wang, W., Dong, L., Liu, Q., Mohammed, O. K., Aggarwal, K., Som, S., Piao, S., & Wei, F. (2022). Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems, 35, 32897–32912.
  • Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding?. In ICML (Vol. 2, pp. 4). ICML.
  • Bruce, X., Liu, Y., Zhang, X., Zhong, S.-h., & Chan, K. C. (2022). Mmnet: A model-based multimodal network for human action recognition in rgb-d videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 3522–3538.
  • Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., & Zisserman, A. (2018). A short note about kinetics-600. abs/1808.01340.
  • Cubuk, E. D., Zoph, B., Shlens, J., & Le, Q. V. (2020). Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (pp. 702–703). IEEE.
  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. volume abs/2010.11929.
  • Duan, S., Xia, C., Gao, X., Ge, B., Zhang, H., & Li, K.-C. (2022). Multi-modality diversity fusion network with swintransformer for rgb-d salient object detection. In 2022 IEEE international conference on image processing (ICIP) (pp. 1076–1080). IEEE.
  • Feichtenhofer, C. (2020). X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 203–213). IEEE COMPUTER SOC.
  • Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6202–6211). IEEE COMPUTER SOC.
  • Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1933–1941). IEEE COMPUTER SOC.
  • Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., & Qiao, Y. (2023). Clip-adapter: Better vision-language models with feature adapters. International journal of computer vision, 1–15. Springer.
  • Hajati, F., & Tavakolian, M. (2020). Video classification using deep autoencoder network. In Complex, intelligent, and software intensive systems: Proceedings of the 13th international conference on complex, intelligent, and software intensive systems (CISIS-2019) (pp. 508–518). Springer.
  • Hara, K., Kataoka, H., & Satoh, Y. (2018). Can spatiotemporal 3D cnns retrace the history of 2D cnns and imagenet?. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6546–6555). IEEE COMPUTER SOC.
  • Hataya, R., Zdenek, J., Yoshizoe, K., & Nakayama, H. (2020). Faster autoaugment: Learning augmentation strategies using backpropagation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16 (pp. 1–16). Springer.
  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778). IEEE COMPUTER SOC.
  • Iashin, V., & Rahtu, E. (2020). Multi-modal dense video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (pp. 958–959). IEEE.
  • Ji, S., Xu, W., Yang, M., & Yu, K. (2012). 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 221–231. https://doi.org/10.1109/TPAMI.2012.59
  • Jiang, B., Wang, M., Gan, W., Wu, W., & Yan, J. (2019). Stm: Spatiotemporal and motion encoding for action recognition. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 2000–2009). IEEE COMPUTER SOC.
  • Khattab, O., & Zaharia, M. (2020). Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval (pp. 39–48). ASSOC COMPUTING MACHINERY.
  • Kim, S., Ahn, D., & Ko, B. C. (2023). Cross-modal learning with 3D deformable attention for action recognition. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10265–10275). IEEE COMPUTER SOC.
  • Kim, W., Son, B., & Kim, I. (2021). Vilt: Vision-and-language transformer without convolution or region supervision. In International conference on machine learning (pp. 5583–5594). PMLR.
  • Klaser, A., Marszałek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3D-gradients. In BMVC 2008-19th british machine vision conference (pp. 275–1). British Machine Vision Association.
  • Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. In 2011 International conference on computer vision (pp. 2556–2563). IEEE.
  • Langley, P., Provan, G. M., & Smyth, P. (1997). Learning with probabilistic representations. Machine Learning, 29, 2/391–101. https://doi.org/10.1023/A:1007467927290
  • Lei, J., Li, L., Zhou, L., Gan, Z., Berg, T. L., Bansal, M., & Liu, J. (2021). Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7331–7341). IEEE COMPUTER SOC.
  • Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., & Hoi, S. C. H (2021). Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 34, 9694–9705.
  • Li, X., Wang, Y., Zhou, Z., & Qiao, Y. (2020). Smallbignet: Integrating core and contextual views for video classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1092–1101). IEEE COMPUTER SOC.
  • Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., & Wang, L. (2020). Tea: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 909–918). IEEE COMPUTER SOC.
  • Lin, J., Gan, C., & Han, S. (2019). Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 7083–7093). IEEE.
  • Lin, Z., Yu, S., Kuang, Z., Pathak, D., & Ramanan, D. (2023). Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 19325–19337). IEEE COMPUTER SOC.
  • Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9), 1–35.
  • Luo, H., Ji, L., Zhong, M., Chen, Y., Lei, W., Duan, N., & Li, T. (2022). Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508, 293–304. https://doi.org/10.1016/j.neucom.2022.07.028
  • Ma, N., Wu, Z., Cheung, Y.-M., Guo, Y., Gao, Y., Li, J., & Jiang, B (2022). A survey of human action recognition and posture prediction. Tsinghua Science and Technology, 27(6), 973–1001. https://doi.org/10.26599/TST.2021.9010068
  • Nanay, B. (2018). Multimodal mental imagery. Cortex; A Journal Devoted to the Study of the Nervous System and Behavior, 105, 125–134. https://doi.org/10.1016/j.cortex.2017.07.006
  • Neimark, D., Bar, O., Zohar, M., & Asselmann, D. (2021). Video transformer network. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 3163–3172). IEEE.
  • Park, J., Lee, J., Kim, I.-J., & Sohn, K. (2022). Probabilistic representations for video contrastive learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14711–14721). IEEE COMPUTER SOC.
  • Qin, J., Zeng, X., Wu, S., & Zou, Y. (2022). Multi-semantic alignment graph convolutional network. Connection Science, 34(1), 2313–2331. https://doi.org/10.1080/09540091.2022.2115010
  • Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., & Krueger, G. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748–8763). PMLR.
  • Ramesh, M., & Mahesh, K. (2019). Sports video classification with deep convolution neural network: A test on ucf101 dataset. International Journal of Engineering and Advanced Technology, 8(4S2), 2249–8958.
  • Ravuri, S., & Vinyals, O. (2019). Classification accuracy score for conditional generative models. 33rd Conference on Neural Information Processing Systems (NeurIPS)(Vol 32). NEURAL INFORMATION PROCESSING SYSTEMS (NIPS).
  • Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. 28th Conference on Neural Information Processing Systems (NIPS)(Vol 27). NEURAL INFORMATION PROCESSING SYSTEMS (NIPS).
  • Song, S., Liu, J., Li, Y., & Guo, Z. (2020). Modality compensation network: Cross-modal adaptation for action recognition. IEEE Transactions on Image Processing, 29, 3957–3969. https://doi.org/10.1109/TIP.83
  • Tian, Y., Wang, Y., Krishnan, D., Tenenbaum, J. B., & Isola, P. (2020). Rethinking few-shot image classification: A good embedding is all you need?. Computer Vision--ECCV 2020: 16th European Conference (pp. 266-282). Springer.
  • Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6450–6459). IEEE COMPUER SOC.
  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I (2017). Attention is all you need. Advances in Neural Information Processing Systems. 30.
  • Wang, H., Kläser, A., Schmid, C., & Liu, C.-L (2013). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 103, 160–79. https://doi.org/10.1007/s11263-012-0594-8
  • Wang, H., Tran, D., Torresani, L., & Feiszli, M. (2020). Video modeling with correlation networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 352–361). IEEE COMPUTER SOC.
  • Wang, J., Jiao, J., Bao, L., He, S., Liu, Y., & Liu, W. (2019). Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4006–4015). IEEE COMPUTER SOC.
  • Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision (pp. 20–36). Springer.
  • Wang, M., Xing, J., Mei, J., Liu, Y., & Jiang, Y. (2023). Actionclip: Adapting language-image pretrained models for video action recognition. IEEE Transactions on Neural Networks and Learning Systems, 1–13.
  • Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7794–7803). IEEE COMPUTER SOC.
  • Xia, C., Duan, S., Fang, X., Gao, X., Sun, Y., Ge, B., Zhang, H., & Li, K.-C. (2022). Efgnet: Encoder steered multi-modality feature guidance network for rgb-d salient object detection. Digital Signal Processing, 131, Article 103775. https://doi.org/10.1016/j.dsp.2022.103775
  • Xia, C., Gao, X., Li, K.-C., Zhao, Q., & Zhang, S. (2020). Salient object detection based on distribution-edge guidance and iterative bayesian optimization. Applied Intelligence, 50, 102977–2990. https://doi.org/10.1007/s10489-020-01691-7
  • Xia, C., Sun, Y., Li, K.-C., Ge, B., Zhang, H., Jiang, B., & Zhang, J. (2024). Rcnet: Related context-driven network with hierarchical attention for salient object detection. Expert Systems with Applications, 237, Article 121441. https://doi.org/10.1016/j.eswa.2023.121441
  • Yamazaki, K., Vo, K., Truong, Q. S., Raj, B., & Le, N. (2023). Vltint: Visual-linguistic transformer-in-transformer for coherent video paragraph captioning. In Proceedings of the AAAI Conference on Artificial intelligence (Vol. 37, pp. 3081–3090). AAAI.
  • Zhang, R., Zhang, W., Fang, R., Gao, P., Li, K., Dai, J., Qiao, Y., & Li, H. (2022). Tip-adapter: Training-free adaption of clip for few-shot classification. In European conference on computer vision (pp. 493–510). Springer.
  • Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022a). Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16816–16825). IEEE COMPUTER SOC.
  • Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022b). Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9), 2337–2348. https://doi.org/10.1007/s11263-022-01653-1