265
Views
0
CrossRef citations to date
0
Altmetric
Research Article

FRIC: a framework for few-shot remote sensing image captioning

, , &
Article: 2337240 | Received 12 Jan 2024, Accepted 25 Mar 2024, Published online: 04 Apr 2024
 

ABSTRACT

The training of image captioning (IC) models requires a large number of caption-labeled samples, which is usually difficult to satisfy in the actual remote sensing scenarios. The performance of the models will be damaged due to the few-shot problems. We describe the few-shot problems in remote sensing image captioning (RC) and design two research schemes. Then, we propose a few-shot RC framework few-shot remote sensing image captioning framework (FRIC). FRIC does not need additional samples and uses a simple base model. FRIC tries to get performance boosts from split samples and reduce the negative effects of noises. Unlike previous works that use 100% samples to simulate few-shot scenarios, FRIC uses less than 1.0% data to simulate actual few-shot scenarios. While previous works focus on improving the encoder, FRIC focuses on optimizing the decoder with parameter ensemble, multi-model ensemble and self-distillation. FRIC can train a simple base model with limited caption-labeled samples to generate captions that meet human expectations. FRIC shows obvious advantages to other methods when trained with only 0.8% samples of RC datasets. No previous work has used such a small amount of data to train the RC model. In addition, the effectiveness of the components in FRIC is verified with ablation experiments.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The UCM-Captions dataset, Sydney-Captions dataset and RSICD dataset can be found at https://github.com/201528014227051/RSICD_optima. The MASATI dataset can be found at https://www.iuii.ua.es/datasets/masati/index.html.