265
Views
0
CrossRef citations to date
0
Altmetric
Research Article

FRIC: a framework for few-shot remote sensing image captioning

, , &
Article: 2337240 | Received 12 Jan 2024, Accepted 25 Mar 2024, Published online: 04 Apr 2024

ABSTRACT

The training of image captioning (IC) models requires a large number of caption-labeled samples, which is usually difficult to satisfy in the actual remote sensing scenarios. The performance of the models will be damaged due to the few-shot problems. We describe the few-shot problems in remote sensing image captioning (RC) and design two research schemes. Then, we propose a few-shot RC framework few-shot remote sensing image captioning framework (FRIC). FRIC does not need additional samples and uses a simple base model. FRIC tries to get performance boosts from split samples and reduce the negative effects of noises. Unlike previous works that use 100% samples to simulate few-shot scenarios, FRIC uses less than 1.0% data to simulate actual few-shot scenarios. While previous works focus on improving the encoder, FRIC focuses on optimizing the decoder with parameter ensemble, multi-model ensemble and self-distillation. FRIC can train a simple base model with limited caption-labeled samples to generate captions that meet human expectations. FRIC shows obvious advantages to other methods when trained with only 0.8% samples of RC datasets. No previous work has used such a small amount of data to train the RC model. In addition, the effectiveness of the components in FRIC is verified with ablation experiments.

1. Introduction

Conventional remote sensing image interpretation tasks obtain image-level semantic information by processing tags or key targets, while the image captioning (IC) is concerned with generating a sentence describing the image. IC has received a great deal of attention in remote sensing, such as disaster risk assessment (Liu et al. Citation2018) and image retrieval (Cheng et al. Citation2021). However, there are few-shot problems in RC, that is, caption-labeled remote sensing samples are always scarce. The reasons for the scarcity of caption-labeled remote sensing samples are as follows: obtaining remote sensing images of the scenes containing specific targets is difficult due to various limitations. RC requires the images to contain content that is sufficient to form descriptive sentences with rich information. In addition, the difficulty and cost of equip each sample with caption labels are high. The above reasons have limited the number of caption-labeled samples. The common RC datasets are the remote sensing image captioning dataset (RSICD) dataset (Lu et al. Citation2017), the university complutense madrid (UCM)-Captions dataset (Qu et al. Citation2016), and the Sydney Captions dataset (Qu et al. Citation2016). RSICD only contains 10,921 images, UCM-Captions only contains 2100 images, and Sydney Captions dataset only contains 613 images, while the natural common dataset, Microsoft common objects in context (COCO) dataset (Lin et al. Citation2014), contains 123k samples. RC models are generally unable to receive sufficient training and often experience overfitting. The scarcity of caption-labeled samples will be more severe in actual remote sensing scenarios. It is hard to solve the few-shot problems by increasing the number of samples. We need to enhance the way of using the limited samples, so that the RC model can be applied to the few-shot remote sensing scenarios, and obtain knowledge that can be used for new samples. It is the core requirement of few-shot learning. Few-shot learning aims to train the model to generalize limited supervised information to new samples or new tasks, thereby closing the gap between the model and the human.

For the resource-constrained scenarios, we develop two research schemes for few-shot RC. Then, we propose a few-shot RC framework few-shot remote sensing image captioning framework (FRIC). FRIC splits labeled remote sensing samples into unlabeled images and unpaired captions. Divide the caption-labeled samples into multiple non-overlapping parts and input them to multiple base models for training. The base models are ensembled to a powerful ‘self-teacher’ model through parameter ensemble and multi-model ensemble. The module of training with split samples helps models learn from unlabeled data. At the same time, measures are set to reduce the adverse impact of noise. We set up contrast experiments in two schemes and ablation experiments to prove the effectiveness of FRIC. The contribution of FRIC can be summarized as follows:

  • We have structured two research schemes about simulating few-shot RC scenarios and the resource constraints, which will be beneficial for the consistency of subsequent research works.

  • We use less than 1.0% data to simulate actual few-shot scenarios, while previous works use 100%.

  • Without using any additional samples or models, FRIC splits the limited caption-labeled remote sensing samples into unlabeled images and unpaired captions as supplemental samples and optimizes the decoding flow.

  • The outstanding performance of FRIC in few-shot RC is demonstrated in contrast experiments, and the effectiveness of components is verified in ablation experiments. FRIC significantly reduces the sample dependence of the model and demonstrates robustness and generalizability.

2. Related work

2.1. Remote sensing IC

There are three different paradigms for RC methods: template paradigm, retrieval paradigm and encoder–decoder paradigm. Shi and Zou (Citation2017) first proposed an RC method using a template paradigm. This type of method is able to generate accurate sentences but lacks flexibility; the retrieval paradigm relies on large image-caption databases. Wang et al. (Citation2019) used the Mahalanobis matrix to retrieve the captions that are closest to the target image. The methods using the encoder–decoder paradigm are more flexible. Zhang et al. (Citation2022) fused global and local visual features from remote sensing images and filtered redundant feature components to greatly improve the relevance of the generated captions. Chen et al. (Citation2022) proposed pure transformer architecture for RC with powerful performance; Liu, Zhao, and Shi (Citation2022) proposed a multilayer aggregated transformer to extract features to generate sufficient captions and Zhuang et al. (Citation2022) proposed a transformer encoder–decoder with grid features to improve the RC performance. Du et al. (Citation2023) designed a transformer equipped with deformable attention for RC and achieved higher accuracy. Liu et al. (Citation2023) introduced a pretrained large language model into RC and implemented very good results. Liu et al. (Citation2022) proposed a dual-branch transformer encoder to improve the feature discrimination capacity and showed significant advantages.

Some works have noticed the overfitting problems or high demand for samples in RC. Li et al. (Citation2020) explored the overfitting problem caused by using cross-entropy loss and presented a truncated cross-entropy (TCE) loss to mitigate the overfitting. Considering that the training of the encoder–decoder method requires many samples, Hoxha and Melgani (Citation2022) used support vector machines (SVMs) as the decoder to model the dependencies of existing words and obtained better results than the vanilla encoder–decoder method. Yang, Ni, and Ren (Citation2022) proposed a meta-learning-based RC framework, which improves the encoder to reduce the need for caption-labeled data. The above work only addresses overfitting in the encoder and neglects the decoder. In addition, the above work did not further reduce the sample size, and there is a gap with the actual few-shot scenario.

2.2. Few-shot learning

Few-shot learning was initially applied in classification tasks. Most of the recent advances come from meta-learning. Meta-learning can be divided into three categories: optimization-based, metric-based and model-based. Optimization-based methods learn a prior by using this prior to allowing the model to learn quickly with limited new samples. The model-agnostic meta learning (MAML) was proposed by Finn, Abbeel, and Levine (Citation2017) to find a parameter that facilitates model fine-tuning. The proposed out-of-distribution (OOD)-MAML (Jeong and Kim Citation2020) improves upon it; metrics-based methods learn a feature space in which the similarity between samples and labels is calculated. The ProtoNet (Snell, Swersky, and Zemel Citation2017) used prototypes to represent data and classified according to the distance. Lyu and Wang (Citation2023) proposed compositional prototypical networks to improve the reusability of features; model-based methods tend to construct an internal state of the model that adapts to the input sample. Meta networks (Munkhdalai and Yu Citation2017) are a representative work that contains a base learner and a meta learner. The base learner helps to complete the task. The meta learner calculates the weights based on the information provided by the base learner, allowing the base learner to quickly generalize across tasks.

Few-shot learning has a variety of applications in remote sensing. Few-shot learning in scene classification has received a lot of attention. Many works are designed for optical images (Li et al. Citation2022) and synthetic aperture radar (SAR) images (Shang et al. Citation2018). In few-shot semantic segmentation, the main challenge is to efficiently extract features to discriminate between multiple classes. Most existing works are on hyperspectral images (Kemker, Salvaggio, and Kanan Citation2018). In few-shot object detection, the main challenge is to overcome the overfitting and obtain discriminative features that are useful for multiscale problems, especially in optical images (Zhang et al. Citation2022).

3. Research schemes

The first thing to discuss is simulating few-shot RC scenarios.

Used in many few-shot learning works, the N-way K-shot paradigm is a task-based paradigm. In each training task, the few-shot model can contact 1 or 5 samples. However, throughout the training process, samples will be randomly extracted from the training set to construct rich training tasks. It requires a certain scale of the training set; otherwise, it is not possible to construct rich tasks. The RC datasets and actual scenarios cannot provide sufficient samples to implement it. In addition, many tasks to train a multimodal model will result in a significant computational burden. Therefore, we do not use the N-way K-shot paradigm but the percentage sampling from Chen, Jiang, and Zhao (Citation2021). Yang, Ni, and Ren (Citation2022) also aimed to train RC models with limited caption-labeled data. The researchers believed that three RC datasets contain relatively small samples compared to the natural dataset, so they directly used 100% of the dataset for research. We think that it cannot simulate few-shot scenarios well, so we adopt percentage sampling to simulate few-shot RC scenarios: instead of using all samples in the training set, use a part of the samples for training. We design two schemes to research a few-shot RC. The sampling percentage needs to be less than 1.0%. We take 0.5%, 0.8%, and 1.0% as the sampling percentages.

In the first scheme, we take percentage sampling to make the few-shot scenarios more realistic and rigorous. It focuses on forcing the model to work well with sparse caption-labeled samples. The models will be evaluated through existing metrics. In the second scheme, the RC model is also trained with less than 1.0% of the RC datasets, while the test samples are raw remote sensing samples without ground truth captions. We need to design new metrics that do not use ground truth captions. The second scheme focuses more on training a general model. The expected model can apply the knowledge from the limited samples to describe the raw samples, just like a human. The generated captions need to be subjectively assessed. The second scheme is closer to the actual scenarios. These two schemes, respectively, reflect two goals in few-shot learning: the first is to obtain sufficient performance from limited samples to reduce the demand for a large number of labeled samples, and the second is to learn from limited samples and apply to new samples as humans.

The reason for introducing the second scheme is the particularity of the few-shot RC task. It is a cross-modal task with the goal of generating captions that accurately describe key information in raw remote sensing samples. The evaluation of the RC model cannot be carried out according to the methods in computer vision (CV) or natural language processing (NLP). In addition, the existence of the few-shot problem requires attention to whether the captions generated for unfamiliar samples meet the requirements. The scores of existing metrics cannot directly reflect whether the generated captions contain correct information, especially in few-shot scenarios. However, raw samples do not have ground truth captions, and the calculation of existing evaluation metrics must use ground truth captions. New metrics need to be designed.

The use of additional samples and models pre-trained on other remote sensing samples should be restricted. Improving the efficiency of using the model on the limited samples as well as the mobility and adaptability of the model are essential to consider. Additional samples and models can cause interference in few-shot learning studies.

4. Method

The few-shot problem often occurs during the process of sample collection not to mention the later process of labeling using descriptive captions. There is a direct result that it is not possible to additionally introduce a large number of unpaired images in RC to solve the few-shot problem as in reference (Chen, Jiang, and Zhao Citation2021). Therefore, FRIC gave up introducing new samples and split the existing limited paired samples into unlabeled remote sensing images and unpaired captions, which were then used them as supplementary training samples. We have adopted various methods in the process of training the model to maximize its utilization of limited samples. We provide an algorithmic flowchart to show the implementation of the FRIC ().

Figure 1. Algorithmic flowchart of FRIC.

Figure 1. Algorithmic flowchart of FRIC.

4.1. Base model

FRIC uses ResNet and long short-term memory (LSTM) as an encoder and a decoder, respectively. Given a remote sensing image, we feed it to the encoder ResNet. ResNet will encode the visual features contained in the image to get the image embedding. The performance of ResNet directly determines the key information contained in image embedding. The image embedding will be input into the decoder built on LSTM. ResNet and LSTM are still favored by researchers today, and the community continues to update pre-trained models. ResNet and LSTM can be replaced with more complex and powerful models to build stronger RC models, and many researchers have completed such works. Choosing ResNet and LSTM makes FRIC orthogonal to these works and better demonstrates its own effectiveness.

4.2. Self-ensemble of models

We modify the ensemble method to integrate the parameters in the models at different time points to achieve self-ensemble (S-es) of models. The model with integrated parameters is the S-es model. The ensembled model will introduce noise and generate errors during the calculation process, and the ensembled model is easily affected by errors. In a parameter ensemble, the transfer of performance between models is achieved through parameters, without the need to calculate based on the noisy outputs of multiple models, reducing the propagation of errors within the ensemble model. Calculation with integrated parameters can reduce the effect of noise. The S-es model will be continuously updated over time points. To avoid the collapse of the S-es model in the training process, the parameter differences between the models calculated at adjacent time points should be guaranteed. We introduced the exponential moving average (EMA) to obtain the parameters of the k-generation S-es model: (1) θk¯=αθk1¯+(1α)θk(1) where α is the decay rate, taken in [0,1], and k represents the time step of training.

To build a more robust S-es model, we add the multi-model ensemble. We train N different base models θkn (n = 1,2, … ,N), with N sets of non-overlapping caption-labeled samples, and integrate their outputs as the final output of the model. At this time, the model Θk is composed of multiple models. We use the parameter ensemble to obtain N θkn¯ models, which are input separately with non-overlapping samples. The schematic diagram of S-es is shown in .

Figure 2. S-es model includes parameter ensemble and multi-model ensemble. Parameter ensemble updates model parameters through EMA. The multi-model ensemble is to integrate base models with different parameters.

Figure 2. S-es model includes parameter ensemble and multi-model ensemble. Parameter ensemble updates model parameters through EMA. The multi-model ensemble is to integrate base models with different parameters.

The parameter ensemble and multi-model ensemble are used to improve from the perspective of increasing the learning efficiency. We integrate the S-es model into self-distillation (S-dl) as the teacher model to guide the base models in each iteration round.

4.3. Module of training with split samples

To mine more supervision, we design a new perspective of using limited samples. We split caption-labeled samples into unlabeled images and unpaired captions as supplementary samples. In the training process, the use of paired samples follows the process of the S-es model in the previous section: images are input into the k generation model Θk to generate captions for comparison with ground truth captions y. The unlabeled images x and unpaired captions y are sent to the k generation S-es model Θk¯. The performance of Θk¯ will be used to train Θk+1.

4.3.1. Training with unlabeled remote sensing images

We input unlabeled images x to Θk¯ and use beam search to obtain a series of pseudo-captions. We take reducing the differences between pseudo-captions and the captions generated by model Θk as the training goal, in the hope that the new model will contain additional knowledge in the pseudo-captions.

However, using unlabeled images without ground truth captions will introduce noise. The S-es model in the training process is not completely reliable. It uses unlabeled images to generate pseudo-captions that contain noise and are less reliable than the ground truth captions. We calculate a confidence score for the pseudo-captions to control their influence in the training process. The confidence score is calculated as follows: (2) Confidence(Wt)=elog(i=1Ntp(wit|w1tw2twi1t))1telog(i=1Ntp(wit|w1tw2twi1t))(2) where Wt is the caption obtained by beam search and wit is the i-th word of it. p(wit|w1tw2twi1t) is the probability when the model generates the i-th word. Nt is the sentence length. The more noise the caption contains, the lower the confidence score will be, and the less attention will be valued in training.

We input N groups of different unlabeled samples xn obtained by dividing unlabeled images x to N models, where n = 1,2 … N. Take model θkn¯ as an example, input xn into the θkn¯ to get T captions Wnt through beam search, where t = 1,2, … ,T. At the same time, we feed also N different base models of the model Θk to get a series of output captions f(xn|θkn). Calculate the mean square error between each caption Wnt and f(xn|θkn), multiply them by the confidence score of Wnt, and then add them all. FRIC uses all N models in each iteration. Self distillation-few-shot image captioning (SD-FSIC) uses 1% of the Microsoft COCO dataset as paired data and the rest as unlabeled data. With such sufficient unlabeled data, SD-FSIC trains the model adequately. In few-shot remote sensing scenarios, it is difficult to collect such sufficient unlabeled data. Therefore, we feed the available paired samples and spilt unlabeled data to all models in each training iteration. The loss function Lx generated by training using unlabeled images is calculated as (3) Lx=1Nn=1Nn=1Nt=1Tconfidence(Wnt)MSE(Wnt,f(xn|θkn))(3) where mean square error (MSE) represents the mean square error and k represents the time step of SD. Training by continuously decreasing Lx can induce the model Θk to follow Θk¯ to learn additional knowledge from the unlabeled images. Each iteration of S-dl training updates the parameters of the model Θk, while the Θk¯ updates the parameters via EMA. Schematic diagram of training with unlabeled images is shown in .

Figure 3. Schematic diagram of training with unlabeled remote sensing images. Training is achieved by comparing the confidence-corrected pseudo-captions with the output captions from base models.

Figure 3. Schematic diagram of training with unlabeled remote sensing images. Training is achieved by comparing the confidence-corrected pseudo-captions with the output captions from base models.

4.3.2. Training with unpaired captions

To train IC models with unpaired captions, we extract pseudo-image features from unpaired captions. We input the unpaired captions to the decoder Dθkn¯ of θkn¯. The structure of Dθkn¯ is LSTM, and its output in the last time step represents the pseudo-image features f(y) corresponding to the unpaired captions.

It necessary to consider the reliability of the pseudo-feature extracted from unpaired captions y by the S-es model. Unpaired labels are not generated by the S-es model but are split from paired samples. It is pointless to evaluate their reliability at this point. It is also undesirable to evaluate the reliability of captions generated by the base models with pseudo-image features. The reliability of captions reflects both the performance of the S-es model and the base model, which cannot be used to improve the guidance of the S-es model. Moreover, image data are not as concise as text data, and it is difficult to measure the reliability of pseudo-image features directly and conveniently. Here, we use gradient descent (GD) to optimize the pseudo-image features before inputting them to the base models. The target pseudo-image features f(y) optimized by GD are (4) f(y)=argminF(y)n=1NCE(y,Dθkn¯(F(y)))(4) where CE represents the cross-entropy loss. We still use all N models in each train iteration as training with unlabeled images. We feed the pseudo-image features f(y) to the decoders Dθkn in the N models θkn of the base model Θk to obtain N output captions Dθkn(f(y)), where n = 1,2, … ,N. Schematic diagram of training with unpaired captions is shown in .

Figure 4. Schematic diagram of training with unpaired captions. The pseudo image features optimized by GD are fed into the base models to facilitate training.

Figure 4. Schematic diagram of training with unpaired captions. The pseudo image features optimized by GD are fed into the base models to facilitate training.

The difference between these captions and unpaired captions y represents the disparity between the base model and the S-es model, and it is also the loss Ly of using unpaired captions y to assist model training: (5) Ly=1Nn=1Nn=1NCE(y,Dθkn(f(y)))(5)

4.3.3. Training with paired samples

It is necessary to use a few paired remote sensing samples (x,y) to train the model Θk. The performance of the model Θk is the cornerstone of the performance improvement through S-es and S-dl. We divide paired samples into N non-overlapping samples and send them to N base models θkn. The objective is to minimize the CE loss L(x,y) between ground truth captions and captions generated by θkn: (6) L(x,y)=1Nn=1Nn=1NCE(yn,f(xn|θkn))(6) To sum up, paired samples (x,y), unlabeled images x and unpaired caption y are used to train θkn in Θk. The expression of the total loss L in this process is (7) L=L(x,y)+λxLx+λyLy=1Nn=1Nn=1NCE(y,f(xn|θkn))+λxNn=1Nn=1Nt=1Tconfidence(Wnk)MSE(Wnk,f(xn|θkn))+λyNn=1Nn=1NCE(y,Dθkn(f(y)))(7) where λx and λy are hyperparameters occupied by the loss Lx from unlabeled remote sensing images x and the loss Ly from unpaired captions y in the total loss L, respectively. The base model θkn is updated by continuously reducing the loss L down to the minimum value. We provide an overall diagram to show the data flows ().

Figure 5. Overall diagram of FRIC, containing three data flows.

Figure 5. Overall diagram of FRIC, containing three data flows.

5. Experimental setup

5.1. Datasets

There are three commonly used datasets for RC. The UCM-Captions contains 2100 samples from 21 remote sensing scenes, each with a resolution of 256 × 256 pixels. The developers use 368 different words to describe the samples. the Sydney-Captions contains 613 remote sensing samples of 7 remote sensing scenes. Each sample has a resolution of 500 × 500 pixels. The dataset used 237 different words in constructing captions. The Sydney Captions has the highest image quality and more detailed captions, but the number of samples is too small, resulting in the highest learning difficulty. The RSICD collected 10,921 samples from 30 remote sensing scenes. The resolution of all samples is set to 224 × 224 pixels. The dataset uses 3325 different words to describe these samples. It has the most exhaustive ground truth captions of the three datasets. The RSICD contains the most samples, but the captions are very complex, which increases the difficulty. Each of these datasets has provided five captions for each sample.

The maritime satellite image dataset (MASATI) dataset (Gallego, Pertusa, and Gil Citation2018) was compiled based on optical remote sensing images obtained from Bing Maps. Each image is in portable network graphics (PNG) format with a resolution of 512 × 512 pixels. The MASATI has 7389 images. This dataset is usually used for object recognition and object detection.

5.2. Metrics

In the first scheme experiments, we used the same metrics as in existing RC works. Bilingual evaluation understudy (BLEU) is a metric that was proposed in 2002 to evaluate machine translation systems (Papineni et al. Citation2002). The BLEU-n (n = 1,2,3,4) is calculated according to the n-gram matching rule, respectively; Recall-oriented understudy for gisting evaluation-ROUGE_L is an evaluation metric that was proposed in 2003 (Lin Citation2004) and consensus-based image description evaluation (CIDEr)-D is a proposed metric for image description problems (Vedantam et al. Citation2014). The designer introduced a Gaussian penalty and limited the score of words appearing multiple times; metric for evaluation of translation with explicit ordering (METEOR) (Banerjee and Lavie Citation2005) is modified based on BLEU; and semantic propositional image caption evaluation (SPICE) is a new metric proposed by (Anderson et al. Citation2016) in 2016 for evaluating caption quality. The calculation of these metrics requires the ground truth captions of the test samples.

In the second scheme experiments, the above metrics cannot be used due to the samples without ground truth captions used for testing. We used two metrics to evaluate the performance of the RC model based on the generated sentences themselves.

Perplexity can be used to evaluate the quality of sentences and measure the sentence generation model. When perplexity is used to evaluate the quality of language models, there is a given dataset to evaluate the ability of different language models to generate data in this dataset. Different models will obtain different calculation results for the same input. The better the model, the smaller the perplexity on a given dataset. The generation model can be seen as generating words one by one according to different probabilities. Perplexity calculates the probability of a sentence according to each word to judge the confidence of the model when generating a sentence. The calculation approach is to calculate the conditional probability of the model generating the next word based on all the words generated. These conditional probabilities are then multiplied by the root of the nth power to get the perplexity of the model to generate the whole sentence (8) Perplexity(W)=P(w1w2wN)1N=i=1N1p(wi|w1w2wi1)N(8)

where W represents the sentence, and wi (i = 1,2 … N) represents the word. N represents the length of the sentence. P and p represent probabilities. The smaller the perplexity, the better the modeling ability of the model.

Entropy measures the uncertainty of random variables or variable systems. The signal (variable) sent by an information source is uncertain. We can measure the uncertainty of the variable according to its probability of occurrence. The uncertainty of the system is determined by all variables in the system. We regard the model as the information source, and each generated word wi (i = 1,2 … N) as a variable. The sentence W can be regarded as a discrete random variable system, and the entropy is (9) Entropy(W)=i=1np(wi)logp(wi)(9) where p(wi) represents the ratio of the occurrences of word wi in the whole sentence W. The smaller the entropy, the smaller the uncertainty of the system, the better the stability of the system.

5.3. Hyperparameter settings

We load the ResNet152 from torch vision as the encoder for FRIC. The encoding size and the dimension of state in LSTM are set to 512. The decay rate α in the S-es model is 0.99. The number of base models for multi-model ensemble is 3. The number of iterations in S-dl is set to 100. We take λx=0.1,λy=1. The number of training epochs is 130, and the batch size is 50. The optimizer taken in FRIC is Adam.

6. Experiments in the first scheme

We take percentage sampling to make the experimental environment more rigorous. The sampling percentages we take ranges from 0.5% to 80%. When FRIC can use more samples to train the model, it is foreseeable that the model will make progress. We design 3 groups of contrast experiments to verify this.

As presented in , when we gradually increase the sampling percentage from 0.5% to 80%, the model gradually becomes stronger. This presents that FRIC can adapt to different data scales rather than overfitting to a certain low sampling percentage. We visualize the C-D scores of the model with different sampling percentages as shown in . is used to show how varying data quantities impact model performance in three datasets.

Figure 6. The trend of C-D score changes in models with different sampling percentages.

Figure 6. The trend of C-D score changes in models with different sampling percentages.

Table 1. Comparison results of FRIC trained with different sampling percentages on UCM-Captions.

Table 2. Comparison results of FRIC trained with different sampling percentages on Sydney-Captions.

Table 3. Comparison results of FRIC trained with different sampling percentages on RSICD.

We also increase the sampling percentage to 100% and conduct three groups of experiments to compare FRIC with classical and recently RC methods. The multimodal recurrent neural network (mRNN) (Qu et al. Citation2016) uses the pretrained convolutional neural networks (CNN) as the encoder and recurrent neural network (RNN) as the decoder; hard-attention (Lu et al. Citation2017) adopts CNN-LSTM structure; TCE loss-based (Li et al. Citation2020) adopts GoogleNet + LSTM and SVM-D concatenation (CONC) takes SVM as the decoder. The encoder of SVM-D CONC is VGG16, the encoder of MLCA-Net is VGG16, and the decoder of multilevel and contextual attention network (MLCA-Net) is LSTM. Self-critical sequence training for image captioning (SCST) (Kandala et al. Citation2022) selects the transformer as the decoder; when the LSTM decoder in FRIC is replaced by transformer, we obtain TR-FRIC. Comparison results from the experiments are shown in .

Table 4. Comparison results of methods trained with a sampling percentage of 100% on UCM-Captions.

Table 5. Comparison results of methods trained with a sampling percentage of 100% on Sydney-Captions.

Table 6. Comparison results of methods trained with a sampling percentage of 100% on RSICD.

FRIC achieves outstanding performance when trained with the same number of samples as other methods. TR-FRIC scores higher than SCST, which also includes a transformer. TR-FRIC is better than FRIC in most experiments. While these methods use various new structures or special auxiliary models, FRIC uses only a simple base model. FRIC can effectively improve the efficiency of the model in utilizing samples.

Most importantly, we compare the performance of the FRIC method with other methods in the same few-shot scenario. We choose hard-attention, mRNN and SD-FSIC to compare. Hard-attention and mRNN have good adaptability. SD-FSIC is a few-shot IC method proposed in natural images, which has good universality and performance. SD-FSIC has used a 0.8% sampling percentage to achieve training the RC model with limited samples. Following 0.8% is beneficial for comparing FRIC with it and the subsequent development of a few-shot RC. The comparison results are presented in .

Table 7. Comparison results of methods trained with a sampling percentage of 0.8% on UCM-Captions.

Table 8. Comparison results of methods trained with a sampling percentage of 0.8% on Sydney-Captions.

Table 9. Comparison results of methods trained with a sampling percentage of 0.8% on RSICD.

We have marked the highest scores for each metrics in the tables. It can be found that FRIC performs better than other methods in experiments. Especially in the experiments on Sydney-Captions and RSICD dataset, FRIC shows obvious advantages. FRIC exhibits stronger performance than SD-FSIC and has achieved advantages in various metrics. Both are few-shot IC methods, and FRIC is more suitable for remote sensing scenarios than SD-FSIC. This indicates that the designs in FRIC targeting the characteristics of remote sensing scenarios are effective.

To further analysis the robustness of FRIC, particularly under different noise levels or variations in the quality of samples, we set up two comparison groups in the three datasets: the first group is 20% paired samples and the second group is 20% paired samples with additional 80% unpaired samples. We quantitatively compare the performance of the model trained in the two comparison groups, as presented in .

Table 10. Comparison results of training with or without additional unpaired samples on UCM-Captions dataset.

Table 11. Comparison results of training with or without additional unpaired samples on Sydney-Captions dataset.

Table 12. Comparison results of training with or without additional unpaired samples on RSICD dataset.

FRIC has obtained performance improvement from additional unlabeled samples in all three datasets. Even if more unlabeled samples will bring more noise and the noise level increases, FRIC can reduce the negative impact of noise and ensure that the model learn more performance. FRIC has good applicability in real-world scenarios.

7. Experiments in the second scheme

We choose 0.8% as the sampling percentage to obtain train samples from the three datasets. The second scheme focuses on the adaptability to raw samples, so we randomly select 200 images from the MASATI dataset as raw samples without ground truth captions. We have continued the comparison between FRIC and SD-FSIC in this scheme, divided into quantitative comparison and qualitative comparison.

7.1. Quantitative comparison

When conducting quantitative comparison, we use the metrics designed in Section 5.2, Perceptibility and Entropy, to measure the performance of the models. These 200 remote sensing images are uncorrelated, so we use the average of perplexity scores (mean perplexity) and the average of entropy evaluation scores (mean entropy) as the final scores. In addition, we multiply mean perplexity and mean entropy and use their product as a composite metric. We compare the FRIC and SD-FSIC in three datasets. The quantitative results are presented in . The lowest score in each table is in bold.

Table 13. Quantitative results in the second scheme on UCM-Captions.

Table 14. Quantitative results in the second scheme on Sydney-Captions.

Table 15. Quantitative results in the second scheme on RSICD.

The scores of FRIC in the three tables are lower, and the lowest scores in all three tables were obtained by FRIC. This means that our designs enable the model to generate a certain caption more stably in the face of unfamiliar samples, instead of being confused about what the next word should be generated. FRIC has better model stability than SD-FSIC.

7.2. Qualitative comparison

In addition to the quantitative comparison of metric scores, we also qualitatively compare the captions to verify the model's generalization ability for raw samples. The qualitative comparison results are shown in . The blue words represent correct information that matches the image content, while the green marked words represent incorrect information. For the first image, the FRIC model trained in the RSICD dataset accurately described the category and shape of key targets, while the SD-FSIC model's output interpreted text had significant errors. For the second image, the FRIC model described most of the key information with only a few errors. The SD-FSIC model, on the other hand, outputs a large amount of error information and lacks description of most key information. In this group of experiments, the FRIC showed better performance than the SD-FSIC, indicating that the FRIC is more suitable for remote sensing images than the SD-FSIC. The FRIC can function normally even in scenarios with extremely scarce samples, ensuring that the model outputs correctly and semantically rich captions.

Figure 7. Captions generated by FRIC and SD-FSIC for samples.

Figure 7. Captions generated by FRIC and SD-FSIC for samples.

8. Ablation experiments

After describing the designs in the FRIC, some confusion will naturally arise. First, splitting limited samples into multiple parts to train models may lead to poorer performance. It is unknown whether multi-model ensemble can obtain a better model. Second, as the basis of S-dl, S-es will also receive unlabeled samples containing noise. Although we have designed measures to eliminate adverse effects when unlabeled images and unpaired captions are input to multiple base models, noise will inevitably be input to each model and eventually ensemble. It is unclear whether the multi-model ensemble can suppress or amplify noise. Third, the paired samples have been used to train the base model, and whether the module of training with split samples can further extract extra useful knowledge from them. Addressing the above confusions is instructive for designing more concise model. Therefore, we implement a series of ablation experiments. The parameter ensemble is the core of the self-ensemble model, which involves updating the parameters of the model. If the parameter ensemble is removed, the self-ensemble model will directly fail to form a difference with the target model, and subsequent multi-model ensemble and SD modules will lose meaning, FRIC will degenerate into the base model. However, the gap between FRIC and base is caused by multiple modules, so we do not perform ablation on the parameter ensemble in the ablation experiments.

8.1. Ablation of multi-model ensemble

To explore the effectiveness of multi-model ensemble, we set up two groups for contrast experiments. One group is a full FRIC, and 2.4% of the train samples are divided into three different models for non-overlapping training. The other group is the FRIC without multi-model ensemble, named S-FRIC, training with all 2.4% of the samples. The results of the contrast experiments are presented in . Before the experiments, we predicted that the first method would perform worse than the second method. Because dividing train samples makes the samples assigned to each model scarcer. Surprisingly, FRIC perform better than S-FRIC. This means that the multi-model ensemble compensates for the loss caused by the significant reduction in training data. Non-overlapping training ensures the variability between the models used and the effectiveness of the ensemble. It is meaningful for fully learning the limited samples. The above discussion addresses the first confusion; to further explore the effect of this design, we train S-FRIC with 0.8% samples. The FRIC with 2.4% samples can be regarded as an ensemble of three different S-FRICs with 0.8% samples.

Table 16. Ablation results of the multi-model ensemble on UCM-Captions.

Table 17. Ablation results of the multi-model ensemble on Sydney-Captions.

Table 18. Ablation results of the multi-model ensemble on RSICD.

The amount of noise contained in the unpaired data obtained by splitting 0.8% or 2.4% of the paired data is certain. FRIC 2.4% provides a greater improvement to S-FRIC 0.8% than S-FRIC 2.4%. The additional performance improvement proves that the multi-model ensemble can suppress the noise and protect the stability of the model in few-shot scenarios. So far, the second confusion about the multi-model ensemble has been answered.

8.2. Ablation of self-distillation and the module of training with split samples

The S-dl is used in FRIC to obtain performance from unlabeled samples. The role of the S-dl directly reflects the performance improvement of the model from the unpaired samples. We set up four comparison methods. The first is ‘No un caption’ that FRIC removes the modules that using unpaired captions. The second is ‘No un image’ that FRIC removes the modules that using unlabeled images. The third is ‘No S-dl’ that removes all S-dl. The fourth is remote sensing image captioning framework (RIC) that removes the module of training with split samples. The fifth is FRIC. The sampling percentage is 0.8%. The ablation results in the three datasets are given in .

Table 19. Ablation results of the self-distillation on UCM-Captions.

Table 20. Ablation results of the self-distillation on Sydney-Captions.

Table 21. Ablation results of the self-distillation on RSICD.

FRIC performs better than No S-dl and the performance improvement is obvious. The performances of ‘No un caption’ and ‘No un image’ show that even though the base model has obtained performance from paired samples, the S-dl can still obtain extra performance from the unlabeled images or unpaired captions. The third confusion has been resolved. Comparing the scores in three datasets, the role of FRIC is most evident on Sydney-Captions, which has the least samples. ‘No un caption’ performed worse than ‘No un image’ in all three sets of experiments, suggesting that FRIC is more fully utilized for captions. It also shows that if FRIC is expected to play a better role, it is necessary to provide detailed and accurate captions.

9. Discussions

In the first scheme, FRIC achieved higher scores than previous works. It means that FRIC can reduce the demand for caption-labeled samples. The higher scores of TR-FRIC compared to SCST (using a transformer) proves the effectiveness of FRIC for complex structures. In the second scheme, FRIC correctly described raw samples, while SD-FSIC made mistakes. This indicates that FRIC has better adaptability and can ensure the normal operation of few-shot models.

However, the model complexity of FRIC is higher. The complexity of FRIC comes from the ensemble of multiple base models and iterative optimization during the training process using labeled and unlabeled data. The training process includes three different parts, but they are in parallel. Therefore, the model complexity of FRIC is O (M + N), where M is the number of base models and N is the number of iterations. Base model does not include ensemble of models and its model complexity is only O (M). This flaw in FRIC is expected as it reduces the demand for labeled data. The training resources of FRIC are 3.5 times that of the base model, the training time is 3 times that of the base model, and the inference time is roughly the same.

10. Conclusions

Very little previous work has systematically defined and addressed the few-shot problem in RC. Although some works address overfitting due to insufficient samples, there is no direct discussion of how to train RC models in few-shot scenarios. They improve RC indirectly by improving scene classification, ignoring the few-shot problem in the process of decoding image features into sentences. Most works have also not further reduced the sample size, and few-shot is not intuitively represented. We attempt to address the above issues and develop a reasonable and sustainable method for studying few-shot RC. We discuss the universality of the few-shot problem in RC and what few-shot learning should achieve. We design two research schemes and the implementation of the second scheme is our important novelty. Previous works mostly relied on metric scores, while second scheme mainly considered the generated captions of few-shot models. The second scheme can ensure the research of few-shot RC more reasonable and effective. We use less than 1% of the data to train the models. Using such a small amount of data is also not involved in previous RC works.

After the above work, we propose a framework named FRIC. Without the need for additional data or external models, FRIC improves model by improving the use of limited caption-labeled remote sensing samples. On the one hand, it improves the learning efficiency. On the other hand, it adds new ways to use caption-labeled samples. The former is realized through the S-es, including parameter ensemble and multi-model ensemble. The latter is splitting the caption-labeled samples into unlabeled images and unpaired captions and obtains performance through the module of training with split samples. Then, we verify the performance of FRIC in two schemes and compare FRIC with other methods, especially SD-FSIC. The experimental results show that FRIC is outstanding in few-shot scenarios. FRIC shows obvious advantages to other methods when trained with 0.8% samples of RC datasets. The ablation results have verified the effectiveness of components and addressed three confusions; there are still challenges to be addressed. The FRIC can be affected by the quality of labeled samples easily. It is possible to train few-shot RC models using unpaired captions from a wider range of sources, but this also requires a greater ability to deal with the noises. It can be explored that training an IC model on natural datasets and then finetuned to limited remote sensing samples. However, the IC model incorporates data from two modalities, and its fine-tuning will be complex.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The UCM-Captions dataset, Sydney-Captions dataset and RSICD dataset can be found at https://github.com/201528014227051/RSICD_optima. The MASATI dataset can be found at https://www.iuii.ua.es/datasets/masati/index.html.

References

  • Anderson, Peter, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. “SPICE: Semantic Propositional Image Caption Evaluation.” Paper presented at the European conference on computer vision, Amsterdam, The Netherlands.
  • Banerjee, Satanjeev, and Alon Lavie. 2005. “METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments.” Paper presented at the IEEvaluation@ACL, AnnArbor, MI, USA, 11 June 2005.
  • Chen, Xianyu, Ming Jiang, and Qi Zhao. 2021. “Self-Distillation for Few-Shot Image Captioning.” 2021 IEEE winter conference on applications of computer vision (WACV):545–555.
  • Chen, Zihang, Junjue Wang, Ailong Ma, and Yanfei Zhong. 2022. “TypeFormer: Multiscale Transformer With Type Controller for Remote Sensing Image Caption.” IEEE Geoscience and Remote Sensing Letters 19:1–5.
  • Cheng, Qimin, Deqiao Gan, Peng Fu, Haiyan Huang, and Yuzhuo Zhou. 2021. “A Novel Ensemble Architecture of Residual Attention-Based Deep Metric Learning for Remote Sensing Image Retrieval.” Remote Sensing 13:3445. https://doi.org/10.3390/rs13173445.
  • Du, Runyan, Wei Cao, Wenkai Zhang, Guo Zhi, Xianchen Sun, Shuoke Li, and Jihao Li. 2023. “From Plane to Hierarchy: Deformable Transformer for Remote Sensing Image Captioning.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 16:7704–7717. https://doi.org/10.1109/JSTARS.2023.3305889.
  • Finn, Chelsea, P. Abbeel, and Sergey Levine. 2017. “Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks.” Paper presented at the international conference on machine learning, Sydney, Australia.
  • Gallego, Antonio Javier, A. Pertusa, and Pablo Gil. 2018. “Automatic Ship Classification from Optical Aerial Images with Convolutional Neural Networks.” Remote Sensing 10:511. https://doi.org/10.3390/rs10040511.
  • Hoxha, Genc, and Farid Melgani. 2022. “A Novel SVM-Based Decoder for Remote Sensing Image Captioning.” IEEE Transactions on Geoscience and Remote Sensing 60:1–14.
  • Jeong, Taewon, and Heeyoung Kim. 2020. “OOD-MAML: Meta-Learning for Few-Shot Out-of-Distribution Detection and Classification.” Paper presented at the neural information processing systems, Vancouver, Canada.
  • Kandala, Hitesh, Sudipan Saha, Biplab Banerjee, and Xiao Xiang Zhu. 2022. “Exploring Transformer and Multilabel Classification for Remote Sensing Image Captioning.” IEEE Geoscience and Remote Sensing Letters 19:1–5. https://doi.org/10.1109/LGRS.2022.3198234.
  • Kemker, Ronald, Carl Salvaggio, and Christopher Kanan. 2018. “Algorithms for Semantic Segmentation of Multispectral Remote Sensing Imagery Using Deep Learning.” ISPRS Journal of Photogrammetry and Remote Sensing 145: 60–77.
  • Li, Xiaomin, D. Shi, Xiaolei Diao, and Hao Xu. 2022. “SCL-MLNet: Boosting Few-Shot Remote Sensing Scene Classification via Self-Supervised Contrastive Learning.” IEEE Transactions on Geoscience and Remote Sensing 60:1–12.
  • Li, Xuelong, Xueting Zhang, Wei Huang, and Qi Wang. 2020. “Truncation Cross Entropy Loss for Remote Sensing Image Captioning.” IEEE Transactions on Geoscience and Remote Sensing 59:5246–5257. https://doi.org/10.1109/TGRS.2020.3010106.
  • Lin, Chin-Yew. 2004. “ROUGE: A Package for Automatic Evaluation of Summaries.” Paper presented at the annual meeting of the association for computational linguistics, Barcelona, Spain.
  • Lin, Tsung-Yi, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. “Microsoft COCO: Common Objects in Context.” Paper presented at the European conference on computer vision, Zurich, Switzerland.
  • Liu, Qingrong, Chengqing Ruan, Shan Zhong, Jian Li, Zhonghui Yin, and Xihu Lian. 2018. “Risk Assessment of Storm Surge Disaster Based on Numerical Models and Remote Sensing.” International Journal of Applied Earth Observation and Geoinformation 68: 20–30.
  • Liu, Chenyang, Rui Zhao, Jianqi Chen, Zipeng Qi, Zhengxia Zou, and Zhen Xia Shi. 2023. “A Decoupling Paradigm with Prompt Learning for Remote Sensing Image Change Captioning.” IEEE Transactions on Geoscience and Remote Sensing 61:1–18.
  • Liu, Chenyang, Rui Zhao, Hao Chen, Zhengxia Zou, and Zhen Xia Shi. 2022. “Remote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset.” IEEE Transactions on Geoscience and Remote Sensing 60:1–20.
  • Liu, Chenyang, Rui Zhao, and Zhen Xia Shi. 2022. “Remote-Sensing Image Captioning Based on Multilayer Aggregated Transformer.” IEEE Geoscience and Remote Sensing Letters 19:1–5.
  • Lu, Xiaoqiang, Binqiang Wang, Xiangtao Zheng, and Xuelong Li. 2017. “Exploring Models and Data for Remote Sensing Image Caption Generation.” IEEE Transactions on Geoscience and Remote Sensing 56:2183–2195.
  • Lyu, Qiang, and Weiqiang Wang. 2023. “Compositional Prototypical Networks for Few-Shot Classification.” ArXiv abs/2306.06584.
  • Munkhdalai, Tsendsuren, and Hong Yu. 2017. “Meta Networks.” Proceedings of Machine Learning Research 70:2554–2563.
  • Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. “Bleu: A Method for Automatic Evaluation of Machine Translation.” Paper presented at the annual meeting of the association for computational linguistics, Philadelphia, PA, USA.
  • Qu, Bo, Xuelong Li, Dacheng Tao, and Xiaoqiang Lu. 2016. “Deep Semantic Understanding of High Resolution Remote Sensing Image.” 2016 international conference on computer, information and telecommunication systems (CITS), Kunming, China: 1–5.
  • Shang, Ronghua, Jiaming Wang, Licheng Jiao, R. Stolkin, Biao Hou, and Yangyang Li. 2018. “SAR Targets Classification Based on Deep Memory Convolution Neural Networks and Transfer Parameters.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 11 (8): 2834–2846. https://doi.org/10.1109/JSTARS.2018.2836909.
  • Shi, Zhenwei, and Zhengxia Zou. 2017. “Can a Machine Generate Humanlike Language Descriptions for a Remote Sensing Image?” IEEE Transactions on Geoscience and Remote Sensing 55:3623–3634. https://doi.org/10.1109/TGRS.2017.2677464.
  • Snell, Jake, Kevin Swersky, and Richard S. Zemel. 2017. “Prototypical Networks for Few-Shot Learning.” ArXiv abs/1703.05175.
  • Vedantam, Ramakrishna, C. Lawrence Zitnick, and Devi Parikh. 2014. “CIDER: Consensus-Based Image Description Evaluation.” 2015 IEEE conference on computer vision and pattern recognition (CVPR), Boston, MA, USA: 4566–4575.
  • Wang, Binqiang, Xiaoqiang Lu, Xiangtao Zheng, and Xuelong Li. 2019. “Semantic Descriptions of High-Resolution Remote Sensing Images.” IEEE Geoscience and Remote Sensing Letters 16:1274–1278. https://doi.org/10.1109/LGRS.2019.2893772.
  • Yang, Qiaoqiao, Zihao Ni, and Pengxin Ren. 2022. “Meta Captioning: A Meta Learning Based Remote Sensing Image Captioning Framework.” ISPRS Journal of Photogrammetry and Remote Sensing 186: 190–200.
  • Zhang, Haopeng, Xingyu Zhang, Gang Meng, Chen Guo, and Zhi-guo Jiang. 2022. “Few-Shot Multi-Class Ship Detection in Remote Sensing Images Using Attention Feature Map and Multi-Relation Detector.” Remote Sensing 14:2790. https://doi.org/10.3390/rs14122790.
  • Zhang, Zhengyuan, Wenkai Zhang, Menglong Yan, Xin Gao, Kun Fu, and Xian Sun. 2022. “Global Visual Feature and Linguistic State Guided Attention for Remote Sensing Image Captioning.” IEEE Transactions on Geoscience and Remote Sensing 60:1–16.
  • Zhuang, Shuo, Pingping Wang, Gang Wang, Di Wang, Jinyong Chen, and Feng Gao. 2022. “Improving Remote Sensing Image Captioning by Combining Grid Features and Transformer.” IEEE Geoscience and Remote Sensing Letters 19:1–5.