Full article: FRIC: a framework for few-shot remote sensing image captioning

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

The training of image captioning (IC) models requires a large number of caption-labeled samples, which is usually difficult to satisfy in the actual remote sensing scenarios. The performance of the models will be damaged due to the few-shot problems. We describe the few-shot problems in remote sensing image captioning (RC) and design two research schemes. Then, we propose a few-shot RC framework few-shot remote sensing image captioning framework (FRIC). FRIC does not need additional samples and uses a simple base model. FRIC tries to get performance boosts from split samples and reduce the negative effects of noises. Unlike previous works that use 100% samples to simulate few-shot scenarios, FRIC uses less than 1.0% data to simulate actual few-shot scenarios. While previous works focus on improving the encoder, FRIC focuses on optimizing the decoder with parameter ensemble, multi-model ensemble and self-distillation. FRIC can train a simple base model with limited caption-labeled samples to generate captions that meet human expectations. FRIC shows obvious advantages to other methods when trained with only 0.8% samples of RC datasets. No previous work has used such a small amount of data to train the RC model. In addition, the effectiveness of the components in FRIC is verified with ablation experiments.

KEYWORDS:

1. Introduction

Conventional remote sensing image interpretation tasks obtain image-level semantic information by processing tags or key targets, while the image captioning (IC) is concerned with generating a sentence describing the image. IC has received a great deal of attention in remote sensing, such as disaster risk assessment (Liu et al. Citation2018) and image retrieval (Cheng et al. Citation2021). However, there are few-shot problems in RC, that is, caption-labeled remote sensing samples are always scarce. The reasons for the scarcity of caption-labeled remote sensing samples are as follows: obtaining remote sensing images of the scenes containing specific targets is difficult due to various limitations. RC requires the images to contain content that is sufficient to form descriptive sentences with rich information. In addition, the difficulty and cost of equip each sample with caption labels are high. The above reasons have limited the number of caption-labeled samples. The common RC datasets are the remote sensing image captioning dataset (RSICD) dataset (Lu et al. Citation2017), the university complutense madrid (UCM)-Captions dataset (Qu et al. Citation2016), and the Sydney Captions dataset (Qu et al. Citation2016). RSICD only contains 10,921 images, UCM-Captions only contains 2100 images, and Sydney Captions dataset only contains 613 images, while the natural common dataset, Microsoft common objects in context (COCO) dataset (Lin et al. Citation2014), contains 123k samples. RC models are generally unable to receive sufficient training and often experience overfitting. The scarcity of caption-labeled samples will be more severe in actual remote sensing scenarios. It is hard to solve the few-shot problems by increasing the number of samples. We need to enhance the way of using the limited samples, so that the RC model can be applied to the few-shot remote sensing scenarios, and obtain knowledge that can be used for new samples. It is the core requirement of few-shot learning. Few-shot learning aims to train the model to generalize limited supervised information to new samples or new tasks, thereby closing the gap between the model and the human.

For the resource-constrained scenarios, we develop two research schemes for few-shot RC. Then, we propose a few-shot RC framework few-shot remote sensing image captioning framework (FRIC). FRIC splits labeled remote sensing samples into unlabeled images and unpaired captions. Divide the caption-labeled samples into multiple non-overlapping parts and input them to multiple base models for training. The base models are ensembled to a powerful ‘self-teacher’ model through parameter ensemble and multi-model ensemble. The module of training with split samples helps models learn from unlabeled data. At the same time, measures are set to reduce the adverse impact of noise. We set up contrast experiments in two schemes and ablation experiments to prove the effectiveness of FRIC. The contribution of FRIC can be summarized as follows:

We have structured two research schemes about simulating few-shot RC scenarios and the resource constraints, which will be beneficial for the consistency of subsequent research works.
We use less than 1.0% data to simulate actual few-shot scenarios, while previous works use 100%.
Without using any additional samples or models, FRIC splits the limited caption-labeled remote sensing samples into unlabeled images and unpaired captions as supplemental samples and optimizes the decoding flow.
The outstanding performance of FRIC in few-shot RC is demonstrated in contrast experiments, and the effectiveness of components is verified in ablation experiments. FRIC significantly reduces the sample dependence of the model and demonstrates robustness and generalizability.

2. Related work

2.1. Remote sensing IC

There are three different paradigms for RC methods: template paradigm, retrieval paradigm and encoder–decoder paradigm. Shi and Zou (Citation2017) first proposed an RC method using a template paradigm. This type of method is able to generate accurate sentences but lacks flexibility; the retrieval paradigm relies on large image-caption databases. Wang et al. (Citation2019) used the Mahalanobis matrix to retrieve the captions that are closest to the target image. The methods using the encoder–decoder paradigm are more flexible. Zhang et al. (Citation2022) fused global and local visual features from remote sensing images and filtered redundant feature components to greatly improve the relevance of the generated captions. Chen et al. (Citation2022) proposed pure transformer architecture for RC with powerful performance; Liu, Zhao, and Shi (Citation2022) proposed a multilayer aggregated transformer to extract features to generate sufficient captions and Zhuang et al. (Citation2022) proposed a transformer encoder–decoder with grid features to improve the RC performance. Du et al. (Citation2023) designed a transformer equipped with deformable attention for RC and achieved higher accuracy. Liu et al. (Citation2023) introduced a pretrained large language model into RC and implemented very good results. Liu et al. (Citation2022) proposed a dual-branch transformer encoder to improve the feature discrimination capacity and showed significant advantages.

Some works have noticed the overfitting problems or high demand for samples in RC. Li et al. (Citation2020) explored the overfitting problem caused by using cross-entropy loss and presented a truncated cross-entropy (TCE) loss to mitigate the overfitting. Considering that the training of the encoder–decoder method requires many samples, Hoxha and Melgani (Citation2022) used support vector machines (SVMs) as the decoder to model the dependencies of existing words and obtained better results than the vanilla encoder–decoder method. Yang, Ni, and Ren (Citation2022) proposed a meta-learning-based RC framework, which improves the encoder to reduce the need for caption-labeled data. The above work only addresses overfitting in the encoder and neglects the decoder. In addition, the above work did not further reduce the sample size, and there is a gap with the actual few-shot scenario.

2.2. Few-shot learning

Few-shot learning was initially applied in classification tasks. Most of the recent advances come from meta-learning. Meta-learning can be divided into three categories: optimization-based, metric-based and model-based. Optimization-based methods learn a prior by using this prior to allowing the model to learn quickly with limited new samples. The model-agnostic meta learning (MAML) was proposed by Finn, Abbeel, and Levine (Citation2017) to find a parameter that facilitates model fine-tuning. The proposed out-of-distribution (OOD)-MAML (Jeong and Kim Citation2020) improves upon it; metrics-based methods learn a feature space in which the similarity between samples and labels is calculated. The ProtoNet (Snell, Swersky, and Zemel Citation2017) used prototypes to represent data and classified according to the distance. Lyu and Wang (Citation2023) proposed compositional prototypical networks to improve the reusability of features; model-based methods tend to construct an internal state of the model that adapts to the input sample. Meta networks (Munkhdalai and Yu Citation2017) are a representative work that contains a base learner and a meta learner. The base learner helps to complete the task. The meta learner calculates the weights based on the information provided by the base learner, allowing the base learner to quickly generalize across tasks.

Few-shot learning has a variety of applications in remote sensing. Few-shot learning in scene classification has received a lot of attention. Many works are designed for optical images (Li et al. Citation2022) and synthetic aperture radar (SAR) images (Shang et al. Citation2018). In few-shot semantic segmentation, the main challenge is to efficiently extract features to discriminate between multiple classes. Most existing works are on hyperspectral images (Kemker, Salvaggio, and Kanan Citation2018). In few-shot object detection, the main challenge is to overcome the overfitting and obtain discriminative features that are useful for multiscale problems, especially in optical images (Zhang et al. Citation2022).

3. Research schemes

The first thing to discuss is simulating few-shot RC scenarios.

Used in many few-shot learning works, the N-way K-shot paradigm is a task-based paradigm. In each training task, the few-shot model can contact 1 or 5 samples. However, throughout the training process, samples will be randomly extracted from the training set to construct rich training tasks. It requires a certain scale of the training set; otherwise, it is not possible to construct rich tasks. The RC datasets and actual scenarios cannot provide sufficient samples to implement it. In addition, many tasks to train a multimodal model will result in a significant computational burden. Therefore, we do not use the N-way K-shot paradigm but the percentage sampling from Chen, Jiang, and Zhao (Citation2021). Yang, Ni, and Ren (Citation2022) also aimed to train RC models with limited caption-labeled data. The researchers believed that three RC datasets contain relatively small samples compared to the natural dataset, so they directly used 100% of the dataset for research. We think that it cannot simulate few-shot scenarios well, so we adopt percentage sampling to simulate few-shot RC scenarios: instead of using all samples in the training set, use a part of the samples for training. We design two schemes to research a few-shot RC. The sampling percentage needs to be less than 1.0%. We take 0.5%, 0.8%, and 1.0% as the sampling percentages.

In the first scheme, we take percentage sampling to make the few-shot scenarios more realistic and rigorous. It focuses on forcing the model to work well with sparse caption-labeled samples. The models will be evaluated through existing metrics. In the second scheme, the RC model is also trained with less than 1.0% of the RC datasets, while the test samples are raw remote sensing samples without ground truth captions. We need to design new metrics that do not use ground truth captions. The second scheme focuses more on training a general model. The expected model can apply the knowledge from the limited samples to describe the raw samples, just like a human. The generated captions need to be subjectively assessed. The second scheme is closer to the actual scenarios. These two schemes, respectively, reflect two goals in few-shot learning: the first is to obtain sufficient performance from limited samples to reduce the demand for a large number of labeled samples, and the second is to learn from limited samples and apply to new samples as humans.

The reason for introducing the second scheme is the particularity of the few-shot RC task. It is a cross-modal task with the goal of generating captions that accurately describe key information in raw remote sensing samples. The evaluation of the RC model cannot be carried out according to the methods in computer vision (CV) or natural language processing (NLP). In addition, the existence of the few-shot problem requires attention to whether the captions generated for unfamiliar samples meet the requirements. The scores of existing metrics cannot directly reflect whether the generated captions contain correct information, especially in few-shot scenarios. However, raw samples do not have ground truth captions, and the calculation of existing evaluation metrics must use ground truth captions. New metrics need to be designed.

The use of additional samples and models pre-trained on other remote sensing samples should be restricted. Improving the efficiency of using the model on the limited samples as well as the mobility and adaptability of the model are essential to consider. Additional samples and models can cause interference in few-shot learning studies.

4. Method

The few-shot problem often occurs during the process of sample collection not to mention the later process of labeling using descriptive captions. There is a direct result that it is not possible to additionally introduce a large number of unpaired images in RC to solve the few-shot problem as in reference (Chen, Jiang, and Zhao Citation2021). Therefore, FRIC gave up introducing new samples and split the existing limited paired samples into unlabeled remote sensing images and unpaired captions, which were then used them as supplementary training samples. We have adopted various methods in the process of training the model to maximize its utilization of limited samples. We provide an algorithmic flowchart to show the implementation of the FRIC ().

Figure 1. Algorithmic flowchart of FRIC.

4.1. Base model

FRIC uses ResNet and long short-term memory (LSTM) as an encoder and a decoder, respectively. Given a remote sensing image, we feed it to the encoder ResNet. ResNet will encode the visual features contained in the image to get the image embedding. The performance of ResNet directly determines the key information contained in image embedding. The image embedding will be input into the decoder built on LSTM. ResNet and LSTM are still favored by researchers today, and the community continues to update pre-trained models. ResNet and LSTM can be replaced with more complex and powerful models to build stronger RC models, and many researchers have completed such works. Choosing ResNet and LSTM makes FRIC orthogonal to these works and better demonstrates its own effectiveness.

4.2. Self-ensemble of models

We modify the ensemble method to integrate the parameters in the models at different time points to achieve self-ensemble (S-es) of models. The model with integrated parameters is the S-es model. The ensembled model will introduce noise and generate errors during the calculation process, and the ensembled model is easily affected by errors. In a parameter ensemble, the transfer of performance between models is achieved through parameters, without the need to calculate based on the noisy outputs of multiple models, reducing the propagation of errors within the ensemble model. Calculation with integrated parameters can reduce the effect of noise. The S-es model will be continuously updated over time points. To avoid the collapse of the S-es model in the training process, the parameter differences between the models calculated at adjacent time points should be guaranteed. We introduced the exponential moving average (EMA) to obtain the parameters of the k-generation S-es model: (1) $\bar{θ_{k}} = α \bar{θ_{k - 1}} + (1 - α) θ_{k}$ (1) where α is the decay rate, taken in [0,1], and k represents the time step of training.

To build a more robust S-es model, we add the multi-model ensemble. We train N different base models $θ_{k}^{n}$ (n = 1,2, … ,N), with N sets of non-overlapping caption-labeled samples, and integrate their outputs as the final output of the model. At this time, the model $Θ_{k}$ is composed of multiple models. We use the parameter ensemble to obtain N $\bar{θ_{k}^{n}}$ models, which are input separately with non-overlapping samples. The schematic diagram of S-es is shown in .

Figure 2. S-es model includes parameter ensemble and multi-model ensemble. Parameter ensemble updates model parameters through EMA. The multi-model ensemble is to integrate base models with different parameters.

The parameter ensemble and multi-model ensemble are used to improve from the perspective of increasing the learning efficiency. We integrate the S-es model into self-distillation (S-dl) as the teacher model to guide the base models in each iteration round.

4.3. Module of training with split samples

To mine more supervision, we design a new perspective of using limited samples. We split caption-labeled samples into unlabeled images and unpaired captions as supplementary samples. In the training process, the use of paired samples follows the process of the S-es model in the previous section: images are input into the k generation model $Θ_{k}$ to generate captions for comparison with ground truth captions y. The unlabeled images $x^{'}$ and unpaired captions $y^{'}$ are sent to the k generation S-es model $\bar{Θ_{k}}$ . The performance of $\bar{Θ_{k}}$ will be used to train $Θ_{k + 1}$ .

4.3.1. Training with unlabeled remote sensing images

We input unlabeled images $x^{'}$ to $\bar{Θ_{k}}$ and use beam search to obtain a series of pseudo-captions. We take reducing the differences between pseudo-captions and the captions generated by model $Θ_{k}$ as the training goal, in the hope that the new model will contain additional knowledge in the pseudo-captions.

However, using unlabeled images without ground truth captions will introduce noise. The S-es model in the training process is not completely reliable. It uses unlabeled images to generate pseudo-captions that contain noise and are less reliable than the ground truth captions. We calculate a confidence score for the pseudo-captions to control their influence in the training process. The confidence score is calculated as follows: (2) $Confidence (W^{t}) = \frac{e^{\log (\prod_{i = 1}^{N^{t}} p (w_{i}^{t} | w_{1}^{t} w_{2}^{t} \dots w_{i - 1}^{t}))}}{\sum_{1}^{t} e^{\log (\prod_{i = 1}^{N^{t}} p (w_{i}^{t} | w_{1}^{t} w_{2}^{t} \dots w_{i - 1}^{t}))}}$ (2) where W^t is the caption obtained by beam search and $w_{i}^{t}$ is the i-th word of it. $p (w_{i}^{t} | w_{1}^{t} w_{2}^{t} \dots w_{i - 1}^{t})$ is the probability when the model generates the i-th word. N^t is the sentence length. The more noise the caption contains, the lower the confidence score will be, and the less attention will be valued in training.

We input N groups of different unlabeled samples $x_{n}^{'}$ obtained by dividing unlabeled images $x^{'}$ to N models, where n = 1,2 … N. Take model $\bar{θ_{k}^{n}}$ as an example, input $x_{n}^{'}$ into the $\bar{θ_{k}^{n}}$ to get T captions $W_{n}^{t}$ through beam search, where t = 1,2, … ,T. At the same time, we feed also N different base models of the model $Θ_{k}$ to get a series of output captions $f ({x^{'}}_{n} | θ_{k}^{n})$ . Calculate the mean square error between each caption $W_{n}^{t}$ and $f ({x^{'}}_{n} | θ_{k}^{n})$ , multiply them by the confidence score of $W_{n}^{t}$ , and then add them all. FRIC uses all N models in each iteration. Self distillation-few-shot image captioning (SD-FSIC) uses 1% of the Microsoft COCO dataset as paired data and the rest as unlabeled data. With such sufficient unlabeled data, SD-FSIC trains the model adequately. In few-shot remote sensing scenarios, it is difficult to collect such sufficient unlabeled data. Therefore, we feed the available paired samples and spilt unlabeled data to all models in each training iteration. The loss function $L_{x^{'}}$ generated by training using unlabeled images is calculated as (3) $L_{x^{'}} = \frac{1}{N} \sum_{n = 1}^{N} \sum_{n = 1}^{N} \sum_{t = 1}^{T} confidence (W_{n}^{t}) MSE (W_{n}^{t}, f ({x^{'}}_{n} | θ_{k}^{n}))$ (3) where mean square error (MSE) represents the mean square error and k represents the time step of SD. Training by continuously decreasing $L_{x^{'}}$ can induce the model $Θ_{k}$ to follow $\bar{Θ_{k}}$ to learn additional knowledge from the unlabeled images. Each iteration of S-dl training updates the parameters of the model $Θ_{k}$ , while the $\bar{Θ_{k}}$ updates the parameters via EMA. Schematic diagram of training with unlabeled images is shown in .

Figure 3. Schematic diagram of training with unlabeled remote sensing images. Training is achieved by comparing the confidence-corrected pseudo-captions with the output captions from base models.

4.3.2. Training with unpaired captions

To train IC models with unpaired captions, we extract pseudo-image features from unpaired captions. We input the unpaired captions to the decoder $D^{\bar{θ_{k}^{n}}}$ of $\bar{θ_{k}^{n}}$ . The structure of $D^{\bar{θ_{k}^{n}}}$ is LSTM, and its output in the last time step represents the pseudo-image features $f (y^{'})$ corresponding to the unpaired captions.

It necessary to consider the reliability of the pseudo-feature extracted from unpaired captions $y^{'}$ by the S-es model. Unpaired labels are not generated by the S-es model but are split from paired samples. It is pointless to evaluate their reliability at this point. It is also undesirable to evaluate the reliability of captions generated by the base models with pseudo-image features. The reliability of captions reflects both the performance of the S-es model and the base model, which cannot be used to improve the guidance of the S-es model. Moreover, image data are not as concise as text data, and it is difficult to measure the reliability of pseudo-image features directly and conveniently. Here, we use gradient descent (GD) to optimize the pseudo-image features before inputting them to the base models. The target pseudo-image features $f (y^{'})$ optimized by GD are (4) $f (y^{'}) = \arg min_{F (y^{'})} \sum_{n = 1}^{N} CE (y^{'}, D^{\bar{θ_{k}^{n}}} (F (y^{'})))$ (4) where CE represents the cross-entropy loss. We still use all N models in each train iteration as training with unlabeled images. We feed the pseudo-image features $f (y^{'})$ to the decoders $D^{θ_{k}^{n}}$ in the N models $θ_{k}^{n}$ of the base model $Θ_{k}$ to obtain N output captions $D^{θ_{k}^{n}} (f (y^{'}))$ , where n = 1,2, … ,N. Schematic diagram of training with unpaired captions is shown in .

Figure 4. Schematic diagram of training with unpaired captions. The pseudo image features optimized by GD are fed into the base models to facilitate training.

The difference between these captions and unpaired captions $y^{'}$ represents the disparity between the base model and the S-es model, and it is also the loss $L_{y^{'}}$ of using unpaired captions $y^{'}$ to assist model training: (5) $L_{y^{'}} = \frac{1}{N} \sum_{n = 1}^{N} \sum_{n = 1}^{N} CE (y^{'}, D^{θ_{k}^{n}} (f (y^{'})))$ (5)

4.3.3. Training with paired samples

It is necessary to use a few paired remote sensing samples (x,y) to train the model $Θ_{k}$ . The performance of the model $Θ_{k}$ is the cornerstone of the performance improvement through S-es and S-dl. We divide paired samples into N non-overlapping samples and send them to N base models $θ_{k}^{n}$ . The objective is to minimize the CE loss $L_{(x, y)}$ between ground truth captions and captions generated by $θ_{k}^{n}$ : (6) $L_{(x, y)} = \frac{1}{N} \sum_{n = 1}^{N} \sum_{n = 1}^{N} CE (y_{n}, f (x_{n} | θ_{k}^{n}))$ (6) To sum up, paired samples (x,y), unlabeled images $x^{'}$ and unpaired caption $y^{'}$ are used to train $θ_{k}^{n}$ in $Θ_{k}$ . The expression of the total loss L in this process is (7) $\begin{aligned} L & = L_{(x, y)} + λ_{x^{'}} L_{x^{'}} + λ_{y^{'}} L_{y^{'}} \\ = \frac{1}{N} \sum_{n = 1}^{N} \sum_{n = 1}^{N} CE (y, f (x_{n} | θ_{k}^{n})) + \frac{λ_{x^{'}}}{N} \sum_{n = 1}^{N} \sum_{n = 1}^{N} \sum_{t = 1}^{T} confidence (W_{n}^{k}) MSE (W_{n}^{k}, f ({x^{'}}_{n} | θ_{k}^{n})) \\ + \frac{λ_{y^{'}}}{N} \sum_{n = 1}^{N} \sum_{n = 1}^{N} CE (y^{'}, D^{θ_{k}^{n}} (f (y^{'}))) \end{aligned}$ (7) where $λ_{x^{'}}$ and $λ_{y^{'}}$ are hyperparameters occupied by the loss $L_{x^{'}}$ from unlabeled remote sensing images $x^{'}$ and the loss $L_{y^{'}}$ from unpaired captions $y^{'}$ in the total loss L, respectively. The base model $θ_{k}^{n}$ is updated by continuously reducing the loss L down to the minimum value. We provide an overall diagram to show the data flows ().

Figure 5. Overall diagram of FRIC, containing three data flows.

5. Experimental setup

5.1. Datasets

There are three commonly used datasets for RC. The UCM-Captions contains 2100 samples from 21 remote sensing scenes, each with a resolution of 256 × 256 pixels. The developers use 368 different words to describe the samples. the Sydney-Captions contains 613 remote sensing samples of 7 remote sensing scenes. Each sample has a resolution of 500 × 500 pixels. The dataset used 237 different words in constructing captions. The Sydney Captions has the highest image quality and more detailed captions, but the number of samples is too small, resulting in the highest learning difficulty. The RSICD collected 10,921 samples from 30 remote sensing scenes. The resolution of all samples is set to 224 × 224 pixels. The dataset uses 3325 different words to describe these samples. It has the most exhaustive ground truth captions of the three datasets. The RSICD contains the most samples, but the captions are very complex, which increases the difficulty. Each of these datasets has provided five captions for each sample.

The maritime satellite image dataset (MASATI) dataset (Gallego, Pertusa, and Gil Citation2018) was compiled based on optical remote sensing images obtained from Bing Maps. Each image is in portable network graphics (PNG) format with a resolution of 512 × 512 pixels. The MASATI has 7389 images. This dataset is usually used for object recognition and object detection.

5.2. Metrics

In the first scheme experiments, we used the same metrics as in existing RC works. Bilingual evaluation understudy (BLEU) is a metric that was proposed in 2002 to evaluate machine translation systems (Papineni et al. Citation2002). The BLEU-n (n = 1,2,3,4) is calculated according to the n-gram matching rule, respectively; Recall-oriented understudy for gisting evaluation-ROUGE_L is an evaluation metric that was proposed in 2003 (Lin Citation2004) and consensus-based image description evaluation (CIDEr)-D is a proposed metric for image description problems (Vedantam et al. Citation2014). The designer introduced a Gaussian penalty and limited the score of words appearing multiple times; metric for evaluation of translation with explicit ordering (METEOR) (Banerjee and Lavie Citation2005) is modified based on BLEU; and semantic propositional image caption evaluation (SPICE) is a new metric proposed by (Anderson et al. Citation2016) in 2016 for evaluating caption quality. The calculation of these metrics requires the ground truth captions of the test samples.

In the second scheme experiments, the above metrics cannot be used due to the samples without ground truth captions used for testing. We used two metrics to evaluate the performance of the RC model based on the generated sentences themselves.

Perplexity can be used to evaluate the quality of sentences and measure the sentence generation model. When perplexity is used to evaluate the quality of language models, there is a given dataset to evaluate the ability of different language models to generate data in this dataset. Different models will obtain different calculation results for the same input. The better the model, the smaller the perplexity on a given dataset. The generation model can be seen as generating words one by one according to different probabilities. Perplexity calculates the probability of a sentence according to each word to judge the confidence of the model when generating a sentence. The calculation approach is to calculate the conditional probability of the model generating the next word based on all the words generated. These conditional probabilities are then multiplied by the root of the nth power to get the perplexity of the model to generate the whole sentence (8) $Perplexity (W) = P (w_{1} w_{2} \dots w_{N})^{- \frac{1}{N}} = \sqrt[N]{\prod_{i = 1}^{N} \frac{1}{p (w_{i} | w_{1} w_{2} \dots w_{i - 1})}}$ (8)

where $W$ represents the sentence, and w_i (i = 1,2 … N) represents the word. N represents the length of the sentence. P and p represent probabilities. The smaller the perplexity, the better the modeling ability of the model.

Entropy measures the uncertainty of random variables or variable systems. The signal (variable) sent by an information source is uncertain. We can measure the uncertainty of the variable according to its probability of occurrence. The uncertainty of the system is determined by all variables in the system. We regard the model as the information source, and each generated word w_i (i = 1,2 … N) as a variable. The sentence W can be regarded as a discrete random variable system, and the entropy is (9) $Entropy (W) = - \sum_{i = 1}^{n} p (w_{i}) \log p (w_{i})$ (9) where p(w_i) represents the ratio of the occurrences of word w_i in the whole sentence W. The smaller the entropy, the smaller the uncertainty of the system, the better the stability of the system.

5.3. Hyperparameter settings

We load the ResNet152 from torch vision as the encoder for FRIC. The encoding size and the dimension of state in LSTM are set to 512. The decay rate α in the S-es model is 0.99. The number of base models for multi-model ensemble is 3. The number of iterations in S-dl is set to 100. We take $λ_{x^{'}} = 0.1, λ_{y^{'}} = 1$ . The number of training epochs is 130, and the batch size is 50. The optimizer taken in FRIC is Adam.

6. Experiments in the first scheme

We take percentage sampling to make the experimental environment more rigorous. The sampling percentages we take ranges from 0.5% to 80%. When FRIC can use more samples to train the model, it is foreseeable that the model will make progress. We design 3 groups of contrast experiments to verify this.

As presented in , when we gradually increase the sampling percentage from 0.5% to 80%, the model gradually becomes stronger. This presents that FRIC can adapt to different data scales rather than overfitting to a certain low sampling percentage. We visualize the C-D scores of the model with different sampling percentages as shown in . is used to show how varying data quantities impact model performance in three datasets.

Figure 6. The trend of C-D score changes in models with different sampling percentages.

FRIC: a framework for few-shot remote sensing image captioning

ABSTRACT

1. Introduction

2. Related work

2.1. Remote sensing IC

2.2. Few-shot learning

3. Research schemes

4. Method

4.1. Base model

4.2. Self-ensemble of models

4.3. Module of training with split samples

4.3.1. Training with unlabeled remote sensing images

4.3.2. Training with unpaired captions

4.3.3. Training with paired samples

5. Experimental setup

5.1. Datasets

5.2. Metrics

5.3. Hyperparameter settings

6. Experiments in the first scheme

Table 1. Comparison results of FRIC trained with different sampling percentages on UCM-Captions.

Table 2. Comparison results of FRIC trained with different sampling percentages on Sydney-Captions.

Table 3. Comparison results of FRIC trained with different sampling percentages on RSICD.

Table 4. Comparison results of methods trained with a sampling percentage of 100% on UCM-Captions.

Table 5. Comparison results of methods trained with a sampling percentage of 100% on Sydney-Captions.

Table 6. Comparison results of methods trained with a sampling percentage of 100% on RSICD.

Table 7. Comparison results of methods trained with a sampling percentage of 0.8% on UCM-Captions.

Table 8. Comparison results of methods trained with a sampling percentage of 0.8% on Sydney-Captions.

Table 9. Comparison results of methods trained with a sampling percentage of 0.8% on RSICD.

Table 10. Comparison results of training with or without additional unpaired samples on UCM-Captions dataset.

Table 11. Comparison results of training with or without additional unpaired samples on Sydney-Captions dataset.

Table 12. Comparison results of training with or without additional unpaired samples on RSICD dataset.

7. Experiments in the second scheme

7.1. Quantitative comparison

Table 13. Quantitative results in the second scheme on UCM-Captions.

Table 14. Quantitative results in the second scheme on Sydney-Captions.

Table 15. Quantitative results in the second scheme on RSICD.

7.2. Qualitative comparison

8. Ablation experiments

8.1. Ablation of multi-model ensemble

Table 16. Ablation results of the multi-model ensemble on UCM-Captions.

Table 17. Ablation results of the multi-model ensemble on Sydney-Captions.

Table 18. Ablation results of the multi-model ensemble on RSICD.

8.2. Ablation of self-distillation and the module of training with split samples

Table 19. Ablation results of the self-distillation on UCM-Captions.

Table 20. Ablation results of the self-distillation on Sydney-Captions.

Table 21. Ablation results of the self-distillation on RSICD.

9. Discussions

10. Conclusions

Disclosure statement

Data availability statement

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date