Full article: Exploring latent weight factors and global information for food-oriented cross-modal retrieval

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Food-oriented cross-modal retrieval aims to retrieve relevant recipes given food images or vice versa. The modality semantic gap between recipes and food images (text and image modalities) is the main challenge. Though several studies are introduced to bridge this gap, they still suffer from two major limitations: 1) The simple embedding concatenation only can capture the simple interactions rather than complex interactions between different recipe components. 2) The image feature extraction based on convolutional neural networks only considers the local features and ignores the global features of an image, as well as the interactions between different extracted features. This paper proposes a novel method based on Latent Component Weight Factors and Global Information (LCWF-GI) to learn the robust recipe and image representations for food-oriented cross-modal retrieval. This proposed method integrates the textual embeddings of different recipe components into a compact embedding to represent the recipes with the latent component-specific weight factors. A transformer encoder is utilised to capture the intra-modality interactions and the importance of different extracted image features for enhanced image representations. Finally, the bi-directional triplet loss is further used to perform retrieval learning. Experimental results on the Recipe 1M dataset show that our LCWF-GI method achieves competent improvements.

KEYWORDS:

1. Introduction

With the exponential growth of food-related multimodal resources on the Web, large-scale food domain information resources, such as cooking videos, food images, recipe texts, etc., can be easily accessed. These information resources are not only relevant to people’s privacy but also closely related to their health and diet (Liang et al., Citation2022b; Zhang et al., Citation2023). As an emerging research field, food computing has attracted wide attention (Min et al., Citation2017). Several relevant tasks in food computing have been explored, including but not limited to: food-oriented cross-modal retrieval (Chen et al., Citation2021; Fu et al., Citation2020; Salvador et al., Citation2017; Salvador et al., Citation2021; Wang et al., Citation2019), food classification (Krizhevsky et al., Citation2017; Martinel et al., Citation2018; Min et al., Citation2017), recipe recommendations (Elsweiler et al., Citation2017; Sanjo & Katsurai, Citation2017; Trattner & Elsweiler, Citation2017), and recipe ingredient analysis (Kagaya et al., Citation2014; Matsuda et al., Citation2012). Food-oriented cross-modal retrieval, as an essential way to acquire recipe or food image information, has been widely concerned by researchers. In this paper, we focus on this retrieval task, whose purpose is to retrieve recipe text with the help of food images or vice versa. Since it involves two modalities (recipe text and food image) in this task, there exists a semantic gap between them (Zhao et al., Citation2023). This leads to a sharp decline in the performance of food-oriented cross-modal retrieval compared with single-modal retrieval (Peng et al., Citation2017). How to bridge the semantic gap between recipes and images is the main challenge.

To address this challenge, the existing studies have proved that learning the consistent semantic representations of images and recipes is a possible resolution. Generally, a recipe text contains three components, including title, ingredient, and cooking instruction. Most existing studies commonly utilise popular deep learning techniques, such as long short term memory (LSTM) and its variants (Carvalho et al., Citation2018; Fu et al., Citation2020; Salvador et al., Citation2017), as well as Transformer encoders (Salvador et al., Citation2021), to learn the textual embeddings of three different recipe components and directly concatenate their textual embeddings to generate the final recipe representations. Then a convolutional neural network (such as the ResNet-50 model (He et al., Citation2016)) is used to learn image representations and the ranking learning function is leveraged to compute the semantic similarity between image and recipe representations. These studies can bridge the semantic gap between recipes and images to some degree and achieve reasonable retrieval performance. However, there exist two main limitations.

The simple and direct embedding concatenation of different recipe components only can model the simple interactions rather than the complex interactions between them. Compared with complex interactions, the simple and straightforward concatenation will result in the final undesirable recipe representations.
The image feature extraction based on a convolutional neural network (ResNet-50 model) only considers the local features of an image. It ignores the global features of this image and the interactions between different image features.

To deal with these issues, this paper proposes a novel method based on Latent Component Weight Factors and Global Information (referred to as LCWF-GI) for food-oriented cross-modal retrieval. This proposed method utilises three Transformer encoders with different parameters to extract textual features of three different recipe components (title, ingredient, and cooking instruction) respectively. In fact, the three recipe components closely correlate with each other, for instance, the cooking steps in the cooking instructions depend on the ingredients and quantity listed in the ingredient component. Tensor decomposition and tensor representation can be used to capture the relatedness between these recipe components and make full use of semantic information and features from them to guide robust recipe representation learning. Since a tensor is constructed by the outer product over different inputs, we use the outer product over textual embeddings of three recipe components in recipes to construct a tensor. Each order of this constructed tensor represents one kind of recipe component. We exploit tensor decomposition to obtain latent weight factors of three different recipe components, which represent the corresponding semantic features of these components. Then our LCWF-GI method leverages latent component weight factors and tensor representation (Fukui et al., Citation2016; Liu et al., Citation2018; Zadeh et al., Citation2017) to fuse and integrate these textual embeddings into the final compact recipe representation with richer semantic information. As we mentioned before, the previous studies directly concatenated the textual embeddings of three recipe components to acquire the final recipe representations. The embedding concatenation only can capture the simple interactions between these three recipe components by directly combining embeddings together. However, it may miss some important information. For example, if two textual embeddings of these recipe components are similar to each other in some dimensions and have unique information in other dimensions, the direct embedding concatenation may hide or cover up their similarity. Unlike the simple embedding concatenation, our proposed method utilises the element-wise product over textual embeddings of these three different components and their corresponding component-specific weight factors to represent the final recipe representations. This method can capture complex interactions because the element-wise product allows all elements of the component embeddings to interact with each other in their dimensions. The element level operation not only can retain its own information, but also can produce the new embedding through this element interaction between each other. Furthermore, after extracting the local image features with the ResNet-50 model (He et al., Citation2016), we further introduce a Transformer encoder to capture global information of images and interactions between different image features for enhanced image semantic representations. Finally, our proposed LCWF-GI method uses a bi-directional triplet loss function to perform the semantic matching between recipe and image representations. To make full use of the textual information of different recipe components and reduce the dependence on recipe-image pairs, our LCWF-GI method still considers the implicit semantic alignment between different recipe components. Several experiments are conducted on the standard Recipe 1M dataset. Compared with the representative food-oriented cross-modal retrieval baseline methods, experimental results prove that our LCWF-GI method can effectively improve retrieval performance.

The contributions can be summarised as follows:

We propose a novel method based on Latent Component Weight Factors and Global Information to learn consistent modality representations for food-oriented cross-modal retrieval. This proposed method utilises latent component weight factors to fuse the textual embeddings of different recipe components for the final recipe representations with deep semantic information. The generated representations can capture the complex interactions between the textual information of different components.
Based on the ResNet-50 model, we further introduce a Transformer encoder to learn the enhanced image semantic representations, which can capture the global information of images and the interaction relationships between the extracted image features.
Experimental results conducted on the Recipe 1M dataset show that our proposed method is superior to the representative baseline methods.

2. Related work

2.1. Food computing

With the exponential growth of multimodal food data that is closely relevant to the daily diet and people’s health available on the Internet, food computing is becoming an emerging research area. There are many popular research tasks in this area, such as food-oriented cross-modal retrieval (Chen et al., Citation2021; Fu et al., Citation2020; Salvador et al., Citation2017; Salvador et al., Citation2021; Wang et al., Citation2019), food classification (Krizhevsky et al., Citation2017; Martinel et al., Citation2018; Min et al., Citation2017), recipe recommendation (Elsweiler et al., Citation2017; Sanjo & Katsurai, Citation2017; Trattner & Elsweiler, Citation2017), and recipe ingredient analysis (Kagaya et al., Citation2014; Matsuda et al., Citation2012). Due to the power of deep learning techniques, they are widely used in many application domains (Li et al., Citation2022; Liang et al., Citation2022a; Long et al., Citation2022). Similarly, in these tasks, researchers also tend to use deep learning techniques or machine learning methods to extract the semantic information of recipes and images. For example, the research studies (Kawano & Yanai, Citation2014; Liu et al., Citation2016; Yanai & Kawano, Citation2015) used convolutional neural network-based methods to analyze or identify food in images. Kusmierczyk et al. leveraged Latent Dirichlet Algorithm (LDA) (Zhou et al., Citation2022a; Zhou et al., Citation2022b) to discover potential topic information in cooking instruction and exploited it to predict the nutritional value of meals (Kusmierczyk & Nørvåg, Citation2016). Similar to the work on recipe ingredient identification, most studies on food image classification utilise techniques such as transfer learning (Lee et al., Citation2018) and deep convolutional neural networks (Krizhevsky et al., Citation2017; Min et al., Citation2017) to learn image features for classifying food images.

2.2. Food-oriented cross-modal retrieval

Our work focuses on the food-oriented cross-modal retrieval task, where relevant recipes from a large number of recipe text resources can be retrieved by a given food image or vice versa. Since this task involves two modalities (recipe text and food image), there exists a semantic gap between the heterogeneous modalities. How to bridge the semantic gap between them to improve retrieval performance is the main challenge (Peng et al., Citation2017). To address this issue, learning consistent semantic representations of recipes and images and performing effective modality alignment can both be possible resolutions. Since consistent modality semantic representations can better assist modality alignment, we focus on learning consistent semantic representations of recipes and images.

The current research studies can be classified into two categories. The first category is to learn the semantic representations of recipes and images. The methods in this category tend to use different neural networks (such as LSTM, Bi-LSTM) and a popular deep convolutional network (the ResNet-50 model (He et al., Citation2016)) to learn the semantic representations of two modalities. Finally, the normal metric learning functions (triplet loss function and its variants) are leveraged to perform semantic matching between modality representations for food-oriented cross-modal retrieval learning. The major difference between methods in this category lies in the use of different neural networks to learn the recipe semantic representations. Therefore, we do not introduce the image semantic representation learning and retrieval ranking functions in detail and only focus on recipe representation learning. An earlier study (Salvador et al., Citation2017) used Bi-LSTM and LSTM to extract textual features from two recipe components (ingredient and cooking instruction) for the corresponding textual embeddings respectively and then directly concatenated them for the final recipe representations. Similar to this work (Salvador et al., Citation2017), Carvalho et al. (Carvalho et al., Citation2018) used Bi-LSTM and Hierarchical LSTM to encode the textual information of the ingredient and cooking instruction components to learn their textual embeddings and then generated the final recipe representations by direct embedding concatenation. Since these two studies only considered the textual information of two recipe components (ingredient and cooking instruction) and used LSTM and Bi-LSTM to encode the recipe text in a simple way, the quality of the generated recipe semantic representations needs to be further improved. To make full use of the textual information of different recipe components and explore the deep recipe semantic information, Chen et al. (Chen et al., Citation2018) leveraged Bi-GRU and attention networks to capture the word-word interactions in the recipe title, ingredient and cooking instruction, as well as the sentence-sentence interactions in cooking instruction to produce textual embeddings of different three recipe components. Then they concatenated them together to generate the final recipe semantic representation. Compared with the earlier work (Carvalho et al., Citation2018; Salvador et al., Citation2017), their proposed method not only considered the textual information of the ingredient and cooking instruction components, but also further considered the textual information of the title component. They also introduced an attention mechanism to capture fine-grained information about three recipe components for better recipe semantic representations. However, the ability to capture the long-range dependencies among the textual information in recipes needs to be improved when encoding the recipe components. The transformer technique can address this limitation of Bi-GRU and Bi-LSTM and better capture the long-range dependencies among the textual information in different recipe components. Salvador et al. (Salvador et al., Citation2021) proposed a hierarchical Transformer framework for food-oriented cross-modal retrieval, where three different hierarchical Transformer encoders (Vaswani et al., Citation2017) are used to encode word-level and sentence-level information of three recipe components (title, ingredient, and cooking instruction) for their corresponding textual embeddings. Then these textual embeddings are concatenated to generate final recipe representations. In Pham et al.'s work (Pham et al., Citation2021), they used a Bi-LSTM to encode the title and a tree-structured LSTM as a recipe component text encoder to capture the hierarchical relationships and more meaningful semantic information in two recipe components (ingredient and cooking instruction). Then these learned corresponding textual embeddings are concatenated for the final recipe representations. Overall, these research studies focus on recipe representation learning by learning the textual embeddings of each recipe component based on different neural networks separately and then directly concatenating them to construct the final semantic representations of recipes. However, the direct and simple embedding concatenation for recipe representation learning can only capture the simple interactions rather than complex interactions between the textual information of different recipe components. Furthermore, some information among the textual embeddings of these recipe components may be missing.

The second category mainly focuses on modality alignment. After obtaining the semantic representations of recipes and images, the methods in the second category propose complex models or loss functions to align and refine modality representations for food-oriented cross-modal retrieval. The recipe representation learning of these methods in the second category is similar to that in the first category. Meanwhile, they also utilise the ResNet-50 model as the image encoder to generate image semantic representations. Therefore, we do not introduce the recipe and image semantic representation learning here, but only focus on the modality alignment methods. Wang et al. (Wang et al., Citation2019) proposed an adversarial network-based modality alignment method to align the learned recipe and image semantic representations. Then they used a cross-modal translation consistency function to assist this alignment process for retrieval learning. This method guided the learning process of the modality semantic representations by using the discriminators and generators of an adversarial network to bridge the gap between them. Similar to this work (Wang et al., Citation2019), Zhu et al. (Zhu et al., Citation2019) used an augmented adversarial network to align the semantic representations of recipes and images and introduced a two-level ranking loss function to improve the robustness of the alignment during the alignment process. Fu et al. (Fu et al., Citation2020) used a cross-modal attention mechanism to learn the interactions of recipe and image representations to generate new semantic representations for them. Finally, the enhanced modality semantic representations are aligned by using a hidden variable model. In Wang et al.'s work (Wang et al., Citation2022), they introduced the semantic consistency network to regularise and align the recipe and image semantic representations. These methods focus on bridging the semantic distance between the representations of two modalities by using a complex alignment model or auxiliary loss functions.

Overall, the existing food-oriented cross-modal retrieval methods typically use a deep convolutional neural network (the ResNet-50 model (He et al., Citation2016)) to encode the image semantic information. Different neural networks are utilised to learn textual embeddings of recipe components in a recipe. Then these textual embeddings of different recipe components are directly concatenated together to generate the final recipe representations. However, these methods can only capture simple interactions rather than complex interactions between textual information in different recipe components. Second, the image representations based on the ResNet-50 model only consider the local information of an image and ignore its global information and the interactions between the different extracted image features. Therefore, our work proposes a novel consistent semantic representation learning method for food-oriented cross-modal retrieval.

3. The proposed method

In the food-oriented cross-modal retrieval task, which aims to match the representations of images and recipes. Given a set of recipe-image pairs with size $N$ : ${x_{i}^{R}, x_{i}^{I}}_{i = 1}^{N}$ , where $x_{i}^{R} \in R$ denotes a recipe text from the recipe set $R$ and $x_{i}^{I} \in I$ denotes an image from the food image set $I$ . Each recipe consists of three components: title, ingredient and cooking instruction, which are denoted as $x_{i}^{R} =$ ( $t^{t t l}$ , $g^{i n g}$ , $s^{i n s}$ ). The existing studies (Fu et al., Citation2020; Salvador et al., Citation2017; Salvador et al., Citation2021; Wang et al., Citation2019) tended to use popular neural networks to encode the textual information of different recipe components ( $t^{t t l}$ , $g^{i n g}$ , $s^{i n s}$ ) into the corresponding textual embeddings ( $e_{t}^{t t l}, e_{g}^{i n g}, e_{s}^{i n s}$ ), and then concatenate them to generate the final recipe text representation $E^{R} = [e_{t}^{t t l}, e_{g}^{i n g}, e_{s}^{i n s}]$ . In this work, the goal of our proposed method is to generate the improved image semantic representation $E^{I}$ and the refined recipe representation $E^{R}$ . A Transformer architecture is used to encode image features from the ResNet-50 model for improved image semantic representation $E^{I}$ . The textual embeddings of different recipe components are leveraged to learn the refined recipe representation $E^{R}$ with the latent component weight factors and tensor decomposition. The framework of our proposed method is shown in Figure , which includes three main components: 1) image representation learning; 2) recipe representation learning; 3) food-oriented cross-modal retrieval learning. The symbols used in this paper and their corresponding meanings are shown in Table .

Figure 1. The overall framework of the proposed method.

Table 1. The symbols and corresponding meanings in the current paper

Display Table

3.1. Image representation learning

The existing studies (Fu et al., Citation2020; Salvador et al., Citation2021; Wang et al., Citation2019) usually use a deep convolutional network (the ResNet-50 model (He et al., Citation2016)) as an image encoder to extract the features of images and regard their corresponding embeddings as image representations. This image encoder can be defined as $ψ_{i m g}$ , and the feature embedding of an image can be defined as: (1) $\begin{aligned} e_{r e s}^{I} = ψ_{i m g} (x_{i}^{I}) \end{aligned}$ (1) where $e_{r e s}^{I}$ represents the corresponding feature embedding of the image $x_{i}^{I}$ with the help of the ResNet-50 model. As mentioned in the existing studies (Fu et al., Citation2020; Salvador et al., Citation2021; Wang et al., Citation2019), the image feature embeddings are obtained from the last two layers of the ResNet-50 model (because embeddings from the last layer are combined with a softmax function for prediction task). Therefore, the feature embedding $e_{r e s}^{I}$ is regarded as the image representation in these studies.

Though feature embeddings as image representations can somewhat represent the image semantic information, the feature extraction based on the ResNet-50 model only considers the local information rather than the global information of images. Furthermore, there may be some correlations between different extracted feature embeddings of the same image, while the interactions between them also are neglected. To alleviate these limitations and learn richer image semantic information, a 2-layer Transformer encoder (Vaswani et al., Citation2017) is used to re-encode the feature embeddings ( $e_{r e s}^{I}$ ) from the ResNet-50 model to generate the enhanced image semantic representations. In this 2-layer Transformer encoder, there are 4 self-attention heads in each layer and the multi-head self-attention mechanism can jointly focus on important information at different locations to capture the interaction information between extracted local features. The image feature embedding $e_{r e s}^{I}$ extracted by the ResNet-50 model is treated as queries, keys and values in the self-attention network, and the multi-headed self-attention mechanism of the image Transformer encoder can be expressed as: (2) $\begin{aligned} H e a d_{i} (e_{r e s}^{I}) & = s o f t m a x (\frac{{[W_{I_{q}} e_{r e s}^{I}]}^{T} W_{I_{k}} e_{r e s}^{I}}{\sqrt{d_{I}}}) \cdot [W_{I_{v}} e_{r e s}^{I}]^{T} \end{aligned}$ (2) (3) $\begin{aligned} M u l t i H e a d (e_{r e s}^{I}) & = [H e a d_{1}, \dots, H e a d_{4}]^{T} W^{'} \end{aligned}$ (3) where $H e a d_{i}$ represents the $i$ th self-attention head in the image Transformer encoder, $W_{I_{q}}, W_{I_{k}}, W_{I_{v}}, W^{'}$ are the trainable weight matrices, $\sqrt{d_{I}}$ is the scaling factor and it is equal to the dimension of a self-attention layer.

To avoid gradient vanishment and speed up convergence, the multi-head self-attention layer, the feed-forward neural network, and the normalisation layer are combined to generate the final image semantic representation. (4) $\begin{aligned} e_{i}^{I^{'}} & = L N (e_{r e s}^{I} + M u l t i H e a d (e_{r e s}^{I})) \end{aligned}$ (4) (5) $\begin{aligned} E^{I} & = L N (e i^{I^{'}} + F F N (e i^{I^{'}})) \end{aligned}$ (5) where $M u l t i H e a d$ is the multi-head self-attention layer, $F F N$ is the feed-forward neural network, $L N$ is the normalisation layer, and $E^{I}$ is the final enhanced image semantic representation. The enhanced image representation not only can consider the local information of images, but also the global information and interactions between different extracted local image features. Therefore, the image representation $E^{I}$ can contain richer semantic information.

3.2. Recipe representation learning

3.2.1. Textual information encoding for recipe components

Each recipe contains three components: title, ingredient, and cooking instruction. The recipe title component is usually represented as a sentence that highly summarises this recipe, while the recipe ingredient and cooking instruction components both are composed of multiple sentences. To generate strong recipe semantic representations, similar to this work (Salvador et al., Citation2021), our work also uses three Transformer encoders with the same structure and different parameters to encode the textual information of three recipe components (title, ingredient and cooking instruction) respectively. As we mentioned before, since the recipe ingredient and cooking instruction components usually are made of multiple sentences, the title component only consists of one sentence. A sequence (sentence) includes many word tokens and it denotes as $S q = (w d_{0}, \dots, w d_{t})$ , where $S q$ represents a sentence and $w d$ means a word token. Then the title can be expressed as $t^{t t l} = S q^{t t l}$ , $S q^{t t l}$ is the sentence embedding of the title, the ingredient and cooking instruction can be denoted as $g^{i n g} = ({S q}_{0}^{i n g}, \dots {S q}_{j}^{i n g}, \dots, {S q}_{M}^{i n g})$ and $s^{i n s} = ({S q}_{0}^{i n s}, \dots {S q}_{j}^{i n s}, \dots, {S q}_{P}^{i n s})$ respectively. These component text encoders make full use of the multi-head self-attention mechanism in the Transformer architecture to capture the long-range dependencies and word-word level interactions of text information in three different recipe components. This can learn the textual embeddings of these components with deep semantic information. The architecture of these component text encoders is similar to that of the image Transformer encoder used in Section 3.1, which is also a 2-layer Transformer encoder (Vaswani et al., Citation2017) with 4 self-attention heads in each layer. The recipe title $t^{t t l}$ is treated as queries, keys, and values in the self-attention network of the Transformer text encoder for the title component, and the multi-head self-attention mechanism in this component text encoder can be defined as: (6) $\begin{aligned} H e a d_{i} (e_{t}) & = s o f t m a x (\frac{{[W_{t t t_{q}} t^{t t l}]}^{T} W_{t t t_{k}} t^{t t l}}{\sqrt{d_{T I}}}) \cdot [W_{t t l_{v}} t^{t t l}]^{T} \end{aligned}$ (6) (7) $\begin{aligned} M u l t i H e a d (e_{t}) & = [H e a d_{1}, \dots, H e a d_{4})]^{T} W_{c}^{'} \end{aligned}$ (7) where $H e a d_{i}$ denotes the $i$ th self-attention head in the Transformer encoder for the title component, $W_{t t t_{q}}, W_{t t t_{k}}, W_{t t l_{v}}, W_{c}^{'}$ are the weight matrices to be learned, $\sqrt{d_{T I}}$ is the scaling factor, and its size is equal to the dimension of the self-attention layer.

Similar to the image encoder, to improve the model performance and avoid gradient vanishment, we also combine the multi-headed self-attention layer, the feed-forward neural network, and the normalisation layer to generate the textual embeddings of the title component. (8) $\begin{aligned} e_{t}^{{t t l}^{'}} = L N (e_{t} + M u l t i H e a d (e_{t})) \end{aligned}$ (8) (9) $\begin{aligned} e_{t}^{t t l} = L N (e_{t}^{{t t l}^{'}} + F F N (e_{t}^{t t l^{'}})) \end{aligned}$ (9) where $M u l t i H e a d$ is the multi-head self-attention layer, $F F N$ is the feed-forward neural network, $L N$ is the normalisation layer, and $e_{t}^{t t l}$ represents the semantic textual embeddings of the title component. Then, as for the ingredient and cooking instruction components, we also use a similar way to obtain their corresponding textual embeddings $e_{g}^{i n g}$ and $e_{s}^{i n s}$ respectively.

3.2.2. Recipe representation learning based on latent weight factors

Most current research methods generate textual embeddings of different recipe components, then obtain the final recipe semantic representations by directly concatenating these corresponding textual embeddings. However, the simple embedding concatenation for final recipe representations can only model and capture the simple interactions rather than complex interactions between the textual information of different components. Unlike these methods (Fu et al., Citation2020; Salvador et al., Citation2017; Salvador et al., Citation2021; Wang et al., Citation2019), we propose a novel method based on Latent Component Weight Factors and Global Information (referred to as LCWF-GI) to learn consistent modality representations for food-oriented cross-modal retrieval. This proposed LCWF-GI method first learns textual embeddings of three components (title, ingredient and cooking instruction) via the component Transformer encoders. Then latent component-specific weight factors are used to fuse and integrate these textual embeddings of multiple components into the unified and compact recipe semantic representation. As we mentioned before, a recipe contains three components: title, ingredient, and cooking instruction. The title component summarises the name of this recipe, while ingredient and cooking instruction provide the textual information of the ingredients, quantity, unit information in this recipe and the cooking steps to prepare this dish respectively. These three recipe components are interdependent, and intimately connected to each other. To learn the final robust recipe representations with richer semantic information, we fuse and integrate their corresponding textual embeddings into the unified embedding for representing the final recipe representation. The fusion method in our work is different from that in the studies (Arevalo et al., Citation2017; Yuan et al., Citation2021; Yuan et al., Citation2022a; Yuan et al., Citation2022b; Yuan et al., Citation2022c). For example, Yuan et al. (Yuan et al., Citation2022a) utilised visual features to guide the text representations. Besides, the element-wise product operation was used to compute the cross-modal similarity between visual and text representations. However, different from the study (Yuan et al., Citation2022a), our work only focuses on leveraging the textual information of three recipe components to learn recipe representations. Furthermore, the element-wise product operation is leveraged to fuse textual embeddings of three recipe components (title, ingredient, and cooking instruction) for acquiring the final recipe representations. Besides, the recipe representation generation only is closely correlated with their recipe components and has no connection with images. Finally, the bi-triplet loss function rather than element-wise product is adopted to calculate the cross-modal similarity between the representations of recipes and images.

As we mentioned before, the three recipe components closely correlate with each other, for example, the cooking steps in the cooking instructions depend on the ingredients and quantity listed in the ingredient component. Tensor decomposition and tensor representation can be utilised to capture the relatedness between these recipe components and make full use of semantic information and features from them for guiding the recipe representations. Therefore, we leverage tensor decomposition to acquire latent weight factors of three different recipe components, because latent weight factors represent the corresponding semantic features of these components. Tensor representation can transform different input representations into a high-dimensional tensor and map this tensor into a low-dimensional vector space (Fukui et al., Citation2016; Liu et al., Citation2018; Zadeh et al., Citation2017). Though we first attempt to use tensor representation to fuse three recipe components for generating the final recipe representations, the tensor fusion network is commonly used in multimodal learning (Fukui et al., Citation2016; Jin et al., Citation2020; Liu et al., Citation2018; Zadeh et al., Citation2017). Similarly, we use the tensor fusion network to fuse different recipe components rather than multimodal data. In this work, we use the textual embeddings of three recipe components ( $e_{t}^{t t l}$ , $e_{g}^{i n g}$ , $e_{s}^{i n s}$ ) to construct an input tensor $X$ via outer product operation, so we can use each order of the input tensor $X$ to represent the component information. However, to model the interactions between three different recipe components as well as to maintain their own semantic information for each recipe component, a 1 is appended to the textual embeddings of each recipe component before constructing the input tensor $X$ via the outer product operation, i.e. $x_{1} = [e_{t}^{t t l} 1] \in R^{d_{1}}$ , $x_{2} = [e_{g}^{i n g} 1] \in R^{d_{2}}$ , $x_{3} = [e_{s}^{i n s} 1] \in R^{d_{3}}$ . The constructed input tensor $X$ can be defined as: (10) $\begin{aligned} X = x_{1} \otimes x_{2} \otimes x_{3} \end{aligned}$ (10) where $\otimes$ denotes the outer tensor product, $X \in R^{d_{1} \times d_{2} \times d_{3}}$ is an order-3 tensor, $d_{1}$ , $d_{2}$ , $d_{3}$ denote the dimensionality size of $x_{1}$ , $x_{2}$ , $x_{3}$ (textual embeddings of the title, ingredient, and cooking instruction components after appending 1) respectively.

Then the tensor $X$ is fed into a linear layer $g (\cdot)$ to generate the recipe semantic representation. The recipe semantic representation can be defined as: (11) $\begin{aligned} E^{R} = g (X; W, b) = X \cdot W + b \end{aligned}$ (11) where $g (\cdot)$ is denoted as a linear layer, $W$ and $b$ are the weight tensor and bias of the linear layer respectively. Since $X$ is an order-3 tensor (the order of $X$ is equal to the number of recipe components used, our work uses all three recipe components), the weight tensor $W \in R^{d_{1} \times d_{2} \times d_{3} \times d_{r}}$ is an order-4 tensor. The extra order ( $d_{r}$ ) of $W$ corresponds to the dimension size $d_{r}$ of the final output recipe representation $E^{R}$ . In the dot product operation of $X \cdot W$ , the weight tensor $W$ can be formed as a stack of $d_{r}$ order-3 tensors. In other words, the weight tensor $W$ can be partitioned into $d_{r}$ order-3 $W_{k} \in R^{d_{1} \times d_{2} \times d_{3}}$ , $k = 1, \dots, d_{r}$ .

To efficiently acquire the final recipe semantic representation, this LCWF-GI method focuses on decomposing the weight tensor $W$ into a set of latent weight factors corresponding to three different components. This tensor decomposition process not only can project the semantic information of these recipe components into a low-dimensional space to catch their semantic features, but also capture the relevance between these components. As $W$ is an order-4 tensor, if $W$ is directly decomposed by using traditional methods, four kinds of elements will be generated. As mentioned earlier, the weight tensor $W$ is formed by stacking $d_{r}$ order-3 tensor $W_{k} \in R^{d_{1} \times d_{2} \times d_{3}}$ , $k = 1, \dots, d_{r}$ , so the decomposition of $W$ is converted into a separate decomposition of each $W_{k}$ . For any order-3 tensor $W_{k}$ , there will always exist an exact decomposition form that can decompose the tensor into vectors. This decomposition form is defined as: (12) $\begin{aligned} W_{k} = \sum_{j = 1}^{ℜ} ({w_{k}}_{(j)}^{t t l} \otimes {w_{k}}_{(j)}^{i n g} \otimes {w_{k}}_{(j)}^{i n s}) \end{aligned}$ (12) where $\otimes$ is the outer product, ${w_{k}}_{(j)}^{t t l}$ , ${w_{k}}_{(j)}^{i n g}$ and ${w_{k}}_{(j)}^{i n s}$ are weight vectors of the title, ingredient and cooking instruction components after decomposing the $k$ -th order-3 tensor $W_{k}$ respectively. $ℜ$ is the minimum rank value that makes the decomposition form valid. The set of weight vectors ${{w_{k}}_{(j)}^{t t l}, {w_{k}}_{(j)}^{i n g}, {w_{k}}_{(j)}^{i n s}}_{j}^{ℜ}$ is called as the $ℜ$ latent weight factors of order-3 weight tensor $W_{k}$ . These weight vectors (low-order tensors in low-dimensional semantic space) also represent the corresponding semantic features of these three recipe components (title, ingredient, and cooking instruction).

In contrast to this decomposition process, we attempt to recover the original weight tensor via the low-rank approximation. The low-rank approximation can keep the important features and remove the noisy features. Therefore, a fixed rank value $r$ is used. These $r$ latent weight decomposition factors ${{w_{k}}_{(j)}^{t t l}, {w_{k}}_{(j)}^{i n g}, {w_{k}}_{(j)}^{i n s}}_{j}^{r}, k = 1, \dots, d_{r}$ are leveraged to parametrically represent the recipe component fusion method. In other words, we leverage the $r$ decomposition factors ${{w_{k}}_{(j)}^{t t l}, {w_{k}}_{(j)}^{i n g}, {w_{k}}_{(j)}^{i n s}}_{j}^{r}$ to reconstruct the low-rank weight tensor $W_{k}$ . Each recipe component has its corresponding low-rank latent component weight factors, which are obtained by combining and concatenating these weight vectors into the low-rank latent component-specific weight factors. For example, as for the title component, the corresponding vectors are combined and concatenated to obtain the corresponding low-rank latent weight factors $w_{(j)}^{t t l} = [{w_{1}}_{(j)}^{t t l}, {w_{2}}_{(j)}^{t t l}, \dots, {w_{k}}_{(j)}^{t t l}, \dots, {w_{d_{r}}}_{(j)}^{t t l}]$ , $d_{r}$ is the dimension size of the final recipe representation, then ${w_{(j)}^{t t l}}_{j}^{r}$ are the low-rank latent weight factor of the title component. In a similar way, the low-rank latent weight factor ${w_{(j)}^{i n g}}_{j}^{r}$ of the ingredient component and the low-rank latent weight factor ${w_{(j)}^{i n s}}_{j}^{r}$ of the cooking instruction component can be obtained. Then the weight tensor $W$ can be reformulated by the low-rank latent weight factors of three recipe components. Thus, Equation Equation(11)(11) $\begin{aligned} E^{R} = g (X; W, b) = X \cdot W + b \end{aligned}$ (11) can be rewritten as: (13) $\begin{aligned} W & = \sum_{j = 1}^{r} (w_{(j)}^{t t l} \otimes w_{(j)}^{i n g} \otimes w_{(j)}^{i n s}) \end{aligned}$ (13) (14) $\begin{aligned} E^{R} & = X \cdot (\sum_{j = 1}^{r} (w_{(j)}^{t t l} \otimes w_{(j)}^{i n g} \otimes w_{(j)}^{i n s})) \end{aligned}$ (14) where $w_{(j)}^{t t l} \in R^{d_{1} \times d_{r}}$ , $w_{(j)}^{i n g} \in R^{d_{2} \times d_{r}}$ , $w_{(j)}^{i n s} \in R^{d_{3} \times d_{r}}$ . It can be seen that they maintain the same second dimension $d_{r}$ . The outer product is performed only in the first dimension, i.e. $w_{(j)}^{t t l} \otimes w_{(j)}^{i n g} \in R^{d_{1} \times d_{2} \times d_{r}}, w_{(j)}^{t t l} \otimes w_{(j)}^{i n s} \in R^{d_{1} \times d_{3} \times d_{r}}, w_{(j)}^{i n g} \otimes w_{(j)}^{i n s} \in R^{d_{2} \times d_{3} \times d_{r}}$ .

To efficiently fuse the textual information of different recipe components, the input tensor $X$ can be decomposed into textual embeddings of these three components in parallel when the weight tensor $W$ is decomposed. According to Equation Equation(10)(10) $\begin{aligned} X = x_{1} \otimes x_{2} \otimes x_{3} \end{aligned}$ (10) and the mixed product property theorem of the tensor, Equation Equation(14)(14) $\begin{aligned} E^{R} & = X \cdot (\sum_{j = 1}^{r} (w_{(j)}^{t t l} \otimes w_{(j)}^{i n g} \otimes w_{(j)}^{i n s})) \end{aligned}$ (14) can be redefined as: (15) $\begin{aligned} E^{R} & = X \cdot (\sum_{j = 1}^{r} (w_{(j)}^{t t l} \otimes w_{(j)}^{i n g} \otimes w_{(j)}^{i n s})) = \sum_{j = 1}^{r} (X \cdot (w_{(j)}^{t t l} \otimes w_{(j)}^{i n g} \otimes w_{(j)}^{i n s})) \\ = \sum_{j = 1}^{r} ((x_{1} \otimes x_{2} \otimes x_{3}) \cdot (w_{(j)}^{t t l} \otimes w_{(j)}^{i n g} \otimes w_{(j)}^{i n s})) \\ = [\sum_{j = 1}^{r} (x_{1} \cdot w_{(j)}^{t t l})] \circ [\sum_{j = 1}^{r} (x_{2} \cdot w_{(j)}^{i n g})] \circ [\sum_{j = 1}^{r} (x_{3} \cdot w_{(j)}^{i n s})] \end{aligned}$ (15) where $\otimes$ is the outer product, $\cdot$ is the dot product, and $^{\circ}$ is denoted as the element-wise product. After decomposing the weight tensor $W$ and the input tensor $X$ in parallel, we can directly compute the recipe representations without actually computing the tensor $X$ . As seen from Equation Equation(15)(15) $\begin{aligned} E^{R} & = X \cdot (\sum_{j = 1}^{r} (w_{(j)}^{t t l} \otimes w_{(j)}^{i n g} \otimes w_{(j)}^{i n s})) = \sum_{j = 1}^{r} (X \cdot (w_{(j)}^{t t l} \otimes w_{(j)}^{i n g} \otimes w_{(j)}^{i n s})) \\ = \sum_{j = 1}^{r} ((x_{1} \otimes x_{2} \otimes x_{3}) \cdot (w_{(j)}^{t t l} \otimes w_{(j)}^{i n g} \otimes w_{(j)}^{i n s})) \\ = [\sum_{j = 1}^{r} (x_{1} \cdot w_{(j)}^{t t l})] \circ [\sum_{j = 1}^{r} (x_{2} \cdot w_{(j)}^{i n g})] \circ [\sum_{j = 1}^{r} (x_{3} \cdot w_{(j)}^{i n s})] \end{aligned}$ (15) , the final recipe semantic representations can be expressed as element-wise over the textual embeddings of three recipe components and their corresponding latent weight factors. Equation Equation(15)(15) $\begin{aligned} E^{R} & = X \cdot (\sum_{j = 1}^{r} (w_{(j)}^{t t l} \otimes w_{(j)}^{i n g} \otimes w_{(j)}^{i n s})) = \sum_{j = 1}^{r} (X \cdot (w_{(j)}^{t t l} \otimes w_{(j)}^{i n g} \otimes w_{(j)}^{i n s})) \\ = \sum_{j = 1}^{r} ((x_{1} \otimes x_{2} \otimes x_{3}) \cdot (w_{(j)}^{t t l} \otimes w_{(j)}^{i n g} \otimes w_{(j)}^{i n s})) \\ = [\sum_{j = 1}^{r} (x_{1} \cdot w_{(j)}^{t t l})] \circ [\sum_{j = 1}^{r} (x_{2} \cdot w_{(j)}^{i n g})] \circ [\sum_{j = 1}^{r} (x_{3} \cdot w_{(j)}^{i n s})] \end{aligned}$ (15) corresponds to the module of Fusion of Textual Embedding of Multiple Components based on Latent Weight Factors in Figure . Compared with the direct embedding concatenation method, this proposed LCWF-GI method allows the interactions between individual elements in textual embeddings of different recipe components through the element-wise product operation. Therefore, the final recipe representation can capture the complex interactions between the text information of different recipe components rather than simple interactions between them.

3.3. Food-oriented cross-modal retrieval learning

Similar to the work (Salvador et al., Citation2021), our objective loss function contains two parts. The first part is a bi-directional triplet loss function $L_{p a i r}$ based on recipe-image pairs, and this function is mainly used to learn the similarity between the representations of recipes and images. The second part is a bi-directional triplet loss function $L_{r e p}$ based on the textual embeddings of different recipe components (title-ingredient, title-cooking instruction, ingredient-cooking instruction, etc.), which is used to ensure that the textual information between different recipe components is as similar as possible. These two loss functions are combined together as our final loss function. (16) $\begin{aligned} L = α L_{p a i r} + β L_{r e p} \end{aligned}$ (16) where $α$ and $β$ are the balance factors of the recipe-image pair loss function and the pair loss function of the textual information of different recipe components, $α = 1$ and $β = 1$ mean that the above two loss functions are considered simultaneously.

Recipe-image pair loss function $L_{p a i r}$ . The goal of this loss function is to make the similarity between an anchor and the positive samples higher than its similarity to the negative samples. For example, when a food image sample is used as an anchor, we should ensure that the similarity between an image anchor and the positive recipe samples is higher than its similarity to the negative recipe samples, and vice versa when the recipe sample is used as an anchor point. (17) $\begin{aligned} L_{p a i r} = max (0, s (E^{I}, E_{n}^{R}) - s (E^{I}, E_{p}^{R}) + m) + max (0, s (E_{n}^{I}, E^{R}) - s (E_{p}^{I}, E^{R}) + m) \end{aligned}$ (17) where $m$ is the margin value in the bi-directional triplet loss function and $s (\cdot)$ is the cosine similarity measurement. $E_{p}^{R}$ and $E_{p}^{I}$ are positive samples of recipe and food images respectively, while $E_{n}^{R}$ and $E_{n}^{I}$ are negative samples of recipe and food images in small batches respectively.

Loss function $L_{r e p}$ of different recipe components. We attempt to reduce the dependence of model training on recipe-image pairs and make full use of the relationships between the textual information of different recipe components. A self-supervised learning loss function between different pairs of three recipe components (title-ingredient, title-cooking instruction, ingredient-cooking instruction, etc.) (Salvador et al., Citation2021) is adopted. This loss function is composed of the average of the bi-directional triplet loss functions of different textual embedding pairs of three recipe components. Its objective is similar to that of using bi-directional triple loss functions for recipe-image pairs, which is also used to ensure that the similarity between an anchor point and positive samples is higher than that between them and negative samples. (18) $\begin{aligned} L_{r e p} = \frac{1}{6} (max (0, s (a, b_{n}) - s (a, b_{p}) + m) + max (0, s (a_{n}, b) - s (a_{p}, b) + m)) \end{aligned}$ (18) where $a, b \in {e_{t}^{t t l}, e_{g}^{i n g}, e_{s}^{i n s}}$ , $a$ and $b$ denote the textual embeddings of any two recipe components (title, ingredient, and cooking instruction), 1/6 denotes a total of six recipe component pairs, and the loss function is averaged by these six combinational relationships. $m$ is the margin value in the triplet loss function. $a_{p}$ and $b_{p}$ are the positive textual embedding samples of two recipe components, while $a_{n}$ and $b_{n}$ are the negative textual embedding samples of two recipe components in a small batch respectively.

4. Experiments

4.1. Dataset and evaluation metrics

The same as the existing studies (Fu et al., Citation2020; Salvador et al., Citation2017; Salvador et al., Citation2021; Wang et al., Citation2019), we conduct experiments on the standard Recipe 1M dataset (Salvador et al., Citation2017) for food-oriented cross-modal retrieval. It is a large publicly available dataset on English recipes and food images. The original dataset consists of 1 million recipes and 900,000 food images. This dataset remains 341,421 recipe-image pairs after preprocessing (Salvador et al., Citation2017). Then all the recipe-image pairs are divided into the training set (238,999), the validation set (51,119) and the test set (51,303). Since this work considers the relationships between the pairs of textual information of different recipe components, 482,231 samples from the rest of this dataset (recipes only, without their corresponding images) are also used to train our proposed method.

The food-oriented cross-modal retrieval task consists of two tasks, i.e. the image-to-recipe retrieval task (Image-to-Recipe) and the recipe-to-image retrieval task (Recipe-to-Image). Ten subsets of 1,000 (1 K) and 10,000 (10 K) recipe-image pairs are extracted from the test dataset to validate our proposed method. Our proposed method and the representative food-oriented cross-modal retrieval baseline methods are evaluated by using Median Rank (MedR), Recall@1 (R1), Recall@5 (R5), and Recall@10 (R10). The results used in our work are the average results of 10 sets of 1 and 10 K test samples.

4.2. Comparison methods

We compare our proposed method with several representative food-oriented cross-modal retrieval methods, which are described as follows.

JE (CVPR2017) (Salvador et al., Citation2017): Salvador et al. used LSTM and the ResNet-50 model to learn the semantic representations of recipes and images respectively, then used semantic regularisation function to align these modality representations for retrieval learning.

ATTEN (MM2018) (Chen et al., Citation2018): Chen et al. introduced hierarchical attention networks to capture word-word and sentence-sentence interactions between the textual information of each recipe component for enhanced recipe representations. Then they learned consistent semantic representations of two modalities based on a bidirectional triplet loss function.

AdaMine (SIGIR2018) (Carvalho et al., Citation2018): Carvalho et al. used Bi-LSTM and the ResNet-50 model to learn recipe and image representations respectively. The category label information, a bi-directional triplet loss and ranking loss are used for the retrieval purpose.

R2GAN (CVPR2019) (Zhu et al., Citation2019): Zhu et al. used two discriminators in augmented adversarial networks to generate recipe representations and proposed a new two-level ranking loss function to align the semantic representations of images and recipes.

MCEN (CVPR2020) (Fu et al., Citation2020): Fu et al. firstly extracted recipe and image features based on Bi-GRU and the ResNet-50 model respectively, then used a cross-modal attention mechanism to learn their consistent representations. Finally, they aligned their semantic representations via a hidden variable model.

ACME (CVPR2019) (Wang et al., Citation2019): Wang et al. proposed a modality alignment method based on adversarial networks and used an improved triplet loss function and cross-modal translation consistency loss function to assist the modality alignment for food-oriented cross-modal retrieval.

HTL (CVPR2021) (Salvador et al., Citation2021): Salvador et al. encoded the textual information of three recipe components separately based on a hierarchical Transformer encoder and fused them to obtain the final recipe semantic representations. Then the alignment between different recipe components and between image-recipe pairs based on the triplet loss functions are introduced to perform food-oriented cross-modal retrieval.

JEMA (CIKM2021) (Xie et al., Citation2021): Xie et al. focused on modality alignment and presented a three-tier modality alignment optimisation method for image-recipe retrieval.

IMHF (ICMR2021) (Li et al., Citation2021): Li et al. learned image and recipe representations through a unified transformer framework with the inter-and intra-modality fusion rather than the separated encoders for food-oriented cross-modal retrieval.

SCAN (TMM2022) (Wang et al., Citation2022): Wang et al. combined attention mechanism and Bi-LSTM to generate recipe representations and introduced semantic consistency networks to regularise and align the modality semantic representations.

MSJE (TSC2022) (Xie et al., Citation2022): Xie et al. incorporated TFIDF semantics, sequence semantics, and category semantics to learn enhanced joint embeddings for food-oriented cross-modal retrieval.

LCWF-GI: This is our proposed method that considers the latent component-specific weight factors and global information to learn consistent modality representations for food-oriented cross-modal retrieval. This method uses latent component-specific weight factors to fuse the textual embeddings of three recipe components to generate the final recipe representations. It also considers the global information of images and interactions between different image features to produce enhanced image representations.

4.3. Experimental settings

In the experimental setup, we use the ResNet-50 model (He et al., Citation2016) (pre-trained on ImageNet (Dosovitskiy et al., Citation2020)) with initialised weights as an image encoder to extract features of images, and the dimension of the final enhanced image semantic representation is 1024. The output dimension of textual embeddings of all three recipe components is 512 and the output of the final recipe representation generated after fusion is 1024. The Adam optimiser is used to train our proposed method, and the learning rate parameter is set to 0.0001. The batch size is 128. The rank $r$ of latent component-specific weight factors is set to 4 during the process of generating recipe representations. The margin parameter m of the bi-directional triplet loss function is 0.3. The hyperparameters α and β of the final objective loss function of our proposed method are both set to 1.0.

5. Result discussion

5.1. Main retrieval results

Experimental results of our proposed LCWF-GI method and the representative food-oriented cross-modal retrieval baseline methods for the Image-to-Recipe and Recipe-to-Image retrieval tasks are shown in Table . The bold values in this table indicate the best results achieved by our LCWF-GI method. The representative baseline methods are divided into the consistent representation learning-based method (focusing on learning high-quality recipe representations) and the modality alignment-based method (aiming at aligning the semantic representations of both recipes and images). As seen from Table , several conclusions can be drawn as follows.

Overall, the proposed LCWF-GI method achieves more competitive results than all baseline methods except the HTL approach in both two retrieval tasks.
In the 1 K test set, the LCWF-GI method shows a similar performance with the HTL method for two retrieval tasks in terms of MedR, Recall@5 and Recall@10. As for the Recall@1 evaluation metric, the LCWF-GI method performs better than all consistent representation learning-based methods (JE, ATTEN, AdaMine, IMHF, MSJE and HTL methods) on two tasks. For example, as for Recall@1, the LCWF-GI method increases from 58.6 to 59.4 and 59.2 to 60.1 compared with the HTL method for Image-to-Recipe and Recipe-to-Image retrieval tasks respectively. This indicates that the enhanced recipe and image semantic representations can effectively improve retrieval performance. However, as for Recall@5, the performance of the LCWF-GI method is slightly weaker than the HTL method on the Image-to-Recipe and Recipe-to-Image tasks. This is due to the fact that the LCWF-GI method cannot fully exploit the relationships between different components to learn its own useful information, leading to inferior performance in recall@5. Compared with the modality alignment-based methods (R2GAN, MCEN, ACME, SCAN and JEMA methods), the LCWF-GI method has more obvious advantages in terms of recall metrics (Recall@1, Recall@5, and Recall@10). Compared with the best modality alignment-based method (JEMA method), the LCWF-GI method improves from 58.1 to 59.4, 85.8 to 86.8, 92.2 to 92.5 on the Image-to-Recipe task for these three recall metrics (Recall@1, Recall@5 and Recall@10) respectively. On the Recipe-to-Image task, the LCWF-GI method increases from 58.5 to 60.1, 86.2 to 86.7, and 92.3 to 92.7 for the same evaluation metrics respectively. Besides, the LCWF-GI method shows significant advantages compared with the second-best modality alignment-based method (SCAN method) on the Image-to-Recipe task and Recipe-to-Image task for these three recall metrics (Recall@1, Recall@5 and Recall@10) respectively. Although modality alignment of the SCAN and JEMA methods can bridge the semantic gap between the recipe and image representations to some degree, enhanced and stronger semantic representations of recipes and images can be more conducive for the retrieval purpose.
On the 10 K test set, as the increase of test set size (from 1 K to 10 K), the retrieval performance of the LCWF-GI method and all baseline methods shows a significant decrease compared with their performance on the 1 K test set. This is because a larger test set will make the retrieval more difficult. However, compared to the HTL method, the LCWF-GI method improves from 27.3 to 27.9, 55.2 to 56.0, and 67.0 to 67.8 on Image-to-Recipe retrieval task in terms of Recall@1, Recall@5, and Recall@10 respectively. For the Recipe-to-Image retrieval task, the values of Recall@1, Recall@5, and Recall@10 of our LCWF-GI method increase from 27.9 to 28.6, 55.3 to 55.8, 66.9 to 67.9 respectively. The LCWF-GI method outperforms the HTL method by a margin improvement on both retrieval tasks. There are two possible reasons: firstly, the LCWF-GI method generates the final recipe semantic representation by fusing the textual embeddings of different recipe components via the latent component weight factor, which can capture the complex interactions and complementary information between the textual information of different components. Secondly, the LCWF-GI method introduces the Transformer encoder to re-encode image feature information from the ResNet-50 model for enhanced image semantic representations. These refined image representations consider global information of images and capture the interactions between different extracted local image features. Compared with the modality alignment-based JEMA and SCAN methods, the proposed LCWF-GI method not only achieves better scores in terms of the MedR evaluation metric, but also still has good performance on three metrics of Recall@1, Recall@5, and Recall@10 for the Image-to-Recipe and Recipe-to-Image tasks. The performance of the LCWF-GI method is significantly better than that of the modal alignment-based methods, which further verifies that the recipe and image representations with high quality can better bridge the semantic gap between the two modalities.

Table 2. The comparison results of our proposed method and popular baseline methods. In this table, bold values represent the best scores.

Download CSV Display Table

5.2. Ablation studies

To explore the contributions of textual information of different recipe components (title, ingredient and cooking instruction) in our LCWF-GI proposed method, we conduct ablation studies on 1 and 10 K test sets. Firstly, the textual information of the title and ingredient components is fused by latent weight factors to generate the recipe representation (denoted as TL + IG). The textual information of the title and cooking instruction components (denoted as TL + IS), and the textual information of ingredient and cooking instruction components (denoted as IG + IS), are fused in a similar way to produce the final recipe representations. The proposed LCWF-GI method fuses the textual information of all three recipe components, including title, ingredient and cooking instruction (denoted as TL + IG + IS). Experimental results are reported in Table . The following conclusions can be drawn from this table: (1) The recipe semantic representations generated by fusing textual information of all three recipe components (TL + IG + IS) can achieve better retrieval performance than only fusing textual information of two recipe components. (2) Compared with TL + IG and TL + IS, the fusion of two recipe components, IG + IS is more conducive to generating recipe representations with high quality. This is because the textual information of the ingredient and cooking instruction components contains multiple sentences, while the textual information of the title component only contains one sentence. Thus, the textual information of ingredient and cooking instruction components can contain more useful information than the title component with a single sentence.

Table 3. The food-oriented cross-modal retrieval results of this proposed method fusing different recipe components. In this table, bold values represent the best scores.

Download CSV Display Table

5.3. Parameter settings

We also explore the effect of the latent weight factor with different rank values in the LCWF-GI method. Since the results of Recall@1 are more obvious than those of Recall@5 and Recall@10, only the value of Recall@1 is chosen for the corresponding demonstration. Figure shows the performance of our proposed LCWF-GI method on the 1 and 10 K test sets for two retrieval tasks when the latent weight factor is set to different rank values. From the experimental results, it is clear that our proposed method performs best when the rank value of the latent weight factor is 4.

Figure 2. The retrieval results of latent weight factor with different rank values in our proposed LCWF-GI method on Image-to-Recipe and Recipe-to-Image tasks on 1 and 10 K test set. (a) The Value of Recall@1 of Our LCWF-GI method for Image-to-Recipe and Recipe-to-Image Retrieval Tasks on 1 K Test Set; (b) The Value of Recall@1 of Our LCWF-GI method for Image-to-Recipe and Recipe-to-Image Retrieval Tasks on 10 K Test Set.

5.4. Visualisation example

Figure shows an example of food-oriented cross-modal retrieval results on a 10 K test set using our proposed LCWF-GI method and the representative baseline method (the HTL method). To save space, only the Recipe-to-Image retrieval task is visualised in this paper as an example. In this paper, two different recipes (Honey Lemon Fruit Salad and Blue Ribbon Apple Crumb Pie) are selected as query terms, and we use our proposed method and the HTL method to retrieve the relevant food images. The top five retrieval results (i.e. the top five food images) are returned. Compared to the real images, the visualisation example shows that our proposed method can retrieve 4-5 relevant results for recipes 1 and 2 respectively, while the HTL method can only return 3-4 relevant results. Therefore, our proposed method can retrieve food images related to recipes more effectively than the HTL method. It indicates that our proposed method can better learn the consistent semantic representations of recipes and images and use them to bridge the semantic gap between these two modalities.

Figure 3. The example of Recipe-to-Image retrieval of our proposed method and HTL baseline method on 10 K test set. The first column represents that the query is a piece of recipe text. The second column is the ground truth of the recipe text. The final column is the top 5 retrieval result.

5.5. Complexity analysis

In our work, we use the parameter count and the training time of one epoch to evaluate the model complexity, because recording the whole training process will spend too much time. The information about parameter counts and the training time of each epoch is shown in Table . From this table, we can see that this proposed method contains about 100.25M (million) parameters, while the HTL method includes about 93.95M (million) parameters. Compared with the best baseline method (the HTL method), the number of parameters in our proposed method is a bit more than that of the HTL method. However, the training time of each epoch is less than that of the HTL method. Furthermore, in our experiments, we observe that only using fewer epochs can get a similar performance in comparison with the HTL method. The results (to save time and space, we only conduct experiments on 1 K test set) are shown in Figure . We can see that our proposed LCWF-GI method can achieve better performance or similar performance in comparison with the HTL method, but our LCWF-GI method only needs much fewer training epochs. It indicates that our LCWF-GI n method can save training time.

Figure 4. The results achieved by our LCWF-GI method and the best baseline method (HTL) on two retrieval tasks on 1 K test set when using the different training epochs. The solid curve represents our LCWF-GI method and the dashed curve means the HTL method. The solid and dashed curves with orange, green and blue colours represent the metrics of Recall@1, Recall@5 and Recall@10 respectively.

Table 4. The parameter counting and the time of each epoch of our LCWF-GI method and the best baseline method (HTL). M and s represent million and second.

Download CSV Display Table

5.6. Attention visualization in transformer

We also visualise the self-attention mechanism in Transformer to further understand it learned in recipe representation learning. As we mentioned before, we use three Transformer encoders with the same structure and different parameters to encode the textual information of three recipe components (title, ingredient and cooking instruction) respectively. Therefore, the principle of attention mechanism for these three recipe components is similar. In our work, we take the attention mechanism in Transformer for the cooking instruction component as an example. The attention visualisation is shown in Figure . There are two sentences “cook chicken on grill for about 30 min” and “turning it once brush chicken with sauce” in the cooking instruction component. The blue colour represents positive and the orange colour represents negative. With the deeper blue colour of associated words on the right, the chosen word in the left sentence is more closely associated with it. Instead, the deeper orange colour of associated words on the right, the chosen word in the left sentence is not associated with it. It can observe that our attention mechanism not only can capture the important words in the cooking instruction component, but also catch the long-distance dependency relationships between words of different sentences in the cooking instruction component.

Figure 5. The visualisation of attention mechanism in transformer for the cooking instruction component. (a) The self-attention visualisation of a sentence in the cooking instruction component; (b) The self-attention visualisation of sentence a and the next sentence in the cooking instruction component.

6. Conclusion

In this paper, we propose a novel method based on latent weight factors and global information to learn consistent modality representations for food-oriented cross-modal retrieval. Compared with the existing retrieval methods, this proposed method first extracts the hidden textual information from three recipe components (title, ingredient and cooking instruction) with their corresponding Transformer encoders. Then we fuse the textual information of these multiple components by using the latent weight factors and tensor representation to generate richer recipe semantic representations. The final recipe representation can be represented as the element-wise product over the textual embeddings of three recipe components and their corresponding latent weight factors. Compared with the direct and simple embedding concatenation method (which can only capture simple interactions between different components), our proposed method can allow the interactions between individual elements in the textual embeddings of different components through the element-wise product operation. This can capture the complex interactions between the textual information of these multiple components. In addition, we introduce the Transformer encoder to re-encode the image features from the ResNet-50 model for learning the enhanced image semantic representations. The newly generated image representations take the global information of images into consideration and leverage the self-attention mechanism in the Transformer encoder to capture the interactions between different extracted local image features. Several experiments are conducted on the standard Recipe 1M dataset. Experimental results show that our proposed method can obtain better retrieval results compared with the existing representative baseline methods. In future work, we not only can further consider the hierarchical alignment between textual information of different components, but also consider new retrieval learning objective functions.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Correction Statement

This article has been corrected with minor changes. These changes do not impact the academic content of the article.

Additional information

Funding

This work was supported in part by the Hunan Provincial Natural Science Foundation of China [grant no 2022JJ30020], the Guangdong Basic and Applied Basic Research Foundation of China [grant no 2023A1515012718], the Philosophy and Social Sciences 14th Five-Year Plan Project of Guangdong Province [grant no GD23CTS03], the Scientific Research Fund of Hunan Provincial Education Department [grant no 21A0319], and the Hunan Provincial Innovation Foundation for Postgraduate [grant no CX20210986].

References

Arevalo, J., Solorio, T., Montes-y-Gómez, M., & González, F. A. (2017). Gated multimodal units for information fusion. Proceedings of the 5th international conference on learning Representations (ICLR, Workshop), Toulon, France.
Google Scholar
Carvalho, M., Cadène, R., Picard, D., Soulier, L., Thome, N., & Cord, M. (2018). Cross-modal retrieval in the cooking context: Learning semantic text-image embeddings. Proceedings of the 41st international ACM SIGIR conference on research & development in information retrieval (SIGIR), Melbourne, VIC, Australia.
Google Scholar
Chen, J.-J., Ngo, C.-W., Feng, F.-L., & Chua, T.-S. (2018). Deep understanding of cooking procedure for cross-modal recipe retrieval. Proceedings of the 26th ACM international conference on multimedia, Seoul, Republic of Korea.
Google Scholar
Chen, Y., Zhou, D., Li, L., & Han, J.-M. (2021). Multimodal encoders for food-oriented cross-modal retrieval. Proceedings of the Web and Big data and the 5th international joint conference (APWeb-WAIM), Guangzhou, China.
Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., & Gelly, S. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. Proceedings of the 9th international conference on learning representations (ICLR), online.
Google Scholar
Elsweiler, D., Trattner, C., & Harvey, M. (2017). Exploiting food choice biases for healthier recipe recommendation. Proceedings of the 40th international acm sigir conference on research and development in information retrieval (SIGIR), Shinjuku, Tokyo, Japan.
Google Scholar
Fu, H., Wu, R., Liu, C., & Sun, J. (2020). Mcen: Bridging cross-modal gap between cooking recipes and dish images with latent variable model. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), Seattle, WA, USA.
Google Scholar
Fukui, A., Park, D. H., Yang, D., Rohrbach, A., Darrell, T., & Rohrbach, M. (2016). Multimodal compact bilinear pooling for visual question answering and visual grounding. Proceedings of the 2016 conference on empirical methods in natural language processing (EMNLP), Austin, Texas, USA.
Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA.
Google Scholar
Jin, T., Huang, S., Li, Y., & Zhang, Z. (2020). Dual low-rank multimodal fusion. Findings of the association for computational linguistics: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), Abu Dhabi, United Arab Emirates.
Google Scholar
Kagaya, H., Aizawa, K., & Ogawa, M. (2014). Food detection and recognition using convolutional neural network. Proceedings of the 22nd ACM international conference on multimedia, Orlando, FL, USA.
Google Scholar
Kawano, Y., & Yanai, K. (2014). Food image recognition with deep convolutional features. Proceedings of the 2014 ACM international joint conference on pervasive and ubiquitous computing: Adjunct publication, Seattle, WA, USA.
Google Scholar
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. https://doi.org/10.1145/3065386
Web of Science ®Google Scholar
Kusmierczyk, T., & Nørvåg, K. (2016). Online food recipe title semantics: Combining nutrient facts and topics. Proceedings of the 25th ACM international on conference on information and knowledge management (CIKM), Indianapolis, IN, USA.
Google Scholar
Lee, K.-H., He, X., Zhang, L., & Yang, L. (2018). Cleannet: Transfer learning for scalable image classifier training with label noise. Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Salt Lake, UT, USA.
Google Scholar
Li, J., Sun, J., Xu, X., Yu, W., & Shen, F. (2021). Cross-Modal image-recipe retrieval via intra-and inter-modality hybrid fusion. Proceedings of the 2021 international conference on multimedia retrieval, Taipei, Taiwan.
Google Scholar
Li, Y., Liang, W., Peng, L., Zhang, D., Yang, C., & Li, K.-C. (2022). Predicting drug-target interactions via dual-stream graph neural network. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1–11. https://doi.org/10.1109/TCBB.2022.3204188
Google Scholar
Liang, W., Li, Y., Xie, K., Zhang, D., Li, K.-C., Souri, A., & Li, K. (2022a). Spatial-temporal aware inductive graph neural network for C-ITS data recovery. IEEE Transactions on Intelligent Transportation Systems, 1–12. https://doi.org/10.1109/TITS.2022.3156266
Web of Science ®Google Scholar
Liang, W., Yang, Y., Yang, C., Hu, Y., Xie, S., Li, K.-C., & Cao, J. (2022b). PDPChain: A consortium blockchain-based privacy protection scheme for personal data. IEEE Transactions on Reliability, 1–13. https://doi.org/10.1109/TR.2022.3190932
Web of Science ®Google Scholar
Liu, C., Cao, Y., Luo, Y., Chen, G., Vokkarane, V., & Ma, Y. (2016). Deepfood: Deep learning-based food image recognition for computer-aided dietary assessment. Proceedings of the 14th international conference on smart homes and health telematics (ICOST), Wuhan, China.
Google Scholar
Liu, Z., Shen, Y., Lakshminarasimhan, V. B., Liang, P. P., Zadeh, A., & Morency, L.-P. (2018). Efficient low-rank multimodal fusion with modality-specific factors. Proceedings of the 56th annual meeting of the association for computational linguistics (ACL), Melbourne, VIC, Australia.
Google Scholar
Long, J., Liang, W., Li, K.-C., Wei, Y., & Marino, M. D. (2022). A regularized cross-layer ladder network for intrusion detection in industrial internet of things. IEEE Transactions on Industrial Informatics, 19(2), 1747–1755. https://doi.org/10.1109/TII.2022.3204034
Web of Science ®Google Scholar
Martinel, N., Foresti, G. L., & Micheloni, C. (2018). Wide-slice residual networks for food recognition. Proceedings of the 2018 IEEE winter conference on applications of computer vision (WACV), Lake Tahoe, NV, USA.
Google Scholar
Matsuda, Y., Hoashi, H., & Yanai, K. (2012). Recognition of multiple-food images by detecting candidate regions. Proceedings of the 2012 IEEE international conference on multimedia and expo (ICME), Melbourne, VIC, Australia.
Google Scholar
Min, W., Bao, B.-K., Mei, S., Zhu, Y., Rui, Y., & Jiang, S. (2017). You are what you eat: Exploring rich recipe information for cross-region food analysis. IEEE Transactions on Multimedia, 20(4), 950–964. https://doi.org/10.1109/TMM.2017.2759499
Web of Science ®Google Scholar
Peng, Y., Huang, X., & Zhao, Y. (2017). An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges. IEEE Transactions on Circuits and Systems for Video Technology, 28(9), 2372–2385. https://doi.org/10.1109/TCSVT.2017.2705068
Web of Science ®Google Scholar
Pham, H. X., Guerrero, R., Pavlovic, V., & Li, J. (2021). CHEF: Cross-modal hierarchical embeddings for food domain retrieval. Proceedings of the AAAI conference on artificial intelligence (AAAI), online.
Google Scholar
Salvador, A., Gundogdu, E., Bazzani, L., & Donoser, M. (2021). Revamping cross-modal recipe retrieval with hierarchical transformers and self-supervised learning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), online.
Google Scholar
Salvador, A., Hynes, N., Aytar, Y., Marin, J., Ofli, F., Weber, I., & Torralba, A. (2017). Learning cross-modal embeddings for cooking recipes and food images. Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, USA.
Google Scholar
Sanjo, S., & Katsurai, M. (2017). Recipe popularity prediction with deep visual-semantic fusion. Proceedings of the 2017 ACM on conference on information and knowledge management (CIKM), Singapore.
Google Scholar
Trattner, C., & Elsweiler, D. (2017). Investigating the healthiness of internet-sourced recipes: Implications for meal planning and recommender systems. Proceedings of the 26th international conference on world wide web (WWW), Perth, Australia.
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł, & Polosukhin, I. (2017). Attention is all you need. Proceedings of advances in neural information processing systems (NeurIPS), Long Beach, CA, USA.
Google Scholar
Wang, H., Sahoo, D., Liu, C., Lim, E.-P., & Hoi, S. C. (2019). Learning cross-modal embeddings with adversarial networks for cooking recipes and food images. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), Long Beach, CA, USA.
Google Scholar
Wang, H., Sahoo, D., Liu, C., Shu, K., Achananuparp, P., Lim, E.-P., & Hoi, S. C. (2022). Cross-modal food retrieval: Learning a joint embedding of food images and recipes with semantic consistency and attention mechanism. IEEE Transactions on Multimedia, 24, 2515–2525. https://doi.org/10.1109/TMM.2021.3083109
Web of Science ®Google Scholar
Xie, Z., Liu, L., Li, L., & Zhong, L. (2021). Learning joint embedding with modality alignments for cross-modal retrieval of recipes and food images. Proceedings of the 30th ACM international conference on information & knowledge management (CIKM), Queensland, Australia.
Google Scholar
Xie, Z., Liu, L., Wu, Y., Li, L., & Zhong, L. (2022). Learning tfidf enhanced joint embedding for recipe-image cross-modal retrieval service. IEEE Transactions on Services Computing, 15(6), 3304–3316. https://doi.org/10.1109/TSC.2021.3098834
Web of Science ®Google Scholar
Yanai, K., & Kawano, Y. (2015). Food image recognition using deep convolutional network with pre-training and fine-tuning. Proceedings of 2015 IEEE international conference on multimedia & expo workshops (ICMEW), Turin, Italy.
Google Scholar
Yuan, Z., Zhang, W., Fu, K., Li, X., Deng, C., Wang, H., & Sun, X. (2022a). Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–19. https://doi.org/10.1109/TGRS.2021.3078451
Web of Science ®Google Scholar
Yuan, Z., Zhang, W., Rong, X., Li, X., Chen, J., Wang, H., Fu, K., & Sun, X. (2021). A lightweight multi-scale crossmodal text-image retrieval method in remote sensing. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–19. https://doi.org/10.1109/TGRS.2021.3124252
Web of Science ®Google Scholar
Yuan, Z., Zhang, W., Tian, C., Mao, Y., Zhou, R., Wang, H., Fu, K., & Sun, X. (2022b). MCRN: A multi-source cross-modal retrieval network for remote sensing. International Journal of Applied Earth Observation and Geoinformation, 115, 103071. https://doi.org/10.1016/j.jag.2022.103071
Web of Science ®Google Scholar
Yuan, Z., Zhang, W., Tian, C., Rong, X., Zhang, Z., Wang, H., Fu, K., & Sun, X. (2022c). Remote sensing cross-modal text-image retrieval based on global and local information. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–16. https://doi.org/10.1109/TGRS.2022.3163706
Web of Science ®Google Scholar
Zadeh, A., Chen, M., Poria, S., Cambria, E., & Morency, L.-P. (2017). Tensor fusion network for multimodal sentiment analysis. Proceedings of the 2017 conference on empirical methods in natural language processing (EMNLP), Copenhagen, Denmark.
Google Scholar
Zhang, S., Hu, B., Liang, W., Li, K.-C., & Gupta, B. B. (2023). A caching-based dual K-anonymous location privacy-preserving scheme for edge computing. IEEE Internet of Things Journal, 1–14. https://doi.org/10.1109/JIOT.2023.3235707
Web of Science ®Google Scholar
Zhao, W., Zhou, D., Cao, B., Zhang, K., & Chen, J. (2023). Adversarial modality alignment network for cross-modal molecule retrieval. IEEE Transactions on Artificial Intelligence, 1–12. https://doi.org/10.1109/TAI.2023.3254518
Google Scholar
Zhou, D., Peng, X., Li, L., & Han, J.-M. (2022a). Cross-lingual embeddings with auxiliary topic models. Expert Systems with Applications, 190, 116194. https://doi.org/10.1016/j.eswa.2021.116194
Web of Science ®Google Scholar
Zhou, D., Qu, W., Li, L., Tang, M., & Yang, A. (2022b). Neural topic-enhanced cross-lingual word embeddings for CLIR. Information Sciences, 608, 809–824. https://doi.org/10.1016/j.ins.2022.06.081
Web of Science ®Google Scholar
Zhu, B., Ngo, C.-W., Chen, J., & Hao, Y. (2019). R2gan: Cross-modal recipe retrieval with generative adversarial network. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), Long Beach, CA, USA.
Google Scholar

Exploring latent weight factors and global information for food-oriented cross-modal retrieval

Abstract

1. Introduction