Full article: Contour wavelet diffusion – a fast and high-quality facial expression generation model

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Facial expressions are important for conveying information in human interactions. The diffusion model can generate high-quality images for clearer and more discriminative faces, but its training and inference time is often prolonged, hampering practical application. Latent space diffusion models have shown promise in speeding up training by leveraging feature space parameters, but they require additional network structures. To address these limitations, we propose a contour wavelet diffusion model that accelerates both training and inference speeds. We use a contour wavelet transform to extract components from images and features, achieving substantial acceleration while preserving reconstruction quality. A normalised random channel attention module enhances the quality of generated images by focusing on high-frequency information. We also include a reconstruction loss function to enhance convergence speed. Experimental results demonstrate the effectiveness of our approach in boosting the training and inference speeds of diffusion models without sacrificing image quality. Fast generation of facial expressions can provide a smoother and more natural user experience, which is important for real-time applications. In addition, the increase in inference speed can save the use of computational resources, reduce system cost and improve energy efficiency, which is conducive to promoting the development and application of this technology.

KEYWORDS:

1. Introduction

Face expression generation has achieved significant success and rapid progress in recent years. Expression images have been applied in many aspects of life, such as security, film production, and healthcare, where the application of facial expression images is required. However, such data is scarce and difficult to acquire. Recently, image generation methods based on diffusion models have garnered increasing attention from both academia and industry due to their ability to generate high-quality images (Croitoru et al., Citation2023; Ramesh et al., Citation2021). Diffusion models can progressively recover realistic original images from random noise images, thus offering a wide range of applications. These models can accomplish various tasks such as text-to-image generation, image-to-image translation, image restoration, and image repair. Some new diffusion models have outperformed state-of-the-art generative adversarial networks (GANs) and exhibit better pattern coverage. OpenAI (Bubeck et al., Citation2023) provides users with the ability to generate realistic images that meet specific requirements through textual instructions, ushering in a new era of artificial intelligence.

Despite the remarkable effectiveness of diffusion models, they suffer from a fundamental issue: slow training and inference speeds. This is the key weakness that prevents diffusion models from being widely adopted, unlike GANs. Early models (Goodfellow et al., Citation2014) such as typically require over a thousand iterative steps to ensure good model performance, meaning that generating a single image takes several minutes, which is roughly ten times longer than GAN models. While some methods (Brophy et al., Citation2023; Huang et al., Citation2022; Lu et al., Citation2022) address this by processing in the feature space or reducing the sampling steps, these require additional operations and have limited improvement potential. Generating even small images still takes a long time, over 100 times slower than GANs. DiffusionGAN (Wang et al., Citation2022) has made a breakthrough in improving inference speed by combining diffusion and GAN in a single system. DiffusionGAN ultimately reduces the sampling steps to four and reduces the inference time for generating 32 × 32 images to a fraction of a second. However, DiffusionGAN is still at least four times slower than StyleGAN (Karras et al., Citation2020). Moreover, the computational complexity of DiffusionGAN increases exponentially with larger inputs, resulting in lengthy convergence times that cannot meet the demands of large-scale tasks.

To address the issue of prolonged model inference time, we propose the Contour Wavelet Diffusion Model. Due to its multi-directionality and anisotropic analysis, the model can simultaneously obtain frequency and positional information of feature maps. Additionally, we propose an attention module in the model that effectively focuses on crucial high-frequency information, improving the overall quality of image generation. This attention mechanism enhances image generation outcomes. Finally, we propose a balance loss function that ensures the network's ability to learn change consistency while also improving processing speed. The incorporation of this loss function maintains a balance between consistent learning and accelerated performance. Our model achieves accelerated performance compared to the original diffusion model while preserving high-quality image generation.

The structure and content of the full paper are as follows:

In the second part, we introduce the related work on diffusion models and wavelet transformation. The third part describes the proposed methodology. In the fourth part, the effectiveness of the method is demonstrated through experimental results. Analysis and discussions are presented in the fifth part. Finally, the conclusions are drawn in the sixth part.

2. Related work

2.1. Facial expression generation

In a comprehensive study conducted by Ekman, Paual, and their colleagues (Ekman et al., Citation1971), the control of facial expressions was examined, starting from the organisation of facial muscles. Their research in the late 1970s aimed to establish a clear connection between facial expressions and facial muscle groups, resulting in the development of the Facial Action Coding System (FACS) (Ekman, Citation2004). FACS divided the face into 46 independent motion units, allowing the representation of different controlled areas and corresponding expressions through a collection of images. This method effectively captured facial motion changes in an intuitive manner. However, the labelling process proved to be time-consuming and labor-intensive due to the complexity of facial muscles, and the coding system ran slowly.

To address these challenges, the MPEG-4 facial motion parameter method (facial animation parameter, FAP) was introduced (Ekman & Friesen, Citation1978). FAP proposed a comprehensive set of facial movements based on the motion of facial muscles. The approach involved creating a standardised facial template and adjusting the details according to individual faces. By sharing motion parameters of the neutral face, FAP significantly improved the recognition accuracy of facial expressions and reduced repetitive work. Additionally, Ying-li Tian proposed the use of Cabor wavelets for facial unit recognition, similar to motion coding. The face was divided into upper and lower parts, and independent motion units were assigned to them. Geometric features were introduced for face recognition (Pandzic, Citation2002).

In the field of deep learning, Batista et al. utilised deep neural networks (DNN) to estimate action units (AU) considering multi-head pose, while Sayan et al. applied convolutional neural networks to perform cross-experiments on multiple datasets, simultaneously predicting multiple AUs (Batista et al., Citation2017; Zhang et al., Citation2012). However, as deep network models continued to evolve, relying solely on deep learning was no longer sufficient to meet various practical requirements. Challenges emerged, including suboptimal performance compared to traditional methods and the inability to implement computationally-intensive deep learning on embedded devices. As a result, approaches that integrate deep learning with other technologies have been explored.

For example, Gudi et al. (Citation2015) employed a 7-layer convolutional neural network for AU recognition, while Walecki et al. (Citation2017) combined convolutional neural networks with conditional random fields (CRF) to capture AU dependencies and reduce correlations between AUs, with convolutional neural networks used for feature extraction (Carvalho et al., Citation2004; Gudi et al., Citation2015). Furthermore, Tran et al. (Citation2017) introduced a combination of variational autoencoder (VAE) and nonparametric Gaussian process. This approach jointly obtained latent representations and multiple ordinal output classifiers (Walecki et al., Citation2016). The focus on facial expression generation can provide valuable insights for facial expression recognition modelling and micro-expression research. However, the existing facial expression generation models are still relatively limited. Moreover, current deep generative models encounter challenges such as the need for improved effectiveness, slow convergence speeds, high computational demands, and difficulty in training. Therefore, enhancing existing models or proposing novel generative models becomes essential to address these limitations.

2.2. Diffusion model

Motivated by the principles of non-equilibrium thermodynamics, diffusion models have emerged as a promising approach in the field of image generation. These models utilise the concepts of Markov chains and iteratively remove and restore image noise to generate target images (Tran et al., Citation2017). Unlike traditional generative models such as Generative Adversarial Networks (GANs) (Goodfellow et al., Citation2014; Ho et al., Citation2020) and Variational Autoencoders (VAEs) (Karras et al., Citation2019; Kingma & Welling, Citation2013), diffusion models involve multiple denoising steps during the sampling process. Additionally, diffusion models employ a fixed process in which each latent variable has the same dimension as the original input.

While several other methods have pursued similar motivations, they have often relied on reverse processes like score matching (Song & Ermon, Citation2020; Taubman et al., Citation2002; Vahdat et al., Citation2021). Subsequent research efforts have focused on enhancing the sample quality of diffusion models (Dhariwal & Nichol, Citation2021; Xiao et al., Citation2020). Some researchers have even explored extending this process to the latent space, enabling successful text-to-image generatio (Esser et al., Citation2021; Ramesh et al., Citation2021; Ramesh et al., Citation2022; Song, Sohl-Dickstein, et al., Citation2020). Notably, certain methods (Rombach et al., Citation2022; Vincent, Citation2011) aim to improve the efficiency of the sampling process to achieve better convergence and faster sampling. Attempts have also been made to decompose the Markov chain into non-Markov chains in order to expedite the sampling process (Sohl-Dickstein et al., Citation2015). However, it is important to consider that there is often a trade-off between sampling speed and sample quality.

In addition to these advancements, diffusion models have undergone various improvements and extensions. The Denoising Diffusion Probabilistic Model (DDPM) (Luhman & Luhman, Citation2021) and the Generative Latent Descriptor (GLIDE) (Ma et al., Citation2022) have emerged as prominent variations. More recently, the Diffusion GAN (Song et al., Citation2021) introduced a novel approach to handle complex multimodal distributions by combining generative adversarial networks with large step sizes. This innovative framework reduces the number of denoising steps to just a few, resulting in a significant reduction in inference time to a fraction of a second. Nevertheless, it still lags behind competing GAN-based methods in terms of speed.

To further unleash the potential of the diffusion model, our proposed approach incorporates new wavelet components into the existing framework. These additions are aimed at enhancing the quality of generated samples while maintaining computational efficiency. By leveraging the distinctive properties of wavelets, our novel methodology increases the expressiveness and diversity of generated images, thereby opening up new possibilities in the realms of image synthesis and understanding. Through our research, we strive to contribute to the continuous advancement of diffusion models and their applications in the field of image generation.

2.3. Contour wavelet transform

Wavelet transform has the advantages of time–frequency localisation and multi-scale, multi-resolution analysis, and has been widely used in image processing field. Although the commonly used discrete wavelet transform (DWT) (Gupta et al., Citation2023) can effectively capture the singularity of one-dimensional signals, it is difficult to represent higher-dimensional geometric features. For two-dimensional images, since geometric features such as edges, contours and textures have high-dimensional singularities and contain most of the information, wavelets are no longer the optimal basis for representing images. The commonly used two-dimensional wavelets are composed of tensor products of two one-dimensional orthogonal wavelets, which have very limited directionality, only horizontal, vertical and diagonal directions, and cannot represent the geometric features of images in the sparsest way 1.

The multi-resolution property of two-dimensional separable wavelets is realised by using basis functions with different scale square support domains. As the resolution increases, the scale becomes finer, and eventually it is expressed as using many “points” to approximate the curve. At scale j, the side length of the support domain of the wavelet basis is approximately 2j, and the number of wavelet coefficients with amplitude exceeding 2j in the transform domain is at least O (2j) order(O(2j) is an exponential function of 2j). When the support domain scale becomes finer, the number of non-zero wavelet coefficients increases exponentially, and there are a large number of non-negligible coefficients. Therefore, when two-dimensional separable wavelets nonlinearly approximate images with more contour details, the error decay is extremely slow, and eventually it shows that they cannot sparsely approximate the original image.

In order to find an effective image representation method that satisfies these characteristics, people have made a lot of efforts and successively proposed multi-resolution and multi-directional image representation methods such as steerable filter (Nichol et al., Citation2021), dual-complex wavelet (Yang et al., Citation2022), ridgelet transform (Shensa, Citation1992) and curvelet transform (Freeman & Adelson, Citation1991). Ridgelet is the optimal basis for representing images with straight edge, and curvelet is the generalisation of ridgelet. It is a multi-directional basis defined in two-dimensional continuous space R2, which has good spatial and frequency domain locality and nonlinear approximation performance. Curvelet is the optimal basis for representing images with smooth curve edges that are second-order differentiable. Bamberger and Smith first proposed directional filter banks (DFB) (Kaur et al., Citation2020), which can effectively capture the directional information in images and realise the multi-directional decomposition of images. With the development of directional filter banks, people have successively proposed some geometric image transforms based on directional filter banks, such as contourlet transform (Do & Vetterli, Citation2003), which is a “true” two-dimensional image representation method. It has good directionality and anisotropy, and can accurately capture the edges in images to different scales and frequencies of subbands. However, this transform has a 4/3 redundancy. Based on Contourlet, Crisp-Contourlet (Bamberger & Smith, Citation1992) uses non-separable filters at all levels, solving the redundancy problem of Contourle. Hong et al. proposed Octave-Band directional filter bank (Da Cunha et al., Citation2006), which can flexibly divide the image spectrum according to actual needs. Later, Elsami et al. proposed wavelet-based contourlet transform (WBCT) (Eslami & Radha, Citation2004), which applies directional filter bank to all detail subbands of wavelet transform, retaining many advantages of wavelet transform and directional filter bank. With the development of directional filter banks, the spectrum division becomes more and more flexible, and people are still looking for more effective geometric image representation methods. In this paper, we combine the geometric properties of the contourlet transform with the diffusion model, which can further accelerate the generation speed of images and improve the quality of image generation.

3. Methods

Section 3.1 introduces our proposed Contour Wavelet Diffusion framework for more effective sampling. Section 3.2 describes a Normalised Random Channel Attention module that enhances image generation. Section 3.3 outlines the loss function used for network model training.

The Contour Wavelet framework that we propose is illustrated in Figure . The input image undergoes contour wavelet decomposition, resulting in a low-frequency subband and 8 × 3 high-frequency subbands (8 corresponding to the number of directions). These subbands are then concatenated into a single feature and fed into the forward process of our model. This decomposition process, as shown in the diagram below, enables the extraction of directionally-anisotropic high-frequency information, resulting in higher-quality detailed images.

Figure 1. Contour wavelet transformation (The image is from public dataset CelebA, and the following are the same.).

Due to the fact that the scale of the image is reduced to one-fourth after the contour wavelet transform, the computational complexity is effectively reduced. Unlike traditional diffusion models, we utilise the obtained noisy image as a condition and employ conditional GAN training to ensure consistency between the generated images and the forward process. By directly generating images using GANs, there is no need for additional operations required by traditional diffusion models. Additionally, the computational cost of the model is significantly reduced due to the one-fourth reduction in spatial dimension.

3.1. Contour wavelet generator

The core of our framework is the contour wavelet generation network, as shown in the Figure . For simplicity, the time step embedding t and the latent embedding z are omitted but they are injected into each block of the denoising process. The input is a noisy wavelet subband with a shape of [C × H × W] at time step t, which is processed by a series of components proposed by us, including contour wavelet upsampling block (CWUB) and contour wavelet downsampling block (CWDB) blocks and skip connections. The output of the model approximates the undisturbed input.

Figure 2. Contour wavelet diffusion network.

The modules of Figure are all concentrated in the Generator module of Figure , with the aim of combining the properties of the contour wavelet variations with the features extracted from the network thus obtaining a higher quality image. In the Generator, features are first subjected to a down-sampling operation that incorporates contour wavelet transforms, followed by an up-sampling operation that incorporates inverse wavelet transforms. This process yields high-quality images at each step of generation. To ensure that information is not lost during the inverse transformation, we utilise skip connections to recover the up-sampled features based on the down-sampled features.

Figure 3. contour wavelet generation network.

The contour wavelet generation network mainly consists of upsampling and downsampling operations, as shown in Figure (a). Traditional methods rely on blur kernels for downsampling and upsampling processes to reduce aliasing. Instead, we take advantage of the intrinsic properties to implement it. Specifically, the downsampling block receives a tuple of input features F, latent z, and temporal embedding t, and then constructs a wavelet through the combination of two downsampling modules and residual modules and two separate wavelets The combination of the upsampling module and the residual module obtains the final sampling features and high-frequency subbands. In addition, after the obtained down-sampled features, a batch-normalised random attention module is introduced to further enhance the detailed features. The detailed process of CWDS and CWUP is shown in Figure (b). The input features are the final output obtained through a series of AdaGN (Adaptive Group Normalisation) and Conv operations. In fact, this enhances the operation of identifying these high-frequency information.

3.2. Batch-normalised stochastic attention module

During the feature extraction stage, the feature extraction for TIR images is relatively weak. Additionally, the occlusion of wildlife can also contaminate the extraction of surrounding features. Inspired by CBAM (Eslami & Radha, Citation2004), we propose a Batch Norm Stochastic Channel Attention (BSCA). While convolutional neural networks can form complex patterns by stacking deep enough layers, the information process between lower and deeper layers is not direct. The pixel-wise attention module can help filter out abnormal activations caused by occlusion in the lower layers, while focusing on pixels in the same position in the channel. It applies a batch normalisation (BN) to measure the importance of different channels with trainable factors on the spatial dimension using average-max pooling. After inputting the input feature map $F \in R^{C \times H \times W}$ , we generate three different spatial contexts by using average pooling, max pooling, and random pooling operations, Fcm, Fca, and Fcs. We then apply a shared multi-layer perceptron (MLP) layer. Through the feature map, we obtain attention vectors for three channels, which enter the MLP to generate three attention vectors with dimensions of C × 1 × 1. Finally, the corresponding positions of the three feature vectors are summed. The channel attention map output Mc is generated by a sigmoid function, as shown in Equation 1. (1) $M_{c} (F) = σ (W_{1} (W_{0} (BN (F_{avg}^{c}))) + W_{1} (W_{0} (BN (F_{max}^{c}))) + W_{1} (W_{0} (BN (F_{sio}^{c})))$ (1) where σ denotes the sigmoid function. $M_{c} (F)$ is channel attention map. $W_{0}$ and $W_{1}$ are the MLP weights. BN means batch normalisation. $F_{avg}^{c}$ , $F_{max}^{c}$ , and $F_{sp}^{c}$ denotes average average-pooling, max-pooling, and stochastic-pooling of feature F (Figure ).

Figure 4. Batch-normalised stochastic attention module.

3.3. Balanced loss function

The traditional diffusion process requires thousands of time steps to gradually diffuse each input $x_{0}$ , following the data distribution $p (x_{0})$ , to transform it into pure Gaussian noise. The posterior probability of the diffused image $x_{t}$ at time step t has a closed form, where $α_{t} = 1 - β_{t,} α_{t}^{'} = \sum s$ , and $β_{t} \in (0, 1)$ is a parameter that can be learned or fixed during the forward process.

As the diffusion process adds relatively small noise at each step, the reverse process $q (x_{t - 1} | x_{t})$ can be approximated as a Gaussian process. Therefore, the trained denoising process $p_{θ} (x_{t - 1} | x_{t})$ can be parameterised based on $q (x_{t - 1} | x_{t}, x_{0})$ , where $μ_{θ} (x_{t}, t)$ and σ² are the mean and variance parameters of the denoising model. The goal is to minimise the distance between the true denoising distribution $q (x_{t - 1} | x_{t})$ and the parameterised distribution $p_{θ} (x_{t - 2} | x_{t})$ using the Kullback-Leibler (KL) divergence (Van Erven & Harremos, Citation2014). Unlike traditional diffusion methods, DiffusionGAN can achieve larger step sizes for faster sampling through a generative adversarial network. It introduces a discriminator $D_{ξ}$ and optimises the generator and discriminator in an adversarial training manner, where fake samples are sampled from the conditional generator $p_{θ} (x_{t - 1} | x_{t})$ . Due to the large step size, $q (x_{t - 1} | x_{t})$ is no longer a Gaussian distribution, so DiffusionGAN aims to implicitly model this complex multimodal distribution using a given D-dimensional generator $G_{θ} (x_{t}, z, t)$ with latent variable z∼N (0,I). Specifically, DiffusionGAN first generates an undisturbed sample $x_{0}$ using the generator $G_{θ} (x_{t}, z, t)$ , and obtains the corresponding perturbed sample $x_{t - 1}$ using $q (x_{t - 1} | x_{t}, x_{0})$ . Additionally, the discriminator distinguishes between real pairs $D_{ξ} (x_{t - 1}, x_{t}, t)$ and fake pairs $D (x_{t - 1}, x_{t}, t)$ . (2) $\frac{1}{n} \sum_{i = 1}^{n} | F (y) - F (y_{p}) |$ (2) To eliminate the instability caused by individual samples during training and enhance the model's generalisation, we propose a balanced loss function, as shown in Equation 2. In this equation, y and $y_{p}$ respectively represent the label and the generated image, and n is the number of samples in a batch. Compared with traditional methods, the new model can adopt a larger step size by the balanced loss function, thereby accelerating the sampling speed.

4. Experiment

In this section, we first provide detailed information about the experimental setup, followed by empirical experiments on different datasets, including CIFAR-10, STL-10, and CelebAHQ. Finally, we discuss important components of our proposed framework. For datasets, we conducted experiments on CIFAR-10 32 × 32, STL-10 64 × 64, and two higher-resolution datasets, including CelebA HQ 256 × 256 and CelebA HQ 512 × 512.

4.1. Dataset and experimental environment

This experiment used three publicly available datasets: CIFAR-10, STL-10, and CelebAHQ.

CIFAR-10

CIFAR-10 is a commonly used image classification dataset consisting of colour images belonging to 10 different classes. Each class contains 6000 32 × 32-pixel images, with 50,000 images used for training and 10,000 for testing. The dataset covers various object categories, including airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. CIFAR-10 is frequently utilised to evaluate the performance of image classification models.

STL-10

STL-10 is another widely used image classification dataset similar to CIFAR-10 but with larger image size and more classes. STL-10 comprises images from 10 different classes, with 500 training and 800 testing images per class. Each image has a size of 96 × 96 pixels and is in colour. Compared to CIFAR-10, STL-10 contains larger and more complex images, demanding higher performance from models.

CelebAHQ

CelebAHQ is a dataset used for facial image processing tasks, mainly in tasks such as face generation and facial attribute analysis. The images in CelebAHQ are generated from the CelebA dataset, which contains a large number of high-quality images from celebrities. CelebAHQ is a high-resolution version of the CelebA dataset, featuring larger dimensions and higher image resolutions. This dataset can be used to train models such as Generative Adversarial Networks (GANs) to generate realistic facial images.

The experimental environment adopts the cloud server provided by the Kaggle platform, and the specific configuration is shown in Table .

Table 1. Experimental environment.

Download CSV Display Table

4.2. Evaluation metrics

We use the commonly used Frechet Inception Distance (FID) (Obukhov & Krasnyanskiy Citation2020) to evaluate the fidelity of generated samples. The calculation process is as follows.

Assuming the true distribution $P_{r}$ , and generating distribution $P_{g}$ , Modeled as a multidimensional Gaussian distribution, the parameters are $(μ_{r}, Σ_{r})$ and $(μ_{g}, Σ_{g})$ , where $μ$ and $Σ$ are the mean vector and covariance matrix, respectively. The calculation formula of FID is: (3) $FID = d^{2} ((μ_{r}, Σ_{r}), (μ_{g}, Σ_{g})) = μ_{r} - {μ_{g}}^{2} + T_{r} (Σ_{r} + Σ_{g} - 2 {(Σ_{r} Σ_{g})}^{\frac{1}{2}})$ (3) Where $T_{r}$ represents the trace of the matrix (the sum of the diagonal elements of the matrix). The smaller the FID value, the better the fidelity.

We use Inception Score (IS) to measure the diversity of generated images, which is defined as follows: (4) $IS = D_{KL} (P ∥ Q) = \sum P (x) log (\frac{P (x)}{Q [x]})$ (4) The $D_{KL}$ stands for Kullback–Leibler divergence, which is used to measure the distance between two distributions. A larger $D_{KL}$ value indicates better diversity.

For sampling time, we measure it by the number of function evaluations (NFE) and the clock time required to generate 100 images on the P100 GPU.

4.3. Compilation details

For all experiments, we adopt the same training configuration as DDGAN. The training epochs are set to 500 for CelebA HQ 256 × 256, and 1800 for CIFAR-10 32 × 32. For CelebA HQ (512), we train our model and DDGAN for 400 epochs. The proposed method is implemented for each corresponding dataset in the experimental environment shown in Table . It is worth noting that due to the efficiency of our wavelet diffusion framework, our models require fewer GPU resources and computations than DDGAN. Also, because we reduce the image size, we can use shallower network structures, which also alleviate the computational burden to some extent.

4.4. Experimental results

For CIFAR-10, as shown in Table , we have greatly improved the inference time using our contour wavelets, which only takes 0.09 (s) based on the diffusion process, which is two times faster than DDGAN. This provides us with real-time performance along with the GAN-based method, while also having significant margin over other diffusion models in terms of sampling speed.

Table 2. Results for unconditional generation on CIFAR-10.

Download CSV Display Table

For STL-10, as shown in Table , we not only obtained a better FID score of 12.32, but also a faster sampling time of 0.41 (s). Our method provides faster convergence than the baseline, especially in the early stages. The generated samples from DDGAN are unable to recover the overall shape and structure of the objects. In addition, we provide sample images generated by the model in Figure .

Figure 5. Generated samples on STL-10.

Table 3. Results on STL-10.

Download CSV Display Table

At a resolution of 256 × 256 for CelebA HQ, our performance outperforms some prominent diffusion and GAN baselines with an FID of 6.35 and a recall rate of 0.39 while achieving over 2× speedup, as shown in Table . Regarding high resolution CelebA HQ (512) in Table , our model significantly outperforms DDGAN in both image quality (4.98 vs 5.25) and sampling time (1.58 vs 3.42) and achieves a higher recall rate of 0.42.

Table 4. Results on CelebA HQ 256 × 256

Download CSV Display Table

Table 5. Results on CelebA-HQ 512 × 512

Download CSV Display Table

In Table , we achieve superior image quality at 4.98 compared to other diffusion models while achieving comparable results to GANs. It is worth noting that its inference time (1.58 s for 25 samples) remains comparable to CelebA HQ 512 thanks to our invariant network configuration, with only two modifications: removing the attention layer and using a patch size of 2. Our model provides 2× faster inference time than DDGAN while surpassing StyleGAN2 (Tran et al., Citation2017) in terms of Recall at 0.6. For qualitative results, randomly generated samples are shown in Figure .

Figure 6. Generated samples of CelebA dataset.

4.5. Abalation study

In this section, we validated the contribution of each individual component of our proposed countourlet diffusion on CelebA HQ 256 × 256 in Table , where the complete model includes contour wavelet generation network, Batch-normalised stochastic attention module and Balanced loss function. As can be seen as much as possible, each component contributes to the performance of the model. By applying all three proposed components, our method achieved the best performance at 6.35, and especially the contour wavelet generation network was the most important component in the generator design. Nonetheless, there was still some performance improvement as well as a small cost in terms of running speed.

Table 6. Ablation study of wavelet generator on CelebA-HQ 256 × 256. Each setting is trained for 500 epochs

Download CSV Display Table

As shown in the fourth term of the ablation experiment, when we use contour wavelets only on DDGAN, the FID is 6.58, which is an improvement of 1.06 over DDGAN, and there is no significant increase in time (see Table ). This shows that our proposed contour wavelet is effective than DDGAN.

We have successfully showcased the remarkable speed of our models when processing single images, making them highly suitable for real-life applications. Table provides an overview of their processing time and key parameters. Notably, our wavelet diffusion models can generate images as large as 1024 × 1024 in an astonishingly swift 0.12 s. This achievement marks the diffusion model has achieved nearly real-time performance of this caliber.

Table 7. Ablation study of contour wavelet diffusion network on CelebA-HQ 256 × 256. Each setting is trained for 500 epochs.

Download CSV Display Table

5. Discussion

5.1. Contributions

To address the issue of prolonged model inference time, we propose the Contour Wavelet Diffusion Model, which utilises the characteristics of wavelet transformation to speed up the original diffusion model while ensuring the quality of generated images. The attention module in the model enables effective focus on important high-frequency information, thereby further improving the quality of image generation. The model incorporates a reconstruction loss function to ensure the network learns variation consistency while further enhancing speed.

Three public datasets, CIFAR-10, STL-10, and CelebAHQ, were used in the experiment. The Contour Wavelet Diffusion Model has achieved good results on these datasets. Especially in CelebA HQ 256 × 256 and CelebA HQ 512 × 512 high-resolution dataset, the performance of the Contour Wavelet Diffusion Model outperforms some well-known GAN-based diffusion models, while achieving over twice the speed. Experiments have shown that the Contour Wavelet Diffusion Model can decouple complex low-frequency information from easily learned high-frequency information, and reduce the size of low-frequency space to make the original diffusion model easier to train. It has broad application prospects in various fields.

5.2. Shortcomings and prospects

While our proposed Contour Wavelet Diffusion method shows promising results in accelerating the training and inference speed of the diffusion model while maintaining image generation quality, there are still some limitations and opportunities for future research.

One of the main shortcomings of our method is that it relies on wavelet transforms, which are computationally expensive themselves, especially when dealing with high-resolution images. Although we have achieved acceleration by processing down-sampled components, there is still room for further optimisation to reduce the overall computational burden.

Additionally, while our attention mechanism improves image generation quality by focusing on important high-frequency information, it might not be fully effective in handling complex and subtle facial expressions. Further research could explore alternative attention mechanisms or hybrid approaches to enhance the model's ability to capture intricate facial details.

Moreover, the experimental evaluation primarily focuses on public datasets, which may not fully represent the diversity of real-world scenarios. Evaluating the proposed method on more extensive and diverse datasets, including in-the-wild and cross-cultural facial expressions, would provide a more comprehensive assessment of its performance and generalizability.

Furthermore, the proposed balanced loss function has shown improvements in convergence speed, but investigating other loss functions and regularisation techniques could potentially lead to even better results and robustness against overfitting.

The learnable Gabor filter bank is also an effective tool for biometric feature extraction and representation, especial-ly for orientation information encoding. However, in our experiments, we found that its generative performance is not as good as the proposed contourlet wavelet. This is because the wavelet transform offers a variable time–frequency window, whereas the time–frequency localisation of the Gabor function is fixed, which also means it cannot provide the optimal resolution simultaneously across all scales. In terms of prospects, exploring hardware-specific optimizations and leveraging parallel computing techniques could significantly reduce the training and inference times, making the Contour Wavelet Diffusion method more practical for real-time applications on various devices.

Lastly, considering the rapid advancements in machine learning and computer vision, future research could explore integrating our method with other state-of-the-art facial attribute reconstruction techniques, such as Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs), to further enhance the quality and diversity of generated faces.

In conclusion, the Contour Wavelet Diffusion method presented in this paper represents a step forward in accelerating diffusion models for facial expression reconstruction. While our experiments demonstrate its effectiveness, addressing the mentioned shortcomings and pursuing the outlined prospects would pave the way for even more efficient and powerful models in the field of facial attribute reconstruction and other related applications.

6. Conclusion

This paper proposes a new diffusion model based on opportunity profile wavelets. Firstly, we perform a wavelet decomposition on both the image and feature levels to extract anisotropic low- and high-frequency components, which are processed to achieve acceleration. Additionally, due to the well-behaved reconstruction property of wavelet transforms, good image generation quality can be maintained. Secondly, we propose an attention mechanism that effectively focuses on important high-frequency information, thereby improving image generation quality. Finally, we introduce a reconstruction loss function to further enhance the model's convergence speed. The key to the effectiveness of our model lies in the ability to decouple complex low-frequency information from easy-to-learn high-frequency information, and reducing the scale of low-frequency space makes the original diffusion model easier to train.

This study is of significant importance as it enhances generation speed by reducing computational complexity. Firstly, in real-time applications, generating realistic facial expressions within short periods is crucial for delivering a smooth user experience. Secondly, it helps save resources and costs by reducing computational complexity. This is especially important for mobile devices and embedded systems, as it extends battery life and reduces resource consumption. Additionally, it assists in reducing deployment and maintenance costs. Lastly, it expands the scope of application areas for facial expression generation technology. Efficient facial expression generation, enabled by improved computational efficiency and generation speed, can be applied not only in domains such as movies, games, and virtual reality but also in various fields including healthcare, education, human–computer interaction, and affective computing. This expansion brings forth opportunities for innovation and application in these domains.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The CIFAR-10 dataset can downloaded from “http://www.cs.toronto.edu/~kriz/cifar.html”; The STL-10 dataset can downloaded from “https://cs.stanford.edu/~acoates/stl10/”; The CelebA-HQ dataset can downloaded from “http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html”.

Additional information

Funding

The author(s) reported there is no funding associated with the work featured in this article.

References

Abdal, R., Qin, Y., & Wonka, P. (2019). Image2stylegan: How to embed images into the stylegan latent space? 2019 IEEE/CVF International Conference on Computer Vision (ICCV). https://doi.org/10.1109/iccv.2019.00453
Google Scholar
Bamberger, R. H., & Smith, M. J. (1992). A filter bank for the directional decomposition of images: Theory and design. IEEE Transactions on Signal Processing, 40(4), 882–893. https://doi.org/10.1109/78.127960
Web of Science ®Google Scholar
Batista, J. C., Albiero, V., Bellon, O. R., & Silva, L. (2017). Aumpnet: Simultaneous action units detection and intensity estimation on multipose facial images using a single convolutional neural network. 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition, FG 2017.
Google Scholar
Brophy, E., Wang, Z., She, Q., & Ward, T. (2023). Generative adversarial networks in time series: A systematic literature review. ACM Computing Surveys, 55(10), Article 199. https://doi.org/10.1145/3559540
Web of Science ®Google Scholar
Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., & Lundberg, S. (2023). Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:.12712.
Google Scholar
Carvalho, T. G., Thiberge, S., Sakamoto, H., & Ménard, R. (2004). Conditional mutagenesis using site-specific recombination in Plasmodium berghei. Proceedings of the National Academy of Sciences, 101(41), 14931–14936. https://doi.org/10.1073/pnas.0404416101
PubMed Web of Science ®Google Scholar
Croitoru, F.-A., Hondru, V., Ionescu, R. T., & Shah, M. (2023). Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9), 10850–10869. https://doi.org/10.1109/tpami.2023.3261988
PubMed Web of Science ®Google Scholar
Da Cunha, A. L., Zhou, J., & Do, M. N. (2006). The nonsubsampled contourlet transform: Theory, design, and applications. IEEE Transactions on Image Processing, 15(10), 3089–3101. https://doi.org/10.1109/TIP.2006.877507
PubMed Web of Science ®Google Scholar
Dhariwal, P., & Nichol, A. (2021). Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34, 8780–8794.
Google Scholar
Do, M. N., & Vetterli, M. (2003). The finite ridgelet transform for image representation. IEEE Transactions on Image Processing, 12(1), 16–28. https://doi.org/10.1109/TIP.2002.806252
PubMed Web of Science ®Google Scholar
Ekman, P. (2004). Emotions revealed. BMJ, 328(Suppl S5), 0405184. https://doi.org/10.1136/sbmj.0405184
Google Scholar
Ekman, P., & Friesen, W. V. (1978). Facial action coding system [dataset]. In PsycTESTS Dataset. American Psychological Association (APA). https://doi.org/10.1037/t27734-000
Google Scholar
Ekman, P., Friesen, W. V., & Tomkins, S. S. (1971). Facial affect scoring technique: A first validity study.
Google Scholar
Eslami, R., & Radha, H. (2004). Wavelet-based contourlet transform and its application to image coding. 2004 International Conference on Image Processing, 2004. ICIP'04.
Google Scholar
Esser, P., Rombach, R., Blattmann, A., & Ommer, B. (2021). Imagebart: Bidirectional context with multinomial diffusion for autoregressive image synthesis. Advances in Neural Information Processing Systems, 34, 3518–3532.
Google Scholar
Freeman, W. T., & Adelson, E. H. (1991). The design and use of steerable filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(9), 891–906. https://doi.org/10.1109/34.93808
Web of Science ®Google Scholar
Gao, R., Song, Y., Poole, B., Wu, Y. N., & Kingma, D. P. (2020). Learning energy-based models by diffusion recovery likelihood. arXiv preprint arXiv:.08125.
Google Scholar
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. Stat, 10, 1050.
Google Scholar
Gudi, A., Tasli, H. E., Den Uyl, T. M., & Maroulis, A. (2015). Deep learning based facs action unit occurrence and intensity estimation. 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).
Google Scholar
Gupta, G., Khan, S., Guleria, V., Almjally, A., Alabduallah, B. I., Siddiqui, T., Albahlal, B. M., Alajlan, S. A., & Al-Subaie, M. (2023). DDPM: A dengue disease prediction and diagnosis model using sentiment analysis and machine learning algorithms. Diagnostics, 13(6), 1093. https://doi.org/10.3390/diagnostics13061093
Web of Science ®Google Scholar
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.
Google Scholar
Huang, J., Cui, K., Guan, D., Xiao, A., Zhan, F., Lu, S., Liao, S., & Xing, E. (2022). Masked generative adversarial networks are data-efficient generation learners. Advances in Neural Information Processing Systems, 35, 2154–2167.
Google Scholar
Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:.10196.
Google Scholar
Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for generative adversarial networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Google Scholar
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., & Aila, T. (2020). Analyzing and improving the image quality of stylegan. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
Google Scholar
Kaur, G., Agarwal, R., & Patidar, V. (2020). Semi-blind robust watermarking with dual complex tree wavelet based hybrid transform and SVD. 2020 IEEE 17th India Council International Conference (INDICON).
Google Scholar
Kingma, D. P., & Dhariwal, P. (2018, December). Glow: Generative flow with invertible 1 × 1 convolutions. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (pp. 10236–10245).
Google Scholar
Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
Google Scholar
Kong, Z., & Ping, W. (2021). On fast sampling of diffusion probabilistic models. arXiv preprint arXiv:.00132.
Google Scholar
Lu, Y., Chen, D., Olaniyi, E., & Huang, Y. (2022). Generative adversarial networks (GANs) for image augmentation in agriculture: A systematic review. Computers and Electronics in Agriculture, 200, 107208. https://doi.org/10.1016/j.compag.2022.107208
Web of Science ®Google Scholar
Luhman, E., & Luhman, T. (2021). Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:.02388.
Google Scholar
Ma, H., Zhang, L., Zhu, X., & Feng, J. (2022). Accelerating score-based generative models with preconditioned diffusion sampling. European Conference on Computer Vision.
Google Scholar
Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., & Chen, M. (2021). Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:.10741.
Google Scholar
Obukhov, A., & Krasnyanskiy, M. (2020). Quality assessment method for GAN based on modified metrics inception score and Fréchet inception distance. Advances in Intelligent Systems and Computing, 102–114. https://doi.org/10.1007/978-3-030-63322-6_8
Google Scholar
Oord, A. V. D., Kalchbrenner, N., Vinyals, O., Espeholt, L., Graves, A., & Kavukcuoglu, K. (2016, December). Conditional image generation with PixelCNN decoders. In Proceedings of the 30th International Conference on Neural Information Processing Systems (pp. 4797–4805).
Google Scholar
Pandzic, I. S. (2002). MPEG-4 facial animation framework for the web and mobile applications. In MPEG-4 Facial Animation (pp. 65–79). Portico. https://doi.org/10.1002/0470854626.ch4
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:.06125, 1(2), 3.
Google Scholar
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., & Sutskever, I. (2021). Zero-shot text-to-image generation. International Conference on Machine Learning.
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
Google Scholar
Salimans, T., & Ho, J. (2022). Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:.00512.
Google Scholar
Shensa, M. J. (1992). The discrete wavelet transform: Wedding the a trous and Mallat algorithms. IEEE Transactions on Signal Processing, 40(10), 2464–2482. https://doi.org/10.1109/78.157290
Web of Science ®Google Scholar
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. International Conference on Machine Learning.
Google Scholar
Song, J., Meng, C., & Ermon, S. (2020). Denoising diffusion implicit models. arXiv preprint arXiv:.02502.
Google Scholar
Song, Y., Durkan, C., Murray, I., & Ermon, S. (2021). Maximum likelihood training of score-based diffusion models. Advances in Neural Information Processing Systems, 34, 1415–1428.
Google Scholar
Song, Y., & Ermon, S. (2020). Improved techniques for training score-based generative models. Advances in Neural Information Processing Systems, 33, 12438–12448.
Google Scholar
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2020). Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:.13456.
Google Scholar
Taubman, D. S., Marcellin, M. W., & Rabbani, M. (2002). JPEG2000: Image compression fundamentals, standards and practice. Journal of Electronic Imaging, 11(2), 286–287. https://doi.org/10.1117/1.1469618
Google Scholar
Tran, D. L., Walecki, R., Rudovic, O., Eleftheriadis, S., Schuller, B., & Pantic, M. (2017). Deepcoder: Semi-parametric variational autoencoders for automatic facial action coding. 2017 IEEE International Conference on Computer Vision (ICCV).
Google Scholar
Vahdat, A., & Kautz, J. J. A. i. n. i. p. s. (2020). NVAE: A deep hierarchical variational autoencoder. 33, 19667–19679.
Google Scholar
Vahdat, A., Kreis, K., & Kautz, J. (2021). Score-based generative modeling in latent space. Advances in Neural Information Processing Systems, 34, 11287–11302.
Google Scholar
Van Erven, T., & Harremos, P. (2014). Rényi divergence and Kullback-Leibler divergence. IEEE Transactions on Information Theory, 60(7), 3797–3820. https://doi.org/10.1109/TIT.2014.2320500
Web of Science ®Google Scholar
Vincent, P. (2011). A connection between score matching and denoising autoencoders. Neural Computation, 23(7), 1661–1674. https://doi.org/10.1162/NECO_a_00142
PubMed Web of Science ®Google Scholar
Walecki, R., Rudovic, O., Pavlovic, V., & Pantic, M. (2016). Copula ordinal regression for joint estimation of facial action unit intensity. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Google Scholar
Walecki, R., Rudovic, O., Pavlovic, V., & Pantic, M. (2017). Variable-state latent conditional random field models for facial expression analysis. Image and Vision Computing, 58, 25–37. https://doi.org/10.1016/j.imavis.2016.04.009
Web of Science ®Google Scholar
Wang, Z., Zheng, H., He, P., Chen, W., & Zhou, M. (2022). Diffusion-gan: Training gans with diffusion. arXiv preprint arXiv:.02262.
Google Scholar
Xiao, Z., Kreis, K., Kautz, J., & Vahdat, A. (2020). Vaebm: A symbiosis between variational autoencoders and energy-based models. arXiv preprint arXiv:.00654.
Google Scholar
Yang, M., Wang, Z., Chi, Z., & Feng, W. (2022). WaveGAN: Frequency-aware GAN for high-fidelity few-shot image generation. European Conference on Computer Vision.
Google Scholar
Zhang, S., Li, L., & Zhao, Z. (2012). Facial expression recognition based on Gabor wavelets and sparse representation. 2012 IEEE 11th International Conference on Signal Processing.
Google Scholar

Contour wavelet diffusion – a fast and high-quality facial expression generation model

Abstract

1. Introduction