Full article: Label Noise Robust Crowd Counting with Loss Filtering Factor

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Crowd counting, a crucial computer vision task, aims at estimating the number of individuals in various environments. Each person in crowd counting datasets is typically annotated by a point at the center of the head. However, challenges like dense crowds, diverse scenarios, significant obscuration, and low resolution lead to inevitable label noise, adversely impacting model performance. Driven by the need to enhance model robustness in noisy environments and improve accuracy, we propose the Loss Filtering Factor (LFF) and the corresponding Label Noise Robust Crowd Counting (LNRCC) training scheme. LFF innovatively filters out losses caused by label noise during training, enabling models to focus on accurate data, thereby increasing reliability. Our extensive experiments demonstrate the effectiveness of LNRCC, which consistently improves performance across all models and datasets, with an average enhancement of 3.68% in Mean Absolute Error (MAE), 6.7% in Mean Squared Error (MSE) and 4.68% in Grid Average Mean Absolute Error (GAME). The universal applicability of this approach, coupled with its ease of integration into any neural network model architecture, marks a significant advancement in the field of computer vision, particularly in addressing the pivotal issue of accuracy in crowd counting under challenging conditions.

Introduction

Crowd counting represents a prominent computer vision task, aimed at automatically estimating the number of individuals in unconstrained scenes (Idrees et al. Citation2013; Laradji et al. Citation2018; Liu et al. Citation2019; Ma et al. Citation2019). This task has garnered significant attention in recent years, with extensive research and implementation across various real-world scenarios, including smart buildings (Zou et al. Citation2018), traffic monitoring (Marsden et al. Citation2018; Zhang et al. Citation2017), and public spaces in Saudi Arabia (Alotibi et al. Citation2019). By harnessing real-time image data, crowd counting facilitates applications such as video surveillance (Wang, Hou, and Chau Citation2019), enhanced security (Chan, John Liang, and Vasconcelos Citation2008), and efficient bandwidth allocation (Zou et al. Citation2017).

In general, crowd counting is challenging due to heavy overlaps and occlusions, complex and noisy backgrounds, and variations in perspective and illumination. In the past decade, a number of crowd counting algorithms have been proposed in the literature. Most early works estimate crowd counts via the detection of people, bodies, or heads in the image (Li et al. Citation2008; Rabaud and Belongie Citation2006), which may have inaccurate estimates and considerable computational complexity of dense crowds due to the heavy overlaps and occlusions of people. Currently, methods that mainly cast crowd counting as a density map estimation problem and combine it with convolutional neural networks (CNNs) have made remarkable progress (Boominathan, Kruthiventi, and Venkatesh Babu Citation2016; Chen et al. Citation2021; Cheng et al. Citation2019; Lin et al. Citation2022; Ma et al. Citation2019; Wang et al. Citation2020). By this method, the values of the crowd density map regressed by CNNs are summed to give the total size of the crowd.

Training crowd counting models effectively poses a challenge due to the nature of most available datasets, which provide only point annotations for each training image, typically denoting the center of a person’s head (Idrees et al. Citation2013, Citation2018; Zhang et al. Citation2016a). One prevalent approach involves transforming the point annotations into density maps using a Gaussian kernel, treating these density maps as the ’ground truth,’ and training the model by regressing values at each pixel within the density map. Furthermore, recent studies (Ma et al. Citation2019; Wang et al. Citation2020; Wan, Liu, and Chan Citation2021; Zhang et al. Citation2016a) have explored alternative methods to enhance this point-to-density-map conversion process, yielding promising performance improvements.

The quality of datasets and the accuracy of point annotations have a profound impact on the performance of crowd count estimators in crowd counting learning tasks (Gao et al. Citation2020). Presently, point annotations for crowd counting are primarily acquired through manual labor, requiring the meticulous labeling of each person in every image of the dataset (Jingying Citation2021; Li et al. Citation2021). However, due to factors such as overlapping individuals, low image resolution, and extremely high crowd densities – especially near the vanishing point of the image – inevitably, there exists a significant likelihood of mislabeled annotations by annotators. Furthermore, given that these point annotations represent only a small fraction of individuals’ heads, there is inherent spatial error; in other words, not every point annotation precisely corresponds to the center of a person’s head. As a result, these “unknown” mislabeled point annotations and spatial errors can also hurt the training of crowd count estimators.

To alleviate the impact of annotation noises and bridge the research gap, we introduce the Loss Filtering Factor (LFF) to assist the model in filtering out losses (at the pixel level) that are most likely due to annotation noises and further propose the Label Noise Robust Crowd Counting (LNRCC) training scheme based on LFF which enables the model focus on more critical non-noise losses during the training process. Specifically, LNRCC first initializes a crowd counting model and trains it on the given training images and annotations to obtain an initial model. Then, predictions are made on test images using this initial model, and the deviation between predictions and annotations is calculated to represent the loss for each data point. LNRCC then sorts this loss vector to rank the losses from small to large. After that, it generates a binary mask vector based on the sorted losses and a hyperparameter theta, which acts as a filter to remove certain losses. Using this mask, LNRCC calculates the filtered losses by zeroing out some elements of the original loss vector so that only selected losses are used for supervision. Finally, the initial model is updated using LFF as the loss function. The steps are repeated until the model converges. In this way, LNRCC selectively filters out likely noise-induced losses, enabling the model to focus more on critical non-noise data during training and enhancing robustness against label noise.

We evaluate the performance of our proposed LNRCC method across various backbone networks (Liu et al. Citation2021; Ma et al. Citation2019; Wang et al. Citation2020; Zhang, Choi, and Hong Citation2022) using multiple datasets (Idrees et al. Citation2013, Citation2018; Liu et al. Citation2021; Zhang et al. Citation2016a). Our comprehensive experimental results demonstrate that our LNRCC effectively enhances the robustness of crowd counting models in label-noise environments while significantly improving their overall learning performance.

Recent progress in crowd counting has seen a variety of innovative approaches tailored to specific challenges in the field. The Scale Region Recognition Network (Guo et al. Citation2023) and the Scale-Context Perceptive Network (Zhai et al. Citation2023) address scale variations and context-aware processing, crucial for applications in intelligent transportation systems and smart cities. Attention mechanisms have been pivotal in enhancing accuracy within dense crowds, as seen in the Group-split Attention Network (Zhai et al. Citation2022), Spatial-Frequency Attention Network (Guo et al. Citation2022), and FPANet (Zhai et al. Citation2023). These models employ attention to manage complex crowd scenes effectively.

In contrast to these developments, our research introduces the Loss Filtering Factor (LFF) and the Label Noise Robust Crowd Counting (LNRCC) training scheme. While the aforementioned SOTA models focus on structural, attentional, and scale-aware aspects of the networks, our LFF approach specifically targets the challenge of label noise in crowd counting datasets. By filtering out noise-impacted losses, LFF enables a more accurate and reliable training process, addressing a key challenge that has been less explored in these recent advancements.

This paper is a significant extension of our prior conference paper, our contributions are summarized as follows:

We identify that label noise, encompassing both spatial noise and quantity noise within the training data, exerts a significant influence on the reduction of learning performance in crowd counting models.
We propose the loss filtering factor (LFF) to alleviate the negative impact of label noise by filtering losses that are likely caused by noise during model training.
We propose the Label Noise Robust Crowd Counting (LNRCC) Training Scheme based on LFF. LNRCC is a comprehensive training pipeline designed to bolster the robustness of crowd counting models in label-noise environments.

Related Work

Crowd Counting

Crowd counting has been extensively studied as a fundamental issue in computer vision. The approaches to this problem can be categorized into three types: detection, direct count regression, and point supervision.

Initially, most methods (Li et al. Citation2008; Liu et al. Citation2019; Rabaud and Belongie Citation2006) focused on detecting individuals, heads, or upper bodies in images. However, this approach faces significant challenges in dense crowds, primarily due to the extensive occlusions and the labor-intensive nature of bounding box annotation.

Transitioning to the next phase of development, later methods (Idrees et al. Citation2018; Jiang et al. Citation2019; Li, Zhang, and Chen Citation2018) moved away from detection-based approaches. Instead, they regress to a”ground truth” density map created from point annotations. These methods use location information to learn a density map for each training sample. Nonetheless, they often assume an even crowd distribution, which is not always reflected in images due to factors like camera angles and imaging techniques. In addressing these challenges (Hu et al. Citation2022) introduced RDC-SAL, a framework that combines refine distance compensating with quantum scale-aware learning, significantly enhancing feature extraction in dense scenes.

Further evolving the methodology, recent works (Dong et al. Citation2020; Ma et al. Citation2019; Wang et al. Citation2021) have suggested using point supervision directly, bypassing the need for generating density maps. These advancements have led to novel approaches like optimal transport (Ma et al. Citation2021) and divergent measuring techniques, focusing on weak supervision without relying on Gaussian distribution assumptions. In this context (Hafeezallah, Al-Dhamari, and Abd Rahman Abu-Bakar Citation2022) introduces an innovative approach using a multi-scale network with an integrated attention unit, further enhancing the accuracy and robustness of crowd counting in challenging scenarios. The QE-DAL framework by (Hu, Tang, and Yang Citation2023) leverages quantum computing for feature extraction in dense crowd scenes, providing a new dimension in crowd counting methodologies.

Further innovations include the integration of multiple attention mechanisms and scale-aware strategies, such as in the Triple Attention and Scale-Aware Network (Guo et al. Citation2022), designed for remote sensing, and the Dense Attention Fusion Network (Guo et al. Citation2023), which focuses on IoT systems. The Multiscale Aggregation Network (Guo et al. Citation2022) offers a unique approach with its smooth inverse mapping technique. Additionally, the comprehensive analysis of crowd counting methodologies in IoT by Gao et al (Gao et al. Citation2023). provides valuable insights into the applicability of various approaches in IoT scenarios.

Other notable works include the Lightweight Ghost Attention Pyramid Network (Guo et al. Citation2023), which offers an efficient solution for smart city applications, the DA2Net (Zhai et al. Citation2023), a dual attention-aware network designed for robust crowd counting, and the Attentive Hierarchy ConvNet (Zhai et al. Citation2023), which emphasizes hierarchical convolutional approaches for smart city environments.

In crowd counting tasks, selecting an effective loss function is critical. Initially, Euclidean loss was commonly used, focusing on minimizing the MSE between predicted density maps and ground truth ((Li, Zhang, and Chen Citation2018; Zhang et al. Citation2016a)). While simple and flexible, this approach overlooks the correlation between adjacent pixels, limiting the quality of density maps.

To address these limitations, newer methods introduced structural similarity-based losses, like the SSIM loss in SANet (Cao et al. Citation2018). These allow models to learn local correlations at various scales, but they struggle with scale variations. Here, the approach of (Hafeezallah, Al-Dhamari, and Abd Rahman Abu-Bakar Citation2022) demonstrates the efficacy of leveraging multi-scale features and attention mechanisms to address scale variations in crowd scenes.

Another development in this field involves the use of multi-task learning frameworks (Sindagi and Patel Citation2019; Wei, Yuan, and Wang Citation2020), (Hafeezallah, Al-Dhamari, and Rahman Abu-Bakar Hafeezallah, Al-Dhamari, and Abd Rahman Abu-Bakar Citation2022). These frameworks, while effective in crowded scenes, are sensitive to hyperparameters and require precise tuning. Strategies like divide-and-conquer (Xiong et al. Citation2019) also emerged, offering efficient segmentation but at higher computational costs.

Additionally, diverse loss optimization strategies have been explored. For instance, CNN-Boosting uses layered boosting and selective sampling (Walach and Wolf Citation2016), and D-ConvNet employs deep negative correlation learning (Shi et al. Citation2018), enhancing counting robustness. In the context of scene classification and motion pattern analysis (Mohammed et al. Citation2023) introduces the concept of adaptive synthetic oversampling and fully connected deep neural networks, providing new insights into the classification of crowd scenes based on motion patterns.

Noisy Labels

The effectiveness of deep neural networks is dependent on having access to high-quality labeled training data because label mistakes (label noise) in training data can significantly impair model performance on clean test data (Zhang et al. Citation2021). Unfortunately, samples with faulty or incorrect labels are virtually always present in big training datasets. Recently, an increasing number of academics have focused on this issue. Han et al. proposed a Co-teaching model for combating noisy labels (Han et al. Citation2018), Jiang et al. contributed to a deeper understanding of deep learning using non-synthetic noisy labels (Jiang et al. Citation2020). To combat overfitting on faulty labels, Jiang et al. proposed the MentorNet method, a strategy for learning another neural network (Jiang et al. Citation2018).

However, these methods can’t transfer well to the crowd counting problem due to insufficient datasets for controlled tests and high computing costs. On the other hand, studies have shown that deep learning models outperform humans in a variety of activities, such as Image classification (Zoph et al. Citation2018), go (Silver et al. Citation2016), and speech recognition (Pham et al. Citation2019). Therefore, for crowd counting problems, particularly high-density crowd counting tasks, we may expect that, under certain conditions, existing deep learning models predict more true signals than human-labeled annotations.

Proposed Method

Background and Motivation

Unlike other tasks, label noise in datasets is unavoidable for crowd counting tasks because of the dense crowd, variety of scenarios, and significant obscuration.

On the one hand, most popular crowd benchmarks have large crowd density, making consistency and accuracy in point annotations challenging. The statistics of the multi-scene datasets for dense crowd counting are summarized in . The majority of datasets feature an average of more than a hundred persons per image.

Table 1. Average count and average density of popular crowd datasets.

Display Table

On the other hand, based on observation, the label noise of the popular crowd benchmark may be separated into two main categories: spatial noise and quantity noise. The relative locations of labeled points vary for the same images in the dataset. Some are the pixel in the center of the head. In contrast, others are just a random pixel within the person (e.g., in people’s chest or waist); in some cases, points outside the person are also annotated. Such annotated inaccuracies are referred to as spatial noise, as shown in . Moreover, due to the limited availability of high-resolution crowd images, there is quantity noise in the dataset, particularly for images with a high crowd density and low resolution, such as missing annotated and duplicate annotated, as shown in . Both of the aforementioned label noises will undoubtedly have an impact on the training and performance of deep learning models to some extent (Zhang et al. Citation2021).

Figure 1. This figure highlights the label noise problems in existing dense crowd datasets. (a) shows cases where the annotations are in other parts of the body(not in center of head) and annotations outside of the body, while (b) shows examples of both duplicate and missing annotations.

Loss Filtering Factor

The Loss Filtering Factor is proposed to alleviate the negative effects of label noise based on the assumption that the trained neural network model predicts more correct signals under certain conditions than human-labeled annotations (Khan, Menouar, and Hamila Citation2023a). The proposed Loss Filtering Factor can filter out the losses assumed to be caused by label noise (quantity noise and spatial noise) during the training process.

shows a simple (2-D) example of training with and without Loss Filtering Factor. During the training process, the model’s predicted values in epoch T will be closer to the label values than in epoch $T - 1$ , i.e., the total of losses of epoch $T$ is less than epoch $T - 1$ . When training without using the loss filter factor, the model will consider the losses caused by label noise as regular losses; therefore, even though the total losses are reduced, the non-noise losses are not always minimized; When training using the Loss Filtering Factor, the Loss Filtering Factor can filter out the losses believed to be caused by label noise. As a result, both total losses and non-noise losses decrease.

Figure 2. Comparison of training with(right) and without(left) using loss filter factor.

The inspiration for the LFF concept is rooted in the principle of selective attention in human perception, analogous to focusing on relevant information while filtering out the less pertinent. This approach is critical in environments with label noise, a prevalent issue in computer vision tasks like crowd counting. LFF is designed to dynamically adjust the weight of each data point in the loss function, based on its estimated level of noise. This selective filtering allows the model to prioritize learning from non-noisy, reliable data, thereby enhancing the overall model performance and robustness against label noise. The development of LFF is a response to the need for simple yet effective solutions in machine learning models, emphasizing tailored approaches to address specific challenges effectively, rather than applying generic methods.

The proposed method employs the mask $M = [m^{j}]_{j}^{N}$ to selectively supervise the training of the neural network model. It can be defined as:

(1)

LFF = \sum_{j}^{N} m^{j} \cdot l^{j}

(1)

where $L = [l^{j}]_{j}^{N}$ is the deviation between the predicted value and label value. For example, for methods based on point supervision, $l^{j}$ is the difference between each predicted point value and the corresponding label point value. In contrast, for density-maps-based methods, the $l^{j}$ is the difference between predicted and generated density maps. The value of sizes $N$ is determined by the loss function. For example, in MSE Loss $N$ equals the size of the density map, while in Bayesian Loss $N$ equals the number of annotated points.

Given that there is unavoidably some label noise in datasets and that the trained model sometimes predicts more accurate signals than annotations, the mask $M$ provides a mechanism to filter out some losses, preventing them from participating in the back-propagation process of model training. We consider the deviation $l^{j}$ between predictions and labels to be a type of label uncertainty. According to the foregoing argument, if the $l^{j}$ is excessively large, it is most likely generated by label noise. In this case, we shall dynamically diminish or eliminate its weight in back-propagation. For efficient computation, we adopt $M$ as binary vectors, after calculating all losses $[l^{j}]_{j}^{N}$ , We get the sorted list $S = [e^{j}]_{j}^{N}$ by sorting $[l^{j}]_{j}^{N}$ in ascending order. Then $m^{j}$ is as follows:

(2)

m^{j} = \{\begin{matrix} 1, & if l^{j} \in \{S (1), S (2), S (3), \dots, S ([θN])\} \\ 0, & otherwise \end{matrix}

(2)

where $θ$ is the parameter of Loss Filtering Factor and $[θN]$ denotes the largest integer no more than $θN$ .

If $F = F (l^{j})$ is the loss function used by the model. Obviously, when Loss Filtering Factor $θ = 1.0$ , it is the same as not using Loss Filtering Factor and putting all losses $[l^{j}]$ into the loss function $F$ directly; when Loss Filtering Factor $θ = 0.85$ , it means that there are 15% of label values are considered to be noise, and only 85% of label values with the lowest deviations from the prediction will be involved in supervision. Finally, the overall loss function with the Loss Filtering Factor is as follows:

(3)

F_{LFF} = F (LFF) = F (\sum_{j}^{N} m^{j} \cdot l^{j})

(3)

Label Noise Robust Crowd Counting Training Scheme

Based on LFF we design Label Noise Robust Crowd Counting (LNRCC) training scheme to enhance crowd counting models robustness and improve learning performance under the label noise environment. Algorithm 1 illustrates the steps of LNRCC.

LNRCC first initializes a crowd counting model and trains it on the given training images and annotations to obtain an initial model. Then, predictions are made on test images using this initial model, and the deviation vector between predictions and annotations is calculated to represent the loss for each data point. Then, LNRCC sorts this loss vector in ascending order to rank the losses from small to large. After that, it generates a binary mask vector based on the sorted losses and a hyper parameter $θ$ , which acts as a filter to remove certain losses. Using this mask, LNRCC calculates the filtered losses by zeroing out some elements of the original loss vector, so that only selected losses are used for supervision. Finally, the initial model is updated using LFF as the loss function. The steps are repeated until the model converges. In this way, LNRCC selectively filters out likely noise-induced losses, enabling the model to focus more on critical non-noise data during training and enhancing robustness against label noise.

Algorithm 1 Label Noise Robust Crowd Counting Training Scheme

Display Table

presents the flowchart of the Low-Noise Robust Crowd Counting (LNRCC) framework. The process commences with the input of a crowd counting dataset, which feeds into a deep learning model exemplified by architectures such as VGG19 or HRNet. These models are adept at capturing complex spatial features from high-density crowd images. The convolutional layers, marked by their depth and the size of the convolutional kernels, progressively reduce the spatial dimensions while increasing the depth of feature maps. Following feature extraction, the flowchart delineates the application of a Bayesian loss function, $L_{Bayes}$ , which is computed as a sum over the transformation of individual errors, $F (1 - E_{cn})$ , to enhance the model’s robustness to noise. Subsequently, the overall loss function incorporates a Loss Filtering Factor (LFF), which is designed to mitigate the impact of noisy labels during training. The LFF is applied through a selective weighting mechanism, where certain data points are emphasized or disregarded based on their estimated noise levels. Finally, the denoising block represents the process of cleansing the output, further refining the count estimates by filtering out the noise-induced inconsistencies. This integrative approach facilitates a more accurate and noise-resilient crowd counting model, as evidenced by the reduced noise in the denoised output compared to the initial model predictions.

Figure 3. The flowchart of the LNRCC.

Experiments

In this section, we introduce our LFF in multiple networks with different datasets, the experimental results show the LFF can effectively enhance the crowd counting model robustness in label noise environment and improve its learning performance.

Evaluation Metrics

MAE and MSE are two extensively used metrics for evaluating crowd count estimate methods. They are defined as follows:

(4)

MAE = \frac{1}{K} \sum_{K = 1}^{K} | N_{k}^{G T} - N_{k} |

(4)

(5)

MSE = \sqrt{\frac{1}{K} \sum_{K = 1}^{K} | N_{k}^{G T} - N_{k} |^{2}}

(5)

where $K$ is the number of test images, $N_{k}^{G T}$ and $N_{k}$ are the value of label count and the value of estimated count for the $k - th$ image, respectively.

Another important metric is the GAME (Khan, Menouar, and Hamila Citation2023b). GAME aims to improve the assessment of localization in crowd counting. It divides each image into $L^{2}$ non-overlapping grids and calculates the MAE within each grid, summing up the errors. The GAME metric is particularly useful for evaluating the spatial distribution accuracy of the estimated count. The formula for GAME is given by:

(6)

GAME (L) = \frac{1}{K} \sum_{k = 1}^{K} \sum_{l = 1}^{L^{2}} | N_{k, l}^{G T} - N_{k, l} |

(6)

where $L^{2}$ represents the total number of grids the image is divided into, $K$ is the number of test images, $N_{k, l}^{G T}$ is the ground truth count in the $l - th$ grid of the $k - th$ image, and $N_{k, l}$ is the estimated count in the same grid.

Datasets

There are numerous crowd counting public datasets available nowadays. Based on the popularity of crowd counting datasets, ShanghaiTech (Zhang et al. Citation2016a), UCF-QNRF (Idrees et al. Citation2018), UCF_CC_50 (Idrees et al. Citation2013) will be used in this paper. Additionally, RGBT-CC (Liu et al. Citation2021), the first large-scale RGBT Crowd Counting benchmark, published in 2021, will also be used in this paper.

ShanghaiTech

(Zhang et al. Citation2016a) consists of parts A and B. Part A contains 300 training images and 182 testing images. At the same time, Part B includes 400 training images and 316 testing images. According to , Part A has a significantly higher density than Part B. It is the most popular Crowd Counting benchmark with more than 90% usage rate in relevant research.

UCF-QNRF

(Idrees et al. Citation2018) is one of the largest crowd counting datasets, with 1,535 images and 1.25 million point annotations. It is a difficult dataset to analyze since it has a wide variety of counts, image resolutions, lighting conditions, and viewpoints and contains a dense crowd. The training set includes 1,201 images, while the remaining 334 images are used for testing. It also has a usage rate of more than 70% in relevant research.

Ucf_cc_50

(Idrees et al. Citation2013) includes 50 grayscale images with varying but high resolutions. Each image has an average resolution of $2013 \times 2902$ . The average count for each image is 1,279, and the minimum and maximum counts are 94 and 4,532, respectively. It is the second most popular Crowd Counting benchmark with more than 80% usage rate in relevant research.

RGBT-CC

(Liu et al. Citation2021) is the first publicly available RGBT dataset for crowd counting, containing 2,030 pairs of representative RGB-thermal images, 1,013 of which are captured in light and 1,017 of which are captured in darkness. Each image is the same size( $640 \times 480$ ), and 1,030 pairs are used for training, 200 pairs for validation, and 800 for testing. A total of 138,389 persons are marked with point annotations, on average 68 people per image.

Neural Network Model

In this experiment, we select several representative crowd counting models and add different Loss Filtering Factors to examine the efficiency of the proposed method.

Bl

(Ma et al. Citation2019) is a loss function that builds a density contribution probability model using point annotations. Rather than restricting the value at each pixel in the density map, the BL training loss uses more reliable supervision on the count expectation at each annotated point. It significantly outperforms the baseline loss on many crowd counting datasets, including UCF-QNRF, ShanghaiTech, and UCF_CC_50, and exceeds the previous best approaches on the UCF-QNRF dataset. Additionally, many recent studies use BL as the loss function for their methods.

DMCC

(Wang et al. Citation2020) uses Distribution Matching for crowd counting. It uses the Optimal Transport method to evaluate the similarity between the normalized predicted density map and the normalized ground truth density map, as well as a Total Variation loss to stabilize the OT computation. DMCC method also outperforms the previous state-of-the-art results on the ShanghaiTech and UCF_CC_50 datasets.

CSCA

(Zhang, Choi, and Hong Citation2022) are modular building pieces that can be simply integrated into any modality-specific architecture. Through spatial-wise crossmodal attention, the CSCA blocks first spatially capture global functional connections across multimodality with less overhead. Then, cross-modal features with spatial attention are refined using adaptive channel-wise feature aggregation. This method greatly improves performance across various backbone networks and outperforms the previous state-of-the-art results on the RGBT-CC dataset.

IADM

(Liu et al. Citation2021) is a crossmodal collaborative representation learning framework, which consists of multiple modality-specific branches, a modality-shared branch, and an Information Aggregation Distribution Module to capture the complementary information of different modalities fully. It incorporates two collaborative information transfers to dynamically enhance the modality-shared and modality-specific representations with a dual information propagation mechanism. Moreover, this method is universal for multi-modal crowd counting, and experiment results demonstrate its effectiveness.

illustrates the detailed architecture of the VGG19 model, a deep convolutional neural network widely utilized in the fields of image recognition and processing. The VGG19 model comprises 19 layers, including 16 convolutional layers and 3 fully connected layers. A distinctive feature of this model is the use of numerous small-sized (3×3) convolutional kernels, allowing the network to learn image features more deeply while maintaining the receptive field. Multiple pooling layers are interspersed between the convolutional layers, serving to reduce dimensions and computational load. The VGG19 demonstrates exceptional performance in image recognition tasks, owing its effectiveness to its deep structure and the application of small-sized convolutional kernels.

Figure 4. The architecture of the VGG19.

Implementation Details

In this experiment, we use the standard image classification network VGG-19 as the backbone and build the model respectively according to the official implementation of the methods in Sec.4.3. The backbone is pre-trained on ImageNet, and the Adam optimizer with an initial learning rate $10^{- 5}$ is used to update the parameters.

For training, images from various datasets are randomly cropped into different sizes. The crop size is $256 \times 256$ for ShanghaiTech Part A, UCF_CC_50 and RGBT-CC where images resolutions are smaller, and $512 \times 512$ for ShanghaiTech Part B and UCF-QNRF. And we perform five-fold cross validations to obtain the average test result of the UCF_CC_50 dataset since it is a small-scale dataset with no data split designated for training and testing.

Experimental Evaluations

We compare the models with and without using the proposed Loss Filtering Factor on the benchmark datasets described in Sec.4.2. We set the model without Loss Filtering Factor( $θ = 1$ ) as the baseline and compare models’ performances at various Loss Filtering Factor values. To make a fair comparison, for the same model, the same random seeds are used for each set of experiments and other parameters. The experimental results are shown in .

Table 2. (a): benchmark evaluations on five benchmark crowd counting datasets using the MAE, MSE and GAME(2) metrics. All of the models use the VGG19 neural network.

Display Table

We conduct five experiments on each dataset separately for each model structure, including a model without the proposed method( $θ = 1$ ) and four models using the Loss Filtering Factor with different values of $θ$ .

Quantitative Results

In all experiments, the proposed Loss Filtering Factor consistently improves the performances of all four models in all datasets used in experiments by an average of 3.68% for MAE, 6.7% for MSE and 4.68% for GAME. For most models and datasets, using the Loss Filtering Factor with a $θ$ of 0.9 or 0.95 may increase the best performance. When $θ = 0.9$ , it provides the greatest performance improvement, boosting the counting accuracy of DMCC on the ShanghaiTechB dataset by 11.38% and 9.12% for MAE and MSE, respectively.

From the perspective of the model, using the appropriate Loss Filtering Factor makes around 6.81% improvements in DMCC and 5.27% of BL on all five datasets. It makes around 5.91% improvements of CSCA and 5.11% of IADM on RGBT-CC datasets.

From the perspective of the dataset, using the appropriate Loss Filtering Factor makes an average of 5.36% improvements of models trained on UCF-QNRF, 3.18% on ShanghaiTechA, 9.77% on ShanghaiTechB, 5.93% on UCF_CC_50 and 5.74% on RGBT-CC, respectively.

In order to compare the impact of different networks on the LNRCC scheme, we conducted experiments to replace the backbone network. Our experiments results in indicate that while HRNet integration with our LNRCC scheme showed significant performance improvements, ResNeXt did not achieve similar success, facing issues with error margins and convergence. This highlights the necessity for model-specific adaptations of our methods, as evidenced by the effectiveness of VGG19 and the adaptability of LFF across various network architectures.

Table 3. For the BL model, we modified the backbone network by replacing VGG19 with HRNet and ResNeXt, respectively, to test the generalization and validity of LFF.

Display Table

Key Issues and Discussion

The change trends of MAE, MSE and GAME of model BL with different $θ$ values compared to the baseline model are shown in (Upper left). As $θ$ decreases, MAE first drops then rises, reaching the lowest point at $θ$ = 0.95 and improving 2.2% compared to the baseline model. MSE exhibits a similar trend, hitting the lowest point at $θ$ = 0.95 with a 4% improvement. GAME also follows a similar pattern, with the lowest value at $θ$ = 0.90, indicating a 4.2% decrease compared to the baseline. In (Lower left), the change trends of MAE and MSE of model DMCC with varying $θ$ values against the baseline model are presented. Both MAE and MSE first fall then climb as $θ$ reduces, sharing the same lowest point at $θ$ = 0.90. MAE sees a 6.4% increase and MSE has a 6.8% gain. GAME trends similarly, reaching its minimum at $θ$ =0.90 wtih a 3.4% improvement over the baseline. (Upper right) illustrates how MAE and MSE of model CSCA alter with different $θ$ values versus the baseline model. The variations of MAE and MSE are analogous, declining initially then rising when $θ$ decreases. The lowest spots emerge at $θ$ = 0.95 and $θ$ = 0.90 separately. MAE gains 3.1% while MSE increases by 8.7%. For GAME, the minimum value is attained at $θ$ =0.95, which is 3.7% higher than the baseline. The change trends of MAE and MSE of model IADM across different $θ$ values in contrast to the baseline model are depicted in (Lower right). MAE and MSE exhibit similar tendencies, both dropping first and then growing as $θ$ reduces, with the lowest point occurring at $θ$ = 0.90. MAE improves by 3% and MSE by 7.3%. GAME follows a similar pattern, the most substantial reduction of 7.4% happending at $θ$ =0.90.

Figure 5. The illusion of the effectiveness of parameter $θ$ under LFF in different models.

Effect of $θ$ . To analyze the impact of $θ$ selection on model performance, we divide the results of each model with Loss Filtering Factor( $θ \neq 1$ ) by the corresponding result of the baseline model( $θ = 1$ )and take the average. The result is shown in .

Figure 6. Effect of loss filtering factor $θ$ .

Although the best value of $θ$ varies for different models and datasets, the proposed Loss Filtering Factor increases model performance when the $θ$ are between 0.85 and 1, and for most models, selecting the value of $θ$ to 0.95 or 0.9 can maximize the model’s performance. However, as $θ$ is selected smaller, the performance of models decreases significantly.

The effect of $θ$ can be explained in the following ways:

There are about 10% point annotations in the dataset, which will negatively influence the model’s performance in counting when adopted in training. In other words, there is about 10% label noise in the dataset, and reducing this label noise during training can improve the model performance in counting. This may also explain why the best $θ$ choices for the same model vary across datasets because the levels of label noise vary among datasets.
By using the proposed Loss Filtering Factor, the model will not use all data points for supervision in each training epoch, similar to the dropout layer in the neural network model, and even if the data removed from training by a factor $θ$ is not the label noise, it can still improve generalization and decrease the overfitting of the model to training data.
Setting $θ$ too small will filter out too much training data. When $θ$ is extremely small, only a tiny portion of data may be selected for training during each iteration. In other words, the model can only see partial information of the training data each time, making it hard to grasp the overall distribution. This will lead to the model’s inability to sufficiently learn the patterns in the training data, resulting in underfitting. The experimental results show that when $θ$ is 0.75, the performance of most models is inferior compared to $θ$ values of 0.85–0.95, indicating that a too small $θ$ causes underfitting. In general, regularization techniques improve generalization capability while sacrificing fitting accuracy on the training data. Only by finding a balancing point can both overfitting and underfitting be avoided. An excessively small $θ$ disrupts this balance by over-regularizing, thus leading to underfitting.

The above explanation shows that a suitable $θ$ can significantly increase model performance because it can reduce the dataset’s label noise during training, improve generalization, and decrease the overfitting of the model. However, a too small $theta$ ( $θ \leq 0.80$ ) results in lower model accuracy due to insufficient use of ground truth, and the model is poorly supervised.

Effect on Model Convergence Speed

Through experiments, we observe that after adding the Loss Filtering Factor, the model’s training time keeps relatively constant for each epoch. The model’s convergence speed, on the other hand, is slow and varies widely between models. In general, the total number of epochs required to complete the training for the model using the Loss Filtering Factor is around 5% more than for the model without using it.

Ablation Studies

We perform the ablation study on the UCF-QNRF dataset by comparing the proposed Loss Filtering Factor with removing the same number of points randomly during the training. provides quantitative results of it.

Table 4. BL and DMCC is original model( $θ = 1$ ). We use $θ = 0.9$ for model BL+LFF and DMCC+LFF, and randomly remove 10% of data supervision for model BL+random and DMCC+random each epoch in training.

Display Table

When 10% of the data were randomly removed to supervision in training, The overall performance of the model improved slightly, but the increase was tiny. It indicates that randomly removing the data supervision in each epoch during the training process has limited influence on model performance, showing that the proposed Loss Filtering Factor improves model performance by filtering label noise that will negatively influence the performance of the model in counting instead of just removing some of data supervision randomly in training.

Conclusions

In this work, we propose the LFF and design the corresponding framework, namely the LNRCC training scheme to mitigate the label noise impact in crowd counting learning tasks. Our proposed approach allows models to filter out likely noise-induced losses during training, enabling them to focus on more reliable signals in the data. We evaluate the LNRCC using crowd images in multiple scenarios like malls, airports and streets. The experimental results show our LNRCC can effectively improve the learning performance of crowd counting models and mitigate the noise label impact in such tasks. As the existing LNRCC relies on manually setting, introducing reinforcement learning in LNRCC and achieving self-tune could be an appealing research trend in the future crowd analytics.

Supplemental material

Disclosure statement

This paper does not have potential conflict of interest.

Data availability statement

The data that support the findings of this study are openly available in ShanghaiTech shanghaitech, UCF-QNRF QNRF, UCF_CC_50 Idrees_2013_CVPR and RGBT-CC RGBTCC.

Supplementary material

Supplemental data for this article can be accessed online at https://doi.org/10.1080/08839514.2024.2329859.

References

Alotibi, M. H., S. Kammoun Jarraya, M. Salamah Ali, and K. Moria. 2019. Cnn-based crowd counting through iot: Application for saudi public places. Procedia Computer Science 163:134–25. doi:10.1016/j.procs.2019.12.095.
Google Scholar
Boominathan, L., S. S. Kruthiventi, and R. Venkatesh Babu. 2016. “Crowdnet: A deep convolutional network for dense crowd counting.” In Proceedings of the 24th ACM international conference on Multimedia, Amsterdam, The Netherlands, 640–44.
Google Scholar
Cao, X., Z. Wang, Y. Zhao, and F. Su. 2018. “Scale aggregation network for accurate and efficient crowd counting.” In Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 734–50.
Google Scholar
Chan, A. B., Z.-S. John Liang, and N. Vasconcelos. 2008. “Privacy preserving crowd monitoring: Counting people without people models or tracking.” In 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 1–7.
Google Scholar
Cheng, Z.-Q., J.-X. Li, Q. Dai, X. Wu, and A. G. Hauptmann. 2019. “Learning spatial awareness to improve crowd counting.” In Proceedings of the IEEE/CVF international conference on computer vision, Seoul, Korea, 6152–61.
Google Scholar
Chen, Y., J. Ma, Y. Du, R. Sun, and B. Niu. 2021. “Research on elementary school students’ books recommendation algorithm based on words and character library.” In 2021 2nd International Conference on Artificial Intelligence and Information Systems, ICAIIS 2021, Chongqing, China.
Google Scholar
Dong, Z., R. Zhang, X. Shao, and Y. Li. 2020. Scale-recursive network with point supervision for crowd scene analysis. Neurocomputing 384:314–24. doi:10.1016/j.neucom.2019.12.070.
Web of Science ®Google Scholar
Gao, G., J. Gao, Q. Liu, Q. Wang, and Y. Wang. 2020. Cnn-based density estimation and crowd counting: A survey. Evaluation Practices 2003:12783.
Google Scholar
Gao, M., A. Souri, M. Zaker, W. Zhai, X. Guo, and Q. Li. 2023. A comprehensive analysis for crowd counting methodologies and algorithms in internet of things. Cluster Computing 27 (1):859–73. doi:10.1007/s10586-023-03987-y.
Web of Science ®Google Scholar
Guo, X., M. Anisetti, M. Gao, and G. Jeon. 2022. Object counting in remote sensing via triple attention and scale-aware network. Remote Sensing 14 (24):6363. doi:10.3390/rs14246363.
Web of Science ®Google Scholar
Guo, X., M. Gao, W. Zhai, Q. Li, K. Hyung Kim, and G. Jeon. 2023. Dense attention fusion network for object counting in IoT system. Mobile Networks and Applications 28 (1):359–68. doi:10.1007/s11036-023-02090-1.
Web of Science ®Google Scholar
Guo, X., M. Gao, W. Zhai, Q. Li, and G. Jeon. 2023. “Scale region recognition network for object counting in intelligent transportation system.” IEEE Transactions on Intelligent Transportation Systems 24(12): 15920–15929. doi:10.1109/TITS.2023.3296571.
Web of Science ®Google Scholar
Guo, X., M. Gao, W. Zhai, Q. Li, J. Pan, and G. Zou. 2022. Multiscale aggregation network via smooth inverse map for crowd counting. Multimedia Tools & Applications 1–15. doi:10.1007/s11042-022-13664-8.
Web of Science ®Google Scholar
Guo, X., M. Gao, W. Zhai, J. Shang, and Q. Li. 2022. Spatial-frequency attention network for crowd counting. Big Data 10 (5):453–65. doi:10.1089/big.2022.0039.
PubMed Web of Science ®Google Scholar
Guo, X., K. Song, M. Gao, W. Zhai, Q. Li, and G. Jeon. 2023. Crowd counting in smart city via lightweight ghost attention pyramid network. Future Generation Computer Systems 147:328–38. doi:10.1016/j.future.2023.05.013.
Web of Science ®Google Scholar
Hafeezallah, A., A. Al-Dhamari, and S. Abd Rahman Abu-Bakar. 2022. Multi-scale network with integrated attention unit for crowd counting. Computers Materials & Continua 73 (2):3879–3903. doi:10.32604/cmc.2022.028289.
Web of Science ®Google Scholar
Han, B., Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama. 2018. Co-teaching: Robust training of deep neural networks with extremely noisy labels. Advances in Neural Information Processing Systems, Montreal, Canada, 31.
Google Scholar
Hu, R., Z.-R. Tang, E. Q. Wu, Q. Mo, R. Yang, and J. Li. 2022. RDC-SAL: Refine distance compensating with quantum scale-aware learning for crowd counting and localization. Applied Intelligence 52 (12):14336–48. doi:10.1007/s10489-022-03238-4.
Web of Science ®Google Scholar
Hu, R., Z. Tang, and R. Yang. 2023. QE-DAL: A quantum image feature extraction with dense distribution-aware learning framework for object counting and localization. Applied Soft Computing 138:110149. doi:10.1016/j.asoc.2023.110149.
Web of Science ®Google Scholar
Idrees, H., I. Saleemi, C. Seibert, and M. Shah. 2013. “Multi-source Multi-scale Counting in Extremely Dense Crowd Images.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, June.
Google Scholar
Idrees, H., M. Tayyab, K. Athrey, D. Zhang, S. Al-Maadeed, N. Rajpoot, and M. Shah. 2018. “Composition loss for counting, density map estimation and localization in dense crowds.” In Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 532–46.
Google Scholar
Jiang, L., D. Huang, M. Liu, and W. Yang. 2020. “Beyond synthetic noise: Deep learning on controlled noisy labels.” In International Conference on Machine Learning, Vienna, 4804–15. PMLR.
Google Scholar
Jiang, X., Z. Xiao, B. Zhang, X. Zhen, X. Cao, D. Doermann, and L. Shao. 2019. “Crowd counting and density estimation by trellis encoder-decoder networks.” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, CA, USA, 6133–42.
Google Scholar
Jiang, L., Z. Zhou, T. Leung, L.-J. Li, and L. Fei-Fei. 2018. “Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels.” In International conference on machine learning, Stockholm, Sweden, 2304–13. PMLR.
Google Scholar
Jingying, W. 2021. “A survey on crowd counting methods and datasets.” In Advances in Computer, Communication and Computational Sciences: Proceedings of IC4S 2019, Bangkok, Thailand, 851–63. Springer.
Google Scholar
Khan, M. A., H. Menouar, and R. Hamila. 2023a. “Crowd density estimation using imperfect labels.” In 2023 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 1–6.
Google Scholar
Khan, M. A., H. Menouar, and R. Hamila. 2023b. Revisiting crowd counting: State-of-the-art, trends, and future perspectives. Image and Vision Computing 129:104597. doi:10.1016/j.imavis.2022.104597.
Web of Science ®Google Scholar
Laradji, I. H., N. Rostamzadeh, P. O. Pinheiro, D. Vazquez, and M. Schmidt. 2018. “Where are the blobs: Counting by localization with point supervision.” In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, September.
Google Scholar
Li, B., H. Huang, A. Zhang, P. Liu, and C. Liu. 2021. Approaches on crowd counting and density estimation: A review. Pattern Analysis and Applications 24 (3):853–74. doi:10.1007/s10044-021-00959-z.
Web of Science ®Google Scholar
Lin, H., Z. Ma, R. Ji, Y. Wang, and X. Hong. 2022. “Boosting crowd counting via multifaceted attention.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19628–37.
Google Scholar
Liu, L., J. Chen, H. Wu, G. Li, C. Li, and L. Lin. 2021. “Cross-modal collaborative representation learning and a large-scale rgbt benchmark for crowd counting.” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 4823–33.
Google Scholar
Liu, Y., M. Shi, Q. Zhao, and X. Wang. 2019. “Point in, box out: Beyond counting persons in crowds.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, June.
Google Scholar
Li, Y., X. Zhang, and D. Chen. 2018. “Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes.” In Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA, 1091–100.
Google Scholar
Li, M., Z. Zhang, K. Huang, and T. Tan. 2008. “Estimating the number of people in crowded scenes by MID based foreground segmentation and head-shoulder detection.” In 2008 19th International Conference on Pattern Recognition, Tampa, FL, USA, 1–4.
Google Scholar
Marsden, M., K. McGuinness, S. Little, C. E. Keogh, and N. E. O’Connor. 2018. “People, Penguins and petri dishes: Adapting object counting models to new visual domains and object types without forgetting.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, June.
Google Scholar
Ma, Z., X. Wei, X. Hong, and Y. Gong. 2019. “Bayesian loss for crowd count estimation with point supervision.” In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, South Korea, 6142–51.
Google Scholar
Ma, Z., X. Wei, X. Hong, H. Lin, Y. Qiu, and Y. Gong. 2021. “Learning to count via unbalanced optimal transport.” In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, 2319–27.
Google Scholar
Mohammed, M. S., A. Al-Dhamari, W. Saeed, F. N. Al-Aswadi, S. Abdulla Mohsen Saleh, and M. N. Marsono. 2023. “Motion pattern-based scene classification using adaptive synthetic oversampling and fully connected deep neural network. IEEE 11:119659–119675. doi:10.1109/ACCESS.2023.3327463.
Google Scholar
Pham, N.-Q., T.-S. Nguyen, J. Niehues, M. Müller, S. Stüker, and A. Waibel. 2019. Very deep self-attention networks for end-to-end speech recognition. arXiv Preprint arXiv 1904:13377.
Google Scholar
Rabaud, V., and S. Belongie. 2006. “Counting crowded moving objects.” In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, USA, Vol. 1, 705–11.
Google Scholar
Shi, Z., L. Zhang, Y. Liu, X. Cao, Y. Ye, M.-M. Cheng, and G. Zheng. 2018. “Crowd counting with deep negative correlation learning.” In Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA, 5382–90.
Google Scholar
Silver, D., A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. 2016. Mastering the game of go with deep neural networks and tree search. Nature 529(7587):484–89. doi: 10.1038/nature16961.
PubMed Web of Science ®Google Scholar
Sindagi, V. A., and V. M. Patel. 2019. Ha-ccn: Hierarchical attention-based crowd counting network. IEEE Transactions on Image Processing 29:323–35. doi:10.1109/TIP.2019.2928634.
Web of Science ®Google Scholar
Walach, E., and L. Wolf. 2016. “Learning to count with cnn boosting.” In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, Amsterdam, the Netherlands, 660–76. Springer.
Google Scholar
Wang, Y., J. Hou, and L.-P. Chau. 2019. “Object counting in video surveillance using multi-scale density map regression.” In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 2422–26.
Google Scholar
Wang, Y., J. Hou, X. Hou, and L.-P. Chau. 2021. A self-training approach for point-supervised object detection and counting in crowds. IEEE Transactions on Image Processing 30:2876–87. doi:10.1109/TIP.2021.3055632.
PubMed Web of Science ®Google Scholar
Wang, B., H. Liu, D. Samaras, and M. Hoai Nguyen. 2020. Distribution matching for crowd counting. Advances in Neural Information Processing Systems 33:1595–607.
Google Scholar
Wan, J., Z. Liu, and A. B. Chan. 2021. “A generalized loss function for crowd counting and localization.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, June, 1974–83.
Google Scholar
Wei, B., Y. Yuan, and Q. Wang. 2020. “MSPNET: Multi-supervised parallel network for crowd counting.” In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2418–22. IEEE.
Google Scholar
Xiong, H., H. Lu, C. Liu, L. Liu, Z. Cao, and C. Shen. 2019. “From open set to closed set: Counting objects by spatial divide-and-conquer.” In Proceedings of the IEEE/CVF international conference on computer vision, Long Beach, CA, USA, 8362–71.
Google Scholar
Zhai, W., M. Gao, M. Anisetti, Q. Li, S. Jeon, and J. Pan. 2022. Group-split attention network for crowd counting. Journal of Electronic Imaging 31 (4):041214–041214. doi:10.1117/1.JEI.31.4.041214.
Web of Science ®Google Scholar
Zhai, W., M. Gao, X. Guo, and Q. Li. 2023. “Scale-context perceptive network for crowd counting and localization in smart city system. IEEE Internet of Things Journal 10(21): 18930–18940. doi:10.1109/JIOT.2023.3268226.
Web of Science ®Google Scholar
Zhai, W., M. Gao, Q. Li, G. Jeon, and M. Anisetti. 2023. Fpanet: Feature pyramid attention network for crowd counting. Applied Intelligence 53 (16):19199–216. doi:10.1007/s10489-023-04499-3.
Web of Science ®Google Scholar
Zhai, W., M. Gao, A. Souri, Q. Li, X. Guo, J. Shang, and G. Zou. 2023. An attentive hierarchy ConvNet for crowd counting in smart city. Cluster Computing 26 (2):1099–111. doi:10.1007/s10586-022-03749-2.
Web of Science ®Google Scholar
Zhai, W., Q. Li, Y. Zhou, X. Li, J. Pan, G. Zou, and M. Gao. 2023. Da 2 net: A dual attention-aware network for robust crowd counting. Multimedia Systems 29 (5):3027–40. doi:10.1007/s00530-021-00877-4.
Web of Science ®Google Scholar
Zhang, C., S. Bengio, M. Hardt, B. Recht, and O. Vinyals. 2021. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM 64 (3):107–15. doi:10.1145/3446776.
Web of Science ®Google Scholar
Zhang, Y., S. Choi, and S. Hong. 2022. “Spatio-channel attention blocks for cross-modal crowd counting.” In Proceedings of the Asian Conference on Computer Vision, Macau SAR, China, 90–107.
Google Scholar
Zhang, S., G. Wu, J. P. Costeira, and J. M. F. Moura. 2017. “Understanding traffic density from large-scale web camera data.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, July.
Google Scholar
Zhang, Y., D. Zhou, S. Chen, S. Gao, and Y. Ma. 2016a. “Single-image crowd counting via multi-column convolutional neural network.” In Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, USA, 589–97.
Google Scholar
Zoph, B., V. Vasudevan, J. Shlens, and Q. V. Le. 2018. “Learning transferable architectures for scalable image recognition.” In Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA, 8697–710.
Google Scholar
Zou, H., Y. Zhou, J. Yang, W. Gu, L. Xie, and C. Spanos. 2017. “Freecount: Device-free crowd counting with commodity wifi.” In GLOBECOM 2017-2017 IEEE Global Communications Conference, Singapore, 1–6. IEEE.
Google Scholar
Zou, H., Y. Zhou, J. Yang, and C. J. Spanos. 2018. Device-free occupancy detection and crowd counting in smart buildings with WiFi-enabled IoT. Energy and Buildings 174:309–22. doi:10.1016/j.enbuild.2018.06.040.
Web of Science ®Google Scholar

Label Noise Robust Crowd Counting with Loss Filtering Factor

ABSTRACT

Introduction

Related Work

Crowd Counting

Noisy Labels

Proposed Method

Background and Motivation

Table 1. Average count and average density of popular crowd datasets.

Loss Filtering Factor

Label Noise Robust Crowd Counting Training Scheme

Algorithm 1 Label Noise Robust Crowd Counting Training Scheme

Experiments

Evaluation Metrics

Datasets

ShanghaiTech

UCF-QNRF

Ucf_cc_50

RGBT-CC

Neural Network Model

Bl

DMCC

CSCA

IADM

Implementation Details

Experimental Evaluations

Table 2. (a): benchmark evaluations on five benchmark crowd counting datasets using the MAE, MSE and GAME(2) metrics. All of the models use the VGG19 neural network.

Quantitative Results

Table 3. For the BL model, we modified the backbone network by replacing VGG19 with HRNet and ResNeXt, respectively, to test the generalization and validity of LFF.

Key Issues and Discussion

Effect on Model Convergence Speed

Ablation Studies

Table 4. BL and DMCC is original model(θ=1). We use θ=0.9 for model BL+LFF and DMCC+LFF, and randomly remove 10% of data supervision for model BL+random and DMCC+random each epoch in training.

Conclusions

effect_of_theta_BL.png

subfig.sty

effect_of_theta_TOTAL.png

natbib.sty

effect_of_theta_IADM.png

effect_of_theta_CSCA.png

VGG19.pdf

epsfig.sty

subfigure.sty

interactcadsample.pdf

workflow.png

rotating.sty

tfcad.bst

booktabs.sty

interactcadsample.bib

reference.bib

interact.cls

graph1.eps

graph2.eps

effect_of_theta_DMCC.png

Disclosure statement

Data availability statement

Supplementary material

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

Table 4. BL and DMCC is original model( $θ = 1$ ). We use $θ = 0.9$ for model BL+LFF and DMCC+LFF, and randomly remove 10% of data supervision for model BL+random and DMCC+random each epoch in training.