164
Views
0
CrossRef citations to date
0
Altmetric
Research Article

A cross-view intelligent person search method based on multi-feature constraints

, , , , &
Article: 2346259 | Received 28 Nov 2023, Accepted 17 Apr 2024, Published online: 26 Apr 2024

ABSTRACT

Person searches aim to simultaneously locate and identify persons to be queried from different scenes, which is crucial for disaster emergency management and public safety. However, the high variability of environmental features in different video scenes, along with the susceptibility of people searching for occlusions or dense populations, results in existing methods suffering from inefficiency and poor accuracy in searching for cross-view persons. Therefore, we propose a cross-view intelligent person search method based on multifeature constraints. First, we establish the global-local context-aware (GLCA) module, which fully extracts the differential personnel features. Second, we construct the semantic complementarity and feature aggregation (SCFA) module for personnel-scale feature constraints in different contexts. Third, we constrain the method in terms of person spatial, person identity, and detection confidence features to improve person search accuracy. Finally, we construct a case experiment dataset, select two public benchmark datasets, and conduct a detailed experimental analysis based on them. The results show that our method can be applied to personnel search tasks in complex scenarios well, and the search results outperform those of 25 other state-of-the-art algorithms, with mAPs improved by 0.41%−19.71%. This approach effectively enhances the informatization level of disaster emergency rescue and public safety management.

1. Introduction

Human activity information is crucial for building digital earth spaces and supporting sustainable development in the future (Che et al. Citation2023; Hu et al. Citation2021). In particular, with the rapid growth of virtual reality (VR), augmented reality (AR), and mixed reality (MR) technologies in recent years, computer-simulated human activity data have been used in various virtual simulations (e.g. disaster simulation, urban simulation, and construction simulation) to help us better understand and plan human activities (Li et al. Citation2024; Wu et al. Citation2023; Zhang et al. Citation2023). However, there is always an unbridgeable gap between reality and VR. Virtual human activity data may lead to inaccurate or unrealistic simulation results (Ray and Claramunt Citation2023; Zhu et al. Citation2024a), hindering the future development of urban and disaster digital twins (Therias and Rafiee Citation2023; Zhu et al. Citation2024b).

Person search is a technique for accurately obtaining actual human activity data. It consists of two phases, pedestrian detection and person reidentification, and aims to locate and identify individuals of interest within diverse scenes simultaneously. It can be used as an effective digital search tool for the timely collection of personal information from multiple surveillance videos and can assist in tasks such as social security maintenance, disaster emergency rescue, and urban digital twins (Qian et al. Citation2022; Singh, Khare, and Jethva Citation2022; Wang et al. Citation2023). Different from the traditional person reidentification method (), person search encompasses two subtasks: pedestrian detection and person reidentification () (Fiaz et al. Citation2022). Paralleling these subtasks, person search grapples with the twofold challenges of pedestrian detection and person reidentification. Despite the considerable scholarly attention devoted to this area of research, person search remains a formidable task due to variations in context, scale, and other factors across different scenarios.

Figure 1. Person reidentification: matching with manually cropped pedestrians. The green box represents the correct recognition result, and the red box represents the incorrect recognition result.

Figure 1. Person reidentification: matching with manually cropped pedestrians. The green box represents the correct recognition result, and the red box represents the incorrect recognition result.

Figure 2. Person search: finding from whole-scene images. The green box represents the correct search result.

Figure 2. Person search: finding from whole-scene images. The green box represents the correct search result.

Benefiting from the ability of deep learning to automatically and rapidly extract features (Chen et al. Citation2023; Feng et al. Citation2023; Xie et al. Citation2020; Xie et al. Citation2022a; Xie et al. Citation2022b), researchers have developed many deep learning-based person search methods. The existing methods for person search can be divided into two broad categories: two-step methods and end-to-end methods. The two-step method (Dong et al. Citation2020; Kalayeh et al. Citation2018; Lan, Zhu, and Gong Citation2018; Si et al. Citation2018; Zheng, Zheng, and Yang Citation2018) divides the person search task into two subtasks: person detection (Li et al. Citation2020; Xie et al. Citation2022) and person reidentification (Kim et al. Citation2023; Zhang and Wang Citation2023). Each subtask is separately optimized to enhance adaptivity. However, the independent processing of these subtasks makes the methodology time-consuming and labor-intensive. The end-to-end method integrates detection and reidentification into the same network framework and optimizes the two subtasks simultaneously (Chen et al. Citation2020; Fiaz et al. Citation2022; Zhao et al. Citation2022). This approach makes the whole process more efficient and convenient. Although the end-to-end method has achieved some results, there are still several shortcomings.

  1. In various scenarios, the semantic features of the target individual, the surrounding environment, and nearby individuals tend to shift. Nevertheless, the majority of existing methods primarily focus on information in the vicinity of the target person's center (Yan et al. Citation2021; Yu et al. Citation2022), neglecting to fully explore and utilize a broader scope of relevant scene information and background person data.

  2. Different from ordinary pedestrian detection, person search encompasses the task of matching the same individual across multiple cameras. First, contrary to continuous tracking within the same viewpoint, images from different surveillance scenes may result in occlusion of the target individual and variances in scale. Second, in densely populated settings, similar individuals can interfere with the target person search, thereby exacerbating the person search complexity, as illustrated in .

  3. Most existing methods strive to enhance the accuracy of person reidentification by refining physical models or algorithms. However, the simple distance function employed in person reidentification networks fails to gauge the similarity of pedestrian feature vectors comprehensively and precisely. This leads to a diminished accuracy of person reidentification algorithms (An, Wang, and Liu Citation2023; Wang and Guo Citation2022; Yun et al. Citation2021; Zheng et al. Citation2017).

Figure 3. Examples of person search interference in different scenarios.

Figure 3. Examples of person search interference in different scenarios.

Given the above shortcomings, this paper proposes a multifeatured-constrained cross-view person intelligent search method to adapt the pedestrian search task in complex environments, starting from the performance characteristics of people in the scene. First, we built a global-local context-aware (GLCA) module to fully capture the differential features of personnel contexts. Second, we constructed a semantic complementarity and feature aggregation (SCFA) module to enhance the performance of personnel features at different scales. Finally, based on the previous features, we combined the spatial features, personal identity features, and detection confidence features to constrain the pedestrian search and realize an accurate person search for cross-view images.

The remainder of this paper is organized as follows. Section 2 introduces related work. Section 3 presents a detailed description of the proposed method and the loss functions. Section 4 introduces the dataset descriptions and hyperparameter configuration. Section 5 presents a comprehensive analysis of the experiments. Sections 6 and 7 provide the discussion and conclusion, respectively.

2. Related work

2.1. Person reidentification

The aim of person reidentification is to match specific pedestrians in a large number of pedestrian images. Early studies mainly used handcrafted features for reidentification tasks (Lan, Zhu, and Gong Citation2018). As convolutional neural networks (CNNs) continue to evolve, deep learning-based methodologies have established strong footholds in the domain of person reidentification (Huertas-Tato et al. Citation2022; Wu et al. Citation2022). Sun et al. (Citation2018) proposed refined part pooling to reassign outliers in part models. Wang et al. (Citation2018) split pedestrian images into patches to learn features from multiple granularities. Zhang et al. (Citation2020) proposed an effective relationship-aware global attention module that captures global structural information for better attentional learning. Wang et al. (Citation2021) introduced a new pyramid spatiotemporal aggregation framework that gradually aggregates frame-level features and fuzes hierarchical spatiotemporal features into a final video-level representation. Qu, Zhang, and Zhang (Citation2023) proposed a parallel mixed-attention network to complement spatial and channel information in Re-ID. Zhang et al. (Citation2023) proposed a complementary network for person reidentification, which aims to extract more discriminative and robust local representations through complementary cues.

Although the person reidentification task can effectively identify a specific person from a large number of images, most existing person reidentification algorithms focus on matching cropped pedestrian images between queries and candidates. However, in real-world scenarios, pedestrian bounding box annotations are often unavailable. Therefore, person reidentification is not suitable for identifying target pedestrians in practical complex scenarios.

2.2. Person search

Person search usually includes two stages, pedestrian detection and person reidentification, which aim to locate and identify specific pedestrians from given images or videos. Compared to the singular task of person reidentification, the person search task is better suited for real-world applications because the image library comprises natural images that encompass both individuals and their surrounding environments. The existing person search methods can be divided into two-step methods and end-to-end methods. The two-step method requires independent optimization of two subtask frameworks. Chen et al. proposed a mask-guided person search method that models the foreground person and the original image block separately, obtaining rich representations from two separate CNN streams (Chen, Huang, and Tao Citation2018). Wang et al. (Citation2020) proposed a task-consistent two-stage person search framework that included an identity-guided detector module to generate a query-like bounding box for the reidentification stage. Han et al. (Citation2019) proposed a refinement framework for person localization based on reidentification, which can obtain more reliable bounding boxes and provide more discriminative feature embeddings for reidentification tasks.

The end-to-end method of person search combines the tasks of pedestrian detection and reidentification into a unified model, aiming to construct an efficient network that provides search results without the need for additional steps. The first end-to-end person search framework was proposed by Xiao et al. (Citation2017). Since then, numerous successful works have emerged as subsequent researchers have delved into this field (Fiaz et al. Citation2022; Li and Miao Citation2021; Yan et al. Citation2021; Yu et al. Citation2022; Zhao et al. Citation2022). For example, Yan et al. (Citation2021) introduced the first anchorless person search framework and designed a feature-aligned network to solve the mismatch problem. Considering the connection between detection and reidentification, Li and Miao (Citation2021) designed a sequential network for person search and proposed the contextual-based global matching (CBGM) algorithm to extract context information from different people. Zhao et al. (Citation2022) focused on the noise in person searches and proposed a context-contrasted loss to enhance the identification features among pedestrians while maintaining feature consistency for the same identity. Yu et al. (Citation2022) proposed a cascade occluded attention transformer (COAT) for end-to-end person searches, which solves the issues of scale, viewpoint, and occlusion.

In general, these existing methods use the information from the detected bounding box to extract reidentification features in the reidentification stage. However, when considering an entire scene image or video, information from not only the human-centered adjacent area but also the background information of the scene and other people present can be obtained. Additionally, the scale of different scenes varies significantly, leading to occlusion or situations with a high density of individuals. Therefore, leveraging this abundant information to supplement or enhance the reidentification features can improve the search performance.

2.3. Context information for person search

The context information has been proven effective in the tasks of various computer vision tasks (Xie et al. Citation2022). In the detection task, the target object is not only related to itself or adjacent information but also strongly connected to the global scene information. Therefore, leveraging global context information is often employed to enhance target features and achieve more accurate localization (Chen, Huang, and Tao Citation2018; Liu et al. Citation2018). In the reidentification task, researchers often explore additional information between the target person and adjacent individuals to establish connections between their features and improve reidentification performance (Yan et al. Citation2020; Zhu et al. Citation2020). In the task of person searching, researchers usually examine the local context of individuals through image matching (Li and Miao Citation2021). Additionally, Yan et al. (Citation2021) studied context cues at three levels – detection, memory, and scene – in unconstrained natural images using weak supervision. However, these methods often utilize either global scene information or local adjacent person information without fully exploiting both types of information to enhance the final person reidentification representation.

3. Method

3.1. Research framework overview

The proposed method is shown in . The method takes the image as input data and outputs the results of a person search, forming an end-to-end mapping that includes two subtasks: pedestrian detection and person reidentification. First, we designed the GLCA module to learn the global context information and local person context. The model established the relationship between the target person and the scene and adjacent persons. Second, we constructed the SCFA module to build the connection between global and local context information and aggregate multiple feature information. Finally, we incorporated a precise detection frame coordinate constraint (Constraint 1), a person identity feature constraint (Constraint 2), and a detection frame confidence constraint (Constraint 3) to improve the comprehensive measurement capability of our framework for pedestrian features.

Figure 4. Person intelligent search framework.

Figure 4. Person intelligent search framework.

3.2. Global-local context-aware module

Contextual information has been proven to play an important role in person search-related tasks (Yan et al. Citation2019; Yan et al. Citation2021). However, in previous works, most researchers used GNNs to learn local contextual information, which significantly increased the complexity of the method. Therefore, to address this issue and consider the diverse characteristics of individuals in various scenarios, a simple and effective spatial-channel correlation attention (SCCA) submodule is developed to efficiently collect and embed the image's contextual information, as depicted in . This submodule is deployed in both the pedestrian detection and person reidentification stages, thereby effectively addressing the problem of insufficient feature extraction caused by occlusions and person-intensive situations. It is noteworthy that from , the GLCA module may appear somewhat similar to the CBAM module (Woo et al. Citation2018) previously proposed by researchers, as both utilize channel and spatial attention mechanisms. However, in the module we propose, the methods of connecting channel attention and spatial attention are different and more complex. Moreover, the designed module makes better use of original features, allowing the original features from each stage to be better preserved in subsequent processes. This module is capable of more effectively exploring contextual semantics in the task of person search.

Figure 5. The global-local context-aware module framework.

Figure 5. The global-local context-aware module framework.

The output of the backbone in the detection stage is further processed because it contains rich global response information, enhancing the context features among objects at the global level. In the reidentification stage, the features corresponding to the person area are first extracted. Then, the information shared between adjacent local areas is learned to reduce the differences and achieve more accurate person reidentification. It is important to note that when the target person is occluded in the image or video, the feature expression of the target person tends to weaken. Additionally, in scenes with a large number of people, similar individuals can interfere with the characteristics of the target person, leading to insufficient feature extraction. However, the GLCA module leverages both the global scene context information and the local person context information to enhance the reidentification features and improve target person identification, thereby increasing the distinctiveness between the target person and the background individuals.

Specifically, the same method is used to extract the global and local contexts, but the inputs of the modules are different. For the global context, the RoI align pooled features are sent to the submodule, and for the local context, the filtered person features are sent to the submodule. First, the module input (represented by F) is sent to adaptive maximum pooling in parallel. Second, two 1 × 1 convolutions follow, after which the two convolutions are added, and the channel attention map is obtained by using the sigmoid function. Then, the original input is multiplied by the channel attention and added by the jump connection to obtain F_CA. Then, F_CA is pooled by the same maximum pooling and average pooling and fuzed by a concatenation operation. Finally, the spatial attention map is obtained by the sigmoid function and multiplied by F_CA to obtain F_(CA-SA). Mathematically, this process can be described simply as shown in EquationEquations (1) and (2): (1) FCA=MCFF(1) (2) FCASA=MSFCAFCA(2) where F represents the original input. MC and MS represent the channel attention map and spatial attention map, respectively. FCA represents the feature maps generated by multiplying and adding. FCASA is the final output of the submodule.

3.3. Semantic complementarity and feature aggregation module

The results of feature extraction reveal that the global context encompasses abstract semantic information. The local context comprises more intricate and detailed features. However, the existing methods directly aggregate the two methods (Zheng et al. Citation2021). Although some results have been achieved, this method further increases the difference between the characteristics of people, resulting in rough or even incorrect discriminant features. Therefore, considering the obvious scale differences among people in the same or different scenes, this paper proposes a semantic complementarity module, which develops a cross-fusion strategy to establish a collaborative mechanism between global and local context information, as shown in .

  1. The obtained global and local context information is subjected to the sigmoid function to obtain the attention map to obtain the weight information in different global and local feature maps and assist in focusing on key areas, which can be described as Eqs. (Equation3) and (Equation4). (3) A1=σ(Conv3×3(Conv3×3(FGlobal)))(3) (4) A2=σ(Conv3×3(Conv3×3(FLocal)))(4) where σ() represents the sigmoid function. Conv3×3() represents a 3 × 3 convolution. FGlobal represents the global context of the detection stage. FLocal represents the local context of the reidentification stage.

  2. By performing cross-multiplication and addition, effective feature complementation is carried out, resulting in feature maps that contain more valuable information. This process aims to minimize feature differences prior to aggregation. It can be mathematically expressed through Eqs. (Equation5) and (Equation6). (5) C1=AMP((A11)Conv3×3(FGlobal))(5) (6) C2=AMP((A21)Conv3×3(FLocal))(6) where Ci{i=1,2} represents the cross-refined features. Conv3×3() and Conv1×1() represent 3 × 3 convolution and 1 × 1 convolution, respectively. Ai{i=1,2} represents the attention maps. FGlobal represents the global context in the detection stage. FLocal represents the local context in the reidentification stage. AMP() represents AdaptiveMaxPool. represents elementwise addition. represents elementwise multiplication.

  3. As EquationEquation (8) shows, the reidentification features are connected with two complementary contextual features, and the features are unified through an attention mechanism to ensure that the network focuses on the most important information. Specifically, the aggregated features are fed into the channel attention mechanism to weight each position and channel independently. After that, norm-aware embedding (NAE) is used to update the shared feature vectors of detection and reidentification. Finally, OIM loss is used to learn and optimize the final representation of the target person. (7) Fuse=Cat(C1,C2,FSelect)(7) (8) f=σ(Conv1×1(Fuse))Fuse(8) whereCat() represents the concatenate operation. Ci{i=1,2} represents the cross-refined features. FSelect represents the selected person features in the reidentification stage. Conv1×1() represents the 1 × 1 convolution. represents elementwise multiplication.

3.4. Multifeature constraints for person reidentification

In Sections 3.2 and 3.3, person-related features are thoroughly extracted to constrain the network's focus on the target individuals. However, existing person reidentification methods commonly rely on distance functions as the standard; consequently, these methods may not comprehensively and accurately measure the similarity of pedestrian feature vectors, leading to low recognition accuracy. A more comprehensive feature-based constraint is needed to further guide the network to address this issue. Therefore, this paper introduces a triple feature constraint in the pedestrian reidentification stage, which includes a precise detection frame coordinate constraint, a person identity feature constraint, and a pedestrian detection frame confidence constraint.

  1. Precise detection frame coordinate constraint

Figure 6. The semantic complementarity and feature aggregation module framework.

Figure 6. The semantic complementarity and feature aggregation module framework.

The precision detection frame coordinate is a spatial feature that refers to the location information of the bounding box for pedestrian detection, including the upper left corner coordinate, width, and height. The pedestrian detection subnetwork obtains several candidate regions, which may contain pedestrians, through the anchor box mechanism in the region proposal network. After reusing the CNN, the network generates a feature map through the candidate region pooling layer, as depicted in (a). However, due to two floating-point rounding operations, a significant pixel error can occur. This paper proposes the use of bilinear interpolation to mitigate this error. First, the original network's floating-point rounding operation is removed, and the floating-point numbers are retained during the mapping process. Second, sampling points are set for each region, and these points are estimated using bilinear interpolation. This process avoids coordinate rounding, resulting in better preservation of the spatial position information of the candidate regions. Consequently, more accurate person detection box coordinates can be obtained, as shown in (b).

(2)

Person identity feature constraint

Figure 7. Precise detection frame coordinate constraints.

Figure 7. Precise detection frame coordinate constraints.

Accurate identification of pedestrian identities requires the extraction of unique characteristics that can effectively express their individuality. These specific identity features play a crucial role in measuring the differences between pedestrians. This paper incorporates personality characteristic constraints that describe pedestrian identities and utilizes the online instance matching loss function (Xiao et al. Citation2017) to update network parameters. The approach involves constructing a lookup table and a circular queue for pedestrians with labeled and unlabeled identities, respectively, within the dataset for optimization purposes, as illustrated in .

Figure 8. Person identity feature constraints.

Figure 8. Person identity feature constraints.

Assuming that the number of labeled pedestrians in the dataset is L and the feature dimension is d, a d×L lookup table VRd×L is designed to store the labeled pedestrians in the dataset. In forward propagation, let the sample pedestrian feature xRd be labeled t; then, the cosine distance VTx between x and the pedestrian feature in the lookup table is calculated. In the backpropagation, if the model matches the identity information of sample x, its features are updated to the lookup table. The calculation method is shown in EquationEquation (9). (9) vtγvt+(1γ)x(9) where vt is the detection box feature after L2 regularization, γ∈[0,1].

In addition, considering the large number of unmarked pedestrians in person reidentification, if feature information can be effectively utilized, this approach can also promote the correct identification of marked pedestrians. In this paper, a Q-cycle queue URd×Q is designed to store unmarked pedestrians in the dataset. In forward propagation, the Euclidean distance UTx between pedestrian features in x and cyclic queues is calculated. Then, the two sets of distance vectors obtained by the lookup table and the loop queue are combined, and the probability distribution of the sample is calculated according to Eqs. (Equation10) and (Equation11). (10) pi=exp(viTxT)j=1Lexp(viTxT)+k=1Qexp(ukTxT)(10) (11) qi=exp(uiTxT)j=1Lexp(vjTxT)+k=1Qexp(ukTxT)(11) where pi denotes the probability that sample x is predicted to be i, qi denotes the probability that sample x is predicted to be the i-th unlabeled pedestrian feature in the circular queue, and T is the control parameter of the probability distribution. The ultimate goal of the online instance matching loss function is to maximize the log-likelihood of sample x, as shown in EquationEquation (12). (12) LOIM=Ex(logpt)(12)

This condition calculates the gradient descent of sample x through the online instance matching loss function and constantly updates the parameter values in the network to achieve the purpose of effectively training the neural network.

(3)

Pedestrian detection frame confidence constraint

The detection frame confidence is a feature used to measure whether a bounding box contains a real pedestrian target. The primary objective of person reidentification is to identify and match the most similar image to the given pedestrian within the candidate pedestrian database constructed from multiple cameras based on the target pedestrian image. Therefore, it is crucial to calculate the feature similarity between the target and the candidate pedestrians. Person reidentification determines the similarity between pedestrian images by evaluating the similarity between their features. Pedestrian images with the highest feature similarity are considered to have the same identity. Thus, designing an effective method for measuring feature similarity is of utmost importance in person reidentification. In this paper, the feature similarity measurement algorithm is further constrained by incorporating the confidence scores of all the candidate pedestrian detection boxes derived from the output results of the pedestrian detection subnetwork, as illustrated in .

Figure 9. Pedestrian detection frame confidence constraints.

Figure 9. Pedestrian detection frame confidence constraints.

First, the confidence in all the candidate pedestrian detection boxes is linearly normalized to [0,1]. Second, the cosine distance between the candidate pedestrian detection frame and the pedestrian to be identified is calculated via EquationEquation (13). Finally, using EquationEquation (14), the normalized confidence of the candidate pedestrian detection frame is used as the weight constraint and multiplied by the cosine distance to measure the feature similarity between the candidate pedestrian and the pedestrian to be identified. (13) dcos(X1,X2)=(X1X2)TU(13) (14) dweightcos(X1,X2)=dcos(X1,X2)W(14) where X1 and X2 represent the feature column vectors of the candidate pedestrians and the pedestrian to be identified, respectively. ()T represents the vector transpose. X1X2 represents the multiplication of the corresponding elements in the feature vectors X1 and X2. U denotes the constant vector [1,1,,1]T. W is the normalized confidence interval.

3.5. Hybrid loss functions

The loss function is used for supervision in both the detection and reidentification stages. The regression loss uses the smooth-L1-loss function (EquationEquation (15)), the classification loss uses the cross-entropy loss function (EquationEquations (16) and (Equation17)), and the reidentification loss uses the OIM loss (EquationEquation (18)). The overall loss function is shown in EquationEquation (19). (15) reg=1Ni=1NLsmoothL1(ti,yi)(15) where N denotes the number of positive samples, ti denotes the coordinates of the i-th positive sample, and yi denotes the coordinates of the corresponding ground-truth box. (16) cls1=1Ni=1NyiLCE(pi,yi)(16) (17) cls2=1Ni=1NyiLCE(fSCFA,yi)(17) where N denotes the number of samples, piis the predicted classification probability of the i-th sample, and yidenotes the corresponding ground-truth label. LCE() is the cross-entropy loss function. The setting of λi(i=1,2,3,4,5) follows that of Li and Miao (Citation2021). (18) reIDSCFA=OIM(fSCFA)(18) (19) joint=λ1reg1+λ2cls1+λ3reg2+λ4cls2+λ5reIDSCFA(19)

4. Dataset descriptions and hyperparameter configuration

4.1. Dataset descriptions

  1. Benchmark datasets

The experimental datasets used in this paper mainly include the CUHK-SYSU dataset (Xiao et al. Citation2017) and the PRW dataset (Zheng et al. Citation2017). CUHK-SYSU is one of the large-scale datasets commonly used for person re-identification research, which mainly consists of images extracted from road surveillance videos and movie scenes. The average resolution of the image is 800 × 600 pixels. We follow Xiao et al. to adopt the standard training and test split, with a training set of 5,532 identities and 11,206 images, and a test set of 2,900 query persons and 6,978 images. PRW dataset is derived from surveillance videos on the campus of Tsinghua University, captured by six cross-regional cameras. The scene images in the dataset have an average resolution of 1920 × 1080 pixels. Following the standard data split, the training set contains 5,704 images with 482 different identities, while the test set includes 2,057 query persons and 6,112 images.

(2)

Self-constructed dataset

To evaluate the generalization of the proposed method, a small dataset sourced from Southwest Jiaotong University is constructed. This dataset consists of four different scenes captured by two cross-regional cameras. The images in the dataset have a resolution of 1280 × 720 pixels, and the total time of the dataset is 305 s, with a total number of 15,450 frames. To further verify the generalization of the method, the entire dataset is used as the testing set without any training process.

4.2. Hyperparameter configuration

  1. Implementation details

All experiments are performed on a server equipped with NVIDIA GeForce RTX 3090 24G and employ the ResNet50 as the backbone. We use the classification model pre-trained on the ImageNet to initialize our model. During the training process, the CUHK-SYSU dataset went through 16 epochs, while the PRW dataset underwent 13 epochs. The batch size was set to 4, and each image was resized to 900 × 1500 pixels for both the CUHK-SYSU and PRW datasets. Stochastic Gradient Descent (SGD) was used as an optimizer to optimize the model training process, with a learning rate of 0.003, momentum of 0.9, and weight decay of 0.0005. During the testing process, the threshold of non maximum suppression (NMS) was set to the same value as Zheng et al. (Citation2021) for consistency. Additionally, the parameters used for CBGM (contextual-based global matching) were consistent with those used by Li and Miao (Citation2021).

(2)

Evaluation metrics

We used evaluation metrics in the person search task, namely mean average precision (mAP) and top-k accuracy.

The mAP measures the average performance of an algorithm on the entire test set, which is calculated by computing the Average Precision (AP). AP refers to the average precision of the t-th image to be identified, as shown in EquationEquation (20). (20) AP=1p(m1loc1+m2loc2++mplocp)(20) where p represents the number of pedestrian images in the candidate pedestrian library that match the image to be identified, mi represents the i-th image that matches the image to be identified after sorting the candidate pedestrian library, and loci denotes the position of the i-th matching result in the sorting list. The mAP is the average of AP for all the images to be identified, as shown in EquationEquation (21). (21) mAP=1ci=1cAPt(21) Where c represents the number of images to be identified, and APk denotes the AP value for the t-th image to be identified.

Top-k accuracy represents the probability of correctly identifying the target image within the top-k images when sorting the image similarities in descending order, which can be calculated using EquationEquation (22). (22) Topk=i1,2,,NSiN(22) where N represents the number of images in the candidate pedestrian library. When the i-th image in the candidate pedestrian library matches the image to be identified, Si is set to 1. otherwise, Si is set to 0. Top-1 means that the first image matched each time is exactly the image to be identified, which is the most fundamental goal of the person search task.

5. Experimental evaluation

5.1. Comparison with state-of-the-art methods

In order to validate the effectiveness of the proposed method, 25 advanced methods are selected to compare the performance with the proposed method on CUHK-SYSU and PRW data. These methods can be divided into two categories, namely two-step and end-to-end. The two-step methods include MGTS (Chen et al. Citation2018), CLSA (Lan, Zhu, and Gong Citation2018), RDLR (Han et al. Citation2019), and Faster R-CNN + PCB + OR (Yao and Xu Citation2021). The End-to-End methods include OIM (Xiao et al. Citation2017), IAN (Xiao et al. Citation2019), QEEPS (Munjal et al. Citation2019), HOIM (Chen et al. Citation2020), APNet (Zhong et al.Citation2020), BINet (Dong et al. Citation2020), NAE (Chen et al. Citation2020), NAE + (Chen et al. Citation2020), DIOIM (Dai et al. Citation2020), BUFF (Yang et al. Citation2020), IIDFC (Hou et al. Citation2021), OR (Yao and Xu Citation2021), PGSFL (Kim et al. Citation2021), AlignPS (Yan et al. Citation2021), AlignPS + (Yan et al. Citation2021), DMRNet (Han et al. Citation2021), SeqNet (Li and Miao Citation2021), SeqNet + NAE (Li and Miao Citation2021), CANR (Zhao et al. Citation2022), CANR + (Zhao et al. Citation2022), and QGN (Munjal et al. Citation2023). The detailed results comparing all the methods are recorded in .

Table 1. The results of proposed method are compared with the existing excellent neural network.

It can be observed in that for the CUHK-SYSU dataset, the proposed method is as high as 95.21% on mAP and 95.75% on Top-1. For the PRW dataset, the proposed method is up to 46.78% on mAP and 84.69% on Top-1. In addition, in order to verify whether the CBGM algorithm can be complementary to the proposed method, this paper deploys it in the test phase. On the CUHK-SYSU dataset, the combined method further increases the mAP by 0.25%, but Top-1 is reduced by 0.1%. Compared with other methods, the combined method shows the most advanced performance. On the PRW dataset, both metrics show improvement, with an increase of 0.85% in mAP and 2.72% in Top-1 accuracy. Compared with other methods, the fusion method still achieves the best performance. In summary, our method achieves state-of-the-art comparative performance on the CUHK-SYSU dataset without the help of additional algorithms and additional auxiliary information. On the PRW dataset, although slightly less than SeqNet + CBGM, the proposed method still achieves competitive results. Sufficient quantitative analysis strongly proves the effectiveness of the proposed method in the person search task.

It is worth noting that, as the size of the image library continues to increase, the performance of person search will also be affected to varying degrees. To validate this effect, a set of experiments on the CUHK-SYSU dataset with different gallery sizes are conducted. The library size includes 50, 100, 500, 1000, 2000, and 4000. summarizes the performance comparison of proposed method with other state-of-the-art methods. First, as the number of libraries increases, the performance of all methods tends to decrease. However, the proposed method decreases more slowly than most methods. Second, the proposed method consistently maintains the best performance under different number of libraries. At the same time, with the increasing number of libraries, the advantages of the proposed method are more significant. In general, the experimental results prove the robustness of the proposed method.

Figure 10. Performance comparison in terms of different gallery sizes on CUHK-SYSU.

Figure 10. Performance comparison in terms of different gallery sizes on CUHK-SYSU.

5.2. Different scenario analysis

In order to further validate the effectiveness of the proposed method in person search tasks, qualitative analysis is conducted (Section 5.2 and 5.3) to examine the coping ability of the proposed method in different scenarios. It is worth noting that the dataset includes both real-world scenes and movie scenes, so this paper analyze them separately. shows the search results in different scenarios.

Figure 11. Comparison of visual results of person search in real scenes.

Figure 11. Comparison of visual results of person search in real scenes.

In real-world scenes, this paper analyzes three challenging situations: occlusion, long-distance small target, and person-intensive scenarios. For the case where the person is occluded, shows the search results for two scenarios: the first group of images displays the search results when the target person in the gallery are occluded, while the last group of images shows the search results when the people in the query images are occluded. From the first group of images, what can be observed is that the target person in the gallery are heavily occluded by umbrellas, even the target person is completely occluded, which is extremely challenging. However, the proposed method can accurately search for and identify the target person. In the last group of images, the target person in the query is occluded by adjacent people, which may reduce the features of the target person learned by the method, resulting in uncertain or incorrect results. However, the proposed method can still search out the target person in such situations. From the second group of images, it can be seen that the target person is gradually moving away from the camera's field of view and there is a partial occlusion, but the proposed method can cope with this situation effectively. From the third group of images, we can see the results of the method in this paper when dealing with person-intensive scenarios. In the figure, the density of person not only increases, but also the posture of the target person and the position of the backpack also change. This change may potentially cause confusion between the target person characteristics and other person characteristics, increasing the uncertainty of the final result. However, the proposed method is still able to handle such scenarios well.

In the movie scenes, this paper analyzes from three aspects: illumination change, long-distance small target, and occlusion. First, from the first and second groups of images in . It can be observed that the light received by the target person has changed significantly. In particular, in the second group of images, the light becomes very dim and the dress color and posture of the adjacent person are very similar to the target person, which will undoubtedly bring difficulties to the identification. Although SeqNet is also able to search and locate the target person, the proposed method has higher similarity and more accurate detection box. Second, in the third group of images, the position distance of the target person in the gallery changes and is occluded by the adjacent person, which makes the similar features around the target person blurred. Similarly, in the fourth group of images, the target person in the gallery is also partially occluded, which causes other methods to misretrieve the target person, and the proposed method has better results in dealing with these situations.

Figure 12. Comparison of visual results of person search in movie scenes.

Figure 12. Comparison of visual results of person search in movie scenes.

In general, the method of this paper has better search performance than other methods in dealing with occlusion, long-distance small target, person-intensive scenarios, and illumination change. This is attributed to the fact that the proposed method not only focuses on global or local context information, but also makes the network pay attention to both information and interact effectively. By strengthening the feature representation of target person, the search performance of the target person is increased.

5.3. Generalization analysis

In order to verify the generalization and migration of the method proposed in this paper, robustness test experiments are carried out on self-built data sets. Specifically, we do not train from scratch on this data set, but directly apply the model trained on the SUHK-SYSU data set to the new data set to verify the generalization of the model, as shown in .

Figure 13. Method generalization analysis in different scenarios.

Figure 13. Method generalization analysis in different scenarios.

From the first set of search results, it can be seen that in different video clips, the posture state of the person has changed greatly compared to the query, and some clips have received the occlusion of the riding tool. In the last clip, the body area of the target person is almost beyond the camera field of view, which will cause some interference to the person search. However, the method of this paper can accurately search for the target person in the case of these disturbances. From the second set of search results, it can be seen that for men wearing black clothes, it can be found that the two pedestrians with their peers are similar to the target. In addition, there are different degrees of occlusion of the target, which causes some interference to the identification of the target. In camera 4, the brightness of the lens has also changed, and the light has become darker, which makes the person characteristics of several black-clothed males side by side more similar. However, the method proposed in this paper can eliminate these interference factors, which not only realizes the identification of small-scale target pedestrians in two videos, but also achieves accurate identification of partially occluded target pedestrians.

5.4. Effects of person detection

Better detection results can usually bring some improvement to the performance of person search. Therefore, this paper selected some representative methods to establish a comparative experiment on the final search performance when using the real box and the detected bounding box. From , it can be seen that when using ground truth boxes, all methods show improved performance, which can prove that the detection results have a substantial impact on the final search performance. Additionally, it is worth noting that when using ground truth boxes, the proposed approach is improved by 0.47% on the CUHK-SYSU dataset, while OIM, NAE, and DIOIM are improved by 2.4%, 2.6%, and 2.6% respectively. The ground truth box brings less improvement to the proposed method than other methods, as well as on the PRW dataset, which also shows that the proposed method benefits more from the enhanced person re-identification features.

Table 2. Comparative results using the detected(det.) and ground truth(gt.) bounding boxes.

5.5. Ablation study

In order to validate the effectiveness of the proposed modules, five groups of ablation experiments are established. records the performance of different components. It can be seen that when the proposed module is added, it will bring a certain improvement to the final performance. Compared with baseline, when GCA (Global Context Attention) is added, the mAP and Top-1 accuracy are improved by 1.71% and 1.24% respectively. When LCA (Local Context Attention) is added, the mAP and Top-1 accuracy are improved by 0.33% and 0.31% respectively. It can be observed that when the module is added alone, the performance of the method is not much improved. This is because LCA aims to capture the characteristics of local adjacent people. When encountering dense people or similar characteristics between adjacent people and target people, the characteristics will be blurred, and the final image retrieval performance will be obtained. Similarly, when SCFA (Semantic Context Fusion Attention) is added, the improvement in mAP is not significant due to the blind integration and interaction learning of features from two stages, resulting in misplacement of semantic features. However, when all the modules are integrated together, the performance of the entire model has been significantly improved.

Table 3. Different module ablation experiment of the proposed method.

Additionally, we calculate the FLOPs and the number of parameters for the method proposed in this paper and for the highly competitive SeqNet. To ensure a fair comparison, the input images for both models are adjusted to 900×1500 pixels, with a consistent batch size as well. The method presented in this paper has a parameter count of 33.04M and FLOPs of 386.79G, while SeqNet's parameter count and FLOPs are 79.51M and 521.02G, respectively. Combining sections 5.1, 5.2, and 5.4, it is evident that the method introduced in this paper achieves higher performance with lower complexity, further demonstrating its superiority in the task of person search.

6. Discussion

6.1. Applicability and implications

With the development of deep neural networks and the demand for intelligent video surveillance, the attention has increased significantly on people's search techniques in computer vision (Fiaz et al. Citation2022; Qian et al. Citation2022). In this paper, we innovatively utilized the global scene context and local people context to complement personal features. Furthermore, we combined multiple feature constraints to form an intelligent person search method for cross-view images. Extensive experiments and analyses were conducted using two public datasets, and the results show that our method outperforms 25 other state-of-the-art methods and can be better applied to the person search task in complex scenes. In addition, we built a new person search dataset to further demonstrate the generalizability of the proposed method.

Our intelligent search methodology accurately provides digital earth data with essential data on human activities and movements, real data that help us better understand the planet's phenomena, resources, and social issues (Li et al. Citation2021; Zhang et al. Citation2023). Our approach can help experts in GIS understand population distribution, human flow dynamics, and urban activity. This information can be used for personnel virtual simulation, and creating urban or disaster digital twins is essential (Therias and Rafiee Citation2023). Our method can accurately provide important information to safety and disaster emergency managers about the location of people, which can help build digital disaster emergency responses and maintain public safety (Singh, Khare, and Jethva Citation2022; Zhu et al. Citation2024b, Citation2024c).

6.2. Limitations

shows the shortcomings of the proposed method in different scenarios. In the two sets of display scenes, the target person in the query has a lower resolution and is far from the camera. The leftmost image shows the target person in the query, and the right image shows the wrong person (black box selection). This is because the proposed method does not perform too much processing during the feature extraction stage, which may lead to insufficient extraction of deeper features from the target person. As a result, there is a deviation in the search results of our method in low-resolution images, and it is impossible to accurately locate low-resolution target personnel.

Figure 14. Methods for insufficient case visualization analysis.

Figure 14. Methods for insufficient case visualization analysis.

7. Conclusion and future work

A person search can provide human activity information for the construction of digital earth, thus promoting the development of informatization in disaster emergency response and public safety. In this paper, we propose a cross-view person intelligent search method that innovatively exploits global-local context information and introduces multiple feature constraints, realizing accurate cross-view person searches in complex environments. The main contributions of this paper are as follows.

First, the GLCA module is proposed and deployed in the pedestrian detection and person rerecognition stages. This module can utilize the global scene context and local person context well to supplement person rerecognition details, which solves the problem of insufficient extraction of person feature information in cases of occlusion and person-intensive situations.

Second, the SCFA module is designed to establish a collaborative mechanism between global and local context information. This module enhances the aggregation ability of the search method for personnel features of different scales. This approach solves the problem of scale differences between people in the same or different scenes.

Third, the person spatial feature constraint, person identity feature constraint, and detection confidence feature constraint are included to comprehensively and accurately measure pedestrian feature vector similarity, which improves the discriminative ability of ambiguous person features in the process of pedestrian reidentification.

Finally, an end-to-end framework for intelligent people search is formed. The experimental results show that our method outperforms 25 other deep learning methods for person searches and can better deal with complex situations such as occlusions, scale changes, and high person density, as well as having higher generalizability and migration. In addition, the proposed method can compensate for the lack of real human activity information in existing disaster twin scenes. It can provide accurate human activity data for disaster emergency rescue and public safety management.

In future research, pedestrian sex, walking posture, clothing color, backpack and other attribute features will be considered to describe pedestrians more comprehensively, as this approach can improve the ability of target people to search for objects with low resolution in complex scenes. Moreover, we will incorporate the people search method into the disaster twin scene to construct a digital twin where the natural and social environments are intertwined.

Author contributions

Conceptualization, Jun Zhu and Yakun Xie; Methodology, Jinbin Zhang; Resources, Hongyu Chen, Tianyi Zhang, Hengchao Gu and Huijie Lian; Software, Jinbin Zhang and Tianyi Zhang; Validation, Jun Zhu and Yakun Xie; Writing – original draft, Jinbin Zhang and Hongyu Chen; Writing – review & editing, Jun Zhu and Yakun Xie. All authors have read and agreed to the published version of the manuscript.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The data that support the findings of this study are available from the corresponding author, upon reasonable request.

Additional information

Funding

This paper was supported by National Key Research and Development Program of China (Grant No. 2022YFC3005703), National Natural Science Foundation of China (Grant Nos. 42301473, 42271424, and 42171397), Postdoctoral Innovation Talents Support Program (BX20230299), China Postdoctoral Science Foundation (2023M742884).

References

  • An, F., J. Wang, and R. Liu. 2023. “Pedestrian Re-Identification Algorithm Based on Attention Pooling Saliency Region Detection and Matching.” IEEE Transactions on Computational Social Systems 99: 1–9.
  • Che, M., Y. Nian, S. Chen, H. Zhang, and T. Pei. 2023. “Spatio-Temporal Characteristics of Human Activities Using Location big Data in Qilian Mountain National Park.” International Journal of Digital Earth 16 (1): 3794–3809. https://doi.org/10.1080/17538947.2023.2259926.
  • Chen, H., D. Feng, S. Cao, W. Xu, Y. Xie, J. Zhu, and H. Zhang. 2023. “Slice-to-Slice Context Transfer and Uncertain Region Calibration Network for Shadow Detection in Remote Sensing Imagery.” ISPRS Journal of Photogrammetry and Remote Sensing 203: 166–182. https://doi.org/10.1016/j.isprsjprs.2023.07.027.
  • Chen, Z., S. Huang, and D. Tao. 2018. “Context Refinement for Object Detection.” Proceedings of the European Conference on Computer Vision (ECCV), 71–86.
  • Chen, D., S. Zhang, W. Ouyang, J. Yang, and B. Schiele. 2020. “Hierarchical Online Instance Matching for Person Search.” Proceedings of the AAAI Conference on Artificial Intelligence, 34 (7), 10518–10525.
  • Chen, D., S. Zhang, W. Ouyang, J. Yang, and Y. Tai. 2018. “Person Search via a Mask-Guided two-Stream cnn Model.” Proceedings of the European Conference on Computer Vision (ECCV), 734–750.
  • Chen, D., S. Zhang, J. Yang, and B. Schiele. 2020. “Norm-aware Embedding for Efficient Person Search.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12615–12624.
  • Dai, J., P. Zhang, H. Lu, and H. Wang. 2020. “Dynamic Imposter Based Online Instance Matching for Person Search.” Pattern Recognition 100: 107120. https://doi.org/10.1016/j.patcog.2019.107120.
  • Dong, W., Z. Zhang, C. Song, and T. Tan. 2020a. “Bi-directional Interaction Network for Person Search.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2839–2848.
  • Dong, W., Z. Zhang, C. Song, and T. Tan. 2020b. “Instance Guided Proposal Network for Person Search.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2585–2594.
  • Feng, D., H. Chen, S. Liu, Z. Liao, X. Shen, Y. Xie, and J. Zhu. 2023. “Boundary-Semantic Collaborative Guidance Network With Dual-Stream Feedback Mechanism for Salient Object Detection in Optical Remote Sensing Imagery.” IEEE Transactions on Geoscience and Remote Sensing 61: 1–17.
  • Fiaz, M., H. Cholakkal, S. Narayan, R. M. Anwer, and F. S. Khan. 2022. “PS-ARM: An end-to-end Attention-Aware Relation Mixer Network for Person Search.” Proceedings of the Asian Conference on Computer Vision, 3828–3844.
  • Han, C., J. Ye, Y. Zhong, X. Tan, C. Zhang, C. Gao, and N. Sang. 2019. “Re-id Driven Localization Refinement for Person Search.” Proceedings of the IEEE/CVF International Conference on Computer Vision, 9814–9823.
  • Han, C., Z. Zheng, C. Gao, N. Sang, and Y. Yang. 2021. “Decoupled and Memory-Reinforced Networks: Towards Effective Feature Learning for one-Step Person Search.” Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 35, No. 2, pp. 1505–1512).
  • Hou, S., C. Zhao, Z. Chen, J. Wu, Z. Wei, and D. Miao. 2021. “Improved Instance Discrimination and Feature Compactness for end-to-end Person Search.” IEEE Transactions on Circuits and Systems for Video Technology 32 (4): 2079–2090. https://doi.org/10.1109/TCSVT.2021.3082775.
  • Hu, T., S. Wang, B. She, M. Zhang, X. Huang, Y. Cui, J. Khuri, et al. 2021. “Human Mobility Data in the COVID-19 Pandemic: Characteristics, Applications, and Challenges.” International Journal of Digital Earth 14 (9): 1126–1147. https://doi.org/10.1080/17538947.2021.1952324.
  • Huertas-Tato, J., A. Martín, J. Fierrez, and D. Camacho. 2022. “Fusing CNNs and Statistical Indicators to Improve Image Classification.” Information Fusion 79: 174–187. https://doi.org/10.1016/j.inffus.2021.09.012.
  • Kalayeh, M. M., E. Basaran, M. Gökmen, M. E. Kamasak, and M. Shah. 2018. “Human Semantic Parsing for Person re-Identification.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1062–1071.
  • Kim, H., S. Joung, I. J. Kim, and K. Sohn. 2021. “Prototype-guided Saliency Feature Learning for Person Search.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4865–4874.
  • Kim, M., S. Kim, J. Park, S. Park, and K. Sohn. 2023. “PartMix: Regularization Strategy to Learn Part Discovery for Visible-Infrared Person re-Identification.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18621–18632.
  • Lan, X., X. Zhu, and S. Gong. 2018. “Person Search by Multi-Scale Matching.” Proceedings of the European Conference on Computer Vision (ECCV), 536–552.
  • Li, J., S. Liao, H. Jiang, and L. Shao. 2020, October. “Box Guided Convolution for Pedestrian Detection.” Proceedings of the 28th ACM International Conference on Multimedia, 1615–1624.
  • Li, Z., and D. Miao. 2021. “Sequential End-to-End Network for Efficient Person Search.” Proceedings of the AAAI Conference on Artificial Intelligence, 35 (3): 2011–2019.
  • Li, W., J. Zhu, L. Fu, Q. Zhu, Y. Xie, and Y. Hu. 2021. “An Augmented Representation Method of Debris Flow Scenes to Improve Public Perception.” International Journal of Geographical Information Science 35 (8): 1521–1544. https://doi.org/10.1080/13658816.2020.1833016.
  • Li, W., J. Zhu, Q. Zhu, J. Zhang, X. Han, and Y. Dehbi. 2024. “Visual Attention-Guided Augmented Representation of Geographic Scenes: A Case of Bridge Stress Visualization.” International Journal of Geographical Information Science 38 (3): 527–549. https://doi.org/10.1080/13658816.2023.2301313.
  • Liu, Y., R. Wang, S. Shan, and X. Chen. 2018. “Structure Inference net: Object Detection Using Scene-Level Context and Instance-Level Relationships.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6985–6994.
  • Munjal, B., S. Amin, F. Tombari, and F. Galasso. 2019. “Query-guided End-to-End Person Search.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 811–820.
  • Munjal, B., A. Flaborea, S. Amin, F. Tombari, and F. Galasso. 2023. “Query-guided Networks for Few-Shot Fine-Grained Classification and Person Search.” Pattern Recognition 133: 109049. https://doi.org/10.1016/j.patcog.2022.109049.
  • Qian, B., Y. Wang, H. Yin, R. Hong, and M. Wang. 2022. “Switchable Online Knowledge Distillation.” European Conference on Computer Vision, 449–466, Cham: Springer Nature Switzerland.
  • Qu, J., Y. Zhang, and Z. Zhang. 2023. “PMA-Net: A Parallelly Mixed Attention Network for Person re-Identification.” Displays 78: 102437. https://doi.org/10.1016/j.displa.2023.102437.
  • Ray, C., and C. Claramunt. 2003. “A Distributed System for the Simulation of People Flows in an Airport Terminal.” Knowledge-Based Systems 16 (4): 191–203. https://doi.org/10.1016/S0950-7051(03)00013-3.
  • Si, J., H. Zhang, C. G. Li, J. Kuen, X. Kong, A. C. Kot, and G. Wang. 2018. “Dual Attention Matching Network for Context-Aware Feature Sequence Based Person re-Identification.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5363–5372.
  • Singh, N. K., M. Khare, and H. B. Jethva. 2022. “A Comprehensive Survey on Person re-Identification Approaches: Various Aspects.” Multimedia Tools and Applications 81 (11): 15747–15791. https://doi.org/10.1007/s11042-022-12585-w.
  • Sun, Y., L. Zheng, Y. Yang, Q. Tian, and S. Wang. 2018. “Beyond Part Models: Person Retrieval with Refined Part Pooling (and a Strong Convolutional Baseline).” Proceedings of the European Conference on Computer Vision (ECCV), 480–496.
  • Therias, A., and A. Rafiee. 2023. “City Digital Twins for Urban Resilience.” International Journal of Digital Earth 16 (2): 4164–4190. https://doi.org/10.1080/17538947.2023.2264827.
  • Wang, H., and Y. Guo. 2022. “A Novel Pedestrian re-Identification Algorithm Framework Based on Deep Learning.” Second International Symposium on Computer Technology and Information Science (ISCTIS 2022), 12474: 152–159.
  • Wang, C., B. Ma, H. Chang, S. Shan, and X. Chen. 2020. “Tcts: A Task-Consistent Two-Stage Framework for Person Search.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11952–11961.
  • Wang, M., H. Ma, S. Liu, and Z. Yang. 2023. “A Novel Small-Scale Pedestrian Detection Method Base on Residual Block Group of CenterNet.” Computer Standards & Interfaces 84: 103702. https://doi.org/10.1016/j.csi.2022.103702.
  • Wang, G., Y. Yuan, X. Chen, J. Li, and X. Zhou. 2018. “Learning Discriminative Features with Multiple Granularities for Person re-Identification.” Proceedings of the 26th ACM International Conference on Multimedia, 274–282.
  • Wang, Y., P. Zhang, S. Gao, X. Geng, H. Lu, and D. Wang. 2021. “Pyramid Spatial-Temporal Aggregation for Video-Based Person Re-Identification.” Proceedings of the IEEE/CVF International Conference on Computer Vision, 12026–12035.
  • Woo, S., J. Park, J. Y. Lee, and I. S. Kweon. 2018. “CBAM: Convolutional Block Attention Module.” Proceedings of the European Conference on Computer Vision (ECCV), 3–19.
  • Wu, Z. Z., J. Xu, Y. Wang, F. Sun, M. Tan, and T. Weise. 2022. “Hierarchical Fusion and Divergent Activation Based Weakly Supervised Learning for Object Detection from Remote Sensing Images.” Information Fusion 80: 23–43. https://doi.org/10.1016/j.inffus.2021.10.010.
  • Wu, J., J. Zhu, J. Zhang, P. Dang, W. Li, Y. Guo, L. Fu, et al. 2023. “A Dynamic Holographic Modelling Method of Digital Twin Scenes for Bridge Construction.” International Journal of Digital Earth 16 (1): 2404–2425. https://doi.org/10.1080/17538947.2023.2229792.
  • Xiao, T., S. Li, B. Wang, L. Lin, and X. Wang. 2017. “Joint Detection and Identification Feature Learning for Person Search.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3415–3424.
  • Xiao, J., Y. Xie, T. Tillo, K. Huang, Y. Wei, and J. Feng. 2019. “IAN: The Individual Aggregation Network for Person Search.” Pattern Recognition 87: 332–340. https://doi.org/10.1016/j.patcog.2018.10.028.
  • Xie, Y., D. Feng, H. Chen, Z. Liao, J. Zhu, C. Li, and S. W. Baik. 2022. “An Omni-Scale Global–Local Aware Network for Shadow Extraction in Remote Sensing Imagery.” ISPRS Journal of Photogrammetry and Remote Sensing 193: 29–44. https://doi.org/10.1016/j.isprsjprs.2022.09.004.
  • Xie, Y., D. Feng, H. Chen, Z. Liu, W. Mao, J. Zhu, Y. Hu, and S. W. Baik. 2022a. “Damaged Building Detection from Post-Earthquake Remote Sensing Imagery Considering Heterogeneity Characteristics.” IEEE Transactions on Geoscience and Remote Sensing 60: 1–17.
  • Xie, Y., D. Feng, X. Shen, Y. Liu, J. Zhu, T. Hussain, and S. W. Baik. 2022b. “Clustering Feature Constraint Multiscale Attention Network for Shadow Extraction from Remote Sensing Images.” IEEE Transactions on Geoscience and Remote Sensing 60: 1–14.
  • Xie, Y., J. Zhu, Y. Cao, D. Feng, M. Hu, W. Li, Y. Zhang, and L. Fu. 2020. “Refined Extraction of Building Outlines from High-Resolution Remote Sensing Imagery Based on a Multifeature Convolutional Neural Network and Morphological Filtering.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 13: 1842–1855. https://doi.org/10.1109/JSTARS.2020.2991391.
  • Xie, Y., J. Zhu, J. Lai, P. Wang, D. Feng, Y. Cao, T. Hussain, and S. W. Baik. 2022. “An Enhanced Relation-Aware Global-Local Attention Network for Escaping Human Detection in Indoor Smoke Scenarios.” ISPRS Journal of Photogrammetry and Remote Sensing 186: 140–156. https://doi.org/10.1016/j.isprsjprs.2022.02.006.
  • Yan, Y., J. Li, J. Qin, S. Bai, S. Liao, L. Liu, F. Zhu, and L. Shao. 2021. “Anchor-free Person Search.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7690–7699.
  • Yan, Y., J. Qin, B. Ni, J. Chen, L. Liu, F. Zhu, W. S. Zheng, X. Yang, and L. Shao. 2020. “Learning Multi-Attention Context Graph for Group-Based re-Identification.” IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (6): 7001–7018. https://doi.org/10.1109/TPAMI.2020.3032542.
  • Yan, Y., Q. Zhang, B. Ni, W. Zhang, M. Xu, and X. Yang. 2019. “Learning Context Graph for Person Search.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Yang, W., D. Li, X. Chen, and K. Huang. 2020. “Bottom-up Foreground-Aware Feature Fusion for Person Search.” Proceedings of the 28th ACM International Conference on Multimedia, 3404–3412.
  • Yao, H., and C. Xu. 2021. “Joint Person Objectness and Repulsion for Person Search.” IEEE Transactions on Image Processing 30: 685–696. https://doi.org/10.1109/TIP.2020.3038347.
  • Yu, R., D. Du, R. LaLonde, D. Davila, C. Funk, A. Hoogs, and B. Clipp. 2022. “Cascade Transformers for End-to-End Person Search.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7267–7276.
  • Yun, X., M. Ge, Y. Sun, K. Dong, and X. Hou. 2021. “Margin CosReid Network for Pedestrian Re-Identification.” Applied Sciences 11 (4): 1775. https://doi.org/10.3390/app11041775.
  • Zhang, Z., C. Lan, W. Zeng, X. Jin, and Z. Chen. 2020. “Relation-aware Global Attention for Person Re-Identification.” Proceedings of the Ieee/cvf Conference on Computer Vision and Pattern Recognition.
  • Zhang, G., W. Lin, A. kumar Chandran, and X. Jing. 2023. “Complementary Networks for Person Re-Identification.” Information Sciences 633: 70–84. https://doi.org/10.1016/j.ins.2023.02.016.
  • Zhang, Y., and H. Wang. 2023. “Diverse Embedding Expansion Network and Low-Light Cross-Modality Benchmark for Visible-Infrared Person Re-Identification.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2153–2162.
  • Zhang, J., J. Zhu, P. Dang, J. Wu, Y. Zhou, W. Li, L. Fu, Y. Guo, and J. You. 2023. “An Improved Social Force Model (ISFM)-Based Crowd Evacuation Simulation Method in Virtual Reality with a Subway Fire as a Case Study.” International Journal of Digital Earth 16 (1): 1186–1204. https://doi.org/10.1080/17538947.2023.2197261.
  • Zhao, C., Z. Chen, S. Dou, Z. Qu, J. Yao, J. Wu, and D. Miao. 2022. “Context-aware Feature Learning for Noise Robust Person Search.” IEEE Transactions on Circuits and Systems for Video Technology 32 (10): 7047–7060. https://doi.org/10.1109/TCSVT.2022.3179441.
  • Zheng, P., J. Qin, Y. Yan, S. Liao, B. B. Ni, X. Cheng, and L. Shao. 2021. Global-Local Context Network for Person Search. arXiv preprint arXiv:2112.02500, 8.
  • Zheng, L., H. Zhang, S. Sun, M. Chandraker, Y. Yang, and Q. Tian. 2017. “Person Re-Identification in the Wild.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1367–1376.
  • Zheng, Z., L. Zheng, and Y. Yang. 2018. “Pedestrian Alignment Network for Large-Scale Person re-Identification.” IEEE Transactions on Circuits and Systems for Video Technology 29 (10): 3037–3045. https://doi.org/10.1109/TCSVT.2018.2873599.
  • Zhong, Y., X. Wang, and S. Zhang. 2020. “Robust Partial Matching for Person Search in the Wild.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6827–6835.
  • Zhu, J., P. Dang, Y. Cao, J. Lai, Y. Guo, P. Wang, and W. Li. 2024c. “A Flood Knowledge-Constrained Large Language Model Interactable with GIS: Enhancing Public Risk Perception of Floods.” International Journal of Geographical Information Science, 38 (4): 603–625.
  • Zhu, J., P. Dang, J. Zhang, Y. Cao, J. Wu, W. Li, Y. Hu, and J. You. 2024a. “The Impact of Spatial Scale on Layout Learning and Individual Evacuation Behavior in Indoor Fires: Single-Scale Learning Perspectives.” International Journal of Geographical Information Science 38 (1): 77–99. https://doi.org/10.1080/13658816.2023.2271956.
  • Zhu, J., H. Yang, W. Lin, N. Liu, J. Wang, and W. Zhang. 2020. “Group re-Identification with Group Context Graph Neural Networks.” IEEE Transactions on Multimedia 23: 2614–2626. https://doi.org/10.1109/TMM.2020.3013531.
  • Zhu, J., J. Zhang, Q. Zhu, W. Li, J. Wu, and Y. Guo. 2024b. “A Knowledge-Guided Visualization Framework of Disaster Scenes for Helping the Public Cognize Risk Information.” International Journal of Geographical Information Science, 38 (4): 626–653.