903
Views
1
CrossRef citations to date
0
Altmetric
COMPUTER SCIENCE

Combined visual and spatial-temporal information for appearance change person re-identification

&
Article: 2197695 | Received 27 Jan 2023, Accepted 25 Mar 2023, Published online: 25 Apr 2023

Abstract

Person re-identification (ReID) seeks to identify the same individual across different cameras by matching their corresponding images. The current ReID datasets are limited in size and diversity, especially in terms of clothing changes, making traditional techniques vulnerable to appearance variations. Further, current approaches rely heavily on appearance features for discrimination, which is unreliable when a person’s appearance changes. We hypothesize that the ReID accuracy can be enhanced by training the ReID model on a large volume of diversified training data and combining multiple features for discrimination. We use the image channel shuffling data augmentation method, producing a large volume of diversified training data. Also, a two-stream visual and spatial-temporal method is proposed to learn the feasible features for appearance change scenarios. The appearance features obtained from the visual stream are combined with spatio-temporal information to discriminate between two people. The proposed approach is evaluated for its robustness on short-term and on long-term datasets. The presented two-stream approach outperforms earlier methods by achieving Rank1 accuracy of 98.6% on Market1501, 95.52% on DukeMTMC-reID, 76.21% on LTCC, and 91.76% on VC-Clothes, respectively.

1. Introduction

Person Re-identification (ReID) is the task of associating the same person’s image captured from a different camera or in the same camera at a different time. Deep learning has revolutionized the field of person re-identification by enabling the development of highly accurate and robust models (Bilakeri & Karunakar, Citation2023; Bilakeri & Kotegar, Citation2022; Movassagh et al., Citation2021) for this task. The ReID task is a key element of surveillance systems and a variety of other intelligent applications, e.g., forensic video investigation, crime investigation, and person tracking (Bilakeri & K, Citation2022). The most challenging part of ReID is appearance variation, which refers to the possibility that two people could appear to be the same person or that the same person could appear differently. The appearance variation occurs when a person arrives in a surveillance network after a short time gap (a few hours) or a long time gap (a few days, weeks, or years). The causes for the short-term appearance variation are illumination change, pose and viewpoint variation, background change, etc., and in the long-term appearance, the cloth change is the common cause. Each person’s images in Market1501 (L. Zheng et al., Citation2015) and DukeMTMC-reID (Z. Zheng et al., Citation2017) datasets are collected during a short period (on the same day). Therefore, these are considered short-term datasets. In contrast, the LTCC (Qian et al., Citation2020) and VC-cloth (Wan et al., Citation2020) are long-term datasets because these datasets are collected over a considerably longer period of time. Because of these short-term and long-term appearance variations, the same person appears to be different, or two people appear to be the same. Consistent re-identification under such variations remained unsolved.

There has been tremendous growth in the ReID performance with the advent of deep learning for ReID tasks, which has dominated the other techniques (e.g. handcrafted feature engineering) in extracting discriminative features. Recently, by utilising deep learning for feature extraction, cutting-edge person ReID algorithms (G. Wang et al., Citation2016; Zhong, Zheng, Cao, et al., Citation2017; Zhuo et al., Citation2018) significantly improved the results. However, the performance is inefficient when tested on the same person with different appearance variations or when two persons are in the same cloth pattern. The first reason that affected the performance is that most of the benchmark datasets Market1501, DukeMTMC-reID have a large number of classes with limited samples per class and a lack of diversity. The existing ReID methods addressed this issue using different data augmentation methods such as random cropping (Krizhevsky et al., Citation2012), random flipping (Simonyan & Zisserman, Citation2014), random erasing (Zhong, Zheng, Kang, et al., Citation2017), etc. However, these methods will enhance the sample size without encoding class-sensitive information during transformation. Recently, Generative Adversarial Network (GAN) has been widely used for augmentation purposes. However, the training is computationally expensive and takes a lot of time to converge the training model. The second reason for ReID’s performance drop in the appearance change scenario is the lack of an efficient feature learning approach.

Recent studies employed structural features such as body size, the shape of a person, horizontal/vertical person’s body part features, etc. These structural features helped to exploit complete body features and part level features for the ReID task. Various attention and part-based methods have also been proposed to achieve efficient performance. However, most of the attention mechanisms were based on a pose estimation model, and detecting key points in the data considered for the study is time-consuming. Instead of appearance features, various methods such as video-based or image-to-video ReID methods (G. Wang et al., Citation2017; Li et al., Citation2018) targets to acquire visual representations that are invariant to spatial and temporal factors. Nevertheless, these methods solely prioritize the visual attributes; most prior arts have ignored spatial-temporal constraints between multiple cameras. For instance, an individual observed by Camera 1 during a specific time instance t should not be visible on Camera 2 located at a distance from Camera 1 during a different time interval t + Δt. A large portion of the irrelevant target images is removed from the gallery by such a spatial-temporal constraint, significantly reducing the appearance ambiguity issue.

In light of the above discussion, we present a novel approach that uses image channel shuffling to generate images with significant appearance variations. Furthermore, we propose a two-stream approach for discriminative feature learning. The two-stream approach considers both visual and spatial-temporal information. The visual stream extracts content and edge features, and the Spatio-temporal stream uses effortless information such as the time-stamp of each frame. The visual stream features have more discriminative capability than the spatial-temporal information. Therefore, when both are combined, the visual stream features contribute more towards improving the recognition rate, and adding spatial-temporal information improves the overall performance. The major contributions of our study are as follows:

  1. We show that the image channel shuffling data augmentation increases the sample size and introduces diversity in terms of the appearance change of a person.

  2. We validate that the inclusion of image channel shuffling has been helpful in improving the model generalization capability.

  3. We show that the integration of visual features (edge, content) and spatio-temporal information makes the proposed method robust for the appearance change scenario.

  4. The experimental results obtained on short-term and long-term datasets show that our method outperforms prior arts.

  5. We have presented experimental investigation of cross-domain evaluation that signifies the proposed method is more generalized to cross domain re-identification.

The remainder of the paper is structured as follows. Section 2 reviews related works. Section 3 presents the proposed combined visual and spatio-temporal approach to generate more discriminative features. Experimental results and discussions are given in Section 4. Finally, conclusions are drawn in Section 5.

2. Related works

The current person ReID techniques focus on deep learning for appearance representation. Given the vast variety of studies on person ReID, we review the methods most closely related to our work. The existing works are reviewed as short-term and long-term person re-identification in our study.

2.1. Short-term person re-identification

The challenge of ReID task is drawing increasing attention as surveillance systems become more common in the real world. As previously stated, practically all of the ReID datasets that are currently available (L. Zheng et al., Citation2015; Z. Zheng et al., Citation2017) were collected over a short period. Therefore, the clothes’ appearances are more or less consistent for the same person. The data collected in a short time exhibits an appearance change caused due to illumination, pose, viewpoint, and occlusion. In the era of deep learning, considerable work has gone into creating methods for automatically re-identifying people using robust distance measurements or discriminative features. However, the models trained on the current dataset (L. Zheng et al., Citation2015; Z. Zheng et al., Citation2017) are not robust when a person comes with a changed cloth. Therefore, there is a need to supervise the model on large and diversified datasets. Aiming at this, Yu et al (Z. Yu et al., Citation2021). proposed a semi-supervised clothing-invariant feature learning framework for discriminative embedding learning with a GAN that simulates clothing. To make the ReID model invariant to external factors,(Deng et al., Citation2018). suggested the use of a similarity-preserving GAN to address domain discrepancies through image style transfer. The work presented by Wei et al (Wei et al., Citation2018). transformed the person from a source domain to a target domain to handle appearance changes caused by environmental effects. Recently, several deep learning models for person Re-identification (ReID) with pose guidance have been introduced, aimed at accounting for changes in the appearance of individuals. For example, Qian et al (Qian et al., Citation2018). developed an image generator capable of producing images of the same person in various poses. At the same time, Lv et al (Lv et al., Citation2018). proposed a method for transferring view-invariant representations from source data to target data using asymmetric multi-task dictionary learning. However, these works consider GAN for image construction to have an appearance change effect. The time complexity of the GAN is considerably high, which limits its application in a real-world scenario. Some of the works have aimed to propose appearance-invariant models by considering discriminative features. For example, Bhujel et al (Bhujel et al., Citation2020). and Sun et al (W. Sun et al., Citation2021). focused on local and global body features and part features such as horizontal and vertical. Also, the attention mechanism has profound potential in the field of person ReID. Inspired by the attention mechanism, various methods are presented (Fan et al., Citation2020; M. Zheng et al., Citation2019; Si et al., Citation2019; Xu et al., Citation2021). Xu et al (Xu et al., Citation2021). introduced a dual attention network that involves pose-guided attention and an activation-based attention model to solve the ReID problem. Zang et al (Zhang et al., Citation2021). presented a part-guided attention network that exploits the human body’s key points. Based on the key points, horizontal parts are generated. Then, inter and intra-local parts relations are exploited for the ReID task. However, part-based and attention-based approaches (Xu et al., Citation2021; Zhang et al., Citation2021) highly depend on the pose estimation model, which is an extra burden to estimate ground truth key points. Another research team focused on spatial-temporal data in order to create the appearance invariant ReID model. In the context of person ReID from images to videos, Wang et al (G. Wang et al., Citation2017). devised a point-to-set technique. For the purpose of identifying several differentiating parts of a person in a video, Li et al (Li et al., Citation2018). created a spatial-temporal attention model. Wang et al (G. Wang et al., Citation2019). proposed two-stream networks which involve visual features from horizontal body part features and spatial-temporal information from frame sequences. However, equal horizontal partition methods will not solve misalignment, thereby increasing misclassification. Also, the model is trained on minimal samples; therefore, this model shows inefficiency when a person’s appearance changes due to illumination, viewpoint, and on different background changes. The consideration of suitable data augmentation techniques that improves the sample size and adds the appearance variation effect to a dataset is essential. Also, more discriminative features such as content, edge, and spatio-temporal information combined with the right data augmentation would better solve the person’s re-identification in an appearance change scenario.

2.2. Long-term person re-identification

Due to its widespread applicability, the topic of changing clothes has received increased attention in recent years. In long-term person re-identification, a person changes his/her clothes very often. Recent person ReID methods focused on exploring appearance invariant features for discrimination. Yang et al. (Yang et al., Citation2019) present a technique to capture the contour sketch of the body’s features. This work primarily concentrates on learning invariant and discriminative ReID features by considering clothing information. Chen et al. (J. Chen et al., Citation2021) exploit the texture-insensitive 3D shape learning from 2D images to re-identify the person in cloth changing. Hong et al. (Hong et al., Citation2021) presented a two-stream network to extract body shape cues to handle cloth change challenges. However, it is observed that body shape cues are ineffective when two individuals’ body shapes are similar. Yu et al. (S. Yu et al., Citation2020) offer a two-branch network to retrieve biometric data and clothing properties through two submodels, respectively. They also introduce COCAS, a large-scale dataset focusing on pedestrian clothing modification status. Recently, Gao et al. (Z. Gao, H. Wei, W. Guan, J. Nie, et al., Citation2022) proposed human semantic attention and a visual cloth shielding model jointly supervised end-to-end to extract more discriminative and robust features irrelevant to appearance change. Gu et al. (Gu et al., Citation2022) presented a cloth adversarial loss function to extract irrelevant cloth representations by penalizing the predictive capability of the ReID model w.r.t. clothes. Goa et al. (Z. Gao, H. Wei, W. Guan, W. Nie, et al., Citation2022) proposed a multi-granular representation (MGR) technique, multilevel and multi-granular feature information is adaptively extracted, and a cloth desensitization network (CDN) is created to increase feature robustness for people wearing various outfits. However, performance improvement is still needed when someone arrives with changed clothes. The model performance can be improved by training it on a larger sample size, and a combination of multiple discriminative features would better solve the problem of long-term person re-identification.

In the proposed work, we bring to light a two-stream framework in which the combination of visual and spatiotemporal information is considered to make the model invariant to short-term and long-term appearance changes. Although conceptually related, our work differs by several aspects compared to (F. Chen et al., Citation2021; G. Wang et al., Citation2019): (i) We address ReID under both long-term and short-term appearance variation. In (F. Chen et al., Citation2021; G. Wang et al., Citation2019), only the short-term ReID is focused. (ii) We use the channel shuffling data augmentation method to increase the sample size and produce more diversified training samples on both long-term and short-term benchmarks. Whereas in (F. Chen et al., Citation2021), the channel shuffling is applied to a short-term dataset. (iii) In the proposed two-stream approach, the visual stream extracts content and edge features that are combined with the spatial-temporal information obtained from the spatial-temporal stream for discrimination. In (G. Wang et al., Citation2019), the two-stream framework extracts Spatial-temporal information and visual features from a part-based convolution model, which is inefficient for spatial misalignment. Our approach extracts multiple features (content, edge) from the visual stream and appearance invariant spatial-temporal information that is efficient and comprehensive for appearance change scenarios.

3. Combined visual and spatial-temporal information approach

The proposed ReID framework shown in Figure is a two-branch network involving Visual Feature Stream (VFS) and Spatio-temporal Stream (STS). The two main components of VFS are data augmentation and feature disentangling networks. The data augmentation method takes RGB images as input and performs channel shuffling to generate the same image in six colors. This method increases sample size and diversity in terms of appearance change. Next, the randomly picked augmented image and the original image are passed to a feature-disentangling network, which contains a content and an edge encoder modules. The encoders mine the appearance and edge features from the input image. The two encoders are trained to minimize the reconstruction error between the original and reconstructed images. While also maximizing the distance between the identity-related features of different people using identification loss and contrastive loss. The training is not necessary for the spatio-temporal stream since the spatio-temporal information of each person in each camera is available from the dataset. During inference, pairs of an image are passed to VFS and STS. The VFS extracts and compares two images’ content and edge features to generate a similarity score. Parallelly, STS gives the probability distribution of two images being positive (images of the same person) between disjoint cameras. We use a joint metric that considers the class probability of two input images from the visual stream and the probability of a positive image pair from the spatial-temporal stream to classify whether both images are of the same person. In the subsequent subsections, there is a comprehensive description of every module is provided.

Figure 1. The Conceptual framework of the proposed method.

Figure 1. The Conceptual framework of the proposed method.

3.1. Visual feature stream

3.1.1. Data augmentation and soft label assignment

Increasing the training sample size by including additional images that contain appearance variations is essential to improve the generalization ability of the ReID method. With this aim, we utilize an image channel shuffling (F. Chen et al., Citation2021) data augmentation technique. In channel shuffling data augmentation, an input image is split into its channels (i.e., R G B), after which the order of the channels is changed (i.e., G R B), and finally, the channels are combined to create a new image. As shown in Figure , we generate five images for each input image, each with a distinct channel order (i.e., RBG, GRB, GBR, BRG, BGR). After data augmentation, the generated images are labeled using the soft label assignment. We take channel order and the connections between the set of generated images into account for soft label assignment. Consider P denotes the number of original identities. The group of generated images will have wider appearance variation; hence, we define the label vector of 6P bits. Let p correspond to identities 1, 2, … 6P of labeled data, and y is the position of 1, same as in the one-hot encoding. The ground truth label is defined based on equation 1.

Figure 2. Samples generated using channel’s order shuffling approach.

Figure 2. Samples generated using channel’s order shuffling approach.
(1) Id(p)=0,py1,p=y(1)

If the class identification number (p) and the position (y) in the one hot encoding are the same, that returns 1 (correct ground truth); otherwise, 0. In the soft label method, the label Id(.) is determined by three components formulated as in equation 2.

(2) Id(P)=Idm(P)+Idc(P)+Ide(P).(2)

Where Idm(P), Idc(P), Ide(P) are the main component, content information, and edge information, the main component gives the training sample’s identity information. Content information represents the association between the images in the same channel sequence. Edge information measures how similar the generated image resembles the original. The soft label distribution is the probability distribution and is defined as follows in equation 3.

(3) Em+Ec+Ee=1(3)

Where Em, Ec, and Ee are the weights assigned to the main, content, and edge components, respectively. The Idm(.) is defined using equation 4.

(4) Idm(P)=0,pyEm,p=y(4)

Idc(.) focuses on channel order consistency. The content weight Ec represents the relation of content information among images of the same channel sequence. For every image, P identities will have identical channel sequences. Since there are 6P identities, Ec in the soft label is evenly split by P bits. It is defined by:

(5) Idc(P)=Ec/P,p,yhavethesamechannelorder.0,otherwise.(5)

Ide(.) denotes the edge resemblance among the set of generated images. The edge weight represents the correlation between the original and generated image. The generated six images of each identity will have a similar edge. Therefore, Ee is equally split by 6 bits in the soft label encoding, which is defined by equation 8.

(6) Ide(P)=Ee/6,pandyarefromsamegroup0,otherwise(6)

The soft label assignment is flexible for extracting the desired information, such as content and edge. The reason for the soft label being flexible is content and edge information can be extracted by setting Ec and Ee to a larger value. The considered data augmentation relies solely on internally available information and can extract distinguishing features from image samples using ID labels. The original image and augmented and soft-labeled images are passed to the feature disentangling method to extract content and edge features.

3.1.2. Feature disentangling

The feature disentangling module takes two input images: the original image (OI) and a randomly chosen generated image (GI). A pair of input images are processed using two encoders, namely Ec and Ee, which are structured similarly but have different parameters. This results in the extraction of content (Oc, Gc) and edge feature maps (Oe, Ge) from the images. The encoders are built on DenseNet121 (G. Huang et al., Citation2017) backbone network. Followed by the shared decoders reconstruct the images according to the feed from the encoder’s output feature maps. The model is supervised on reconstruction, contrastive, and identification loss functions for better and fast convergence.

The decoder network reconstructs the original and generated images by concatenating their respective extracted features (content and edge) to ensure that the generated features capture representative information. Then, pixels wise self re-constructive L2 loss is computed to check the similarity between the original (OI, OG) and reconstructed (ROI, ROG) images. Equation 7 is for self re-constructive loss (Lrecself) computation.

(7) Lrecself=x(O,G)IxD(Ec(Ix),Ee(Ix))2(7)

where x is either the original or generated image, D(Ec(Ix),Ee(Ix)) denotes the reconstructed images (Roi) by combining the content and edge information of an original image(Oc, Oe), same definition follows to reconstruct the generated image. We hypothesize that the generated and the original images will share a common edge and different content information. To prove the hypothesis, the original and generated images’ content information is swapped, and these feature maps are represented as (Oc, Ge). That means the content features of the original image are used with the edge features of generated one to reconstruct a new image, which is represented as R(Ge, Oc). The function (Gc, Oe) defines the content features of generated image and is used with the original image’s edge features to reconstruct a new image named R(Oe,Gc). The reconstructed images (R(Ge, Oc), R(Oe,Gc)) are expected to be the same as the original images (OI, OG). The re-constructive swap loss (Lrecswap) is computed using equation 8.

(8) Lrecswap=(x,y)(O,G)&xyIxD(Ec(Ix),Ee(Iy))2(8)

Where, D(Ec(Ix),Ee(Iy)) denotes the reconstructed image by concatenating the content of the original image (Ix) with the edge of the generated image (Iy). The re-constructive loss (Lrec) is the sum of self and swap re-constructive losses, as defined in equation 9.

(9) Lrec=Lrecself+Lrecswap(9)

Also, the ReID model is trained using contrastive loss (Lcon) for faster convergence. The contrastive loss is computed separately for content (Lconc) and edge features (Lcone) as defined in equation 10 and equation 11. The total contrastive loss is computed using equation 12.

(10) Lconc=max(margincfocfgc2)(10)
(11) Lcone=Lconc=focfgc2)(11)

Where foc and fgc are feature vectors of original, and augmented images, respectively and marginc is the hyper-parameter representing the least gap between foc and fgc.

(12) Lcon=αcLconc+αeLcone(12)

where αc and αe are the trade-off parameters. To boost the identification ability of the learned feature, the ID loss is computed on the extracted content and edge features of the original and the generated images. The ID loss (Lid) is defined as per equation 13.

(13) Lid=x(O,G)Lidcx+x(O,G)Lidex(13)

Where, Lidcx and Lidex are the cross-entropy loss between the predicted and the ground truth probability.

Overall, the visual feature stream is trained using re-constructive, contrastive, and ID loss. As in equation 14, the final loss (Ltotal) is the weighted sum of all three defined loss functions where β and γ are weights assigned to contrastive and ID loss to manage different tasks.

(14) Ltotal=Lcon+βLrec+γLid(14)

During the testing phase, the 512-dimensional features related to content and edges are integrated to make a final person representation in the form of a 1024-dimensional vector. The correlation between the gallery and query image is computed using the Euclidean distance metric. To boost the discriminative power of our proposed method, we also estimate the spatio-temporal probability for the same two images. The computation of spatio-temporal probability is described in the following subsection.

3.2. Spatial-Temporal stream

A spatial-temporal stream is employed in our work to obtain spatial-temporal harmonizing data to aid the visual feature stream. Spatial-temporal distribution is computed using the non-parametric parzen window approach (Parzen, Citation1962) (Histogram parzen window method). This stream does not require training; we only train a visual stream. Considering Ii, Ij are two images of IDi, IDj, captured by Ci, Cj at time ti, tj, in order. We generate rough spatial-temporal frequency distribution to represent the likelihood of a positive image pair by equation 15.

(15) Pˆ(Y=1|k,Ci,Cj)=ncicjklncicjl(15)

Where k is the kth bin of the histogram whose time interval is ti-tj ((k-1)Δt, kΔt). Where ncicjis the count of a pair of a person whose time gaps are at kth bin from ci, cj and Δt is a small value. The term Y = 1 denotes Ii, Ij are of the same person image. We smooth the histogram using the Parzen Window approach defined by equation 16.

(16) Pˆ(Y=1|k,Ci,Cj)=1ZlPˆ(Y=1|l,Ci,Cj)K(lk)(16)

Where k(.) is kernel and normalized vector is Z=kP(Y=1|k,Ci,Cj). In this work, the Gaussian function is used as a kernel K which is defined by:

(17) K(x)=12πσe2σ2x2(17)

where, x denotes the data sample fall in bin k, σ is standard deviation, and σ2 is the variance.

The spatio-temporal stream evaluates the likelihood of two images pertaining to the same individual. The probability estimates from VFS and STS are considered in the joint metric phase for classification.

3.3. Joint metric

The spatial-temporal distribution is independent of visual similarity score due to the two diverse patterns. The joint probability is defined as follows:

(18) P(y=1|xi,xj,k,ci,cj)=s(xi,xj)P(y=1|k,ci,cj)(18)

where s(xi,xj) represents visual similarity of two person (xi,xj) and P(y=1|k,ci,cj) is spatial-temporal probability. Equation 18 neglects two things: (i) consideration of visual similarity score itself as visual similarity probability and (ii) neglecting the unreliable and uncontrollable spatial-temporal probability (people may appear from anywhere and at any time). Directly employing P(y=1|k,ci,cj) leads to lower recall. For example, given a query image, the gallery image has a 0.9 visual similarity score and 0.01 spatial-temporal distribution, and another gallery image has a 0.3 visual similarity score and 0.1 spatial-temporal distribution. As per equation 18, the second gallery image is retrieved. The image with less spatial-temporal probability is treated as an irrelevant image which is not acceptable in real-time applications. The reason is the thief will have a fast walking speed compared to a normal person. It is quite obvious to get the low spatial-temporal probability for the thief, thereby failing to retrieve the thief. To overcome this complexity, the visual similarity score is transformed into visual probability. First, Laplace smoothing is employed to adjust the probability of rare events (zero frequency problem), which alleviates the unreliable probability estimation. Secondly, we apply the logistic function for binary classification as defined in equation 19.

(19) f(x,λ,γ)=11+λeγx(19)

where λ, γ are smoothing and shrinking factors, respectively. Using both the Laplace smoothing and the logistic functions together is known as the logistic smoothing strategy. Equation 18 is modified as follows

(20) Pjoint=f(s,λ0,γ0),f(Pst,λ1,γ1)(20)

Where pjoint is P(Y=1|(Xi,Xj,K,Ci,Cj)), s is defined as s((Xi,Xj)(1,1), and Pst is defined as P(Y=1|(K,Ci,Cj)(0,1). The λ and γ are the smoothing and shrinking factor, respectively. Employing the logistic function to convert the similarity score into a binary classification probability is straightforward and self-evident.

4. Results and discussions

4.1. Implementation details

DenseNet121 is used as encoder (content and edge encoder) Ec, Ee, which has a larger receptive field than VGG, ResNet50, that helps to extract better global features of a pedestrian (F. Chen et al., Citation2021). As the max pooling layer extracts more relevant features than the average pooling layer, the global average pooling layer is swapped out with an adaptive max-pooling layer, and two fully connected layers (fc1, fc2) are added. The fc1 is to map a feature map to a 512-dimensional feature vector, and fc2 is to generate class probability. The output neuron is set to six times the training IDs. The decoder structure is mentioned in the Table .

Table 1. Decoder structure

This work follows the empirically tested hyperparameter values for the loss function and the soft label computation from (F. Chen et al., Citation2021). All the images are re-sized to 256×128. The SGD optimizer trains the network with 24 as batch size and 0.5 dropout rate. The 0.01 learning rate is set for the initial layers. Then, every 35 epochs, the learning rate is decayed by 0.1. The model is trained for 90 epochs. For the spatial-temporal stream, Δt is set to 100, and for the Gaussian kernel, the σ value is defined as 50. In joint metric, λ1, λ2, γ1, and γ2 are set to 1, 2, 5, and 5, respectively. The spatial-temporal stream does not require training since the spatial-temporal information is generated from the image annotation.

The model is tested on public benchmark datasets such as Market1501, DukeMTMc-reID, LTCC, and VC-cloth. The dataset’s details are summarized in Table .

Table 2. Datasets detail

4.2. Comparison with the prior arts

Our method is evaluated on Market1501 and DukeMTMC-reID datasets to analyse the performance in short-term settings. The performance attained by our approach on short-term datasets is compared with several cutting-edge techniques as listed in Table . Also, we verify the performance on the long-term datasets (LTCC and VC-clothes), and the outcomes are compiled in Table . It has been discovered that the performance is superior to existing approaches in both short and long-term scenarios.

Table 3. The performance comparison with prior-arts on market1501 and DukeMTMC-reID. The best results are highlighted in bold

Table 4. The performance comparison with prior-arts on LTCC and VC-clothes. The obtained results are for cloth change settings

The PGCN (Zhang et al., Citation2021) and DAReID (Xu et al., Citation2021) methods require a pose estimation model to obtain ground truth joint points, which is an extra burden. Then, the PGCN model focused more on de-noising images than extracting robust features for the ReID task; also, most images are discarded based on uncertainty criteria. In our work, the data augmentation increased the dataset size and introduced more appearance change diversity. The content and edge feature and the complementary spatial-temporal information effectively handled the ReID task with an appearance change scenario. The efficient visual stream and complimentary spatial-temporal streams together enhanced the ReID performance compared to the existing techniques. The best results are highlighted in bold; also, the result without and with re-ranking is represented.

In order to assess the robustness of the proposed technique for appearance change scenarios, it is evaluated on the LTCC and VC-Clothes datasets. Our strategy outperforms cutting-edge approaches, as shown by the findings given in Table .

We visualized the top 10 retrieved results on Market1501 and LTCC datasets. Figure shows retrieved results considering only the visual features stream, and Figure shows the retrieved results considering both visual stream and spatial-temporal features (combined). We can see that the combined features are more favorable in retrieving the correct gallery image than considering visual features alone. Figure is the top 10 retrieved images considering the visual feature stream alone, and Figure is the retrieved results considering both the visual stream and spatial-temporal stream on the LTCC dataset. The retrieved images are not resized, directly retrieved, and illustrated from the gallery set. Therefore, we could see a varied image size in Figure . The proposed model performs more efficiently both on long-term and short-term datasets. Figure . Market1501 dataset : Top 10 retrieved gallery images for the given query image (Figure ). LTCC dataset: Top 10 retrieved gallery images for the given query image.

Figure 3. Market1501 dataset : Top 10 retrieved gallery images for the given query image.

Figure 3. Market1501 dataset : Top 10 retrieved gallery images for the given query image.

Figure 4. LTCC dataset : Top 10 retrieved gallery images for the given query image.

Figure 4. LTCC dataset : Top 10 retrieved gallery images for the given query image.

The limitation of the proposed method is that to integrate the visual and spatio-temporal information; the input data need to be provided with a time-stamp of a person in each camera view.

4.3. Ablation study and discussions

To check the effectiveness of data augmentation (DA), feature disentangling (FD), and spatial-temporal (ST) modules, we analyzed them by systematic experiment under the same dataset settings. The ablation study is conducted on Market1501 and DukeMTMC-reID datasets.

Baseline

The baseline is the simple DenseNet121 network with a fully connected layer to perform classification. The number of neurons is set to the number of identities in the considered dataset. The model is trained with identification loss.

Baseline+DA

The channel shuffling data augmentation is applied to all the images in the training set. The generated images are assigned new IDs using the soft-label assignment approach. The number of neurons in the fully connected layers is set to six times the original identities. The model is trained with identification loss.

Baseline+DA+FD

The feature disentangles method is employed to extract the content and edge features by setting different parameters such as Em = 0.7, Ec = 0.2, and Ee = 0.1. The model is trained with ID loss, contrastive loss, and reconstruction loss. The number of neurons in the fully connected layer is six times that of the original IDs. The model developed with Baseline+DA+FD represents the visual stream.

Baseline+DA+FD+ST

To complement the visual stream, spatial-temporal information is used. The joint metric method is applied to evaluate the performance of the model.

The ablation study results are represented in Table . The addition of each component one by one shows performance improvement. Additionally, the interaction of these aspects is advantageous for improving overall performance.

Table 5. Ablation study conducted with combination of different feature

4.3.1. Effect of visual stream, spatial-temporal stream, and joint metric

The performance is analyzed considering visual and spatial-temporal streams separately, then the combination of both features is utilized in the joint metric.

Effectiveness of visual stream

To check the benefit of the visual stream, we discarded the visual stream and considered only spatial-temporal information for re-identification. It is observed that the very large marginal performance dropped from 98.63% to 12.01% and 95.52% to 8.13% Rank1 accuracy on Market1501 and DukeMTMC-reID datasets, respectively. The result is listed in Table . The visual features play a very important role in this approach.

Table 6. Analysis on the effectiveness of visual, spatial-temporal and joint-metric on the ReID performance

Effectiveness of spatial-temporal stream

The benefit of having an STS is analyzed by detaching this module. The performance is assessed while simply taking into account the visual stream, and the findings are shown in Table . We observed the Rank1 accuracy drop from 98.63% to 94.29% on Market1501 and 95.52% to 88.62% Rank1 accuracy drop on DukeMTMC-reID datasets, respectively.

Effectiveness of Joint Metric

The joint metric uses visual and spatial-temporal probability to discriminate between two individuals. The combined features achieved an improved Rank1 recognition rate of 98.63% on Market1501 and 95.52% on DukeMTMC-reID datasets, respectively.

It is observed from Table that the visual stream plays a vital role, and the spatial-temporal stream is complementary in enhancing the ReID performance further. The joint metric module helps in adjusting the probability of the rare event.

4.3.2. Cross-dataset ReID

There is a considerable domain gap between various ReID datasets, which results in a significant decline in performance when a model trained on one dataset is directly applied to another. To showcase the ability of our method to generalize across domains, we carry out a comparative analysis in a cross-domain environment. The cross-domain ReID is conducted by training the model on Market1501 and testing it on the DukeMTMC-reID dataset and visa-versa. The obtained results are summarized in Table . We have made two observations on the result; one is that all the models suffer from major performance degradation, and secondly, our approach is superior to the existing methods.

Table 7. The performance of the proposed model on cross domain is compared with state-of-the-art methods. Md indicates trained on market1501 and tested on DukeMTMC-reID, visa-versa

5. Conclusion

To address the problem of scarcity of training data and lack of diversity in it, in this paper, we have explored the image channel shuffling data augmentation approach that has increased the sample size and introduced the appearance variation effect into benchmark datasets. The model trained with this augmentation technique has shown improved generalization capability on the unseen domain. Further, combining multiple features is more robust than using the appearance feature alone. Aiming at this, we proposed a two-stream network that combines the visual (content, edge) and spatio-temporal features. The considered multiple features have shown their significance by consistently re-identifying a person under long-term and short-term appearance variations. The presented two-stream approach outperforms earlier methods by achieving Rank1 accuracy of 98.6% on Market1501, 95.52% on DukeMTMC-reID, 76.21% on LTCC, and 91.76% on VC-Clothes, respectively. Our method outperforms the prior arts and the results evidence that our method is robust in re-identifying a person in an appearance change scenario.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

No funding was received for conducting this study.

References

  • Bhujel, N., Jun, L., Yun, Y. W., & Wang, H. (2020). Towards understanding and inferring the crowd: Guided second order attention networks and re-identification for multi-object tracking. Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), The Caesars Forum Conventional Center in Las Vegas, USA (pp. 9999–17). IEEE.
  • Bilakeri, S., & K, A. K. (2022). Multi-object tracking by multi-feature fusion to associate all detected boxes. Cogent Engineering, 9(1), 2151553. https://doi.org/10.1080/23311916.2022.2151553
  • Bilakeri, S., & Karunakar, A. K. (2023). Triplet Multi-task Learning Strategy for Person Re-identification Using Deep Learning. Proceedings of International Conference on Data Science and Applications: ICDSA 2022. School of Mobile Computing and Communication, Jadavpur University, Kolkata, India, 2, 447–461. Springer Nature.
  • Bilakeri, S., & Kotegar, K. (2022) Strong baseline with auto-encoder for scale-invariant person re-identification. Procedings of the 2022 International Conference on Distributed Computing, VLSI, Electrical Circuits and Robotics (DISCOVER), JNN College of Engineering, Shimoga, Karnataka, India (pp 1–6). IEEE.
  • Chen, J., Jiang, X., Wang, F., Zhang, J., Zheng, F., Sun, X., & Zheng, W. S. (2021). Learning 3d shape feature for texture-insensitive person re-identification. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, United States (pp. 8146–8155).
  • Chen, F., Wang, N., Tang, J., & Zhu, F. (2021). A feature disentangling approach for person re-identification via self-supervised data augmentation. Applied Soft Computing, 100, 106939. https://doi.org/10.1016/j.asoc.2020.106939
  • Che, J., Zhang, Y., Yang, Q., & He, Y. (2022). Research on person re-identification based on posture guidance and feature alignment. Multimedia Systems, 29(2), 1–8. https://doi.org/10.1007/s00530-022-01016-3
  • Deng, W., Zheng, L., Ye, Q., Kang, G., Yang, Y., & Jiao, J. (2018). Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA. (pp. 994–1003).
  • Fan, H., Zhu, L., Yang, Y., & Wu, F. (2020). Recurrent attention network with reinforced generator for visual dialog. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 16(3), 1–16. https://doi.org/10.1145/3390891
  • Gao, Z., Wei, H., Guan, W., Nie, W., Liu, M., & Wang, M. (2021). Multigranular visual-semantic embedding for cloth-changing person re-identification. Proceedings of the 30th ACM International Conference on Multimedia, Lisbon Congress Center, Portugal (pp. 3703–3711).
  • Gao, Z., Wei, H., Guan, W., Nie, W., Liu, M., Wang, M. (2022). Multigranular visual-semantic embedding for cloth-changing person re-identification. Proceedings of the 30th ACM International Conference on Multimedia, Lisbon Congress Center, Portugal (pp. 3703–3711).
  • Gao, Z., Wei, H., Guan, W., Nie, J., Wang, M., & Chen, S. (2022). A semantic-aware attention and visual shielding network for cloth-changing person re-identification.
  • Gu, X., Chang, H., Ma, B., Bai, S., Shan, S., Chen, X. (2022) Clothes-changing person re-identification with rgb modality only. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans Ernest N. Morial Convention Center New Orleans, Louisiana (pp. 1060–1069).
  • Hong, P., Wu, T., Wu, A., Han, X., Zheng, W. S. (2021) Fine-grained shape-appearance mutual learning for cloth-changing person re-identification. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA (pp. 10513–10522).
  • Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K. Q. (2017) Densely connected convolutional networks. Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, Hawaii (pp. 4700–4708).
  • Huang, H., Yang, W., Chen, X., Zhao, X., Huang, K., Lin, J., Huang, G., & Du, D. (2018). Eanet: Enhancing alignment for cross-domain person re-identification.
  • Jia, Z., Li, Y., Tan, Z., Wang, W., Wang, Z., & Yin, G. (2022). Domain-invariant feature extraction and fusion for cross-domain person re-identification. The Visual Computer, 39(3), 1–12. https://doi.org/10.1007/s00371-022-02398-1
  • Jia, X., Zhong, X., Ye, M., Liu, W., & Huang, W. (2022). Complementary data augmentation for cloth-changing person re-identification. IEEE Transactions on Image Processing, 31, 4227–4239. https://doi.org/10.1109/TIP.2022.3183469
  • Jin, X., He, T., Zheng, K., Yin, Z., Shen, X., Huang, Z., Feng, R., Huang, J., Hua, X. S., & Chen, Z. (2021). Cloth-changing person re-identification from a single image with gait prediction and regularization. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 14278–14287).
  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25(6), 1097–1105.
  • Li, S., Bak, S., Carr, P., &, and Wang, X. (2018). Diversity regularized spatiotemporal attention for video-based person re-identification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA (pp. 369–378).
  • Lv, J., Chen, W., Li, Q., Yang, C. (2018) Unsupervised cross-dataset person re-identification by transfer learning of spatial-temporal patterns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7948–7956
  • Movassagh, A. A., Alzubi, J. A., Gheisari, M., Rahimi, M., Mohan, S., Abbasi, A. A., & Nabipour, N. (2021). Artificial neural networks training algorithm integrating invasive weed optimization with differential evolutionary model. Journal of Ambient Intelligence and Humanized Computing, 1–9. https://doi.org/10.1007/s12652-020-02623-6
  • Parzen, E. (1962). On estimation of a probability density function and mode. Annals of Mathematical Statistics, 33(3), 1065–1076. https://doi.org/10.1214/aoms/1177704472
  • Pervaiz, N., Fraz, M., & Shahzad, M. (2022). Per-former: Rethinking person re-identification using transformer augmented with self-attention and contextual mapping. The Visual Computer, 1–16. https://doi.org/10.1007/s00371-022-02577-0
  • Qian, X., Fu, Y., Xiang, T., Wang, W., Qiu, J., Wu, Y., Jiang, Y. G., & Xue, X. (2018). Pose-normalized image generation for person re-identification. Proceedings of the European conference on computer vision (ECCV), Munich, Germany (pp. 650–667).
  • Qian, X., Wang, W., Zhang, L., Zhu, F., Fu, Y., Xiang, T., Jiang, Y. G., & Xue, X. (2020). Long-term cloth-changing person re-identification. Proceedings of the Asian Conference on Computer Vision, Kyoto.
  • Si, C., Chen, W., Wang, W., Wang, L., & Tan, T. (2019). An attention enhanced graph convolutional lstm network for skeleton-based action recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, California (pp. 1227–1236).
  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition.
  • Suh, Y., Wang, J., Tang, S., Mei, T., & Lee, K. M. (2018). Part-aligned bilinear representations for person re-identification. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, (pp. 402–419).
  • Sun, W., Xie, J., Qiu, J., & Ma, Z. (2021). Part uncertainty estimation convolutional neural network for person re-identification. Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage-Alaska, Denaʼina Civic and Convention Center, (pp. 2304–2308).
  • Sun, Y., Zheng, L., Yang, Y., Tian, Q., & Wang, S. (2018). Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). Proceedings of the European conference on computer vision (ECCV), Tel Aviv, Israel (pp. 480–496).
  • Wang, X., Huang, Y., Wang, Q., Chen, Y., & Shen, Y. (2020). Multi-stream refining network for person re-identification. IEEE Access, 9, 6596–6607. https://doi.org/10.1109/ACCESS.2020.3048119
  • Wang, G., Lai, J., Huang, P., & Xie, X. (2019). Spatial-temporal person re-identification. Proceedings of the AAAI conference on artificial intelligence, Honolulu, Hawaii, 33, 8933–8940.
  • Wang, G., Lai, J., & Xie, X. (2017). P2snet: Can an image match a video for person re-identification in an end-to-end way? IEEE Transactions on Circuits and Systems for Video Technology, 28(10), 2777–2787. https://doi.org/10.1109/TCSVT.2017.2748698
  • Wang, G., Lin, L., Ding, S., Li, Y., & Wang, Q. (2016). Dari: Distance metric and representation integration for person verification. Proceedings of the AAAI Conference on Artificial Intelligence, The Phoenix Convention Center, Phoenix, Arizona, USA, 30.
  • Wang, Y., Wang, Z., Jiang, M., Zhang, H., & Tang, E. (2021). Graph convolutional network for person re-identification based on part representation. Wireless Personal Communications, 127(S1), 1–16. https://doi.org/10.1007/s11277-021-08781-w
  • Wan, F., Wu, Y., Qian, X., & Fu, Y. (2020). When person re-identification meets changing clothes. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, Washington, USA (pp. 830–831).
  • Wei, L., Zhang, S., Gao, W., & Tian, Q. (2018). Person transfer gan to bridge domain gap for person re-identification. Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, Utah (pp. 79–88).
  • Xu, Y., Zhao, L., & Qin, F. (2021). Dual attention-based method for occluded person re-identification. Knowledge-Based Systems, 212, 106554. https://doi.org/10.1016/j.knosys.2020.106554
  • Yang, Q., Wu, A., & Zheng, W. S. (2019). Person re-identification by contour sketch under moderate clothing change. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(6), 2029–2046. https://doi.org/10.1109/TPAMI.2019.2960509
  • Yu, S., Li, S., Chen, D., Zhao, R., Yan, J., & Qiao, Y. (2020). Cocas: A large-scale clothes changing person dataset for re-identification. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA (pp. 3400–3409).
  • Yu, Z., Zhao, Y., Hong, B., Jin, Z., Huang, J., Cai, D., He, X., & Hua, X. S. (2021). Apparel-invariant feature learning for person re-identification. IEEE Transactions on Multimedia.
  • Zhang, Z., Zhang, H., Liu, S., Xie, Y., & Durrani, T. S. (2021). Part-guided graph convolution networks for person re-identification. Pattern Recognition, 120, 108155. https://doi.org/10.1016/j.patcog.2021.108155
  • Zheng, M., Karanam, S., Wu, Z., & Radke, R. J. (2019). Re-identification with consistent attentive siamese networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, California (pp. 5735–5744).
  • Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., Tian, Q. (2015). Scalable person re-identification: A benchmark. Proceedings of the IEEE International Conference on Computer Vision (ICCV), NW Washington, DC. United States.
  • Zheng, Z., Yang, X., Yu, Z., Zheng, L., Yang, Y., Kautz, J. (2019) Joint discriminative and generative learning for person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2138–2147
  • Zheng, Z., Zheng, L., & Yang, Y. (2017). Unlabeled samples generated by gan improve the person re-identification baseline in vitro. Proceedings of the IEEE international conference on computer vision, Venice, Italy (pp. 3754–3762).
  • Zhong, Z., Zheng, L., Cao, D., & Li, S. (2017). Re-ranking person re-identification with k-reciprocal encoding. Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, Hawaii (pp. 1318–1327).
  • Zhong, Z., Zheng, L., Kang, G., Li, S., & Yi, Y. (2017). Random erasing data augmentation. Proceedings of the AAAI conference on artificial intelligence, Hilton New York Midtown, New York, New York, USA, 34(7), 13001–13008.
  • Zhuo, J., Chen, Z., Lai, J., & Wang, G. (2018). Occluded person re-identification. Proceedings of the 2018 IEEE International Conference on Multimedia and Expo (ICME), IEEE, Miami, Florida, USA (pp. 1–6).