Full article: Human Activity Detection Events Through Human Eye Reflection using Bystander Analyzer

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Automatic video-based human activity recognition has shown promise and is mostly used in video surveillance applications for diverse purposes. However, there are still substantial performance issues that make real-world implementation challenging. The main barrier to the accurate detection of human movement remains to the perspective problem, which results from the fact that video sequences are frequently shot from random camera angles. Therefore, the main focus of this study is on how human eyesight reflects in different camera views that are used to detect human activity or identify intruders. First, a gaussian mixture model that previously used the Minimal Spanning Tree for segmentation is used for pre-processing. Using the pixel-based Kruskal methodology and this method, the input data set’s minimal weight is precisely determined. Segmentation comes next. Independent discriminant features are extracted using transformation using Karhunen-Loeve expansion, which isolates human behavior based on the Person Correlation Coefficient. To effectively identify data, Deep Lens Classifier is also employed to look for any suspect human behavior. With an impressive 78.8774% accuracy, 28.6961% sensitivity, 98.50% specificity, 75.6734% precision, 48.781% recall, and 65.10% F-measure, this approach is unique in the field of detecting human activity. Finally, our proposed system’s MATLAB performance retrieves the accurate detection.

Introduction

In the modern world, crime and violence have become increasingly prevalent wherein surveillance cameras are continuously capturing typical events from a variety of angles. These devices have been around for a while and have been used to collect data and keep track of people, events, and activities (Moriya Citation2019). Applications for access control, crowd flux statistics, congestion analysis, and human behavior monitoring are also among the promising ones in public safety and security (Adil, Simon, and Kumar Khatri Citation2019). A variety of tools are also used to decrease or manage the problem. The best option for use in both public and private situations is video surveillance. When video surveillance is successful in identifying any peculiar or suspicious behavior, and effective. The majority of surveillance equipment in use today is operated by individuals.

To detect any odd behavior, require ongoing human observation (Koyama and Arikuma Citation2019). The impact of a human’s fatigue on the technology decreases when a human is present. Additionally, video surveillance in dynamic circumstances, especially for people, is one of the computer vision and artificial intelligence fields that is garnering the most interest in research (Dhulekar et al. Citation2018). Along with comprehending and describing object behaviors, video surveillance also aims to recognize, track, and detect multiple objects from image sequences in a dynamic scene with numerous cameras (Tripathi, Singh Jalal, and Chandh Agrawal Citation2018). In addition to replacing human eyes with cameras, video surveillance also seeks to automate the surveillance procedure. For video surveillance (Nikouei et al. Citation2019), it’s essential to recognize human movement and activity from image classifications. Furthermore, cameras are frequently taken as evidence during criminal investigations (Sheu et al. Citation2019). The images captured by these cameras of individuals used to assemble networks of connections or link individuals to certain locations (Yakkali, Nayar, and Indu Citation2018). It is especially important to identify the photographer and other individuals who were present but were not immediately visible in the image. When photos depict illegal activity, such as when hostage-takers or child sex abusers capture pictures of their victims, bystander identification is essential (Yakkali RT, et al. Citation2018, Gomez et al. Citation2018, Shiraishi and Uda Citation2019). The automation of video surveillance was used to address the picture identification problem.

Automatic human activity identification has drawn a lot of interest in the field of video analysis technology recently due to rising demands from many applications, including surveillance environments, entertainment environments, and healthcare systems (Kamthe and Patil Citation2018). Automatic reporting of a person loitering with luggage at an airport or train station is one example of how aberrant activity possibly utilized to warn the appropriate authority of potential criminal or harmful actions. Similar to this, activity detection enhance human-computer interaction (HCI) in a gaming environment, for example, by automatically identifying various player actions during a tennis match so that a computer avatar plays the game for the player (Rautaray and Agrawal Citation2015). Activity recognition in a healthcare system also helps with patient rehabilitation, for example, by automatically recognizing a patient’s action to quicken the rehabilitation process (Zhou et al. Citation2020). The three levels of representation for human activity recognition include low-level core technology, mid-level human activity recognition systems, and high-level applications. The three primary processing phases taken into consideration in the first level of core technology are object segmentation, feature extraction and representation, and activity detection and classification algorithms (Chen and Nugent Citation2019). To initially isolate the human object, the video sequence is divided into segments. A set of features is then appropriately derived from the human object to depict its shape, silhouette, colors, positions, and body motions. An activity detection or classification method is then used to identify distinct human activities using the collected features (Nadeem, Jalal, and Kim Citation2021). The second level of human activity recognition systems also covers three crucial recognition systems: aberrant activity recognition, multiple people interaction and crowd behavior, and single-person activity recognition. The third level of applications is where the recognized results applied to surveillance environments, entertainment environments, or healthcare systems are discussed (Denis and Madhubala Citation2021). Additionally, in the modern era, it is challenging to automatically detect suspicious activities in video surveillance, especially for long-term video feeds. The visual surveillance system in the public and private sectors has significant challenges in the detection of suspicious conduct (Amrutha, Jyotsna, and Amudha Citation2020).

The long-term video surveillance system deals with the complex structures of circumstances and the misleading nature of unexpected conduct. The challenging job of identifying such activity from the video has sparked a lot of interest in computer vision. The understanding human activity requires more than just identifying patterns; it also requires understanding how different bodily parts move. The second level of human activity recognition systems also covers three crucial recognition systems: aberrant activity recognition, multiple people interaction and crowd behavior, and single-person activity recognition (Elharrouss, Almaadeed, and Al-Maadeed Citation2021). The third level of applications is when the acknowledged results applied in surveillance environments, entertainment environments, or healthcare systems are described. Although the application of automatic video-based human activity recognition in video surveillance for a variety of reasons has been promising, there are still serious performance difficulties that make real-world implementation challenging. Existing researchers in camera surveillance are primarily focused on identifying human activity recognition using walking surface conditions, clothing and footgear, object carrying, and the behavior of a person. Doing this, accurate human activity is severely lacking in the form of inter and intra camera issues. Therefore, the goal of the work is proposed to detect suspicious and violent human activity automatically using security camera data.

In this study, apply the Minimum Spanning Tree (MST)-based segmentation method, which computes the minimum weight of the input data set and applies a pixel-based Kruskal approach for precise segmentation.
In addition, the Pearson Correlation Coefficient is used to correctly extract the tracking behavior of human traits through the presentation of uncorrelated discriminant transformation with Karhunen-Loeve expansion. Thus, our proposed approaches address the problem of arbitrary in-camera perspectives.

The Deep Lens Classifier (DLC), which randomly identifies human activities using Kullback–Leibler divergence, is created to properly categorize the findings.

Therefore, a human activity detection event in human eye reflection employing a standalone analyzer is presented to address all of the concerns mentioned.

This research project is divided into four distinct sections, section 2 of which covers the categorization problem in human activity detection events. The proposed method’s specifics are given in section 3. The outcomes of applying the suggested strategy are presented in Section 4. Section 5 presents the research project’s Conclusion section and, as the last step, lists the sources that were employed.

Literature Survey

While there are numerous methods for analyzing video camera images to identify activities, Roy, P.K. et al (Roy and HariOm Citation2018). described a technique for automatically identifying a person’s suspicious or aggressive behavior from surveillance footage. The HOG characteristics that were taken from the video frames were used by the author to train the SVM classifier. The surveillance video’s test frames are red and processed to determine whether they are violent or normal frames. An alarm existed set off to notify the controller if any frames designated as violent were found. It was used to record the amount expended loitering in a monitored area. consumed notify of any possible suspicious activity if the time exceeded a predetermined threshold so that it checked promptly. However, there has been a delay in updating the system to use information from many cameras positioned at various angles of the same region of interest to improve prediction accuracy. To decrease operation lag, the generated feature vectors is adjusted to lower their dimension. However, the classification’s accuracy needs to raised.

Following, that Olmos, R et al (Olmos, Tabik, and Herrera Citation2018)-Presented a cutting-edge automatic firearm detection system in films suitable for monitoring and management. This detection problem is reformulated as the problem of minimizing false positives, and it is resolved by first building the essential training data set under the guidance of a VGG-16-based classifier and then evaluating the best classification model using the sliding window approach and the region proposal approach. The Faster R-CNN-based model had shown the most encouraging results. Comparing several CNN-based classifiers that take into account a larger number of classes while the training time increases are behind.

Sultani, W et al (Sultani, Chen, and Shah Citation2018) explained due to the intricacy of these realistic anomalies, a deep learning strategy the best for anomaly detection. It makes an effort to take advantage of both typical and unusual videos. In this paper, a universal model of anomaly detection is developed utilizing a deep MIL framework with weakly labeled data to minimize labor-intensive temporal annotations of anomalous segments in training movies. A fresh large-scale anomaly dataset with a range of genuine anomalies is introduced to validate the presented methodology. The experimental outcomes on the dataset demonstrate that our suggested anomaly detection methodology outperformed baseline methods by a large margin. This paper also shows how the dataset functions to identify unusual activity. Additionally, it is appropriate for handling big databases.

Ghasemi, et al (Ghasemi and And Ravi Kumar Citation2017) adopted GMM (Gaussian Mixture Model) to generate candidate regions with motion features indicative of suspicious activity that is retrieved from the magnitude data of optical flow; refer to this approach as Suspicious Activity Region Detector (SARD). Even in cluttered settings, experimental results on several benchmark datasets have shown that our proposed framework outperforms the state-of-the-art in terms of detection accuracy and processing performance. As a result, the state of the work strengthened.

Singh, A et al (Singh, Patil, and Omkar Citation2018) – projected the suggested SHDL network is used to estimate the pose of the humans in the real-time Drone Surveillance System (DSS) framework, which initially employs the FPN network to detect humans. The SVM uses the estimated poses to identify violent people. The suggested SHDL network accelerates training for comparatively fewer labeled samples by combining Scatter Net characteristics with structural priors. This framework will help identify people who are involved in violent behavior in public places or at large gatherings. The classifying process takes a long time.

Poulose et al. (Citation2022) - suggested a human image threshing (HIT) machine-based HAR system that recognizes activity from an image dataset captured by a smartphone camera. The HITmachine effectively employs a deep learning model for activity classification, a facial image threshing machine (FIT) for picture cropping and resizing, and a mask region-based convolutional neural network (R-CNN) for human body recognition. A number of tests and findings that proved the effectiveness of our suggested HIT machine-based HAR system. The suggested HIT machine’s deep learning model, the ResNet architecture, produced accuracy of 98.53%. Compared our HIT machine’s impact on activity recognition (with and without the HIT machine) using the HAR picture dataset, which is also used by the CNN model. Additional explanations on FIT machine operation, mask R-CNN, and deep learning models are included.

Mutegeki Ronald, et al[mutegeki Ronald, Poulose, and Han (Citation2021) - The performance of the suggested model is contrasted with that of other DL designs that have recently been put out to address the HAR issue. Across all four datasets, the suggested model beats these methods in terms of accuracy, cross-entropy loss, and F1 score. The UCI HAR smartphone dataset, Opportunity activity recognition dataset, Daphnet freezing of gait dataset, and PAMAP2 physical activity monitoring dataset are used to validate the performance of the proposed iSPLInception model. The results of the trials and analysis show that the suggested iSPLInception model performs very well for HAR applications. The HAR sector has the same problem as image recognition, where a model will always perform badly on data that has never been seen. This difficulty is history thanks to the use of transfer learning, and our iSPLInception model is flexible. The suggested model is easily scalable and doesn’t significantly affect performance when it is expanded to utilize additional inception modules.

According to the studies that have been evaluated, accuracy from (Roy and HariOm Citation2018) needs to increased, and training time reduction from (Olmos, Tabik, and Herrera Citation2018) needs to improved. Additionally (Sultani, Chen, and Shah Citation2018), fails to manage a big collection of data, and (Ghasemi and And Ravi Kumar Citation2017) identifies robustness as the root of the problem. As a result (Singh, Patil, and Omkar Citation2018), claims that the classification time is extremely high (Zemni et al. Citation2019). a step forward is proposed based on more general wavelet tools. New approach is proposed for the reconstruction of signals/images. Zhang et al. (Citation2020) a spectral feature extraction is developed and the seizure prediction is performed based on uncorrelated multilinear discriminant analysis (UMLDA) and Support Vector Machine (SVM). Taking into account all of the aforementioned problems, this system resolves them by implementing efficient segmentation and classification for employing corneal pictures to detect suspicious human activity in public settings using an effective deep learning method.

Corneal Image Suspicious Event Detection Using Bystander Analyzer

Assemble an image dataset of the cornea that possibly include pictures taken under different circumstances. Label both typical and suspicious events – such as anomalies, illnesses, or injuries and annotate the dataset accordingly. The corneal images was preprocessed to improve quality, eliminate noise, and standardize the format. Take out the pertinent characteristics from the corneal pictures. It incorporates color information, shape aspects, and texture analysis. Explain the meaning of “bystander analyzer.” A system or algorithm that evaluates corneal pictures in relation to events or surrounding factors. Provide an algorithm for analyzing corneal pictures that considers contextual information and other aspects that affect the analysis. Establish a system to identify questionable occurrences using the machine learning model’s output. This is entail employing a more complex strategy like object detection or establishing a threshold for anomaly detection. We integrate the context analysis of corneal pictures performed by the “bystander analyzer” into the system as a whole. Think about how bystander analysis information possibly affect how corneal images are evaluated. To make sure the model is generalized, validate it with a different dataset. Evaluate the system’s functionality under varied circumstances and with possible fluctuations in the input data. Put in place a feedback loop to achieve ongoing development. Gather more information and make adjustments to the model in response to user comments and changing needs. Technology for recognizing human behavior has a big part to play in many applications. Due to its impact on real-time applications, there are still certain concerns, such as the inability to accurately detect human activity due to inter- and intra-camera issues and viewpoint problems. This study presented a Bystander Analyzer that focuses on eye reflection image recognition from different camera viewpoints, which are utilized to identify human activities. Since the current methods only recognize the walking surface conditions or any human-made item, they are unable to recognize human activity in an eye reflection image. This is why the suggested system detects human activity in an eye reflection image by using corneal images. To extract the target activity of the eye reflection image, which is done using the Gaussian Mixture Model, the images are segmented using the Minimum spanning tree with the input data weight by Kruskal approach following this segmentation activity. This reduces the complexity of each image. The centroid, movement, speed, direction, and dimensions are very minutely determined from the segmented image using the uncorrelated discriminant KL expansion to reduce the dimensionality of the image because prior feature extraction techniques profoundly neglected the feature of the images.

The structure of the suggested work, the preprocessing phase, feature extraction phase, and recognition phase are the three primary mechanisms of the Bystander Analyzer framework as shown in . The first stage, known as preprocessing, involves the detection of moving objects by backdrop modeling while also removing noise. The target activity of the eye reflection picture is then extracted using segmentation of the image, which is previously preprocessed using background modeling with the Gaussian Mixture Model. The photos are first preprocessed being segmented using the Minimum spanning tree and the Kruskal technique to weight the input data. The uncorrelated discriminant KL expansion calculates the very minute information from the segmented image, including the centroid, movement, speed, direction, and dimensions. The actions of the input video are then classified using the eye reflection image in the last phase of classification, and any suspicious behavior is picked up.

Figure 1. Proposed framework for Bystander Analyzer.

Minimum Spanning Tree Segmentation

A tree is a division of a directed or undirected connected graph in which it is connected to form a path. Joining all of the graph’s sides and utilizing weighted edges to determine the path, a weighted graph was used to create multiple spanning trees. The smallest spanning tree is therefore one that has the shortest path. When using the Krushal Approach, which entails giving each edge a weight and allowing the edges with lower weights to go to the Minimum Spanning Tree, three processes are involved: preprocessing, graph conversion, determining the shortest path, and segmentation (Bianchi et al. Citation2019). Similar to this, video surveillance images are gathered, and then the bystander Analyzer takes on the duty of gathering video frames V1, V42, V3, and so forth, to convert the gathered input single video frame. However, Bystander Analyzer had preprocessed in modeling to aid in the segmentation Process, which is used to analyze the hidden information from the corneal images, that is reflection images are deeply explored.

Gaussian Mixture Model

The Krushal approach technique is used for this processing along with the Gaussian Mixture Model (GMM) (Qiao et al. Citation2019). A probability density function known as a GMM is used to describe images as pixels, with each pixel being indicated as a stochastic variable and a variable A designating the dimension and the frame t. The image’s likelihood is indicated as follows:

(1)

\begin{matrix} X (D_{t}) = \sum_{a = 1}^{k} M_{a, t} N (P (σ_{a}, {η_{a, t}}^{2}) \end{matrix}

(1)

Where set M defines the weight of the total number of a region which satisfies the weight condition $\sum_{a = 1}^{k} M_{i} = 1$ and $P (σ_{i}, {η_{a}}^{2})$ represents the a^th and the mean ${η_{a}}^{}$ and the standard deviation $σ_{a}$ the Gaussian distribution probability of density function is as follows:

(2)

N (σ_{a}, {η_{a}}^{2}) = \frac{1}{σ (2 π)} exp (\frac{D - σ_{t}}{2 σ_{a}^{2}})

(2)

The following equation is used to compute the measure of the changing random variable for each uncorrelated RGB that has a difference in intensity between each other and possesses the standard deviation.

(3)

\sum_{a, t} = k_{a i, t}^{2} V

(3)

Thus, the preprocessed image I_p is utilized to segment to obtain the corneal to analyze the image using minimum spanning tree by which the Gaussian mixture model.

Following that, thresholding determines whether the GMM is classifying the pixels as background or foreground. It is regarded as a background pixel if the thresholding is higher than that of GMM; otherwise, it is a foreground pixel.

(4)

Y = arg min y ([\sum_{a = 1}^{y} M_{a, t}]) > T

(4)

Where T is the threshold

(5)

M_{a, t} = (1 - η) M_{a, t} + η

(5)

$η$ denotes the learning rate

Then to segment the eye reflection through the minimum spanning tree, which works as per the minimum weight of the input data sets for accurate segmentation with the aid Krushkal approach. Then, N continuous segments of the preprocessed input image are provided. Therefore, different behavior patterns was segregated separately from each video segment. Initially, input video data is converted into frames using the concept of a minimum spanning tree. The minimum spanning tree then involves forming the graph to decide the minimum path through which the segmentation occurs.

The minimal range technique generated more than one minimal range of trees by taking into account both the similarity criterion within the selected region as well as the smallest weighted tree. As a result, three types of the suggested segmentation algorithm are identified. the following actions: The gray level of each pixel in a surveillance camera video image is represented as a matrix, with the values of the matrix’s members expressing the level of the camera image. The nearest dots and pixels are used to identify each pixel position in the camera image, and the directions of right, left, down, top, right, lower right corner, upper left corner, and lower left corner are connected by an axis. The weight of all corresponding directions is achieved by using the formula.

(6)

W (V_{i,} V_{j}) = | I_{p} (v_{i}) - (v_{j}) | + 1

(6)

Ip(v) denotes the gray level pixel of an image v. As a result, the weighted graph is represented in the form of the adjacency matrix. The Krushal technique is used to add similarity criteria, and selection is not only based on the least weight but also takes similarity criteria into account. The smallest range of the tree is chosen if the weight is less than the similarity criteria’s value. This is because it satisfies the relevant criteria.

Every pixel in a minimal tree is the same color. For the average gray level pixels on a minimum range tree, the color of each pixel is determined by the condition

(7)

\begin{matrix} black if average 1 > T \\ white if average 1 \leq T \end{matrix}

(7)

Where T is the threshold for black and white colors,

(8)

T = \frac{Max v + min v}{2}

(8)

The image is initially segmented based on the similarity density of color-matched pixels, and those segmented images are then returned to represent the specific images. The segmented image is then provided by Iseg, where the part of the face, particularly the eye, is segmented in detail. This enables the acquisition of the reflected eye image, which is followed by an analysis of the subject who is present in the reflection. The features of successive frames are extracted to recognize abnormal and normal activity follow segmentation, as discussed in the following section.

Feature Extraction by Uncorrelated Discriminant Transformation with KL Expansion

Feature extraction is essential for accurate recognition, and the reflection of ocular pictures. To capture and show human action using elements including shape, shadow motion, and color. Therefore, Karhnen-Loeve expansion with uncorrelated discriminant transformation is introduced to extract the behavior of human traits based on Pearson Correlation Coefficient Zhang et al. (Citation2020, July).

In the current system, uncorrelated discriminant analysis has only been used to forecast and extract spectral features from images of epileptic seizures; however, in this proposed uncorrelated discriminant transformation with KL expansion, analysis of the reflection images is applied to extract features for recognition. This is a useful technique for data that are not associated. The dataset built with 45 compressed videos and various activities is unavailable since it requires conventional data sets without any abnormal behaviors in a single video. There are numerous human movements involved, such as crawling, running, and walking. Videos with both typical and atypical behavioral activity patterns are presented.

Feature extraction is a useful strategy that is taken into account for choosing features and improving class separation to keep it. Following that, the original dimension space N=mn, and the segmented picture Iseg has a resolution of mn. Let’s say K is the quantity of practise samples. The uncorrelated discriminant transformation, i.e. K>N+L, is utilized to extract the features of the images because the low resolution satisfies it. The steps for extracting the characteristics are as follows:

Step A. Calculate the N×N between-class scatter matrix S_b, the N×N within-class scatter matrix Sb and the N×N total population scatter matrix Sb in the original X-space with K training samples.

Step B. According to the Eigen equation, $S_{b} u = λ u^{T}$ calculate all the N-dimensional uncorrelated discriminant vectors u₁, u₂, u₃ … … .u_k, the number of which is the minimum of k and the optimal dimensionality of feature space is (L-1), L- class problems, i.e.

(9)

K = min (k, L - 1)

(9)

Step C. According to Y=u^T_{i X}, perform the linear transformation from the original X-space The following six steps are proposed to extract uncorrelated discriminant features of face images based on the KL expansion:

Step 1. Calculate the L×L, between-class scatter matrix Sb, and total population scatters matrix St in the KL compression feature W-space RL with K training samples.

Step 2. According to the Eigen equation, calculate all the ¸N-dimensional uncorrelated discriminant vectors u₁, u₂, u₃ … … .u_k the number of which is (L-1).

Step 3. Execute the linear transformation from the uncorrelated discriminant feature (Y-space RL-1) to the KL compression feature (W-space RL):

(10)

Y = [\begin{matrix} Y_{1} \\ Y_{2} \\ \begin{matrix} ⋮ \\ Y_{L - 1} \end{matrix} \end{matrix}] = [\begin{matrix} u_{1} \\ u_{2} \\ ⋮ \\ u_{L}^{T} \end{matrix}] W

(10)

It is noted that when K>N+L¸ and K>N, the uncorrelated discriminant features of images are based on the KL expansion. Thus, the extracted features from the following process are then classified deeply for identifying the person in the reflected image and also the recognition of the activity performed.

Table

Download CSV Display Table

Classification by Deep Lens Classifier

describes the Deep Lens Classifier performs classification based on features collected from the image that the eye reflects, such as the deep features of the lens used for deep analysis. This deep learning classifier uses VGG-Net to perform its classifications. In comparison to the architecture of other successful models, VGGNet’s architecture is straightforward, easy to bend, and has a high recognition rate. In this study, a VGG-net model with five nets was used for picture retrieval. These five nets are illustrated here with five different alphabets, ranging from A to E. The first convolution layer’s width is 64, and follow each max-polling layer, it increases by a factor of 2 until it reaches 512 and stays there. There are five max-polling layers in addition to convolution layers. The convolution layers and the pooling layers of the five nets that make up the VGG-net share the same parametric values.

Figure 2. Classification process by Deep Lens Classifier.

Regardless of the number of convolution layers in the convolution community, this technique will make sure that the form is constant throughout all groups of convolution layers. Deeper network training, however, not only greatly increases computing demands, but also strictly enforces hardware support. In this paper, the VGG-net model is utilized for extracting picture characteristics. Each image is resized to the same size as an average of 224 to create this network with manageable processing requirements. Following that the FC layer is removed, the network size is then represented by a 4096 vector. It is suggested that LPI uncovers the discriminative internal organization of the document space and extracts the most discriminative features buried inside the data space. The training condition to be formed is as follows based on the similarity condition:

LPI samples Y, the input layer with K inputs

(11)

Y = [Y_{1}, Y_{2}, Y_{k}]

(11)

Hidden layer is described by

(12)

Z_{j} = σ_{i = 1}^{K} ω_{i} Y_{i} + a_{j}

(12)

Where $ω_{i}$ represents the weight of the $i^{th}$ of the input layer, $a_{j}$ represents the error.

Thus, the output of the hidden layer is represented to find the similarity between the trained data and the sample data set

Using the similarity idea and the training data from the aforementioned EquationEquation 11(11) $Y = [Y_{1}, Y_{2}, Y_{k}]$ (11) and EquationEquation 12(12) $Z_{j} = σ_{i = 1}^{K} ω_{i} Y_{i} + a_{j}$ (12) , it is possible to detect the hidden structure and compare it to the sample dataset. While training, the classifier is unable to correlate the degree of similarity; instead, it uses the Kullback–Leibler divergence, a measure of the probability distribution (also known as relative entropy). A Kullback–Leibler criterion is used for training data in a straightforward situation to determine how similar the activity is to the condition.

(13)

αK (W, a) - \frac{1}{n} \sum_{i = 1}^{n} [{\partial_{i}}^{(1 + 1)} + χ (\frac{- p}{p_{j}} + 1 - p^{1}) (K^{1} (1 + 1))

(13)

Where the difference between the p is the expected sparse, which is zero and p_j is the average output value of j^th hidden layer, where $χ$ denotes the weight of the controlling sparsity. In the straightforward example, a Kullback–Leibler divergence of 0 denotes that two different distributions behave similarly, if not identically, whereas divergence of 1 denotes otherwise. This criterion is used to detect the similarity of the activity using training data through the condition.

Therefore, the Bystander Analyzer, in turn, consists of a Minimum Spanning Tree (MST)-based segmentation to calculate the minimum weight of the input data set that is used for precise segmentation with the aid of the pixel-based on Kruskal approach, where the obtained eye reflected images are the resultant. Following segmentation, the detailed activity of the human is represented in the image as features, such as shape, silhouette, colors, and motions. Uncorrelated Discriminant Transformation with KL expansion is used to accurately extract the tracking behavior of human features by using this concept to solve the problem of arbitrary camera viewpoints between and within frames. Because the usual behavior of the person and the suspected activity is similar to seeing eye reflection, it is necessary to precisely identify the activity follow extracting the feature and detect the behavior of the human. Finally, the suggested framework offers greater retrieval flexibility in recovering the precise identification of humans using eye reflection in video surveillance. When it comes to reduced loss function and time delay during activity recognition, this framework effectively combines the strategies that produce superior performance.

Results and Discussion

This section includes the experiments performed on the proposed approach, the dataset used, the environmental conditions, setup, and constraints imposed. Furthermore, the results of different experiments are calculated and are also compared with previous works in anomalous activity detection.

Dataset

The standard dataset of multiple anomalous activities in a single video is unavailable. As a result, it produced a unique dataset. In such a case, our dataset is divided into two classes, representing typical and deviant human behavior. Unlike abnormal classes, which contain films of both normal and abnormal actions, normal classes only contain videos of normal activities. Moreover, our synthesized dataset consists of 45 videos of three activities. These activities are walking, running, and crawling with multiple humans. Also consider the direction as a parameter to find anomalies so the scenario of humans performing activities in different directions is also taken into account. Our proposed work is scenario-dependent as any activity normal in one scenario but anomalous in another scenario.

Experimental Setup

The proposed method is implemented in the working platform of MATLAB with the following system specification.

Table

Download CSV Display Table

Simulation Outputs

The proposed framework is trained by using a predefined set of rules and tested on various videos some of the test video snapshots are shown in the . The following figure shows the identification of the person from the eye reflection of the individual which aids us to find the persons present in the activity who are not captured by the camera video.

Figure 3. Suspicious event occurrence (a) input image, (b) segmented output, (c) feature extraction output.

Suspicious Event Occurrence

shows an Input image is taken from the video frame. Then, the image is preprocessed by using Gaussian Mixture Model (GMM) is given in and the image is then segmented by utilizing the Minimum Spanning Tree (MST) is given in .

Eye Reflection Image Analysis

Eye image deep analysis, Output image. From the deep segmentation of the image the eye reflection from the cornea is obtained and is further analyzed by extracting the essential elements through which the suspicious activity detected whereas the features are extracted by utilizing the Uncorrelated Discriminant Transformation with KL expansion. Thus from the features obtained the classification is done deeply through a deep lens classifier. At last, the classified output from the deep lens classifier aids the image from the corneal reflection due to further analysis of the activity of the person performing is also detected.

Figure 4. Eye reflection image Analysis.

Performance Metrics

The following performance evaluation measures have been used to assess the performance of the suggested methodology.

Accuracy

The most crucial performance indicator for a categorization system is accuracy. The effectiveness of a classification test’s ability to correctly classify or exclude a condition is also assessed statistically using this metric. Accuracy is defined as the proportion of genuine results, both TP and TN, among all instances studied. The ratio of the number of correctly classified samples to the total number of samples is calculated, and it is given in (14)

(14)

Accuracy = \frac{TP + TN}{P + N}

(14)

Sensitivity

Another statistical metric for evaluating the effectiveness of a categorization strategy is sensitivity. The capacity of a prediction model to choose examples of a certain class from a dataset is measured by sensitivity or recall. Sensitivity is the potential to deliver a favorable outcome when the attack is successful. Additionally, it is described as the percentage of occurrences that are correctly categorized out of all those that were found, and it is provided as in (15)

(15)

Sensitivity = \frac{TP}{TP + FN}

(15)

Specificity

A statistical technique called specificity is used to evaluate how well the categorization test is performed. Given that a specific class is defined, specificity is responsible for measuring accuracy. The specificity is the likelihood of receiving a bad outcome when an attack is indeed bad. In addition, it is described in (16)

(16)

Specificity = \frac{TN}{TN + FP}

(16)

Precision

Precision is estimated as the ratio of True Negative to a negative value and is given as in (17)

(17)

Pr ecision = \frac{TP}{TP + FP}

(17)

Recall

The recall is the ratio of correctly predicted positive observations to all observations in the actual class

(18)

Re call = \frac{TP}{TP + FN}

(18)

F-Measure

F-Measure is better measure to use if any need to seek a balance between Precision and Recall also there is an uneven class distribution (a large number of Actual Negatives).

(19)

F 1 = 2 \times \frac{precision * Re c all}{precision + Re c all}

(19)

describes the performance metrics like accuracy, sensitivity, specificity, recall, f-measurement, precision. and its shows the overall accuracy of the proposed system is 78.624%, the proposed system is determined sensitivity of 28.6961%, the proposed methodology achieves specificity of 98.50%, the overall precision of the proposed system is 75.6734%, it observed that the recall of proposed system are about 48.781%, the methodology determined f-measure of 65.1%.

Figure 5. Comparison in term of accuracy, sensitivity, specificity, f-measurement, recall, precision.

Performance Metrics Based on Activity Recognition

Thus, from the classified images, the suspicious activity is performed to be detected and the parameters for indicating the better recognition using parameters which are tabulated in .

Table 1. Activity recognition parameters.

Download CSV Display Table

describes actions like slapping, punching, shooting, choking, chain snatching. while comparing these actions, slapping has higher parameters in both precision and recall. And chain snatching is high in accuracy. Thus, the analyzed eye reflected the image of suspicious activities, such as slapping, punching, shooting, choking, chain snatching, and so on. The table given discusses the recognition rate parameters, while the following graphical representation provides more information. .

Figure 6. (a) Recall rate for activity recognition, (b) precision rate for activity recognition, (c) accuracy rate for activity recognition.

Overall, the eye-reflected image is obtained from the deep-analyzed video image, and those images are deep-analyzed by compiling their features being submitted to the deep lens classifier for classification so that the person in the reflected image is obtained and through the analysis, the anomalous activity is identified.

Comparison Analysis

This section illustrates how the unique dataset performs differently from other standard datasets and shows the findings using several metrics. The following metrics are considered: Accuracy, Specificity, and Sensitivity to assess the performance of the novel dataset.

describes the comparison analysis like accuracy, sensitivity, specificity and it demonstrates that the novel dataset’s accuracy is 77.874%, it is clear that the stack sensitivity of the proposed output achieves 27.7%, indicates that the suggested dataset’s output specificity is 99.9%. which is higher than the standard datasets Columbia Gaze, Openeds facebook, MPII Gaze, Kaggle, and Cairo.

Figure 7. Comparison in term of accuracy, specificity and sensitivity.

Table 2. Comparison of suggested techniques.

Download CSV Display Table

lists the suggested techniques, including Columbia gaze, openends facebook, MPII gazing, Kaggle, and Cario. In contrast to other proposed methods, this one has a high specificity of 99.9%, a high sensitivity of 27.7%, and an accuracy of 77.874%.

describes the comparison analysis like accuracy, sensitivity, specificity, f-measure, recall, precision. From the graph, it is clear that the stack accuracy of the proposed system achieves 78.624%, the overall stack sensitivity of the proposed output achieves 28.6961%, its shows the overall stack precision of the proposed output achieves 73.2734%, the proposed methodology achieves higher specificity of 98.50%, it observed that the recall of the proposed output achieves 43.781% and it also clears that the stack f-measure of the proposed output achieves 61.8493% which is higher than the baseline approaches, such as Supply Vector Machine (SVM), K-means, Distance-based network topology (DBNT), Gait-based Person Identification (GBNI), and Deep Convolutional Neural Network (DCNN).

Figure 8. Comparison in term of accuracy, sensitivity, specificity, f-measure, recall, precision.

Table 3. Comparision in terms of activity recognition parameters.

Download CSV Display Table

lists the suggested techniques, including SVM, Kmeans, DBNT, GBI, and DCNN. In comparison to all other approaches, the suggested method has a higher sensitivity rate (28.6961%), a higher specificity rate (98.50%), and an accuracy rate (78.624%) than the proposed method. Moreover, this section describes the various performances of the proposed method compared with the results of previous methodologies and depicts their results based on various metrics. Support Vector Machine (SVM), K-means, Distance-based network topology (DBNT), Gait-based Person Identification (GBNI), and Deep Convolutional Neural Network (DCNN) with the parameters Accuracy, Specificity, and Sensitivity are taken into consideration to assess how well the proposed system performs in comparison.

illustrates the error (%) of Human activity detection compared with the percentage of training data of the various techniques. From the graph, it is clear that the stack error (%) of the proposed output achieves 6% which is lower than the existing output when compared with baseline, Overall, Walking and Climbing-Up.

Figure 9. Comparison of percentage of training data.

depicts the prediction of (%) human activity detection compared with the various techniques. From the graph, it is clear that the prediction (%) of the proposed output achieves 98% which is higher than the existing output when compared with baseline, CNN AlexNet, CNN VGG-VD16, Deep representation AlexNet, and Deep Representation VGG-VD16. Thus, the proposed framework overcomes the given challenge by detecting and locating anomalous events in surveillance videos through the reflection of the eye cornea.

Figure 10. Comparison in terms of prediction.

depicts the true positive rate (%) of human activity detection compared with various methods. From the graph, it is clear that the true positive rate (%) of the proposed output achieves 99% which is higher than the existing output when compared with baseline, Waqas Sultani, Roberto Olmos, Abouzar Ghasemi. Thus, proposed framework overcomes the given challenges.

Figure 11. Comparison in term of true positive rate.

Table 4. Confusion matrix analysis of proposed method.

Download CSV Display Table

compares the genuine positive rates of several approaches, such as sophisticated automatic firearms, suspicious activity region detectors, and deep MIL ranking, which have been presented. The true positive rates (.9%) of the suggested approach are higher than those of all the other methods. The true positive rate in the proposed methodology will arise with the use of the suggested approach.

Overall, the research focused on a proposed technique for using corneal reflection to identify unusual activity in security footage. The research employed a unique dataset because there was no standard one, and it concentrated on actions like walking, running, and crawling. The implementation using MATLAB showed an overall accuracy of 78.624%; however, a significant limitation was the reduced sensitivity of 28.6961%. Activity recognition metrics showed that performance varied throughout actions; the proposed approach performed best in terms of specificity (99.9%) but possibly improved in terms of sensitivity when compared to baseline methods. The system performed comparably or better, as evidenced by comparison assessments of its error rates, prediction percentages, and true positive rates. All things considered, the suggested approach showed promise for surveillance applications, highlighting the necessity of increasing sensitivity.

Conclusion

This work presents a novel framework with an efficient partially supervised deep learning algorithm for recognizing and localizing anomalous events in surveillance footage via the reflection of the eye cornea. It is more adaptable when it comes to obtaining precise person recognition via eye reflection in video surveillance. Additionally, it successfully integrates the strategies that improve activity recognition performance in terms of a smaller loss function and faster recognition times. With high accuracy, sensitivity, specificity, precision, recall, and F-measure respectively, this method builds on a human activity detection framework. The final detection results were made possible by these thorough analyses. On difficult datasets, both the qualitative and quantitative findings demonstrate that our strategy beats cutting-edge approaches. Future studies will concentrate on improving the approach by fusing at an earlier timing with shorter training times and accuracy to create a more straightforward network for anomaly detection.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

Adil, M. D., R. Simon, and S. Kumar Khatri. 2019. Automated invigilation system for detection of suspicious activities during examination. In 2019 Amity International Conference on Artificial Intelligence (AICAI), Dubai, United Arab Emirates, 361–26. IEEE. February 04-06 2019.
Google Scholar
Amrutha, C. V., C. Jyotsna, and J. Amudha 2020, March. Deep learning approach for suspicious activity detection from surveillance video. In 2020 2nd International Conference on Innovative Mechanisms for Industry Applications (ICIMIA), Bangalore, India, 335–39. IEEE.
Google Scholar
Bianchi, V., M. Bassoli, G. Lombardo, P. Fornacciari, M. Mordonini, and I. De Munari. 2019. IoT wearable sensor and deep learning: An integrated approach for personalized human activity recognition in a smart home environment. IEEE Internet of Things Journal 6 (5):8553–8562. doi:10.1109/JIOT.2019.2920283.
Web of Science ®Google Scholar
Chen, L., and C. D. Nugent. 2019. Human activity recognition and behaviour analysis. Berlin/Heidelberg, Germany: Springer International Publishing.
Google Scholar
Denis, R., and P. Madhubala. 2021. Hybrid data encryption model integrating multi-objective adaptive genetic algorithm for secure medical data communication over cloud-based healthcare systems. Multimedia Tools and Applications 80 (14):21165–21202. doi:10.1007/s11042-021-10723-4.
Web of Science ®Google Scholar
Dhulekar, P. A., S. T. Gandhe, N. Sawale, V. Shinde, and S. Khute 2018. Surveillance system for detection of suspicious human activities at war field. In 2018 International Conference On Advances in Communication and Computing Technology (ICACCT). 357–60. Sangamner, India: IEEE. February08-09 2018.
Google Scholar
Elharrouss, O., N. Almaadeed, and S. Al-Maadeed. 2021. A review of video surveillance systems. Journal of Visual Communication and Image Representation 77:103116. doi:10.1016/j.jvcir.2021.103116.
Web of Science ®Google Scholar
Ghasemi, A., and C. N. And Ravi Kumar 2017. A novel algorithm to predict and detect suspicious behaviors of people in public areas for surveillance cameras. Palladam, India. In 2017 International Conference on Intelligent Sustainable Systems (ICISS), Palladam, India, 168–75. December 07-08 2017.
Google Scholar
Gomez, A. H. F., T. E. F. Lozada, L. A. Llerena, J. A. B. Hurtado, R. E. R. Ordoñez, F. G. S. Carrillo, J. Naranjo-Santamaria, and T. A. Barros 2018. Identification of human behavior patterns based on the GSP algorithm. In International Conference Europe Middle East & North Africa Information Systems and Technologies to Support Learning. 572–79. Cham, Springer, 25 October 2018.
Google Scholar
Kamthe, U. M., and C. G. Patil 2018. Suspicious Activity Recognition in Video Surveillance System. In IEEE in 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA). Pune, India: 1–6. August 16-18 2018.
Google Scholar
Koyama, K., and T. Arikuma. 2019. Target object identifying device, target object identifying method and target object identifying program. U.S. Patent 10,430,666.
Google Scholar
Moriya, A. 2019. Suspicious person detection device, suspicious person detection method, and program. U S Patent Application 16 (315):235.
Google Scholar
Nadeem, A., A. Jalal, and K. Kim. 2021. Automatic human posture estimation for sport activity recognition with robust body parts detection and entropy markov model. Multimedia Tools and Applications 80 (14):21465–21498. doi:10.1007/s11042-021-10687-5.
Web of Science ®Google Scholar
Nikouei, S. Y., Y. Chen, A. Aved, E. Blasch, and R. Timothy Faughnan 2019. I-safe: Instant suspicious activity identification at the edge using fuzzy decision making. In Proceedings of the 4th ACM/IEEE Symposium on Edge Computing, Washington DC, 101–12.
Google Scholar
Olmos, R., S. Tabik, and F. Herrera. 2018. Automatic handgun detection alarm in videos using deep learning. Neurocomputing 275:66–72. doi:10.1016/j.neucom.2017.05.012.
Web of Science ®Google Scholar
Poulose, A., J. H. Kim, D. S. Han, and J. Medina. 2022. HIT HAR: Human image threshing machine for human activity recognition using deep learning models. Computational Intelligence and Neuroscience 2022:1–21. doi:10.1155/2022/1808990.
Web of Science ®Google Scholar
Qiao, J., X. Cai, Q. Xiao, Z. Chen, P. Kulkarni, C. Ferris, S. Kamarthi, and S. Sridhar. 2019. Data on MRI brain lesion segmentation using K-means and gaussian mixture model-expectation maximization. Data in Brief 27:104628. doi:10.1016/j.dib.2019.104628.
PubMed Web of Science ®Google Scholar
Rautaray, S. S., and A. Agrawal. January 01, 2015. Vision based hand gesture recognition for human computer interaction: A survey. Artificial Intelligence Review 43 (1):1–54. doi:10.1007/s10462-012-9356-9.
Web of Science ®Google Scholar
Ronald, M., A. Poulose, and D. S. Han. 2021. iSplinception: An inception-ResNet deep learning architecture for human activity recognition. IEEE Access 9:68985–69001. doi:10.1109/ACCESS.2021.3078184.
Web of Science ®Google Scholar
Roy, P., and H. HariOm. 2018. Suspicious and violent activity detection of humans using HOG features and SVM Classifier in surveillance videos. Advances in Soft Computing and Machine Learning in Image Processing 277–94.
Google Scholar
Sheu, R.-K., M. Pardeshi, L.-C. Chen, and S. Ming Yuan. 2019. STAM-CCF: Suspicious tracking across multiple camera based on correlation filters. Sensors 19 (13):3016. doi:10.3390/s19133016.
PubMed Web of Science ®Google Scholar
Shiraishi, M., and R. Uda. 2019. Detection of suspicious person with kinect by action coordinate. In International Conference on Ubiquitous Information Management and Communication. Cham, Springer. 456–72. May 23 2019.
Google Scholar
Singh, A., D. Patil, and S. N. Omkar. 2018. Eye in the sky: Real-time drone surveillance system (DSS) for violent individuals identification using ScatterNet hybrid deep learning network. Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 1629–1637.
Google Scholar
Sultani, W., C. Chen, and M. Shah. 2018. Real-world anomaly detection in surveillance videos. Center for Research in Computer Vision (CRCV). University of Central Florida (UCF).
Google Scholar
Tripathi, R. K., A. Singh Jalal, and S. Chandh Agrawal. 2018. Suspicious human activity recognition: A review. Artificial Intelligence Review 50 (2):283–339. doi:10.1007/s10462-017-9545-7.
Web of Science ®Google Scholar
Yakkali, R. T., R. Nayar, and S. Indu. 2018. Object tracking and suspicious activity identification during occlusion. International Journal of Computer Applications 179: 29–34.
Google Scholar
Zemni, M., M. Jallouli, A. B. Mabrouk, A. Mahjoub, and M. A. Mahjoub. 2019. Explicit haar–Schauder multiwavelet filters and algorithms. Part II: Relative entropy-based estimation for optimal modeling of biomedical signals. International Journal of Wavelets Multiresolution and Information Processing 1950038 (5):1950038. doi:10.1142/S0219691319500383.
Google Scholar
Zhang, R., J. Xinyu, C. Dai, and W. Chen. 2020, July. Tensor-based uncorrelated multilinear discriminant analysis for epileptic seizure prediction. In IEEE in 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). 541–44. Montreal, QC, Canada. July 20-24 2020.
Google Scholar
Zhou, X., W. Liang, I. Kevin, K. Wang, H. Wang, L. T. Yang, and Q. Jin. 2020. Deep-learning-enhanced human activity recognition for internet of healthcare things. IEEE Internet of Things Journal 7 (7):6429–38. doi:10.1109/JIOT.2020.2985082.
Web of Science ®Google Scholar

Human Activity Detection Events Through Human Eye Reflection using Bystander Analyzer

ABSTRACT

Introduction

Literature Survey