Bibiliography

profileJay92
07035630.pdf

Behavior-Driven Video Analytics System for

Critical Infrastructure Protection Phillip Curtis

1,2 , M’IEEE, Moufid Harb

1 , SM’IEEE, Rami Abielmona

1 , SM’IEEE, Emil Petriu

2 , F’IEEE

1 Larus Technologies

Research & Engineering Division

Ottawa, Canada

{moufid.harb, rami.abielmona}@larus.com

2 University of Ottawa

School of Electrical Engineering and Computer Science

Ottawa, Canada

{pcurtis, petriu}@eecs.uottawa.ca

Abstract— The convergence of a security aware environment

with the proliferation of inexpensive high quality video imaging

devices has led to the deployment of cameras at a high number of

critical infrastructure sites. As many cameras are needed to keep

all key access points under continuous observation, an operator

of the surveillance system may become distracted from the many

video feeds, possibly missing key events, such as suspicious

individuals leaving an object behind or approaching a door. By

providing an automated system for monitoring these types of

events within a video feed, some of the burden placed on the

operator is alleviated, thereby increasing the overall reliability

and performance of the monitoring system, as well as providing

archival capability for future investigations. In this paper, we

propose a solution that uses a background subtraction based

segmentation method to determine objects within the scene. An

artificial neural network classifier is then employed to determine

the class of each object detected in each frame, which is then

temporally filtered using Bayesian inference to minimize the

effect of occasional misclassifications. The behavior of the object

is then determined based on its classification and spatio-temporal

properties, and if the object is considered of interest, feedback is

provided to the background subtraction segmentation technique

for background fading prevention reasons.

Keywords—computer vision, neural networks, Bayesian

inference, classification, computational intelligence, background

subtraction, segmentation, behavior analysis, critical infrastructure

protection, territorial security, video analytics

I. INTRODUCTION

The use of video feeds within surveillance applications is becoming quite popular due to the increased demand for ensuring the security of buildings and other infrastructures, as well as the declining cost and increased precision of digital video cameras. The increased usage of multiple video sources to cover a large perimeter surrounding a critical infrastructure imposes a large burden on the system operators, who cannot physically concentrate on simultaneously observing all the remotely distributed video feeds. This leads to fatigued, stressed, and overworked operators who end up possibly missing important events [1].

The increased processing power, and reduced costs, of current computing technologies can be used to help solve this problem, mainly through the application of computer vision (CV) and computational intelligence (CI) techniques. Using computer vision techniques, objects can be detected and extracted from the video stream. These objects can then be

classified based on supervised learning techniques, and their behavior monitored for undesirable events. When the undesirable event occurs, the operator can then be alerted, and the video stream annotated to indicate this fact so the operator can make a decision on the potential response.

The solution proposed in this work uses a background subtraction method of extracting objects of interest, which is updated adaptively based on the classes detected and observed behavior. After the objects have been extracted from the scene, an artificial neural network (ANN) combined with a temporal Bayesian filter classify the object. The behavior of the classified object, such as entering a restricted zone, stopping, and abandoning another object, is determined. Based on these behaviors, alerts and annotations to the video are enacted (if necessary), and the information is fed back into the background subtraction model to prevent objects belonging to classes of interest from being introduced into the background model.

The proposed solution is capable of detecting several behaviors of interest in surveillance activities, including restricted zone intrusion by objects of select classes (e.g. car, person, bird, or maritime vessel), abandoned object detection and stopped object detection, while handling the issues of background fading inherent in most background subtraction techniques. It is implemented using C++, in-part using the open source OpenCV library [2].

The rest of the paper is structured as follows. Section II briefly reviews relevant works. Section III unveils the proposed behavior-driven classification methodology and Section IV illustrates its application within critical infrastructure protection. Section V sheds light on the empirical evaluation before some final conclusions and future directions are elaborated upon in Section VI.

II. LITERATURE REVIEW

The first subsection reviews segmentation techniques found in the literature, while the second subsection discusses classification techniques.

A. Segmentation Review

Segmentation is the clustering of regions sharing similar spatio-temporal properties, such as color, texture, location, and motion. Image segmentation techniques, such as watershed algorithms [3] and k-means clustering [4][5], only use spatial

978-1-4799-5431-5/14/$31.00 ©2014 IEEE

properties of a single image to segment the image. While producing good results, they tend to take an extensive amount of computational time, and are not directly suitable for segmenting video streams. On the other hand, video streams can take temporal properties into consideration, thereby using the additional information to minimize the computational resources required for segmentation.

Some video segmentation techniques [6][7][8] rely on performing an accurate segmentation based on the first frame using slower image based techniques, and then track the intra- frame changes, refining the segmentation in each subsequent frame. These techniques work well in situations with minimal amounts of object motion between frames, however, when there are significant changes, such as the introduction of new objects, a reinitialization of the segmentation may be necessary. In situations where this occurs frequently, the goal of reducing segmentation-related computational resources by tracking changes between frames is prevented.

Yet other techniques model the scene stochastically, such that when a new object is introduced, it can more easily be detected. Mixture of Gaussians (MoG) is such a class of techniques [9][10][11]. The primary issue with these algorithms is that foreground objects eventually are adapted into the background once they are stationary, in addition to requiring a relatively static scene, with illumination variation occurring slowly, and using a static camera. Some research has been made in providing methods to model dynamic backgrounds [12] with illumination invariance [13][14] characteristics.

B. Classification Review

Classifiers exist in two different flavors: unsupervised and supervised. Unsupervised techniques extract knowledge from a scene without a priori knowledge, and are typically used for clustering and segmentation. Supervised classifiers, however, involve training the classifier, through a reinforcement machine learning technique, by introducing many samples of each class that is needed to be identified. For each class, the scene is typically processed by extracting a feature vector that is then fed into the classifier. There are several supervised classification techniques that are commonly used for image and video processing, with the most popular being the support vector machine (SVM), boosted classifiers, and the artificial neural network (ANN).

The SVM is a binary classifier that maps the feature vector into a multi-dimensional vector space and defines a partition (the classification threshold) such that the distance between classes within the vector space is maximal [15][16]. By ensuring the distance between the feature vectors representing classes is maximal, discrimination of features representing each class is made easier, and determining which class an object belongs to becomes the detection of which side of the hyper-planar class partition the feature vector lies.

Boosted classifiers use many weak classifiers that are only slightly better than random chance, and then combining them to produce a stronger result [17][18]. These classifiers are typically fast and simple to use, allowing for much parallelization, but at the cost of longer training periods.

ANNs are inspired from neural biology, and are quite flexible in modelling any desired system [19][20]. They consist of several inter-networked neurons. An individual neuron accepts a weighted combination of input values that get processed by a typically non-linear function to generate an output value; it is the weights and biases for all the neurons in the network that get adapted during training based on the desired output.

III. PROPOSED SOLUTION

The proposed solution is based on three interconnected modules (see Fig. 1) which include an object extraction module, a classifier module, and a behavior engine that generates feedback to the object extraction module, as well as the annotated output frame and any necessary system alerts. Subsections A, B, and C discuss the three respective modules of proposed solution.

Fig. 1. Block diagram illustrating the proposed solution

A. Object Extraction

The object extraction technique that has been employed (see Fig. 2) uses a background subtraction based approach. This is followed by a standard 8-wise connected components algorithm applied to the foreground image, with a Kalman tracker combined with a nearest neighbor matching technique to perform correspondence of detected objects between frames. The background subtraction based segmenter is the MoG technique [11] that models each pixel in the scene by a mixture of Gaussian distributions, which model the most common elements observed over time for each particular pixel. Any measurement that does not fit into these distributions is considered as an anomaly, and labelled as belonging to a foreground object. In order to prevent the background model from incorporating objects that are of interest, the learning parameter is set to zero when the video frame is first introduced, and then set back to its regular value when the training image for that particular frame has been decided by the behavior engine, as detailed in subsection C. The Kalman tracker is used to predict the bounding box that previously detected objects will have in the current frame, which is then used to determine matches based on a nearest neighbor comparison of bounding boxes from objects extracted in the current frame.

Fig. 2. Block diagram illustrating the object extraction process

B. Classifier

The classifier architecture, as shown in Fig. 3, contains a feature extractor that produces a feature vector for each tracked object. These feature vectors are then fed into a parallel bank of Multi-Layer Perceptron (MLP) ANN binary classifiers. The output from the ANNs is then fed into a temporal Bayesian classification filter, to minimize the effects that false positives and negatives produce to the overall system.

Fig. 3. Block diagram illustrating the classification process

The features that are provided to the ANN classifier are extracted from the contents of a subimage defined by the object’s bounding box. The first feature is the mean color corrected red, green and blue (RGB) values of the subimage. The second feature is a grayscale version of the subimage that has been rescaled to 4x4 pixels in size, and the final feature is a black and white thresheld version of the grayscale image using the OTSU algorithm [21]. This results in a total feature vector length of 35 values. This vector is then normalized to be between -1 and 1, with the normalization limits chosen based on the dataset used for training. The ANN is a simple feed forward (FF) type MLP that has input, hidden, and output layers. The output layer has 2 neurons with binary output values {1,0} to indicate that the object belongs to the class, and {0,1} to indicate that the object does not belong to the class. Each classifier has a different number of neurons that was found through training. By using a short feature vector, the speed of the classification is improved, at the cost of potentially higher misclassifications.

The classes that are currently classified by the MLP classifier’s binary output are bird, person, car, and maritime vessel. These classes were trained using a combination of the Visual Object Classes Challenge – Pattern Analysis, Statistical Modelling, and Computational Learning (PASCAL) [22] dataset of 2007 and 2008, and an internally maintained dataset of images for the targeted categories. The four aforementioned classes were chosen since individuals and cars are typically objects that are of interest for critical infrastructure protection applications, while birds and maritime vessels are of concern within maritime situational awareness.

The scaled conjugate gradient back-propagation method was used for training since the dataset is quite large and this method can tackle such data with low memory consumption. Table 1 shows the training and testing results of the designed classifiers. The classifiers were capable of classifying a significant number of images that contain an object of a targeted category. As shown in Fig. 4, the person classifier had the highest false positive rate and the lowest false negative rate. This is mainly due to the dataset itself, which contains a large number of persons.

The MLP ANN binary classifiers for each class can be executed in parallel, with the classification, ��,�, for the kth object, ��,� , at frame n being chosen based on the highest activation output of all the classifiers being used, while ensuring that there is a sufficient delta, α, between the activation levels. If there is no single class dominant or if the dominant class has an activation level below a threshold, β, then the classification of the object is considered to be unknown, as in (1).

Table 1. Results of training and testing for each of the NN classifiers

Class

R e

s u

lt

D a

ta

S iz

e

C C

= A

C

(% )

T P

(% )

F N

(% )

T N

(% )

F P

(% )

Person Tr 12684 96.2 46.3 3.7 49.2 0.83

Ts 14976 59.8 21.9 28.1 34.0 16.0

Maritime Vessel

Tr 12889 98.8 38.2 11.8 50.0 0.01

Ts 14976 96.5 11.8 38.2 49.2 0.76

Car Tr 6904 98.3 38.5 11.5 50.0 0.01

Ts 6347 89.7 8.6 41.4 47.7 2.37

Bird Tr 6904 98.6 44.3 5.7 50.0 0.03

Ts 6347 89.5 13.2 36.8 46.3 3.7

Tr: Training; Ts: Testing; AC: Accuracy; CC: Correct Classification; TP: True Positive; FN: False Negative; TN: True Negative; FP: False Negative

Fig. 4. TP, TN, FP, FN Values of testing on unseen images

��,� � �arg max���������,��� , ��� �������,�� � �,��������,�� � �������,��� � �, � � �� ! "# , "$%&'#�(& (1) To prevent the effect of temporary misclassifications in the

form of false positives and false negatives, a Bayesian inference predictor, (2), has been implemented to perform temporal filtering of the ANN classifier output, where )*��|��, is the probability that the object belongs to class X at time n given the current observation On, -*��|��, is the likelihood that the observation O results in the classification X for the current observation at time n (which is determined by the normalized output of the ANN classifier) and )*��./|��./, is the probability that the object belongs to class X observation O at the previous instant in time, n-1.

0

10

20

30

40

50

60

TrueP FalseN TrueN FalseP

Car

Person

Bird

Maritime Vessel

)*��|��, � 0*1234|5234,∙7*12|52,0*1234|5234,∙7*12|52,8�/.0*1234|5234,�∙�/.7*12|52,� (2) The object’s current class is decided by the dominant probability out of all classes, including the unknown object class. Note that when the output of the MLP is unknown, the Bayesian temporal filter is not updated in order to prevent situations that the classifier does not recognize, such as uneven lighting or occlusion, from suppressing the current classification of the object.

C. Behavior Engine

The behavior engine process, as shown in Fig. 5, consists of a behavior analyzer unit that looks for specific behavior from certain classes of tracked objects, followed by an annotation unit that generates an operator output in the form of an annotated video frame and alerts, as well as a unit that generates a training frame for feedback into the object extractor module.

Fig. 5. Block diagram illustrating the behavior engine

The behavior analyzers that are currently implemented include an intrusion detection analyzer, an abandoned object analyzer, and a counting object analyzer. The intrusion detection analyzer monitors for the intrusion of a restricted zone (e.g. a preselected subregion of the image) by an object from a selected class, or classes. The abandoned object analyzer monitors for the separation of a smaller object from a larger parent object of a specific class, or classes, within a particular predefined subregion of the image space. The counting object analyzer counts the number of objects from a particular class or classes that has crossed a predefined subregion of the image space.

The annotation unit marks up the video stream, highlighting objects and classes of interest, as well as providing alerts based on the behavior analysis. The training frame creation unit generates the training frame based on the background model, the current frame, and the objects of interest produced from the behavioral analysis unit, such that the background subtraction will not integrate objects of interest into the background module. The produced training frame is then fed back into the object extraction module.

IV. CASE STUDY: CRITICAL INFRASTRUCTURE PROTECTION

In this paper, two scenarios are considered. The first scenario, shown in Fig. 6 a), consists of monitoring a pair of dumpsters for their unauthorized usage. The second scenario, shown in Fig. 6 b), consists of monitoring a doorway for unauthorized access. In each scenario, the region of interest for monitoring intrusion is a polygon that is drawn on the images in blue.

In the first scenario, a car enters from the bottom left corner and drive up to the dumpsters. A person exits the car, grabs a

bag of garbage, tosses it into the garbage bin, and then drives away to the right. As vehicles may be common, they will not cause an alert, yet they will still be actively monitored. A person by the dumpster on the other hand indicates a condition that should be handled by an operator to ensure the person is authorized to use that resource, and as a result an alert will be sent when a person enters that zone.

In the second scenario, a person walks in from the right, stops at the door, drops a bag, and then walks away towards the left. In this scenario, people are actively monitored, and hence will cause alerts when they intrude upon the monitored zone. Furthermore any abandoned objects within this zone are monitored, with an alert being generated, if one is detected.

a) b)

Fig. 6. Demonstrating the two considered scenarios of a) monitoring of a dumpster, and b) monitoring of a doorway for suspicious activities

V. EXPERIMENTAL RESULTS

The scenarios were captured using two different cameras and frame rates. The first camera is a Vivotek security camera, capturing at a variable frame rate at a resolution of 640x480, which was used to acquire the video of the first scenario. The second camera is a Logitech webcam, capturing at 30 fps at a resolution of 640x480, which was used to acquire the video of the second scenario. Each video was saved in the MP4 video format, and subsequently processed offline to allow for the repeatability and thorough analysis of the results. This is not to state that the system can only operate offline; in fact the proposed system is capable of operating online in real-time in a wide range of situations. The video processing algorithm was implemented in optimized C++, using a combination of OpenCV and in-house libraries.

The key moments of the first scenario are shown in Fig. 7 (located at the end of the paper), where the first column contains the frame number for the corresponding row, the second column contains the annotated output video frames, the third column contains the detected objects, and the fourth column contains the feedback training frames. The car comes into view and is classified as unknown in frame 246. In frame 249 it is correctly classified as a car. The car continues moving until frame 297, where it stops. Notice that the car crossing into the intrusion polygon does not trigger an alert, hence highlighting that the behavioral module correctly distinguishes between classes when processing behaviors. In frame 387, an object that has been correctly classified as a person has exited the car with a garbage bag in hand and is about to toss it into the dumpster. Frame 390, an alert is generated as an intrusion has been detected by an object classified as a person, which causes the intrusion polygon to alternate between blue, green, and red. In frame 447, the person has reentered the car, and drives away in frame 479. Notice that in the training images,

the regions of the image that correspond to unknown objects, cars, and persons have not been introduced into the feedback training image, thereby preventing objects of potential interest from being incorporated into the background model, even with the car being stationary for over 42 seconds.

Table 2 provides key moments of the classifier performance of the car object in scenario 1. The first column is the frame number that the classification took place in, the second column indicates the classification of the object from the previous frame (unknown with a probability of 0.5 by default for new objects), the third column indicates the output classification and probability of the MLP ANN classifier, the fourth column is the current classification after the temporal Bayesian inference filter has been integrated with the MLP ANN observation. The object is first detected at frame 245, when it is classified as unknown with probability of 1.0000 by the MLP ANN classifier. As previously mentioned, since the Bayesian temporal filter is not updated upon an unknown classification, the resulting classification is still unknown with a probability of 0.5000. This situation remains unchanged until frame 249, when the MLP ANN finally recognizes the object as a car with a probability of 0.9997. The output to the Bayesian filter becomes car with a probability of 0.9997. In the following frame (#250), the MLP ANN classifier produces another classification of car with a probability of 0.9997. This results in the reinforcement of the Bayesian belief that the object is a car, but now with a probability of 1.0000. In frame 285, the MLP ANN classifier produces a misclassification with the class being person with a probability of 0.9997. Due to the high belief that the Bayesian temporal classification filter currently has, the resulting probability is still car with probability of 1.0000, thereby preventing the misclassification from affecting the culminating classification, and any potential action based on that classification.

Table 2. Demonstrating key instances in the classification of the car object in

scenario 1

F ra m e # Previous

Classification MLP ANN Classifier Classification

Current Classification

Class Prob. Class Prob. Class Prob.

245 Unknown 0.5000 Unknown 1.0000 Unknown 0.5000

249 Unknown 0.5000 Car 0.9997 Car 0.9997

250 Car 0.9997 Car 0.9997 Car 1.0000

285 Car 1.0000 Person 0.9997 Car 1.0000

The key moments of the second scenario are shown in Fig. 8 (located at the end of the paper), using the same column order as previously defined for Fig. 7. In frame 239, an individual enters the frame from the right and is initially misclassified as a bird. By frame 244, this individual is now mostly in the scene and is correctly classified as a person. In frame 276, he enters the region by the door, triggering an intrusion alert, causing the outlining polygon to alternate between blue, green, and red. In frame 326, the individual stops for a bit, and drops a bag. By frame 401, he has walked away from the door, but the bag has been identified as an unknown object, still triggering the intrusion alert. In frame 459, this unknown object has been determined to be an

abandoned object, which has created yet another alert, indicated by the thicker red boundary around the object with an ‘A’ drawn in the interior. At frame 459, the individual has completely left the scene and the abandoned object is still triggering both the intrusion alert, as well as the abandoned object alert. Furthermore, a track in green indicating the individual’s center of gravity over time has been traced through the scene. Finally, all objects corresponding to the person and unknown classes have not been fed back into the training image, while other classes have, such as when the individual was misclassified as a bird in frame 239, due to the interest being on persons and unknown objects. This keeps both of the monitored objects, person and unknown, from being integrated into the background model, thereby allowing the detection, tracking, and behavior analysis to take place for objects of these classes in subsequent frames.

Table 3 provides key moments of the classifier performance of the person object in scenario 2. The organization is identical to that previously described for Table 2. In this scenario, the person object first enters the scene at frame 217, where it is classified as unknown with a probability of 1.0000. However in frame 231, this object is misclassified as bird with a probability of 0.7470 by the MLP ANN classifier, resulting in the output classification of bird by the Bayesian temporal filter. The following frame, the MLP ANN resumes its classification of the object as unknown with a probability of 1.0000, but as previously discussed, the Bayesian temporal classification filter is not updated when the MLP ANN classification is unknown. In frame 244, the MLP ANN classifier finally correctly classifies the output as a person with probability of 0.9310, which results in the output of the Bayesian filter of person with a probability of 0.8206. In each of the two following frames, number 245 and 246, the MLP ANN classifier produces a classification of person with probabilities of 0.9271 and 0.9995 respectively. This reinforces the Bayesian belief that the correct classification is person with the probabilities evolving to 0.9831 and 1.0000 in those two successive frames. In frame 271, the MLP ANN classifier produces a misclassification of bird with a probability of 0.8311, which does not affect the Bayesian belief that the object is a person with a probability of 1.0000, thereby further demonstrating the benefit of the temporal Bayesian classification filter.

Table 3. Demonstrating key instances in the classification of the person object

in scenario 2

F ra m e # Previous

Classification MLP ANN Classifier Classification

Current Classification

Class Prob. Class Prob. Class Prob.

217 Unknown 0.5000 Unknown 1.0000 Unknown 0.5000

231 Unknown 0.5000 Bird 0.7470 Bird 0.7470

232 Bird 0.7470 Unknown 1.0000 Bird 0.7470

244 Bird 0.7470 Person 0.9310 Person 0.8206

245 Person 0.8206 Person 0.9271 Person 0.9831

246 Person 0.9831 Person 0.9995 Person 1.0000

271 Person 1.0000 Bird 0.8311 Person 1.0000

VI. CONCLUSION

The proposed video analytics system correctly extracts interesting objects from the scene, which it then tracks, classifies, determines the behavior of, and finally provides relevant alerts to the operator so that a potential action can be determined. The performance of the proposed system was demonstrated in the two scenarios presented in this paper.

Firstly, objects of interest are extracted based on the MoG background subtraction technique. These objects are then tracked throughout an image sequence using a combination of Kalman tracking and nearest neighbor matching, which is clearly demonstrated in the results for scenario two, where a track is drawn that follows the motion of the person through the scene. These tracked objects are then classified. The proposed classification module contains four accurate classifiers for the following classes: car, person, bird, and maritime vessel. In order mitigate the inevitable small amounts of misclassification, and improve reliability of the end classification, a temporal Bayesian filter reduces the effect of these occasional misclassifications. This was demonstrated in scenario 1 where the car was misclassified as a person in frame 285, but the output from the Bayesian filter was still a car, and additionally demonstrated in scenario 2 where the person was initially considered a bird in frame 231, but later confirmed to be a person in frame 244.

After the objects have been classified, their behavior is analyzed. In the scenario that monitors the dumpster, cars do not trigger an intrusion alert, but are still monitored. The person on the other hand does trigger the intrusion alert. In scenario two, the person crosses into the intrusion region, which triggers an alert. Furthermore, the person leaves a bag behind, which continues the intrusion alert, while also producing an abandoned object alert. This knowledge is fed back into the MoG segmenter by adjusting the training image to not contain objects that are of interest for the monitoring application, and hence reducing the chance of forgetting, or not observing, interesting objects that could be of highest importance for critical infrastructure protection.

Future enhancements are being planned for the current system. Firstly, the segmenter will be enhanced to handle more dynamic scenes with camera movement. Secondly, the computer vision techniques will be made more illumination- invariant such that they can handle greater light variation across the scene. Finally, the addition of further in-depth behavior analysis capability will be developed, such as vandalism and smoke/fire detection.

ACKNOWLEDGMENT

This work was partially supported by Mitacs under its Accelerate Cluster program, by the Ministry of Economic Development and Innovation of Ontario under an ORF-RE3 grant, and by the Natural Sciences and Engineering Research Council of Canada (NSERC).

REFERENCES

[1] L. G. Weiss, "Autonomous Robots in the Fog of War," IEEE Spectrum, vol. 48, no. 8, pp. 30-34, 56-57, August 2011.

[2] Opencv dev team, “OpenCV 3.0.0-dev documentation”, June 2014, WWW: http://docs.opencv.org/master/index.html#.

[3] J. B. Roerdink and A. Meijster, "The Watershed Transform: Definitions, Algorithms and Parallelization Strategies," Fundamenta Informaticae, vol. 41, pp. 187-228, 2001.

[4] K. Alsabti, S. Ranka and V. Singh, "An efficient k-means clustering algorithm," Syracuse University SURFACE Electrical Engineering and Computer Science, Syracuse, 1997.

[5] C. Ding and X. He., "K-means Clustering via Principal Component Analysis," in Proceedings of the International Conference on Machine Learning, Banff, 2004.

[6] K. Ryan, A. Amer and L. Gagnon, "Video Object Segmentation Based on Object Enhancement and Region Merging," in IEEE Internation Conference on Multimedia and Expo, Toronto, 2006.

[7] M. EL Hassani, S. Jehan-Besson, L. Brun, M. Revenu, M. Duranton, D. Tschumperlé and D. Rivasseau, "A Time-Consistent Video Segmentation Algorithm designed for Real-Time Implementation," VLSI Design, vol. 2008, p. 12, 2008.

[8] W.-C. Hu, "Real-Time On-Line Video Object Segmentation Based on Motion Detection Without Background Construction," International Journal of Innovative Computing, Information, and Control, vol. 7, no. 4, pp. 1845-1860, 2011.

[9] N. Friedman and S. Russell, "Image Segmentation in Video Sequences: A Probabilistic Approach," in Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI'97), 1997.

[10] P. KaewTraKulPong and R. Bowden, "An Improved Adaptive Background Mixture Model for Real-time Tracking With Shadow Detection," in Proceedings of the European Workshop on Advanced Video Based Surveillance Systems., 2001.

[11] Z. Zivkovic, "Improved Adaptive Gaussian Mixture Model for Background Subtraction," in Proceedings of the International Conference of Pattern Recognition, 2004.

[12] Y. Sheikh and M. Shah, "Bayesian Modeling of Dynamic Scenes for Object Detection," IEEE Transactions on Pettern Analysis and Machine Intelligence, vol. 27, no. 11, pp. 1778-1792, 2005.

[13] K. Avgerinakis, A. Briassouli and I. Kompatsiaris, "Real Time Illumination Invariant Motion Change Detection," in ACM-MM ARTEMIS International Workshop, Firenze, Italy, 2010.

[14] M. S. Drew, J. Wei and Z.-N. Li, "Illumination-Invariant Color Object Recognition via Compressed Chromaticity Histograms of Color- Channel-Normalized Images," in IEEE International Conference on Computer Vision, Bombay, India, 1998.

[15] G. Anthony, H. Gregg and M. Tshilidzi, "Image Classification Using SVMs: One-against-One Vs One-against-All," in Proceedings of the Asian Conference on Remote Sensing, 2007.

[16] B. Banerjee, T. Bhattacharjee and N. Chowdhury, "Image Object Classification Using Scale Invariant Feature Transform Descriptor with Support Vector Machine Classifier with Histogram Intersection Kernel," in Information and Communication Technologies: Communications in Computer and Information Science, Berlin, Springer Berlin Heidelberg, 2010, pp. 443-448.

[17] J. Shotton, J. Winn, C. Rother and A. Criminisi, "TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context," International Journal of Computer Vision, vol. 81, no. 1, pp. 2 - 23, 2009.

[18] R. Lienhart, A. Kuranov and V. Pisarevsky, "Empirical Analysis of Detection Cascades of Boosted Classifiers for Rapid Object Detection," Intel Corp., Santa Clara, 2002.

[19] G. P. Zhang., "Neural Networks for Classification: A Survey," IEEE Transactions on Systems, Man, and Cybernetics - Part C: Applications and Reviews, vol. 30, no. 4, pp. 451-462, 2000.

[20] C. Goerick, D. Noll and M. Werner, "Artificial Neural Networks in Real Time Car Detection and Tracking Applications," Pattern Recognition Letters: Neural Networks for Computer Vision Applications, vol. 17, no. 4, pp. 335-343, 1996.

[21] N. Otsu, "A Threshold Selection Method from Gray-Level Histograms.," IEEE Transactions on Systems, Man, and Cybernetics, vol. 9, no. 1, pp. 62-66, 1979.

[22] The PASCAL Visual Object Classes Homepage, http://pascallin.ecs.soton.ac.uk/challenges/VOC/, July, 7

th , 2014.

Frame# Annotated Frame Segmented Image Training Image

246

249

297

387

390

447

479

Fig. 7. Demonstrating the annotation, segmentation, and training images with their corresponding video frame # of key moments that occurred during scenario 1

Frame# Annotated Frame Segmented Image Training Image

239

244

276

326

401

459

495

Fig. 8. Demonstrating the annotation, segmentation, and training images with their corresponding video frame # of key moments that occurred during scenario 2