Human activity capturer and klassifier wins first prize at Vinci Energies hackaton
Current situation – Human video operators
Security cameras send a live video feed to a control center where operators try to detect emergency situations that require their response. We humans can only focus on just one video feed at a time. However, their job requires them to keep track of many. They are drowning in too much data and have no means of prioritizing what to watch. When an event of interest happens – e.g. when a person faints – often that event gets noticed after valuable minutes have passed. Sometimes, it is even lost in the mountain of video and never seen. This delay in response time costs lives. For example, in case of cardiac arrest survival probabilities go down by 10% (source) each minute, until the person receives proper help.
Solution – detecting human behaviour and suggesting a response
Using the power of A.I. we can improve this system by giving each video feed the attention it deserves at all times. Out of all contestants our team managed to build both the most accurate and the fastest model to detect human behaviour in real-time. We used a state of the art two-stage deep neural network to detect persons, objects, motions and behaviour of the people in the frames.
While VINCI Energies challenged our team to classify individual video files they realised the actual use case would involve video streaming data. That’s why we tailored our solution appropriately by showing live probabilities for the behaviours of interest. When the live probability of a behaviour reaches a critical threshold, the model suggests the appropriate response to the operator.
In case of a medical emergency he could send an ambulance to the right location with a single click, saving crucial time and valuable minutes. Lower response times increase the quality of our solution. Additionally, by focussing on the videos where things are actually happening, each operator can handle more video streams and costs go down. A double win!
Using video data is a quite interesting case for a classification problem because each video contains important information both in the spatial and the temporal domain. Spatial information is available in each individual frame (e.g. shape and location of various objects in-frame) and temporal information lies in the context of a frame in relation to earlier or later frames in time. To capture all this information available in videos and create an efficient classifier we built a Deep Neural Network architecture that combines the effectiveness of Convolutional Neural Networks (CNNs) to detect spatial features and the ability of Long Short-Term Memory networks (LSTMs) to capture temporal features. Using transfer learning, we avoided training a CNN from scratch and instead employed the pre-trained Inception-v3 model from Google, which gives state-of-the-art results in the Imagenet classification challenge (3.46% top-5 accuracy, 1.2 million images, 1000 classes). For each frame, we extracted features from its final pool layer and that produced our input sequences to be fed to the LSTM. The model was trained on an Nvidia-provided DGX-1 station (8x Tesla V100 GPUs) and managed to accurately classify behaviors in the given videos.
Future – activating passive cameras
When our model makes a poor suggestion, the operator just declines it and the model learns from its mistakes. The operator and the model will thus both improve each others performance. In the future, our model will be even more accurate and will be deployed on cameras that currently don’t have an operator but solely collect evidence. Since the large majority of security camera systems fall in this category, the potential is huge!