Abstract

As much as it has played a part in evolution, humans have long been expressing panic and distress calls to signal an urgent need or to warn potential situations of danger. In the same way, humans have no difficulty in identifying these sounds and responding accordingly.

This study aims to create a deep learning model to recognize sounds associated with alarming and possibly life-threatening situations. Using audio files extracted from YouTube videos split into five second clips, we developed three models characterized by the following preprocessed inputs: (1) Fourier transformed signals fed into a three-layer fully-connected neural network, (2) audio images fed into a basic convolutional neural network, and (3) spectrograms fed into a basic convolutional neural network.

We find that the Fourier transform neural network model achieved the highest accuracy at 80.5% and 61.25% for binary and multilabel classification, respectively. Both CNN models achieved high training and validation accuracies but failed to generalize due to limited number and variety of training data. Using these models, we can build IoT devices and sensor technologies purposed for automatic danger detection in residential and commercial areas.