Urban Sound Classification in New York City Using Convolutional Neural Networks and Random Forest

Brief Explanation of Purpose

The aim of the project was to develop a machine learning model capable of classifying random urban sounds recorded in New York City. The goal was to categorize these sounds into one of 10 predefined classes such as dog barking, car horn, and gunshots. Accurate sound classification is vital for various applications ranging from environmental monitoring to public safety.

The project includes a Jupyter Notebook and a write-up that breaks down the math behind each machine learning model used. For more details, feel free to check out both.

Project Overview

Dataset

The dataset was comprised of two-second sound clips recorded in New York City. Each sound clip was labeled with one of the 10 predefined classes. The dataset was divided into training and test sets, with 70% of the data used for training and the remaining 30% for testing the models.

Data Pipeline

We initially experimented with two data representations: raw amplitude of sound over time and Mel spectrograms. The latter proved to be more useful, resulting in better model performances in terms of accuracy, ROC plots, and AUC scores. Data augmentation techniques were also applied, such as frequency and time masking, although no significant performance improvements were observed.

Model Selection

We decided to use the Random Forest Classifier (RFC) as it presented a balance between accuracy and model interpretability. While Convolutional Neural Networks (CNN) demonstrated slightly better accuracy, their complex nature made them less suitable for our purposes.

Model Tuning

For RFC, hyperparameter tuning was carried out using both random and grid searches. A k-fold cross-validation strategy was employed, and the best performing hyperparameters were selected based on the average validation score.

Bias-Variance Trade-off

Both baseline models and RFC showed signs of overfitting, although RFC managed to maintain relatively consistent performance across different validation sets, indicating its ability to control high variance.

Evaluation Metrics

Models were primarily evaluated using mean, balanced, per-class accuracy, and AUC scores. We recognize the limitations of relying solely on accuracy and recommend further studies to identify other relevant metrics for evaluation.

Domain-Specific Evaluation

The problem was framed as a classification task, assuming all urban sounds belong to one of the 10 classes. This simplification is a limitation, as real-world sounds can be more varied.

Design Review

The RFC model could benefit from additional hyperparameter tuning. We also recognize the limitations of the cross-entropy loss function used and suggest that other loss functions could be explored for better performance.

Ethical Considerations

Recording urban sounds may capture personal conversations, posing ethical and privacy concerns. These must be considered if the model is to be deployed for monitoring purposes.

Optional Exploration: Pre-Trained CNNs

We also explored the use of a pre-trained ResNet model through transfer learning. Despite data augmentation and selective fine-tuning of model layers, the ResNet model underperformed compared to RFC and simple CNN models. This suggests a need for a more extended training period and raises questions about the effectiveness of transfer learning for this particular classification task.

Final Reflections

The project successfully demonstrated that machine learning models could be used for urban sound classification. However, several areas could be improved upon, including model tuning, evaluation metrics, and ethical considerations. Future work could focus on expanding the types of sounds classified or investigating the utility of more complex neural network architectures.