Skip to content

NikolaiHerrmann/audio-classifier

Repository files navigation

Goal

This research is aimed to classify multivariate time-series to perform speaker identification. We specifically investigate speech recordings from which linear prediction cepstral coefficients (LPCCs) coefficients have been extracted. Three classifiers are being examined for this task: a simple Convolutional Neural Network (CNN) using 1D convolution, a Random Forest classifier and a Support Vector Machine. The latter two use hand-crafted features (i.e. mean, standard deviation and slope of each time-series). These classifiers were trained and tested on two different datasets, namely the Japanese Vowels dataset and the (English) Free Spoken Digit dataset. In this manner, the classifiers' performances are evaluated in two different scenarios, as the datasets vary in language, length and task. We find that the hand-crafted classifiers outperformed the neural network classifier.

Run instructions

  1. Install the Python requirements (Python 3.10.9).
  2. Run the train.py file:
python train.py

Japanese Vowel (/ae/) Data Set

Size

  • 640 speaker recordings
  • 9 unique speakers
  • Split:
    • Train: 270, 30 recordings per speaker
    • Test: 370, 24-88 recordings per speaker

Parameters

  • LPCCs order of 12

Source

Spoken Digit Data Set

  • extract all zip files in spoken_digits
  • one recording per txt file

Size

  • 3000 speaker recordings
  • 6 unique speakers
  • 50 recordings of each digit per speaker

Parameters

  • wav files with sample rate of 8kHz (pretty low)
  • recordings are trimmed, almost no silence at start/end points

Feature Extraction

Source

About

ML pipeline for speaker identification

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •