Music Genre Classifier

A simple music genre classifier based off of audio features.

Main Goal: Accurately classify unseen songs into one of the given genres using the following

PCA
KNN
SVM
Random Forest
Neural Network

Sub-Goal: Discover the best method for identifying fusion songs, that is songs that are somewhat in-between two genres by using only non-fusion training data

Installation

Clone the repository
Install the required packages

pip install -r requirements.txt

** NOTE: You will need to use python version <= 3.12; python 3.13 is missing one of the necessary dependancies for the librosa package **

Download the data

import kagglehub

# Download latest version
path = kagglehub.dataset_download("andradaolteanu/gtzan-dataset-music-genre-classification")

print("Path to dataset files:", path)

Biggest takeaways

Feature Selection

Feature selection can be somewhat ad-hoc, especially when multiple features describe the same underlying concept. For example, 'mfccx_mean' where x varies from 1 to 10. To address this, consider the following approaches:

Correlation Analysis: Identify and remove highly correlated features to reduce redundancy.
Feature Importance: Use algorithms like Random Forest or Gradient Boosting to rank feature importance and select the most relevant ones.
Dimensionality Reduction: Apply techniques like PCA or t-SNE, but be cautious of potential data leakage and ensure proper cross-validation.
Domain Knowledge: Leverage domain expertise to select features that are known to be significant.

By applying these methods, you can create a more robust and interpretable model.

Observations

PCA can cause unnecessary confusion in understanding what is actually important for a dataset. Especially because it's easy to use data transformed via PCA for both testing and training which is actually a data leak if you do PCA based on all of the data. However, if you don't use all the data PCA can underexaggerate important features of that data. It seems therefore that PCA is most useful for the purpose of visualization.

In other KNN examples I saw people obtain the optimal K using all the data. This is a large oversight and should be very intentionally avoided as it is a clear data leak which then overexaggerates the test accuracy of these models. A similar data leak also tends to happen in normalization when the scaler is fitted on all of the data instead of just the training data, and it is then used to fit all the data. This is using data that you shouldn't have access to in training.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitignore		.gitignore
README.md		README.md
main.ipynb		main.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Music Genre Classifier

Main Goal: Accurately classify unseen songs into one of the given genres using the following

Sub-Goal: Discover the best method for identifying fusion songs, that is songs that are somewhat in-between two genres by using only non-fusion training data

Installation

Biggest takeaways

Feature Selection

Observations

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

jamiecmarks/genre_classifier

Folders and files

Latest commit

History

Repository files navigation

Music Genre Classifier

Main Goal: Accurately classify unseen songs into one of the given genres using the following

Sub-Goal: Discover the best method for identifying fusion songs, that is songs that are somewhat in-between two genres by using only non-fusion training data

Installation

Biggest takeaways

Feature Selection

Observations

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages