Skip to content

IMEMNet-C is a multimodal dataset for aligning images, music, and musical captions using valence-arousal (VA) scores.

Notifications You must be signed in to change notification settings

schoi828/IMEMNet-C

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

IMEMNet-C

Designed for multimodal research involving images, music, and musical captions, IMEMNet-C is an extended version of the IMEMNet dataset. This dataset bridges the gap between modalities by providing additional textual descriptions (musical captions) for music data, enabling deeper exploration of multimodal relationships. This dataset accompanies the paper "MMVA: Multimodal Matching Based on Valence and Arousal across Images, Music, and Musical Captions", accepted in Artificial Intelligence for Music workshop at AAAI 2025.

Key Features

  • Multimodal data:
    • Images: 24,756 real-world images.
    • Music Clips: 25,944 music clips.
    • Musical Captions: Text descriptions corresponding to the music clips.
  • Emotion-Annotated
    • Valence and Arousal (VA) annotations normalized to a [0, 1] range.
    • Enables emotional alignment across modalities.

Source Datasets

IMEMNet-C is composed of four datasets: DEAM, IAPS, and EMOTIC. To use IMEMNet-C, you must obtain permission, download these datasets, and follow the license agreements associated with them.

Data Format

Image VA CSV

Image IDs: Correspond to image filenames with extensions .jpg or .JPG.

row ID Image ID Valence Arousal
0 3053 0 0.921847
1 3102 0.012802 0.863233

Music VA CSV

Clip IDs: Correspond to audio filenames with extensions .wav or .WAV.

row ID Clip ID Valence Arousal Caption
0 725-16 0.671875 0.566878 The jazz song showcases a vibraphone solo melody. It is accompanied by punchy snare and kick hits, shimmering cymbals, and a groovy bass guitar. The composition conveys an energetic and passionate mood.
1 118-12 0.539571 0.416638 The instrumental piece showcases a medium tempo with a keyboard accompaniment. It has a steady drumming rhythm and a percussive bass line. Various percussion hits are incorporated for added texture. The composition is energetic and passionate. Despite the low audio quality, the musical elements remain distinct.

About

IMEMNet-C is a multimodal dataset for aligning images, music, and musical captions using valence-arousal (VA) scores.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published