FUSA-Net: Cross-Modal Music Retrieval

M.Sc. thesis — matching sheet music images to audio recordings using deep learning, without any human-labeled pairs.

The Problem

Finding the audio recording that matches a given sheet music page — or vice versa — is a task humans do naturally, but machines struggle with. Audio and sheet music live in completely different representational spaces: one is sound, the other is a visual document. Bridging them requires learning a shared space where both can be compared directly.

The Dataset

No large-scale public dataset existed for this task. I built PDMX-FUSA from scratch: 116,626 public-domain piano scores transformed into 291,648 training examples, each containing a synthesized audio recording, a high-resolution score image, and MIDI data — all aligned at the measure level.

PDMX-FUSA dataset generation pipeline — Pipeline used to transform public-domain piano scores into aligned audio, score-image, and MIDI training examples.

The Architecture

FUSA-Net is a dual-encoder model: one branch processes audio, the other processes score images. Both produce embeddings in a shared 512-dimensional space, trained with contrastive learning so that matching pairs are pushed together and non-matching pairs are pushed apart. Two design decisions made a significant difference. First, I used CCA (Canonical Correlation Analysis) to initialize the model in a geometrically favorable starting point, which made training 4.8× faster than standard approaches and more stable. Second, I added auxiliary prediction tasks — key, meter, polyphony — that force the shared space to organize itself around musically meaningful structure, not just identity matching.

FUSA-Net dual-encoder training diagram — Training setup for FUSA-Net: audio and score-image encoders, CCA-based projection, contrastive loss, and auxiliary musical prediction tasks.

Results

Recall@1 = 66.87% · Recall@10 = 92.24% · Modality Gap = 0.036 In practical terms: given a sheet music page, the system finds the correct audio recording in its top 10 results more than 9 out of 10 times. Even when the top result is wrong, it is musically coherent — matching the correct polyphony 91.4% of the time and the correct meter 70.6% of the time.

Stack

PythonPyTorchtransformerscontrastive learningCCA