Back
FUSA-Net
Dual-encoder model for image-to-audio and audio-to-image retrieval. M.Sc. thesis, PUC Chile.
RepositoryFUSA-Net aligns sheet-music images and audio representations in a shared embedding space.
The system uses contrastive learning to retrieve the matching modality without requiring paired metadata at inference time.
Metrics
Recall@1 66.87%, Recall@10 92.24%, modality gap 0.036
Stack
PyTorchtransformersCCA
Images
Images can be added later under public/projects/.