Musicologist | Shangyang Min

Project Overview

Musicologist introduces a novel framework for analyzing how musical concepts emerge during the reverse diffusion process in generative audio models. Using Stable Audio 1.0 as a backbone, the project explores both high-level concepts (e.g., genre, mood) and low-level features (e.g., rhythmic drive, timbre) as they form across diffusion steps.

Our methodology combines text-guided truncated sampling to capture evolving semantic structure with CLAP embeddings + Concept Activation Vectors (CAVs) to identify fine-grained auditory attributes. Results show that key musical qualities stabilize by ~40% into the diffusion trajectory, providing insights into the structured emergence of concepts in music generation.

This work advances interpretable AI in music by offering tools and datasets for analyzing the conceptual building blocks of audio generation. It contributes to the broader goal of making generative models more transparent, controllable, and useful for AI-assisted composition.

Download the PDF