AI tool points to simpler way of shaping immersive audio environments

Engineers at the University of Pennsylvania School of Engineering and Applied Science have developed an AI-powered audio editing system that allows users to modify immersive sound environments using natural language, offering a glimpse into how complex spatial audio could become more accessible to a wider range of users.

Called SmartDJ, the system enables users to issue high-level instructions such as “make this sound like a busy office” - or, for example, to create a more natural ambience such as a rainforest - before automatically planning and executing the steps required to achieve that result.

Unlike earlier AI audio-editing tools, which relied on rigid, template-based commands, SmartDJ is designed to interpret more intuitive requests. Previous systems also typically worked with single-channel, or mono, audio, limiting their ability to preserve the spatial cues essential for immersive listening experiences. SmartDJ, by contrast, operates on stereo audio, allowing it to better maintain and reshape the spatial structure of a scene.

The system combines different types of AI models to achieve this. Language models are used to interpret user input, while diffusion models generate and modify audio. To connect the two, the research team introduced an audio language model (ALM), trained on both sound and text.

The ALM analyses the original audio alongside the user’s prompt and breaks the request into a sequence of smaller editing actions, such as adding, removing, or repositioning sounds. A diffusion model then carries out these actions step by step.

Importantly, SmartDJ is designed to be interpretable, allowing users to see and adjust each stage of the editing process. For example, a request to create a busy office soundscape might result in an instruction such as adding a phone ringing on the right-hand side at a specific level. Users can then refine or remove individual steps to adjust the final output.

“We use language models to deal with text,” said Zitong Lan, doctoral student in Electrical and Systems Engineering and first author of the study. “We further use diffusion models to edit sounds.”

“With SmartDJ, users can describe the outcome they want in natural language, and the system figures out how to make it happen,” added Mingmin Zhao, assistant professor in Computer and Information Science and senior author of the study.

The research was presented at the 2026 International Conference on Learning Representations (ICLR).

Top image: Sylvia Zhang, Penn Engineering

Article Categories