Raw Audio VAE

in proceedings of Sound and Music Computing 2023

Kıvanç Tatar

Chalmers University of Technology
Sweden
tatar@chalmers.se

Kelsey Cotton

Chalmers University of Technology
Sweden
kelsey@chalmers.se

Daniel Bisig

Zurich University of the Arts
Switzerland
daniel.bisig@zhdk.ch

Abstract

The research in Deep Learning applications in sound and music computing have gathered an interest in the recent years; however, there is still a missing link between these new technologies and on how they can be incorporated into real-world artistic practices. In this work, we explore a well-known Deep Learning architecture called Variational Autoencoders (VAEs). These architectures have been used in many areas for generating latent spaces where data points are organized so that similar data points locate closer to each other. Previously, VAEs have been used for generating latent timbre spaces or latent spaces of symbolic muic excepts. Applying VAE to audio features of timbre reuires a vocoder to transform the timbre generated by the network to an audio signal, which is computationally expensive. In this work, we apply VAEs to raw audio data directly while bypassing audio feature extraction. This approach allows the practitioners to use any audio recording while giving flexibility and control over the aesthetics through dataset curation. The lower computation time in audio signal generation allows the raw audio approach to be incorporated into real-time applications. In this work, we propose three strategies to explore latent spaces of audio and timbre for sound design applications. By doing so, our aim is to initiate a conversation on artistic approaches and strategies to utilize latent audio spaces in sound and music practices.

Paper -> ︎
Code -> ︎

The artwork titled Coding the Latent uses three interpolation strategies presented in this paper, within a live coding environment, performed at Kubus, ZKM, Karlsruhe.

Network

Deep Learning Network in RawAudioVAE

Interpolation Strategies

Three interpolation strategies in RawAudioVAE

Acknowledgements

This work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program – Humanities
and Society (WASP-HS) funded by the Marianne and Marcus Wallenberg Foundation and the Marcus and Amalia
Wallenberg Foundation. Additionally, this research was previously supported by the Swiss National Science Foundation, and Canada Council for the Arts.