Solo Audio from John Hopkins Released!
So essentially,
Solo Audio can separate voice and language from audio clips
Paper: SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer (5 Pages)
Researchers from John Hopkins University are interested in identifying particular sound within a complex acoustic environment.
Hmm..What’s the background?
Humans can easily focus on a particular sound within a complex acoustic environment, even with overlapping sounds. This paper aims to computationally replicate this human ability. Target Sound Extraction (TSE) focuses on extracting sounds of interest from a mixture of overlapping audio, using clues about the target sound class. These clues can be one-hot labels, audio clips, or images.
Existing discriminative models for TSE often struggle with overlapping sounds. While generative models like DPM-TSE, based on DDPMs, show promise, they have limitations in reconstruction quality and generalization ability due to reliance on log-mel spectrograms and in-domain labels.
Ok, So what is proposed in the research paper?
Key features of SoloAudio:
Uses a skip-connected Transformer instead of U-Net for latent feature processing
Supports both audio- and language-oriented TSE by using a CLAP model to extract target sound features
Leverages synthetic audio from T2A models for training, improving generalization to out-of-domain data and unseen sound events
TSE achieves better target sound rendering and separation compared to discriminative models and shows improved reconstruction quality by using a VAE latent space instead of the mel spectrogram space. This paper exhibits strong zero-shot and few-shot capabilities due to its use of language-oriented TSE and synthetic data.
What’s next?
The authors outline several directions for future work:
Enhance the sampling speed of SoloAudio
Explore more effective T2A tools and audio-text alignment methods
Scale up training with larger datasets
Investigate the use of alternative target references, such as images and videos
So essentially,
Solo Audio can separate voice and language from audio clips
Learned something new? Consider sharing with your friends!