Listening to some of the music samples generated by text, and you get a good idea of what Google’s MusicLM has been designed to do. Here you can find 30-second snippets created from descriptions that list a genre, specific instruments or a vibe.
MusicLM uses a large dataset of unlabelled music and captions from MusicCaps which uses rich text descriptions provided by human experts and audio clips from Google’s AudioSet. It was trained on a dataset of 280,000 hours of music to learn to generate coherent songs for text descriptions such as “a calming violin melody backed by a distorted guitar riff.” Through a generative adversarial network MusicLM generates sound snippets based on text input:
This is an r&b/hip-hop music piece. There is a male vocal rapping and a female vocal singing in a rap-like manner. The beat is comprised of a piano playing the chords of the tune with an electronic drum backing. The atmosphere of the piece is playful and energetic. This piece could be used in the soundtrack of a high school drama movie/TV show. It could also be played at birthday parties or beach parties.
I have to say that some of the samples sound really good although I’m unsure about the quality of the vocals on some of the samples …
We can hear a choir, singing a Gregorian chant, and a drum machine, creating a rhythmic beat. The slow, stately sounds of strings provide a calming backdrop for the fast, complex sounds of futuristic electronic music.
MusicLM is trained through taking a sequence of pieces of sound and mapping these words in the captions that represent meaning. The way I understand it is that there’s an exchange of tokens taking place: mapping audio tokens (sound) to semantic tokens (text). The language model then receives user captions or input audio. Acoustic tokens are then generated; these are pieces of sound that create the song output.
I can imagine that the likes of Riffusion and Dance Diffusion use similar approaches, basing audio generated in text descriptions. Again, listening to some of the sample sounds the snippets do sound fairly cohesive but the vocals are hard to make sense of.
Main learning point: It’s early days but Google’s MusicLM samples do sound credible, and I’m curious to see what impact further developments in this space will have.
Related links for further learning: