◎只要輸入類型，藝人和歌詞，Jukebox會製作的新音樂樣本，它可以創作原創音樂；重寫現有音樂；以12秒為樣本的完整歌曲；甚至可以製作深層的翻唱版本，樣本來自(Elvis Presley)，(Katy Perry)，(Frank Sinatra)，(Nas)，(Bruno Mars)等人的風格。
OpenAI is the non-profit artificial intelligence company backed by (among others) tech mogul Elon Musk. Just under a year ago it showed off Musenet: “a deep neural network that can generate 4-minute musical compositions with 10 different instruments, and can combine styles from country to Mozart to the Beatles”.
Now it’s following up with a new system called Jukebox: “A neural net that generates music, including rudimentary singing, as raw audio in a variety of genres and artist styles.”
It’s going to ruffle many feathers within the music community.
Here’s why. “Provided with genre, artist, and lyrics as input, Jukebox outputs a new music sample produced from scratch,” explains its introductory blog post. It can create original music; rewrite existing music; ‘complete’ songs based on 12-second samples; and even do deepfake-style goofy covers. Samples are offered “in the style” of Elvis Presley, Katy Perry, Frank Sinatra, Nas, Bruno Mars and others.
“To train this model, we crawled the web to curate a new dataset of 1.2 million songs (600,000 of which are in English), paired with the corresponding lyrics and metadata from LyricWiki,” explained OpenAI’s team. That’s where the feather-ruffling may come in.
“This is pretty amazing from OpenAI – AI-created songs, lyrics and voices,” tweeted Ed Newton-Rex, formerly co-founder and CEO of Jukedeck, one of the first modern AI-music startups, who now works at Bytedance’s European AI Lab. “And another example that AI is only as good as its training data – without musicians’ music to train on, it wouldn’t work. Speaks to big copyright battles in the years ahead.”
There are two different issues here, both of them chewy. First, the copyright questions around training a musical AI on recorded music – a procedure that almost always requires making a copy of that music. That’s something we delved into with law firm Reed Smith’s Sophie Goossens last November: she thinks this issue is “something we should be asking many more questions about”, while also concluding that in the US, the general view is that this kind of training is considered ‘fair use’.
The second issue is the output, and this one has several strands. Jukebox can create new music for existing lyrics – we’ll be honest, its revamp of Rick Astley’s ‘Never Gonna Give You Up’ isn’t a patch on the original – but if those lyrics are covered by copyright, that’s one thing. Second, if this music is ‘in the style of’ an artist including singing, that kicks over another can of worms. Witness the story earlier this week (in the separate but related category of speech synthesis) about Roc Nation filing takedowns for deepfake recordings of Jay-Z.
In some quarters of the music community, there’ll be an instinctive recoil from technology like Jukebox, whether on the basis of its big-tech backers; its copyright implications and/or worries about its potential to devalue human-made music. That said, other musicians and industry folk will be excited and curious about Jukebox: perhaps for how artists themselves could make creative use of this tech, or because of what it might teach us about creativity itself.
All these reactions and more are understandable: like all technology, the potential of AI music systems to be good or bad for humans depends very much on what humans choose to do with them.
That’s why, as an industry and community of musicians, though, what we need to do now is to lean in to Jukebox and other AI music technology: poke, prod and play with it, ask its creators lots of questions (and yes, ask lawyers lots of questions too), and give ourselves as good an understanding as possible of what this tech is really capable of – so we can start to form reasoned opinions about what it might mean.
(What’s that? It’d be great if someone had recently written a report that might help? We have you covered!)
One final note on the dystopian worries. “While Jukebox represents a step forward in musical quality, coherence, length of audio sample, and ability to condition on artist, genre, and lyrics, there is a significant gap between these generations and human-created music,” admit its creators. It’s still not very good at repeating choruses; the recordings are a bit noisy; it takes nine hours to render one minute of audio; and it doesn’t know anything about non-western music and non-English lyrics.
It won’t be writing a new ‘Despacito’ just yet, then, but this technology is improving rapidly, and that’s why we need to be engaging with it as often and as deeply as we can. Even when it’s ‘Frank Sinatra’ singing about Christmas time being “hot tub time”…