Google researchers have developed an AI that can construct minutes-long musical pieces from text prompts and can even translate a whistled or hummed melody into different instruments. However, due to the potential risks, the company has no plans to release it.
Google’s MusicLM isn’t the first generative AI system for music. Other attempts have included Riffusion, Dance Diffusion, Google’s own AudioML, and OpenAI’s Jukebox. However, due to technical constraints and inadequate training data, none of them have been able to produce songs that are exceptionally sophisticated in composition or high fidelity.
MusicLM may be the first to do so.
In a research paper, Google researchers defined MusicLM as “a model creating high-fidelity music from text descriptions such as a calming violin melody backed by a distorted guitar riff.'”
“We demonstrate that MusicLM can be conditioned on both text and melody by transforming whistled and hummed melodies according to the style indicated in a text caption,” according to the paper.
The introduction of MusicLM comes at the same time as the rapid expansion of OpenAI’s popular chatbot ChatGPT, which prompted Google to issue a “code red” warning, which the tech giant described as “equivalent to pulling the fire alarm.”
The Times claims that to keep up, the company is speeding up the release of 20 new products, including a Google Search variant with AI chatbot features.
However, Google announced that it has no plans to make MusicLM accessible to the general public, citing several risks, including programming biases that could lead to cultural appropriation and lack of representation, technical issues, and, most importantly, “the potential misappropriation of creative content.”
The study discovered known existing music in about 1% of the instances, indicating copyright violation.
“We strongly emphasize the need for more future work in tackling these risks associated with music generation — we have no plans to release models at this point,” the study states.
The study also highlights some of the technology’s current shortcomings, such as the usage of negations and temporal sequencing in text prompts and vocal quality. In the future, the researchers plan to ” model high-level song structure such as introduction, verse, and chorus.”