Social Media

Light
Dark

Stability AI, gunning for a hit, launches an AI-powered music generator

A year ago, London-based startup Stability AI, the creators of the open source image-generating AI model known as Stable Diffusion, introduced Dance Diffusion, a model capable of generating songs and sound effects based on text descriptions. This marked Stability AI’s initial venture into generative audio, showcasing the company’s interest in the emerging field of AI music creation tools. However, in the year following Dance Diffusion’s announcement, Stability’s generative audio efforts appeared dormant.

The research organization funded by Stability to develop Dance Diffusion, called Harmonai, halted updates for the model last year. Unlike its predecessors, Dance Diffusion never received a user-friendly interface, requiring users to work directly with the source code for installation.

Now, faced with investor pressure to translate over $100 million in capital into revenue-generating products, Stability is making a significant commitment to audio technology. Today, they have unveiled Stable Audio, claiming it as the first tool capable of producing high-quality, 44.1 kHz music suitable for commercial use using a technique called latent diffusion. This approximately 1.2-billion-parameter model is trained on audio metadata and audio file durations, allowing for greater control over synthesized audio content and duration compared to earlier generative music tools.

Ed Newton-Rex, VP of audio for Stability AI, explained, “Stability AI is on a mission to unlock humanity’s potential by building foundational AI models across a number of content types or ‘modalities.’ We started with Stable Diffusion and have grown to include languages, code, and now music. We believe the future of generative AI is multimodality.”

Stable Audio wasn’t solely developed by Harmonai; Stability’s audio team, formed in April, created a new model inspired by Dance Diffusion to serve as the foundation for Stable Audio, which Harmonai then trained.

Stable Audio distinguishes itself from Dance Diffusion by generating longer audio and allowing users to guide generation using text prompts and desired durations. Some prompts, such as those for EDM and ambient music, work exceptionally well, while others produce more unconventional results, like melodic, classical, and jazz music.

Stability declined requests to test Stable Audio before its launch. At present, Stable Audio is exclusively accessible through a web app, with no plans announced to release the model behind it as open source.

Stability provided samples showcasing Stable Audio’s capabilities across various genres, mainly focusing on EDM. These samples demonstrated more coherent, melodic, and musical results compared to many previous audio generation models, although they lacked a certain degree of creativity.

Achieving optimal results with Stable Audio, as with other generative tools, involves crafting prompts that capture the nuances of the desired song, including genre, tempo, prominent instruments, and emotional aspects.

Stable Audio utilizes latent diffusion, a technique similar to that used in Stable Diffusion for image generation. The model gradually removes noise from an initial noise-based song, aligning it with the provided text description. This approach allows Stable Audio to maintain coherency for longer durations, up to approximately 90 seconds, a notable improvement compared to some other AI models that generate songs.

Stable Audio’s capabilities extend beyond music generation; it can also replicate various sounds, such as a passing car or a drum solo.

To train Stable Audio, Stability AI partnered with the commercial music library AudioSparx, providing a collection of around 800,000 songs from independent artists. Measures were taken to exclude vocal tracks from the training data, addressing potential ethical and copyright concerns related to “deepfake” vocals.

Stability clarified that while Stable Audio is primarily designed for generating instrumental music, the risk of misinformation and vocal deepfakes is minimal. The company is actively working on implementing content authenticity standards and watermarking in its audio models to address emerging risks.

Generative AI-generated tracks resembling popular artists like Harry Styles or The Eagles may not be possible due to the limitations of the model’s training data, which excludes major-label music. However, there appears to be a loophole, as AudioSparx’s library includes songs “in the style of” well-known artists.

Stability AI offers Stable Audio through various pricing tiers, with the Pro tier priced at $11.99 per month, allowing users to generate 500 commercializable tracks up to 90 seconds long each month. Free tier users can create 20 non-commercializable tracks, each lasting 20 seconds. Users who wish to incorporate AI-generated music from Stable Audio into apps, software, or websites with over 100,000 monthly active users must sign up for an enterprise plan.

Stability’s terms of service grant the company rights to use customer prompts, songs, and tool activity data for various purposes, including developing future models and services. Customers are required to indemnify Stability against intellectual property claims related to songs created with Stable Audio.

Artists whose work contributed to Stable Audio’s training data had the option to opt out. Approximately 10% of them chose to do so, and Stability has arranged a revenue-sharing agreement with AudioSparx for musicians on the platform, allowing them to share in the profits generated by Stable Audio.

Stability AI has faced financial challenges and criticism related to employee wages and payroll taxes. Despite raising $25 million in a convertible note offering, the company’s valuation has not significantly increased since its last valuation at $1 billion. It remains to be seen whether Stable Audio will positively impact Stability AI’s financial outlook.

Leave a Reply

Your email address will not be published. Required fields are marked *