Google DeepMind has unveiled an innovative AI tool designed for crafting soundtracks for videos. This new tool not only uses textual prompts to generate audio but also analyzes the video content to enhance the relevance and impact of the sounds produced.
According to DeepMind, this tool enables the creation of various audio environments, from dramatic scores and realistic sound effects to dialogue that resonates with the video’s characters and tone. Demonstrations available on DeepMind’s website showcase the impressive capabilities of this technology. For instance, for a video featuring a car navigating a cyberpunk cityscape, the audio was generated using the prompt “cars skidding, car engine throttling, angelic electronic music,” perfectly syncing the sounds of skidding with the visual action. Another example provided an underwater soundscape for a scene with jellyfish, using the prompt “jellyfish pulsating under water, marine life, ocean.”
What sets this tool apart is its flexibility; using a text prompt is optional, and users are not required to align the audio precisely with the video scenes manually. DeepMind highlights the tool’s ability to produce an “unlimited” number of soundtracks, offering endless possibilities for audio customization.
This capability distinguishes it from other AI-driven audio tools, such as those from ElevenLabs, which rely solely on text prompts. It could also complement AI-generated videos from DeepMind’s own Veo and Sora, the latter of which is expected to integrate audio features.
DeepMind has developed this tool by training it on a combination of video, audio, and annotations that include detailed sound descriptions and dialogue transcripts, allowing for accurate audio-visual synchronization.
However, there are still challenges to be addressed, such as improving lip-syncing capabilities, demonstrated in a claymation family video, and ensuring high audio quality, which can be affected by the video’s clarity.
While the tool is not yet publicly available, DeepMind plans to conduct “rigorous safety assessments and testing” before release. Once launched, its audio outputs will carry Google’s SynthID watermark to indicate that they are AI-generated.