In the future, editing audio might be as easy as opening up Photoshop and cropping a picture. Adobe’s Project VoCo, two years in the making, is designed to make audio editing “really easy for the average person” according to Zeyu Jin, an audio researcher and intern at Adobe's Creative Technologies Lab. With Project VoCo, you can easily crop out certain words by searching through a transcript—and even generate new words in the speaker’s voice.
The program debuted as one of 11 experimental projects at Adobe Sneaks, an event where the company shows off new technology “that doesn’t have a place in a product yet—or may never,” as Adobe Senior Research Scientist Stephen DiVerdi explains it.
Project VoCo just needs an audio sample and a transcript of the recording, then you can edit the transcript and let the program handle the audio, instead of cropping and stitching together the recording yourself. If you need to edit out curses or misspoken words, it's just a matter of searching the text of the transcript. More impressively, the program can analyze a person’s voice and create new speech that sounds just like them, by cobbling together syllables and sounds the person used in the initial recording. (Because of this process, you can’t insert words that require sounds that person never used in the audio sample provided.)
For instance, you can change this first sentence below into one with a whole different meaning:
See a live demonstration at the recent Adobe Max conference in the video below. The meat of the demonstration starts just before the one-minute mark.
It doesn’t take much data for the program to be able to synthesize someone’s voice—it can do it with 10 minutes of audio, though for a really good mimic, 30 minutes is better.
In the ideal use case, you could fire up this program to fix speeches or podcasts or voice-overs where there was a mistake in the initial recording, and you need to re-record. Since audio is so sensitive, changes in the sound of the room or in the person's voice (say, if they've developed a cold) make it next to impossible to re-record just a segment of the audio clip in question—to make it sound really good, you need to re-record the whole thing. Here, you can make corrections that sound seamless. That said, the ability to create audio featuring someone’s voice saying words that never came out of their mouth is ripe for serious misuse. But the Adobe researchers say that it’s not unlike the ability to Photoshop misleading images, like the fake viral images that circulate on the web.
Still, Jin says they “are looking for a technological solution to prevent misuse. We are investigating deep learning detectors to find the edited part [of the audio]” and create some sort of watermark for it.
All images courtesy of Adobe